Awesome AI Training-List π
date
Apr 29, 2023
slug
all-ai-training-resources-in-one-place
status
Published
tags
Research
summary
All AI Training resources in one place
type
Post
Distributed:
Deepspeed
- ZeRO-Offload: Democratizing Billion-Scale Model Training: https://arxiv.org/abs/2101.06840
- ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning: https://arxiv.org/abs/2104.07857
- ZeRO: Memory Optimizations Toward Training Trillion Parameter Models: https://arxiv.org/abs/1910.02054
Eleuter/DeeperSpeed:
HuggingFace Accelerate:
π€ Accelerate is a library that enables the same PyTorch code to be run across any distributed configuration by adding just four lines of code! In short, training and inference at scale made simple, efficient and adaptable.
Fully Sharded Data Parallel
To accelerate training huge models on larger batch sizes, we can use a fully sharded data parallel model. This type of data parallel paradigm enables fitting more data and larger models by sharding the optimizer states, gradients and parameters. To read more about it and the benefits, check out the Fully Sharded Data Parallel blog. We have integrated the latest PyTorchβs Fully Sharded Data Parallel (FSDP) training feature. All you need to do is enable it through the config.
AutoTrain:
π€ AutoTrain is a no-code tool for training state-of-the-art models for Natural Language Processing (NLP) tasks, for Computer Vision (CV) tasks, and for Speech tasks and even for Tabular tasks. It is built on top of the awesome tools developed by the Hugging Face team, and it is designed to be easy to use.
Onnx Runtime:
ONNX Runtime inference can enable faster customer experiences and lower costs, supporting models from deep learning frameworks such as PyTorch and TensorFlow/Keras as well as classical machine learning libraries such as scikit-learn, LightGBM, XGBoost, etc. ONNX Runtime is compatible with different hardware, drivers, and operating systems, and provides optimal performance by leveraging hardware accelerators where applicable alongside graph optimizations and transforms. Learn more β
NVIDIA APEX:
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
- github: https://github.com/NVIDIA/apex
Nvidia DALI:
A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
ColossalAI:
Reinforcement:
carperai/trlx:
trlX is a distributed training framework designed from the ground up to focus on fine-tuning large language models with reinforcement learning using either a provided reward function or a reward-labeled dataset.
TRL - Transformer Reinforcement Learning
Efficiency:
LoRA: Low-Rank Adaptation of Large Language Models
Language:
Triton by openai:
- An Intermediate Language and Compiler for Tiled Neural Network Computations
Jax:
- Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more
Compilers:
Hidet
Hidet is an open-source deep learning compiler, written in Python. It supports end-to-end compilation of DNN models from PyTorch and ONNX to efficient cuda kernels. A series of graph-level and operator-level optimizations are applied to optimize the performance.
Quantization:
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale: https://arxiv.org/abs/2208.07339
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers: https://arxiv.org/abs/2210.17323
GPTQ-for-LLaMa:
GPTQ is SOTA one-shot weight quantization method
bitsandbytes:
The bitsandbytes is a lightweight wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and quantization functions.
AutoGPTQ
An easy-to-use model quantization package with user-friendly apis, based on GPTQ algorithm.
Frameworks:
Ray:
Lightning:
- Deep learning framework to train, deploy, and ship AI products Lightning fast.