GPU Computation for LLM
For this computation, we will use a 7B dense model as example
Basic knowledge:
- 8 bit - 1 byte
- Floating point - 32 precision - 32 bit - 4 byte
- Floating point - 16 precision - 16 bit - 2 byte
The GPU computation depends on:
1. Model Parameters
Each parameter is just a number. So 7B parameter means it 7B number. This number can be floating point (32 or 16 or ...) or integer. Depending on the precision or type, the VRAM memory calculation depends.
| Precision | Bytes | Total for 7B |
|---|---|---|
| 32 | 4 | 28GB |
| 16 | 2 | 14GB |
| Mixed Precision | ~3 | 21 |
2. Optimizer States
In the optimizer, it keeps information for each of the number. For Adam (which is the default choice), use 2 extra information for first and second momentum (m, v). So for each number, there are 3 extra numbers in optimizer states.
Optimizer is usually stored in 8 bytes as they are very sensitive to the precision.
So for 7b model, 3 x 4 x 7 = 84GB
8-bit Optimizer
In the 8 bit optimizer like in bitsandbytes or in QLoRA, they use int8 for momentums. It reduces the 2 parameter to 8 bits == 1 bytes
So for 7b model, 2 x 1 x 7 + 1 x 4 x 7 = 42GB
3. Gradients
Gradient are stored for each value, so 1 extra parameter for each value.
So for 7b model, we will have 7 x 2 = 14gb or 7 x 4 = 28 gb depending on the precision
4. Activations (Forward Pass)
Activations are intermediate outputs on each layer for each token for each batch during the forward pass. It is kept so that it need not to be recalculated during backward pass.
ccccccc
The rough estimate of parameter number is:
Batch size x Sequence length x hidden size x bytes
Normally, in modern training, we use Gradient Checkpointing to reduce VRAM and increase computation time by a bit
Total
For half precision and 7b model,
Model Parameter = 14gb
Optimizer States = 42gb
Gradients = 14gb
Activations = 0
Total = 70 gb