GPU Computation for LLM

For this computation, we will use a 7B dense model as example

Basic knowledge:

8 bit - 1 byte
Floating point - 32 precision - 32 bit - 4 byte
Floating point - 16 precision - 16 bit - 2 byte

The GPU computation depends on:

1. Model Parameters

Each parameter is just a number. So 7B parameter means it 7B number. This number can be floating point (32 or 16 or ...) or integer. Depending on the precision or type, the VRAM memory calculation depends.

Precision	Bytes	Total for 7B
32	4	28GB
16	2	14GB
Mixed Precision	~3	21

2. Optimizer States

In the optimizer, it keeps information for each of the number. For Adam (which is the default choice), use 2 extra information for first and second momentum (m, v). So for each number, there are 3 extra numbers in optimizer states.

Optimizer is usually stored in 8 bytes as they are very sensitive to the precision.

So for 7b model, 3 x 4 x 7 = 84GB

8-bit Optimizer

In the 8 bit optimizer like in bitsandbytes or in QLoRA, they use int8 for momentums. It reduces the 2 parameter to 8 bits == 1 bytes

So for 7b model, 2 x 1 x 7 + 1 x 4 x 7 = 42GB

3. Gradients

Gradient are stored for each value, so 1 extra parameter for each value.

So for 7b model, we will have 7 x 2 = 14gb or 7 x 4 = 28 gb depending on the precision

4. Activations (Forward Pass)

Activations are intermediate outputs on each layer for each token for each batch during the forward pass. It is kept so that it need not to be recalculated during backward pass.
ccccccc
The rough estimate of parameter number is:
Batch size x Sequence length x hidden size x bytes

Normally, in modern training, we use Gradient Checkpointing to reduce VRAM and increase computation time by a bit

Total

For half precision and 7b model,

Model Parameter = 14gb
Optimizer States = 42gb
Gradients = 14gb
Activations = 0

Total = 70 gb