Optimizing Transformer

Transformer structure There are multiple ways to optimize transformer to train or inference in a low-resource compute

Pruning: Pruning involved identifying layers that are least important and removing those layers
Quantization: Quantization reduces the precision of the model weights and activation by using fewer bits. For example, converting 32 bits to 8 or 4 bits. It might reduce the performance by a bit but using Quantization-aware Training or Dynamic Quantization, we can get back the lost performance.
Knowledge Distillation: In the knowledge distillation, the main idea is to train a smaller model for a specific task to follow a more general or larger or frontier model.
Mixture of Experts: While MoE can't reduce the training time, it reduces the inference by a high margin. It does this by activating only a fraction of experts depending on the user requests. Its something like training multiple experts and selecting the one which is more appropriate.

References