llm

Note
Additive Attention
Auto Regressive Model
BERT
Byte Pair Encoding (BPE)
Cache Augmented Generation (CAG)
Causal Language Modeling
Cross-Attention
DAPO
DistilBERT
ELMo Embeddings
Encoder-Decoder Transformer
GPU Computation for LLM
Group-Query Attention
GSPO
Hypothetical Document Embedding (HyDE)
Instruction Fine Tuning
KV Cache
Mamba Architecture
Masked Self-Attention
Multi-Head Attention
Multi-Head Latent Attention
Multi-Query Attention
Optimizing Transformer
Parallelism in LLM
Positional Encoding in Transformer
Pre-Fill in LLM
Pre-Training LLM
Prompt Engineering
Rotary Position Embedding (RoPE)
Self-Attention
Sliding Window Attention
State Space Model
Transformer vs LSTM
When less data is better than more?
Why do we scale attention weights?
Why do we use Projection in QKV?
Why Trigonometric Function for Positional Encoding?
Yet another Rope Extension (YaRN)