01 — Quantisation
Numerical Precision Reduction
FP32 → INT8 → INT4 → Binary
Reduce the bit-width of weights and activations. Most model behaviour survives the transition from 32-bit float to 8-bit integer. Post-training quantisation (PTQ) or quantisation-aware training (QAT).
4× memory reduction · 2-4× inference speedup · <1% accuracy loss at INT8
02 — Pruning & Sparsification
Weight Elimination
Structured · Unstructured · Magnitude
Zero out weights below a threshold — unstructured for maximum compression, structured for hardware-friendly sparsity patterns. Lottery ticket hypothesis: the sparse subnetwork was always there.
Up to 90% sparsity achievable · Structured pruning maps to SIMD ops
03 — Low-Rank Decomposition
Matrix Factorisation
SVD · Tucker · CP Decomposition
Decompose large weight matrices W ≈ AB where A, B have far fewer parameters. The effective rank of trained layers is typically much lower than the nominal dimension — the model is over-parameterised by design.
LoRA adapters: rank r ≪ d · Attention layers compress most · 50-80% param reduction
04 — Mixture of Experts & Adapters
Conditional Computation
Sparse Gating · Task-Specific Routing
Don't compress the model — make most of it inactive per inference. MoE routes each token to 2 of N expert sub-networks. Adapter architectures bolt small trained modules onto frozen base weights.
Mixtral 8×7B: 47B params, 13B active · <1% routing overhead at scale