MOC
Map of Content · Deep Tech · AI

Model Compression
& Edge AI

Software Foundations · Hardware Physics · Deployment Constraints

The question is not "how small can we make it" — it is "what do we lose, and does it matter for the deployment we care about?" That question lives at the intersection of mathematics, information theory, and hardware physics.

Mathematics
What rank does this matrix really have? Most weight matrices are full-rank by construction but low-rank by behaviour.
Information Theory
What precision does this layer actually need? Bits spent preserving noise are bits wasted.
Hardware Physics
What can this chip actually multiply? Silicon enforces the tradeoffs that theory makes optional.
The Compression Question
Where all three meet — extracting the useful signal, discarding the rest, without breaking what made the model work.
Cloud Deployment
Can Afford to Be Lazy
A cloud GPU has headroom. Overprovisioned memory, multi-second latency budgets, external cooling, unconstrained power draw. Compression is optimisation — useful, but not existential.
Edge Deployment — Where Tradeoffs Become Real
Cannot.
Edge AI is where these tradeoffs stop being theoretical. A phone, a sensor, a pacemaker operates under hard constraints. Every watt, every millisecond, every byte of memory is load-bearing.
Mobile SoC
IoT Sensor
Medical Device
Industrial MCU
Satellite
Four Canonical Techniques
Everything else is a variation, combination,
or physical realisation of these.
01 — Quantisation
Numerical Precision Reduction
FP32 → INT8 → INT4 → Binary
Reduce the bit-width of weights and activations. Most model behaviour survives the transition from 32-bit float to 8-bit integer. Post-training quantisation (PTQ) or quantisation-aware training (QAT).
4× memory reduction · 2-4× inference speedup · <1% accuracy loss at INT8
02 — Pruning & Sparsification
Weight Elimination
Structured · Unstructured · Magnitude
Zero out weights below a threshold — unstructured for maximum compression, structured for hardware-friendly sparsity patterns. Lottery ticket hypothesis: the sparse subnetwork was always there.
Up to 90% sparsity achievable · Structured pruning maps to SIMD ops
03 — Low-Rank Decomposition
Matrix Factorisation
SVD · Tucker · CP Decomposition
Decompose large weight matrices W ≈ AB where A, B have far fewer parameters. The effective rank of trained layers is typically much lower than the nominal dimension — the model is over-parameterised by design.
LoRA adapters: rank r ≪ d · Attention layers compress most · 50-80% param reduction
04 — Mixture of Experts & Adapters
Conditional Computation
Sparse Gating · Task-Specific Routing
Don't compress the model — make most of it inactive per inference. MoE routes each token to 2 of N expert sub-networks. Adapter architectures bolt small trained modules onto frozen base weights.
Mixtral 8×7B: 47B params, 13B active · <1% routing overhead at scale
Why is compression possible at all
without quality collapse?
The answer sits in how these models are trained, not in the compression method itself. Frontier models are dramatically over-parameterised for the tasks they solve — by intentional design. SGD finds solutions that generalise well, and those solutions occupy a far lower-dimensional manifold than the parameter space suggests.

Neural Scaling Laws imply diminishing returns: most task-relevant capability is acquired early in training. Late-stage parameters are refinements, not foundations — and refinements compress. The compressibility of a model is evidence of its training quality, not a bug.

Model Quality
Compression Ratio →
OVER-PARAM ZONE QUALITY COLLAPSE TARGET ZONE
Task Quality vs Compression
Collapse threshold
Where Software Compression Meets Silicon
Beyond the Von Neumann Bottleneck
Analog In-Memory Computing
Compute Where the Weights Live
The von Neumann bottleneck — the energy cost of moving data between memory and processor — dominates digital inference at the edge. Analog in-memory computing performs matrix-vector multiplication where weights are stored, using the physics of the memory device (memristor, PCM, ReRAM) as the compute substrate. Radical departure from digital CMOS.
Energy per MAC
100× better than digital CMOS
Precision constraint
3–6 effective bits
Requires
Quantisation + noise-aware training
Bottleneck eliminated
Memory bandwidth
Efficient Transformer Architectures
Attention Designed for Silicon Reality
Standard attention is O(n²) in sequence length — tractable in cloud, unacceptable at edge. Architectural compression: Linear Attention, Mamba (state-space models), RetNet, and Flash Attention variants redesign the computation graph to fit within edge SRAM budgets without approximation. These are not post-hoc compressions — they are compression-native architectures.
Attention complexity
O(n²) → O(n) or O(1) state
SRAM fit (mobile)
1–4 MB working set
Latency target
<50ms time-to-first-token
Key architectures
Mamba · GQA · Flash-Attn · RetNet
Biology
Optional but revealing. Compression and scaling play out differently when the sequence domain is not language.
Biological Sequence Modelling
Language model architectures applied to DNA, protein, and RNA sequences — but the compression dynamics are distinct. Biological sequences have sparser, more conserved information density than natural language. The effective vocabulary is tiny (4 nucleotides, 20 amino acids), but the long-range dependencies are extreme (gene regulatory elements span megabases). Compression ratios achievable here are higher than NLP for equivalent task performance — but the quality collapse point is non-negotiable: a misfolded protein does not degrade gracefully.
DNA · RNA · Protein
4–20 token vocab
Ultra-long contexts
Hard quality floor
AlphaFold lineage