Model Compression & Edge AI

The Compression Question Lives Here

Mathematics

What rank does this matrix really have? Most weight matrices are full-rank by construction but low-rank by behaviour.

Information Theory

What precision does this layer actually need? Bits spent preserving noise are bits wasted.

Hardware Physics

What can this chip actually multiply? Silicon enforces the tradeoffs that theory makes optional.

The Compression Question

Where all three meet — extracting the useful signal, discarding the rest, without breaking what made the model work.

Cloud Deployment

Can Afford to Be Lazy

A cloud GPU has headroom. Overprovisioned memory, multi-second latency budgets, external cooling, unconstrained power draw. Compression is optimisation — useful, but not existential.

Edge Deployment — Where Tradeoffs Become Real

Cannot.

Edge AI is where these tradeoffs stop being theoretical. A phone, a sensor, a pacemaker operates under hard constraints. Every watt, every millisecond, every byte of memory is load-bearing.

Mobile SoC

IoT Sensor

Medical Device

Industrial MCU

Satellite

Foundations — Software Compression

Four Canonical Techniques

Everything else is a variation, combination,
or physical realisation of these.

01 — Quantisation

Numerical Precision Reduction

FP32 → INT8 → INT4 → Binary

Reduce the bit-width of weights and activations. Most model behaviour survives the transition from 32-bit float to 8-bit integer. Post-training quantisation (PTQ) or quantisation-aware training (QAT).

4× memory reduction · 2-4× inference speedup · <1% accuracy loss at INT8

02 — Pruning & Sparsification

Weight Elimination

Structured · Unstructured · Magnitude

Zero out weights below a threshold — unstructured for maximum compression, structured for hardware-friendly sparsity patterns. Lottery ticket hypothesis: the sparse subnetwork was always there.

Up to 90% sparsity achievable · Structured pruning maps to SIMD ops

03 — Low-Rank Decomposition

Matrix Factorisation

SVD · Tucker · CP Decomposition

Decompose large weight matrices W ≈ AB where A, B have far fewer parameters. The effective rank of trained layers is typically much lower than the nominal dimension — the model is over-parameterised by design.

LoRA adapters: rank r ≪ d · Attention layers compress most · 50-80% param reduction

04 — Mixture of Experts & Adapters

Conditional Computation

Sparse Gating · Task-Specific Routing

Don't compress the model — make most of it inactive per inference. MoE routes each token to 2 of N expert sub-networks. Adapter architectures bolt small trained modules onto frozen base weights.

Mixtral 8×7B: 47B params, 13B active · <1% routing overhead at scale

The Theoretical Why

Why is compression possible at all
without quality collapse?

The answer sits in how these models are trained, not in the compression method itself. Frontier models are dramatically over-parameterised for the tasks they solve — by intentional design. SGD finds solutions that generalise well, and those solutions occupy a far lower-dimensional manifold than the parameter space suggests.

Neural Scaling Laws imply diminishing returns: most task-relevant capability is acquired early in training. Late-stage parameters are refinements, not foundations — and refinements compress. The compressibility of a model is evidence of its training quality, not a bug.

Compression–Quality Tradeoff

Model Quality

Compression Ratio →

Task Quality vs Compression

Collapse threshold

Hardware Physics

Where Software Compression Meets Silicon

Beyond the Von Neumann Bottleneck

Analog In-Memory Computing

Compute Where the Weights Live

The von Neumann bottleneck — the energy cost of moving data between memory and processor — dominates digital inference at the edge. Analog in-memory computing performs matrix-vector multiplication where weights are stored, using the physics of the memory device (memristor, PCM, ReRAM) as the compute substrate. Radical departure from digital CMOS.

Energy per MAC

100× better than digital CMOS

Precision constraint

3–6 effective bits

Requires

Quantisation + noise-aware training

Bottleneck eliminated

Memory bandwidth

Efficient Transformer Architectures

Attention Designed for Silicon Reality

Standard attention is O(n²) in sequence length — tractable in cloud, unacceptable at edge. Architectural compression: Linear Attention, Mamba (state-space models), RetNet, and Flash Attention variants redesign the computation graph to fit within edge SRAM budgets without approximation. These are not post-hoc compressions — they are compression-native architectures.

Attention complexity

O(n²) → O(n) or O(1) state

SRAM fit (mobile)

1–4 MB working set

Latency target

<50ms time-to-first-token

Key architectures

Mamba · GQA · Flash-Attn · RetNet

Application Branch

Biology

Optional but revealing. Compression and scaling play out differently when the sequence domain is not language.

Biological Sequence Modelling

Language model architectures applied to DNA, protein, and RNA sequences — but the compression dynamics are distinct. Biological sequences have sparser, more conserved information density than natural language. The effective vocabulary is tiny (4 nucleotides, 20 amino acids), but the long-range dependencies are extreme (gene regulatory elements span megabases). Compression ratios achievable here are higher than NLP for equivalent task performance — but the quality collapse point is non-negotiable: a misfolded protein does not degrade gracefully.

DNA · RNA · Protein

4–20 token vocab

Ultra-long contexts

Hard quality floor

AlphaFold lineage

Model Compression& Edge AI

Model Compression
& Edge AI