The Theoretical Limits of Media Compression

How thermodynamic systems explains codec tuning

18 Apr 2026•Technical•11 min

Introduction

Working in the media group (i.e. handling with user uploaded audio / images / video) at tech company, we often apply an encoding on uploaded user files. An important decision we make when choosing the encoding options is the tradeoff between "file size" and "file quality".

The more we can compress our files, the more we can save on storage / network costs. With millions of users and files in the MBs to GBs range, a small % in savings can result in relatively huge absolute numbers.

Of course this comes with downsides, compressing the file can decrease the file quality (and customer satisfaction *ahem* looking at you Facebook Messenger for compressing all my videos from my highschool days).

But file size and quality are two separate dimensions. If we view quality as a function of file size, what function describes the theoretical best quality we can get for a specified file size?

Lossless Compression

First, I'll do a quick refresher on lossless compression since it'll contain some important concepts usually introduced (and usually forgotten) in CS undergrad.

The hard limit on lossless compression (where the original is perfectly recoverable) is the entropy of the source:

$$ H(X) = -\sum_x p(x) \log_2 p(x) $$

where $p(x)$ is the probability of a symbol $x$. Entropy measures the "average surprise" - a predictable source has low entropy (few bits needed), while an unpredictable source has high entropy (can't compress at all).

For a coin flip with probability $p$ of heads, the entropy $H = -p\log_2 p - (1-p)\log_2(1-p)$ looks like:

The outcomes of a biased coin ($p \approx 0$ or $p \approx 1$) is predictable and compresses well.

As an example:

Say we had a long string of 100 H's in the outcome of our coin tosses, we could represent this with an idea like "H"*100 instead of the raw value. This compresses things, and if the longer the repeated string was the more it'd be compressed, and the more frequently this long string happened, the more we could apply this compression strategy. Hence the more biased the coin is (these "compressable" patterns become longer and appear more frequently), and thus the better compression we can achieve.

Lossy Compression

But media codecs go far beyond the entropy limit every day - a raw 1080p frame at 24fps is about 1.5 Gbps, yet streaming services deliver it at around 5 Mbps, a 300x reduction.

This is because they use lossy compression - unlike text there are limits to human perception, consequently the decoder can drop information that is less likely to be noticed, i.e.

We're far more sensitive to brightness than colour - colour can be stored at lower resolution without anyone noticing
Loud sounds mask quieter sounds at nearby frequencies - the encoder can skip the masked parts entirely
Adjacent video frames are almost identical - the encoder only stores what changed between frames

So now if we can tolerate some distortion, how few bits do we actually need?

Rate-Distortion Theory

Suppose you have a source of data drawn from distribution $p(x)$ - audio samples, video frames, whatever. You want to compress each sample $x$ into a reconstruction $\hat{x}$.

The encoder is a conditional distribution $q(\hat{x} | x)$ - for a given input $x$, it gives the probability of producing each possible reconstruction $\hat{x}$.

The reconstruction won't be perfect, so we need a way to measure how far off it is. This is the distortion function $d(x, \hat{x})$ - it takes the original and the reconstruction and returns a number representing how different they are.

There are multiple types of distortion functions, i.e.

signal level i.e. squared error $d(x, \hat{x}) = (x - \hat{x})^2$
human perceptual level i.e. VMAF (Video Multimethod Assessment Fusion) - Netflix's machine-learned metric that combines multiple features to predict human quality scores

But we'll soon see that rate-distortion theory works regardless of what distortion function we choose.

We want to find the encoder that uses the fewest bits while keeping distortion acceptable. This is a tradeoff between two things - rate $R$ (bits used) and distortion $D$ (quality lost) - and we can express it directly as a single objective using a Lagrange multiplier $\lambda$:

$$ \begin{align*} \mathcal{L} &= R + \lambda D \\ &= I(X; \hat{X}) + \lambda \mathbb{E}[d(x, \hat{x})] \end{align*} $$

Symbol	Name	Description
$R = I(X; \hat{X})$	Rate	mutual information between source $X$ and reconstruction $\hat{X}$ or bitrate
$D = \mathbb{E}[d(x, \hat{x})]$	Distortion	the expected distortion, averaged over all inputs
$\lambda$	Lagrange multiplier	controls weighting between rate and distortion

Sweeping $\lambda$ from $0$ to $\infty$ and solving for the optimal encoder at each value traces out the rate-distortion curve $R(D)$ - the theoretical best rate at every distortion level:

Any point on the curve is the best possible, but our encoders exist above the curve, with each improvement bringing it closer to the curve.

Thermodynamics

Short primer if you haven't seen statistical mechanics before - it studies systems with many possible states, each with some energy $E$. Heat constantly moves the system between states, so rather than sitting in one state, it spends time in many (lower-energy states more often whilst higher-energy states less often).

The Boltzmann distribution describes exactly how much time is spent in each state:

$$ p(\text{state}) = \frac{1}{Z} e^{-\beta E(\text{state})} $$

where $\beta$ is the inverse temperature (large $\beta$ = cold, small $\beta$ = hot) and $Z = \sum_{\text{states}} e^{-\beta E}$ is a normalising constant called the partition function.

The system is minimising the free energy $F$ where

$$ F = E - \frac{1}{\beta}S $$

This equation shows our tradeoff: minimise energy (settle into low-energy states) vs maximise entropy $S$ (spread out across many states). The parameter $\beta$ (temperature) controls how nature balances the tradeoff between the two.

Comparing this with our rate-distortion Lagrangian:

$$ \mathcal{L} = R + \lambda D $$

It follows the same structure - we're trying to minimise distortion (energy) vs minimise rate (negative entropy). Similarly, the parameter $\lambda$ controls how much you care about each.

The Equivalence

Intuition

TLDR: information entropy ($H = -\sum p \log p$) and thermodynamic entropy ($S = k \log W$) are both counts of configurations - bit sequences vs physical microstates. We take the $\log$ because total uncertainty in independent systems add ($\log W_1 + \log W_2$) whilst independent probabilities and state counts multiply ($W_1 \cdot W_2$).

Since the Lagrangian has a $q \log q$ (from mutual information) and a linear $q \cdot d$ term (from distortion), setting the derivative to zero and isolating $\log q$ gives $\log q = \text{stuff}$, and exponentiating both sides gives $q = e^{\text{stuff}}$. Hence we see the Boltzmann distribution.

Formal proof

Not-the-TLDR:

The Lagrangian is defined as

$$ \begin{align*} \mathcal{L} &= R + \lambda D \\ &= I(X;\hat{X}) + \lambda \mathbb{E}[d(x, \hat{x})] \\ &= \sum_{x,\hat{x}} p(x) q(\hat{x} \mid x)\log\frac{q(\hat{x} \mid x)}{q(\hat{x})} + \lambda \sum_{x,\hat{x}} p(x) q(\hat{x} \mid x) d(x,\hat{x}) \\ &= \sum_{x, \hat{x}} p(x) q(\hat{x} | x) \left[ \log \frac{q(\hat{x} | x)}{q(\hat{x})} + \lambda d(x, \hat{x}) \right] \end{align*} $$

where $q(\hat{x}) = \sum_x p(x) q(\hat{x}|x)$ is the marginal over reconstructions.

Solving for the optimal conditional distribution means minimising $\mathcal{L}[q]$ over all valid $q(\hat{x} \mid x)$:

$$ \begin{align*} q^*(\hat{x} \mid x) &= \arg\min_{q(\hat{x} \mid x)} \mathcal{L}[q] \\ \text{subject to } \sum_{\hat{x}} q(\hat{x} \mid x) &= 1 \quad \text{for each } x \end{align*} $$

Enforcing that constraint and setting the derivative equal to 0 gives:

$$ \log \frac{q^*(\hat{x} \mid x)}{q(\hat{x})} = -\lambda d(x, \hat{x}) - \log Z(x) $$

Exponentiating both sides and rearranging gives:

$$ q^*(\hat{x} \mid x) = \frac{q(\hat{x})}{Z(x)} e^{-\lambda d(x, \hat{x})} $$

That's the Boltzmann distribution, with the same partition function

$$ Z(x) = \sum_{\hat{x}} q(\hat{x}) e^{-\lambda d(x, \hat{x})} $$

that enforces $\sum_{\hat{x}} q^*(\hat{x}|x) = 1$.

Notice that $d(x, \hat{x})$ enters the Lagrangian linearly (as a coefficient multiplied by $q(\hat{x} | x)$) so the derivative treats it as a constant. Hence if we change $d$ the "energy landscape" changes, but the Boltzmann form stays the same.

Encoding Implications

Now we can use this knowledge to gain intuition on what tweaking certain parameters are doing. I'll use FFmpeg parameters here, but the ideas apply more broadly (note this is not an exhaustive list):

Parameter	Role	Physics Analogy
bitrate	resource limits	energy
QP	quality vs bitrate tradeoff	temperature
preset	solve quality	equilibrium time
partitions	block split size	degrees of freedom
deblock	smooth block boundaries	heat diffusion

Using this model we can view the encoder and its parameters as defining a free-energy functional and the encoding process is the system "relaxing" towards a minimum free-energy state.

You can push the analogy a bit further with rate control parameters as well:

Setting	Implication	Physics Analogy
CRF (Constant Rate Factor)	bit rate varies, quality constant with time	canonical ensemble (heat bath to maintain temperature)
CBR (Constant Bitrate)	bit rate constant, quality varies with time	microcanonical ensemble (fixed energy which redistributes itself locally)
Two-pass encoding	understanding complexity on first pass then optimising rate control settings before encoding	measuring the energy landscape to set optimal conditions for relaxation

Conclusion

This concludes my article but the fun doesn't stop there (nerdy, I know) - understanding that entropy is represented the same in thermodynamics and information theory gives us a way to gain intuition for how encoding options look more like properties of a thermodynamic system.