How thermodynamic systems explains codec tuning
Working in the media group (i.e. handling with user uploaded audio / images / video) at tech company, we often apply an encoding on uploaded user files. An important decision we make when choosing the encoding options is the tradeoff between "file size" and "file quality".
The more we can compress our files, the more we can save on storage / network costs. With millions of users and files in the MBs to GBs range, a small % in savings can result in relatively huge absolute numbers.
Of course this comes with downsides, compressing the file can decrease the file quality (and customer satisfaction *ahem* looking at you Facebook Messenger for compressing all my videos from my highschool days).
But file size and quality are two separate dimensions. If we view quality as a function of file size, what function describes the theoretical best quality we can get for a specified file size?
First, I'll do a quick refresher on lossless compression since it'll contain some important concepts usually introduced (and usually forgotten) in CS undergrad.
The hard limit on lossless compression (where the original is perfectly recoverable) is the entropy of the source:
$$ H(X) = -\sum_x p(x) \log_2 p(x) $$
where $p(x)$ is the probability of a symbol $x$. Entropy measures the "average surprise" - a predictable source has low entropy (few bits needed), while an unpredictable source has high entropy (can't compress at all).
For a coin flip with probability $p$ of heads, the entropy $H = -p\log_2 p - (1-p)\log_2(1-p)$ looks like:
The outcomes of a biased coin ($p \approx 0$ or $p \approx 1$) is predictable and compresses well.
As an example:
Say we had a long string of 100 H's in the outcome of our coin tosses, we could represent this with an idea like
"H"*100instead of the raw value. This compresses things, and if the longer the repeated string was the more it'd be compressed, and the more frequently this long string happened, the more we could apply this compression strategy. Hence the more biased the coin is (these "compressable" patterns become longer and appear more frequently), and thus the better compression we can achieve.
But media codecs go far beyond the entropy limit every day - a raw 1080p frame at 24fps is about 1.5 Gbps, yet streaming services deliver it at around 5 Mbps, a 300x reduction.
This is because they use lossy compression - unlike text there are limits to human perception, consequently the decoder can drop information that is less likely to be noticed, i.e.
So now if we can tolerate some distortion, how few bits do we actually need?
Suppose you have a source of data drawn from distribution $p(x)$ - audio samples, video frames, whatever. You want to compress each sample $x$ into a reconstruction $\hat{x}$.
The encoder is a conditional distribution $q(\hat{x} | x)$ - for a given input $x$, it gives the probability of producing each possible reconstruction $\hat{x}$.
The reconstruction won't be perfect, so we need a way to measure how far off it is. This is the distortion function $d(x, \hat{x})$ - it takes the original and the reconstruction and returns a number representing how different they are.
There are multiple types of distortion functions, i.e.
But we'll soon see that rate-distortion theory works regardless of what distortion function we choose.
We want to find the encoder that uses the fewest bits while keeping distortion acceptable. This is a tradeoff between two things - rate $R$ (bits used) and distortion $D$ (quality lost) - and we can express it directly as a single objective using a Lagrange multiplier $\lambda$:
$$ \begin{align*} \mathcal{L} &= R + \lambda D \\ &= I(X; \hat{X}) + \lambda \mathbb{E}[d(x, \hat{x})] \end{align*} $$
| Symbol | Name | Description |
|---|---|---|
| $R = I(X; \hat{X})$ | Rate | mutual information between source $X$ and reconstruction $\hat{X}$ or bitrate |
| $D = \mathbb{E}[d(x, \hat{x})]$ | Distortion | the expected distortion, averaged over all inputs |
| $\lambda$ | Lagrange multiplier | controls weighting between rate and distortion |
Sweeping $\lambda$ from $0$ to $\infty$ and solving for the optimal encoder at each value traces out the rate-distortion curve $R(D)$ - the theoretical best rate at every distortion level:
Any point on the curve is the best possible, but our encoders exist above the curve, with each improvement bringing it closer to the curve.
Short primer if you haven't seen statistical mechanics before - it studies systems with many possible states, each with some energy $E$. Heat constantly moves the system between states, so rather than sitting in one state, it spends time in many (lower-energy states more often whilst higher-energy states less often).
The Boltzmann distribution describes exactly how much time is spent in each state:
$$ p(\text{state}) = \frac{1}{Z} e^{-\beta E(\text{state})} $$
where $\beta$ is the inverse temperature (large $\beta$ = cold, small $\beta$ = hot) and $Z = \sum_{\text{states}} e^{-\beta E}$ is a normalising constant called the partition function.
The system is minimising the free energy $F$ where
$$ F = E - \frac{1}{\beta}S $$
This equation shows our tradeoff: minimise energy (settle into low-energy states) vs maximise entropy $S$ (spread out across many states). The parameter $\beta$ (temperature) controls how nature balances the tradeoff between the two.
Comparing this with our rate-distortion Lagrangian:
$$ \mathcal{L} = R + \lambda D $$
It follows the same structure - we're trying to minimise distortion (energy) vs minimise rate (negative entropy). Similarly, the parameter $\lambda$ controls how much you care about each.
TLDR: information entropy ($H = -\sum p \log p$) and thermodynamic entropy ($S = k \log W$) are both counts of configurations - bit sequences vs physical microstates. We take the $\log$ because total uncertainty in independent systems add ($\log W_1 + \log W_2$) whilst independent probabilities and state counts multiply ($W_1 \cdot W_2$).
Since the Lagrangian has a $q \log q$ (from mutual information) and a linear $q \cdot d$ term (from distortion), setting the derivative to zero and isolating $\log q$ gives $\log q = \text{stuff}$, and exponentiating both sides gives $q = e^{\text{stuff}}$. Hence we see the Boltzmann distribution.
Not-the-TLDR:
The Lagrangian is defined as
$$ \begin{align*} \mathcal{L} &= R + \lambda D \\ &= I(X;\hat{X}) + \lambda \mathbb{E}[d(x, \hat{x})] \\ &= \sum_{x,\hat{x}} p(x) q(\hat{x} \mid x)\log\frac{q(\hat{x} \mid x)}{q(\hat{x})} + \lambda \sum_{x,\hat{x}} p(x) q(\hat{x} \mid x) d(x,\hat{x}) \\ &= \sum_{x, \hat{x}} p(x) q(\hat{x} | x) \left[ \log \frac{q(\hat{x} | x)}{q(\hat{x})} + \lambda d(x, \hat{x}) \right] \end{align*} $$
where $q(\hat{x}) = \sum_x p(x) q(\hat{x}|x)$ is the marginal over reconstructions.
Solving for the optimal conditional distribution means minimising $\mathcal{L}[q]$ over all valid $q(\hat{x} \mid x)$:
$$ \begin{align*} q^*(\hat{x} \mid x) &= \arg\min_{q(\hat{x} \mid x)} \mathcal{L}[q] \\ \text{subject to } \sum_{\hat{x}} q(\hat{x} \mid x) &= 1 \quad \text{for each } x \end{align*} $$
Enforcing that constraint and setting the derivative equal to 0 gives:
$$ \log \frac{q^*(\hat{x} \mid x)}{q(\hat{x})} = -\lambda d(x, \hat{x}) - \log Z(x) $$
Exponentiating both sides and rearranging gives:
$$ q^*(\hat{x} \mid x) = \frac{q(\hat{x})}{Z(x)} e^{-\lambda d(x, \hat{x})} $$
That's the Boltzmann distribution, with the same partition function
$$ Z(x) = \sum_{\hat{x}} q(\hat{x}) e^{-\lambda d(x, \hat{x})} $$
that enforces $\sum_{\hat{x}} q^*(\hat{x}|x) = 1$.
Notice that $d(x, \hat{x})$ enters the Lagrangian linearly (as a coefficient multiplied by $q(\hat{x} | x)$) so the derivative treats it as a constant. Hence if we change $d$ the "energy landscape" changes, but the Boltzmann form stays the same.
Now we can use this knowledge to gain intuition on what tweaking certain parameters are doing. I'll use FFmpeg parameters here, but the ideas apply more broadly (note this is not an exhaustive list):
| Parameter | Role | Physics Analogy |
|---|---|---|
| bitrate | resource limits | energy |
| QP | quality vs bitrate tradeoff | temperature |
| preset | solve quality | equilibrium time |
| partitions | block split size | degrees of freedom |
| deblock | smooth block boundaries | heat diffusion |
Using this model we can view the encoder and its parameters as defining a free-energy functional and the encoding process is the system "relaxing" towards a minimum free-energy state.
You can push the analogy a bit further with rate control parameters as well:
| Setting | Implication | Physics Analogy |
|---|---|---|
| CRF (Constant Rate Factor) | bit rate varies, quality constant with time | canonical ensemble (heat bath to maintain temperature) |
| CBR (Constant Bitrate) | bit rate constant, quality varies with time | microcanonical ensemble (fixed energy which redistributes itself locally) |
| Two-pass encoding | understanding complexity on first pass then optimising rate control settings before encoding | measuring the energy landscape to set optimal conditions for relaxation |
This concludes my article but the fun doesn't stop there (nerdy, I know) - understanding that entropy is represented the same in thermodynamics and information theory gives us a way to gain intuition for how encoding options look more like properties of a thermodynamic system.