Schrödinger Bridge Diffusion

KL divergence, diffusion, optimal transport [WIP]

KL divergence

The Kullback-Leibler divergence is a non-symmetrical quantification of the difference between two probability distributions
The KL divergence of $q(x)$ from $p(x)$ (where $x$ is a discrete random variable) measures the information lost if $q(x)$ were used to approximate $p(x)$
Quantified as the expected extra bits required to code samples of $p(x)$ using a code based on $q(x)$ rather than $p(x)$:
\[D(p(x) || q(x)) = \sum_{x\in X} p(x) \log \frac{p(x)}{q(x)}\]

KL divergence is also termed relative entropy
The relationship between Shannon entropy, cross-entropy, and relative entropy can be written as follows (for discrete random variable $x$):
\[D(p(x) || q(x)) = Q(p(x) || q(x)) - H(p(x))\]

Proof (by definition)
$$ \begin{align*} D(p(x) || q(x)) &= \sum_{x\in X} p(x) [\log p(x) - \log q(x)] \\ &= - \sum_{x\in X} p(x) \log q(x) + \sum_{x\in X} p(x) \log p(x) \\ &= Q(p(x) || q(x)) - H(p(x)) \quad \Box \end{align*} $$
In other words, the relative entropy (extra bits needed using predicted distribution) is the cross-entropy (total bits using predicted distribution) minus the Shannon entropy (inherent bits using true distribution)

(TODO)

[1] https://www.stat.cmu.edu/~cshalizi/754/2006/notes/lecture-28.pdf

[2] https://hanj.cs.illinois.edu/cs412/bk3/KL-divergence.pdf