Following the success of the Wasserstein GANs and of the Sinkhorn divergences, Optimal transport theory is rapidly becoming an essential theoretical tool for machine learning research. Optimal transport has been used for generative modeling, probabilistic autoencoders, variational inference, reinforcement learning and clustering, among many other things. In other words, if you are a machine learning researcher you need to have some understanding of optimal transport or you’ll risk to be left behind. Unfortunately, optimal transport theory is often presented in heavily mathematical jargon that risks to scare away the non-mathematicians among us. This is a pity since the parts of optimal transport theory that are most relevant for modern machine learning research are often very intuitive. In this series of posts, I’ll try to provide an intuitive guide to optimal transport methods without relying on complex mathematical concepts. In this first post I will formulate the (Kantorovich) optimal transport problem both from a deterministic and from a probabilistic viewpoint. I bet that the curiosity of most of my readers has been driven to this topic by the celebrated Wasserstein GAN. While I will cover the Wasserstein GAN in a future post of this series, I will conclude this post with an introduction to the dual optimal transport problem and I will give a simple proof of the Kantorovich-Rubinstein duality which is the theoretical foundation of this adversarial method.

## Optimal transport problems

Optimal transport problems can be formulated in a very intuitive way. Consider the following example: an online retailer has $N$ storage areas and there are $K$ customers who ordered e-book readers. The n-th storage area ${x}_{n}$ contains ${m}_{n}$ readers while the k-th customer ${y}_{k}$ ordered ${h}_{k}$ readers. The transport cost $c(x,y)$ is the distance between the storage area $x$ and the address of costumer $y$. The optimal transport problem consists of finding the least expensive way of moving all the readers stored in the storage areas to the customers who ordered them. A transportation map $\mathrm{\Gamma}$ is a matrix whose ${\mathrm{\Gamma}}_{nk}$ entry represents the number of e-book readers sent from the n-th storage area to the k-th customer. For consistency, the sum of all the readers leaving the n-th storage areas has to be equal to the total number of readers stored in that area while the sum of all the readers arriving to a costumer’s house has to be equal to the number of e-book readers she ordered. These are the hard constraints of the transport problem and can be written in formulas as follows:

$$\sum _{k}{\mathrm{\Gamma}}_{nk}={m}_{n}\text{},$$ $$\sum _{n}{\mathrm{\Gamma}}_{nk}={h}_{k}\text{}.$$The final constraint is that the entries of the matrix have to be positive-valued (for obvious reasons). The optimal solution $\hat{T}$ is the transportation matrix that minimizes the total cost while respecting the constraints:

$$\hat{T}={\text{argmin}}_{\mathrm{\Gamma}}\sum {\mathrm{\Gamma}}_{nk}c({x}_{n},{y}_{k})\text{}.$$In this expression we are assuming that transporting $L$ e-readers from ${x}_{n}$ to ${y}_{k}$ is $L$ times more expensive than transporting one reader. Note that this assumption is not realistic in most real world transportation problems since the transportation cost usually does not scale linearly with the number of transported units. Nevertheless, this simplified problem gives rise to a very elegant and useful mathematical theory.

### Probabilistic formulation

In machine learning and statistics it is often useful to reformulate the optimal transport problem in probabilistic terms. Consider two finite probability spaces $(X,P)$ and $(Y,Q)$ where $X$ and $Y$ are finite sets and $P$ and $Q$ are probability functions assigning a probability to each element of their set. The optimal transport between $P$ and $Q$ is the conditional probability function $\gamma (y|x)$ that minimizes the following loss function:

$${\text{argmin}}_{\mathrm{\Gamma}}\sum \mathrm{\Gamma}({y}_{n}|{x}_{k})P({x}_{k})c({x}_{n},{y}_{k})\text{},$$subject to the following marginalization constraint:

$$\sum \mathrm{\Gamma}({y}_{n}|{x}_{k})P({x}_{k})=Q({y}_{n})\text{}.$$This simply means that the marginal distribution of the joint probability $\mathrm{\Gamma}(y|x)P(x)$ is $Q(y)$. In other words, $\mathrm{\Gamma}({y}_{n}|{x}_{k})$ is transporting the distribution $P(x)$ into the distribution $Q(y)$. This transportation can be interpreted as a stochastic function that takes $x$ as input and outputs a $y$ with probability $\gamma (y|x)$. The problem thus consists of finding a stochastic transport that maps the probability distribution $P$ into the probability distribution $Q$ while minimizing the expected transportation cost. It is easy to see that this problem is formally identical to the deterministic problem that I introduced in the previous section. The transportation matrix ${\mathrm{\Gamma}}_{nk}$ is given by $\mathrm{\Gamma}({y}_{n}|{x}_{k})P({x}_{k})$. This assures that the first constraint is automatically fulfilled while the second constraint still needs to be enforced.

### Continuous formulation

It is straightforward to extend the definition of probabilistic optimal transport to continuous probability distributions. This can be done by replacing the probabilities $P(x)$ and $Q(x)$ with the probability densities $p(x)$ and $q(x)$ and the summation with an integration:

$${\text{argmin}}_{\gamma}\int \gamma (y|x)p(x)c(x,y)dxdy\text{}.$$Analogously, the marginalization constraint becomes:

$$\int \gamma (y|x)p(x)dx=q(y)\text{}.$$This continuous optimal transport problem is usually introduced in a slightly different (and in my opinion less intuitive) form. I will denote the joint density $\gamma (y|x)p(x)$ as $\gamma (x,y)$. It is easy to see that the problem can be reformulated as follows:

$${\text{arginf}}_{\gamma}\int \gamma (x,y)c(x,y)dxdy\text{},$$with the two marginalization constraints:

$$\int \gamma (x,y)dx=q(y)$$and

$$\int \gamma (x,y)dy=p(x)\text{}.$$## Optimal transport divergences

In many situations the primary interest is not to obtain the optimal transportation map. Instead, we are often interested in using the optimal transportation cost as a statistical divergence between two probability distributions. A statistical divergence is a function that takes two probability distributions as input and outputs a non-negative number that is zero if and only if the two distributions are identical. Statistical divergences such as the KL divergence are massively used in statistics and machine learning as a way of measuring dissimilarity between two probability distributions. Statistical divergences have a central role in several of the most active areas of statistical machine learning, such as generative modeling and variational Bayesian inference.

### Optimal transport divergences and the Wasserstein distance

An optimal transport divergence is defined as the optimal transportation cost between two probability distributions:

$$O{T}_{c}[p,q]={\text{inf}}_{\gamma}\int \gamma (x,y)c(x,y)dxdy\text{},$$where the optimization is subject to the usual marginalization constraints. This expression provides a valid divergence as far as the cost is always non-negative and $c(x,x)$ vanishes for all values of $x$. Clearly, the properties of an optimal transport divergence depend on its cost function. A common choice is the squared Euclidean distance:

$$c(x,y)={\Vert x-y\Vert}_{2}^{2}\text{}.$$Using the Euclidean distance as a cost function, we obtain the famous (squared) 2-Wasserstein distance:

$${\mathcal{W}}_{2}[p,q{]}^{2}={\text{inf}}_{\gamma}\int \gamma (x,y){\Vert x-y\Vert}_{2}^{2}dxdy\text{}.$$The squared root of ${\mathcal{W}}_{2}[p,q{]}^{2}$ is a proper metric function between probability distributions as it respects the triangle inequality. Using a proper metric such as the Wasserstein distance instead of other kinds of optimal transport divergences is not crucial for most machine learning applications, but it often simplifies the mathematical treatment. Finally, given an integer $k$, the k-Wasserstein distance is defined as follows:

$${\mathcal{W}}_{k}[p,q{]}^{k}={\text{inf}}_{\gamma}\int \gamma (x,y){\Vert x-y\Vert}_{k}^{k}dxdy\text{},$$where ${\Vert \cdot \Vert}_{k}$ denotes the ${L}_{k}$ norm.

## The dual problem and the Wasserstein GAN

Optimal transport problems are a special case of linear programming problems since both the function to be optimized and the constraints are linear functions of the transportation map. The theory behind linear programming dates back to the beginning of the last century and is one of the cornerstones of mathematical optimization. One of the most fundamental results of linear programming is that any linear problem has a dual problem whose solution provides an upper bound to the solution of the original (primal) problem. Fortunately, it turns out that in the case of optimal transport the solution of the dual problem does not simply provide a bound but is indeed identical to the solution of the primal problem. Furthermore, the dual formulation of the optimal transport problem is the starting point for adversarial algorithms and the Wasserstein GAN. The dual formulation of an optimal transport divergence is given by the following formula:

$$O{T}_{c}[p,q]={\text{sup}}_{f\in {L}_{c}}\left[\int f(x)p(x)dx-\int f(y)q(y)dy\right]\text{},$$where ${L}_{c}$ is the set of functions whose growth is bounded by $c$:

$${L}_{c}=\left\{f:\mathrm{\Re}\to \mathrm{\Re}|f(x)-f(y)\le c(x,y)\right\}\text{}.$$It is far from obvious why this expression is equivalent to the primal expression that I gave in the previous sections and I will spend the rest of the post proving this result. However, the formula in itself has a rather intuitive interpretation. Clearly, if $p$ is equal to $q$, the difference between their expected values of any function $f$ will be zero and consequently the divergence will vanish. Now assume that $p$ and $q$ differ in some region of their domain. In this case the divergence is obtained by finding the function $f$ that maximizes this difference in terms of its expected value. In other words, $f$ acts like a feature detector that extract the features that maximally differentiate $p$ from $q$. For example, imagine that $p$ is a distribution over landscape images without traces of human activity while $q$ is a distribution over landscape images with a plane in the sky. In this case, the optimal $f$ will be a plane detector. From this example you can see how $f$ plays the role of a discriminator in the Wasserstein GAN. Note that without any constraints on $f$ any small difference in the distributions can be magnified arbitrarily and the divergence would be infinite.

### Proving the duality

In order to prove the duality, we need to reformulate the constrained optimization in the primal problem as an unconstrained one. Consider the following optimization:

$${\text{sup}}_{f}\left[\int f({x}^{\prime})p({x}^{\prime})d{x}^{\prime}-\int f(x)\gamma (x,y)dxdy\right]\text{},$$where $f$ can be any function. The term on the left-hand side is the expectation of $f$ under $p$ while the term on the right-hand side is the expectation of $f$ under the marginal distribution $\int \gamma (x,y)dy$. This expression is clearly zero for all possible $f$ if the marginal constraint over $p$ is met since the two terms will be identical. However, if the constraint is not met the values of $f$ can be chosen to be arbitrarily large for the values of $x$ where the two marginals are different and the result of the optimization will be infinity. Therefore, adding two terms of this form to the loss function of our optimization problem will not change the problem when the constraints are met but it will exclude all the possible solutions that do not satisfy the constraints. Also note that the term on the left ($\int f({x}^{\prime})p({x}^{\prime})d{x}^{\prime}$) can be moved inside the expectation integral on the right since the expectation integral of a constant is the constant itself:

$${\text{sup}}_{f}\left[\int f({x}^{\prime})p({x}^{\prime})d{x}^{\prime}-\int f(x)\gamma (x,y)dxdy\right]={\text{sup}}_{f}\left[\int f({x}^{\prime})p({x}^{\prime})d{x}^{\prime}-f(x)\right]\gamma (x,y)dxdy\text{}.$$We can now write a modified loss function that incorporates the constraints:

$$O{T}_{c}[p,q]={\text{inf}}_{\gamma}\left[\int \gamma (x,y)c(x,y)dxdy+{\text{sup}}_{f}\mathcal{L}![f,\gamma ]\right]\text{},$$where

$$\mathcal{L}\phantom{\rule{negativethinmathspace}{0ex}}[f,\gamma ]=\left[\left(\int f({x}^{\prime})p({x}^{\prime})d{x}^{\prime}-f(x)\right)-\left(\int f({y}^{\prime})p({y}^{\prime})d{y}^{\prime}-f(y)\right)\right]\gamma (x,y)dxdy\text{}.$$The next thing to do is to exchange the order of the infinum and the supremum. This can be done by using Sion’s minimax theorem since the loss function is linear in both $f$ and $\gamma $:

$$O{T}_{c}[p,q]={\text{sup}}_{f}{\text{inf}}_{\gamma}\left[\int \gamma (x,y)c(x,y)dxdy+\mathcal{L}![f,\gamma ]\right]\text{},$$ $$={\text{sup}}_{f}\left[\int f({x}^{\prime})p({x}^{\prime})d{x}^{\prime}-\int f({y}^{\prime})p({y}^{\prime})d{y}^{\prime}+{\text{inf}}_{\gamma}\int l(x,y)\gamma (x,y)dxdy\right]\text{},$$where

$$l(x,y)=c(x,y)-(f(x)-f(y))\text{}.$$We are almost there! The optimization over $f$ on the right-hand side of this expression can be converted into a constraint. In fact, if $l(x,y)\ge 0$ for all $x$ and $y$ then the infimum is zero and is reached by assigning the whole probability density on the $x=y$ subspace. Conversely, if there is a region where $l(x,y)<0$, the cost can become arbitrarily large by assigning an arbitrarily large amount of density to that region. By converting this term into a constraint we arrive at the dual formulation of the optimal transport problem.