Maximum likelihood estimation is about finding model parameters that maximize the likelihood of the data. KL divergence measures how similar one probability distribution is to another. So what do these have in common? This:

$$\argmin_\theta D_{\text{KL}}(p \parallel q) = \argmax_\theta p(\mathcal{D}|\theta)$$

Let's take a look at the definition of the KL divergence.

$$ D_\text{KL} (p \parallel q) = \mathbb{E}_{x \sim p} \left[ \log \frac{p(x)}{q(x)}\right] $$

Here, $p$ is the underlying distribution. We never truly have access to this,
but we want to approach it using our model $q$ with parameters $\theta$. So
when we're fitting $q$ we want to set $\theta$ such that this divergence is
*minimized*.

$$ \argmin_\theta D_\text{KL} (p \parallel q) = \argmin_\theta \mathbb{E}_{x \sim p} \left[ \log \frac{p(x)}{q(x)}\right] $$

Now, because of this $\argmin$ that we're looking for, we can simplify a bit. We can cross out a term that doesn't matter within the optimization problem: $p$, the underlying distribution. It doesn't depend on the model parameters $\theta$ at all, so we can just take it out of the equation.

$$\begin{aligned} \argmin_\theta D_\text{KL} (p \parallel q) &= \argmin_\theta \mathbb{E}_{x \sim p} \left[ \log \frac{p(x)}{q(x)}\right]\\ &= \argmin_\theta \mathbb{E}_{x \sim p} \left[ \log p(x) - \log q(x)\right]\\ &= \argmin_\theta \mathbb{E}_{x \sim p} \left[ - \log q(x)\right]\\ &= \argmin_\theta - \mathbb{E}_{x \sim p} \left[ \log q(x)\right]\\ \end{aligned}$$Now: finding the minimum of some function is the same as finding the maximum of that function flipped upside down, right? So we can use the minus here to flip that $\argmin$ to an $\argmax$.

$$\argmax_\theta \mathbb{E}_{x \sim p} \left[ \log q(x)\right]$$

Beautiful. And this looks familar, too! Let's write the expectation out.

$$\argmax_\theta \mathbb{E}_{x \sim p} \left[ \log q(x)\right] = \argmax_\theta \log \prod_i^N p(x_i | \theta)$$Hah, this is maximizing the log likelihood!

$$\argmax_\theta \log \prod_i^N p(x_i | \theta) = \argmax_\theta \log p(\mathcal{D}|\theta)$$

And maximizing the log likelihood gives you the same $\theta$ as maximizing the likelihood itself.

$$\argmax_\theta \log p(\mathcal{D}|\theta) = \argmax_\theta p(\mathcal{D}|\theta)$$

And there we have it. Doing maximum likelihood estimation is the same as minimizing KL divergence.

$$\argmin_\theta D_{\text{KL}}(p \parallel q) = \argmax_\theta p(\mathcal{D}|\theta)$$