Maximum Likelihood Estimation is Minimizing KL Divergence
Maximum likelihood estimation is about finding model parameters that maximize
the likelihood of the data. KL divergence measures how similar one probability
distribution is to another. So what do these have in common? This:
θargminDKL(p∥q)=θargmaxp(D∣θ)
Let's take a look at the definition of the KL divergence.
DKL(p∥q)=Ex∼p[logq(x)p(x)]
This is a good video that nicely explains the KL divergence
Here, p is the underlying distribution. We never truly have access to this,
but we want to approach it using our model q with parameters θ. So
when we're fitting q we want to set θ such that this divergence is
minimized.
θargminDKL(p∥q)=θargminEx∼p[logq(x)p(x)]
Now, because of this argmin that we're looking for, we can simplify a bit.
We can cross out a term that doesn't matter within the optimization problem:
p, the underlying distribution. It doesn't depend on the model parameters
θ at all, so we can just take it out of the equation.
θargminDKL(p∥q)=θargminEx∼p[logq(x)p(x)]=θargminEx∼p[logp(x)−logq(x)]=θargminEx∼p[−logq(x)]=θargmin−Ex∼p[logq(x)]
Now: finding the minimum of some function is the same as finding the maximum of
that function flipped upside down, right? So we can use the minus here to flip
that argmin to an argmax.
θargmaxEx∼p[logq(x)]
Beautiful. And this looks familar, too! Let's write the expectation out.
θargmaxEx∼p[logq(x)]=θargmaxlogi∏Np(xi∣θ)
Hah, this is maximizing the log likelihood!
θargmaxlogi∏Np(xi∣θ)=θargmaxlogp(D∣θ)
And maximizing the log likelihood gives you the same θ as maximizing
the likelihood itself.
θargmaxlogp(D∣θ)=θargmaxp(D∣θ)
And there we have it. Doing maximum likelihood estimation is the same as
minimizing KL divergence.
θargminDKL(p∥q)=θargmaxp(D∣θ)