Principle Component Analysis

If your data has too many features you may find that your model overfits, gets stuck in local optima (depending on your model) or just takes too long to converge. In these cases, or when you’re simply looking to visualize high-dimensional data, you have a few options to reduce the number of features:

  • Feature selection: select the features you want to keep and drop the rest.

    • Forward Feature Selection: start with an empty set. Add features one by one and check which ones add the most in terms of the validiation score.

    • Backward Feature Selection: start with all the features. Drop them one by one and see which ones contribute the most to your model.

  • Feature Extraction: combine certain features.

Feature selection is fairly straightforward – you just delete some columns from your data set. But let’s approach it in a mathematical way. Take some data point \(x_1\) from distribution \(\mathbf{x}\) with two features, with values \(4\) and \(3\):

\[x_1 = \begin{bmatrix}4\\3\end{bmatrix}\]

If we decided that feature 1 is not useful for whatever reason and we want to throw it away, we create a matrix \(\mathbf{w}\) such that when applying it to any data point from \(\mathbf{x}\), feature 1 disappears.

\[\mathbf{w} = \begin{bmatrix}0\\1\end{bmatrix}\qquad \mathbf{w}^T x_1 = \begin{bmatrix}3\end{bmatrix}\]

It’s easy to see what \(\mathbf{w}\) would look like if you wanted to throw away feature 2. This idea can be scaled up to any number of features. For example, here’s how to select feature 1 and 3 from a data point with 4 features:

\[x_1 = \begin{bmatrix}4\\3\\5\\9\end{bmatrix}\qquad \mathbf{w} = \begin{bmatrix} 1 & 0\\ 0 & 0\\ 0 & 1\\ 0 & 0 \end{bmatrix}\qquad \mathbf{w}^T x_1 = \begin{bmatrix}4\\5\end{bmatrix}\]

What if, instead of completely throwing away a feature, you designed \(\mathbf{w}\) such that the components of the resulting vector are not exact copies of one of the original features, but a combination of more of them?

\[x_1 = \begin{bmatrix}4\\3\\5\\9\end{bmatrix}\qquad \mathbf{w} = \begin{bmatrix} \frac{1}{2} & 0\\ \frac{1}{2} & 0\\ 0 & \frac{1}{2}\\ 0 & \frac{1}{2} \end{bmatrix}\qquad \mathbf{w}^T x_1 = \begin{bmatrix}3 \frac{1}{2}\\7\end{bmatrix}\]

We’ve now stepped into feature extraction, where we’re creating new features that are linear combinations of old ones. Now, if we’re looking into these kinds of operations we need some sensible way of deciding on the components of \(\mathbf{w}\). Principle Component Analysis is one such strategy.

Principle Component Analysis (PCA)

In PCA, the metric of a good \(\mathbf{w}\) is one that maximizes the variance (\(\sigma^2\)) 1 of the data after applying it. This makes sense if you think about it in the context of feature reduction; seeing how the data points are spread out over a feature is pretty much the whole point of having that feature in your data set. So even if you’ve decided you can deal with some loss of that information as a trade-off to achieve a dimensionality reduction, it would still be nice to retain as much variance as possible.

Now, if we really wanted to maximize variance we should start by multiplying all the data points by infinity because the variance would scale quadratically 2. To nip this in the bud we add an extra constraint: the columns of \(\mathbf{w}\) must have a length of \(1\); they need to be unit vectors.

There’s another problem. Say we have many data points about houses. Two of the features are the number of rooms in the house and its surface area in square meters, and we want to combine these into one, new feature that says something along the lines of how useful the house is. The problem that arises is that the surface area ranges between 20 and 200, whereas the number of rooms ranges between 1 and 5. If we try to maximize the variance after applying our \(\mathbf{w}\) projection, the surface area feature would contribute way more to that than the room count feature, simply because its variance was larger to begin with.

This problem can be solved by applying z-normalisation. After applying this to each feature, the means will be 0 and the standard deviations will be 1.

\[x_i = \frac{x_i - \mu}{\sigma}\]

Alright! We’ve normalized the data and we know what we’re looking for in a good \(\mathbf{w}\). Via math™ it turns out what we are looking is an eigenvector of the normalized data’s covariance matrix (\(\mathbf{\Sigma}\)). This is a bit of a tough one to intuit, and I’m just going to stick the math in a footnote if you’re into that 3.

You can actually distill more new features from that covariance matrix if you want to retain more of the original variance in your new data set. The columns of the change of basis matrix \(\mathbf{W}\) are made up out of each next eigenvector by descending order of its corresponding eigenvalue:

\[\mathbf{W} = \begin{bmatrix}&&\\&&\\e_1&e_2&\cdots\\&&\\&&\end{bmatrix}\]

The way to get your new features’ vectors (\(\mathbf{z}\)) is always:

\[\mathbf{z} = \mathbf{W}^T \mathbf{x}\]



The variance (\(\sigma^2\)) for a distribution \(X\) with \(N\) samples and mean \(\mu\) is:

\[\frac{\sum_i^N (x_i - \mu)^2}{N}\]

Check this video from Khanh Academy for an example.


For any random variable \(X\): \(\mathrm{Var}(aX+ b) = a^2 \mathrm{Var}(X)\)


Work out that \(\mathrm{Var}(\mathbf{w}^T\mathbf{x}) = \mathbf{w}^T \mathbf{\Sigma} \mathbf{w}\). For principle component 1, see that solving the partial derivative of this result with the Lagrange multiplier \(\mathbf{w_1}^T \mathbf{w_1} - 1\) with respect to \(\mathbf{w_1}\) for \(0\) gives \(\mathbf{\Sigma} \mathbf{w_1} = \lambda \mathbf{w_1}\). This is exactly the equation that defines an eigenvector! Boom.