PCA From Scratch
Writing Principal Component Analysis
Principal Component Analysis (PCA) from Scratch¶
This page also exists as an executable Google Colab Notebook.
(c) Scott H. Hawley, Dec 21, 2019
The world doesn't need a yet another PCA tutorial, just like the world doesn't need another silly love song. But sometimes you still get the urge to write your own.
Principal Component Analysis (PCA) is a data-reduction technique that finds application in a wide variety of fields, including biology, sociology, physics, medicine, and audio processing. PCA may be used as a "front end" processing step that feeds into additional layers of machine learning, or it may be used by itself, for example when doing data visualization. It is so useful and ubiquitious that is is worth learning not only what it is for and what it is, but how to actually do it.
In this interactive worksheet, we work through how to perform PCA on a few different datasets, writing our own code as we go.
Other (better?) treatments¶
My treatment here was written several months after viewing...
- the excellent demo page at setosa.io
- this quick 1m30s video of a teapot,
- this great StatsQuest video
- this lecture from Andrew Ng's course
Put simply, PCA involves making a coordinate transformation (i.e., a rotation) from the arbitrary axes (or "features") you started with to a set of axes 'aligned with the data itself,' and doing this almost always means that you can get rid of a few of these 'components' of data that have small variance without suffering much in the way of accurcy while saving yourself a ton of computation.
Once you "get it," you'll find PCA to be almost no big deal, if it weren't for the fact that it's so darn useful!
We'll define the following terms as we go, but here's the process in a nutshell:
- Covariance: Find the covariance matrix for your dataset
- Eigenvectors: Find the eigenvectors of that matrix (these are the "components" btw)
- Ordering: Sort the eigenvectors/'dimensions' from biggest to smallest variance
- Projection / Data reduction: Use the eigenvectors corresponding to the largest variance to project the dataset into a reduced- dimensional space
- (Check: How much did we lose by that truncation?)
Since PCA will involve making linear transformations, there are some situations where PCA won't help but...pretty much it's handy enough that it's worth giving it a shot!
If you've got two data dimensions and they vary together, then they are co-variant.
Example: Two-dimensional data that's somewhat co-linear:
import numpy as np import matplotlib.pyplot as plt import plotly.graph_objects as go N = 100 x = np.random.normal(size=N) y = 0.5*x + 0.2*(np.random.normal(size=N)) fig = go.Figure(data=[go.Scatter(x=x, y=y, mode='markers', marker=dict(size=8,opacity=0.5), name="data" )]) fig.update_layout( xaxis_title="x", yaxis_title="y", yaxis = dict(scaleanchor = "x",scaleratio = 1) ) fig.show()