PCA From Scratch

Writing Principal Component Analysis

PCA_from_Scratch

Principal Component Analysis (PCA) from Scratch

This page also exists as an executable Google Colab Notebook.

splash pic i made (c) Scott H. Hawley, Dec 21, 2019

The world doesn't need a yet another PCA tutorial, just like the world doesn't need another silly love song. But sometimes you still get the urge to write your own.

Relevance

Principal Component Analysis (PCA) is a data-reduction technique that finds application in a wide variety of fields, including biology, sociology, physics, medicine, and audio processing. PCA may be used as a "front end" processing step that feeds into additional layers of machine learning, or it may be used by itself, for example when doing data visualization. It is so useful and ubiquitious that is is worth learning not only what it is for and what it is, but how to actually do it.

In this interactive worksheet, we work through how to perform PCA on a few different datasets, writing our own code as we go.

Other (better?) treatments

My treatment here was written several months after viewing...

Basic Idea

Put simply, PCA involves making a coordinate transformation (i.e., a rotation) from the arbitrary axes (or "features") you started with to a set of axes 'aligned with the data itself,' and doing this almost always means that you can get rid of a few of these 'components' of data that have small variance without suffering much in the way of accurcy while saving yourself a ton of computation.

Once you "get it," you'll find PCA to be almost no big deal, if it weren't for the fact that it's so darn useful!

We'll define the following terms as we go, but here's the process in a nutshell:

  1. Covariance: Find the covariance matrix for your dataset
  2. Eigenvectors: Find the eigenvectors of that matrix (these are the "components" btw)
  3. Ordering: Sort the eigenvectors/'dimensions' from biggest to smallest variance
  4. Projection / Data reduction: Use the eigenvectors corresponding to the largest variance to project the dataset into a reduced- dimensional space
  5. (Check: How much did we lose by that truncation?)

Caveats

Since PCA will involve making linear transformations, there are some situations where PCA won't help but...pretty much it's handy enough that it's worth giving it a shot!

Covariance

If you've got two data dimensions and they vary together, then they are co-variant.

Example: Two-dimensional data that's somewhat co-linear:

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import plotly.graph_objects as go

N = 100
x = np.random.normal(size=N)
y = 0.5*x + 0.2*(np.random.normal(size=N))

fig = go.Figure(data=[go.Scatter(x=x, y=y, mode='markers',
                marker=dict(size=8,opacity=0.5), name="data" )])
fig.update_layout( xaxis_title="x", yaxis_title="y",
    yaxis = dict(scaleanchor = "x",scaleratio = 1) )
fig.show()