Principal Component Analysis (PCA)

Steps:

  1. For each dimension, get the Mean
  2. Plot that point
  3. Transform that point to origin
    1. That will transfer all features with same aspect
  4. Fit a line with lowest Sum of Residual
  5. That is the PC1
  6. For next Principal Component PCi, draw a perpendicular on PC1...(i1)
  7. Project the D dimensional data to all the principal components (usually 2/3)
  8. Now rotate PC1 to make it horizontal with X axis
    1. That will rotate the features too
  9. That is the final PCA
Variation for Principal Component

Variation for PCi=Sum of Squared Distnace from features to PCin1

Practical Tips:

  1. All variables are on same scale, i.e., [0, 1] (Normalization)
  2. Make sure data is centered on origin (if not done by library)
Why do we need to do data normalization in PCA?

If we don't normalize the data before PCA is done, then it's possible that the data with high variance will dominate the principal component, i.e., if the data has weight in kg and weight in grams, then weight in grams might dominate the component as it has more variance then kg.

Why is it required to center the data in PCA?

PCA requires to center the data as it needs to find the variance to find the most important component. By centering the data, we are actually shifting it to mean of 0. If we don't center the data, it might be possible to be misguided.

References


Related Notes