Principal Component Analysis

The singular value decomposition of data matrix

The Principal Component Analysis (PCA) is a technique used in many fields: data science, signal processing, mechanics, etc. As a student of machine learning, I should take sometime to at least review this technique; and maybe ICA too, in some future posts.

Reducing dimensionality

With a data science mind set, the key idea of PCA is to reduce the dimensionality of the data while retaining as much variation as possible in the result. Personally, I think of PCA as projecting the “cloud” of data points to a “flat” surface. More technically, PCA is particularly useful when we have a large number of variables. In such situation, we might want to look at the data from a point of view where one direction capture the most variance (the data spread out the most). A picture from setosa illustrates this idea:

Under the transformation, our data now have large variance on pc1 and small variance on pc2. The data now can be represented only on pc1 without much information loss.

Let \(\mathbf{x}\) be a vector containing \(p\) random variables, we define the principal components of \(\mathbf{x}\) as follow:

  1. Find \(\alpha_1 \in R^p\) such that:
\[\left\{ \begin{array}{l} \left\lVert\alpha_1\right\rVert_2 = 1 \\ z_1 = \alpha_1 \mathbf{x} = \sum_{j=1}^p \alpha_{1j} \mathbf{x}_j \text{ has the largest variance.} %_ \end{array} \right.\]
  1. Next, find \(\alpha_2 \in R^p\) such that:
\[\left\{ \begin{array}{l} \left\lVert\alpha_2\right\rVert_2 = 1 \\ z_2 = \alpha_2 \mathbf{x} = \sum_{j=1}^p \alpha_{2j} \mathbf{x}_j \text{ has the largest variance, } z_2 \text{ is uncorrelated with } z_1 %_ \end{array} \right.\]
  1. Continue doing so, we can define \(\alpha_3; \alpha_4;... ;\alpha_k (k < p)\) to satisfy the condition above.

Main theorem

Let \(\Sigma\) be the covariance matrix of \(\mathbf{x}\), then \(\alpha_1; \alpha_2;... ;\alpha_k\) are respectively eigenvectors of \(\Sigma\) corresponding with eigenvalues \(\lambda_1; ...; \lambda_k\) (s.t. \(\lambda_1 > ... > \lambda_k\)) and \(V(z_i) = V(\alpha_i \mathbf{x}) = \lambda_i\).


PCA is widely used when it comes to data as it gives a general view of the dataset. It is known as the singular value decomposition of data matrix \(\mathbf{X}\), or the eigenvalue decomposition of \(\mathbf{X}^\top\mathbf{X}\) (main theorem).

Take a simple approach to image recognition as an example. If we consider each pixel of a given image is a random variable, then we can compute the covariance matrix. By choosing k-largest eigenvalues and their corresponding eigenvectors, we can have the image in a new space. Interestingly, we can choose k as small as we want, resulting in a compact representation of images. In the newly defined space, every image is represented as a vector. These vectors can be used for similarity comparison. Such approach to image recognition is naïve and not effective in many cases. However, it gives a baseline and an example of the PCA technique.