# Principal Component Analysis

The singular value decomposition of data matrix

The Principal Component Analysis (PCA) is a technique used in many fields: data science, signal processing, mechanics, etc. As a student of machine learning, I should take sometime to at least review this technique; and maybe ICA too, in some future posts.

## Reducing dimensionality

With a data science mind set, the key idea of PCA is to reduce the dimensionality of the data while retaining as much variation as possible in the result. Personally, I think of PCA as projecting the “cloud” of data points to a “flat” surface. More technically, PCA is particularly useful when we have a large number of variables. In such situation, we might want to look at the data from a point of view where one direction capture the most variance (the data spread out the most). A picture from setosa illustrates this idea: Under the transformation, our data now have large variance on pc1 and small variance on pc2. The data now can be represented only on pc1 without much information loss.

Let $$\mathbf{x}$$ be a vector containing $$p$$ random variables, we define the principal components of $$\mathbf{x}$$ as follow:

1. Find $$\alpha_1 \in R^p$$ such that:
$\left\{ \begin{array}{l} \left\lVert\alpha_1\right\rVert_2 = 1 \\ z_1 = \alpha_1 \mathbf{x} = \sum_{j=1}^p \alpha_{1j} \mathbf{x}_j \text{ has the largest variance.} %_ \end{array} \right.$
1. Next, find $$\alpha_2 \in R^p$$ such that:
$\left\{ \begin{array}{l} \left\lVert\alpha_2\right\rVert_2 = 1 \\ z_2 = \alpha_2 \mathbf{x} = \sum_{j=1}^p \alpha_{2j} \mathbf{x}_j \text{ has the largest variance, } z_2 \text{ is uncorrelated with } z_1 %_ \end{array} \right.$
1. Continue doing so, we can define $$\alpha_3; \alpha_4;... ;\alpha_k (k < p)$$ to satisfy the condition above.

## Main theorem

Let $$\Sigma$$ be the covariance matrix of $$\mathbf{x}$$, then $$\alpha_1; \alpha_2;... ;\alpha_k$$ are respectively eigenvectors of $$\Sigma$$ corresponding with eigenvalues $$\lambda_1; ...; \lambda_k$$ (s.t. $$\lambda_1 > ... > \lambda_k$$) and $$V(z_i) = V(\alpha_i \mathbf{x}) = \lambda_i$$.

## Applications

PCA is widely used when it comes to data as it gives a general view of the dataset. It is known as the singular value decomposition of data matrix $$\mathbf{X}$$, or the eigenvalue decomposition of $$\mathbf{X}^\top\mathbf{X}$$ (main theorem).

Take a simple approach to image recognition as an example. If we consider each pixel of a given image is a random variable, then we can compute the covariance matrix. By choosing k-largest eigenvalues and their corresponding eigenvectors, we can have the image in a new space. Interestingly, we can choose k as small as we want, resulting in a compact representation of images. In the newly defined space, every image is represented as a vector. These vectors can be used for similarity comparison. Such approach to image recognition is naïve and not effective in many cases. However, it gives a baseline and an example of the PCA technique.