Sunday, October 24, 2010

Principal Component Analysis Explained

Principal Component Analysis (PCA) is a mathematical procedure for transforming correlated variables into a smaller number of uncorrelated variables. Basically, PCA decorrelates the feature space and orders the dimensions with decreasing variance. We can reduce the dimension by choosing the best N dimension that accounts the variance of original data. PCA chooses optimal dimension in the sense that it represents the structure of data in lower dimension.

High dimension feature space creates lots of problem (over-fitting) in many machine-learning algorithms. In such case, PCA can be handy tool in reducing feature space dimension to address “Curse of dimensionality ” problem.

I have implemented a PCA algorithm in Python using Numpy for computation and Pylab for visualization. Iris dataset is used in a source code shown below.











Low Dimension representation of the Iris dataset using PCA is given below.



















Source Code:

 from numpy import *
from pylab import *

def pca(X,reduced_dimension=None):
# Principal component Analysis
# Input
# X : Matrix of training data contained in rows
# reduced_dimension :: Reduce dimension cardinality
# output: Transformation matrix

samples,dim = X.shape
#zero mean unit variance normalization
X = (X-X.mean(axis = 0))/(X.var(axis=0))**(1/2.0)
#singular value decompostion
U,S,V = linalg.svd(X)
#eigenvectors of X.T X
if reduced_dimension:
V = V[:reduced_dimension]
return V.T


#read the dataset(iris dataset in this case)
f = open("iris.data","r")
#class of iris data
setosa=[]
versicolor = []
virginicia = []

#read data from file
for data in f.readlines():
val = data.split(",")
iris_type = val[-1].rsplit()[0]
values = [double(i) for i in val[:-1]]
if(iris_type == "Iris-setosa"):
setosa.append(values)
if(iris_type == "Iris-versicolor"):
versicolor.append(values)
if(iris_type == "Iris-versicolor"):
virginicia.append(values)

#compute principle component projection matrix using PCA for each class
#number 2 signifies we want our projection matrix to project into two dimension
setosa_pc_projection_matrix = pca(array(setosa),2)
versicolor_pc_projection_matrix = pca(array(versicolor),2)
virginicia_pc_projection_matrix = pca(array(virginicia),2)

#Project iris data point to lower dimension(2)
low_dim_setosa_points = dot(array(setosa),setosa_pc_projection_matrix)
low_dim_versicolor_points = dot(array(versicolor),versicolor_pc_projection_matrix)
low_dim_virginicia_points = dot(array(virginicia),virginicia_pc_projection_matrix)

#plot the data points
p1 = plot(low_dim_setosa_points[:,0].tolist(),low_dim_setosa_points[:,1].tolist(),"ro",label="Iris Setosa")
p2 = plot(low_dim_versicolor_points[:,0].tolist(),low_dim_versicolor_points[:,1].tolist(),"r*",label="Iris versicolor")
p3 = plot(virginicia_pc_projection_matrix[:,0].tolist(),virginicia_pc_projection_matrix[:,1].tolist(),"r^",label="Iris virginicia")
title("PCA based dimension reduction")
xlabel("X1")
ylabel("X2")
legend(loc = "upper left")
show()

Download source and data here.