Principal component analysis

Principal Component Analysis (PCA) is a statistical procedure that transforms and converts a data set into a new data set containing linearly uncorrelated variables, known as principal components. The basic idea is that the data set is transformed into a set of components where each one attempts to capture as much of the variance (information) in data as possible.

Data Set

In our experiment we have used cars data set from the web site: https://perso.telecom-paristech.fr/eagan/class/igr204/datasets . It contains 406 observations of 9 variables.

Step 1: Data Cleansing

We stored the data set to the local variable cars. We removed 1st and 9th columns of my data set as PCA can not handle character data.

There is no use of mice function as the data does not contain any null value.

cars<-read.csv("~/R/cars.csv", sep=";")
cars_new<-cars[-c(1,9)]
str(cars_new)
## 'data.frame':    406 obs. of  7 variables:
##  $ MPG         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ Cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ Displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ Horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ Weight      : num  3504 3693 3436 3433 3449 ...
##  $ Acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ Model       : int  70 70 70 70 70 70 70 70 70 70 ...
library(mice)
## Loading required package: lattice
## 
## Attaching package: 'mice'
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
md.pattern(cars_new)
##  /\     /\
## {  `---'  }
## {  O   O  }
## ==>  V <==  No need for mice. This data set is completely observed.
##  \  \|/  /
##   `-----'

##     MPG Cylinders Displacement Horsepower Weight Acceleration Model  
## 406   1         1            1          1      1            1     1 0
##       0         0            0          0      0            0     0 0
head(cars_new,10)

Step 2: Use prcomp() function

It Performs a principal components analysis on the given data matrix and returns the results as an object of class prcomp. Our data contains 7 variabls and its really hard to analyze data with 7 variables. it will make our model complex.

cars.pca <- prcomp(scale(cars_new),center = TRUE)

The calculation is done by a singular value decomposition of the data matrix, not by using eigen on the covariance matrix. This is generally the preferred method for numerical accuracy. The print method for these objects prints the results in a nice format and the plot method produces a scree plot.

prcomp returns a list with class “prcomp” containing the following components:

sdev: the standard deviations of the principal components.

rotation: the matrix of variable loadings. The function princomp returns this in the element loadings.

x:if retx is true the value of the rotated data (the centred (and scaled if requested) data multiplied by the rotation matrix) is returned. Hence, cov(x) is the diagonal matrix diag(sdev^2). For the formula method, napredict() is applied to handle the treatment of values omitted by the na.action.

center, scale :the centering and scaling used, or FALSE.

names(cars.pca)
## [1] "sdev"     "rotation" "center"   "scale"    "x"
summary(cars.pca)
## Importance of components:
##                           PC1    PC2    PC3     PC4    PC5     PC6   PC7
## Standard deviation     2.2241 0.9311 0.8415 0.47434 0.3880 0.26077 0.187
## Proportion of Variance 0.7066 0.1238 0.1012 0.03214 0.0215 0.00971 0.005
## Cumulative Proportion  0.7066 0.8305 0.9316 0.96379 0.9853 0.99500 1.000

From the summary its clear that PC1 explained 70% variance of the data. We can set a cut off say 90%, so PC1, PC2 and PC3 are our desired Principle Components.

Scree plot

We can also explain PCA by graphs.It is clearly visible that PC1, PC2 and PC3 has the highest variance and they explains more than 90% of the variance.

## [1] "proportions of variance:"
## [1] 0.706641351 0.123840832 0.101164128 0.032142961 0.021500913 0.009714179
## [7] 0.004995635

The End