Principal component analysis - Example 1: Crime Data

In machine learning and statistics, dimensionality reduction or dimension reduction is the process of reducing the number of random variables under consideration, via obtaining a set of principal variables. It can be divided into feature selection and feature extraction.

Principal component analysis (PCA) is routinely employed on a wide range of problems. From the detection of outliers to predictive modeling, PCA has the ability of projecting the observations described by p variables into few orthogonal components defined at where the data ‘stretch’ the most, rendering a simplified overview. It is an unsupervised method, meaning it will always look into the greatest sources of variation regardless of the data structure.

Step 0: Built pcaChart function for exploratory data analysis on Variance

pcaCharts <- function(x) {
    x.var <- x$sdev ^ 2
    x.pvar <- x.var/sum(x.var)
    print("proportions of variance:")
    print(x.pvar)
    
    par(mfrow=c(2,2))
    plot(x.pvar,xlab="Principal component", ylab="Proportion of variance explained", ylim=c(0,1), type='b')
    plot(cumsum(x.pvar),xlab="Principal component", ylab="Cumulative Proportion of variance explained", ylim=c(0,1), type='b')
    screeplot(x)
    screeplot(x,type="l")
    par(mfrow=c(1,1))
}

Step 1: Load Data for analysis - Crime Data

data("USArrests")
rawdf <- na.omit(USArrests)

names(rawdf)=c("Murder","Assault", "Rape", "UrbanPop")
head(rawdf)
##            Murder Assault Rape UrbanPop
## Alabama      13.2     236   58     21.2
## Alaska       10.0     263   48     44.5
## Arizona       8.1     294   80     31.0
## Arkansas      8.8     190   50     19.5
## California    9.0     276   91     40.6
## Colorado      7.9     204   78     38.7

Reminder: Principal Component Analysis (PCA) is a method used to reduce the number of variables in a dataset. Now, we will simplify the data into two-variables data. This does not mean that we are eliminating two variables and keeping two; it means that we are replacing the four variables with two brand new ones called “principal components”.

prcomp: Performs a principal components analysis on the given data matrix and returns the results as an object of class

Step 2: Standardize the data by using scale and apply “prcomp” function

arrests.pca <- prcomp(scale(USArrests),center = TRUE)
#Checking output of pca. prcomp function returns standard deviation (sdev), rotation and loadings.
names(arrests.pca)
## [1] "sdev"     "rotation" "center"   "scale"    "x"
print(arrests.pca)
## Standard deviations (1, .., p=4):
## [1] 1.5748783 0.9948694 0.5971291 0.4164494
## 
## Rotation (n x k) = (4 x 4):
##                 PC1        PC2        PC3         PC4
## Murder   -0.5358995  0.4181809 -0.3412327  0.64922780
## Assault  -0.5831836  0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158  0.13387773
## Rape     -0.5434321 -0.1673186  0.8177779  0.08902432

Step 3: Choose the principal components with highest variances

Now that R has computed 4 new variables (“principal components”), you can choose the two (or one, or three) principal components with the highest variances.

This step is to identify coverage of variance in dataset by individual principal components. summary() function can be used or screen plot can be used to explain the variance.

summary(arrests.pca)
## Importance of components%s:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.5749 0.9949 0.59713 0.41645
## Proportion of Variance 0.6201 0.2474 0.08914 0.04336
## Cumulative Proportion  0.6201 0.8675 0.95664 1.00000

From the the summary, we can undersand PC1 explains 62% of variance and PC2 explains 24% so on. Usually Principal components which explains about 95% variance can be considered for models. Summary also yields cumulative proportion of the principal components.

Best thing is, plot PCA using various types of scree plot. Above declared ‘pcaCharts’ function invokes various forms of scree plot

Step 4: Visulization of Data in the new reduced dimension

pcaCharts(arrests.pca)
## [1] "proportions of variance:"
## [1] 0.62006039 0.24744129 0.08914080 0.04335752

biplot(arrests.pca,scale=0, cex=.7)

pca.out <- arrests.pca
pca.out$rotation <- -pca.out$rotation
pca.out$x <- -pca.out$x
biplot(pca.out,scale=0, cex=.7)

pca.out$rotation[,1:2]
##                PC1        PC2
## Murder   0.5358995 -0.4181809
## Assault  0.5831836 -0.1879856
## UrbanPop 0.2781909  0.8728062
## Rape     0.5434321  0.1673186

Principal component analysis - Example 2: Wine Data

A wine data set at the UCI Machine Learning Repository will serve as a good starting example, these data consist of 13 physicochemical parameters measured in 178 wine samples from three distinct cultivars grown in Italy.

Step 1: Load Data for analysis - Wine Data

winedf <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data"),header=T)

names(winedf) <- c("Cvs","Alcohol","Malic acid","Ash","Alcalinity of ash", "Magnesium", "Total phenols", "Flavanoids", "Nonflavanoid phenols", "Proanthocyanins", "Color intensity", "Hue", "OD280/OD315 of diluted wines", "Proline")

head(winedf)
##   Cvs Alcohol Malic acid  Ash Alcalinity of ash Magnesium Total phenols
## 1   1   13.20       1.78 2.14              11.2       100          2.65
## 2   1   13.16       2.36 2.67              18.6       101          2.80
## 3   1   14.37       1.95 2.50              16.8       113          3.85
## 4   1   13.24       2.59 2.87              21.0       118          2.80
## 5   1   14.20       1.76 2.45              15.2       112          3.27
## 6   1   14.39       1.87 2.45              14.6        96          2.50
##   Flavanoids Nonflavanoid phenols Proanthocyanins Color intensity  Hue
## 1       2.76                 0.26            1.28            4.38 1.05
## 2       3.24                 0.30            2.81            5.68 1.03
## 3       3.49                 0.24            2.18            7.80 0.86
## 4       2.69                 0.39            1.82            4.32 1.04
## 5       3.39                 0.34            1.97            6.75 1.05
## 6       2.52                 0.30            1.98            5.25 1.02
##   OD280/OD315 of diluted wines Proline
## 1                         3.40    1050
## 2                         3.17    1185
## 3                         3.45    1480
## 4                         2.93     735
## 5                         2.85    1450
## 6                         3.58    1290
summary(winedf)
##       Cvs           Alcohol        Malic acid        Ash       
##  Min.   :1.000   Min.   :11.03   Min.   :0.74   Min.   :1.360  
##  1st Qu.:1.000   1st Qu.:12.36   1st Qu.:1.60   1st Qu.:2.210  
##  Median :2.000   Median :13.05   Median :1.87   Median :2.360  
##  Mean   :1.944   Mean   :12.99   Mean   :2.34   Mean   :2.366  
##  3rd Qu.:3.000   3rd Qu.:13.67   3rd Qu.:3.10   3rd Qu.:2.560  
##  Max.   :3.000   Max.   :14.83   Max.   :5.80   Max.   :3.230  
##  Alcalinity of ash   Magnesium      Total phenols     Flavanoids   
##  Min.   :10.60     Min.   : 70.00   Min.   :0.980   Min.   :0.340  
##  1st Qu.:17.20     1st Qu.: 88.00   1st Qu.:1.740   1st Qu.:1.200  
##  Median :19.50     Median : 98.00   Median :2.350   Median :2.130  
##  Mean   :19.52     Mean   : 99.59   Mean   :2.292   Mean   :2.023  
##  3rd Qu.:21.50     3rd Qu.:107.00   3rd Qu.:2.800   3rd Qu.:2.860  
##  Max.   :30.00     Max.   :162.00   Max.   :3.880   Max.   :5.080  
##  Nonflavanoid phenols Proanthocyanins Color intensity       Hue       
##  Min.   :0.1300       Min.   :0.410   Min.   : 1.280   Min.   :0.480  
##  1st Qu.:0.2700       1st Qu.:1.250   1st Qu.: 3.210   1st Qu.:0.780  
##  Median :0.3400       Median :1.550   Median : 4.680   Median :0.960  
##  Mean   :0.3623       Mean   :1.587   Mean   : 5.055   Mean   :0.957  
##  3rd Qu.:0.4400       3rd Qu.:1.950   3rd Qu.: 6.200   3rd Qu.:1.120  
##  Max.   :0.6600       Max.   :3.580   Max.   :13.000   Max.   :1.710  
##  OD280/OD315 of diluted wines    Proline      
##  Min.   :1.270                Min.   : 278.0  
##  1st Qu.:1.930                1st Qu.: 500.0  
##  Median :2.780                Median : 672.0  
##  Mean   :2.604                Mean   : 745.1  
##  3rd Qu.:3.170                3rd Qu.: 985.0  
##  Max.   :4.000                Max.   :1680.0
wineClasses <- factor(winedf$Cvs)

plot(main="Three Different Cultivars",winedf$Alcohol,winedf$'Alcalinity of ash', col = wineClasses)

plot(main="Three Different Cultivars",winedf$`Malic acid`,winedf$Magnesium, col = wineClasses)

plot(main="Three Different Cultivars",winedf$Ash,winedf$Flavanoids, col = wineClasses)

Step 2: Apply “prcomp” function

winePCA <- prcomp(scale(winedf[,-1]))

Step 3: Choose the principal components with highest variances

Now the 13 features has reduced to only 2 new Principal Components These are not 2 of those 13, but 2 new components

summary(winePCA)
## Importance of components%s:
##                           PC1    PC2    PC3     PC4     PC5    PC6     PC7
## Standard deviation     2.1628 1.5816 1.2055 0.96148 0.92830 0.8030 0.74295
## Proportion of Variance 0.3598 0.1924 0.1118 0.07111 0.06629 0.0496 0.04246
## Cumulative Proportion  0.3598 0.5522 0.6640 0.73515 0.80144 0.8510 0.89350
##                            PC8     PC9    PC10    PC11    PC12   PC13
## Standard deviation     0.59223 0.53775 0.49680 0.47481 0.41034 0.3224
## Proportion of Variance 0.02698 0.02224 0.01899 0.01734 0.01295 0.0080
## Cumulative Proportion  0.92048 0.94272 0.96171 0.97905 0.99200 1.0000

Step 4: Visulization of Data in the new reduced dimension

pcaCharts(winePCA)
## [1] "proportions of variance:"
##  [1] 0.359830707 0.192412795 0.111794601 0.071111087 0.066287442
##  [6] 0.049603670 0.042460139 0.026979910 0.022244617 0.018985283
## [11] 0.017341553 0.012952063 0.007996133

biplot(winePCA,scale=0, cex=.7)

wineClasses <- factor(winedf$Cvs)

plot(main="Three Different Cultivars",winePCA$x[,1:2], col = wineClasses)

As you compared with the plot with original features, you can see now it is much easy to classified them