In machine learning and statistics, dimensionality reduction or dimension reduction is the process of reducing the number of random variables under consideration, via obtaining a set of principal variables. It can be divided into feature selection and feature extraction.
Principal component analysis (PCA) is routinely employed on a wide range of problems. From the detection of outliers to predictive modeling, PCA has the ability of projecting the observations described by p variables into few orthogonal components defined at where the data âstretchâ the most, rendering a simplified overview. It is an unsupervised method, meaning it will always look into the greatest sources of variation regardless of the data structure.
pcaCharts <- function(x) {
x.var <- x$sdev ^ 2
x.pvar <- x.var/sum(x.var)
print("proportions of variance:")
print(x.pvar)
par(mfrow=c(2,2))
plot(x.pvar,xlab="Principal component", ylab="Proportion of variance explained", ylim=c(0,1), type='b')
plot(cumsum(x.pvar),xlab="Principal component", ylab="Cumulative Proportion of variance explained", ylim=c(0,1), type='b')
screeplot(x)
screeplot(x,type="l")
par(mfrow=c(1,1))
}
data("USArrests")
rawdf <- na.omit(USArrests)
names(rawdf)=c("Murder","Assault", "Rape", "UrbanPop")
head(rawdf)
## Murder Assault Rape UrbanPop
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
Reminder: Principal Component Analysis (PCA) is a method used to reduce the number of variables in a dataset. Now, we will simplify the data into two-variables data. This does not mean that we are eliminating two variables and keeping two; it means that we are replacing the four variables with two brand new ones called âprincipal componentsâ.
prcomp: Performs a principal components analysis on the given data matrix and returns the results as an object of class
arrests.pca <- prcomp(scale(USArrests),center = TRUE)
#Checking output of pca. prcomp function returns standard deviation (sdev), rotation and loadings.
names(arrests.pca)
## [1] "sdev" "rotation" "center" "scale" "x"
print(arrests.pca)
## Standard deviations (1, .., p=4):
## [1] 1.5748783 0.9948694 0.5971291 0.4164494
##
## Rotation (n x k) = (4 x 4):
## PC1 PC2 PC3 PC4
## Murder -0.5358995 0.4181809 -0.3412327 0.64922780
## Assault -0.5831836 0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158 0.13387773
## Rape -0.5434321 -0.1673186 0.8177779 0.08902432
Now that R has computed 4 new variables (âprincipal componentsâ), you can choose the two (or one, or three) principal components with the highest variances.
This step is to identify coverage of variance in dataset by individual principal components. summary() function can be used or screen plot can be used to explain the variance.
summary(arrests.pca)
## Importance of components%s:
## PC1 PC2 PC3 PC4
## Standard deviation 1.5749 0.9949 0.59713 0.41645
## Proportion of Variance 0.6201 0.2474 0.08914 0.04336
## Cumulative Proportion 0.6201 0.8675 0.95664 1.00000
From the the summary, we can undersand PC1 explains 62% of variance and PC2 explains 24% so on. Usually Principal components which explains about 95% variance can be considered for models. Summary also yields cumulative proportion of the principal components.
Best thing is, plot PCA using various types of scree plot. Above declared âpcaChartsâ function invokes various forms of scree plot
pcaCharts(arrests.pca)
## [1] "proportions of variance:"
## [1] 0.62006039 0.24744129 0.08914080 0.04335752
biplot(arrests.pca,scale=0, cex=.7)
pca.out <- arrests.pca
pca.out$rotation <- -pca.out$rotation
pca.out$x <- -pca.out$x
biplot(pca.out,scale=0, cex=.7)
pca.out$rotation[,1:2]
## PC1 PC2
## Murder 0.5358995 -0.4181809
## Assault 0.5831836 -0.1879856
## UrbanPop 0.2781909 0.8728062
## Rape 0.5434321 0.1673186
A wine data set at the UCI Machine Learning Repository will serve as a good starting example, these data consist of 13 physicochemical parameters measured in 178 wine samples from three distinct cultivars grown in Italy.
winedf <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data"),header=T)
names(winedf) <- c("Cvs","Alcohol","Malic acid","Ash","Alcalinity of ash", "Magnesium", "Total phenols", "Flavanoids", "Nonflavanoid phenols", "Proanthocyanins", "Color intensity", "Hue", "OD280/OD315 of diluted wines", "Proline")
head(winedf)
## Cvs Alcohol Malic acid Ash Alcalinity of ash Magnesium Total phenols
## 1 1 13.20 1.78 2.14 11.2 100 2.65
## 2 1 13.16 2.36 2.67 18.6 101 2.80
## 3 1 14.37 1.95 2.50 16.8 113 3.85
## 4 1 13.24 2.59 2.87 21.0 118 2.80
## 5 1 14.20 1.76 2.45 15.2 112 3.27
## 6 1 14.39 1.87 2.45 14.6 96 2.50
## Flavanoids Nonflavanoid phenols Proanthocyanins Color intensity Hue
## 1 2.76 0.26 1.28 4.38 1.05
## 2 3.24 0.30 2.81 5.68 1.03
## 3 3.49 0.24 2.18 7.80 0.86
## 4 2.69 0.39 1.82 4.32 1.04
## 5 3.39 0.34 1.97 6.75 1.05
## 6 2.52 0.30 1.98 5.25 1.02
## OD280/OD315 of diluted wines Proline
## 1 3.40 1050
## 2 3.17 1185
## 3 3.45 1480
## 4 2.93 735
## 5 2.85 1450
## 6 3.58 1290
summary(winedf)
## Cvs Alcohol Malic acid Ash
## Min. :1.000 Min. :11.03 Min. :0.74 Min. :1.360
## 1st Qu.:1.000 1st Qu.:12.36 1st Qu.:1.60 1st Qu.:2.210
## Median :2.000 Median :13.05 Median :1.87 Median :2.360
## Mean :1.944 Mean :12.99 Mean :2.34 Mean :2.366
## 3rd Qu.:3.000 3rd Qu.:13.67 3rd Qu.:3.10 3rd Qu.:2.560
## Max. :3.000 Max. :14.83 Max. :5.80 Max. :3.230
## Alcalinity of ash Magnesium Total phenols Flavanoids
## Min. :10.60 Min. : 70.00 Min. :0.980 Min. :0.340
## 1st Qu.:17.20 1st Qu.: 88.00 1st Qu.:1.740 1st Qu.:1.200
## Median :19.50 Median : 98.00 Median :2.350 Median :2.130
## Mean :19.52 Mean : 99.59 Mean :2.292 Mean :2.023
## 3rd Qu.:21.50 3rd Qu.:107.00 3rd Qu.:2.800 3rd Qu.:2.860
## Max. :30.00 Max. :162.00 Max. :3.880 Max. :5.080
## Nonflavanoid phenols Proanthocyanins Color intensity Hue
## Min. :0.1300 Min. :0.410 Min. : 1.280 Min. :0.480
## 1st Qu.:0.2700 1st Qu.:1.250 1st Qu.: 3.210 1st Qu.:0.780
## Median :0.3400 Median :1.550 Median : 4.680 Median :0.960
## Mean :0.3623 Mean :1.587 Mean : 5.055 Mean :0.957
## 3rd Qu.:0.4400 3rd Qu.:1.950 3rd Qu.: 6.200 3rd Qu.:1.120
## Max. :0.6600 Max. :3.580 Max. :13.000 Max. :1.710
## OD280/OD315 of diluted wines Proline
## Min. :1.270 Min. : 278.0
## 1st Qu.:1.930 1st Qu.: 500.0
## Median :2.780 Median : 672.0
## Mean :2.604 Mean : 745.1
## 3rd Qu.:3.170 3rd Qu.: 985.0
## Max. :4.000 Max. :1680.0
wineClasses <- factor(winedf$Cvs)
plot(main="Three Different Cultivars",winedf$Alcohol,winedf$'Alcalinity of ash', col = wineClasses)
plot(main="Three Different Cultivars",winedf$`Malic acid`,winedf$Magnesium, col = wineClasses)
plot(main="Three Different Cultivars",winedf$Ash,winedf$Flavanoids, col = wineClasses)
winePCA <- prcomp(scale(winedf[,-1]))
Now the 13 features has reduced to only 2 new Principal Components These are not 2 of those 13, but 2 new components
summary(winePCA)
## Importance of components%s:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.1628 1.5816 1.2055 0.96148 0.92830 0.8030 0.74295
## Proportion of Variance 0.3598 0.1924 0.1118 0.07111 0.06629 0.0496 0.04246
## Cumulative Proportion 0.3598 0.5522 0.6640 0.73515 0.80144 0.8510 0.89350
## PC8 PC9 PC10 PC11 PC12 PC13
## Standard deviation 0.59223 0.53775 0.49680 0.47481 0.41034 0.3224
## Proportion of Variance 0.02698 0.02224 0.01899 0.01734 0.01295 0.0080
## Cumulative Proportion 0.92048 0.94272 0.96171 0.97905 0.99200 1.0000
pcaCharts(winePCA)
## [1] "proportions of variance:"
## [1] 0.359830707 0.192412795 0.111794601 0.071111087 0.066287442
## [6] 0.049603670 0.042460139 0.026979910 0.022244617 0.018985283
## [11] 0.017341553 0.012952063 0.007996133
biplot(winePCA,scale=0, cex=.7)
wineClasses <- factor(winedf$Cvs)
plot(main="Three Different Cultivars",winePCA$x[,1:2], col = wineClasses)
As you compared with the plot with original features, you can see now it is much easy to classified them