*** Principal Component Analysis (PCA): an Unsupervised learning***

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.

The number of principal components is less than or equal to the number of original variables. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components.

The resulting vectors are an uncorrelated orthogonal basis set. PCA is sensitive to the relative scaling of the original variables.

Principal components analysis is a procedure for identifying a smaller number of uncorrelated variables, called “principal components”, from a large set of data. The goal of principal components analysis is to explain the maximum amount of variance with the fewest number of principal components. Principal components analysis is commonly used in the social sciences, market research, and other industries that use large data sets.

Principal components analysis is commonly used as one step in a series of analyses. You can use principal components analysis to reduce the number of variables and avoid multicollinearity, or when you have too many predictors relative to the number of observations.

Example: A consumer products company wants to analyze customer responses to several characteristics of a new shampoo: color, smell, texture, cleanliness, shine, volume, amount needed to lather, and price. They perform a principal components analysis to determine whether they can form a smaller number of uncorrelated variables that are easier to interpret and analyze. The results identify the following patterns:

Color, smell, and texture form a “Shampoo quality” component.

Cleanliness, shine, and volume form an “Effect on hair” component. Amount needed to lather and price form a “Value” component.

Minitab.comLicense PortalStoreBlogCon


In machine learning and statistics, dimensionality reduction or dimension reduction is the process of reducing the number of random variables under consideration,[1] via obtaining a set of principal variables. It can be divided into feature selection and feature extraction.[2]

Feature selection:

Feature selection approaches try to find a subset of the original variables (also called features or attributes). There are three strategies; filter (e.g. information gain) and wrapper (e.g. search guided by accuracy) approaches, and embedded (features are selected to add or be removed while building the model based on the prediction errors). See also combinatorial optimization problems.

In some cases, data analysis such as regression or classification can be done in the reduced space more accurately than in the original space.

Feature extraction:

Feature extraction transforms the data in the high-dimensional space to a space of fewer dimensions. The data transformation may be linear, as in principal component analysis (PCA), but many nonlinear dimensionality reduction techniques also exist.[3][4] For multidimensional data, tensor representation can be used in dimensionality reduction through multilinear subspace learning.[5]

Advantages of dimensionality reduction

It reduces the time and storage space required.
Removal of multi-collinearity improves the performance of the machine learning model.
It becomes easier to visualize the data when reduced to very low dimensions such as 2D or 3D.

PCA is mostly used as a tool in exploratory data analysis and for making predictive models.

PCA can be done by eigenvalue decomposition of a data covariance (or correlation) matrix or singular value decomposition of a data matrix, usually after mean centering (and normalizing or using Z-scores) the data matrix for each attribute.[4] The results of a PCA are usually discussed in terms of component scores, sometimes called factor scores (the transformed variable values corresponding to a particular data point), and loadings (the weight by which each standardized original variable should be multiplied to get the component score).[5]

PCA is the simplest of the true eigenvector-based multivariate analyses. Often, its operation can be thought of as revealing the internal structure of the data in a way that best explains the variance in the data. If a multivariate dataset is visualised as a set of coordinates in a high-dimensional data space (1 axis per variable), PCA can supply the user with a lower-dimensional picture, a projection or “shadow” of this object when viewed from its (in some sense; see below) most informative viewpoint. This is done by using only the first few principal components so that the dimensionality of the transformed data is reduced.

PCA is closely related to factor analysis. Factor analysis typically incorporates more domain specific assumptions about the underlying structure and solves eigenvectors of a slightly different matrix.

PCA is also related to canonical correlation analysis (CCA). CCA defines coordinate systems that optimally describe the cross-covariance between two datasets while PCA defines a new orthogonal coordinate system that optimally describes variance in a single dataset.

data("USArrests")

usarrests = USArrests

dim(usarrests)
## [1] 50  4
str(usarrests)
## 'data.frame':    50 obs. of  4 variables:
##  $ Murder  : num  13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
##  $ Assault : int  236 263 294 190 276 204 110 238 335 211 ...
##  $ UrbanPop: int  58 48 80 50 91 78 77 72 80 60 ...
##  $ Rape    : num  21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
states = row.names(usarrests)
states
##  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
##  [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
##  [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
## [13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
## [17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
## [21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
## [25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
## [29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
## [33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
## [37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
## [41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
## [45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
## [49] "Wisconsin"      "Wyoming"
names(usarrests)
## [1] "Murder"   "Assault"  "UrbanPop" "Rape"
apply(usarrests, 2, mean)
##   Murder  Assault UrbanPop     Rape 
##    7.788  170.760   65.540   21.232
apply(usarrests, 2, sd)
##    Murder   Assault  UrbanPop      Rape 
##  4.355510 83.337661 14.474763  9.366385
apply(usarrests, 2, var)
##     Murder    Assault   UrbanPop       Rape 
##   18.97047 6945.16571  209.51878   87.72916

as variables have very different mean and variance, scaling before applying PCA is always a good idea.

if we failed to scale the variables before performing PCA, then most of the principal components that we observed would be driven by “Assault” VARIABLE as it has by far the highest mean and variance.

thus, it is important to standardize the variables to have mean zero and std of one before performing PCA.

pr.out = prcomp(usarrests, scale = TRUE)

prcomp() function centers variables to have mean zero.

scale=TRUE scale the variables to have std of 1.

names(pr.out)
## [1] "sdev"     "rotation" "center"   "scale"    "x"

the “center” and “scale” components correspond to the means and std of the variables that were used for scaling prior to implementing PCA.

pr.out$center
##   Murder  Assault UrbanPop     Rape 
##    7.788  170.760   65.540   21.232
pr.out$scale
##    Murder   Assault  UrbanPop      Rape 
##  4.355510 83.337661 14.474763  9.366385

the rotation matrix provides the principal component loadings, each column of pr.out$rotation contains the corresponding principal component loading vector.

pr.out$rotation
##                 PC1        PC2        PC3         PC4
## Murder   -0.5358995  0.4181809 -0.3412327  0.64922780
## Assault  -0.5831836  0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158  0.13387773
## Rape     -0.5434321 -0.1673186  0.8177779  0.08902432

there are 4 distinct PCs as “no. of PCs = min(n-1,p)”. and here n-1=49, p=4.

these loading vector needs to be matrix-multiplied with predcitors matrix X to get coordinates of the data in the rotated coordinate system. these coordinates are the principal component scores.

score vector = X-matrix x loading-vector

pr.out$x is this score matrix. which should have same no. of rows and columns as data set.

dim(pr.out$x)
## [1] 50  4

we can plot first 2 PCs as follows.

biplot(pr.out, scale=0)

scale=0 argument ensures that the arrows are scaled to represent the loadings. other value of scale will mean different interpretation.

the signs of loadings vectors and score vectors if are reversed together, it represents the same PCAs as before.

different sofware packages ay give these vectors with different signs but values should be same.

pr.out$rotation = -pr.out$rotation

pr.out$x = -pr.out$x

biplot(pr.out, scale=0)

pr.out$sdev
## [1] 1.5748783 0.9948694 0.5971291 0.4164494
pr.var = pr.out$sdev^2
pr.var
## [1] 2.4802416 0.9897652 0.3565632 0.1734301

pr.out$sdev is std. deviation of PCAs. if you convert it into variance by squaring it, this is the variance explained by each PC for this data set.

to compute the proportion of variance explained by each PC, divide the variance explained by total variance explained by all PCA.

PVE = pr.var / sum(pr.var) * 100 # in percentage
PVE
## [1] 62.006039 24.744129  8.914080  4.335752

we see that first PC explains 62% variance in the data set. the next PC 24.7%.

we can now plot the PVE explained by each component as well as the cumulative PVE.

par(mfrow = (c(1,2)))

plot(PVE, xlab="Principal Component",
     ylab="Proportion of variance explained",
     main="variance explained by each PC",
     type='b',
     col="red",
     lwd=2)

# if i had not converted PVE in to percentage than add ylim=c(0,1).

plot(cumsum(PVE), xlab="Principal Component",
     ylab="Proportion of variance explained",
     main="cumulative variance explained by PCs",
     type='b',
     col="blue",
     lwd=2)

all 4 PCs together explain 100% variance in data set.