============================================================================================================
About: This document is also available at http://rpubs.com/sherloconan/615424

 

Goals

In this assignment you will be working with dataset from your 699 project. You will perform principal component analysis (PCA).

Submission Format

Tasks

  1. Establish the optimal number of components: visualize the scree plot and explain your decision. [ - 10pts]

  2. Visualize PCA1 and PCA2 and describe which variables contribute to the PCA. [ - 10pts]

  3. Reflect how you could use the reduced dimensionality in your final paper. [ - 10pts]

  4. Writing style. [ - 10pts]

 

Project Progress

The EDA parts are available at RPubs - part I and RPubs - part II. The modeling part is available at RPubs - part III.

The project dataset is the Bitcoin price as a time series. The PCA cannot be conducted in this case. Hence, the default dataset “mtcars” will be in use. Fig. 1 shows the raw data.

data <- mtcars[,c(1,3)]
f1 <- ggplot(data,aes(mpg,disp))+geom_point()+labs(x="Fuel economy (miles per gallon)",y="Engine displacement (cubic inch)",subtitle="Raw data")+ggtitle("Fig. 1. Scatter plot of MPG and DISP")+theme_classic()

data <- data.frame(scale(data))
f2 <- ggplot(data,aes(mpg,disp))+geom_point()+labs(x="Fuel economy (miles per gallon)",y="Engine displacement (cubic inch)",subtitle="Scaled data")+ggtitle("Fig. 2. Scatter plot of MPG and DISP")+theme_classic()

gridExtra::grid.arrange(f1,f2,nrow=1,bottom="Before PCA")

 

Principal Component Analysis (PCA)

The summary shows the importance of components. (1) sdev, the standard deviations of the principal components, i.e., the square roots of the eigenvalues of the covariance/correlation matrix, though the calculation is actually done with the singular values of the data matrix. (2) rotation, the matrix of variable loadings, i.e., a matrix whose columns contain the eigenvectors. The function “princomp” returns this in the element loadings. (3) x, if retx is true, then the value of the rotated data (the centred (and scaled if requested) data multiplied by the rotation matrix) is returned. Hence, cov(x) is the diagonal matrix diag(sdev^2). For the formula method, napredict() is applied to handle the treatment of values omitted by the na.action.

\(PC1 = -0.7071 * mpg + 0.7071 * disp\)
\(PC2 = 0.7071 * mpg + 0.7071 * disp\)

PCA <- prcomp(data,center=F,scale.=F)
summary(PCA)
## Importance of components:
##                           PC1     PC2
## Standard deviation     1.3592 0.39045
## Proportion of Variance 0.9238 0.07622
## Cumulative Proportion  0.9238 1.00000
PCA$rotation
##             PC1       PC2
## mpg  -0.7071068 0.7071068
## disp  0.7071068 0.7071068
#plot(PCA$x)
data_PCA <- data.frame(PCA$x)
ggplot(data_PCA,aes(PC1,PC2))+geom_point()+labs(subtitle="After PCA")+ggtitle("Fig. 3. Scatter plot of MPG and DISP")+theme_classic()

 

PCA continued

data2 <- prcomp(mtcars,center=F,scale.=F,retx=T)
plot(data2$x[,1:2],main="Principal component analysis")

data2$x[1:5,1:5]
##                         PC1       PC2       PC3         PC4        PC5
## Mazda RX4         -195.4586 -12.82442 11.366763 -0.01644411  2.1681553
## Mazda RX4 Wag     -195.4900 -12.85837 11.672530  0.47938929  2.1123201
## Datsun 710        -142.4803 -25.93604 16.034034  1.33669483 -1.1818054
## Hornet 4 Drive    -279.1129  38.27291 14.032390 -0.15698678 -0.8169072
## Hornet Sportabout -399.4494  37.33958  1.384863 -2.55678873 -0.4435470
data2$sdev
##  [1] 310.1170486  40.8849807  15.8494620   2.1406948   1.0130078   0.7559841
##  [7]   0.4637388   0.2914478   0.2518935   0.2107261   0.1985110
data2$rotation[1:5,1:5]
##              PC1         PC2          PC3           PC4          PC5
## mpg  -0.05192570 -0.12168895  0.816770206 -0.5384199012  0.014048862
## cyl  -0.02055752 -0.01353632  0.068072076  0.0965395213  0.220032664
## disp -0.85225865  0.52234494  0.009452163 -0.0223072691  0.007183881
## hp   -0.51719303 -0.84049922 -0.157726895  0.0009687559 -0.033835232
## drat -0.01010078 -0.02135314  0.107698845  0.0342661459  0.167587424
recover <- as.matrix(mtcars) %*% as.matrix(data2$rotation)
recover[1:5,1:5]
##                         PC1       PC2       PC3         PC4        PC5
## Mazda RX4         -195.4586 -12.82442 11.366763 -0.01644411  2.1681553
## Mazda RX4 Wag     -195.4900 -12.85837 11.672530  0.47938929  2.1123201
## Datsun 710        -142.4803 -25.93604 16.034034  1.33669483 -1.1818054
## Hornet 4 Drive    -279.1129  38.27291 14.032390 -0.15698678 -0.8169072
## Hornet Sportabout -399.4494  37.33958  1.384863 -2.55678873 -0.4435470