PCA in R In R, there are several functions from different packages that allow us to perform PCA. 5 different ways to do a PCA using the following functions (with their corresponding packages in parentheses):
prcomp() (stats) princomp() (stats) PCA() (FactoMineR) dudi.pca() (ade4) acp() (amap)
Brief note: It is no coincidence that the three external packages (“FactoMineR”, “ade4”, and “amap”) have been developed by French data analysts, which have a long tradition and preference for PCA and other related exploratory technique
the typical PCA results should consist of a set of eigenvalues, a table with the scores or Principal Components (PCs), and a table of loadings (or correlations between variables and PCs). The eigenvalues provide information of the variability in the data. The scores provide information about the structure of the observations. The loadings (or correlations) allow you to get a sense of the relationships between variables, as well as their associations with the extracted PCs.
pulsar <- read.csv("pulsar_stars.csv", header=TRUE)
head(pulsar)
## Mean.of.the.integrated.profile
## 1 140.56250
## 2 102.50781
## 3 103.01562
## 4 136.75000
## 5 88.72656
## 6 93.57031
## Standard.deviation.of.the.integrated.profile
## 1 55.68378
## 2 58.88243
## 3 39.34165
## 4 57.17845
## 5 40.67223
## 6 46.69811
## Excess.kurtosis.of.the.integrated.profile
## 1 -0.23457141
## 2 0.46531815
## 3 0.32332837
## 4 -0.06841464
## 5 0.60086608
## 6 0.53190485
## Skewness.of.the.integrated.profile Mean.of.the.DM.SNR.curve
## 1 -0.6996484 3.199833
## 2 -0.5150879 1.677258
## 3 1.0511644 3.121237
## 4 -0.6362384 3.642977
## 5 1.1234917 1.178930
## 6 0.4167211 1.636288
## Standard.deviation.of.the.DM.SNR.curve
## 1 19.11043
## 2 14.86015
## 3 21.74467
## 4 20.95928
## 5 11.46872
## 6 14.54507
## Excess.kurtosis.of.the.DM.SNR.curve Skewness.of.the.DM.SNR.curve
## 1 7.975532 74.24222
## 2 10.576487 127.39358
## 3 7.735822 63.17191
## 4 6.896499 53.59366
## 5 14.269573 252.56731
## 6 10.621748 131.39400
## target_class
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
names(pulsar)
## [1] "Mean.of.the.integrated.profile"
## [2] "Standard.deviation.of.the.integrated.profile"
## [3] "Excess.kurtosis.of.the.integrated.profile"
## [4] "Skewness.of.the.integrated.profile"
## [5] "Mean.of.the.DM.SNR.curve"
## [6] "Standard.deviation.of.the.DM.SNR.curve"
## [7] "Excess.kurtosis.of.the.DM.SNR.curve"
## [8] "Skewness.of.the.DM.SNR.curve"
## [9] "target_class"
Same as in Python workflow, rename original column names to the above:
data2.columns=[‘mean_profile’, ‘std_profile’, ‘kurtosis_profile’, ‘skewness_profile’, ‘mean_dmsnr_curve’, ‘std_dmsnr_curve’, ‘kurtosis_dmsnr_curve’, ‘skewness_dmsnr_curve’, ‘target_class’]
names(pulsar) <- c("mean_profile", "std_profile", "kurtosis_profile","skewness_profile","mean_dmsnr_curve","std_dmsnr_curve","kurtosis_dmsnr_curve","skewness_dmsnr_curve", "target_class")
names(pulsar)
## [1] "mean_profile" "std_profile" "kurtosis_profile"
## [4] "skewness_profile" "mean_dmsnr_curve" "std_dmsnr_curve"
## [7] "kurtosis_dmsnr_curve" "skewness_dmsnr_curve" "target_class"
str(pulsar)
## 'data.frame': 17898 obs. of 9 variables:
## $ mean_profile : num 140.6 102.5 103 136.8 88.7 ...
## $ std_profile : num 55.7 58.9 39.3 57.2 40.7 ...
## $ kurtosis_profile : num -0.2346 0.4653 0.3233 -0.0684 0.6009 ...
## $ skewness_profile : num -0.7 -0.515 1.051 -0.636 1.123 ...
## $ mean_dmsnr_curve : num 3.2 1.68 3.12 3.64 1.18 ...
## $ std_dmsnr_curve : num 19.1 14.9 21.7 21 11.5 ...
## $ kurtosis_dmsnr_curve: num 7.98 10.58 7.74 6.9 14.27 ...
## $ skewness_dmsnr_curve: num 74.2 127.4 63.2 53.6 252.6 ...
## $ target_class : int 0 0 0 0 0 0 0 0 0 0 ...
Scale only columns 1-8, and keep target_class unchanged
#cbind() - combine vectors by row/column
pulsar.z = as.data.frame(cbind(pulsar[,9,drop=F], scale(pulsar[ ,1:8])))
nrow(pulsar.z)
## [1] 17898
# Divide data in 80:20 ratio - training:test
samp_size <- floor(0.80* nrow(pulsar.z))
train_ind <- sample(seq_len(nrow(pulsar.z)), size = samp_size)
# Training data
pca.train <- as.data.frame(pulsar.z[train_ind,])
# Test Data
pca.test <- as.data.frame(pulsar.z[-train_ind,])
#With parameter scale. = T, we normalize the variables to have standard deviation equals to 1
prin_comp <- prcomp(pca.train, scale=TRUE)
We will use prcomp function for PCA. The prcomp provides four output as dumped below.
Sdev - This defines the standard deviation of projected points on PC1, PC2, PC3 and PC3. As expected, the standard deviation of projected point is in decreasing order from PC1 to PC4. Rotation - This defines the principal components axis. Here there are four principal components as there are four input features. Center/Scale - These are mean and standard deviation of input features in original feature space (without any transformation).
#The prcomp() function results in 5 useful measures:
names(prin_comp)
## [1] "sdev" "rotation" "center" "scale" "x"
#Sdev - This defines the standard deviation of projected points on PC1, PC2, PC3 and PC3. As expected, the standard deviation of projected point is in decreasing order from PC1 to PC4.
#sqrt of eigenvalues
prin_comp$sdev
## [1] 2.1746738 1.4804373 0.9056351 0.7107247 0.5622048 0.5079116 0.3822143
## [8] 0.1323042 0.1271003
#compute standard deviation
std_dev <- prin_comp$sdev
#compute variance
pr_var <- std_dev^2
#check variance of first 4 components
pr_var[1:10]
## [1] 4.72920593 2.19169448 0.82017494 0.50512958 0.31607422 0.25797417
## [7] 0.14608777 0.01750440 0.01615449 NA
We aim to find the components which explain the maximum variance. This is because, we want to retain as much information as possible using these components. So, higher is the explained variance, higher will be the information contained in those components.
To compute the proportion of variance explained by each component, we simply divide the variance by sum of total variance. This results in:
#proportion of variance explained
prop_varex <- pr_var/sum(pr_var)
prop_varex[1:10]
## [1] 0.525467325 0.243521609 0.091130549 0.056125509 0.035119358
## [6] 0.028663797 0.016231975 0.001944934 0.001794943 NA
This shows that first principal component explains 52% variance, second 24% variance, third 9%, forth 5%….
plot(prop_varex, xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
type = "b")
6 components explain 98% of variance
#cumulative scree plot
plot(cumsum(prop_varex), xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
type = "b")
https://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/
#Rotation - This defines the principal components axis. Here there are four principal components as there are four input features.
#loadings
prin_comp$rotation
## PC1 PC2 PC3 PC4
## target_class 0.3720348 0.1505419 0.121130307 0.43675288
## mean_profile -0.3533864 -0.3146865 0.009749826 -0.14548585
## std_profile -0.2084167 -0.3903647 0.482560007 0.69425753
## kurtosis_profile 0.4070189 0.2702159 0.073312122 0.17627547
## skewness_profile 0.3879750 0.2564561 0.055545604 0.02609747
## mean_dmsnr_curve 0.3051815 -0.2855959 0.536031470 -0.42431627
## std_dmsnr_curve 0.3451514 -0.3520348 0.219020085 -0.23339296
## kurtosis_dmsnr_curve -0.3216209 0.4467007 0.276045164 -0.03159109
## skewness_dmsnr_curve -0.2477426 0.4267358 0.576548160 -0.19702432
## PC5 PC6 PC7 PC8
## target_class -0.75052872 0.0380515543 -0.245542716 0.09202271
## mean_profile -0.37304411 -0.7247529191 -0.064043466 -0.29280968
## std_profile 0.28056320 -0.0363387373 0.076516288 0.05834996
## kurtosis_profile 0.23855164 -0.1566171438 0.080372724 -0.79642986
## skewness_profile 0.27077668 -0.6599123375 0.081437507 0.51379768
## mean_dmsnr_curve 0.12323712 0.0631815672 -0.574383895 -0.01293889
## std_dmsnr_curve -0.23154069 0.0849371647 0.732631632 0.02871594
## kurtosis_dmsnr_curve -0.03893636 -0.0002887855 0.005736838 0.04556202
## skewness_dmsnr_curve -0.13523171 0.0272991397 0.223521244 -0.03214195
## PC9
## target_class -0.0009823011
## mean_profile -0.0236960246
## std_profile -0.0049493579
## kurtosis_profile -0.0517215331
## skewness_profile 0.0226900333
## mean_dmsnr_curve -0.0930678321
## std_dmsnr_curve -0.2368560045
## kurtosis_dmsnr_curve -0.7849806511
## skewness_dmsnr_curve 0.5614855937
# PCs (aka scores)
head(prin_comp$x)
## PC1 PC2 PC3 PC4 PC5 PC6
## 7696 -0.6674070 0.94204844 -0.2868181 -0.3838119 0.08173933 0.2440567
## 15865 -1.4923128 -0.62156871 0.3069022 0.7466008 0.07911779 -0.5022875
## 5491 -0.4405090 0.06057573 -0.4563747 0.1943803 0.32533050 0.2303513
## 5851 3.1357392 0.75946563 -0.7188919 1.0361107 -1.82885569 0.2754297
## 2241 4.2779965 -1.58914285 1.8894446 2.2646479 -0.57817163 1.7010631
## 6838 -0.4767944 0.44131696 -1.0889361 -0.9329042 -0.34024180 -0.2560680
## PC7 PC8 PC9
## 7696 -0.1476946 0.02930861 0.02538018
## 15865 -0.1848377 -0.03612507 0.03736138
## 5491 -0.1395701 0.04704056 0.02249237
## 5851 -0.3766238 -0.17635924 0.13142379
## 2241 -0.1097442 0.10563952 0.03410825
## 6838 -0.3364263 0.00213954 0.10171090
#Center/Scale - These are mean and standard deviation of input features in original feature space (without any transformation).
prin_comp$center
## target_class mean_profile std_profile
## 0.093937701 -0.002429675 -0.001890109
## kurtosis_profile skewness_profile mean_dmsnr_curve
## 0.003616044 0.003095856 0.002965435
## std_dmsnr_curve kurtosis_dmsnr_curve skewness_dmsnr_curve
## 0.004025270 -0.004143251 -0.002479922
prin_comp$scale
## target_class mean_profile std_profile
## 0.2917522 1.0048682 1.0026539
## kurtosis_profile skewness_profile mean_dmsnr_curve
## 1.0057490 1.0045527 1.0044289
## std_dmsnr_curve kurtosis_dmsnr_curve skewness_dmsnr_curve
## 1.0027342 1.0005313 1.0032054
After we’ve calculated the principal components on training set, let’s now understand the process of predicting on test data using these components. The process is simple. Just like we’ve obtained PCA components on training set, we’ll get another bunch of components on testing set. Finally, we train the model.
But, few important points to understand:
We should not combine the train and test set to obtain PCA components of whole data at once. Because, this would violate the entire assumption of generalization since test data would get ‘leaked’ into the training set. In other words, the test data set would no longer remain ‘unseen’. Eventually, this will hammer down the generalization capability of the model. We should not perform PCA on test and train data sets separately. Because, the resultant vectors from train and test PCAs will have different directions ( due to unequal variance). Due to this, we’ll end up comparing data registered on different axes. Therefore, the resulting vectors from train and test data should have same axes.
We should do exactly the same transformation to the test set as we did to training set, including the center and scaling feature. Let’s do it in R:
# put classifier back to train set
train_data <- data.frame(target_class=pca.train$target_class,prin_comp$x)
#we are interested in first 6 PCAs
train_data <- train_data[,1:7]
#run a decision tree
#install.packages("rpart")
library(rpart)
rpart.model <- rpart(target_class~ .,data = train_data, method = "anova")
rpart.model
## n= 14318
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 14318 1218.654000 0.093937700
## 2) PC1< 2.806277 13130 240.428400 0.018659560
## 4) PC5>=-1.487643 12892 12.986890 0.001008377 *
## 5) PC5< -1.487643 238 5.848739 0.974789900 *
## 3) PC1>=2.806277 1188 81.481480 0.925925900
## 6) PC4< -1.299098 69 2.869565 0.043478260 *
## 7) PC4>=-1.299098 1119 21.567470 0.980339600 *
#transform test into PCA
test.data <- predict(prin_comp, newdata = pca.test)
test.data <- as.data.frame(test.data)
#select the first 6 components
test.data <- test.data[,1:6]
#make prediction on test data
rpart.prediction <- predict(rpart.model, test.data)
#For fun, finally check your score of leaderboard
#sample <- read.csv("pulsar_stars.csv")
final.sub <- data.frame(Item_Identifier = pca.test$target_class, target_class = pca.test$target_class, predicted_class = rpart.prediction)
write.csv(final.sub, "pca2.csv",row.names = F)
head(final.sub, 45)
## Item_Identifier target_class predicted_class
## 5 0 0 0.001008377
## 10 0 0 0.001008377
## 13 0 0 0.001008377
## 14 0 0 0.001008377
## 17 0 0 0.001008377
## 22 0 0 0.001008377
## 27 0 0 0.001008377
## 30 0 0 0.001008377
## 31 0 0 0.001008377
## 33 0 0 0.001008377
## 34 0 0 0.001008377
## 35 0 0 0.001008377
## 37 0 0 0.001008377
## 41 0 0 0.001008377
## 42 0 0 0.001008377
## 45 0 0 0.001008377
## 47 0 0 0.001008377
## 51 0 0 0.001008377
## 53 0 0 0.001008377
## 63 0 0 0.001008377
## 64 0 0 0.001008377
## 68 0 0 0.001008377
## 77 0 0 0.001008377
## 79 0 0 0.001008377
## 89 0 0 0.001008377
## 103 0 0 0.001008377
## 107 0 0 0.001008377
## 109 0 0 0.001008377
## 111 1 1 0.980339589
## 115 0 0 0.001008377
## 116 0 0 0.001008377
## 125 0 0 0.001008377
## 130 0 0 0.001008377
## 133 1 1 0.980339589
## 139 0 0 0.001008377
## 145 0 0 0.001008377
## 148 0 0 0.001008377
## 156 0 0 0.001008377
## 169 0 0 0.001008377
## 171 0 0 0.001008377
## 177 0 0 0.001008377
## 200 0 0 0.001008377
## 205 1 1 0.974789916
## 206 0 0 0.001008377
## 212 0 0 0.001008377
#rounded error
round(mean((rpart.prediction - pca.test$target_class)^2),5)
## [1] 0.00165
# PCA with function prcomp
#With parameter scale. = T, we normalize the variables to have standard deviation equals to 1
pca1 = prcomp(USArrests, scale. = TRUE)
# load ggplot2
library(ggplot2)
# create data frame with scores
scores = as.data.frame(pca1$x)
# plot of observations
ggplot(data = scores, aes(x = PC1, y = PC2, label = rownames(scores))) +
geom_hline(yintercept = 0, colour = "gray65") +
geom_vline(xintercept = 0, colour = "gray65") +
geom_text(colour = "tomato", alpha = 0.8, size = 4) +
ggtitle("PCA plot of USA States - Crime Rates")
head(USArrests)
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
head(pca1)
## $sdev
## [1] 1.5748783 0.9948694 0.5971291 0.4164494
##
## $rotation
## PC1 PC2 PC3 PC4
## Murder -0.5358995 0.4181809 -0.3412327 0.64922780
## Assault -0.5831836 0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158 0.13387773
## Rape -0.5434321 -0.1673186 0.8177779 0.08902432
##
## $center
## Murder Assault UrbanPop Rape
## 7.788 170.760 65.540 21.232
##
## $scale
## Murder Assault UrbanPop Rape
## 4.355510 83.337661 14.474763 9.366385
##
## $x
## PC1 PC2 PC3 PC4
## Alabama -0.97566045 1.12200121 -0.43980366 0.154696581
## Alaska -1.93053788 1.06242692 2.01950027 -0.434175454
## Arizona -1.74544285 -0.73845954 0.05423025 -0.826264240
## Arkansas 0.13999894 1.10854226 0.11342217 -0.180973554
## California -2.49861285 -1.52742672 0.59254100 -0.338559240
## Colorado -1.49934074 -0.97762966 1.08400162 0.001450164
## Connecticut 1.34499236 -1.07798362 -0.63679250 -0.117278736
## Delaware -0.04722981 -0.32208890 -0.71141032 -0.873113315
## Florida -2.98275967 0.03883425 -0.57103206 -0.095317042
## Georgia -1.62280742 1.26608838 -0.33901818 1.065974459
## Hawaii 0.90348448 -1.55467609 0.05027151 0.893733198
## Idaho 1.62331903 0.20885253 0.25719021 -0.494087852
## Illinois -1.36505197 -0.67498834 -0.67068647 -0.120794916
## Indiana 0.50038122 -0.15003926 0.22576277 0.420397595
## Iowa 2.23099579 -0.10300828 0.16291036 0.017379470
## Kansas 0.78887206 -0.26744941 0.02529648 0.204421034
## Kentucky 0.74331256 0.94880748 -0.02808429 0.663817237
## Louisiana -1.54909076 0.86230011 -0.77560598 0.450157791
## Maine 2.37274014 0.37260865 -0.06502225 -0.327138529
## Maryland -1.74564663 0.42335704 -0.15566968 -0.553450589
## Massachusetts 0.48128007 -1.45967706 -0.60337172 -0.177793902
## Michigan -2.08725025 -0.15383500 0.38100046 0.101343128
## Minnesota 1.67566951 -0.62590670 0.15153200 0.066640316
## Mississippi -0.98647919 2.36973712 -0.73336290 0.213342049
## Missouri -0.68978426 -0.26070794 0.37365033 0.223554811
## Montana 1.17353751 0.53147851 0.24440796 0.122498555
## Nebraska 1.25291625 -0.19200440 0.17380930 0.015733156
## Nevada -2.84550542 -0.76780502 1.15168793 0.311354436
## New Hampshire 2.35995585 -0.01790055 0.03648498 -0.032804291
## New Jersey -0.17974128 -1.43493745 -0.75677041 0.240936580
## New Mexico -1.96012351 0.14141308 0.18184598 -0.336121113
## New York -1.66566662 -0.81491072 -0.63661186 -0.013348844
## North Carolina -1.11208808 2.20561081 -0.85489245 -0.944789648
## North Dakota 2.96215223 0.59309738 0.29824930 -0.251434626
## Ohio 0.22369436 -0.73477837 -0.03082616 0.469152817
## Oklahoma 0.30864928 -0.28496113 -0.01515592 0.010228476
## Oregon -0.05852787 -0.53596999 0.93038718 -0.235390872
## Pennsylvania 0.87948680 -0.56536050 -0.39660218 0.355452378
## Rhode Island 0.85509072 -1.47698328 -1.35617705 -0.607402746
## South Carolina -1.30744986 1.91397297 -0.29751723 -0.130145378
## South Dakota 1.96779669 0.81506822 0.38538073 -0.108470512
## Tennessee -0.98969377 0.85160534 0.18619262 0.646302674
## Texas -1.34151838 -0.40833518 -0.48712332 0.636731051
## Utah 0.54503180 -1.45671524 0.29077592 -0.081486749
## Vermont 2.77325613 1.38819435 0.83280797 -0.143433697
## Virginia 0.09536670 0.19772785 0.01159482 0.209246429
## Washington 0.21472339 -0.96037394 0.61859067 -0.218628161
## West Virginia 2.08739306 1.41052627 0.10372163 0.130583080
## Wisconsin 2.05881199 -0.60512507 -0.13746933 0.182253407
## Wyoming 0.62310061 0.31778662 -0.23824049 -0.164976866