PCA on PULSAR Data, same as Python file:

https://github.com/olgabradford/machine_learning_python/blob/master/01_10_Principal_Component_Analysis_PCA_classification_Pulsar.ipynb

PCA in R In R, there are several functions from different packages that allow us to perform PCA. 5 different ways to do a PCA using the following functions (with their corresponding packages in parentheses):

prcomp() (stats) princomp() (stats) PCA() (FactoMineR) dudi.pca() (ade4) acp() (amap)

Brief note: It is no coincidence that the three external packages (“FactoMineR”, “ade4”, and “amap”) have been developed by French data analysts, which have a long tradition and preference for PCA and other related exploratory technique

the typical PCA results should consist of a set of eigenvalues, a table with the scores or Principal Components (PCs), and a table of loadings (or correlations between variables and PCs). The eigenvalues provide information of the variability in the data. The scores provide information about the structure of the observations. The loadings (or correlations) allow you to get a sense of the relationships between variables, as well as their associations with the extracted PCs.

pulsar <- read.csv("pulsar_stars.csv", header=TRUE)
head(pulsar)
##   Mean.of.the.integrated.profile
## 1                      140.56250
## 2                      102.50781
## 3                      103.01562
## 4                      136.75000
## 5                       88.72656
## 6                       93.57031
##   Standard.deviation.of.the.integrated.profile
## 1                                     55.68378
## 2                                     58.88243
## 3                                     39.34165
## 4                                     57.17845
## 5                                     40.67223
## 6                                     46.69811
##   Excess.kurtosis.of.the.integrated.profile
## 1                               -0.23457141
## 2                                0.46531815
## 3                                0.32332837
## 4                               -0.06841464
## 5                                0.60086608
## 6                                0.53190485
##   Skewness.of.the.integrated.profile Mean.of.the.DM.SNR.curve
## 1                         -0.6996484                 3.199833
## 2                         -0.5150879                 1.677258
## 3                          1.0511644                 3.121237
## 4                         -0.6362384                 3.642977
## 5                          1.1234917                 1.178930
## 6                          0.4167211                 1.636288
##   Standard.deviation.of.the.DM.SNR.curve
## 1                               19.11043
## 2                               14.86015
## 3                               21.74467
## 4                               20.95928
## 5                               11.46872
## 6                               14.54507
##   Excess.kurtosis.of.the.DM.SNR.curve Skewness.of.the.DM.SNR.curve
## 1                            7.975532                     74.24222
## 2                           10.576487                    127.39358
## 3                            7.735822                     63.17191
## 4                            6.896499                     53.59366
## 5                           14.269573                    252.56731
## 6                           10.621748                    131.39400
##   target_class
## 1            0
## 2            0
## 3            0
## 4            0
## 5            0
## 6            0
names(pulsar)
## [1] "Mean.of.the.integrated.profile"              
## [2] "Standard.deviation.of.the.integrated.profile"
## [3] "Excess.kurtosis.of.the.integrated.profile"   
## [4] "Skewness.of.the.integrated.profile"          
## [5] "Mean.of.the.DM.SNR.curve"                    
## [6] "Standard.deviation.of.the.DM.SNR.curve"      
## [7] "Excess.kurtosis.of.the.DM.SNR.curve"         
## [8] "Skewness.of.the.DM.SNR.curve"                
## [9] "target_class"

Rename columns

Same as in Python workflow, rename original column names to the above:

data2.columns=[‘mean_profile’, ‘std_profile’, ‘kurtosis_profile’, ‘skewness_profile’, ‘mean_dmsnr_curve’, ‘std_dmsnr_curve’, ‘kurtosis_dmsnr_curve’, ‘skewness_dmsnr_curve’, ‘target_class’]

names(pulsar) <- c("mean_profile", "std_profile", "kurtosis_profile","skewness_profile","mean_dmsnr_curve","std_dmsnr_curve","kurtosis_dmsnr_curve","skewness_dmsnr_curve", "target_class")
names(pulsar)
## [1] "mean_profile"         "std_profile"          "kurtosis_profile"    
## [4] "skewness_profile"     "mean_dmsnr_curve"     "std_dmsnr_curve"     
## [7] "kurtosis_dmsnr_curve" "skewness_dmsnr_curve" "target_class"
str(pulsar)
## 'data.frame':    17898 obs. of  9 variables:
##  $ mean_profile        : num  140.6 102.5 103 136.8 88.7 ...
##  $ std_profile         : num  55.7 58.9 39.3 57.2 40.7 ...
##  $ kurtosis_profile    : num  -0.2346 0.4653 0.3233 -0.0684 0.6009 ...
##  $ skewness_profile    : num  -0.7 -0.515 1.051 -0.636 1.123 ...
##  $ mean_dmsnr_curve    : num  3.2 1.68 3.12 3.64 1.18 ...
##  $ std_dmsnr_curve     : num  19.1 14.9 21.7 21 11.5 ...
##  $ kurtosis_dmsnr_curve: num  7.98 10.58 7.74 6.9 14.27 ...
##  $ skewness_dmsnr_curve: num  74.2 127.4 63.2 53.6 252.6 ...
##  $ target_class        : int  0 0 0 0 0 0 0 0 0 0 ...

Scale pulsAR

Scale only columns 1-8, and keep target_class unchanged

#cbind() - combine vectors by row/column 
pulsar.z = as.data.frame(cbind(pulsar[,9,drop=F], scale(pulsar[ ,1:8])))

Train Test split

nrow(pulsar.z)
## [1] 17898
# Divide data in 80:20 ratio - training:test
samp_size <- floor(0.80* nrow(pulsar.z))
train_ind <- sample(seq_len(nrow(pulsar.z)), size = samp_size)

# Training data
pca.train <- as.data.frame(pulsar.z[train_ind,])

# Test Data
pca.test <-  as.data.frame(pulsar.z[-train_ind,])

PCA on scaled data only on train dataset

#With parameter scale. = T, we normalize the variables to have standard deviation equals to 1
prin_comp <- prcomp(pca.train, scale=TRUE)

We will use prcomp function for PCA. The prcomp provides four output as dumped below.

Sdev - This defines the standard deviation of projected points on PC1, PC2, PC3 and PC3. As expected, the standard deviation of projected point is in decreasing order from PC1 to PC4. Rotation - This defines the principal components axis. Here there are four principal components as there are four input features. Center/Scale - These are mean and standard deviation of input features in original feature space (without any transformation).

#The prcomp() function results in 5 useful measures:
names(prin_comp)
## [1] "sdev"     "rotation" "center"   "scale"    "x"
#Sdev - This defines the standard deviation of projected points on PC1, PC2, PC3 and PC3. As expected, the standard deviation of projected point is in decreasing order from PC1 to PC4.
#sqrt of eigenvalues
prin_comp$sdev
## [1] 2.1746738 1.4804373 0.9056351 0.7107247 0.5622048 0.5079116 0.3822143
## [8] 0.1323042 0.1271003
#compute standard deviation
std_dev <- prin_comp$sdev

#compute variance
pr_var <- std_dev^2
#check variance of first 4 components

pr_var[1:10]
##  [1] 4.72920593 2.19169448 0.82017494 0.50512958 0.31607422 0.25797417
##  [7] 0.14608777 0.01750440 0.01615449         NA

We aim to find the components which explain the maximum variance. This is because, we want to retain as much information as possible using these components. So, higher is the explained variance, higher will be the information contained in those components.

To compute the proportion of variance explained by each component, we simply divide the variance by sum of total variance. This results in:

#proportion of variance explained
prop_varex <- pr_var/sum(pr_var)
prop_varex[1:10]
##  [1] 0.525467325 0.243521609 0.091130549 0.056125509 0.035119358
##  [6] 0.028663797 0.016231975 0.001944934 0.001794943          NA

This shows that first principal component explains 52% variance, second 24% variance, third 9%, forth 5%….

scree plot

 plot(prop_varex, xlab = "Principal Component",
             ylab = "Proportion of Variance Explained",
             type = "b")

6 components explain 98% of variance

#cumulative scree plot
plot(cumsum(prop_varex), xlab = "Principal Component",
              ylab = "Cumulative Proportion of Variance Explained",
              type = "b")

https://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/

#Rotation - This defines the principal components axis. Here there are four principal components as there are four input features.
#loadings
prin_comp$rotation
##                             PC1        PC2         PC3         PC4
## target_class          0.3720348  0.1505419 0.121130307  0.43675288
## mean_profile         -0.3533864 -0.3146865 0.009749826 -0.14548585
## std_profile          -0.2084167 -0.3903647 0.482560007  0.69425753
## kurtosis_profile      0.4070189  0.2702159 0.073312122  0.17627547
## skewness_profile      0.3879750  0.2564561 0.055545604  0.02609747
## mean_dmsnr_curve      0.3051815 -0.2855959 0.536031470 -0.42431627
## std_dmsnr_curve       0.3451514 -0.3520348 0.219020085 -0.23339296
## kurtosis_dmsnr_curve -0.3216209  0.4467007 0.276045164 -0.03159109
## skewness_dmsnr_curve -0.2477426  0.4267358 0.576548160 -0.19702432
##                              PC5           PC6          PC7         PC8
## target_class         -0.75052872  0.0380515543 -0.245542716  0.09202271
## mean_profile         -0.37304411 -0.7247529191 -0.064043466 -0.29280968
## std_profile           0.28056320 -0.0363387373  0.076516288  0.05834996
## kurtosis_profile      0.23855164 -0.1566171438  0.080372724 -0.79642986
## skewness_profile      0.27077668 -0.6599123375  0.081437507  0.51379768
## mean_dmsnr_curve      0.12323712  0.0631815672 -0.574383895 -0.01293889
## std_dmsnr_curve      -0.23154069  0.0849371647  0.732631632  0.02871594
## kurtosis_dmsnr_curve -0.03893636 -0.0002887855  0.005736838  0.04556202
## skewness_dmsnr_curve -0.13523171  0.0272991397  0.223521244 -0.03214195
##                                PC9
## target_class         -0.0009823011
## mean_profile         -0.0236960246
## std_profile          -0.0049493579
## kurtosis_profile     -0.0517215331
## skewness_profile      0.0226900333
## mean_dmsnr_curve     -0.0930678321
## std_dmsnr_curve      -0.2368560045
## kurtosis_dmsnr_curve -0.7849806511
## skewness_dmsnr_curve  0.5614855937
# PCs (aka scores)
head(prin_comp$x)
##              PC1         PC2        PC3        PC4         PC5        PC6
## 7696  -0.6674070  0.94204844 -0.2868181 -0.3838119  0.08173933  0.2440567
## 15865 -1.4923128 -0.62156871  0.3069022  0.7466008  0.07911779 -0.5022875
## 5491  -0.4405090  0.06057573 -0.4563747  0.1943803  0.32533050  0.2303513
## 5851   3.1357392  0.75946563 -0.7188919  1.0361107 -1.82885569  0.2754297
## 2241   4.2779965 -1.58914285  1.8894446  2.2646479 -0.57817163  1.7010631
## 6838  -0.4767944  0.44131696 -1.0889361 -0.9329042 -0.34024180 -0.2560680
##              PC7         PC8        PC9
## 7696  -0.1476946  0.02930861 0.02538018
## 15865 -0.1848377 -0.03612507 0.03736138
## 5491  -0.1395701  0.04704056 0.02249237
## 5851  -0.3766238 -0.17635924 0.13142379
## 2241  -0.1097442  0.10563952 0.03410825
## 6838  -0.3364263  0.00213954 0.10171090
#Center/Scale - These are mean and standard deviation of input features in original feature space (without any transformation).
prin_comp$center
##         target_class         mean_profile          std_profile 
##          0.093937701         -0.002429675         -0.001890109 
##     kurtosis_profile     skewness_profile     mean_dmsnr_curve 
##          0.003616044          0.003095856          0.002965435 
##      std_dmsnr_curve kurtosis_dmsnr_curve skewness_dmsnr_curve 
##          0.004025270         -0.004143251         -0.002479922
prin_comp$scale
##         target_class         mean_profile          std_profile 
##            0.2917522            1.0048682            1.0026539 
##     kurtosis_profile     skewness_profile     mean_dmsnr_curve 
##            1.0057490            1.0045527            1.0044289 
##      std_dmsnr_curve kurtosis_dmsnr_curve skewness_dmsnr_curve 
##            1.0027342            1.0005313            1.0032054

Predictive Modeling with PCA components

Train / Test Split

After we’ve calculated the principal components on training set, let’s now understand the process of predicting on test data using these components. The process is simple. Just like we’ve obtained PCA components on training set, we’ll get another bunch of components on testing set. Finally, we train the model.

But, few important points to understand:

We should not combine the train and test set to obtain PCA components of whole data at once. Because, this would violate the entire assumption of generalization since test data would get ‘leaked’ into the training set. In other words, the test data set would no longer remain ‘unseen’. Eventually, this will hammer down the generalization capability of the model. We should not perform PCA on test and train data sets separately. Because, the resultant vectors from train and test PCAs will have different directions ( due to unequal variance). Due to this, we’ll end up comparing data registered on different axes. Therefore, the resulting vectors from train and test data should have same axes.

We should do exactly the same transformation to the test set as we did to training set, including the center and scaling feature. Let’s do it in R:

# put classifier back to train set
train_data <- data.frame(target_class=pca.train$target_class,prin_comp$x)
#we are interested in first 6 PCAs
train_data <- train_data[,1:7]
#run a decision tree
#install.packages("rpart")
library(rpart)
rpart.model <- rpart(target_class~ .,data = train_data, method = "anova")
rpart.model
## n= 14318 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 14318 1218.654000 0.093937700  
##   2) PC1< 2.806277 13130  240.428400 0.018659560  
##     4) PC5>=-1.487643 12892   12.986890 0.001008377 *
##     5) PC5< -1.487643 238    5.848739 0.974789900 *
##   3) PC1>=2.806277 1188   81.481480 0.925925900  
##     6) PC4< -1.299098 69    2.869565 0.043478260 *
##     7) PC4>=-1.299098 1119   21.567470 0.980339600 *
#transform test into PCA
test.data <- predict(prin_comp, newdata = pca.test)
test.data <- as.data.frame(test.data)
#select the first 6 components
test.data <- test.data[,1:6]
#make prediction on test data
rpart.prediction <- predict(rpart.model, test.data)
#For fun, finally check your score of leaderboard
#sample <- read.csv("pulsar_stars.csv")

final.sub <- data.frame(Item_Identifier = pca.test$target_class, target_class = pca.test$target_class, predicted_class = rpart.prediction)
write.csv(final.sub, "pca2.csv",row.names = F)

check what went in a file:

head(final.sub, 45)
##     Item_Identifier target_class predicted_class
## 5                 0            0     0.001008377
## 10                0            0     0.001008377
## 13                0            0     0.001008377
## 14                0            0     0.001008377
## 17                0            0     0.001008377
## 22                0            0     0.001008377
## 27                0            0     0.001008377
## 30                0            0     0.001008377
## 31                0            0     0.001008377
## 33                0            0     0.001008377
## 34                0            0     0.001008377
## 35                0            0     0.001008377
## 37                0            0     0.001008377
## 41                0            0     0.001008377
## 42                0            0     0.001008377
## 45                0            0     0.001008377
## 47                0            0     0.001008377
## 51                0            0     0.001008377
## 53                0            0     0.001008377
## 63                0            0     0.001008377
## 64                0            0     0.001008377
## 68                0            0     0.001008377
## 77                0            0     0.001008377
## 79                0            0     0.001008377
## 89                0            0     0.001008377
## 103               0            0     0.001008377
## 107               0            0     0.001008377
## 109               0            0     0.001008377
## 111               1            1     0.980339589
## 115               0            0     0.001008377
## 116               0            0     0.001008377
## 125               0            0     0.001008377
## 130               0            0     0.001008377
## 133               1            1     0.980339589
## 139               0            0     0.001008377
## 145               0            0     0.001008377
## 148               0            0     0.001008377
## 156               0            0     0.001008377
## 169               0            0     0.001008377
## 171               0            0     0.001008377
## 177               0            0     0.001008377
## 200               0            0     0.001008377
## 205               1            1     0.974789916
## 206               0            0     0.001008377
## 212               0            0     0.001008377
#rounded error
round(mean((rpart.prediction - pca.test$target_class)^2),5)
## [1] 0.00165

Plots of observations

# PCA with function prcomp
#With parameter scale. = T, we normalize the variables to have standard deviation equals to 1
pca1 = prcomp(USArrests, scale. = TRUE)

# load ggplot2
library(ggplot2)

# create data frame with scores
scores = as.data.frame(pca1$x)

# plot of observations
ggplot(data = scores, aes(x = PC1, y = PC2, label = rownames(scores))) +
  geom_hline(yintercept = 0, colour = "gray65") +
  geom_vline(xintercept = 0, colour = "gray65") +
  geom_text(colour = "tomato", alpha = 0.8, size = 4) +
  ggtitle("PCA plot of USA States - Crime Rates")

head(USArrests)
##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7
head(pca1)
## $sdev
## [1] 1.5748783 0.9948694 0.5971291 0.4164494
## 
## $rotation
##                 PC1        PC2        PC3         PC4
## Murder   -0.5358995  0.4181809 -0.3412327  0.64922780
## Assault  -0.5831836  0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158  0.13387773
## Rape     -0.5434321 -0.1673186  0.8177779  0.08902432
## 
## $center
##   Murder  Assault UrbanPop     Rape 
##    7.788  170.760   65.540   21.232 
## 
## $scale
##    Murder   Assault  UrbanPop      Rape 
##  4.355510 83.337661 14.474763  9.366385 
## 
## $x
##                        PC1         PC2         PC3          PC4
## Alabama        -0.97566045  1.12200121 -0.43980366  0.154696581
## Alaska         -1.93053788  1.06242692  2.01950027 -0.434175454
## Arizona        -1.74544285 -0.73845954  0.05423025 -0.826264240
## Arkansas        0.13999894  1.10854226  0.11342217 -0.180973554
## California     -2.49861285 -1.52742672  0.59254100 -0.338559240
## Colorado       -1.49934074 -0.97762966  1.08400162  0.001450164
## Connecticut     1.34499236 -1.07798362 -0.63679250 -0.117278736
## Delaware       -0.04722981 -0.32208890 -0.71141032 -0.873113315
## Florida        -2.98275967  0.03883425 -0.57103206 -0.095317042
## Georgia        -1.62280742  1.26608838 -0.33901818  1.065974459
## Hawaii          0.90348448 -1.55467609  0.05027151  0.893733198
## Idaho           1.62331903  0.20885253  0.25719021 -0.494087852
## Illinois       -1.36505197 -0.67498834 -0.67068647 -0.120794916
## Indiana         0.50038122 -0.15003926  0.22576277  0.420397595
## Iowa            2.23099579 -0.10300828  0.16291036  0.017379470
## Kansas          0.78887206 -0.26744941  0.02529648  0.204421034
## Kentucky        0.74331256  0.94880748 -0.02808429  0.663817237
## Louisiana      -1.54909076  0.86230011 -0.77560598  0.450157791
## Maine           2.37274014  0.37260865 -0.06502225 -0.327138529
## Maryland       -1.74564663  0.42335704 -0.15566968 -0.553450589
## Massachusetts   0.48128007 -1.45967706 -0.60337172 -0.177793902
## Michigan       -2.08725025 -0.15383500  0.38100046  0.101343128
## Minnesota       1.67566951 -0.62590670  0.15153200  0.066640316
## Mississippi    -0.98647919  2.36973712 -0.73336290  0.213342049
## Missouri       -0.68978426 -0.26070794  0.37365033  0.223554811
## Montana         1.17353751  0.53147851  0.24440796  0.122498555
## Nebraska        1.25291625 -0.19200440  0.17380930  0.015733156
## Nevada         -2.84550542 -0.76780502  1.15168793  0.311354436
## New Hampshire   2.35995585 -0.01790055  0.03648498 -0.032804291
## New Jersey     -0.17974128 -1.43493745 -0.75677041  0.240936580
## New Mexico     -1.96012351  0.14141308  0.18184598 -0.336121113
## New York       -1.66566662 -0.81491072 -0.63661186 -0.013348844
## North Carolina -1.11208808  2.20561081 -0.85489245 -0.944789648
## North Dakota    2.96215223  0.59309738  0.29824930 -0.251434626
## Ohio            0.22369436 -0.73477837 -0.03082616  0.469152817
## Oklahoma        0.30864928 -0.28496113 -0.01515592  0.010228476
## Oregon         -0.05852787 -0.53596999  0.93038718 -0.235390872
## Pennsylvania    0.87948680 -0.56536050 -0.39660218  0.355452378
## Rhode Island    0.85509072 -1.47698328 -1.35617705 -0.607402746
## South Carolina -1.30744986  1.91397297 -0.29751723 -0.130145378
## South Dakota    1.96779669  0.81506822  0.38538073 -0.108470512
## Tennessee      -0.98969377  0.85160534  0.18619262  0.646302674
## Texas          -1.34151838 -0.40833518 -0.48712332  0.636731051
## Utah            0.54503180 -1.45671524  0.29077592 -0.081486749
## Vermont         2.77325613  1.38819435  0.83280797 -0.143433697
## Virginia        0.09536670  0.19772785  0.01159482  0.209246429
## Washington      0.21472339 -0.96037394  0.61859067 -0.218628161
## West Virginia   2.08739306  1.41052627  0.10372163  0.130583080
## Wisconsin       2.05881199 -0.60512507 -0.13746933  0.182253407
## Wyoming         0.62310061  0.31778662 -0.23824049 -0.164976866