PCA example

This simple example demonstrate how PCA is utilized for dimensional reduction. It also prove that it may not always be use full and practical all the time. We will build simple linear regression model predicting the miles per-hour of the cars mpg and compare the results with PCA or non PCA. We will use R^2 (RSquared) to evaluate the variance that can be explained by our model. The closer it get to 100% is better.

We will be using Boston Housing data from library mlbench. We’re going to predict house price(medv) based on all other features (columns)

library(mlbench)
library(help = "mlbench")
library(ggplot2)
# Boston Housing Data
data(BostonHousing)
dim(BostonHousing)
## [1] 506  14
head(BostonHousing)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio      b lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
summary(BostonHousing)
##       crim                zn             indus       chas         nox        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   0:471   Min.   :0.3850  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1: 35   1st Qu.:0.4490  
##  Median : 0.25651   Median :  0.00   Median : 9.69           Median :0.5380  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14           Mean   :0.5547  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10           3rd Qu.:0.6240  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74           Max.   :0.8710  
##        rm             age              dis              rad        
##  Min.   :3.561   Min.   :  2.90   Min.   : 1.130   Min.   : 1.000  
##  1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100   1st Qu.: 4.000  
##  Median :6.208   Median : 77.50   Median : 3.207   Median : 5.000  
##  Mean   :6.285   Mean   : 68.57   Mean   : 3.795   Mean   : 9.549  
##  3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188   3rd Qu.:24.000  
##  Max.   :8.780   Max.   :100.00   Max.   :12.127   Max.   :24.000  
##       tax           ptratio            b              lstat      
##  Min.   :187.0   Min.   :12.60   Min.   :  0.32   Min.   : 1.73  
##  1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38   1st Qu.: 6.95  
##  Median :330.0   Median :19.05   Median :391.44   Median :11.36  
##  Mean   :408.2   Mean   :18.46   Mean   :356.67   Mean   :12.65  
##  3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23   3rd Qu.:16.95  
##  Max.   :711.0   Max.   :22.00   Max.   :396.90   Max.   :37.97  
##       medv      
##  Min.   : 5.00  
##  1st Qu.:17.02  
##  Median :21.20  
##  Mean   :22.53  
##  3rd Qu.:25.00  
##  Max.   :50.00

PCA is best only on numericals so we need to remove categorical or ordinal feature. Some people do one hot encoding but for me it does not make any sense for PCA

BH <- BostonHousing %>% select(- c(chas))

For proper prediction model we need to split our data to train and test. This is just a basic splitting (not even random)

set.seed(2020)
train_index <- createDataPartition(BH$medv, p = 0.7, list = F)
train <- BH[train_index,]
test <- BH[-train_index,]

We populate PCA based on training set. We will transform our test set for prediction later. We can see PCA 1 can explain significantly explain the data.

set.seed(2020)
BH_pca <- prcomp(train %>% select(-medv), center = TRUE, scale= TRUE)
fviz_eig(BH_pca)

summary(BH_pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6    PC7
## Standard deviation     2.4710 1.1685 1.09205 0.92934 0.80464 0.71262 0.6178
## Proportion of Variance 0.5088 0.1138 0.09938 0.07197 0.05395 0.04232 0.0318
## Cumulative Proportion  0.5088 0.6226 0.72199 0.79396 0.84792 0.89024 0.9220
##                            PC8     PC9    PC10    PC11    PC12
## Standard deviation     0.52750 0.46712 0.44758 0.41678 0.25506
## Proportion of Variance 0.02319 0.01818 0.01669 0.01448 0.00542
## Cumulative Proportion  0.94523 0.96341 0.98010 0.99458 1.00000
fviz_pca_var(BH_pca,
             col.var = "cos2", # Color by contributions to the PC
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE     # Avoid text overlapping
)

We create a train data set based on the principle component PCA1 and PCA2. We now have only 3 features instead 12 from original dataset!!.. Dimension Reduction remember :)

new_BH_pca <- cbind(medv = train$medv, BH_pca$x[,1:3]) %>% as_tibble()
head(new_BH_pca)
## # A tibble: 6 x 4
##    medv   PC1    PC2     PC3
##   <dbl> <dbl>  <dbl>   <dbl>
## 1  24   -2.10 0.279  -1.08  
## 2  21.6 -1.46 0.935  -0.172 
## 3  34.7 -2.06 0.236  -0.846 
## 4  33.4 -2.60 0.0204 -0.0776
## 5  36.2 -2.44 0.0721 -0.190 
## 6  28.7 -2.21 0.488   0.365

Let’s build simple linear regression model

bh_m_PCA <- lm(medv ~ PC1 + PC2 + PC3 , data = new_BH_pca)
summary(bh_m_PCA)
## 
## Call:
## lm(formula = medv ~ PC1 + PC2 + PC3, data = new_BH_pca)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.827  -2.966  -0.505   1.802  34.834 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22.6975     0.3030  74.913   <2e-16 ***
## PC1          -2.2640     0.1228 -18.438   <2e-16 ***
## PC2          -1.3566     0.2597  -5.224    3e-07 ***
## PC3          -4.3615     0.2778 -15.698   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.717 on 352 degrees of freedom
## Multiple R-squared:  0.6355, Adjusted R-squared:  0.6324 
## F-statistic: 204.6 on 3 and 352 DF,  p-value: < 2.2e-16

We build another model with all features

bh_m <- lm(medv ~ . , data = train)
summary(bh_m)
## 
## Call:
## lm(formula = medv ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.4930  -2.8798  -0.7844   1.5707  25.9508 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  36.886566   6.456711   5.713 2.41e-08 ***
## crim         -0.149127   0.044936  -3.319 0.001002 ** 
## zn            0.042555   0.016869   2.523 0.012099 *  
## indus         0.087170   0.074299   1.173 0.241522    
## nox         -20.222886   4.566635  -4.428 1.28e-05 ***
## rm            4.061471   0.535160   7.589 3.07e-13 ***
## age           0.009415   0.016146   0.583 0.560194    
## dis          -1.468881   0.240243  -6.114 2.64e-09 ***
## rad           0.394694   0.083376   4.734 3.23e-06 ***
## tax          -0.015324   0.004635  -3.306 0.001047 ** 
## ptratio      -1.054124   0.163361  -6.453 3.75e-10 ***
## b             0.011969   0.003261   3.670 0.000281 ***
## lstat        -0.546897   0.063261  -8.645  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.947 on 343 degrees of freedom
## Multiple R-squared:  0.734,  Adjusted R-squared:  0.7247 
## F-statistic: 78.87 on 12 and 343 DF,  p-value: < 2.2e-16

From training model we can see that the All model perform better than PCA model (1,2 and 3) Nevertheless, let us try apply PCA on Test dataset and do prediction. Result of the prediction will be displayed in Predicted vs Actual Chart

set.seed(2020)
rsq <- function (x, y) cor(x, y) ^ 2
new_BH_pca_test <- predict(BH_pca, test %>% select(-medv))
test_pca_df <- new_BH_pca_test[,1:3] %>% as_tibble()
y_test_pca <- predict(bh_m_PCA, test_pca_df)
y_test <- test$medv
rsq_pca1 <- round(rsq(y_test,y_test_pca),3)
df <- cbind(Actual = y_test, Predicted = y_test_pca) %>% as_tibble()
ggplot(df,aes(x=Actual, y =  Predicted))+
geom_point() + ggtitle(paste("Test Acuracy for PCA model with R^2:" ,rsq_pca1),) +
  theme(plot.title = element_text(hjust = 0.5))

Now let us compare Test Accuracy with Model that use all features (no PCA)

rsq <- function (x, y) cor(x, y) ^ 2
y_test_all <- predict(bh_m, test %>% select(-medv))
y_test <- test$medv
rsq_all <- round(rsq(y_test,y_test_all),3)
df1 <- cbind(Actual = y_test, Predicted = y_test_all) %>% as_tibble()
ggplot(df1,aes(x=Actual, y =  Predicted))+
geom_point() + ggtitle(paste("Test Acuracy for PCA model with R^2:" ,rsq_all),) +
  theme(plot.title = element_text(hjust = 0.5))

Conslusion

We can see that the All model Actual vs Predicted plots is closer to diagonal lime comparing to the PCA model. In addition R^2 is better in the All model compared to PCA (74% vs 70%). It is also very clear the All model (train) has better accuracy compared to PCA model.

Takeaway

In conclusion, for this dataset we can say that PCA is not helpful in improving the accuracy. It is meant for dimensional reduction and may be not be useful if we have small number of features such as in the Boston Housing Dataset. will be helpful perhaps only if we have too many features but still we can experiment with other technique such as model.