This simple example demonstrate how PCA is utilized for dimensional reduction. It also prove that it may not always be use full and practical all the time. We will build simple linear regression model predicting the miles per-hour of the cars mpg and compare the results with PCA or non PCA. We will use R^2 (RSquared) to evaluate the variance that can be explained by our model. The closer it get to 100% is better.
We will be using Boston Housing data from library mlbench. We’re going to predict house price(medv) based on all other features (columns)
library(mlbench)
library(help = "mlbench")
library(ggplot2)
# Boston Housing Data
data(BostonHousing)
dim(BostonHousing)
## [1] 506 14
head(BostonHousing)
## crim zn indus chas nox rm age dis rad tax ptratio b lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
summary(BostonHousing)
## crim zn indus chas nox
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 0:471 Min. :0.3850
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1: 35 1st Qu.:0.4490
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.5380
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.5547
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.6240
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :0.8710
## rm age dis rad
## Min. :3.561 Min. : 2.90 Min. : 1.130 Min. : 1.000
## 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100 1st Qu.: 4.000
## Median :6.208 Median : 77.50 Median : 3.207 Median : 5.000
## Mean :6.285 Mean : 68.57 Mean : 3.795 Mean : 9.549
## 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188 3rd Qu.:24.000
## Max. :8.780 Max. :100.00 Max. :12.127 Max. :24.000
## tax ptratio b lstat
## Min. :187.0 Min. :12.60 Min. : 0.32 Min. : 1.73
## 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38 1st Qu.: 6.95
## Median :330.0 Median :19.05 Median :391.44 Median :11.36
## Mean :408.2 Mean :18.46 Mean :356.67 Mean :12.65
## 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23 3rd Qu.:16.95
## Max. :711.0 Max. :22.00 Max. :396.90 Max. :37.97
## medv
## Min. : 5.00
## 1st Qu.:17.02
## Median :21.20
## Mean :22.53
## 3rd Qu.:25.00
## Max. :50.00
PCA is best only on numericals so we need to remove categorical or ordinal feature. Some people do one hot encoding but for me it does not make any sense for PCA
BH <- BostonHousing %>% select(- c(chas))
For proper prediction model we need to split our data to train and test. This is just a basic splitting (not even random)
set.seed(2020)
train_index <- createDataPartition(BH$medv, p = 0.7, list = F)
train <- BH[train_index,]
test <- BH[-train_index,]
We populate PCA based on training set. We will transform our test set for prediction later. We can see PCA 1 can explain significantly explain the data.
set.seed(2020)
BH_pca <- prcomp(train %>% select(-medv), center = TRUE, scale= TRUE)
fviz_eig(BH_pca)
summary(BH_pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.4710 1.1685 1.09205 0.92934 0.80464 0.71262 0.6178
## Proportion of Variance 0.5088 0.1138 0.09938 0.07197 0.05395 0.04232 0.0318
## Cumulative Proportion 0.5088 0.6226 0.72199 0.79396 0.84792 0.89024 0.9220
## PC8 PC9 PC10 PC11 PC12
## Standard deviation 0.52750 0.46712 0.44758 0.41678 0.25506
## Proportion of Variance 0.02319 0.01818 0.01669 0.01448 0.00542
## Cumulative Proportion 0.94523 0.96341 0.98010 0.99458 1.00000
fviz_pca_var(BH_pca,
col.var = "cos2", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
We create a train data set based on the principle component PCA1 and PCA2. We now have only 3 features instead 12 from original dataset!!.. Dimension Reduction remember :)
new_BH_pca <- cbind(medv = train$medv, BH_pca$x[,1:3]) %>% as_tibble()
head(new_BH_pca)
## # A tibble: 6 x 4
## medv PC1 PC2 PC3
## <dbl> <dbl> <dbl> <dbl>
## 1 24 -2.10 0.279 -1.08
## 2 21.6 -1.46 0.935 -0.172
## 3 34.7 -2.06 0.236 -0.846
## 4 33.4 -2.60 0.0204 -0.0776
## 5 36.2 -2.44 0.0721 -0.190
## 6 28.7 -2.21 0.488 0.365
Let’s build simple linear regression model
bh_m_PCA <- lm(medv ~ PC1 + PC2 + PC3 , data = new_BH_pca)
summary(bh_m_PCA)
##
## Call:
## lm(formula = medv ~ PC1 + PC2 + PC3, data = new_BH_pca)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.827 -2.966 -0.505 1.802 34.834
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.6975 0.3030 74.913 <2e-16 ***
## PC1 -2.2640 0.1228 -18.438 <2e-16 ***
## PC2 -1.3566 0.2597 -5.224 3e-07 ***
## PC3 -4.3615 0.2778 -15.698 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.717 on 352 degrees of freedom
## Multiple R-squared: 0.6355, Adjusted R-squared: 0.6324
## F-statistic: 204.6 on 3 and 352 DF, p-value: < 2.2e-16
We build another model with all features
bh_m <- lm(medv ~ . , data = train)
summary(bh_m)
##
## Call:
## lm(formula = medv ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.4930 -2.8798 -0.7844 1.5707 25.9508
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.886566 6.456711 5.713 2.41e-08 ***
## crim -0.149127 0.044936 -3.319 0.001002 **
## zn 0.042555 0.016869 2.523 0.012099 *
## indus 0.087170 0.074299 1.173 0.241522
## nox -20.222886 4.566635 -4.428 1.28e-05 ***
## rm 4.061471 0.535160 7.589 3.07e-13 ***
## age 0.009415 0.016146 0.583 0.560194
## dis -1.468881 0.240243 -6.114 2.64e-09 ***
## rad 0.394694 0.083376 4.734 3.23e-06 ***
## tax -0.015324 0.004635 -3.306 0.001047 **
## ptratio -1.054124 0.163361 -6.453 3.75e-10 ***
## b 0.011969 0.003261 3.670 0.000281 ***
## lstat -0.546897 0.063261 -8.645 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.947 on 343 degrees of freedom
## Multiple R-squared: 0.734, Adjusted R-squared: 0.7247
## F-statistic: 78.87 on 12 and 343 DF, p-value: < 2.2e-16
From training model we can see that the All model perform better than PCA model (1,2 and 3) Nevertheless, let us try apply PCA on Test dataset and do prediction. Result of the prediction will be displayed in Predicted vs Actual Chart
set.seed(2020)
rsq <- function (x, y) cor(x, y) ^ 2
new_BH_pca_test <- predict(BH_pca, test %>% select(-medv))
test_pca_df <- new_BH_pca_test[,1:3] %>% as_tibble()
y_test_pca <- predict(bh_m_PCA, test_pca_df)
y_test <- test$medv
rsq_pca1 <- round(rsq(y_test,y_test_pca),3)
df <- cbind(Actual = y_test, Predicted = y_test_pca) %>% as_tibble()
ggplot(df,aes(x=Actual, y = Predicted))+
geom_point() + ggtitle(paste("Test Acuracy for PCA model with R^2:" ,rsq_pca1),) +
theme(plot.title = element_text(hjust = 0.5))
Now let us compare Test Accuracy with Model that use all features (no PCA)
rsq <- function (x, y) cor(x, y) ^ 2
y_test_all <- predict(bh_m, test %>% select(-medv))
y_test <- test$medv
rsq_all <- round(rsq(y_test,y_test_all),3)
df1 <- cbind(Actual = y_test, Predicted = y_test_all) %>% as_tibble()
ggplot(df1,aes(x=Actual, y = Predicted))+
geom_point() + ggtitle(paste("Test Acuracy for PCA model with R^2:" ,rsq_all),) +
theme(plot.title = element_text(hjust = 0.5))
We can see that the All model Actual vs Predicted plots is closer to diagonal lime comparing to the PCA model. In addition R^2 is better in the All model compared to PCA (74% vs 70%). It is also very clear the All model (train) has better accuracy compared to PCA model.
In conclusion, for this dataset we can say that PCA is not helpful in improving the accuracy. It is meant for dimensional reduction and may be not be useful if we have small number of features such as in the Boston Housing Dataset. will be helpful perhaps only if we have too many features but still we can experiment with other technique such as model.