Sometimes we have too many predictors and if we use all of them in our regression model, we would end up with issues and explanation could be difficult due to collinearity. It could also cause prediction performance degradation by using too many predictors. Hence, it has been proven better to reduce dimension of the data to fetch meaningful, appropriate and valid results.
Principal components analysis (PCA) is one of a family of techniques to deal with high-dimensional data by using high dimensional data and its variable’s dependencies to represent it in a lower dimensional form without losing too much information. PCA is one of the simplest ways of doing dimensionality reduction. Here components are independent. This is a method of extracting information from higher dimensional data by representing it to lower dimension. It does this using a linear combination (weighted average) of a set of given variables and the created index variables are called principal components.
To demonstrate the PCA, we will consider Boston housing dataset that has below variables. This dataset has 506 records and total 14 variables.
# housing data
housing <- fread("https://raw.githubusercontent.com/amit-kapoor/data621/main/blog1/housing.csv", header = FALSE)
# assign column names
colnames(housing) <- c("CRIM","ZN","INDUS","CHAS","NOX","RM","AGE","DIS","RAD","TAX","PTRATIO","B","LSTAT", "MEDV")
head(housing)
## CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
## 1: 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2: 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3: 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4: 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5: 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6: 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## MEDV
## 1: 24.0
## 2: 21.6
## 3: 34.7
## 4: 33.4
## 5: 36.2
## 6: 28.7
# data dimesnion
dim(housing)
## [1] 506 14
# correlation
ggcorr(housing, label = TRUE) + labs(title = "Correlation of variables")
We can see here that there are variables which are highly correlated,
# describe data
describe(housing)[-c(1)]
## n mean sd median trimmed mad min max range skew
## CRIM 506 3.61 8.60 0.26 1.68 0.33 0.01 88.98 88.97 5.19
## ZN 506 11.36 23.32 0.00 5.08 0.00 0.00 100.00 100.00 2.21
## INDUS 506 11.14 6.86 9.69 10.93 9.37 0.46 27.74 27.28 0.29
## CHAS 506 0.07 0.25 0.00 0.00 0.00 0.00 1.00 1.00 3.39
## NOX 506 0.55 0.12 0.54 0.55 0.13 0.38 0.87 0.49 0.72
## RM 506 6.28 0.70 6.21 6.25 0.51 3.56 8.78 5.22 0.40
## AGE 506 68.57 28.15 77.50 71.20 28.98 2.90 100.00 97.10 -0.60
## DIS 506 3.80 2.11 3.21 3.54 1.91 1.13 12.13 11.00 1.01
## RAD 506 9.55 8.71 5.00 8.73 2.97 1.00 24.00 23.00 1.00
## TAX 506 408.24 168.54 330.00 400.04 108.23 187.00 711.00 524.00 0.67
## PTRATIO 506 18.46 2.16 19.05 18.66 1.70 12.60 22.00 9.40 -0.80
## B 506 356.67 91.29 391.44 383.17 8.09 0.32 396.90 396.58 -2.87
## LSTAT 506 12.65 7.14 11.36 11.90 7.11 1.73 37.97 36.24 0.90
## MEDV 506 22.53 9.20 21.20 21.56 5.93 5.00 50.00 45.00 1.10
## kurtosis se
## CRIM 36.60 0.38
## ZN 3.95 1.04
## INDUS -1.24 0.30
## CHAS 9.48 0.01
## NOX -0.09 0.01
## RM 1.84 0.03
## AGE -0.98 1.25
## DIS 0.46 0.09
## RAD -0.88 0.39
## TAX -1.15 7.49
## PTRATIO -0.30 0.10
## B 7.10 4.06
## LSTAT 0.46 0.32
## MEDV 1.45 0.41
Next we will use prcomp function that performs a principal components analysis on the given data matrix and returns the results.
pca_housing <- prcomp(housing, center = TRUE, scale. = TRUE)
summary(pca_housing)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.5585 1.2843 1.16142 0.94156 0.92244 0.81241 0.73172
## Proportion of Variance 0.4676 0.1178 0.09635 0.06332 0.06078 0.04714 0.03824
## Cumulative Proportion 0.4676 0.5854 0.68174 0.74507 0.80585 0.85299 0.89123
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.63488 0.5266 0.50225 0.4613 0.42777 0.36607 0.24561
## Proportion of Variance 0.02879 0.0198 0.01802 0.0152 0.01307 0.00957 0.00431
## Cumulative Proportion 0.92003 0.9398 0.95785 0.9730 0.98612 0.99569 1.00000
# $x - principal components
dim(pca_housing$x)
## [1] 506 14
# std. deviations
pca_housing$scale
## CRIM ZN INDUS CHAS NOX RM
## 8.6015451 23.3224530 6.8603529 0.2539940 0.1158777 0.7026171
## AGE DIS RAD TAX PTRATIO B
## 28.1488614 2.1057101 8.7072594 168.5371161 2.1649455 91.2948644
## LSTAT MEDV
## 7.1410615 9.1971041
# means
pca_housing$center
## CRIM ZN INDUS CHAS NOX RM
## 3.61352356 11.36363636 11.13677866 0.06916996 0.55469506 6.28463439
## AGE DIS RAD TAX PTRATIO B
## 68.57490119 3.79504269 9.54940711 408.23715415 18.45553360 356.67403162
## LSTAT MEDV
## 12.65306324 22.53280632
# first PCA component
round(pca_housing$rot[,1],2)
## CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX
## 0.24 -0.25 0.33 -0.01 0.33 -0.20 0.30 -0.30 0.30 0.32
## PTRATIO B LSTAT MEDV
## 0.21 -0.20 0.31 -0.27
Next we will see scree plot which is a line plot of the eigen values of principal components.
#scree plot
fviz_eig(pca_housing)
Finally we will fit the models first having full model using the original data and second using principal components (first 3) identified above.
set.seed(317)
# fit model using where we use all predictors
housing.fullmodel <- lm(MEDV ~ ., data = housing)
summary(housing.fullmodel)
##
## Call:
## lm(formula = MEDV ~ ., data = housing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.595 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
## CRIM -1.080e-01 3.286e-02 -3.287 0.001087 **
## ZN 4.642e-02 1.373e-02 3.382 0.000778 ***
## INDUS 2.056e-02 6.150e-02 0.334 0.738288
## CHAS 2.687e+00 8.616e-01 3.118 0.001925 **
## NOX -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
## RM 3.810e+00 4.179e-01 9.116 < 2e-16 ***
## AGE 6.922e-04 1.321e-02 0.052 0.958229
## DIS -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
## RAD 3.060e-01 6.635e-02 4.613 5.07e-06 ***
## TAX -1.233e-02 3.760e-03 -3.280 0.001112 **
## PTRATIO -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
## B 9.312e-03 2.686e-03 3.467 0.000573 ***
## LSTAT -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
set.seed(317)
# fit model using first 3 Prinipal components
housing.pcamodel <- lm(housing$MEDV ~ pca_housing$x[,1:3])
summary(housing.pcamodel)
##
## Call:
## lm(formula = housing$MEDV ~ pca_housing$x[, 1:3])
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.0345 -2.1015 -0.0748 1.6409 24.3703
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.53281 0.17015 132.43 <2e-16 ***
## pca_housing$x[, 1:3]PC1 -2.45228 0.06657 -36.84 <2e-16 ***
## pca_housing$x[, 1:3]PC2 4.09202 0.13261 30.86 <2e-16 ***
## pca_housing$x[, 1:3]PC3 1.50086 0.14664 10.23 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.827 on 502 degrees of freedom
## Multiple R-squared: 0.8278, Adjusted R-squared: 0.8268
## F-statistic: 804.7 on 3 and 502 DF, p-value: < 2.2e-16
Comparing the full model with the PCA model, it is evident that PCA explains close to 83% of the variability with just three variables than the 13 significant variables from the full model which has \(R^2\)=0.73.