Dimensionality reduction is the process of reducing the number of features (or dimensions) in a dataset while preserving as much meaningful information as possible. High-dimensional data can be computationally expensive, prone to overfitting, and difficult to visualize.
For example, if we have a dataset with 1000 variables, it is challenging to analyze patterns and relationships effectively. Dimensionality reduction techniques help us simplify the dataset without losing significant insights.
There are two major types:
One of the reasons we use variable reduction techniques is to address multicollinearity. Multicollinearity exists in the case of highly correlated independent variables. Since these variables move similarly, it becomes difficult to distinguish their individual effects on the predicted variable. Modeling a relationship using highly correlated variables makes it challenging to determine their precise influence. It is also possible to obtain coefficients that indicate an opposite relationship to the true one in terms of sign. Let’s see an example about how multicollinearity may distort the relationship in a model.
For that purpose, we will use the dataset seatpos from the faraway package.
Task: Install and load library faraway
This is a dataset about 38 car drivers and their seat position in the car, collected from the University of Michigan. Among the variables we have the age Age of the driver, the weight Weight of the driver, the height of the driver while wearing shoes HtShoes and we have also the height of the driver barefoot Ht, the seated height in centimeters Seated, the arm length Arm, the thigh length Thigh, the lower leg length in centimetersLeg. The purpose of our model is to explain the hipcenter, which represents the horizontal distance from the midpoint of the hips to a fixed location in the car.
Not surprisingly, there are variables that are highly correlated, as is the case of HtShoes and Ht.
To assess the extent of the problem, we can examine the correlation matrix.
Task: Create the correlation matrix
## Age Weight HtShoes Ht Seated Arm Thigh Leg hipcenter
## Age 1.00 0.08 -0.08 -0.09 -0.17 0.36 0.09 -0.04 0.21
## Weight 0.08 1.00 0.83 0.83 0.78 0.70 0.57 0.78 -0.64
## HtShoes -0.08 0.83 1.00 1.00 0.93 0.75 0.72 0.91 -0.80
## Ht -0.09 0.83 1.00 1.00 0.93 0.75 0.73 0.91 -0.80
## Seated -0.17 0.78 0.93 0.93 1.00 0.63 0.61 0.81 -0.73
## Arm 0.36 0.70 0.75 0.75 0.63 1.00 0.67 0.75 -0.59
## Thigh 0.09 0.57 0.72 0.73 0.61 0.67 1.00 0.65 -0.59
## Leg -0.04 0.78 0.91 0.91 0.81 0.75 0.65 1.00 -0.79
## hipcenter 0.21 -0.64 -0.80 -0.80 -0.73 -0.59 -0.59 -0.79 1.00
round(cor(seatpos),2)
You can see that many variables have high correlation. As expected, HtShoes and Ht have a perfect correlation.
We can run a regression analysis. The function for linear regression is lm. The hip center is the variable we want to explain. Let’s run a regression with all the other variables as explanatory variables and see the results.
##
## Call:
## lm(formula = hipcenter ~ ., data = seatpos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -73.827 -22.833 -3.678 25.017 62.337
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 436.43213 166.57162 2.620 0.0138 *
## Age 0.77572 0.57033 1.360 0.1843
## Weight 0.02631 0.33097 0.080 0.9372
## HtShoes -2.69241 9.75304 -0.276 0.7845
## Ht 0.60134 10.12987 0.059 0.9531
## Seated 0.53375 3.76189 0.142 0.8882
## Arm -1.32807 3.90020 -0.341 0.7359
## Thigh -1.14312 2.66002 -0.430 0.6706
## Leg -6.43905 4.71386 -1.366 0.1824
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.72 on 29 degrees of freedom
## Multiple R-squared: 0.6866, Adjusted R-squared: 0.6001
## F-statistic: 7.94 on 8 and 29 DF, p-value: 1.306e-05
fit<-lm(hipcenter~., data=seatpos) summary(fit)
You can see here that all variables are insignificant. All variables appear to be insignificant. Although the correlation matrix showed a strong relationship between hipcenter and several variables, this is not reflected in the regression model. However, both the R-squared and adjusted R-squared values are quite high. The F statistic indicates that the model has some predictive power.
What exactly is the problem? Probably the reason for this is the multicollinearity.
One tool for detecting such issues is the Variance Inflation Factor VIF. VIF determines the strength of the correlation between the independent variables. It is predicted by taking a variable and regressing it against every other variable.
###Variance Inflation Factor One function to do this is the function vif which comes from the package faraway. The syntax is vif with the regression object inside the parentheses.
## Age Weight HtShoes Ht Seated Arm Thigh
## 1.997931 3.647030 307.429378 333.137832 8.951054 4.496368 2.762886
## Leg
## 6.694291
vif(fit)
The output shows all the predictor variables and their respective vif values. We want this number to be less than five. That means that any variable that is over five is susceptible to multicollinearity issues. You can see that the number for HtShoes, and the number for Ht are really high: 307 and 333. But at the same time you can see that we have also some other issues with the seated variable and the leg variable. So what can we do in this case?
We need to reduce the number of the variables we are using in our model. Since we have the same information, we could include just one of the variables that are co-linear. We could for example, run a model where we will include only the Age, weight and Ht variables.
Task: Run a regression using only the Age, *weight** and *Ht** as predictors, check the results and see how the vif values have changed.
fit2<-lm(hipcenter~ Age+ Weight +Ht, data=seatpos) summary(fit2) vif(fit2)
Let’s see how to use PCA for variable reduction in R. You can use any dataset you wish that has 10 or more dimensions/columns of numerical data. This guide, however, will be based on the mtcars dataset that has 11 dimensions and 32 rows (representing different car models). The data is included in the base (datasets) package of R:
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
The first things to check are: (1) the data is all numerical, (2) there are no missing values and (3) the variables are either all minimising or all maximising. You will find out that no changes need to be made to the mtcars dataset.
Next, it’s a good idea to have a look at the correlation matrix of the data
Task: Check the correlation matrix of the data
round(cor(mtcars), 3)
## mpg cyl disp hp drat wt qsec vs am gear
## mpg 1.000 -0.852 -0.848 -0.776 0.681 -0.868 0.419 0.664 0.600 0.480
## cyl -0.852 1.000 0.902 0.832 -0.700 0.782 -0.591 -0.811 -0.523 -0.493
## disp -0.848 0.902 1.000 0.791 -0.710 0.888 -0.434 -0.710 -0.591 -0.556
## hp -0.776 0.832 0.791 1.000 -0.449 0.659 -0.708 -0.723 -0.243 -0.126
## drat 0.681 -0.700 -0.710 -0.449 1.000 -0.712 0.091 0.440 0.713 0.700
## wt -0.868 0.782 0.888 0.659 -0.712 1.000 -0.175 -0.555 -0.692 -0.583
## qsec 0.419 -0.591 -0.434 -0.708 0.091 -0.175 1.000 0.745 -0.230 -0.213
## vs 0.664 -0.811 -0.710 -0.723 0.440 -0.555 0.745 1.000 0.168 0.206
## am 0.600 -0.523 -0.591 -0.243 0.713 -0.692 -0.230 0.168 1.000 0.794
## gear 0.480 -0.493 -0.556 -0.126 0.700 -0.583 -0.213 0.206 0.794 1.000
## carb -0.551 0.527 0.395 0.750 -0.091 0.428 -0.656 -0.570 0.058 0.274
## carb
## mpg -0.551
## cyl 0.527
## disp 0.395
## hp 0.750
## drat -0.091
## wt 0.428
## qsec -0.656
## vs -0.570
## am 0.058
## gear 0.274
## carb 1.000
From checking this, you can see that some variables are highly correlated whilst some others are clearly not. As there many variables, producing a very large correlation matrix, it may be more helpful to review the correlation using visualisation methods such as:
Using the corrgram function from the corrgram library. Dark blue indicates a high positive correlation whilst a dark red indicates a high negative correlation.
Using the plotcorr function from the ellipse library. Read more about this function with ?plotcorr.
Next we need to work out how many components to extract. We can do this by looking at the scree plot.
To produce the scree plot type the following command:
plot(prcomp(mtcars, scale = TRUE), type="line")
According to Kaiser’s rule (take components above 1) we would select 2 components. From looking at the scree plot (take the components left of the elbow) you would take 2-3 components. Let’s check also the summary to understand the cumulative explained variance.
summary(prcomp(mtcars, scale = TRUE))
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.5707 1.6280 0.79196 0.51923 0.47271 0.46000 0.3678
## Proportion of Variance 0.6008 0.2409 0.05702 0.02451 0.02031 0.01924 0.0123
## Cumulative Proportion 0.6008 0.8417 0.89873 0.92324 0.94356 0.96279 0.9751
## PC8 PC9 PC10 PC11
## Standard deviation 0.35057 0.2776 0.22811 0.1485
## Proportion of Variance 0.01117 0.0070 0.00473 0.0020
## Cumulative Proportion 0.98626 0.9933 0.99800 1.0000
We can see again that 2-3 components is a good choice.
From the previous table, we can see that 2 components account for 84% of the variance. We can then have a look at the two principal components using the following command
prcomp(mtcars, scale = TRUE)$rotation[,1:2]
## PC1 PC2
## mpg -0.3625305 0.01612440
## cyl 0.3739160 0.04374371
## disp 0.3681852 -0.04932413
## hp 0.3300569 0.24878402
## drat -0.2941514 0.27469408
## wt 0.3461033 -0.14303825
## qsec -0.2004563 -0.46337482
## vs -0.3065113 -0.23164699
## am -0.2349429 0.42941765
## gear -0.2069162 0.46234863
## carb 0.2140177 0.41357106
It is quite difficult to interpret this, so need to rotate the components. First though, we can have a look at a 2D biplot using:
biplot(prcomp(mtcars, scale = TRUE))
The plot shows the original variables and how they are placed on the 2D principal component space. You can see that cyl and disp load highly on the first component and hardly at all on the second component (this can also be seen in the PC1 and PC2 values shown previously). You can also see that carb loads highly on the second component and not much on the first component. The plot also shows the car models and where they are placed in the 2D principal component space. You can view the PCA scores for each of the car models using:
prcomp(mtcars, scale = TRUE)$x[,1:2]
## PC1 PC2
## Mazda RX4 -0.6468627420 1.7081142
## Mazda RX4 Wag -0.6194831460 1.5256219
## Datsun 710 -2.7356242748 -0.1441501
## Hornet 4 Drive -0.3068606268 -2.3258038
## Hornet Sportabout 1.9433926844 -0.7425211
## Valiant -0.0552534228 -2.7421229
## Duster 360 2.9553851233 0.3296133
## Merc 240D -2.0229593244 -1.4421056
## Merc 230 -2.2513839535 -1.9522879
## Merc 280 -0.5180912217 -0.1594610
## Merc 280C -0.5011860079 -0.3187934
## Merc 450SE 2.2124096339 -0.6727099
## Merc 450SL 2.0155715693 -0.6724606
## Merc 450SLC 2.1147047372 -0.7891129
## Cadillac Fleetwood 3.8383725118 -0.8149087
## Lincoln Continental 3.8918495626 -0.7218314
## Chrysler Imperial 3.5363862158 -0.4145024
## Fiat 128 -3.7955510831 -0.2920783
## Honda Civic -4.1870356784 0.6775721
## Toyota Corolla -4.1675359344 -0.2748890
## Toyota Corona -1.8741790870 -2.0864529
## Dodge Challenger 2.1504414942 -0.9982442
## AMC Javelin 1.8340369797 -0.8921886
## Camaro Z28 2.8434957523 0.6701037
## Pontiac Firebird 2.2105479148 -0.8600504
## Fiat X1-9 -3.5176818134 -0.1192950
## Porsche 914-2 -2.6095003965 2.0141425
## Lotus Europa -3.3323844512 1.3568877
## Ford Pantera L 1.3513346957 3.4448780
## Ferrari Dino -0.0009743305 3.1683750
## Maserati Bora 2.6270897605 4.3107016
## Volvo 142E -2.3824711412 0.2299603
For example, you can see from the score how 2.63 in PC1 and 4.31 in PC2 places the Maserati Bora at the top and slightly to the right of the plot. To make the components easier to interpret we can use rotation. The psych library contains functions for this, so add the following libraries to R: library(psych) and library(GPArotation).
We can try varimax rotation using:
principal(mtcars, nfactors = 2, rotate="varimax")
## Principal Components Analysis
## Call: principal(r = mtcars, nfactors = 2, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
## RC1 RC2 h2 u2 com
## mpg 0.68 -0.63 0.87 0.131 2.0
## cyl -0.64 0.72 0.93 0.071 2.0
## disp -0.73 0.60 0.90 0.098 1.9
## hp -0.32 0.88 0.88 0.116 1.3
## drat 0.85 -0.21 0.77 0.228 1.1
## wt -0.80 0.46 0.85 0.154 1.6
## qsec -0.16 -0.90 0.83 0.165 1.1
## vs 0.30 -0.82 0.76 0.237 1.3
## am 0.92 0.08 0.85 0.146 1.0
## gear 0.91 0.17 0.85 0.150 1.1
## carb 0.08 0.87 0.76 0.244 1.0
##
## RC1 RC2
## SS loadings 4.67 4.59
## Proportion Var 0.42 0.42
## Cumulative Var 0.42 0.84
## Proportion Explained 0.50 0.50
## Cumulative Proportion 0.50 1.00
##
## Mean item complexity = 1.4
## Test of the hypothesis that 2 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.05
## with the empirical chi square 9.45 with prob < 1
##
## Fit based upon off diagonal values = 0.99
You can see that now mpg, drat, am and gear load heavily on the first component and cyl, disp, hp and carb load heavily on the second component. We can also view a biplot of the rotated components using:
biplot(principal(mtcars, nfactors = 2, rotate="varimax"))
If you wanted to include the car labels on the plot you can use: biplot(principal(mtcars, nfactors = 2, rotate=“varimax”), lab=row.names(mtcars))
You can also look at the scores for the rotated components using:
principal(mtcars, nfactors = 2, rotate="varimax")$scores
## RC1 RC2
## Mazda RX4 0.913545261 0.5740734
## Mazda RX4 Wag 0.827547997 0.5013891
## Datsun 710 0.698816380 -0.8074274
## Hornet 4 Drive -0.913638812 -1.1047271
## Hornet Sportabout -0.859350144 0.2025926
## Valiant -1.162422544 -1.2190911
## Duster 360 -0.680268641 0.9486423
## Merc 240D -0.056857408 -1.1835017
## Merc 230 -0.212468151 -1.4696568
## Merc 280 0.075581300 -0.2109478
## Merc 280C 0.002444315 -0.2763119
## Merc 450SE -0.904174186 0.3064233
## Merc 450SL -0.849329738 0.2529898
## Merc 450SLC -0.927001103 0.2287337
## Cadillac Fleetwood -1.417403948 0.6862707
## Lincoln Continental -1.392296773 0.7416873
## Chrysler Imperial -1.161445427 0.7799435
## Fiat 128 0.930026590 -1.1606987
## Honda Civic 1.455372982 -0.8414166
## Toyota Corolla 1.040852468 -1.2543364
## Toyota Corona -0.374988630 -1.4259630
## Dodge Challenger -1.026764303 0.1466255
## AMC Javelin -0.893224853 0.1071275
## Camaro Z28 -0.502907752 1.0677154
## Pontiac Firebird -0.984122353 0.2236560
## Fiat X1-9 0.926969145 -1.0092451
## Porsche 914-2 1.590766193 0.1745827
## Lotus Europa 1.509486177 -0.3106522
## Ford Pantera L 1.103849172 1.8802235
## Ferrari Dino 1.361140948 1.3909629
## Maserati Bora 1.120968759 2.6074298
## Volvo 142E 0.761297080 -0.5470932
After performing the PCA with Varimax rotation, we have two principal components (PC1 and PC2) that explain 84% of the variance. Now we will proceed by using these components in a regression model to predict the target variable (e.g., mpg).
Step 1: Extract PCA Scores After Varimax Rotation We need to extract the rotated PCA scores for each car model and use them as the independent variables in a regression model.We remove here the first column because we will use as a dependent variable.
# Extract rotated principal components using Varimax rotation
pca_rotated <- principal(mtcars[, -1], nfactors = 2, rotate = "varimax")
# View the rotated PCA scores for each car model
pca_scores <- as.data.frame(pca_rotated$scores)
# Add the dependent variable (mpg) to the PCA scores dataframe
pca_scores$mpg <- mtcars$mpg
# Check the new dataframe
head(pca_scores)
## RC1 RC2 mpg
## Mazda RX4 0.5580509 0.9287428 21.0
## Mazda RX4 Wag 0.4868240 0.8413597 21.0
## Datsun 710 -0.8669120 0.7575937 22.8
## Hornet 4 Drive -1.0834612 -0.9386737 21.4
## Hornet Sportabout 0.2606309 -0.9199634 18.7
## Valiant -1.2444094 -1.1321068 18.1
Here, the pca_scores dataframe now contains the first two rotated principal components and the target variable mpg.
Step 2: Perform Regression Using PCA Components We now use the first two rotated principal components (PC1 and PC2) to predict mpg. This allows us to avoid multicollinearity issues while retaining most of the variance.
# Run linear regression using the rotated principal components as predictors
pca_lm <- lm(mpg ~ RC1 + RC2, data=pca_scores)
# Display the regression summary
summary(pca_lm)
##
## Call:
## lm(formula = mpg ~ RC1 + RC2, data = pca_scores)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3611 -1.7263 -0.3322 1.3208 5.6763
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.0906 0.4591 43.760 < 2e-16 ***
## RC1 -3.7487 0.4665 -8.037 7.31e-09 ***
## RC2 3.9952 0.4665 8.565 1.96e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.597 on 29 degrees of freedom
## Multiple R-squared: 0.8263, Adjusted R-squared: 0.8143
## F-statistic: 68.97 on 2 and 29 DF, p-value: 9.493e-12
Step 3: Compare With the Original Model Now let’s compare the model using PCA components with the original model that uses all the predictors. We’ll look at the R-squared values of both models to assess their performance.
# Original regression model using all variables
original_lm <- lm(mpg ~ ., data = mtcars)
# Compare R-squared values of both models
cat("Original Model R-squared:", summary(original_lm)$r.squared, "\n")
## Original Model R-squared: 0.8690158
cat("PCA Model R-squared:", summary(pca_lm)$r.squared, "\n")
## PCA Model R-squared: 0.8262911