One of the reasons we use variable reduction techniques is the multicollinearity. Multicollinearity exists in the case of highly correlated independent variables. Because these variables move similarly, it will not be easy to distinguish their effect on the predicted variable. If you try to model a relationship using highly correlated variables, it will be difficult to find the exact relationship. It is also likely to find coefficients that point out to a completely opposite relationship than the true relationship in terms of the sign. Let’s see an example about how multicollinearity may distort the relationship in a model.
For that purpose we are going to use a dataset named seatpos from the package faraway
Task: Install and load library faraway
This is a dataset about 38 car drivers and their seat position in the car, collected from the University of Michigan. Among the variables we have the age Age of the driver, the weight Weight of the driver, the height of the driver while wearing shoes HtShoes and we have also the height of the driver barefoot Ht, the seated height in centimeters Seated, the arm length Arm, the thigh length Thigh, the lower leg length in centimetersLeg. The purpose of our model is to explain the hip center hipcenter, which is the horizontal distance over the midpoint of the hips from a fixed location in the car.
Not surprisingly, there are variables that are highly correlated, as is the case of HtShoes and Ht.
To understand the magnitude of the problem we can check the correlation matrix.
Task: Create the correlation matrix
## Age Weight HtShoes Ht Seated Arm Thigh Leg hipcenter
## Age 1.00 0.08 -0.08 -0.09 -0.17 0.36 0.09 -0.04 0.21
## Weight 0.08 1.00 0.83 0.83 0.78 0.70 0.57 0.78 -0.64
## HtShoes -0.08 0.83 1.00 1.00 0.93 0.75 0.72 0.91 -0.80
## Ht -0.09 0.83 1.00 1.00 0.93 0.75 0.73 0.91 -0.80
## Seated -0.17 0.78 0.93 0.93 1.00 0.63 0.61 0.81 -0.73
## Arm 0.36 0.70 0.75 0.75 0.63 1.00 0.67 0.75 -0.59
## Thigh 0.09 0.57 0.72 0.73 0.61 0.67 1.00 0.65 -0.59
## Leg -0.04 0.78 0.91 0.91 0.81 0.75 0.65 1.00 -0.79
## hipcenter 0.21 -0.64 -0.80 -0.80 -0.73 -0.59 -0.59 -0.79 1.00
round(cor(seatpos),2)
You can see that many variables have high correlation. As expected, HtShoes and Ht have a perfect correlation.
We can run a regression analysis. The function for linear regression is lm. The hip center is the variable we want to explain. Let’s run a regression with all the other variables as explanatory variables and see the results.
##
## Call:
## lm(formula = hipcenter ~ ., data = seatpos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -73.827 -22.833 -3.678 25.017 62.337
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 436.43213 166.57162 2.620 0.0138 *
## Age 0.77572 0.57033 1.360 0.1843
## Weight 0.02631 0.33097 0.080 0.9372
## HtShoes -2.69241 9.75304 -0.276 0.7845
## Ht 0.60134 10.12987 0.059 0.9531
## Seated 0.53375 3.76189 0.142 0.8882
## Arm -1.32807 3.90020 -0.341 0.7359
## Thigh -1.14312 2.66002 -0.430 0.6706
## Leg -6.43905 4.71386 -1.366 0.1824
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.72 on 29 degrees of freedom
## Multiple R-squared: 0.6866, Adjusted R-squared: 0.6001
## F-statistic: 7.94 on 8 and 29 DF, p-value: 1.306e-05
fit<-lm(hipcenter~., data=seatpos) summary(fit)
You can see here that all variables are insignificant. So although we saw that in the correlation matrix there is a high correlation of the hip center with several variables, this is not reflected in the model. However, we can also see that the R-square and the adjusted R-square are quite high. The F statistic indicates that the model has some predictive power.
What exactly is the problem? Probably the reason for this is the multicollinearity.
One tool that you can use to understand such effects is by taking the variable inflation factor, known as VIF. VIF determines the strength of the correlation between the independent variables. It is predicted by taking a variable and regressing it against every other variable.
###Variance Inflation Factor One function to do this is the function vif which comes from the package faraway. The syntax is vif with the regression object inside the parentheses.
## Age Weight HtShoes Ht Seated Arm Thigh
## 1.997931 3.647030 307.429378 333.137832 8.951054 4.496368 2.762886
## Leg
## 6.694291
vif(fit)
The output shows all the predictor variables and their respective vif values. We want this number to be less than five. That means that any variable that is over five is susceptible to multicollinearity issues. You can see that the number for HtShoes, and the number for Ht are really high: 307 and 333. But at the same time you can see that we have also some other issues with the seated variable and the leg variable. So what can we do in this case?
We need to reduce the number of the variables we are using in our model. Since we have the same information, we could include just one of the variables that are co-linear. We could for example, run a model where we will include only the Age, weight and Ht variables.
Task: Run a regression using only the Age, *weight** and *Ht** as predictors, check the results and see how the vif values have changed.
fit2<-lm(hipcenter~ Age+ Weight +Ht, data=seatpos) summary(fit2) vif(fit2)
Let’s see how to use PCA for variable reduction in R. You can use any dataset you wish that has 10 or more dimensions/columns of numerical data. This guide, however, will be based on the mtcars dataset that has 11 dimensions and 32 rows (representing different car models). The data is included in the base (datasets) package of R:
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
The first things to check are: (1) the data is all numerical, (2) there are no missing values and (3) the variables are either all minimising or all maximising. You will find out that no changes need to be made to the mtcars dataset.
Next, it’s a good idea to have a look at the correlation matrix of the data
Task: Check the correlation matrix of the data
round(cor(mtcars), 3)
## mpg cyl disp hp drat wt qsec vs am gear
## mpg 1.000 -0.852 -0.848 -0.776 0.681 -0.868 0.419 0.664 0.600 0.480
## cyl -0.852 1.000 0.902 0.832 -0.700 0.782 -0.591 -0.811 -0.523 -0.493
## disp -0.848 0.902 1.000 0.791 -0.710 0.888 -0.434 -0.710 -0.591 -0.556
## hp -0.776 0.832 0.791 1.000 -0.449 0.659 -0.708 -0.723 -0.243 -0.126
## drat 0.681 -0.700 -0.710 -0.449 1.000 -0.712 0.091 0.440 0.713 0.700
## wt -0.868 0.782 0.888 0.659 -0.712 1.000 -0.175 -0.555 -0.692 -0.583
## qsec 0.419 -0.591 -0.434 -0.708 0.091 -0.175 1.000 0.745 -0.230 -0.213
## vs 0.664 -0.811 -0.710 -0.723 0.440 -0.555 0.745 1.000 0.168 0.206
## am 0.600 -0.523 -0.591 -0.243 0.713 -0.692 -0.230 0.168 1.000 0.794
## gear 0.480 -0.493 -0.556 -0.126 0.700 -0.583 -0.213 0.206 0.794 1.000
## carb -0.551 0.527 0.395 0.750 -0.091 0.428 -0.656 -0.570 0.058 0.274
## carb
## mpg -0.551
## cyl 0.527
## disp 0.395
## hp 0.750
## drat -0.091
## wt 0.428
## qsec -0.656
## vs -0.570
## am 0.058
## gear 0.274
## carb 1.000
From checking this, you can see that some variables are highly correlated whilst some others are clearly not. As there many variables, producing a very large correlation matrix, it may be more helpful to review the correlation using visualisation methods such as:
Using the corrgram function from the corrgram library. Dark blue indicates a high positive correlation whilst a dark red indicates a high negative correlation.
Using the plotcorr function from the ellipse library. Read more about this function with ?plotcorr.
Next we need to work out how many components to extract. We can do this by looking at the scree plot.
To produce the scree plot type the following command:
plot(prcomp(mtcars, scale = TRUE), type="line")
According to Kaiser’s rule (take components above 1) we would select 2 components. From looking at the scree plot (take the components left of the elbow) you would take 2-3 components. Let’s check also the summary to understand the cumulative explained variance.
summary(prcomp(mtcars, scale = TRUE))
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.5707 1.6280 0.79196 0.51923 0.47271 0.46000 0.3678
## Proportion of Variance 0.6008 0.2409 0.05702 0.02451 0.02031 0.01924 0.0123
## Cumulative Proportion 0.6008 0.8417 0.89873 0.92324 0.94356 0.96279 0.9751
## PC8 PC9 PC10 PC11
## Standard deviation 0.35057 0.2776 0.22811 0.1485
## Proportion of Variance 0.01117 0.0070 0.00473 0.0020
## Cumulative Proportion 0.98626 0.9933 0.99800 1.0000
We can see again that 2-3 components is a good choice.
From the previous table, we can see that 2 components account for 84% of the variance. We can then have a look at the two principal components using the following command
prcomp(mtcars, scale = TRUE)$rotation[,1:2]
## PC1 PC2
## mpg -0.3625305 0.01612440
## cyl 0.3739160 0.04374371
## disp 0.3681852 -0.04932413
## hp 0.3300569 0.24878402
## drat -0.2941514 0.27469408
## wt 0.3461033 -0.14303825
## qsec -0.2004563 -0.46337482
## vs -0.3065113 -0.23164699
## am -0.2349429 0.42941765
## gear -0.2069162 0.46234863
## carb 0.2140177 0.41357106
It is quite difficult to interpret this, so need to rotate the components. First though, we can have a look at a 2D biplot using:
biplot(prcomp(mtcars, scale = TRUE))
The plot shows the original variables and how they are placed on the 2D principal component space. You can see that cyl and disp load highly on the first component and hardly at all on the second component (this can also be seen in the PC1 and PC2 values shown previously). You can also see that carb loads highly on the second component and not much on the first component. The plot also shows the car models and where they are placed in the 2D principal component space. You can view the PCA scores for each of the car models using:
prcomp(mtcars, scale = TRUE)$x[,1:2]
## PC1 PC2
## Mazda RX4 -0.6468627420 1.7081142
## Mazda RX4 Wag -0.6194831460 1.5256219
## Datsun 710 -2.7356242748 -0.1441501
## Hornet 4 Drive -0.3068606268 -2.3258038
## Hornet Sportabout 1.9433926844 -0.7425211
## Valiant -0.0552534228 -2.7421229
## Duster 360 2.9553851233 0.3296133
## Merc 240D -2.0229593244 -1.4421056
## Merc 230 -2.2513839535 -1.9522879
## Merc 280 -0.5180912217 -0.1594610
## Merc 280C -0.5011860079 -0.3187934
## Merc 450SE 2.2124096339 -0.6727099
## Merc 450SL 2.0155715693 -0.6724606
## Merc 450SLC 2.1147047372 -0.7891129
## Cadillac Fleetwood 3.8383725118 -0.8149087
## Lincoln Continental 3.8918495626 -0.7218314
## Chrysler Imperial 3.5363862158 -0.4145024
## Fiat 128 -3.7955510831 -0.2920783
## Honda Civic -4.1870356784 0.6775721
## Toyota Corolla -4.1675359344 -0.2748890
## Toyota Corona -1.8741790870 -2.0864529
## Dodge Challenger 2.1504414942 -0.9982442
## AMC Javelin 1.8340369797 -0.8921886
## Camaro Z28 2.8434957523 0.6701037
## Pontiac Firebird 2.2105479148 -0.8600504
## Fiat X1-9 -3.5176818134 -0.1192950
## Porsche 914-2 -2.6095003965 2.0141425
## Lotus Europa -3.3323844512 1.3568877
## Ford Pantera L 1.3513346957 3.4448780
## Ferrari Dino -0.0009743305 3.1683750
## Maserati Bora 2.6270897605 4.3107016
## Volvo 142E -2.3824711412 0.2299603
For example, you can see from the score how 2.63 in PC1 and 4.31 in PC2 places the Maserati Bora at the top and slightly to the right of the plot. To make the components easier to interpret we can use rotation. The psych library contains functions for this, so add the following libraries to R: library(psych) and library(GPArotation).
We can try varimax rotation using:
principal(mtcars, nfactors = 2, rotate="varimax")
## Principal Components Analysis
## Call: principal(r = mtcars, nfactors = 2, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
## RC1 RC2 h2 u2 com
## mpg 0.68 -0.63 0.87 0.131 2.0
## cyl -0.64 0.72 0.93 0.071 2.0
## disp -0.73 0.60 0.90 0.098 1.9
## hp -0.32 0.88 0.88 0.116 1.3
## drat 0.85 -0.21 0.77 0.228 1.1
## wt -0.80 0.46 0.85 0.154 1.6
## qsec -0.16 -0.90 0.83 0.165 1.1
## vs 0.30 -0.82 0.76 0.237 1.3
## am 0.92 0.08 0.85 0.146 1.0
## gear 0.91 0.17 0.85 0.150 1.1
## carb 0.08 0.87 0.76 0.244 1.0
##
## RC1 RC2
## SS loadings 4.67 4.59
## Proportion Var 0.42 0.42
## Cumulative Var 0.42 0.84
## Proportion Explained 0.50 0.50
## Cumulative Proportion 0.50 1.00
##
## Mean item complexity = 1.4
## Test of the hypothesis that 2 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.05
## with the empirical chi square 9.45 with prob < 1
##
## Fit based upon off diagonal values = 0.99
You can see that now mpg, drat, am and gear load heavily on the first component and cyl, disp, hp and carb load heavily on the second component. We can also view a biplot of the rotated components using:
biplot(principal(mtcars, nfactors = 2, rotate="varimax"))
If you wanted to include the car labels on the plot you can use: biplot(principal(mtcars, nfactors = 2, rotate=“varimax”), lab=row.names(mtcars))
You can also look at the scores for the rotated components using:
principal(mtcars, nfactors = 2, rotate="varimax")$scores
## RC1 RC2
## Mazda RX4 0.913545261 0.5740734
## Mazda RX4 Wag 0.827547997 0.5013891
## Datsun 710 0.698816380 -0.8074274
## Hornet 4 Drive -0.913638812 -1.1047271
## Hornet Sportabout -0.859350144 0.2025926
## Valiant -1.162422544 -1.2190911
## Duster 360 -0.680268641 0.9486423
## Merc 240D -0.056857408 -1.1835017
## Merc 230 -0.212468151 -1.4696568
## Merc 280 0.075581300 -0.2109478
## Merc 280C 0.002444315 -0.2763119
## Merc 450SE -0.904174186 0.3064233
## Merc 450SL -0.849329738 0.2529898
## Merc 450SLC -0.927001103 0.2287337
## Cadillac Fleetwood -1.417403948 0.6862707
## Lincoln Continental -1.392296773 0.7416873
## Chrysler Imperial -1.161445427 0.7799435
## Fiat 128 0.930026590 -1.1606987
## Honda Civic 1.455372982 -0.8414166
## Toyota Corolla 1.040852468 -1.2543364
## Toyota Corona -0.374988630 -1.4259630
## Dodge Challenger -1.026764303 0.1466255
## AMC Javelin -0.893224853 0.1071275
## Camaro Z28 -0.502907752 1.0677154
## Pontiac Firebird -0.984122353 0.2236560
## Fiat X1-9 0.926969145 -1.0092451
## Porsche 914-2 1.590766193 0.1745827
## Lotus Europa 1.509486177 -0.3106522
## Ford Pantera L 1.103849172 1.8802235
## Ferrari Dino 1.361140948 1.3909629
## Maserati Bora 1.120968759 2.6074298
## Volvo 142E 0.761297080 -0.5470932
For this example, we were able to reduce 11 dimensions to 2 which account for 84% of the variance. Why not try a different dataset to see how you can reduce the number of dimensions?