Seminar – Dimensionality Reduction

Part 1: Multicollinearity

What is Multicollinearity?

One of the reasons we use variable reduction techniques is the multicollinearity. Multicollinearity exists in the case of highly correlated independent variables. Because these variables move similarly, it will not be easy to distinguish their effect on the predicted variable. If you try to model a relationship using highly correlated variables, it will be difficult to find the exact relationship. It is also likely to find coefficients that point out to a completely opposite relationship than the true relationship in terms of the sign. Let’s see an example about how multicollinearity may distort the relationship in a model.

For that purpose we are going to use a dataset named seatpos from the package faraway

Task: Install and load library faraway

Data Description

This is a dataset about 38 car drivers and their seat position in the car, collected from the University of Michigan. Among the variables we have the age Age of the driver, the weight Weight of the driver, the height of the driver while wearing shoes HtShoes and we have also the height of the driver barefoot Ht, the seated height in centimeters Seated, the arm length Arm, the thigh length Thigh, the lower leg length in centimetersLeg. The purpose of our model is to explain the hip center hipcenter, which is the horizontal distance over the midpoint of the hips from a fixed location in the car.

Not surprisingly, there are variables that are highly correlated, as is the case of HtShoes and Ht.

Correlation Matrix

To understand the magnitude of the problem we can check the correlation matrix.

Task: Create the correlation matrix

##             Age Weight HtShoes    Ht Seated   Arm Thigh   Leg hipcenter
## Age        1.00   0.08   -0.08 -0.09  -0.17  0.36  0.09 -0.04      0.21
## Weight     0.08   1.00    0.83  0.83   0.78  0.70  0.57  0.78     -0.64
## HtShoes   -0.08   0.83    1.00  1.00   0.93  0.75  0.72  0.91     -0.80
## Ht        -0.09   0.83    1.00  1.00   0.93  0.75  0.73  0.91     -0.80
## Seated    -0.17   0.78    0.93  0.93   1.00  0.63  0.61  0.81     -0.73
## Arm        0.36   0.70    0.75  0.75   0.63  1.00  0.67  0.75     -0.59
## Thigh      0.09   0.57    0.72  0.73   0.61  0.67  1.00  0.65     -0.59
## Leg       -0.04   0.78    0.91  0.91   0.81  0.75  0.65  1.00     -0.79
## hipcenter  0.21  -0.64   -0.80 -0.80  -0.73 -0.59 -0.59 -0.79      1.00

round(cor(seatpos),2)

You can see that many variables have high correlation. As expected, HtShoes and Ht have a perfect correlation.

Regression Analysis

We can run a regression analysis. The function for linear regression is lm. The hip center is the variable we want to explain. Let’s run a regression with all the other variables as explanatory variables and see the results.

## 
## Call:
## lm(formula = hipcenter ~ ., data = seatpos)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -73.827 -22.833  -3.678  25.017  62.337 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 436.43213  166.57162   2.620   0.0138 *
## Age           0.77572    0.57033   1.360   0.1843  
## Weight        0.02631    0.33097   0.080   0.9372  
## HtShoes      -2.69241    9.75304  -0.276   0.7845  
## Ht            0.60134   10.12987   0.059   0.9531  
## Seated        0.53375    3.76189   0.142   0.8882  
## Arm          -1.32807    3.90020  -0.341   0.7359  
## Thigh        -1.14312    2.66002  -0.430   0.6706  
## Leg          -6.43905    4.71386  -1.366   0.1824  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.72 on 29 degrees of freedom
## Multiple R-squared:  0.6866, Adjusted R-squared:  0.6001 
## F-statistic:  7.94 on 8 and 29 DF,  p-value: 1.306e-05

fit<-lm(hipcenter~., data=seatpos) summary(fit)

You can see here that all variables are insignificant. So although we saw that in the correlation matrix there is a high correlation of the hip center with several variables, this is not reflected in the model. However, we can also see that the R-square and the adjusted R-square are quite high. The F statistic indicates that the model has some predictive power.

What exactly is the problem? Probably the reason for this is the multicollinearity.

One tool that you can use to understand such effects is by taking the variable inflation factor, known as VIF. VIF determines the strength of the correlation between the independent variables. It is predicted by taking a variable and regressing it against every other variable.

###Variance Inflation Factor One function to do this is the function vif which comes from the package faraway. The syntax is vif with the regression object inside the parentheses.

##        Age     Weight    HtShoes         Ht     Seated        Arm      Thigh 
##   1.997931   3.647030 307.429378 333.137832   8.951054   4.496368   2.762886 
##        Leg 
##   6.694291

vif(fit)

The output shows all the predictor variables and their respective vif values. We want this number to be less than five. That means that any variable that is over five is susceptible to multicollinearity issues. You can see that the number for HtShoes, and the number for Ht are really high: 307 and 333. But at the same time you can see that we have also some other issues with the seated variable and the leg variable. So what can we do in this case?

Feature Reduction

We need to reduce the number of the variables we are using in our model. Since we have the same information, we could include just one of the variables that are co-linear. We could for example, run a model where we will include only the Age, weight and Ht variables.

Task: Run a regression using only the Age, *weight** and *Ht** as predictors, check the results and see how the vif values have changed.

fit2<-lm(hipcenter~ Age+ Weight +Ht, data=seatpos) summary(fit2) vif(fit2)

Part 2: Using PCA for variable reduction in R

Let’s see how to use PCA for variable reduction in R. You can use any dataset you wish that has 10 or more dimensions/columns of numerical data. This guide, however, will be based on the mtcars dataset that has 11 dimensions and 32 rows (representing different car models). The data is included in the base (datasets) package of R:

##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

The first things to check are: (1) the data is all numerical, (2) there are no missing values and (3) the variables are either all minimising or all maximising. You will find out that no changes need to be made to the mtcars dataset.

Check Correlation

Next, it’s a good idea to have a look at the correlation matrix of the data

Task: Check the correlation matrix of the data

round(cor(mtcars), 3)

##         mpg    cyl   disp     hp   drat     wt   qsec     vs     am   gear
## mpg   1.000 -0.852 -0.848 -0.776  0.681 -0.868  0.419  0.664  0.600  0.480
## cyl  -0.852  1.000  0.902  0.832 -0.700  0.782 -0.591 -0.811 -0.523 -0.493
## disp -0.848  0.902  1.000  0.791 -0.710  0.888 -0.434 -0.710 -0.591 -0.556
## hp   -0.776  0.832  0.791  1.000 -0.449  0.659 -0.708 -0.723 -0.243 -0.126
## drat  0.681 -0.700 -0.710 -0.449  1.000 -0.712  0.091  0.440  0.713  0.700
## wt   -0.868  0.782  0.888  0.659 -0.712  1.000 -0.175 -0.555 -0.692 -0.583
## qsec  0.419 -0.591 -0.434 -0.708  0.091 -0.175  1.000  0.745 -0.230 -0.213
## vs    0.664 -0.811 -0.710 -0.723  0.440 -0.555  0.745  1.000  0.168  0.206
## am    0.600 -0.523 -0.591 -0.243  0.713 -0.692 -0.230  0.168  1.000  0.794
## gear  0.480 -0.493 -0.556 -0.126  0.700 -0.583 -0.213  0.206  0.794  1.000
## carb -0.551  0.527  0.395  0.750 -0.091  0.428 -0.656 -0.570  0.058  0.274
##        carb
## mpg  -0.551
## cyl   0.527
## disp  0.395
## hp    0.750
## drat -0.091
## wt    0.428
## qsec -0.656
## vs   -0.570
## am    0.058
## gear  0.274
## carb  1.000

From checking this, you can see that some variables are highly correlated whilst some others are clearly not. As there many variables, producing a very large correlation matrix, it may be more helpful to review the correlation using visualisation methods such as:

Using the corrgram function from the corrgram library. Dark blue indicates a high positive correlation whilst a dark red indicates a high negative correlation.
Using the plotcorr function from the ellipse library. Read more about this function with ?plotcorr.

Next we need to work out how many components to extract. We can do this by looking at the scree plot.

Decide on the number of components

To produce the scree plot type the following command:

plot(prcomp(mtcars, scale = TRUE), type="line")

According to Kaiser’s rule (take components above 1) we would select 2 components. From looking at the scree plot (take the components left of the elbow) you would take 2-3 components. Let’s check also the summary to understand the cumulative explained variance.

summary(prcomp(mtcars, scale = TRUE))

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6    PC7
## Standard deviation     2.5707 1.6280 0.79196 0.51923 0.47271 0.46000 0.3678
## Proportion of Variance 0.6008 0.2409 0.05702 0.02451 0.02031 0.01924 0.0123
## Cumulative Proportion  0.6008 0.8417 0.89873 0.92324 0.94356 0.96279 0.9751
##                            PC8    PC9    PC10   PC11
## Standard deviation     0.35057 0.2776 0.22811 0.1485
## Proportion of Variance 0.01117 0.0070 0.00473 0.0020
## Cumulative Proportion  0.98626 0.9933 0.99800 1.0000

We can see again that 2-3 components is a good choice.

Check the components

From the previous table, we can see that 2 components account for 84% of the variance. We can then have a look at the two principal components using the following command

prcomp(mtcars, scale = TRUE)$rotation[,1:2]

##             PC1         PC2
## mpg  -0.3625305  0.01612440
## cyl   0.3739160  0.04374371
## disp  0.3681852 -0.04932413
## hp    0.3300569  0.24878402
## drat -0.2941514  0.27469408
## wt    0.3461033 -0.14303825
## qsec -0.2004563 -0.46337482
## vs   -0.3065113 -0.23164699
## am   -0.2349429  0.42941765
## gear -0.2069162  0.46234863
## carb  0.2140177  0.41357106

It is quite difficult to interpret this, so need to rotate the components. First though, we can have a look at a 2D biplot using:

biplot(prcomp(mtcars, scale = TRUE))

The plot shows the original variables and how they are placed on the 2D principal component space. You can see that cyl and disp load highly on the first component and hardly at all on the second component (this can also be seen in the PC1 and PC2 values shown previously). You can also see that carb loads highly on the second component and not much on the first component. The plot also shows the car models and where they are placed in the 2D principal component space. You can view the PCA scores for each of the car models using:

prcomp(mtcars, scale = TRUE)$x[,1:2]

##                               PC1        PC2
## Mazda RX4           -0.6468627420  1.7081142
## Mazda RX4 Wag       -0.6194831460  1.5256219
## Datsun 710          -2.7356242748 -0.1441501
## Hornet 4 Drive      -0.3068606268 -2.3258038
## Hornet Sportabout    1.9433926844 -0.7425211
## Valiant             -0.0552534228 -2.7421229
## Duster 360           2.9553851233  0.3296133
## Merc 240D           -2.0229593244 -1.4421056
## Merc 230            -2.2513839535 -1.9522879
## Merc 280            -0.5180912217 -0.1594610
## Merc 280C           -0.5011860079 -0.3187934
## Merc 450SE           2.2124096339 -0.6727099
## Merc 450SL           2.0155715693 -0.6724606
## Merc 450SLC          2.1147047372 -0.7891129
## Cadillac Fleetwood   3.8383725118 -0.8149087
## Lincoln Continental  3.8918495626 -0.7218314
## Chrysler Imperial    3.5363862158 -0.4145024
## Fiat 128            -3.7955510831 -0.2920783
## Honda Civic         -4.1870356784  0.6775721
## Toyota Corolla      -4.1675359344 -0.2748890
## Toyota Corona       -1.8741790870 -2.0864529
## Dodge Challenger     2.1504414942 -0.9982442
## AMC Javelin          1.8340369797 -0.8921886
## Camaro Z28           2.8434957523  0.6701037
## Pontiac Firebird     2.2105479148 -0.8600504
## Fiat X1-9           -3.5176818134 -0.1192950
## Porsche 914-2       -2.6095003965  2.0141425
## Lotus Europa        -3.3323844512  1.3568877
## Ford Pantera L       1.3513346957  3.4448780
## Ferrari Dino        -0.0009743305  3.1683750
## Maserati Bora        2.6270897605  4.3107016
## Volvo 142E          -2.3824711412  0.2299603

For example, you can see from the score how 2.63 in PC1 and 4.31 in PC2 places the Maserati Bora at the top and slightly to the right of the plot. To make the components easier to interpret we can use rotation. The psych library contains functions for this, so add the following libraries to R: library(psych) and library(GPArotation).

We can try varimax rotation using:

principal(mtcars, nfactors = 2, rotate="varimax")

## Principal Components Analysis
## Call: principal(r = mtcars, nfactors = 2, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
##        RC1   RC2   h2    u2 com
## mpg   0.68 -0.63 0.87 0.131 2.0
## cyl  -0.64  0.72 0.93 0.071 2.0
## disp -0.73  0.60 0.90 0.098 1.9
## hp   -0.32  0.88 0.88 0.116 1.3
## drat  0.85 -0.21 0.77 0.228 1.1
## wt   -0.80  0.46 0.85 0.154 1.6
## qsec -0.16 -0.90 0.83 0.165 1.1
## vs    0.30 -0.82 0.76 0.237 1.3
## am    0.92  0.08 0.85 0.146 1.0
## gear  0.91  0.17 0.85 0.150 1.1
## carb  0.08  0.87 0.76 0.244 1.0
## 
##                        RC1  RC2
## SS loadings           4.67 4.59
## Proportion Var        0.42 0.42
## Cumulative Var        0.42 0.84
## Proportion Explained  0.50 0.50
## Cumulative Proportion 0.50 1.00
## 
## Mean item complexity =  1.4
## Test of the hypothesis that 2 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.05 
##  with the empirical chi square  9.45  with prob <  1 
## 
## Fit based upon off diagonal values = 0.99

You can see that now mpg, drat, am and gear load heavily on the first component and cyl, disp, hp and carb load heavily on the second component. We can also view a biplot of the rotated components using:

biplot(principal(mtcars, nfactors = 2, rotate="varimax"))

If you wanted to include the car labels on the plot you can use: biplot(principal(mtcars, nfactors = 2, rotate=“varimax”), lab=row.names(mtcars))

You can also look at the scores for the rotated components using:

principal(mtcars, nfactors = 2, rotate="varimax")$scores

##                              RC1        RC2
## Mazda RX4            0.913545261  0.5740734
## Mazda RX4 Wag        0.827547997  0.5013891
## Datsun 710           0.698816380 -0.8074274
## Hornet 4 Drive      -0.913638812 -1.1047271
## Hornet Sportabout   -0.859350144  0.2025926
## Valiant             -1.162422544 -1.2190911
## Duster 360          -0.680268641  0.9486423
## Merc 240D           -0.056857408 -1.1835017
## Merc 230            -0.212468151 -1.4696568
## Merc 280             0.075581300 -0.2109478
## Merc 280C            0.002444315 -0.2763119
## Merc 450SE          -0.904174186  0.3064233
## Merc 450SL          -0.849329738  0.2529898
## Merc 450SLC         -0.927001103  0.2287337
## Cadillac Fleetwood  -1.417403948  0.6862707
## Lincoln Continental -1.392296773  0.7416873
## Chrysler Imperial   -1.161445427  0.7799435
## Fiat 128             0.930026590 -1.1606987
## Honda Civic          1.455372982 -0.8414166
## Toyota Corolla       1.040852468 -1.2543364
## Toyota Corona       -0.374988630 -1.4259630
## Dodge Challenger    -1.026764303  0.1466255
## AMC Javelin         -0.893224853  0.1071275
## Camaro Z28          -0.502907752  1.0677154
## Pontiac Firebird    -0.984122353  0.2236560
## Fiat X1-9            0.926969145 -1.0092451
## Porsche 914-2        1.590766193  0.1745827
## Lotus Europa         1.509486177 -0.3106522
## Ford Pantera L       1.103849172  1.8802235
## Ferrari Dino         1.361140948  1.3909629
## Maserati Bora        1.120968759  2.6074298
## Volvo 142E           0.761297080 -0.5470932

For this example, we were able to reduce 11 dimensions to 2 which account for 84% of the variance. Why not try a different dataset to see how you can reduce the number of dimensions?