Seminar – Dimensionality Reduction

What is Dimensionality Reduction?

Dimensionality reduction is the process of reducing the number of features (or dimensions) in a dataset while preserving as much meaningful information as possible. High-dimensional data can be computationally expensive, prone to overfitting, and difficult to visualize.

For example, if we have a dataset with 1000 variables, it is challenging to analyze patterns and relationships effectively. Dimensionality reduction techniques help us simplify the dataset without losing significant insights.

Why is it Important?

Avoiding the Curse of Dimensionality

As the number of features increases, the dataset becomes sparse and harder to model effectively.
Many machine learning algorithms perform poorly in high-dimensional space due to overfitting and computational inefficiency.

Improving Model Performance

Removes redundant features (e.g., highly correlated variables).
Speeds up computation for machine learning models.
Enhances interpretability by reducing complexity.

Visualizing High-Dimensional Data

Many datasets exist in more than three dimensions, making visualization impossible.
Using techniques like Principal Component Analysis (PCA) or t-SNE, we can project data onto a lower-dimensional space and explore patterns.

Types of Dimensionality Reduction

There are two major types:

1️⃣ Feature Selection

Selecting a subset of the most important variables.
Examples: stepwise regression Recursive Feature Elimination (RFE), Lasso Regression.

2️⃣ Feature Extraction

Creating new features by combining existing ones.
Examples: Principal Component Analysis (PCA), t-SNE.

Part 1: Multicollinearity

What is Multicollinearity?

One of the reasons we use variable reduction techniques is to address multicollinearity. Multicollinearity exists in the case of highly correlated independent variables. Since these variables move similarly, it becomes difficult to distinguish their individual effects on the predicted variable. Modeling a relationship using highly correlated variables makes it challenging to determine their precise influence. It is also possible to obtain coefficients that indicate an opposite relationship to the true one in terms of sign. Let’s see an example about how multicollinearity may distort the relationship in a model.

For that purpose, we will use the dataset seatpos from the faraway package.

Task: Install and load library faraway

Data Description

This is a dataset about 38 car drivers and their seat position in the car, collected from the University of Michigan. Among the variables we have the age Age of the driver, the weight Weight of the driver, the height of the driver while wearing shoes HtShoes and we have also the height of the driver barefoot Ht, the seated height in centimeters Seated, the arm length Arm, the thigh length Thigh, the lower leg length in centimetersLeg. The purpose of our model is to explain the hipcenter, which represents the horizontal distance from the midpoint of the hips to a fixed location in the car.

Not surprisingly, there are variables that are highly correlated, as is the case of HtShoes and Ht.

Correlation Matrix

To assess the extent of the problem, we can examine the correlation matrix.

Task: Create the correlation matrix

##             Age Weight HtShoes    Ht Seated   Arm Thigh   Leg hipcenter
## Age        1.00   0.08   -0.08 -0.09  -0.17  0.36  0.09 -0.04      0.21
## Weight     0.08   1.00    0.83  0.83   0.78  0.70  0.57  0.78     -0.64
## HtShoes   -0.08   0.83    1.00  1.00   0.93  0.75  0.72  0.91     -0.80
## Ht        -0.09   0.83    1.00  1.00   0.93  0.75  0.73  0.91     -0.80
## Seated    -0.17   0.78    0.93  0.93   1.00  0.63  0.61  0.81     -0.73
## Arm        0.36   0.70    0.75  0.75   0.63  1.00  0.67  0.75     -0.59
## Thigh      0.09   0.57    0.72  0.73   0.61  0.67  1.00  0.65     -0.59
## Leg       -0.04   0.78    0.91  0.91   0.81  0.75  0.65  1.00     -0.79
## hipcenter  0.21  -0.64   -0.80 -0.80  -0.73 -0.59 -0.59 -0.79      1.00

round(cor(seatpos),2)

You can see that many variables have high correlation. As expected, HtShoes and Ht have a perfect correlation.

Regression Analysis

We can run a regression analysis. The function for linear regression is lm. The hip center is the variable we want to explain. Let’s run a regression with all the other variables as explanatory variables and see the results.

## 
## Call:
## lm(formula = hipcenter ~ ., data = seatpos)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -73.827 -22.833  -3.678  25.017  62.337 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 436.43213  166.57162   2.620   0.0138 *
## Age           0.77572    0.57033   1.360   0.1843  
## Weight        0.02631    0.33097   0.080   0.9372  
## HtShoes      -2.69241    9.75304  -0.276   0.7845  
## Ht            0.60134   10.12987   0.059   0.9531  
## Seated        0.53375    3.76189   0.142   0.8882  
## Arm          -1.32807    3.90020  -0.341   0.7359  
## Thigh        -1.14312    2.66002  -0.430   0.6706  
## Leg          -6.43905    4.71386  -1.366   0.1824  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.72 on 29 degrees of freedom
## Multiple R-squared:  0.6866, Adjusted R-squared:  0.6001 
## F-statistic:  7.94 on 8 and 29 DF,  p-value: 1.306e-05

fit<-lm(hipcenter~., data=seatpos) summary(fit)

You can see here that all variables are insignificant. All variables appear to be insignificant. Although the correlation matrix showed a strong relationship between hipcenter and several variables, this is not reflected in the regression model. However, both the R-squared and adjusted R-squared values are quite high. The F statistic indicates that the model has some predictive power.

What exactly is the problem? Probably the reason for this is the multicollinearity.

One tool for detecting such issues is the Variance Inflation Factor VIF. VIF determines the strength of the correlation between the independent variables. It is predicted by taking a variable and regressing it against every other variable.

###Variance Inflation Factor One function to do this is the function vif which comes from the package faraway. The syntax is vif with the regression object inside the parentheses.

##        Age     Weight    HtShoes         Ht     Seated        Arm      Thigh 
##   1.997931   3.647030 307.429378 333.137832   8.951054   4.496368   2.762886 
##        Leg 
##   6.694291

vif(fit)

The output shows all the predictor variables and their respective vif values. We want this number to be less than five. That means that any variable that is over five is susceptible to multicollinearity issues. You can see that the number for HtShoes, and the number for Ht are really high: 307 and 333. But at the same time you can see that we have also some other issues with the seated variable and the leg variable. So what can we do in this case?

Feature Reduction

We need to reduce the number of the variables we are using in our model. Since we have the same information, we could include just one of the variables that are co-linear. We could for example, run a model where we will include only the Age, weight and Ht variables.

Task: Run a regression using only the Age, *weight** and *Ht** as predictors, check the results and see how the vif values have changed.

fit2<-lm(hipcenter~ Age+ Weight +Ht, data=seatpos) summary(fit2) vif(fit2)

Part 2: Using PCA for variable reduction in R

Let’s see how to use PCA for variable reduction in R. You can use any dataset you wish that has 10 or more dimensions/columns of numerical data. This guide, however, will be based on the mtcars dataset that has 11 dimensions and 32 rows (representing different car models). The data is included in the base (datasets) package of R:

##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

The first things to check are: (1) the data is all numerical, (2) there are no missing values and (3) the variables are either all minimising or all maximising. You will find out that no changes need to be made to the mtcars dataset.

Check Correlation

Next, it’s a good idea to have a look at the correlation matrix of the data

Task: Check the correlation matrix of the data

round(cor(mtcars), 3)

##         mpg    cyl   disp     hp   drat     wt   qsec     vs     am   gear
## mpg   1.000 -0.852 -0.848 -0.776  0.681 -0.868  0.419  0.664  0.600  0.480
## cyl  -0.852  1.000  0.902  0.832 -0.700  0.782 -0.591 -0.811 -0.523 -0.493
## disp -0.848  0.902  1.000  0.791 -0.710  0.888 -0.434 -0.710 -0.591 -0.556
## hp   -0.776  0.832  0.791  1.000 -0.449  0.659 -0.708 -0.723 -0.243 -0.126
## drat  0.681 -0.700 -0.710 -0.449  1.000 -0.712  0.091  0.440  0.713  0.700
## wt   -0.868  0.782  0.888  0.659 -0.712  1.000 -0.175 -0.555 -0.692 -0.583
## qsec  0.419 -0.591 -0.434 -0.708  0.091 -0.175  1.000  0.745 -0.230 -0.213
## vs    0.664 -0.811 -0.710 -0.723  0.440 -0.555  0.745  1.000  0.168  0.206
## am    0.600 -0.523 -0.591 -0.243  0.713 -0.692 -0.230  0.168  1.000  0.794
## gear  0.480 -0.493 -0.556 -0.126  0.700 -0.583 -0.213  0.206  0.794  1.000
## carb -0.551  0.527  0.395  0.750 -0.091  0.428 -0.656 -0.570  0.058  0.274
##        carb
## mpg  -0.551
## cyl   0.527
## disp  0.395
## hp    0.750
## drat -0.091
## wt    0.428
## qsec -0.656
## vs   -0.570
## am    0.058
## gear  0.274
## carb  1.000

From checking this, you can see that some variables are highly correlated whilst some others are clearly not. As there many variables, producing a very large correlation matrix, it may be more helpful to review the correlation using visualisation methods such as:

Using the corrgram function from the corrgram library. Dark blue indicates a high positive correlation whilst a dark red indicates a high negative correlation.
Using the plotcorr function from the ellipse library. Read more about this function with ?plotcorr.

Next we need to work out how many components to extract. We can do this by looking at the scree plot.

Decide on the number of components

To produce the scree plot type the following command:

plot(prcomp(mtcars, scale = TRUE), type="line")

According to Kaiser’s rule (take components above 1) we would select 2 components. From looking at the scree plot (take the components left of the elbow) you would take 2-3 components. Let’s check also the summary to understand the cumulative explained variance.

summary(prcomp(mtcars, scale = TRUE))

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6    PC7
## Standard deviation     2.5707 1.6280 0.79196 0.51923 0.47271 0.46000 0.3678
## Proportion of Variance 0.6008 0.2409 0.05702 0.02451 0.02031 0.01924 0.0123
## Cumulative Proportion  0.6008 0.8417 0.89873 0.92324 0.94356 0.96279 0.9751
##                            PC8    PC9    PC10   PC11
## Standard deviation     0.35057 0.2776 0.22811 0.1485
## Proportion of Variance 0.01117 0.0070 0.00473 0.0020
## Cumulative Proportion  0.98626 0.9933 0.99800 1.0000

We can see again that 2-3 components is a good choice.

Check the components

From the previous table, we can see that 2 components account for 84% of the variance. We can then have a look at the two principal components using the following command

prcomp(mtcars, scale = TRUE)$rotation[,1:2]

##             PC1         PC2
## mpg  -0.3625305  0.01612440
## cyl   0.3739160  0.04374371
## disp  0.3681852 -0.04932413
## hp    0.3300569  0.24878402
## drat -0.2941514  0.27469408
## wt    0.3461033 -0.14303825
## qsec -0.2004563 -0.46337482
## vs   -0.3065113 -0.23164699
## am   -0.2349429  0.42941765
## gear -0.2069162  0.46234863
## carb  0.2140177  0.41357106

It is quite difficult to interpret this, so need to rotate the components. First though, we can have a look at a 2D biplot using:

biplot(prcomp(mtcars, scale = TRUE))

The plot shows the original variables and how they are placed on the 2D principal component space. You can see that cyl and disp load highly on the first component and hardly at all on the second component (this can also be seen in the PC1 and PC2 values shown previously). You can also see that carb loads highly on the second component and not much on the first component. The plot also shows the car models and where they are placed in the 2D principal component space. You can view the PCA scores for each of the car models using:

prcomp(mtcars, scale = TRUE)$x[,1:2]

##                               PC1        PC2
## Mazda RX4           -0.6468627420  1.7081142
## Mazda RX4 Wag       -0.6194831460  1.5256219
## Datsun 710          -2.7356242748 -0.1441501
## Hornet 4 Drive      -0.3068606268 -2.3258038
## Hornet Sportabout    1.9433926844 -0.7425211
## Valiant             -0.0552534228 -2.7421229
## Duster 360           2.9553851233  0.3296133
## Merc 240D           -2.0229593244 -1.4421056
## Merc 230            -2.2513839535 -1.9522879
## Merc 280            -0.5180912217 -0.1594610
## Merc 280C           -0.5011860079 -0.3187934
## Merc 450SE           2.2124096339 -0.6727099
## Merc 450SL           2.0155715693 -0.6724606
## Merc 450SLC          2.1147047372 -0.7891129
## Cadillac Fleetwood   3.8383725118 -0.8149087
## Lincoln Continental  3.8918495626 -0.7218314
## Chrysler Imperial    3.5363862158 -0.4145024
## Fiat 128            -3.7955510831 -0.2920783
## Honda Civic         -4.1870356784  0.6775721
## Toyota Corolla      -4.1675359344 -0.2748890
## Toyota Corona       -1.8741790870 -2.0864529
## Dodge Challenger     2.1504414942 -0.9982442
## AMC Javelin          1.8340369797 -0.8921886
## Camaro Z28           2.8434957523  0.6701037
## Pontiac Firebird     2.2105479148 -0.8600504
## Fiat X1-9           -3.5176818134 -0.1192950
## Porsche 914-2       -2.6095003965  2.0141425
## Lotus Europa        -3.3323844512  1.3568877
## Ford Pantera L       1.3513346957  3.4448780
## Ferrari Dino        -0.0009743305  3.1683750
## Maserati Bora        2.6270897605  4.3107016
## Volvo 142E          -2.3824711412  0.2299603

For example, you can see from the score how 2.63 in PC1 and 4.31 in PC2 places the Maserati Bora at the top and slightly to the right of the plot. To make the components easier to interpret we can use rotation. The psych library contains functions for this, so add the following libraries to R: library(psych) and library(GPArotation).

We can try varimax rotation using:

principal(mtcars, nfactors = 2, rotate="varimax")

## Principal Components Analysis
## Call: principal(r = mtcars, nfactors = 2, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
##        RC1   RC2   h2    u2 com
## mpg   0.68 -0.63 0.87 0.131 2.0
## cyl  -0.64  0.72 0.93 0.071 2.0
## disp -0.73  0.60 0.90 0.098 1.9
## hp   -0.32  0.88 0.88 0.116 1.3
## drat  0.85 -0.21 0.77 0.228 1.1
## wt   -0.80  0.46 0.85 0.154 1.6
## qsec -0.16 -0.90 0.83 0.165 1.1
## vs    0.30 -0.82 0.76 0.237 1.3
## am    0.92  0.08 0.85 0.146 1.0
## gear  0.91  0.17 0.85 0.150 1.1
## carb  0.08  0.87 0.76 0.244 1.0
## 
##                        RC1  RC2
## SS loadings           4.67 4.59
## Proportion Var        0.42 0.42
## Cumulative Var        0.42 0.84
## Proportion Explained  0.50 0.50
## Cumulative Proportion 0.50 1.00
## 
## Mean item complexity =  1.4
## Test of the hypothesis that 2 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.05 
##  with the empirical chi square  9.45  with prob <  1 
## 
## Fit based upon off diagonal values = 0.99

You can see that now mpg, drat, am and gear load heavily on the first component and cyl, disp, hp and carb load heavily on the second component. We can also view a biplot of the rotated components using:

biplot(principal(mtcars, nfactors = 2, rotate="varimax"))

If you wanted to include the car labels on the plot you can use: biplot(principal(mtcars, nfactors = 2, rotate=“varimax”), lab=row.names(mtcars))

You can also look at the scores for the rotated components using:

principal(mtcars, nfactors = 2, rotate="varimax")$scores

##                              RC1        RC2
## Mazda RX4            0.913545261  0.5740734
## Mazda RX4 Wag        0.827547997  0.5013891
## Datsun 710           0.698816380 -0.8074274
## Hornet 4 Drive      -0.913638812 -1.1047271
## Hornet Sportabout   -0.859350144  0.2025926
## Valiant             -1.162422544 -1.2190911
## Duster 360          -0.680268641  0.9486423
## Merc 240D           -0.056857408 -1.1835017
## Merc 230            -0.212468151 -1.4696568
## Merc 280             0.075581300 -0.2109478
## Merc 280C            0.002444315 -0.2763119
## Merc 450SE          -0.904174186  0.3064233
## Merc 450SL          -0.849329738  0.2529898
## Merc 450SLC         -0.927001103  0.2287337
## Cadillac Fleetwood  -1.417403948  0.6862707
## Lincoln Continental -1.392296773  0.7416873
## Chrysler Imperial   -1.161445427  0.7799435
## Fiat 128             0.930026590 -1.1606987
## Honda Civic          1.455372982 -0.8414166
## Toyota Corolla       1.040852468 -1.2543364
## Toyota Corona       -0.374988630 -1.4259630
## Dodge Challenger    -1.026764303  0.1466255
## AMC Javelin         -0.893224853  0.1071275
## Camaro Z28          -0.502907752  1.0677154
## Pontiac Firebird    -0.984122353  0.2236560
## Fiat X1-9            0.926969145 -1.0092451
## Porsche 914-2        1.590766193  0.1745827
## Lotus Europa         1.509486177 -0.3106522
## Ford Pantera L       1.103849172  1.8802235
## Ferrari Dino         1.361140948  1.3909629
## Maserati Bora        1.120968759  2.6074298
## Volvo 142E           0.761297080 -0.5470932

After performing the PCA with Varimax rotation, we have two principal components (PC1 and PC2) that explain 84% of the variance. Now we will proceed by using these components in a regression model to predict the target variable (e.g., mpg).

Step 1: Extract PCA Scores After Varimax Rotation We need to extract the rotated PCA scores for each car model and use them as the independent variables in a regression model.We remove here the first column because we will use as a dependent variable.

# Extract rotated principal components using Varimax rotation
pca_rotated <- principal(mtcars[, -1], nfactors = 2, rotate = "varimax")

# View the rotated PCA scores for each car model
pca_scores <- as.data.frame(pca_rotated$scores)


# Add the dependent variable (mpg) to the PCA scores dataframe
pca_scores$mpg <- mtcars$mpg

# Check the new dataframe
head(pca_scores)

##                          RC1        RC2  mpg
## Mazda RX4          0.5580509  0.9287428 21.0
## Mazda RX4 Wag      0.4868240  0.8413597 21.0
## Datsun 710        -0.8669120  0.7575937 22.8
## Hornet 4 Drive    -1.0834612 -0.9386737 21.4
## Hornet Sportabout  0.2606309 -0.9199634 18.7
## Valiant           -1.2444094 -1.1321068 18.1

Here, the pca_scores dataframe now contains the first two rotated principal components and the target variable mpg.

Step 2: Perform Regression Using PCA Components We now use the first two rotated principal components (PC1 and PC2) to predict mpg. This allows us to avoid multicollinearity issues while retaining most of the variance.

# Run linear regression using the rotated principal components as predictors
pca_lm <- lm(mpg ~ RC1 + RC2, data=pca_scores)

# Display the regression summary
summary(pca_lm)

## 
## Call:
## lm(formula = mpg ~ RC1 + RC2, data = pca_scores)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3611 -1.7263 -0.3322  1.3208  5.6763 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  20.0906     0.4591  43.760  < 2e-16 ***
## RC1          -3.7487     0.4665  -8.037 7.31e-09 ***
## RC2           3.9952     0.4665   8.565 1.96e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.597 on 29 degrees of freedom
## Multiple R-squared:  0.8263, Adjusted R-squared:  0.8143 
## F-statistic: 68.97 on 2 and 29 DF,  p-value: 9.493e-12

Step 3: Compare With the Original Model Now let’s compare the model using PCA components with the original model that uses all the predictors. We’ll look at the R-squared values of both models to assess their performance.

# Original regression model using all variables
original_lm <- lm(mpg ~ ., data = mtcars)

# Compare R-squared values of both models
cat("Original Model R-squared:", summary(original_lm)$r.squared, "\n")

## Original Model R-squared: 0.8690158

cat("PCA Model R-squared:", summary(pca_lm)$r.squared, "\n")

## PCA Model R-squared: 0.8262911