Previously on STAT 412:

  • Cleaning dataset

TODAY: MULTICOLLINEARITY

Why is detecting and solving multicollinearity important?

Detecting multi-collinearity matters in regression because it ensures the accuracy of our predictions. When variables are too closely related, it confuses our results. It makes it difficult to determine which factors truly influence the outcome and by how much. By identifying multi-collinearity, we ensure our predictions are reliable and not influenced by misleading relationships between variables.

In this recitation, we will use car-price dataset.

library(readxl)
car = read_excel("car2.xlsx")
head(car)
car_ID CarName fueltype aspiration doornumber wheelbase carlength carwidth carheight curbweight enginesize boreratio stroke compressionratio horsepower peakrpm citympg highwaympg price
1 alfa-romero giulia gas std two 88.6 168.8 64.1 48.8 2548 130 3.47 2.68 9.0 111 5000 21 27 13495
2 alfa-romero stelvio gas std two 88.6 168.8 64.1 48.8 2548 130 3.47 2.68 9.0 111 5000 21 27 16500
3 alfa-romero Quadrifoglio gas std two 94.5 171.2 65.5 52.4 2823 152 2.68 3.47 9.0 154 5000 19 26 16500
4 audi 100 ls gas std four 99.8 176.6 66.2 54.3 2337 109 3.19 3.40 10.0 102 5500 24 30 13950
5 audi 100ls gas std four 99.4 176.6 66.4 54.3 2824 136 3.19 3.40 8.0 115 5500 18 22 17450
6 audi fox gas std two 99.8 177.3 66.3 53.1 2507 136 3.19 3.40 8.5 110 5500 19 25 15250

Quickly examine the dataset:

sum(is.na(car))
## [1] 0
sum(duplicated(car))
## [1] 0
dim(car)
## [1] 205  19

The data has no missing or duplicated values.

Data Dictionary:

  • Car_ID: Unique id of each observation (Integer)
  • carCompany: Name of car company (Categorical)
  • fueltype: Car fuel type i.e gas or diesel (Categorical)
  • aspiration: Aspiration used in a car (Categorical)
  • doornumber: Number of doors in a car (Categorical)
  • wheelbase: Weelbase of car (Numeric)
  • carlength: Length of car (Numeric)
  • carwidth: Width of car (Numeric)
  • carheight: height of car (Numeric)
  • curbweight: The weight of a car without occupants or baggage. (Numeric)
  • enginesize: Size of car (Numeric)
  • boreratio: Boreratio of car (Numeric)
  • stroke: Stroke or volume inside the engine (Numeric)
  • compressionratio: compression ratio of car (Numeric)
  • horsepower: Horsepower (Numeric)
  • peakrpm: car peak rpm (Numeric)
  • citympg: Mileage in city (Numeric)
  • highwaympg: Mileage on highway (Numeric)
  • price(Dependent variable): Price of car (Numeric)

Today’s focus is on multi-collinearity and we keep only numeric variables while ignoring categorical ones. Therefore, we drop the variables car_ID, car name, fuel type, aspiration and door number.

str(car)
## tibble [205 × 19] (S3: tbl_df/tbl/data.frame)
##  $ car_ID          : num [1:205] 1 2 3 4 5 6 7 8 9 10 ...
##  $ CarName         : chr [1:205] "alfa-romero giulia" "alfa-romero stelvio" "alfa-romero Quadrifoglio" "audi 100 ls" ...
##  $ fueltype        : chr [1:205] "gas" "gas" "gas" "gas" ...
##  $ aspiration      : chr [1:205] "std" "std" "std" "std" ...
##  $ doornumber      : chr [1:205] "two" "two" "two" "four" ...
##  $ wheelbase       : num [1:205] 88.6 88.6 94.5 99.8 99.4 ...
##  $ carlength       : num [1:205] 169 169 171 177 177 ...
##  $ carwidth        : num [1:205] 64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 ...
##  $ carheight       : num [1:205] 48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 ...
##  $ curbweight      : num [1:205] 2548 2548 2823 2337 2824 ...
##  $ enginesize      : num [1:205] 130 130 152 109 136 136 136 136 131 131 ...
##  $ boreratio       : num [1:205] 3.47 3.47 2.68 3.19 3.19 3.19 3.19 3.19 3.13 3.13 ...
##  $ stroke          : num [1:205] 2.68 2.68 3.47 3.4 3.4 3.4 3.4 3.4 3.4 3.4 ...
##  $ compressionratio: num [1:205] 9 9 9 10 8 8.5 8.5 8.5 8.3 7 ...
##  $ horsepower      : num [1:205] 111 111 154 102 115 110 110 110 140 160 ...
##  $ peakrpm         : num [1:205] 5000 5000 5000 5500 5500 5500 5500 5500 5500 5500 ...
##  $ citympg         : num [1:205] 21 21 19 24 18 19 19 19 17 16 ...
##  $ highwaympg      : num [1:205] 27 27 26 30 22 25 25 25 20 22 ...
##  $ price           : num [1:205] 13495 16500 16500 13950 17450 ...
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
car=car %>% select(-car_ID,-CarName,-fueltype,-aspiration,-doornumber)
summary(car)
##    wheelbase        carlength        carwidth       carheight    
##  Min.   : 86.60   Min.   :141.1   Min.   :60.30   Min.   :47.80  
##  1st Qu.: 94.50   1st Qu.:166.3   1st Qu.:64.10   1st Qu.:52.00  
##  Median : 97.00   Median :173.2   Median :65.50   Median :54.10  
##  Mean   : 98.76   Mean   :174.0   Mean   :65.91   Mean   :53.72  
##  3rd Qu.:102.40   3rd Qu.:183.1   3rd Qu.:66.90   3rd Qu.:55.50  
##  Max.   :120.90   Max.   :208.1   Max.   :72.30   Max.   :59.80  
##    curbweight     enginesize      boreratio        stroke      compressionratio
##  Min.   :1488   Min.   : 61.0   Min.   :2.54   Min.   :2.070   Min.   : 7.00   
##  1st Qu.:2145   1st Qu.: 97.0   1st Qu.:3.15   1st Qu.:3.110   1st Qu.: 8.60   
##  Median :2414   Median :120.0   Median :3.31   Median :3.290   Median : 9.00   
##  Mean   :2556   Mean   :126.9   Mean   :3.33   Mean   :3.255   Mean   :10.14   
##  3rd Qu.:2935   3rd Qu.:141.0   3rd Qu.:3.58   3rd Qu.:3.410   3rd Qu.: 9.40   
##  Max.   :4066   Max.   :326.0   Max.   :3.94   Max.   :4.170   Max.   :23.00   
##    horsepower       peakrpm        citympg        highwaympg        price      
##  Min.   : 48.0   Min.   :4150   Min.   :13.00   Min.   :16.00   Min.   : 5118  
##  1st Qu.: 70.0   1st Qu.:4800   1st Qu.:19.00   1st Qu.:25.00   1st Qu.: 7788  
##  Median : 95.0   Median :5200   Median :24.00   Median :30.00   Median :10295  
##  Mean   :104.1   Mean   :5125   Mean   :25.22   Mean   :30.75   Mean   :13277  
##  3rd Qu.:116.0   3rd Qu.:5500   3rd Qu.:30.00   3rd Qu.:34.00   3rd Qu.:16503  
##  Max.   :288.0   Max.   :6600   Max.   :49.00   Max.   :54.00   Max.   :45400

Detecting Multi-collinearity

1. Informal way: Correlation plot via “corrplot” package

corr=cor(car)
corr
##                   wheelbase  carlength   carwidth   carheight curbweight
## wheelbase         1.0000000  0.8745875  0.7951436  0.58943476  0.7763863
## carlength         0.8745875  1.0000000  0.8411183  0.49102946  0.8777285
## carwidth          0.7951436  0.8411183  1.0000000  0.27921032  0.8670325
## carheight         0.5894348  0.4910295  0.2792103  1.00000000  0.2955717
## curbweight        0.7763863  0.8777285  0.8670325  0.29557173  1.0000000
## enginesize        0.5693287  0.6833599  0.7354334  0.06714874  0.8505941
## boreratio         0.4887499  0.6064544  0.5591499  0.17107092  0.6484797
## stroke            0.1609590  0.1295326  0.1829417 -0.05530667  0.1687900
## compressionratio  0.2497858  0.1584137  0.1811286  0.26121423  0.1513617
## horsepower        0.3532945  0.5526230  0.6407321 -0.10880206  0.7507393
## peakrpm          -0.3604687 -0.2872422 -0.2200123 -0.32041072 -0.2662432
## citympg          -0.4704136 -0.6709087 -0.6427043 -0.04863963 -0.7574138
## highwaympg       -0.5440819 -0.7046616 -0.6772179 -0.10735763 -0.7974648
## price             0.5778156  0.6829200  0.7593253  0.11933623  0.8353049
##                   enginesize    boreratio      stroke compressionratio
## wheelbase         0.56932868  0.488749875  0.16095905      0.249785845
## carlength         0.68335987  0.606454358  0.12953261      0.158413706
## carwidth          0.73543340  0.559149909  0.18294169      0.181128627
## carheight         0.06714874  0.171070922 -0.05530667      0.261214226
## curbweight        0.85059407  0.648479749  0.16879004      0.151361740
## enginesize        1.00000000  0.583774327  0.20312859      0.028971360
## boreratio         0.58377433  1.000000000 -0.05590898      0.005197339
## stroke            0.20312859 -0.055908983  1.00000000      0.186110110
## compressionratio  0.02897136  0.005197339  0.18611011      1.000000000
## horsepower        0.80976865  0.573676823  0.08093954     -0.204326226
## peakrpm          -0.24465983 -0.254975528 -0.06796375     -0.435740514
## citympg          -0.65365792 -0.584531716 -0.04214475      0.324701425
## highwaympg       -0.67746991 -0.587011784 -0.04393093      0.265201389
## price             0.87414480  0.553173237  0.07944308      0.067983506
##                   horsepower     peakrpm     citympg  highwaympg       price
## wheelbase         0.35329448 -0.36046875 -0.47041361 -0.54408192  0.57781560
## carlength         0.55262297 -0.28724220 -0.67090866 -0.70466160  0.68292002
## carwidth          0.64073208 -0.22001230 -0.64270434 -0.67721792  0.75932530
## carheight        -0.10880206 -0.32041072 -0.04863963 -0.10735763  0.11933623
## curbweight        0.75073925 -0.26624318 -0.75741378 -0.79746479  0.83530488
## enginesize        0.80976865 -0.24465983 -0.65365792 -0.67746991  0.87414480
## boreratio         0.57367682 -0.25497553 -0.58453172 -0.58701178  0.55317324
## stroke            0.08093954 -0.06796375 -0.04214475 -0.04393093  0.07944308
## compressionratio -0.20432623 -0.43574051  0.32470142  0.26520139  0.06798351
## horsepower        1.00000000  0.13107251 -0.80145618 -0.77054389  0.80813882
## peakrpm           0.13107251  1.00000000 -0.11354438 -0.05427481 -0.08526715
## citympg          -0.80145618 -0.11354438  1.00000000  0.97133704 -0.68575134
## highwaympg       -0.77054389 -0.05427481  0.97133704  1.00000000 -0.69759909
## price             0.80813882 -0.08526715 -0.68575134 -0.69759909  1.00000000
library(corrplot)
## corrplot 0.92 loaded
corrplot(corr, method = "number", type = "upper", diag = FALSE,number.cex = 0.7)

The correlation plot shows us that there is a multicollinearity problem in our dataset. For example, the value 0.97 represents that there is almost perfect relationship between city mpg and highway mpg.

2. Informal way: Correlation plot via “GGally” package

The ggpairs() function of the GGally package allows to build a great scatter-plot matrix.

library(GGally)
## Zorunlu paket yükleniyor: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
c=ggpairs(car,title="Correlogram with ggpairs()",upper = list(continuous = wrap("cor", size = 2)),lower = list(continuous = wrap("points",  size = 0.5)))
c + theme(axis.text = element_text(size = 3), #axis text size
          strip.text.x = element_text(size = 4), #panel text size
           strip.text.y = element_text(size = 4)) #panel text size

You see the density plots and scatter-plots beside pairwise correlations.

3. Informal way: Correlation plot via “PerformanceAnalytics” package

library(PerformanceAnalytics)
chart.Correlation(car)  #method= "pearson" (default), "kendall", or "spearman"

Similar with the ggpairs, the correlation matrix gives us the densities, scatter plots and correlations.

4. Informal way: Model Inspection

Another detection of multi-collinearity utilizes from model construction. Firstly, let’s build the model with all predictors.

fit=lm(price~.,data=car)
summary(fit)
## 
## Call:
## lm(formula = price ~ ., data = car)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10789.3  -1671.5   -188.7   1628.8  14243.7 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -4.750e+04  1.527e+04  -3.111 0.002154 ** 
## wheelbase         1.226e+02  1.005e+02   1.220 0.223782    
## carlength        -9.468e+01  5.556e+01  -1.704 0.089986 .  
## carwidth          5.056e+02  2.460e+02   2.055 0.041235 *  
## carheight         1.632e+02  1.357e+02   1.202 0.230727    
## curbweight        1.885e+00  1.737e+00   1.085 0.279395    
## enginesize        1.173e+02  1.384e+01   8.481 6.04e-15 ***
## boreratio        -1.003e+03  1.196e+03  -0.838 0.402850    
## stroke           -3.035e+03  7.786e+02  -3.897 0.000134 ***
## compressionratio  2.981e+02  8.291e+01   3.596 0.000412 ***
## horsepower        3.081e+01  1.622e+01   1.900 0.058959 .  
## peakrpm           2.375e+00  6.709e-01   3.540 0.000502 ***
## citympg          -3.204e+02  1.778e+02  -1.802 0.073110 .  
## highwaympg        2.028e+02  1.598e+02   1.270 0.205794    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3189 on 191 degrees of freedom
## Multiple R-squared:  0.8508, Adjusted R-squared:  0.8406 
## F-statistic: 83.78 on 13 and 191 DF,  p-value: < 2.2e-16

Here, the model seems statistically significant (p<2.2e-16) overall. The adjusted-R squared 0.84 indicates that 84% of the variability is explained by the independent variables.

REMINDER: \(R^2\) is a statistical measure that represents the proportion of the variance for a dependent variable that’s explained by an independent variable or variables in a regression model.

REMINDER: Adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model. The adjusted R-squared increases when the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected. Typically, the adjusted R-squared is positive, not negative. It is always lower than the R-squared. We generally prefer using adjusted R-squared because it has the potential to be more accurate.

Car width, engine size, stroke, compression ratio and peakrpm are statistically significant variables having p-value smaller than 0.05.

You may expect that the city mpg might have positive effect on car price. However, its coefficient is negative. So, it makes us suspicious about the existence of multicollinearity.

In such cases, we cannot interpret the regression coefficients. The common interpretation of a regression coefficient is not fully applicable when multicollinearity exists. Normally interpretation of regression coefficients should be “Holding all X’s constant except one, the effect of one unit change of this variable on Y.” However, when X’s are highly correlated, you cannot really keep one constant and other changing.

Try: Remove one or more predictors from the model and see what’s happening in existing coefficients?

Let’ drop the variable highway mpg.

fit2=lm(price~. -highwaympg,data=car)
summary(fit2)
## 
## Call:
## lm(formula = price ~ . - highwaympg, data = car)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10927.4  -1564.9   -207.4   1570.3  14360.7 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -4.612e+04  1.526e+04  -3.024 0.002840 ** 
## wheelbase         9.756e+01  9.866e+01   0.989 0.323985    
## carlength        -8.077e+01  5.455e+01  -1.481 0.140377    
## carwidth          5.173e+02  2.462e+02   2.101 0.036974 *  
## carheight         1.643e+02  1.359e+02   1.209 0.228186    
## curbweight        1.347e+00  1.688e+00   0.798 0.425816    
## enginesize        1.157e+02  1.380e+01   8.386 1.07e-14 ***
## boreratio        -9.562e+02  1.197e+03  -0.799 0.425417    
## stroke           -2.925e+03  7.750e+02  -3.774 0.000214 ***
## compressionratio  3.019e+02  8.299e+01   3.637 0.000354 ***
## horsepower        3.500e+01  1.590e+01   2.201 0.028906 *  
## peakrpm           2.315e+00  6.703e-01   3.454 0.000679 ***
## citympg          -1.225e+02  8.571e+01  -1.430 0.154442    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3194 on 192 degrees of freedom
## Multiple R-squared:  0.8495, Adjusted R-squared:  0.8401 
## F-statistic: 90.34 on 12 and 192 DF,  p-value: < 2.2e-16
list(fit=round(fit$coefficients,2),fit2=round(fit2$coefficients,2))
## $fit
##      (Intercept)        wheelbase        carlength         carwidth 
##        -47495.74           122.62           -94.68           505.57 
##        carheight       curbweight       enginesize        boreratio 
##           163.18             1.88           117.35         -1002.57 
##           stroke compressionratio       horsepower          peakrpm 
##         -3034.61           298.14            30.81             2.38 
##          citympg       highwaympg 
##          -320.35           202.82 
## 
## $fit2
##      (Intercept)        wheelbase        carlength         carwidth 
##        -46124.34            97.56           -80.77           517.26 
##        carheight       curbweight       enginesize        boreratio 
##           164.33             1.35           115.71          -956.23 
##           stroke compressionratio       horsepower          peakrpm 
##         -2924.54           301.85            35.00             2.32 
##          citympg 
##          -122.54

Now, car width, engine size, stroke, compression ratio, horse power(additional) and peakrpm are statistically significant variables having p-value smaller than 0.05. Let’s have a look at the coefficent of city mpg. It decreases the negative magnitude and seems better. Moreover, there is a sharp change in the coeffient of wheel base, which can be the important indication of multicollinearity.

We have seen several informal ways of detecting multicollinearity. What about the formal way?

5. Formal way: VIF values

5.1. Checking individual VIF values
library(car)
## Zorunlu paket yükleniyor: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
vif(fit)
##        wheelbase        carlength         carwidth        carheight 
##         7.340949         9.422999         5.586367         2.205975 
##       curbweight       enginesize        boreratio           stroke 
##        16.413371         6.658982         2.103912         1.195781 
## compressionratio       horsepower          peakrpm          citympg 
##         2.175511         8.247880         2.053763        27.128671 
##       highwaympg 
##        24.277439

According to the values presented above, certain predictors may cause multi-collinearity problem. However, the literature regards two different thresholds: VIF>5 & VIF>10. Consider the latter, VIF>10. There are three predictors having VIF greater than 10. Therefore, there exists a multi-collinearity.

5.2. Checking joint VIF value

Beside individual VIF scores, the joint VIF gives us an information about the existence of multi-collinearity.

ifelse(sum(vif(fit))/(length(fit$coefficients)-2)>1, "multicollinearity exists", "multicollinearity does not exist" )  # -2 because of intercept and 1 predictor.
## [1] "multicollinearity exists"

The result of joint VIF is greater than 1===>there exists multi-collinearity

What are the possible solutions for multi-collinearity?

Remedies for Multi-collinearity

  • Transformation (scaling (standardization, min-max, robust), centering, correlation)

Small Exercise

library(MASS)
FALSE 
FALSE Attaching package: 'MASS'
FALSE The following object is masked from 'package:dplyr':
FALSE 
FALSE     select
data(mtcars)
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
cor(mtcars[,c("cyl" , "disp" , "hp" , "wt" , "qsec")])
FALSE             cyl       disp         hp         wt       qsec
FALSE cyl   1.0000000  0.9020329  0.8324475  0.7824958 -0.5912421
FALSE disp  0.9020329  1.0000000  0.7909486  0.8879799 -0.4336979
FALSE hp    0.8324475  0.7909486  1.0000000  0.6587479 -0.7082234
FALSE wt    0.7824958  0.8879799  0.6587479  1.0000000 -0.1747159
FALSE qsec -0.5912421 -0.4336979 -0.7082234 -0.1747159  1.0000000
model = lm(mpg ~ cyl + disp + hp + wt + qsec+disp*hp, data = mtcars)
summary(model)
FALSE 
FALSE Call:
FALSE lm(formula = mpg ~ cyl + disp + hp + wt + qsec + disp * hp, data = mtcars)
FALSE 
FALSE Residuals:
FALSE     Min      1Q  Median      3Q     Max 
FALSE -3.4624 -1.4010 -0.5426  1.5036  3.6878 
FALSE 
FALSE Coefficients:
FALSE               Estimate Std. Error t value Pr(>|t|)    
FALSE (Intercept)  3.850e+01  8.895e+00   4.328 0.000213 ***
FALSE cyl          1.860e-01  8.006e-01   0.232 0.818199    
FALSE disp        -4.248e-02  2.232e-02  -1.904 0.068516 .  
FALSE hp          -9.104e-02  3.034e-02  -3.001 0.006031 ** 
FALSE wt          -3.878e+00  1.124e+00  -3.451 0.001997 ** 
FALSE qsec         3.079e-01  4.352e-01   0.708 0.485734    
FALSE disp:hp      2.622e-04  9.455e-05   2.773 0.010334 *  
FALSE ---
FALSE Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
FALSE 
FALSE Residual standard error: 2.272 on 25 degrees of freedom
FALSE Multiple R-squared:  0.8854,  Adjusted R-squared:  0.8579 
FALSE F-statistic: 32.21 on 6 and 25 DF,  p-value: 1.352e-10
vif(model)
FALSE there are higher-order terms (interactions) in this model
FALSE consider setting type = 'predictor'; see ?vif
FALSE       cyl      disp        hp        wt      qsec   disp:hp 
FALSE 12.283585 45.955641 26.002534  7.265024  3.632723 62.810241
centered_mtcars = as.data.frame(lapply(mtcars, function(x) x - mean(x)))
centered_model = lm(mpg ~ cyl + disp + hp + wt + qsec+disp*hp, data = centered_mtcars)
summary(centered_model)
FALSE 
FALSE Call:
FALSE lm(formula = mpg ~ cyl + disp + hp + wt + qsec + disp * hp, data = centered_mtcars)
FALSE 
FALSE Residuals:
FALSE     Min      1Q  Median      3Q     Max 
FALSE -3.4624 -1.4010 -0.5426  1.5036  3.6878 
FALSE 
FALSE Coefficients:
FALSE               Estimate Std. Error t value Pr(>|t|)   
FALSE (Intercept) -1.707e+00  7.350e-01  -2.323   0.0286 * 
FALSE cyl          1.860e-01  8.006e-01   0.232   0.8182   
FALSE disp        -4.017e-03  1.208e-02  -0.333   0.7423   
FALSE hp          -3.054e-02  1.461e-02  -2.090   0.0469 * 
FALSE wt          -3.878e+00  1.124e+00  -3.451   0.0020 **
FALSE qsec         3.079e-01  4.352e-01   0.708   0.4857   
FALSE disp:hp      2.622e-04  9.455e-05   2.773   0.0103 * 
FALSE ---
FALSE Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
FALSE 
FALSE Residual standard error: 2.272 on 25 degrees of freedom
FALSE Multiple R-squared:  0.8854,  Adjusted R-squared:  0.8579 
FALSE F-statistic: 32.21 on 6 and 25 DF,  p-value: 1.352e-10
vif(centered_model)
FALSE there are higher-order terms (interactions) in this model
FALSE consider setting type = 'predictor'; see ?vif
FALSE       cyl      disp        hp        wt      qsec   disp:hp 
FALSE 12.283585 13.465157  6.029215  7.265024  3.632723  1.801127
standardized_mtcars = as.data.frame(scale(mtcars))
standardized_model = lm(mpg ~ cyl + disp + hp + wt + qsec+disp*hp, data = standardized_mtcars)
summary(standardized_model)
FALSE 
FALSE Call:
FALSE lm(formula = mpg ~ cyl + disp + hp + wt + qsec + disp * hp, data = standardized_mtcars)
FALSE 
FALSE Residuals:
FALSE      Min       1Q   Median       3Q      Max 
FALSE -0.57449 -0.23246 -0.09002  0.24949  0.61189 
FALSE 
FALSE Coefficients:
FALSE             Estimate Std. Error t value Pr(>|t|)   
FALSE (Intercept) -0.28330    0.12196  -2.323   0.0286 * 
FALSE cyl          0.05511    0.23725   0.232   0.8182   
FALSE disp        -0.08260    0.24840  -0.333   0.7423   
FALSE hp          -0.34745    0.16622  -2.090   0.0469 * 
FALSE wt          -0.62961    0.18246  -3.451   0.0020 **
FALSE qsec         0.09130    0.12902   0.708   0.4857   
FALSE disp:hp      0.36973    0.13331   2.773   0.0103 * 
FALSE ---
FALSE Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
FALSE 
FALSE Residual standard error: 0.3769 on 25 degrees of freedom
FALSE Multiple R-squared:  0.8854,  Adjusted R-squared:  0.8579 
FALSE F-statistic: 32.21 on 6 and 25 DF,  p-value: 1.352e-10
vif(standardized_model)
FALSE there are higher-order terms (interactions) in this model
FALSE consider setting type = 'predictor'; see ?vif
FALSE       cyl      disp        hp        wt      qsec   disp:hp 
FALSE 12.283585 13.465157  6.029215  7.265024  3.632723  1.801127
  • Eliminating the corresponding variable
  • PCA
  • Shrinkage Methods

RSS:

\(\text{RSS}(\boldsymbol{\beta})=\sum_{i=1}^n(Y_i-\beta_0-\beta_1X_{i1}-\ldots-\beta_pX_{ip})^2.\)

Ridge Regression

\(\text{RSS}(\boldsymbol{\beta})+\lambda\sum_{j=1}^p \beta_j^2\)

where \(\lambda\) is a penalty or tuning parameter and it modulates the importance of fit vs. shrinkage.

We can implement ridge regression via “glmnet” function in R. However, we do not utilize the y∼x notation. Therefore, we have to define predictors and dependent variable in matrix format. The model.matrix() function is very helpful for generating the x matrix. It not only creates a matrix with the predictors but also automatically converts any qualitative variables into dummy variables.

BE CAREFUL: The glmnet() function can only accept numerical or quantitative inputs and it cannot handle categorical or qualitative data.

x=model.matrix(price~. ,data=car)[,-1]   #without [,-1] the matrix X seems like design matrix including 1 in the first column
y=as.matrix(car[,"price"])  #glmnet generally expects to work with a matrix, not a data frame.

If the value of alpha is 0, glmnet() fits a ridge regression model, and if the value of alpha is 1, it fits a lasso model.

Simple idea: We find an estimate for many values of \(\lambda\) and then choose it by cross-validation.

library(glmnet)
## Warning: package 'glmnet' was built under R version 4.3.3
## Zorunlu paket yükleniyor: Matrix
## Warning: package 'Matrix' was built under R version 4.3.3
## Loaded glmnet 4.1-8
grid=10^seq(10,-5, length =100)
ridge=glmnet(x,y,alpha=0,lambda=grid,standardize = TRUE)  # alpha=0 means ridge regression

By default, glmnet() conducts ridge regression using a range of λ values that are automatically selected. However, it is worth to use grid-search to obtain effective optimization.

dim(coef(ridge))
## [1]  14 100
ridge$lambda[5]
## [1] 2477076356
coef(ridge)[,5]
##      (Intercept)        wheelbase        carlength         carwidth 
##     1.327529e+04     2.466178e-03     1.422688e-03     9.097444e-03 
##        carheight       curbweight       enginesize        boreratio 
##     1.255186e-03     4.123195e-05     5.395171e-04     5.249292e-02 
##           stroke compressionratio       horsepower          peakrpm 
##     6.510854e-03     4.398977e-04     5.252472e-04    -4.594367e-06 
##          citympg       highwaympg 
##    -2.694046e-03    -2.603567e-03
ridge$lambda[80]
## [1] 0.01072267
coef(ridge)[,80]
##      (Intercept)        wheelbase        carlength         carwidth 
##    -47531.287430       121.837641       -94.278836       505.844286 
##        carheight       curbweight       enginesize        boreratio 
##       163.140696         1.889160       117.243758     -1000.712083 
##           stroke compressionratio       horsepower          peakrpm 
##     -3032.251689       297.568762        30.931334         2.374865 
##          citympg       highwaympg 
##      -316.416415       199.904330
set.seed(1)
cv_ridge = cv.glmnet(x, y, alpha = 0, lambda = grid)
plot(cv_ridge)

opt_lambda_ridge = cv_ridge$lambda.min
opt_lambda_ridge
## [1] 533.6699

It’s worth mentioning that glmnet() automatically standardizes the variables to ensure they’re on a same scale. This prevents penalizing some coefficients more than others.

ridge2=glmnet(x,y,alpha=0,lambda=opt_lambda_ridge,standardize = TRUE)  #standardize=TRUE(deafult)

coef(ridge2)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                             s0
## (Intercept)      -45165.234494
## wheelbase            70.068761
## carlength           -29.252596
## carwidth            479.245300
## carheight            75.125313
## curbweight            2.173168
## enginesize           89.058070
## boreratio          -681.845774
## stroke            -2390.988074
## compressionratio    235.677483
## horsepower           44.827994
## peakrpm               1.734463
## citympg             -78.681126
## highwaympg            7.380630

BE CAREFUL: It seen that ridge regression cannot set coefficients to zero! Although ridge regression effectively handles multi-collinearity, it has a significant drawback. It includes all predictors in the model, even though the penalty term forces many of them to be close to zero but not exactly zero. While this typically doesn’t impact prediction accuracy, it can complicate result interpretation. Hence, considering alternative techniques may be advisable.

pred_ridge = predict(ridge2, s = opt_lambda_ridge, newx = x)

#RMSE
rmse_ridge = sqrt(mean((pred_ridge - y)^2))
rmse_ridge
## [1] 3135.089

Lasso Regression

\(\text{RSS}(\boldsymbol{\beta})+\lambda\sum_{j=1}^p |\beta_j|\) where \(\lambda\) is penalty parameter which modulates the importance of fit vs. shrinkage.

Why would we use the Lasso instead of Ridge regression? - Ridge regression shrinks all the coefficients to a non-zero value. - The Lasso shrinks some of the coefficients all the way to zero.

Hence, we apply feature selection at the same time.

Let’s apply lasso regression.

grid=10^seq(10,-5, length =100)
lasso=glmnet (x,y,alpha =1, lambda =grid)
plot(lasso)
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values

The coefficient plot above illustrates that certain coefficients become zero based on the chosen tuning parameter. Subsequently, we conduct cross-validation to determine the test error associated with the model.

cv_lasso =cv.glmnet (x,y,alpha =1, lambda =grid)
opt_lambda_lasso = cv_lasso$lambda.min
opt_lambda_lasso
## [1] 187.3817
plot(cv_lasso)

The optimum lambda is set about 187. Now, let’s build the model with the optimum lambda.

lasso2=glmnet(x, y, alpha = 1, lambda = opt_lambda_lasso)
coef(lasso2)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                             s0
## (Intercept)      -45482.975329
## wheelbase             8.316413
## carlength             .       
## carwidth            481.165370
## carheight            23.742790
## curbweight            2.052894
## enginesize           99.720706
## boreratio             .       
## stroke            -1840.543793
## compressionratio    147.877488
## horsepower           37.681530
## peakrpm               1.498585
## citympg              -2.505219
## highwaympg            .

Now, it is obvious that some of the coefficients are set to 0. This is called as feature selection.

pred_lasso= predict(lasso2, s = opt_lambda_lasso, newx = x)

#RMSE
rmse_lasso= sqrt(mean((pred_lasso - y)^2))
rmse_lasso
## [1] 3175.461

The RMSE obtained from lasso is about 3146.

Elastic Net

Elastic net is an extended version of lasso.

\(\text{RSS}(\boldsymbol{\beta})+\lambda \left[ \left(1 - \alpha \right) \frac{\left \lVert \beta \right \rVert_{2}^{2}}{2} + \alpha \left \lVert \beta \right \rVert_{1} \right]\)

If we specify alpha as 0.5 then, we can implement elastic net regression. The implementation is similar with the ridge and lasso regressions.

elast =glmnet (x,y,alpha =0.5, lambda =grid)
plot(elast)
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values

cv_elast =cv.glmnet (x,y,alpha =0.5, lambda =grid)
opt_lambda_elast = cv_elast$lambda.min
opt_lambda_elast
## [1] 23.1013

The optimum lambda value for elastic net regression is about 23.

plot(cv_elast)

elast2 = glmnet(x, y, alpha = 0.5, lambda = opt_lambda_elast) #best_elastic
coef(elast2)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                             s0
## (Intercept)      -47027.523203
## wheelbase            94.801178
## carlength           -68.739889
## carwidth            497.713313
## carheight           144.423009
## curbweight            1.594638
## enginesize          114.644763
## boreratio          -877.702745
## stroke            -2888.979819
## compressionratio    287.701323
## horsepower           33.616248
## peakrpm               2.286126
## citympg            -207.284055
## highwaympg          102.096937
pred_elast = predict(elast2, s = opt_lambda_elast, newx = x)

#RMSE
rmse_elast = sqrt(mean((pred_elast - y)^2))
rmse_elast
## [1] 3083.036

Assessing the performances

coefs=data.frame(as.matrix(coef(ridge2)),as.matrix(coef(lasso2)),as.matrix(coef(elast2)))
colnames(coefs)=c("Ridge","Lasso","Elastic")
coefs
Ridge Lasso Elastic
(Intercept) -45165.234494 -45482.975329 -47027.523203
wheelbase 70.068761 8.316413 94.801177
carlength -29.252596 0.000000 -68.739889
carwidth 479.245300 481.165370 497.713313
carheight 75.125313 23.742790 144.423009
curbweight 2.173168 2.052894 1.594637
enginesize 89.058070 99.720706 114.644763
boreratio -681.845774 0.000000 -877.702745
stroke -2390.988074 -1840.543793 -2888.979819
compressionratio 235.677483 147.877488 287.701323
horsepower 44.827994 37.681530 33.616248
peakrpm 1.734463 1.498585 2.286126
citympg -78.681126 -2.505219 -207.284055
highwaympg 7.380630 0.000000 102.096937

In table above, we see that some coefficients obtained from Lasso regression are 0. It is meaningful to look at the rmse scores of these three regressions.

rmse_comp=data.frame(rmse_ridge,rmse_lasso,rmse_elast)
colnames(rmse_comp)=c("Ridge","Lasso","Elastic")
rmse_comp
Ridge Lasso Elastic
3135.089 3175.461 3083.036

According to the rmse scores, elastic net regression is best suited to our model.

References: https://web.stanford.edu/class/stats202/notes/Model-selection/Shrinkage.html Yozgatlıgil,C. (2024). Stat 412-Lecture Notes.