Stat412 Recitation 4

Previously on STAT 412:

Cleaning dataset

TODAY: MULTICOLLINEARITY

Why is detecting and solving multicollinearity important?

Detecting multi-collinearity matters in regression because it ensures the accuracy of our predictions. When variables are too closely related, it confuses our results. It makes it difficult to determine which factors truly influence the outcome and by how much. By identifying multi-collinearity, we ensure our predictions are reliable and not influenced by misleading relationships between variables.

In this recitation, we will use car-price dataset.

library(readxl)
car = read_excel("car2.xlsx")
head(car)

car_ID	CarName	fueltype	aspiration	doornumber	wheelbase	carlength	carwidth	carheight	curbweight	enginesize	boreratio	stroke	compressionratio	horsepower	peakrpm	citympg	highwaympg	price
1	alfa-romero giulia	gas	std	two	88.6	168.8	64.1	48.8	2548	130	3.47	2.68	9.0	111	5000	21	27	13495
2	alfa-romero stelvio	gas	std	two	88.6	168.8	64.1	48.8	2548	130	3.47	2.68	9.0	111	5000	21	27	16500
3	alfa-romero Quadrifoglio	gas	std	two	94.5	171.2	65.5	52.4	2823	152	2.68	3.47	9.0	154	5000	19	26	16500
4	audi 100 ls	gas	std	four	99.8	176.6	66.2	54.3	2337	109	3.19	3.40	10.0	102	5500	24	30	13950
5	audi 100ls	gas	std	four	99.4	176.6	66.4	54.3	2824	136	3.19	3.40	8.0	115	5500	18	22	17450
6	audi fox	gas	std	two	99.8	177.3	66.3	53.1	2507	136	3.19	3.40	8.5	110	5500	19	25	15250

Quickly examine the dataset:

sum(is.na(car))

## [1] 0

sum(duplicated(car))

## [1] 0

dim(car)

## [1] 205  19

The data has no missing or duplicated values.

Data Dictionary:

Car_ID: Unique id of each observation (Integer)
carCompany: Name of car company (Categorical)
fueltype: Car fuel type i.e gas or diesel (Categorical)
aspiration: Aspiration used in a car (Categorical)
doornumber: Number of doors in a car (Categorical)
wheelbase: Weelbase of car (Numeric)
carlength: Length of car (Numeric)
carwidth: Width of car (Numeric)
carheight: height of car (Numeric)
curbweight: The weight of a car without occupants or baggage. (Numeric)
enginesize: Size of car (Numeric)
boreratio: Boreratio of car (Numeric)
stroke: Stroke or volume inside the engine (Numeric)
compressionratio: compression ratio of car (Numeric)
horsepower: Horsepower (Numeric)
peakrpm: car peak rpm (Numeric)
citympg: Mileage in city (Numeric)
highwaympg: Mileage on highway (Numeric)
price(Dependent variable): Price of car (Numeric)

Today’s focus is on multi-collinearity and we keep only numeric variables while ignoring categorical ones. Therefore, we drop the variables car_ID, car name, fuel type, aspiration and door number.

str(car)

## tibble [205 × 19] (S3: tbl_df/tbl/data.frame)
##  $ car_ID          : num [1:205] 1 2 3 4 5 6 7 8 9 10 ...
##  $ CarName         : chr [1:205] "alfa-romero giulia" "alfa-romero stelvio" "alfa-romero Quadrifoglio" "audi 100 ls" ...
##  $ fueltype        : chr [1:205] "gas" "gas" "gas" "gas" ...
##  $ aspiration      : chr [1:205] "std" "std" "std" "std" ...
##  $ doornumber      : chr [1:205] "two" "two" "two" "four" ...
##  $ wheelbase       : num [1:205] 88.6 88.6 94.5 99.8 99.4 ...
##  $ carlength       : num [1:205] 169 169 171 177 177 ...
##  $ carwidth        : num [1:205] 64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 ...
##  $ carheight       : num [1:205] 48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 ...
##  $ curbweight      : num [1:205] 2548 2548 2823 2337 2824 ...
##  $ enginesize      : num [1:205] 130 130 152 109 136 136 136 136 131 131 ...
##  $ boreratio       : num [1:205] 3.47 3.47 2.68 3.19 3.19 3.19 3.19 3.19 3.13 3.13 ...
##  $ stroke          : num [1:205] 2.68 2.68 3.47 3.4 3.4 3.4 3.4 3.4 3.4 3.4 ...
##  $ compressionratio: num [1:205] 9 9 9 10 8 8.5 8.5 8.5 8.3 7 ...
##  $ horsepower      : num [1:205] 111 111 154 102 115 110 110 110 140 160 ...
##  $ peakrpm         : num [1:205] 5000 5000 5000 5500 5500 5500 5500 5500 5500 5500 ...
##  $ citympg         : num [1:205] 21 21 19 24 18 19 19 19 17 16 ...
##  $ highwaympg      : num [1:205] 27 27 26 30 22 25 25 25 20 22 ...
##  $ price           : num [1:205] 13495 16500 16500 13950 17450 ...

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

car=car %>% select(-car_ID,-CarName,-fueltype,-aspiration,-doornumber)
summary(car)

##    wheelbase        carlength        carwidth       carheight    
##  Min.   : 86.60   Min.   :141.1   Min.   :60.30   Min.   :47.80  
##  1st Qu.: 94.50   1st Qu.:166.3   1st Qu.:64.10   1st Qu.:52.00  
##  Median : 97.00   Median :173.2   Median :65.50   Median :54.10  
##  Mean   : 98.76   Mean   :174.0   Mean   :65.91   Mean   :53.72  
##  3rd Qu.:102.40   3rd Qu.:183.1   3rd Qu.:66.90   3rd Qu.:55.50  
##  Max.   :120.90   Max.   :208.1   Max.   :72.30   Max.   :59.80  
##    curbweight     enginesize      boreratio        stroke      compressionratio
##  Min.   :1488   Min.   : 61.0   Min.   :2.54   Min.   :2.070   Min.   : 7.00   
##  1st Qu.:2145   1st Qu.: 97.0   1st Qu.:3.15   1st Qu.:3.110   1st Qu.: 8.60   
##  Median :2414   Median :120.0   Median :3.31   Median :3.290   Median : 9.00   
##  Mean   :2556   Mean   :126.9   Mean   :3.33   Mean   :3.255   Mean   :10.14   
##  3rd Qu.:2935   3rd Qu.:141.0   3rd Qu.:3.58   3rd Qu.:3.410   3rd Qu.: 9.40   
##  Max.   :4066   Max.   :326.0   Max.   :3.94   Max.   :4.170   Max.   :23.00   
##    horsepower       peakrpm        citympg        highwaympg        price      
##  Min.   : 48.0   Min.   :4150   Min.   :13.00   Min.   :16.00   Min.   : 5118  
##  1st Qu.: 70.0   1st Qu.:4800   1st Qu.:19.00   1st Qu.:25.00   1st Qu.: 7788  
##  Median : 95.0   Median :5200   Median :24.00   Median :30.00   Median :10295  
##  Mean   :104.1   Mean   :5125   Mean   :25.22   Mean   :30.75   Mean   :13277  
##  3rd Qu.:116.0   3rd Qu.:5500   3rd Qu.:30.00   3rd Qu.:34.00   3rd Qu.:16503  
##  Max.   :288.0   Max.   :6600   Max.   :49.00   Max.   :54.00   Max.   :45400

Detecting Multi-collinearity

1. Informal way: Correlation plot via “corrplot” package

corr=cor(car)
corr

##                   wheelbase  carlength   carwidth   carheight curbweight
## wheelbase         1.0000000  0.8745875  0.7951436  0.58943476  0.7763863
## carlength         0.8745875  1.0000000  0.8411183  0.49102946  0.8777285
## carwidth          0.7951436  0.8411183  1.0000000  0.27921032  0.8670325
## carheight         0.5894348  0.4910295  0.2792103  1.00000000  0.2955717
## curbweight        0.7763863  0.8777285  0.8670325  0.29557173  1.0000000
## enginesize        0.5693287  0.6833599  0.7354334  0.06714874  0.8505941
## boreratio         0.4887499  0.6064544  0.5591499  0.17107092  0.6484797
## stroke            0.1609590  0.1295326  0.1829417 -0.05530667  0.1687900
## compressionratio  0.2497858  0.1584137  0.1811286  0.26121423  0.1513617
## horsepower        0.3532945  0.5526230  0.6407321 -0.10880206  0.7507393
## peakrpm          -0.3604687 -0.2872422 -0.2200123 -0.32041072 -0.2662432
## citympg          -0.4704136 -0.6709087 -0.6427043 -0.04863963 -0.7574138
## highwaympg       -0.5440819 -0.7046616 -0.6772179 -0.10735763 -0.7974648
## price             0.5778156  0.6829200  0.7593253  0.11933623  0.8353049
##                   enginesize    boreratio      stroke compressionratio
## wheelbase         0.56932868  0.488749875  0.16095905      0.249785845
## carlength         0.68335987  0.606454358  0.12953261      0.158413706
## carwidth          0.73543340  0.559149909  0.18294169      0.181128627
## carheight         0.06714874  0.171070922 -0.05530667      0.261214226
## curbweight        0.85059407  0.648479749  0.16879004      0.151361740
## enginesize        1.00000000  0.583774327  0.20312859      0.028971360
## boreratio         0.58377433  1.000000000 -0.05590898      0.005197339
## stroke            0.20312859 -0.055908983  1.00000000      0.186110110
## compressionratio  0.02897136  0.005197339  0.18611011      1.000000000
## horsepower        0.80976865  0.573676823  0.08093954     -0.204326226
## peakrpm          -0.24465983 -0.254975528 -0.06796375     -0.435740514
## citympg          -0.65365792 -0.584531716 -0.04214475      0.324701425
## highwaympg       -0.67746991 -0.587011784 -0.04393093      0.265201389
## price             0.87414480  0.553173237  0.07944308      0.067983506
##                   horsepower     peakrpm     citympg  highwaympg       price
## wheelbase         0.35329448 -0.36046875 -0.47041361 -0.54408192  0.57781560
## carlength         0.55262297 -0.28724220 -0.67090866 -0.70466160  0.68292002
## carwidth          0.64073208 -0.22001230 -0.64270434 -0.67721792  0.75932530
## carheight        -0.10880206 -0.32041072 -0.04863963 -0.10735763  0.11933623
## curbweight        0.75073925 -0.26624318 -0.75741378 -0.79746479  0.83530488
## enginesize        0.80976865 -0.24465983 -0.65365792 -0.67746991  0.87414480
## boreratio         0.57367682 -0.25497553 -0.58453172 -0.58701178  0.55317324
## stroke            0.08093954 -0.06796375 -0.04214475 -0.04393093  0.07944308
## compressionratio -0.20432623 -0.43574051  0.32470142  0.26520139  0.06798351
## horsepower        1.00000000  0.13107251 -0.80145618 -0.77054389  0.80813882
## peakrpm           0.13107251  1.00000000 -0.11354438 -0.05427481 -0.08526715
## citympg          -0.80145618 -0.11354438  1.00000000  0.97133704 -0.68575134
## highwaympg       -0.77054389 -0.05427481  0.97133704  1.00000000 -0.69759909
## price             0.80813882 -0.08526715 -0.68575134 -0.69759909  1.00000000

library(corrplot)

## corrplot 0.92 loaded

corrplot(corr, method = "number", type = "upper", diag = FALSE,number.cex = 0.7)

The correlation plot shows us that there is a multicollinearity problem in our dataset. For example, the value 0.97 represents that there is almost perfect relationship between city mpg and highway mpg.

2. Informal way: Correlation plot via “GGally” package

The ggpairs() function of the GGally package allows to build a great scatter-plot matrix.

library(GGally)

## Zorunlu paket yükleniyor: ggplot2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

c=ggpairs(car,title="Correlogram with ggpairs()",upper = list(continuous = wrap("cor", size = 2)),lower = list(continuous = wrap("points",  size = 0.5)))
c + theme(axis.text = element_text(size = 3), #axis text size
          strip.text.x = element_text(size = 4), #panel text size
           strip.text.y = element_text(size = 4)) #panel text size

You see the density plots and scatter-plots beside pairwise correlations.

3. Informal way: Correlation plot via “PerformanceAnalytics” package

library(PerformanceAnalytics)
chart.Correlation(car)  #method= "pearson" (default), "kendall", or "spearman"

Similar with the ggpairs, the correlation matrix gives us the densities, scatter plots and correlations.

4. Informal way: Model Inspection

Another detection of multi-collinearity utilizes from model construction. Firstly, let’s build the model with all predictors.

fit=lm(price~.,data=car)
summary(fit)

## 
## Call:
## lm(formula = price ~ ., data = car)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10789.3  -1671.5   -188.7   1628.8  14243.7 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -4.750e+04  1.527e+04  -3.111 0.002154 ** 
## wheelbase         1.226e+02  1.005e+02   1.220 0.223782    
## carlength        -9.468e+01  5.556e+01  -1.704 0.089986 .  
## carwidth          5.056e+02  2.460e+02   2.055 0.041235 *  
## carheight         1.632e+02  1.357e+02   1.202 0.230727    
## curbweight        1.885e+00  1.737e+00   1.085 0.279395    
## enginesize        1.173e+02  1.384e+01   8.481 6.04e-15 ***
## boreratio        -1.003e+03  1.196e+03  -0.838 0.402850    
## stroke           -3.035e+03  7.786e+02  -3.897 0.000134 ***
## compressionratio  2.981e+02  8.291e+01   3.596 0.000412 ***
## horsepower        3.081e+01  1.622e+01   1.900 0.058959 .  
## peakrpm           2.375e+00  6.709e-01   3.540 0.000502 ***
## citympg          -3.204e+02  1.778e+02  -1.802 0.073110 .  
## highwaympg        2.028e+02  1.598e+02   1.270 0.205794    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3189 on 191 degrees of freedom
## Multiple R-squared:  0.8508, Adjusted R-squared:  0.8406 
## F-statistic: 83.78 on 13 and 191 DF,  p-value: < 2.2e-16

Here, the model seems statistically significant (p<2.2e-16) overall. The adjusted-R squared 0.84 indicates that 84% of the variability is explained by the independent variables.

REMINDER: \(R^2\) is a statistical measure that represents the proportion of the variance for a dependent variable that’s explained by an independent variable or variables in a regression model.

REMINDER: Adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model. The adjusted R-squared increases when the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected. Typically, the adjusted R-squared is positive, not negative. It is always lower than the R-squared. We generally prefer using adjusted R-squared because it has the potential to be more accurate.

Car width, engine size, stroke, compression ratio and peakrpm are statistically significant variables having p-value smaller than 0.05.

You may expect that the city mpg might have positive effect on car price. However, its coefficient is negative. So, it makes us suspicious about the existence of multicollinearity.

In such cases, we cannot interpret the regression coefficients. The common interpretation of a regression coefficient is not fully applicable when multicollinearity exists. Normally interpretation of regression coefficients should be “Holding all X’s constant except one, the effect of one unit change of this variable on Y.” However, when X’s are highly correlated, you cannot really keep one constant and other changing.

Try: Remove one or more predictors from the model and see what’s happening in existing coefficients?

Let’ drop the variable highway mpg.

fit2=lm(price~. -highwaympg,data=car)
summary(fit2)

## 
## Call:
## lm(formula = price ~ . - highwaympg, data = car)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10927.4  -1564.9   -207.4   1570.3  14360.7 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -4.612e+04  1.526e+04  -3.024 0.002840 ** 
## wheelbase         9.756e+01  9.866e+01   0.989 0.323985    
## carlength        -8.077e+01  5.455e+01  -1.481 0.140377    
## carwidth          5.173e+02  2.462e+02   2.101 0.036974 *  
## carheight         1.643e+02  1.359e+02   1.209 0.228186    
## curbweight        1.347e+00  1.688e+00   0.798 0.425816    
## enginesize        1.157e+02  1.380e+01   8.386 1.07e-14 ***
## boreratio        -9.562e+02  1.197e+03  -0.799 0.425417    
## stroke           -2.925e+03  7.750e+02  -3.774 0.000214 ***
## compressionratio  3.019e+02  8.299e+01   3.637 0.000354 ***
## horsepower        3.500e+01  1.590e+01   2.201 0.028906 *  
## peakrpm           2.315e+00  6.703e-01   3.454 0.000679 ***
## citympg          -1.225e+02  8.571e+01  -1.430 0.154442    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3194 on 192 degrees of freedom
## Multiple R-squared:  0.8495, Adjusted R-squared:  0.8401 
## F-statistic: 90.34 on 12 and 192 DF,  p-value: < 2.2e-16

list(fit=round(fit$coefficients,2),fit2=round(fit2$coefficients,2))

## $fit
##      (Intercept)        wheelbase        carlength         carwidth 
##        -47495.74           122.62           -94.68           505.57 
##        carheight       curbweight       enginesize        boreratio 
##           163.18             1.88           117.35         -1002.57 
##           stroke compressionratio       horsepower          peakrpm 
##         -3034.61           298.14            30.81             2.38 
##          citympg       highwaympg 
##          -320.35           202.82 
## 
## $fit2
##      (Intercept)        wheelbase        carlength         carwidth 
##        -46124.34            97.56           -80.77           517.26 
##        carheight       curbweight       enginesize        boreratio 
##           164.33             1.35           115.71          -956.23 
##           stroke compressionratio       horsepower          peakrpm 
##         -2924.54           301.85            35.00             2.32 
##          citympg 
##          -122.54

Now, car width, engine size, stroke, compression ratio, horse power(additional) and peakrpm are statistically significant variables having p-value smaller than 0.05. Let’s have a look at the coefficent of city mpg. It decreases the negative magnitude and seems better. Moreover, there is a sharp change in the coeffient of wheel base, which can be the important indication of multicollinearity.

We have seen several informal ways of detecting multicollinearity. What about the formal way?

5. Formal way: VIF values

5.1. Checking individual VIF values

library(car)

## Zorunlu paket yükleniyor: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

vif(fit)

##        wheelbase        carlength         carwidth        carheight 
##         7.340949         9.422999         5.586367         2.205975 
##       curbweight       enginesize        boreratio           stroke 
##        16.413371         6.658982         2.103912         1.195781 
## compressionratio       horsepower          peakrpm          citympg 
##         2.175511         8.247880         2.053763        27.128671 
##       highwaympg 
##        24.277439

According to the values presented above, certain predictors may cause multi-collinearity problem. However, the literature regards two different thresholds: VIF>5 & VIF>10. Consider the latter, VIF>10. There are three predictors having VIF greater than 10. Therefore, there exists a multi-collinearity.

5.2. Checking joint VIF value

Beside individual VIF scores, the joint VIF gives us an information about the existence of multi-collinearity.

ifelse(sum(vif(fit))/(length(fit$coefficients)-2)>1, "multicollinearity exists", "multicollinearity does not exist" )  # -2 because of intercept and 1 predictor.

## [1] "multicollinearity exists"

The result of joint VIF is greater than 1===>there exists multi-collinearity

What are the possible solutions for multi-collinearity?

Remedies for Multi-collinearity

Transformation (scaling (standardization, min-max, robust), centering, correlation)

Small Exercise

library(MASS)

FALSE 
FALSE Attaching package: 'MASS'

FALSE The following object is masked from 'package:dplyr':
FALSE 
FALSE     select

data(mtcars)
head(mtcars)

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21.0	6	160	110	3.90	2.620	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160	110	3.90	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108	93	3.85	2.320	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
Hornet Sportabout	18.7	8	360	175	3.15	3.440	17.02	0	0	3	2
Valiant	18.1	6	225	105	2.76	3.460	20.22	1	0	3	1

cor(mtcars[,c("cyl" , "disp" , "hp" , "wt" , "qsec")])

FALSE             cyl       disp         hp         wt       qsec
FALSE cyl   1.0000000  0.9020329  0.8324475  0.7824958 -0.5912421
FALSE disp  0.9020329  1.0000000  0.7909486  0.8879799 -0.4336979
FALSE hp    0.8324475  0.7909486  1.0000000  0.6587479 -0.7082234
FALSE wt    0.7824958  0.8879799  0.6587479  1.0000000 -0.1747159
FALSE qsec -0.5912421 -0.4336979 -0.7082234 -0.1747159  1.0000000

model = lm(mpg ~ cyl + disp + hp + wt + qsec+disp*hp, data = mtcars)
summary(model)

FALSE 
FALSE Call:
FALSE lm(formula = mpg ~ cyl + disp + hp + wt + qsec + disp * hp, data = mtcars)
FALSE 
FALSE Residuals:
FALSE     Min      1Q  Median      3Q     Max 
FALSE -3.4624 -1.4010 -0.5426  1.5036  3.6878 
FALSE 
FALSE Coefficients:
FALSE               Estimate Std. Error t value Pr(>|t|)    
FALSE (Intercept)  3.850e+01  8.895e+00   4.328 0.000213 ***
FALSE cyl          1.860e-01  8.006e-01   0.232 0.818199    
FALSE disp        -4.248e-02  2.232e-02  -1.904 0.068516 .  
FALSE hp          -9.104e-02  3.034e-02  -3.001 0.006031 ** 
FALSE wt          -3.878e+00  1.124e+00  -3.451 0.001997 ** 
FALSE qsec         3.079e-01  4.352e-01   0.708 0.485734    
FALSE disp:hp      2.622e-04  9.455e-05   2.773 0.010334 *  
FALSE ---
FALSE Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
FALSE 
FALSE Residual standard error: 2.272 on 25 degrees of freedom
FALSE Multiple R-squared:  0.8854,  Adjusted R-squared:  0.8579 
FALSE F-statistic: 32.21 on 6 and 25 DF,  p-value: 1.352e-10

vif(model)

FALSE there are higher-order terms (interactions) in this model
FALSE consider setting type = 'predictor'; see ?vif

FALSE       cyl      disp        hp        wt      qsec   disp:hp 
FALSE 12.283585 45.955641 26.002534  7.265024  3.632723 62.810241

centered_mtcars = as.data.frame(lapply(mtcars, function(x) x - mean(x)))
centered_model = lm(mpg ~ cyl + disp + hp + wt + qsec+disp*hp, data = centered_mtcars)
summary(centered_model)

FALSE 
FALSE Call:
FALSE lm(formula = mpg ~ cyl + disp + hp + wt + qsec + disp * hp, data = centered_mtcars)
FALSE 
FALSE Residuals:
FALSE     Min      1Q  Median      3Q     Max 
FALSE -3.4624 -1.4010 -0.5426  1.5036  3.6878 
FALSE 
FALSE Coefficients:
FALSE               Estimate Std. Error t value Pr(>|t|)   
FALSE (Intercept) -1.707e+00  7.350e-01  -2.323   0.0286 * 
FALSE cyl          1.860e-01  8.006e-01   0.232   0.8182   
FALSE disp        -4.017e-03  1.208e-02  -0.333   0.7423   
FALSE hp          -3.054e-02  1.461e-02  -2.090   0.0469 * 
FALSE wt          -3.878e+00  1.124e+00  -3.451   0.0020 **
FALSE qsec         3.079e-01  4.352e-01   0.708   0.4857   
FALSE disp:hp      2.622e-04  9.455e-05   2.773   0.0103 * 
FALSE ---
FALSE Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
FALSE 
FALSE Residual standard error: 2.272 on 25 degrees of freedom
FALSE Multiple R-squared:  0.8854,  Adjusted R-squared:  0.8579 
FALSE F-statistic: 32.21 on 6 and 25 DF,  p-value: 1.352e-10

vif(centered_model)

FALSE there are higher-order terms (interactions) in this model
FALSE consider setting type = 'predictor'; see ?vif

FALSE       cyl      disp        hp        wt      qsec   disp:hp 
FALSE 12.283585 13.465157  6.029215  7.265024  3.632723  1.801127

standardized_mtcars = as.data.frame(scale(mtcars))
standardized_model = lm(mpg ~ cyl + disp + hp + wt + qsec+disp*hp, data = standardized_mtcars)
summary(standardized_model)

FALSE 
FALSE Call:
FALSE lm(formula = mpg ~ cyl + disp + hp + wt + qsec + disp * hp, data = standardized_mtcars)
FALSE 
FALSE Residuals:
FALSE      Min       1Q   Median       3Q      Max 
FALSE -0.57449 -0.23246 -0.09002  0.24949  0.61189 
FALSE 
FALSE Coefficients:
FALSE             Estimate Std. Error t value Pr(>|t|)   
FALSE (Intercept) -0.28330    0.12196  -2.323   0.0286 * 
FALSE cyl          0.05511    0.23725   0.232   0.8182   
FALSE disp        -0.08260    0.24840  -0.333   0.7423   
FALSE hp          -0.34745    0.16622  -2.090   0.0469 * 
FALSE wt          -0.62961    0.18246  -3.451   0.0020 **
FALSE qsec         0.09130    0.12902   0.708   0.4857   
FALSE disp:hp      0.36973    0.13331   2.773   0.0103 * 
FALSE ---
FALSE Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
FALSE 
FALSE Residual standard error: 0.3769 on 25 degrees of freedom
FALSE Multiple R-squared:  0.8854,  Adjusted R-squared:  0.8579 
FALSE F-statistic: 32.21 on 6 and 25 DF,  p-value: 1.352e-10

vif(standardized_model)

FALSE there are higher-order terms (interactions) in this model
FALSE consider setting type = 'predictor'; see ?vif

FALSE       cyl      disp        hp        wt      qsec   disp:hp 
FALSE 12.283585 13.465157  6.029215  7.265024  3.632723  1.801127

Eliminating the corresponding variable
PCA
Shrinkage Methods

RSS:

\(\text{RSS}(\boldsymbol{\beta})=\sum_{i=1}^n(Y_i-\beta_0-\beta_1X_{i1}-\ldots-\beta_pX_{ip})^2.\)

Ridge Regression

\(\text{RSS}(\boldsymbol{\beta})+\lambda\sum_{j=1}^p \beta_j^2\)

where \(\lambda\) is a penalty or tuning parameter and it modulates the importance of fit vs. shrinkage.

We can implement ridge regression via “glmnet” function in R. However, we do not utilize the y∼x notation. Therefore, we have to define predictors and dependent variable in matrix format. The model.matrix() function is very helpful for generating the x matrix. It not only creates a matrix with the predictors but also automatically converts any qualitative variables into dummy variables.

BE CAREFUL: The glmnet() function can only accept numerical or quantitative inputs and it cannot handle categorical or qualitative data.

x=model.matrix(price~. ,data=car)[,-1]   #without [,-1] the matrix X seems like design matrix including 1 in the first column
y=as.matrix(car[,"price"])  #glmnet generally expects to work with a matrix, not a data frame.

If the value of alpha is 0, glmnet() fits a ridge regression model, and if the value of alpha is 1, it fits a lasso model.

Simple idea: We find an estimate for many values of \(\lambda\) and then choose it by cross-validation.

library(glmnet)

## Warning: package 'glmnet' was built under R version 4.3.3

## Zorunlu paket yükleniyor: Matrix

## Warning: package 'Matrix' was built under R version 4.3.3

## Loaded glmnet 4.1-8

grid=10^seq(10,-5, length =100)
ridge=glmnet(x,y,alpha=0,lambda=grid,standardize = TRUE)  # alpha=0 means ridge regression

By default, glmnet() conducts ridge regression using a range of λ values that are automatically selected. However, it is worth to use grid-search to obtain effective optimization.

dim(coef(ridge))

## [1]  14 100

ridge$lambda[5]

## [1] 2477076356

coef(ridge)[,5]

##      (Intercept)        wheelbase        carlength         carwidth 
##     1.327529e+04     2.466178e-03     1.422688e-03     9.097444e-03 
##        carheight       curbweight       enginesize        boreratio 
##     1.255186e-03     4.123195e-05     5.395171e-04     5.249292e-02 
##           stroke compressionratio       horsepower          peakrpm 
##     6.510854e-03     4.398977e-04     5.252472e-04    -4.594367e-06 
##          citympg       highwaympg 
##    -2.694046e-03    -2.603567e-03

ridge$lambda[80]

## [1] 0.01072267

coef(ridge)[,80]

##      (Intercept)        wheelbase        carlength         carwidth 
##    -47531.287430       121.837641       -94.278836       505.844286 
##        carheight       curbweight       enginesize        boreratio 
##       163.140696         1.889160       117.243758     -1000.712083 
##           stroke compressionratio       horsepower          peakrpm 
##     -3032.251689       297.568762        30.931334         2.374865 
##          citympg       highwaympg 
##      -316.416415       199.904330

set.seed(1)
cv_ridge = cv.glmnet(x, y, alpha = 0, lambda = grid)
plot(cv_ridge)

opt_lambda_ridge = cv_ridge$lambda.min
opt_lambda_ridge

## [1] 533.6699

It’s worth mentioning that glmnet() automatically standardizes the variables to ensure they’re on a same scale. This prevents penalizing some coefficients more than others.

ridge2=glmnet(x,y,alpha=0,lambda=opt_lambda_ridge,standardize = TRUE)  #standardize=TRUE(deafult)

coef(ridge2)

## 14 x 1 sparse Matrix of class "dgCMatrix"
##                             s0
## (Intercept)      -45165.234494
## wheelbase            70.068761
## carlength           -29.252596
## carwidth            479.245300
## carheight            75.125313
## curbweight            2.173168
## enginesize           89.058070
## boreratio          -681.845774
## stroke            -2390.988074
## compressionratio    235.677483
## horsepower           44.827994
## peakrpm               1.734463
## citympg             -78.681126
## highwaympg            7.380630

BE CAREFUL: It seen that ridge regression cannot set coefficients to zero! Although ridge regression effectively handles multi-collinearity, it has a significant drawback. It includes all predictors in the model, even though the penalty term forces many of them to be close to zero but not exactly zero. While this typically doesn’t impact prediction accuracy, it can complicate result interpretation. Hence, considering alternative techniques may be advisable.

pred_ridge = predict(ridge2, s = opt_lambda_ridge, newx = x)

#RMSE
rmse_ridge = sqrt(mean((pred_ridge - y)^2))
rmse_ridge

## [1] 3135.089

Lasso Regression

\(\text{RSS}(\boldsymbol{\beta})+\lambda\sum_{j=1}^p |\beta_j|\) where \(\lambda\) is penalty parameter which modulates the importance of fit vs. shrinkage.

Why would we use the Lasso instead of Ridge regression? - Ridge regression shrinks all the coefficients to a non-zero value. - The Lasso shrinks some of the coefficients all the way to zero.

Hence, we apply feature selection at the same time.

Let’s apply lasso regression.

grid=10^seq(10,-5, length =100)
lasso=glmnet (x,y,alpha =1, lambda =grid)
plot(lasso)

## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values

The coefficient plot above illustrates that certain coefficients become zero based on the chosen tuning parameter. Subsequently, we conduct cross-validation to determine the test error associated with the model.

cv_lasso =cv.glmnet (x,y,alpha =1, lambda =grid)
opt_lambda_lasso = cv_lasso$lambda.min
opt_lambda_lasso

## [1] 187.3817

plot(cv_lasso)

The optimum lambda is set about 187. Now, let’s build the model with the optimum lambda.

lasso2=glmnet(x, y, alpha = 1, lambda = opt_lambda_lasso)
coef(lasso2)

## 14 x 1 sparse Matrix of class "dgCMatrix"
##                             s0
## (Intercept)      -45482.975329
## wheelbase             8.316413
## carlength             .       
## carwidth            481.165370
## carheight            23.742790
## curbweight            2.052894
## enginesize           99.720706
## boreratio             .       
## stroke            -1840.543793
## compressionratio    147.877488
## horsepower           37.681530
## peakrpm               1.498585
## citympg              -2.505219
## highwaympg            .

Now, it is obvious that some of the coefficients are set to 0. This is called as feature selection.

pred_lasso= predict(lasso2, s = opt_lambda_lasso, newx = x)

#RMSE
rmse_lasso= sqrt(mean((pred_lasso - y)^2))
rmse_lasso

## [1] 3175.461

The RMSE obtained from lasso is about 3146.

Elastic Net

Elastic net is an extended version of lasso.

\(\text{RSS}(\boldsymbol{\beta})+\lambda \left[ \left(1 - \alpha \right) \frac{\left \lVert \beta \right \rVert_{2}^{2}}{2} + \alpha \left \lVert \beta \right \rVert_{1} \right]\)

If we specify alpha as 0.5 then, we can implement elastic net regression. The implementation is similar with the ridge and lasso regressions.

elast =glmnet (x,y,alpha =0.5, lambda =grid)
plot(elast)

## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values

cv_elast =cv.glmnet (x,y,alpha =0.5, lambda =grid)
opt_lambda_elast = cv_elast$lambda.min
opt_lambda_elast

## [1] 23.1013

The optimum lambda value for elastic net regression is about 23.

plot(cv_elast)

elast2 = glmnet(x, y, alpha = 0.5, lambda = opt_lambda_elast) #best_elastic
coef(elast2)

## 14 x 1 sparse Matrix of class "dgCMatrix"
##                             s0
## (Intercept)      -47027.523203
## wheelbase            94.801178
## carlength           -68.739889
## carwidth            497.713313
## carheight           144.423009
## curbweight            1.594638
## enginesize          114.644763
## boreratio          -877.702745
## stroke            -2888.979819
## compressionratio    287.701323
## horsepower           33.616248
## peakrpm               2.286126
## citympg            -207.284055
## highwaympg          102.096937

pred_elast = predict(elast2, s = opt_lambda_elast, newx = x)

#RMSE
rmse_elast = sqrt(mean((pred_elast - y)^2))
rmse_elast

## [1] 3083.036

Assessing the performances

coefs=data.frame(as.matrix(coef(ridge2)),as.matrix(coef(lasso2)),as.matrix(coef(elast2)))
colnames(coefs)=c("Ridge","Lasso","Elastic")
coefs

	Ridge	Lasso	Elastic
(Intercept)	-45165.234494	-45482.975329	-47027.523203
wheelbase	70.068761	8.316413	94.801177
carlength	-29.252596	0.000000	-68.739889
carwidth	479.245300	481.165370	497.713313
carheight	75.125313	23.742790	144.423009
curbweight	2.173168	2.052894	1.594637
enginesize	89.058070	99.720706	114.644763
boreratio	-681.845774	0.000000	-877.702745
stroke	-2390.988074	-1840.543793	-2888.979819
compressionratio	235.677483	147.877488	287.701323
horsepower	44.827994	37.681530	33.616248
peakrpm	1.734463	1.498585	2.286126
citympg	-78.681126	-2.505219	-207.284055
highwaympg	7.380630	0.000000	102.096937

In table above, we see that some coefficients obtained from Lasso regression are 0. It is meaningful to look at the rmse scores of these three regressions.

rmse_comp=data.frame(rmse_ridge,rmse_lasso,rmse_elast)
colnames(rmse_comp)=c("Ridge","Lasso","Elastic")
rmse_comp

Ridge	Lasso	Elastic
3135.089	3175.461	3083.036

According to the rmse scores, elastic net regression is best suited to our model.

References: https://web.stanford.edu/class/stats202/notes/Model-selection/Shrinkage.html Yozgatlıgil,C. (2024). Stat 412-Lecture Notes.