Previously on STAT 412:
Why is detecting and solving multicollinearity important?
Detecting multi-collinearity matters in regression because it ensures the accuracy of our predictions. When variables are too closely related, it confuses our results. It makes it difficult to determine which factors truly influence the outcome and by how much. By identifying multi-collinearity, we ensure our predictions are reliable and not influenced by misleading relationships between variables.
In this recitation, we will use car-price dataset.
| car_ID | CarName | fueltype | aspiration | doornumber | wheelbase | carlength | carwidth | carheight | curbweight | enginesize | boreratio | stroke | compressionratio | horsepower | peakrpm | citympg | highwaympg | price |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | alfa-romero giulia | gas | std | two | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | 130 | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495 |
| 2 | alfa-romero stelvio | gas | std | two | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | 130 | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500 |
| 3 | alfa-romero Quadrifoglio | gas | std | two | 94.5 | 171.2 | 65.5 | 52.4 | 2823 | 152 | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500 |
| 4 | audi 100 ls | gas | std | four | 99.8 | 176.6 | 66.2 | 54.3 | 2337 | 109 | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950 |
| 5 | audi 100ls | gas | std | four | 99.4 | 176.6 | 66.4 | 54.3 | 2824 | 136 | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450 |
| 6 | audi fox | gas | std | two | 99.8 | 177.3 | 66.3 | 53.1 | 2507 | 136 | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 15250 |
Quickly examine the dataset:
## [1] 0
## [1] 0
## [1] 205 19
The data has no missing or duplicated values.
Data Dictionary:
Today’s focus is on multi-collinearity and we keep only numeric variables while ignoring categorical ones. Therefore, we drop the variables car_ID, car name, fuel type, aspiration and door number.
## tibble [205 × 19] (S3: tbl_df/tbl/data.frame)
## $ car_ID : num [1:205] 1 2 3 4 5 6 7 8 9 10 ...
## $ CarName : chr [1:205] "alfa-romero giulia" "alfa-romero stelvio" "alfa-romero Quadrifoglio" "audi 100 ls" ...
## $ fueltype : chr [1:205] "gas" "gas" "gas" "gas" ...
## $ aspiration : chr [1:205] "std" "std" "std" "std" ...
## $ doornumber : chr [1:205] "two" "two" "two" "four" ...
## $ wheelbase : num [1:205] 88.6 88.6 94.5 99.8 99.4 ...
## $ carlength : num [1:205] 169 169 171 177 177 ...
## $ carwidth : num [1:205] 64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 ...
## $ carheight : num [1:205] 48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 ...
## $ curbweight : num [1:205] 2548 2548 2823 2337 2824 ...
## $ enginesize : num [1:205] 130 130 152 109 136 136 136 136 131 131 ...
## $ boreratio : num [1:205] 3.47 3.47 2.68 3.19 3.19 3.19 3.19 3.19 3.13 3.13 ...
## $ stroke : num [1:205] 2.68 2.68 3.47 3.4 3.4 3.4 3.4 3.4 3.4 3.4 ...
## $ compressionratio: num [1:205] 9 9 9 10 8 8.5 8.5 8.5 8.3 7 ...
## $ horsepower : num [1:205] 111 111 154 102 115 110 110 110 140 160 ...
## $ peakrpm : num [1:205] 5000 5000 5000 5500 5500 5500 5500 5500 5500 5500 ...
## $ citympg : num [1:205] 21 21 19 24 18 19 19 19 17 16 ...
## $ highwaympg : num [1:205] 27 27 26 30 22 25 25 25 20 22 ...
## $ price : num [1:205] 13495 16500 16500 13950 17450 ...
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## wheelbase carlength carwidth carheight
## Min. : 86.60 Min. :141.1 Min. :60.30 Min. :47.80
## 1st Qu.: 94.50 1st Qu.:166.3 1st Qu.:64.10 1st Qu.:52.00
## Median : 97.00 Median :173.2 Median :65.50 Median :54.10
## Mean : 98.76 Mean :174.0 Mean :65.91 Mean :53.72
## 3rd Qu.:102.40 3rd Qu.:183.1 3rd Qu.:66.90 3rd Qu.:55.50
## Max. :120.90 Max. :208.1 Max. :72.30 Max. :59.80
## curbweight enginesize boreratio stroke compressionratio
## Min. :1488 Min. : 61.0 Min. :2.54 Min. :2.070 Min. : 7.00
## 1st Qu.:2145 1st Qu.: 97.0 1st Qu.:3.15 1st Qu.:3.110 1st Qu.: 8.60
## Median :2414 Median :120.0 Median :3.31 Median :3.290 Median : 9.00
## Mean :2556 Mean :126.9 Mean :3.33 Mean :3.255 Mean :10.14
## 3rd Qu.:2935 3rd Qu.:141.0 3rd Qu.:3.58 3rd Qu.:3.410 3rd Qu.: 9.40
## Max. :4066 Max. :326.0 Max. :3.94 Max. :4.170 Max. :23.00
## horsepower peakrpm citympg highwaympg price
## Min. : 48.0 Min. :4150 Min. :13.00 Min. :16.00 Min. : 5118
## 1st Qu.: 70.0 1st Qu.:4800 1st Qu.:19.00 1st Qu.:25.00 1st Qu.: 7788
## Median : 95.0 Median :5200 Median :24.00 Median :30.00 Median :10295
## Mean :104.1 Mean :5125 Mean :25.22 Mean :30.75 Mean :13277
## 3rd Qu.:116.0 3rd Qu.:5500 3rd Qu.:30.00 3rd Qu.:34.00 3rd Qu.:16503
## Max. :288.0 Max. :6600 Max. :49.00 Max. :54.00 Max. :45400
## wheelbase carlength carwidth carheight curbweight
## wheelbase 1.0000000 0.8745875 0.7951436 0.58943476 0.7763863
## carlength 0.8745875 1.0000000 0.8411183 0.49102946 0.8777285
## carwidth 0.7951436 0.8411183 1.0000000 0.27921032 0.8670325
## carheight 0.5894348 0.4910295 0.2792103 1.00000000 0.2955717
## curbweight 0.7763863 0.8777285 0.8670325 0.29557173 1.0000000
## enginesize 0.5693287 0.6833599 0.7354334 0.06714874 0.8505941
## boreratio 0.4887499 0.6064544 0.5591499 0.17107092 0.6484797
## stroke 0.1609590 0.1295326 0.1829417 -0.05530667 0.1687900
## compressionratio 0.2497858 0.1584137 0.1811286 0.26121423 0.1513617
## horsepower 0.3532945 0.5526230 0.6407321 -0.10880206 0.7507393
## peakrpm -0.3604687 -0.2872422 -0.2200123 -0.32041072 -0.2662432
## citympg -0.4704136 -0.6709087 -0.6427043 -0.04863963 -0.7574138
## highwaympg -0.5440819 -0.7046616 -0.6772179 -0.10735763 -0.7974648
## price 0.5778156 0.6829200 0.7593253 0.11933623 0.8353049
## enginesize boreratio stroke compressionratio
## wheelbase 0.56932868 0.488749875 0.16095905 0.249785845
## carlength 0.68335987 0.606454358 0.12953261 0.158413706
## carwidth 0.73543340 0.559149909 0.18294169 0.181128627
## carheight 0.06714874 0.171070922 -0.05530667 0.261214226
## curbweight 0.85059407 0.648479749 0.16879004 0.151361740
## enginesize 1.00000000 0.583774327 0.20312859 0.028971360
## boreratio 0.58377433 1.000000000 -0.05590898 0.005197339
## stroke 0.20312859 -0.055908983 1.00000000 0.186110110
## compressionratio 0.02897136 0.005197339 0.18611011 1.000000000
## horsepower 0.80976865 0.573676823 0.08093954 -0.204326226
## peakrpm -0.24465983 -0.254975528 -0.06796375 -0.435740514
## citympg -0.65365792 -0.584531716 -0.04214475 0.324701425
## highwaympg -0.67746991 -0.587011784 -0.04393093 0.265201389
## price 0.87414480 0.553173237 0.07944308 0.067983506
## horsepower peakrpm citympg highwaympg price
## wheelbase 0.35329448 -0.36046875 -0.47041361 -0.54408192 0.57781560
## carlength 0.55262297 -0.28724220 -0.67090866 -0.70466160 0.68292002
## carwidth 0.64073208 -0.22001230 -0.64270434 -0.67721792 0.75932530
## carheight -0.10880206 -0.32041072 -0.04863963 -0.10735763 0.11933623
## curbweight 0.75073925 -0.26624318 -0.75741378 -0.79746479 0.83530488
## enginesize 0.80976865 -0.24465983 -0.65365792 -0.67746991 0.87414480
## boreratio 0.57367682 -0.25497553 -0.58453172 -0.58701178 0.55317324
## stroke 0.08093954 -0.06796375 -0.04214475 -0.04393093 0.07944308
## compressionratio -0.20432623 -0.43574051 0.32470142 0.26520139 0.06798351
## horsepower 1.00000000 0.13107251 -0.80145618 -0.77054389 0.80813882
## peakrpm 0.13107251 1.00000000 -0.11354438 -0.05427481 -0.08526715
## citympg -0.80145618 -0.11354438 1.00000000 0.97133704 -0.68575134
## highwaympg -0.77054389 -0.05427481 0.97133704 1.00000000 -0.69759909
## price 0.80813882 -0.08526715 -0.68575134 -0.69759909 1.00000000
## corrplot 0.92 loaded
The correlation plot shows us that there is a multicollinearity problem in our dataset. For example, the value 0.97 represents that there is almost perfect relationship between city mpg and highway mpg.
The ggpairs() function of the GGally package allows to build a great scatter-plot matrix.
## Zorunlu paket yükleniyor: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
c=ggpairs(car,title="Correlogram with ggpairs()",upper = list(continuous = wrap("cor", size = 2)),lower = list(continuous = wrap("points", size = 0.5)))
c + theme(axis.text = element_text(size = 3), #axis text size
strip.text.x = element_text(size = 4), #panel text size
strip.text.y = element_text(size = 4)) #panel text sizeYou see the density plots and scatter-plots beside pairwise correlations.
library(PerformanceAnalytics)
chart.Correlation(car) #method= "pearson" (default), "kendall", or "spearman"Similar with the ggpairs, the correlation matrix gives us the densities, scatter plots and correlations.
Another detection of multi-collinearity utilizes from model construction. Firstly, let’s build the model with all predictors.
##
## Call:
## lm(formula = price ~ ., data = car)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10789.3 -1671.5 -188.7 1628.8 14243.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.750e+04 1.527e+04 -3.111 0.002154 **
## wheelbase 1.226e+02 1.005e+02 1.220 0.223782
## carlength -9.468e+01 5.556e+01 -1.704 0.089986 .
## carwidth 5.056e+02 2.460e+02 2.055 0.041235 *
## carheight 1.632e+02 1.357e+02 1.202 0.230727
## curbweight 1.885e+00 1.737e+00 1.085 0.279395
## enginesize 1.173e+02 1.384e+01 8.481 6.04e-15 ***
## boreratio -1.003e+03 1.196e+03 -0.838 0.402850
## stroke -3.035e+03 7.786e+02 -3.897 0.000134 ***
## compressionratio 2.981e+02 8.291e+01 3.596 0.000412 ***
## horsepower 3.081e+01 1.622e+01 1.900 0.058959 .
## peakrpm 2.375e+00 6.709e-01 3.540 0.000502 ***
## citympg -3.204e+02 1.778e+02 -1.802 0.073110 .
## highwaympg 2.028e+02 1.598e+02 1.270 0.205794
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3189 on 191 degrees of freedom
## Multiple R-squared: 0.8508, Adjusted R-squared: 0.8406
## F-statistic: 83.78 on 13 and 191 DF, p-value: < 2.2e-16
Here, the model seems statistically significant (p<2.2e-16) overall. The adjusted-R squared 0.84 indicates that 84% of the variability is explained by the independent variables.
REMINDER: \(R^2\) is a statistical measure that represents the proportion of the variance for a dependent variable that’s explained by an independent variable or variables in a regression model.
REMINDER: Adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model. The adjusted R-squared increases when the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected. Typically, the adjusted R-squared is positive, not negative. It is always lower than the R-squared. We generally prefer using adjusted R-squared because it has the potential to be more accurate.
Car width, engine size, stroke, compression ratio and peakrpm are statistically significant variables having p-value smaller than 0.05.
You may expect that the city mpg might have positive effect on car price. However, its coefficient is negative. So, it makes us suspicious about the existence of multicollinearity.
In such cases, we cannot interpret the regression coefficients. The common interpretation of a regression coefficient is not fully applicable when multicollinearity exists. Normally interpretation of regression coefficients should be “Holding all X’s constant except one, the effect of one unit change of this variable on Y.” However, when X’s are highly correlated, you cannot really keep one constant and other changing.
Try: Remove one or more predictors from the model and see what’s happening in existing coefficients?
Let’ drop the variable highway mpg.
##
## Call:
## lm(formula = price ~ . - highwaympg, data = car)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10927.4 -1564.9 -207.4 1570.3 14360.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.612e+04 1.526e+04 -3.024 0.002840 **
## wheelbase 9.756e+01 9.866e+01 0.989 0.323985
## carlength -8.077e+01 5.455e+01 -1.481 0.140377
## carwidth 5.173e+02 2.462e+02 2.101 0.036974 *
## carheight 1.643e+02 1.359e+02 1.209 0.228186
## curbweight 1.347e+00 1.688e+00 0.798 0.425816
## enginesize 1.157e+02 1.380e+01 8.386 1.07e-14 ***
## boreratio -9.562e+02 1.197e+03 -0.799 0.425417
## stroke -2.925e+03 7.750e+02 -3.774 0.000214 ***
## compressionratio 3.019e+02 8.299e+01 3.637 0.000354 ***
## horsepower 3.500e+01 1.590e+01 2.201 0.028906 *
## peakrpm 2.315e+00 6.703e-01 3.454 0.000679 ***
## citympg -1.225e+02 8.571e+01 -1.430 0.154442
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3194 on 192 degrees of freedom
## Multiple R-squared: 0.8495, Adjusted R-squared: 0.8401
## F-statistic: 90.34 on 12 and 192 DF, p-value: < 2.2e-16
## $fit
## (Intercept) wheelbase carlength carwidth
## -47495.74 122.62 -94.68 505.57
## carheight curbweight enginesize boreratio
## 163.18 1.88 117.35 -1002.57
## stroke compressionratio horsepower peakrpm
## -3034.61 298.14 30.81 2.38
## citympg highwaympg
## -320.35 202.82
##
## $fit2
## (Intercept) wheelbase carlength carwidth
## -46124.34 97.56 -80.77 517.26
## carheight curbweight enginesize boreratio
## 164.33 1.35 115.71 -956.23
## stroke compressionratio horsepower peakrpm
## -2924.54 301.85 35.00 2.32
## citympg
## -122.54
Now, car width, engine size, stroke, compression ratio, horse power(additional) and peakrpm are statistically significant variables having p-value smaller than 0.05. Let’s have a look at the coefficent of city mpg. It decreases the negative magnitude and seems better. Moreover, there is a sharp change in the coeffient of wheel base, which can be the important indication of multicollinearity.
We have seen several informal ways of detecting multicollinearity. What about the formal way?
## Zorunlu paket yükleniyor: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## wheelbase carlength carwidth carheight
## 7.340949 9.422999 5.586367 2.205975
## curbweight enginesize boreratio stroke
## 16.413371 6.658982 2.103912 1.195781
## compressionratio horsepower peakrpm citympg
## 2.175511 8.247880 2.053763 27.128671
## highwaympg
## 24.277439
According to the values presented above, certain predictors may cause multi-collinearity problem. However, the literature regards two different thresholds: VIF>5 & VIF>10. Consider the latter, VIF>10. There are three predictors having VIF greater than 10. Therefore, there exists a multi-collinearity.
Beside individual VIF scores, the joint VIF gives us an information about the existence of multi-collinearity.
ifelse(sum(vif(fit))/(length(fit$coefficients)-2)>1, "multicollinearity exists", "multicollinearity does not exist" ) # -2 because of intercept and 1 predictor.## [1] "multicollinearity exists"
The result of joint VIF is greater than 1===>there exists multi-collinearity
What are the possible solutions for multi-collinearity?
FALSE
FALSE Attaching package: 'MASS'
FALSE The following object is masked from 'package:dplyr':
FALSE
FALSE select
| mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
| Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
| Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
| Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
| Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
| Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
FALSE cyl disp hp wt qsec
FALSE cyl 1.0000000 0.9020329 0.8324475 0.7824958 -0.5912421
FALSE disp 0.9020329 1.0000000 0.7909486 0.8879799 -0.4336979
FALSE hp 0.8324475 0.7909486 1.0000000 0.6587479 -0.7082234
FALSE wt 0.7824958 0.8879799 0.6587479 1.0000000 -0.1747159
FALSE qsec -0.5912421 -0.4336979 -0.7082234 -0.1747159 1.0000000
FALSE
FALSE Call:
FALSE lm(formula = mpg ~ cyl + disp + hp + wt + qsec + disp * hp, data = mtcars)
FALSE
FALSE Residuals:
FALSE Min 1Q Median 3Q Max
FALSE -3.4624 -1.4010 -0.5426 1.5036 3.6878
FALSE
FALSE Coefficients:
FALSE Estimate Std. Error t value Pr(>|t|)
FALSE (Intercept) 3.850e+01 8.895e+00 4.328 0.000213 ***
FALSE cyl 1.860e-01 8.006e-01 0.232 0.818199
FALSE disp -4.248e-02 2.232e-02 -1.904 0.068516 .
FALSE hp -9.104e-02 3.034e-02 -3.001 0.006031 **
FALSE wt -3.878e+00 1.124e+00 -3.451 0.001997 **
FALSE qsec 3.079e-01 4.352e-01 0.708 0.485734
FALSE disp:hp 2.622e-04 9.455e-05 2.773 0.010334 *
FALSE ---
FALSE Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
FALSE
FALSE Residual standard error: 2.272 on 25 degrees of freedom
FALSE Multiple R-squared: 0.8854, Adjusted R-squared: 0.8579
FALSE F-statistic: 32.21 on 6 and 25 DF, p-value: 1.352e-10
FALSE there are higher-order terms (interactions) in this model
FALSE consider setting type = 'predictor'; see ?vif
FALSE cyl disp hp wt qsec disp:hp
FALSE 12.283585 45.955641 26.002534 7.265024 3.632723 62.810241
centered_mtcars = as.data.frame(lapply(mtcars, function(x) x - mean(x)))
centered_model = lm(mpg ~ cyl + disp + hp + wt + qsec+disp*hp, data = centered_mtcars)
summary(centered_model)FALSE
FALSE Call:
FALSE lm(formula = mpg ~ cyl + disp + hp + wt + qsec + disp * hp, data = centered_mtcars)
FALSE
FALSE Residuals:
FALSE Min 1Q Median 3Q Max
FALSE -3.4624 -1.4010 -0.5426 1.5036 3.6878
FALSE
FALSE Coefficients:
FALSE Estimate Std. Error t value Pr(>|t|)
FALSE (Intercept) -1.707e+00 7.350e-01 -2.323 0.0286 *
FALSE cyl 1.860e-01 8.006e-01 0.232 0.8182
FALSE disp -4.017e-03 1.208e-02 -0.333 0.7423
FALSE hp -3.054e-02 1.461e-02 -2.090 0.0469 *
FALSE wt -3.878e+00 1.124e+00 -3.451 0.0020 **
FALSE qsec 3.079e-01 4.352e-01 0.708 0.4857
FALSE disp:hp 2.622e-04 9.455e-05 2.773 0.0103 *
FALSE ---
FALSE Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
FALSE
FALSE Residual standard error: 2.272 on 25 degrees of freedom
FALSE Multiple R-squared: 0.8854, Adjusted R-squared: 0.8579
FALSE F-statistic: 32.21 on 6 and 25 DF, p-value: 1.352e-10
FALSE there are higher-order terms (interactions) in this model
FALSE consider setting type = 'predictor'; see ?vif
FALSE cyl disp hp wt qsec disp:hp
FALSE 12.283585 13.465157 6.029215 7.265024 3.632723 1.801127
standardized_mtcars = as.data.frame(scale(mtcars))
standardized_model = lm(mpg ~ cyl + disp + hp + wt + qsec+disp*hp, data = standardized_mtcars)
summary(standardized_model)FALSE
FALSE Call:
FALSE lm(formula = mpg ~ cyl + disp + hp + wt + qsec + disp * hp, data = standardized_mtcars)
FALSE
FALSE Residuals:
FALSE Min 1Q Median 3Q Max
FALSE -0.57449 -0.23246 -0.09002 0.24949 0.61189
FALSE
FALSE Coefficients:
FALSE Estimate Std. Error t value Pr(>|t|)
FALSE (Intercept) -0.28330 0.12196 -2.323 0.0286 *
FALSE cyl 0.05511 0.23725 0.232 0.8182
FALSE disp -0.08260 0.24840 -0.333 0.7423
FALSE hp -0.34745 0.16622 -2.090 0.0469 *
FALSE wt -0.62961 0.18246 -3.451 0.0020 **
FALSE qsec 0.09130 0.12902 0.708 0.4857
FALSE disp:hp 0.36973 0.13331 2.773 0.0103 *
FALSE ---
FALSE Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
FALSE
FALSE Residual standard error: 0.3769 on 25 degrees of freedom
FALSE Multiple R-squared: 0.8854, Adjusted R-squared: 0.8579
FALSE F-statistic: 32.21 on 6 and 25 DF, p-value: 1.352e-10
FALSE there are higher-order terms (interactions) in this model
FALSE consider setting type = 'predictor'; see ?vif
FALSE cyl disp hp wt qsec disp:hp
FALSE 12.283585 13.465157 6.029215 7.265024 3.632723 1.801127
RSS:
\(\text{RSS}(\boldsymbol{\beta})=\sum_{i=1}^n(Y_i-\beta_0-\beta_1X_{i1}-\ldots-\beta_pX_{ip})^2.\)
\(\text{RSS}(\boldsymbol{\beta})+\lambda\sum_{j=1}^p \beta_j^2\)
where \(\lambda\) is a penalty or tuning parameter and it modulates the importance of fit vs. shrinkage.
We can implement ridge regression via “glmnet” function in R. However, we do not utilize the y∼x notation. Therefore, we have to define predictors and dependent variable in matrix format. The model.matrix() function is very helpful for generating the x matrix. It not only creates a matrix with the predictors but also automatically converts any qualitative variables into dummy variables.
BE CAREFUL: The glmnet() function can only accept numerical or quantitative inputs and it cannot handle categorical or qualitative data.
x=model.matrix(price~. ,data=car)[,-1] #without [,-1] the matrix X seems like design matrix including 1 in the first column
y=as.matrix(car[,"price"]) #glmnet generally expects to work with a matrix, not a data frame.If the value of alpha is 0, glmnet() fits a ridge regression model, and if the value of alpha is 1, it fits a lasso model.
Simple idea: We find an estimate for many values of \(\lambda\) and then choose it by cross-validation.
## Warning: package 'glmnet' was built under R version 4.3.3
## Zorunlu paket yükleniyor: Matrix
## Warning: package 'Matrix' was built under R version 4.3.3
## Loaded glmnet 4.1-8
grid=10^seq(10,-5, length =100)
ridge=glmnet(x,y,alpha=0,lambda=grid,standardize = TRUE) # alpha=0 means ridge regressionBy default, glmnet() conducts ridge regression using a range of λ values that are automatically selected. However, it is worth to use grid-search to obtain effective optimization.
## [1] 14 100
## [1] 2477076356
## (Intercept) wheelbase carlength carwidth
## 1.327529e+04 2.466178e-03 1.422688e-03 9.097444e-03
## carheight curbweight enginesize boreratio
## 1.255186e-03 4.123195e-05 5.395171e-04 5.249292e-02
## stroke compressionratio horsepower peakrpm
## 6.510854e-03 4.398977e-04 5.252472e-04 -4.594367e-06
## citympg highwaympg
## -2.694046e-03 -2.603567e-03
## [1] 0.01072267
## (Intercept) wheelbase carlength carwidth
## -47531.287430 121.837641 -94.278836 505.844286
## carheight curbweight enginesize boreratio
## 163.140696 1.889160 117.243758 -1000.712083
## stroke compressionratio horsepower peakrpm
## -3032.251689 297.568762 30.931334 2.374865
## citympg highwaympg
## -316.416415 199.904330
## [1] 533.6699
It’s worth mentioning that glmnet() automatically standardizes the variables to ensure they’re on a same scale. This prevents penalizing some coefficients more than others.
ridge2=glmnet(x,y,alpha=0,lambda=opt_lambda_ridge,standardize = TRUE) #standardize=TRUE(deafult)
coef(ridge2)## 14 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -45165.234494
## wheelbase 70.068761
## carlength -29.252596
## carwidth 479.245300
## carheight 75.125313
## curbweight 2.173168
## enginesize 89.058070
## boreratio -681.845774
## stroke -2390.988074
## compressionratio 235.677483
## horsepower 44.827994
## peakrpm 1.734463
## citympg -78.681126
## highwaympg 7.380630
BE CAREFUL: It seen that ridge regression cannot set coefficients to zero! Although ridge regression effectively handles multi-collinearity, it has a significant drawback. It includes all predictors in the model, even though the penalty term forces many of them to be close to zero but not exactly zero. While this typically doesn’t impact prediction accuracy, it can complicate result interpretation. Hence, considering alternative techniques may be advisable.
pred_ridge = predict(ridge2, s = opt_lambda_ridge, newx = x)
#RMSE
rmse_ridge = sqrt(mean((pred_ridge - y)^2))
rmse_ridge## [1] 3135.089
\(\text{RSS}(\boldsymbol{\beta})+\lambda\sum_{j=1}^p |\beta_j|\) where \(\lambda\) is penalty parameter which modulates the importance of fit vs. shrinkage.
Why would we use the Lasso instead of Ridge regression? - Ridge regression shrinks all the coefficients to a non-zero value. - The Lasso shrinks some of the coefficients all the way to zero.
Hence, we apply feature selection at the same time.
Let’s apply lasso regression.
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
The coefficient plot above illustrates that certain coefficients become zero based on the chosen tuning parameter. Subsequently, we conduct cross-validation to determine the test error associated with the model.
cv_lasso =cv.glmnet (x,y,alpha =1, lambda =grid)
opt_lambda_lasso = cv_lasso$lambda.min
opt_lambda_lasso## [1] 187.3817
The optimum lambda is set about 187. Now, let’s build the model with the optimum lambda.
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -45482.975329
## wheelbase 8.316413
## carlength .
## carwidth 481.165370
## carheight 23.742790
## curbweight 2.052894
## enginesize 99.720706
## boreratio .
## stroke -1840.543793
## compressionratio 147.877488
## horsepower 37.681530
## peakrpm 1.498585
## citympg -2.505219
## highwaympg .
Now, it is obvious that some of the coefficients are set to 0. This is called as feature selection.
pred_lasso= predict(lasso2, s = opt_lambda_lasso, newx = x)
#RMSE
rmse_lasso= sqrt(mean((pred_lasso - y)^2))
rmse_lasso## [1] 3175.461
The RMSE obtained from lasso is about 3146.
Elastic net is an extended version of lasso.
\(\text{RSS}(\boldsymbol{\beta})+\lambda \left[ \left(1 - \alpha \right) \frac{\left \lVert \beta \right \rVert_{2}^{2}}{2} + \alpha \left \lVert \beta \right \rVert_{1} \right]\)
If we specify alpha as 0.5 then, we can implement elastic net regression. The implementation is similar with the ridge and lasso regressions.
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
cv_elast =cv.glmnet (x,y,alpha =0.5, lambda =grid)
opt_lambda_elast = cv_elast$lambda.min
opt_lambda_elast## [1] 23.1013
The optimum lambda value for elastic net regression is about 23.
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -47027.523203
## wheelbase 94.801178
## carlength -68.739889
## carwidth 497.713313
## carheight 144.423009
## curbweight 1.594638
## enginesize 114.644763
## boreratio -877.702745
## stroke -2888.979819
## compressionratio 287.701323
## horsepower 33.616248
## peakrpm 2.286126
## citympg -207.284055
## highwaympg 102.096937
pred_elast = predict(elast2, s = opt_lambda_elast, newx = x)
#RMSE
rmse_elast = sqrt(mean((pred_elast - y)^2))
rmse_elast## [1] 3083.036
coefs=data.frame(as.matrix(coef(ridge2)),as.matrix(coef(lasso2)),as.matrix(coef(elast2)))
colnames(coefs)=c("Ridge","Lasso","Elastic")
coefs| Ridge | Lasso | Elastic | |
|---|---|---|---|
| (Intercept) | -45165.234494 | -45482.975329 | -47027.523203 |
| wheelbase | 70.068761 | 8.316413 | 94.801177 |
| carlength | -29.252596 | 0.000000 | -68.739889 |
| carwidth | 479.245300 | 481.165370 | 497.713313 |
| carheight | 75.125313 | 23.742790 | 144.423009 |
| curbweight | 2.173168 | 2.052894 | 1.594637 |
| enginesize | 89.058070 | 99.720706 | 114.644763 |
| boreratio | -681.845774 | 0.000000 | -877.702745 |
| stroke | -2390.988074 | -1840.543793 | -2888.979819 |
| compressionratio | 235.677483 | 147.877488 | 287.701323 |
| horsepower | 44.827994 | 37.681530 | 33.616248 |
| peakrpm | 1.734463 | 1.498585 | 2.286126 |
| citympg | -78.681126 | -2.505219 | -207.284055 |
| highwaympg | 7.380630 | 0.000000 | 102.096937 |
In table above, we see that some coefficients obtained from Lasso regression are 0. It is meaningful to look at the rmse scores of these three regressions.
rmse_comp=data.frame(rmse_ridge,rmse_lasso,rmse_elast)
colnames(rmse_comp)=c("Ridge","Lasso","Elastic")
rmse_comp| Ridge | Lasso | Elastic |
|---|---|---|
| 3135.089 | 3175.461 | 3083.036 |
According to the rmse scores, elastic net regression is best suited to our model.
References: https://web.stanford.edu/class/stats202/notes/Model-selection/Shrinkage.html Yozgatlıgil,C. (2024). Stat 412-Lecture Notes.