data <- read.csv("HW5data.csv")
set.seed(1)
fitControl <- trainControl(method = "cv", number = 10)
model.cv <- train(Solubility ~.,
data = data,
method = "lm",
trControl = fitControl)
model.cv
Linear Regression
1267 samples
228 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 1141, 1140, 1140, 1139, 1141, 1141, ...
Resampling results:
RMSE Rsquared MAE
0.7126272 0.8800056 0.5329865
Tuning parameter 'intercept' was held constant at a value of TRUE
The RMSE is 0.7126. On average the model’s predictions are off by about 0.7126 units from the actual values.
model.lm <- lm(Solubility ~., data = data)
rmse <- sqrt(mean(residuals(model.lm)^2))
rmse
[1] 0.5375757
This RMSE is lower than the test RMSE from part (a) because the model is being evaluated on the same data it was trained on, so it fits the data better.
coef_ols <- summary(model.lm)$coefficients[,1]
hist(coef_ols)
The majority of the coefficient estimates are very close to 0, so most
predictors have very minimal effects on Solubility. Only a small number
of coefficients have larger magnitudes, suggesting that only a few
variables meaningfully contribute to the model.
length(which(coef_ols == 0))
[1] 0
p_vals <- summary(model.lm)$coefficients[,4]
sum(p_vals > .10)
[1] 155
155 of the p-values were greater than .10.
set.seed(1)
kGrid <- expand.grid(k = 1:20)
model.knn.sc.k <- train(Solubility ~.,
data = data,
method = "knn",
preProc = c("center", "scale"),
tryControl = fitControl,
tuneGrid = kGrid)
model.knn.sc.k
k-Nearest Neighbors
1267 samples
228 predictor
Pre-processing: centered (228), scaled (228)
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 1267, 1267, 1267, 1267, 1267, 1267, ...
Resampling results across tuning parameters:
k RMSE Rsquared MAE
1 1.198573 0.6877347 0.8842548
2 1.160632 0.6983165 0.8597383
3 1.129450 0.7084610 0.8409524
4 1.108760 0.7158757 0.8315002
5 1.095056 0.7205767 0.8259047
6 1.087393 0.7226650 0.8228438
7 1.083740 0.7236593 0.8217687
8 1.079321 0.7253272 0.8209392
9 1.075952 0.7269070 0.8194609
10 1.078885 0.7252852 0.8230893
11 1.081267 0.7241185 0.8281407
12 1.082280 0.7237036 0.8305415
13 1.080696 0.7245372 0.8312641
14 1.080345 0.7250477 0.8316676
15 1.081281 0.7248098 0.8344448
16 1.085586 0.7227357 0.8365685
17 1.087382 0.7219311 0.8380272
18 1.091152 0.7201571 0.8418194
19 1.094536 0.7185419 0.8452972
20 1.098663 0.7166530 0.8489150
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 9.
The best performing model was when k = 9 with an RMSE of 1.076. KNN likely performed worse than OLS because the dataset contains a large number of predictors, many of which are binary and not significant. Since KNN relies on distance, having many irrelevant variables makes it harder to identify meaningful nearest neighbors. In contrast, OLS is better able to handle a large number of predictors and is less sensitive to this type of noise.
set.seed(1)
tune.grid <- expand.grid(alpha = 0, lambda = 10^seq(-3, 0.5, length = 100))
model.ridge <- train(Solubility ~ .,
data = data,
method = "glmnet",
tuneGrid = tune.grid,
standardize = TRUE,
trControl = fitControl)
model.ridge
glmnet
1267 samples
228 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 1141, 1140, 1140, 1139, 1141, 1141, ...
Resampling results across tuning parameters:
lambda RMSE Rsquared MAE
0.001000000 0.6704221 0.8924728 0.5138610
0.001084810 0.6704221 0.8924728 0.5138610
0.001176812 0.6704221 0.8924728 0.5138610
0.001276617 0.6704221 0.8924728 0.5138610
0.001384886 0.6704221 0.8924728 0.5138610
0.001502338 0.6704221 0.8924728 0.5138610
0.001629751 0.6704221 0.8924728 0.5138610
0.001767969 0.6704221 0.8924728 0.5138610
0.001917910 0.6704221 0.8924728 0.5138610
0.002080568 0.6704221 0.8924728 0.5138610
0.002257020 0.6704221 0.8924728 0.5138610
0.002448437 0.6704221 0.8924728 0.5138610
0.002656088 0.6704221 0.8924728 0.5138610
0.002881350 0.6704221 0.8924728 0.5138610
0.003125716 0.6704221 0.8924728 0.5138610
0.003390807 0.6704221 0.8924728 0.5138610
0.003678380 0.6704221 0.8924728 0.5138610
0.003990342 0.6704221 0.8924728 0.5138610
0.004328761 0.6704221 0.8924728 0.5138610
0.004695882 0.6704221 0.8924728 0.5138610
0.005094138 0.6704221 0.8924728 0.5138610
0.005526170 0.6704221 0.8924728 0.5138610
0.005994843 0.6704221 0.8924728 0.5138610
0.006503263 0.6704221 0.8924728 0.5138610
0.007054802 0.6704221 0.8924728 0.5138610
0.007653118 0.6704221 0.8924728 0.5138610
0.008302176 0.6704221 0.8924728 0.5138610
0.009006280 0.6704221 0.8924728 0.5138610
0.009770100 0.6704221 0.8924728 0.5138610
0.010598698 0.6704221 0.8924728 0.5138610
0.011497570 0.6704221 0.8924728 0.5138610
0.012472675 0.6704221 0.8924728 0.5138610
0.013530478 0.6704221 0.8924728 0.5138610
0.014677993 0.6704221 0.8924728 0.5138610
0.015922828 0.6704221 0.8924728 0.5138610
0.017273237 0.6704221 0.8924728 0.5138610
0.018738174 0.6704221 0.8924728 0.5138610
0.020327352 0.6704221 0.8924728 0.5138610
0.022051307 0.6704221 0.8924728 0.5138610
0.023921471 0.6704221 0.8924728 0.5138610
0.025950242 0.6704221 0.8924728 0.5138610
0.028151073 0.6704221 0.8924728 0.5138610
0.030538555 0.6704221 0.8924728 0.5138610
0.033128519 0.6704221 0.8924728 0.5138610
0.035938137 0.6704221 0.8924728 0.5138610
0.038986037 0.6704221 0.8924728 0.5138610
0.042292429 0.6704221 0.8924728 0.5138610
0.045879234 0.6704221 0.8924728 0.5138610
0.049770236 0.6704221 0.8924728 0.5138610
0.053991231 0.6704221 0.8924728 0.5138610
0.058570208 0.6704221 0.8924728 0.5138610
0.063537526 0.6704221 0.8924728 0.5138610
0.068926121 0.6704221 0.8924728 0.5138610
0.074771720 0.6704221 0.8924728 0.5138610
0.081113083 0.6704221 0.8924728 0.5138610
0.087992254 0.6704221 0.8924728 0.5138610
0.095454846 0.6704221 0.8924728 0.5138610
0.103550337 0.6704221 0.8924728 0.5138610
0.112332403 0.6704221 0.8924728 0.5138610
0.121859274 0.6704221 0.8924728 0.5138610
0.132194115 0.6705272 0.8924525 0.5139918
0.143405450 0.6713023 0.8922466 0.5148594
0.155567614 0.6722688 0.8919871 0.5158956
0.168761248 0.6733807 0.8916902 0.5170364
0.183073828 0.6746725 0.8913450 0.5183414
0.198600253 0.6761524 0.8909484 0.5198129
0.215443469 0.6778002 0.8905079 0.5214519
0.233715152 0.6796517 0.8900126 0.5232851
0.253536449 0.6816888 0.8894698 0.5252506
0.275038784 0.6839183 0.8888784 0.5274750
0.298364724 0.6863330 0.8882385 0.5298792
0.323668929 0.6889760 0.8875382 0.5323798
0.351119173 0.6918326 0.8867823 0.5349548
0.380897464 0.6949455 0.8859578 0.5378214
0.413201240 0.6982727 0.8850811 0.5408001
0.448244688 0.7017706 0.8841665 0.5438586
0.486260158 0.7055869 0.8831644 0.5471585
0.527499706 0.7095274 0.8821451 0.5505160
0.572236766 0.7138389 0.8810199 0.5540401
0.620767959 0.7183435 0.8798561 0.5576575
0.673415066 0.7231104 0.8786291 0.5615143
0.730527154 0.7281917 0.8773221 0.5657155
0.792482898 0.7335308 0.8759603 0.5701107
0.859693087 0.7391893 0.8745213 0.5747646
0.932603347 0.7451599 0.8730070 0.5796828
1.011697100 0.7514360 0.8714319 0.5847919
1.097498765 0.7580411 0.8697822 0.5902346
1.190577239 0.7649585 0.8680702 0.5958909
1.291549665 0.7722074 0.8662909 0.6017523
1.401085526 0.7798601 0.8644057 0.6080231
1.519911083 0.7879216 0.8624318 0.6146552
1.648814193 0.7963971 0.8603824 0.6215462
1.788649529 0.8053040 0.8582456 0.6289280
1.940344250 0.8146425 0.8560235 0.6367362
2.104904145 0.8244226 0.8537155 0.6450014
2.283420305 0.8346695 0.8513144 0.6536780
2.477076356 0.8454169 0.8488133 0.6626732
2.687156307 0.8566924 0.8462067 0.6719731
2.915053063 0.8685091 0.8434937 0.6815190
3.162277660 0.8808827 0.8406724 0.6914448
Tuning parameter 'alpha' was held constant at a value of 0
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 0 and lambda = 0.1218593.
The optimal lambda was approximately 0.1219 with a test RMSE of 0.6704. Since this RMSE is smaller than the OLS model, ridge regression performed better. This is likely because the dataset contains many predictors that are correlated with each other. In this situation, OLS can produce unstable coefficient estimates, where it “bounces” between different combinations of large and small coefficients that give similar predictions. Ridge regression addresses this by penalizing large coefficients, forcing the model to choose the smaller coefficients. Smaller coefficients are less sensitive to noise, so regularized performed better on the unseen data.
coef.ridge <- coef(model.ridge$finalModel, model.ridge$bestTune$lambda)
hist(as.vector(coef.ridge))
length(which(coef.ridge[,1] == 0))
[1] 0
The coefficients are more densely centered around zero and take on smaller values. Once again, zero coefficients are set to zero.
set.seed(1)
tune.grid <- expand.grid(alpha = 1, lambda = 10^seq(-3, 0.5, length = 100))
model.lasso <- train(Solubility ~ .,
data = data,
method = "glmnet",
tuneGrid = tune.grid,
standardize = TRUE,
trControl = fitControl)
model.lasso
glmnet
1267 samples
228 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 1141, 1140, 1140, 1139, 1141, 1141, ...
Resampling results across tuning parameters:
lambda RMSE Rsquared MAE
0.001000000 0.6823043 0.8889596 0.5136810
0.001084810 0.6812989 0.8892328 0.5129574
0.001176812 0.6803933 0.8894927 0.5124579
0.001276617 0.6793222 0.8898082 0.5119425
0.001384886 0.6781872 0.8901499 0.5114422
0.001502338 0.6770456 0.8904882 0.5109158
0.001629751 0.6756104 0.8909051 0.5101109
0.001767969 0.6745134 0.8912226 0.5094828
0.001917910 0.6732547 0.8915901 0.5087927
0.002080568 0.6722322 0.8918833 0.5081706
0.002257020 0.6710708 0.8922213 0.5075220
0.002448437 0.6696498 0.8926425 0.5068243
0.002656088 0.6685145 0.8929756 0.5062654
0.002881350 0.6675204 0.8932701 0.5057492
0.003125716 0.6665593 0.8935681 0.5052962
0.003390807 0.6654356 0.8939196 0.5047436
0.003678380 0.6642763 0.8942777 0.5042477
0.003990342 0.6633989 0.8945655 0.5041042
0.004328761 0.6630948 0.8946846 0.5044461
0.004695882 0.6631954 0.8946825 0.5050696
0.005094138 0.6634878 0.8946110 0.5059066
0.005526170 0.6639049 0.8945036 0.5069783
0.005994843 0.6645962 0.8943278 0.5083596
0.006503263 0.6655527 0.8940380 0.5098164
0.007054802 0.6671454 0.8935797 0.5117387
0.007653118 0.6686601 0.8931255 0.5137932
0.008302176 0.6704091 0.8926129 0.5163103
0.009006280 0.6722159 0.8920902 0.5189581
0.009770100 0.6740341 0.8915452 0.5216614
0.010598698 0.6762320 0.8908740 0.5244964
0.011497570 0.6787567 0.8900912 0.5274267
0.012472675 0.6817571 0.8891630 0.5306378
0.013530478 0.6848488 0.8882273 0.5336304
0.014677993 0.6877221 0.8873578 0.5362642
0.015922828 0.6906689 0.8864751 0.5386887
0.017273237 0.6939073 0.8855001 0.5411193
0.018738174 0.6975176 0.8844071 0.5435738
0.020327352 0.7016217 0.8831550 0.5464656
0.022051307 0.7062813 0.8817356 0.5501272
0.023921471 0.7112683 0.8802109 0.5536750
0.025950242 0.7159500 0.8788159 0.5567117
0.028151073 0.7210058 0.8773218 0.5601800
0.030538555 0.7263597 0.8757436 0.5638470
0.033128519 0.7318813 0.8741296 0.5677765
0.035938137 0.7378164 0.8723989 0.5720924
0.038986037 0.7443633 0.8704547 0.5770824
0.042292429 0.7513937 0.8683487 0.5826862
0.045879234 0.7582825 0.8663528 0.5882470
0.049770236 0.7648356 0.8645334 0.5936344
0.053991231 0.7715843 0.8627001 0.5993908
0.058570208 0.7788703 0.8607217 0.6057880
0.063537526 0.7865053 0.8586457 0.6124971
0.068926121 0.7947925 0.8564090 0.6198544
0.074771720 0.8038024 0.8540315 0.6277365
0.081113083 0.8139583 0.8512742 0.6362516
0.087992254 0.8249663 0.8482840 0.6457059
0.095454846 0.8373859 0.8448196 0.6559923
0.103550337 0.8512847 0.8408155 0.6672516
0.112332403 0.8671085 0.8360633 0.6801789
0.121859274 0.8848381 0.8305321 0.6948395
0.132194115 0.9040344 0.8244359 0.7106752
0.143405450 0.9240027 0.8181313 0.7269278
0.155567614 0.9443462 0.8118635 0.7437428
0.168761248 0.9658467 0.8051504 0.7614518
0.183073828 0.9887707 0.7979033 0.7799368
0.198600253 1.0136777 0.7897818 0.7996606
0.215443469 1.0401324 0.7809972 0.8199658
0.233715152 1.0680259 0.7715079 0.8413549
0.253536449 1.0971146 0.7614251 0.8635377
0.275038784 1.1272671 0.7509882 0.8857034
0.298364724 1.1579637 0.7408118 0.9079514
0.323668929 1.1864394 0.7331197 0.9289053
0.351119173 1.2141606 0.7273430 0.9501386
0.380897464 1.2442202 0.7217266 0.9738010
0.413201240 1.2782498 0.7148659 1.0006287
0.448244688 1.3171716 0.7056190 1.0311237
0.486260158 1.3615587 0.6929488 1.0657047
0.527499706 1.4120534 0.6752460 1.1055412
0.572236766 1.4688818 0.6505485 1.1503814
0.620767959 1.5263309 0.6235032 1.1944187
0.673415066 1.5812937 0.6012835 1.2339638
0.730527154 1.6410228 0.5720554 1.2765613
0.792482898 1.7066535 0.5281226 1.3230855
0.859693087 1.7719372 0.4740940 1.3685571
0.932603347 1.8273186 0.4315378 1.4098715
1.011697100 1.8728175 0.4189354 1.4480108
1.097498765 1.9204608 0.4189354 1.4889799
1.190577239 1.9750392 0.4189354 1.5356990
1.291549665 2.0367586 0.3973913 1.5900329
1.401085526 2.0507379 NaN 1.6026278
1.519911083 2.0507379 NaN 1.6026278
1.648814193 2.0507379 NaN 1.6026278
1.788649529 2.0507379 NaN 1.6026278
1.940344250 2.0507379 NaN 1.6026278
2.104904145 2.0507379 NaN 1.6026278
2.283420305 2.0507379 NaN 1.6026278
2.477076356 2.0507379 NaN 1.6026278
2.687156307 2.0507379 NaN 1.6026278
2.915053063 2.0507379 NaN 1.6026278
3.162277660 2.0507379 NaN 1.6026278
Tuning parameter 'alpha' was held constant at a value of 1
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 1 and lambda = 0.004328761.
With lasso reg the optimal lambda is 0.004328761 and the RMSE is 0.6630948. Once again better than OLS.
coef.lasso <- coef(model.lasso$finalModel, model.lasso$bestTune$lambda)
hist(as.vector(coef.lasso))
length(which(coef.lasso[,1] == 0))
[1] 78
This time we have 78 coefficients set to zero. This happened because lasso can fully set coefficients it finds useless to zero.
Since LASSO regression has the smallest test RMSE (and largest R^2) compared to OLS, KNN, and ridge regression, it is the best performing model. This makes sense because the dataset contains a large number of predictors, many of which are likely not important. LASSO is effective in this setting because it can set some coefficients exactly equal to 0, effectively removing irrelevant variables and reducing noise.