library(ISLR)
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-10
data("Hitters")
set.seed(123)
Check the number of observations (rows) and variables (columns)
dim(Hitters)
## [1] 322 20
Check how many rows have missing data (sum(is.na())) Identify which column(s) contain missing values
(sum(is.na(Hitters)))
## [1] 59
colSums(is.na(Hitters))
## AtBat Hits HmRun Runs RBI Walks Years CAtBat
## 0 0 0 0 0 0 0 0
## CHits CHmRun CRuns CRBI CWalks League Division PutOuts
## 0 0 0 0 0 0 0 0
## Assists Errors Salary NewLeague
## 0 0 59 0
Action: Remove rows with missing values
Create a new dataset ‘Hitters_clean’ using na.omit()
Verify dimensions again to ensure rows were removed
Hitters_clean<- na.omit(Hitters)
dim(Hitters_clean)
## [1] 263 20
Create a histogram of the ‘Salary’ variable Observation: Does it look normal or skewed? (Write comment below)
hist(Hitters_clean$Salary, main = "Salary Distribution", xlab = "Salary")
Yes it is very positively skewed to the left, with most of the salaries being at that lower range.
Split ‘Hitters_clean’ into Training (70%) and Testing (30%)
Create index for training data
Create ‘train_data’ and ‘test_data’
Hitters_index <- sample(1:nrow(Hitters_clean),
round(0.7 * nrow(Hitters_clean)))
Hitters_Train <- Hitters_clean[Hitters_index,]
Hitters_Test <- Hitters_clean[-Hitters_index,]
Fit a full linear regression model on ‘train_data’ using all predictors
Display the summary of the model
full_model <- lm(Salary~., data = Hitters_Train)
summary(full_model)
##
## Call:
## lm(formula = Salary ~ ., data = Hitters_Train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -828.0 -169.1 -11.2 134.6 1757.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 236.5886 107.6923 2.197 0.02943 *
## AtBat -2.0079 0.8134 -2.469 0.01458 *
## Hits 6.3735 3.2080 1.987 0.04861 *
## HmRun -3.4758 7.4955 -0.464 0.64347
## Runs -2.9202 3.8616 -0.756 0.45061
## RBI 2.1402 3.2835 0.652 0.51543
## Walks 7.2022 2.4394 2.953 0.00361 **
## Years -16.6985 15.3867 -1.085 0.27940
## CAtBat -0.2203 0.1648 -1.336 0.18336
## CHits 0.2019 0.8722 0.232 0.81721
## CHmRun 1.1003 2.0708 0.531 0.59589
## CRuns 2.2317 0.9931 2.247 0.02596 *
## CRBI 0.4078 0.9501 0.429 0.66831
## CWalks -0.9651 0.4245 -2.273 0.02430 *
## LeagueN 11.3278 105.7700 0.107 0.91484
## DivisionW -119.3533 49.7837 -2.397 0.01763 *
## PutOuts 0.2173 0.0930 2.337 0.02066 *
## Assists 0.6353 0.2898 2.192 0.02979 *
## Errors -10.2234 5.7896 -1.766 0.07929 .
## NewLeagueN 85.9137 106.3435 0.808 0.42033
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 315.4 on 164 degrees of freedom
## Multiple R-squared: 0.6234, Adjusted R-squared: 0.5798
## F-statistic: 14.29 on 19 and 164 DF, p-value: < 2.2e-16
Perform stepwise selection using step() with direction = “both” or “backward”
Save the final model as ‘lm_step_aic’
null_model <- lm(Salary ~ 1, data = Hitters_Train)
lm_step_aic <- step(full_model,
scope = list(lower = null_model, upper = full_model),
direction = "both", trace = F)
summary(lm_step_aic)
##
## Call:
## lm(formula = Salary ~ AtBat + Hits + Walks + CAtBat + CRuns +
## CRBI + CWalks + Division + PutOuts + Assists + Errors + NewLeague,
## data = Hitters_Train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -797.0 -171.8 -10.1 127.8 1765.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 172.19676 83.25669 2.068 0.040121 *
## AtBat -2.00031 0.67324 -2.971 0.003394 **
## Hits 5.93052 2.12649 2.789 0.005889 **
## Walks 6.42584 2.01994 3.181 0.001742 **
## CAtBat -0.25730 0.06738 -3.819 0.000187 ***
## CRuns 2.30925 0.51824 4.456 1.51e-05 ***
## CRBI 1.05082 0.29166 3.603 0.000412 ***
## CWalks -1.01174 0.30628 -3.303 0.001164 **
## DivisionW -120.14999 47.93791 -2.506 0.013132 *
## PutOuts 0.22148 0.09084 2.438 0.015790 *
## Assists 0.67724 0.26680 2.538 0.012030 *
## Errors -9.26152 5.59590 -1.655 0.099747 .
## NewLeagueN 106.50038 48.78764 2.183 0.030403 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 311.5 on 171 degrees of freedom
## Multiple R-squared: 0.6169, Adjusted R-squared: 0.59
## F-statistic: 22.95 on 12 and 171 DF, p-value: < 2.2e-16
length(coef(lm_step_aic)) - 1
## [1] 12
Question: How many predictors were left in the final AIC model?
Comment: There are 12 predictors left.
Create the input matrix ‘x’ (predictors) and vector ‘y’ (response) from ‘train_data’ Hint: use model.matrix() for x to handle categorical variables automatically
x <- model.matrix(Salary~., data = Hitters_Train)[, -1]
y <- Hitters_Train$Salary
Use cv.glmnet() to find optimal lambda
Hitters_lasso <- cv.glmnet(x, y, alpha = 1)
Hitters_lasso$lambda.min
## [1] 3.216343
Report the value of lambda.min
## [1] 3.216343
Fit the final LASSO model using lambda.min
Display the coefficients using coef()
Hitters_lasso_final <- glmnet(x, y, alpha = 1, lambda = Hitters_lasso$lambda.min)
coef(Hitters_lasso_final)
## 20 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 207.8575346
## AtBat -1.5163170
## Hits 4.4833440
## HmRun -3.8302523
## Runs .
## RBI 1.2041422
## Walks 5.9093856
## Years -26.6539318
## CAtBat .
## CHits .
## CHmRun 1.6096210
## CRuns 1.1849423
## CRBI 0.1229401
## CWalks -0.6825704
## LeagueN 10.5114166
## DivisionW -131.1153295
## PutOuts 0.1882778
## Assists 0.3445808
## Errors -6.5273150
## NewLeagueN 71.3468791
Question: Which variables were shrunk exactly to zero?
Comment: The runs, CatBar and Chits variables have been shrunk to zero meaning that they do not meaningfully improve ours models prediction abilities.
Hitters_lasso_min <- as.matrix(coef(Hitters_lasso, s = "lambda.min"))
rownames(Hitters_lasso_min)[Hitters_lasso_min == 0]
## [1] "Runs" "CAtBat" "CHits"
’ 1. Make predictions on ‘test_data’ using ‘lm_step_aic’
Hitters_pred_aic <- predict(lm_step_aic, newdata = Hitters_Test)
Hint: newx argument in predict() for LASSO needs a matrix
x_train <- model.matrix(Salary ~ ., data = Hitters_Train)[, -1]
y_train <- Hitters_Train$Salary
Hitters_lasso <- cv.glmnet(x_train, y_train, alpha = 1)
x_test <- model.matrix(Salary ~ ., data = Hitters_Test)[, -1]
pred_lasso <- predict(Hitters_lasso,
s = "lambda.min",
newx = x_test)
MSE = mean((Actual - Predicted)^2)
mse_aic <- mean((Hitters_Test$Salary - Hitters_pred_aic)^2)
mse_aic
## [1] 124511.4
mse_lasso <- mean((Hitters_Test$Salary - pred_lasso)^2)
mse_lasso
## [1] 124910.6
Conclusion: Based on OOS MSE, which model performed better?
Comment: While close, we can see that the MSE of the AIC model is ever so slightly smaller, indicating that the AIC model perfomed better than the lasso model.