———————————————————

MSBA-680: Data Mining - Assignment 2

Student Name: Joe Frye

———————————————————

— SETUP —

1. Load necessary libraries

library(ISLR)   
library(glmnet) 
## Loading required package: Matrix
## Loaded glmnet 4.1-10

2. Load the dataset

data("Hitters")

3. Set seed for reproducibility (Crucial for this assignment!)

set.seed(123)

———————————————————

Task 1: Data Exploration & Preparation

———————————————————

Q1.1: Dimensions

Check the number of observations (rows) and variables (columns)

dim(Hitters)
## [1] 322  20

Q1.2: Missing Values

Check how many rows have missing data (sum(is.na())) Identify which column(s) contain missing values

(sum(is.na(Hitters)))
## [1] 59
colSums(is.na(Hitters))
##     AtBat      Hits     HmRun      Runs       RBI     Walks     Years    CAtBat 
##         0         0         0         0         0         0         0         0 
##     CHits    CHmRun     CRuns      CRBI    CWalks    League  Division   PutOuts 
##         0         0         0         0         0         0         0         0 
##   Assists    Errors    Salary NewLeague 
##         0         0        59         0

Action: Remove rows with missing values

Create a new dataset ‘Hitters_clean’ using na.omit()

Verify dimensions again to ensure rows were removed

Hitters_clean<- na.omit(Hitters)
dim(Hitters_clean)
## [1] 263  20

Q1.3: Distribution

Create a histogram of the ‘Salary’ variable Observation: Does it look normal or skewed? (Write comment below)

hist(Hitters_clean$Salary, main = "Salary Distribution", xlab = "Salary")

Yes it is very positively skewed to the left, with most of the salaries being at that lower range.

———————————————————

Task 2: Data Splitting

———————————————————

Q2.1: Train/Test Split

Split ‘Hitters_clean’ into Training (70%) and Testing (30%)

Create index for training data

Create ‘train_data’ and ‘test_data’

Hitters_index <- sample(1:nrow(Hitters_clean), 
                        round(0.7 * nrow(Hitters_clean)))
Hitters_Train <- Hitters_clean[Hitters_index,]
Hitters_Test <- Hitters_clean[-Hitters_index,]

———————————————————

Task 3: Linear Regression & Stepwise Selection

———————————————————

Q3.1: Full Model

Fit a full linear regression model on ‘train_data’ using all predictors

Display the summary of the model

full_model <- lm(Salary~., data = Hitters_Train)
summary(full_model)
## 
## Call:
## lm(formula = Salary ~ ., data = Hitters_Train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -828.0 -169.1  -11.2  134.6 1757.9 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  236.5886   107.6923   2.197  0.02943 * 
## AtBat         -2.0079     0.8134  -2.469  0.01458 * 
## Hits           6.3735     3.2080   1.987  0.04861 * 
## HmRun         -3.4758     7.4955  -0.464  0.64347   
## Runs          -2.9202     3.8616  -0.756  0.45061   
## RBI            2.1402     3.2835   0.652  0.51543   
## Walks          7.2022     2.4394   2.953  0.00361 **
## Years        -16.6985    15.3867  -1.085  0.27940   
## CAtBat        -0.2203     0.1648  -1.336  0.18336   
## CHits          0.2019     0.8722   0.232  0.81721   
## CHmRun         1.1003     2.0708   0.531  0.59589   
## CRuns          2.2317     0.9931   2.247  0.02596 * 
## CRBI           0.4078     0.9501   0.429  0.66831   
## CWalks        -0.9651     0.4245  -2.273  0.02430 * 
## LeagueN       11.3278   105.7700   0.107  0.91484   
## DivisionW   -119.3533    49.7837  -2.397  0.01763 * 
## PutOuts        0.2173     0.0930   2.337  0.02066 * 
## Assists        0.6353     0.2898   2.192  0.02979 * 
## Errors       -10.2234     5.7896  -1.766  0.07929 . 
## NewLeagueN    85.9137   106.3435   0.808  0.42033   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 315.4 on 164 degrees of freedom
## Multiple R-squared:  0.6234, Adjusted R-squared:  0.5798 
## F-statistic: 14.29 on 19 and 164 DF,  p-value: < 2.2e-16

Q3.2: Stepwise AIC Selection

Perform stepwise selection using step() with direction = “both” or “backward”

Save the final model as ‘lm_step_aic’

null_model <- lm(Salary ~ 1, data = Hitters_Train)
lm_step_aic <- step(full_model,
                    scope = list(lower = null_model, upper = full_model),
                    direction = "both", trace = F)
summary(lm_step_aic)
## 
## Call:
## lm(formula = Salary ~ AtBat + Hits + Walks + CAtBat + CRuns + 
##     CRBI + CWalks + Division + PutOuts + Assists + Errors + NewLeague, 
##     data = Hitters_Train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -797.0 -171.8  -10.1  127.8 1765.4 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  172.19676   83.25669   2.068 0.040121 *  
## AtBat         -2.00031    0.67324  -2.971 0.003394 ** 
## Hits           5.93052    2.12649   2.789 0.005889 ** 
## Walks          6.42584    2.01994   3.181 0.001742 ** 
## CAtBat        -0.25730    0.06738  -3.819 0.000187 ***
## CRuns          2.30925    0.51824   4.456 1.51e-05 ***
## CRBI           1.05082    0.29166   3.603 0.000412 ***
## CWalks        -1.01174    0.30628  -3.303 0.001164 ** 
## DivisionW   -120.14999   47.93791  -2.506 0.013132 *  
## PutOuts        0.22148    0.09084   2.438 0.015790 *  
## Assists        0.67724    0.26680   2.538 0.012030 *  
## Errors        -9.26152    5.59590  -1.655 0.099747 .  
## NewLeagueN   106.50038   48.78764   2.183 0.030403 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 311.5 on 171 degrees of freedom
## Multiple R-squared:  0.6169, Adjusted R-squared:   0.59 
## F-statistic: 22.95 on 12 and 171 DF,  p-value: < 2.2e-16
length(coef(lm_step_aic)) - 1
## [1] 12

Question: How many predictors were left in the final AIC model?

Comment: There are 12 predictors left.

———————————————————

Task 4: LASSO Regression

———————————————————

Q4.1: Matrix Preparation

Create the input matrix ‘x’ (predictors) and vector ‘y’ (response) from ‘train_data’ Hint: use model.matrix() for x to handle categorical variables automatically

x <- model.matrix(Salary~., data = Hitters_Train)[, -1]
y <- Hitters_Train$Salary

Q4.2: Cross-Validation for Lambda

Use cv.glmnet() to find optimal lambda

Hitters_lasso <- cv.glmnet(x, y, alpha = 1)
Hitters_lasso$lambda.min
## [1] 3.216343

Report the value of lambda.min

## [1] 3.216343

Q4.3: Final LASSO Model

Fit the final LASSO model using lambda.min

Display the coefficients using coef()

Hitters_lasso_final <- glmnet(x, y, alpha = 1, lambda = Hitters_lasso$lambda.min)
coef(Hitters_lasso_final)
## 20 x 1 sparse Matrix of class "dgCMatrix"
##                       s0
## (Intercept)  207.8575346
## AtBat         -1.5163170
## Hits           4.4833440
## HmRun         -3.8302523
## Runs           .        
## RBI            1.2041422
## Walks          5.9093856
## Years        -26.6539318
## CAtBat         .        
## CHits          .        
## CHmRun         1.6096210
## CRuns          1.1849423
## CRBI           0.1229401
## CWalks        -0.6825704
## LeagueN       10.5114166
## DivisionW   -131.1153295
## PutOuts        0.1882778
## Assists        0.3445808
## Errors        -6.5273150
## NewLeagueN    71.3468791

Question: Which variables were shrunk exactly to zero?

Comment: The runs, CatBar and Chits variables have been shrunk to zero meaning that they do not meaningfully improve ours models prediction abilities.

Hitters_lasso_min <- as.matrix(coef(Hitters_lasso, s = "lambda.min"))
rownames(Hitters_lasso_min)[Hitters_lasso_min == 0]
## [1] "Runs"   "CAtBat" "CHits"

———————————————————

Task 5: Model Comparison & Conclusion

———————————————————

Q5.1: Prediction & MSE

’ 1. Make predictions on ‘test_data’ using ‘lm_step_aic’

Hitters_pred_aic <- predict(lm_step_aic, newdata = Hitters_Test)
  1. Make predictions on ‘test_data’ using the LASSO model

Hint: newx argument in predict() for LASSO needs a matrix

x_train <- model.matrix(Salary ~ ., data = Hitters_Train)[, -1]
y_train <- Hitters_Train$Salary

Hitters_lasso <- cv.glmnet(x_train, y_train, alpha = 1)

x_test <- model.matrix(Salary ~ ., data = Hitters_Test)[, -1]

pred_lasso <- predict(Hitters_lasso,
                      s = "lambda.min",
                      newx = x_test)
  1. Calculate Out-of-Sample MSE for both models

MSE = mean((Actual - Predicted)^2)

mse_aic <- mean((Hitters_Test$Salary - Hitters_pred_aic)^2)
mse_aic
## [1] 124511.4
mse_lasso <- mean((Hitters_Test$Salary - pred_lasso)^2)
mse_lasso
## [1] 124910.6

Conclusion: Based on OOS MSE, which model performed better?

Comment: While close, we can see that the MSE of the AIC model is ever so slightly smaller, indicating that the AIC model perfomed better than the lasso model.

———————————————————

End of Assignment 2

———————————————————