———————————————————

MSBA-680: Data Mining - Assignment 2

Student Name: Joe Frye

———————————————————

— SETUP —

1. Load necessary libraries

library(ISLR)   
library(glmnet)

## Loading required package: Matrix

## Loaded glmnet 4.1-10

2. Load the dataset

data("Hitters")

3. Set seed for reproducibility (Crucial for this assignment!)

set.seed(123)

———————————————————

Task 1: Data Exploration & Preparation

———————————————————

Q1.1: Dimensions

Check the number of observations (rows) and variables (columns)

dim(Hitters)

## [1] 322  20

Q1.2: Missing Values

Check how many rows have missing data (sum(is.na())) Identify which column(s) contain missing values

(sum(is.na(Hitters)))

## [1] 59

colSums(is.na(Hitters))

##     AtBat      Hits     HmRun      Runs       RBI     Walks     Years    CAtBat 
##         0         0         0         0         0         0         0         0 
##     CHits    CHmRun     CRuns      CRBI    CWalks    League  Division   PutOuts 
##         0         0         0         0         0         0         0         0 
##   Assists    Errors    Salary NewLeague 
##         0         0        59         0

Action: Remove rows with missing values

Create a new dataset ‘Hitters_clean’ using na.omit()

Verify dimensions again to ensure rows were removed

Hitters_clean<- na.omit(Hitters)
dim(Hitters_clean)

## [1] 263  20

Q1.3: Distribution

Create a histogram of the ‘Salary’ variable Observation: Does it look normal or skewed? (Write comment below)

hist(Hitters_clean$Salary, main = "Salary Distribution", xlab = "Salary")

Yes it is very positively skewed to the left, with most of the salaries being at that lower range.

———————————————————

Task 2: Data Splitting

———————————————————

Q2.1: Train/Test Split

Split ‘Hitters_clean’ into Training (70%) and Testing (30%)

Create index for training data

Create ‘train_data’ and ‘test_data’

Hitters_index <- sample(1:nrow(Hitters_clean), 
                        round(0.7 * nrow(Hitters_clean)))
Hitters_Train <- Hitters_clean[Hitters_index,]
Hitters_Test <- Hitters_clean[-Hitters_index,]

———————————————————

Task 3: Linear Regression & Stepwise Selection

———————————————————

Q3.1: Full Model

Fit a full linear regression model on ‘train_data’ using all predictors

Display the summary of the model

full_model <- lm(Salary~., data = Hitters_Train)
summary(full_model)

## 
## Call:
## lm(formula = Salary ~ ., data = Hitters_Train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -828.0 -169.1  -11.2  134.6 1757.9 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  236.5886   107.6923   2.197  0.02943 * 
## AtBat         -2.0079     0.8134  -2.469  0.01458 * 
## Hits           6.3735     3.2080   1.987  0.04861 * 
## HmRun         -3.4758     7.4955  -0.464  0.64347   
## Runs          -2.9202     3.8616  -0.756  0.45061   
## RBI            2.1402     3.2835   0.652  0.51543   
## Walks          7.2022     2.4394   2.953  0.00361 **
## Years        -16.6985    15.3867  -1.085  0.27940   
## CAtBat        -0.2203     0.1648  -1.336  0.18336   
## CHits          0.2019     0.8722   0.232  0.81721   
## CHmRun         1.1003     2.0708   0.531  0.59589   
## CRuns          2.2317     0.9931   2.247  0.02596 * 
## CRBI           0.4078     0.9501   0.429  0.66831   
## CWalks        -0.9651     0.4245  -2.273  0.02430 * 
## LeagueN       11.3278   105.7700   0.107  0.91484   
## DivisionW   -119.3533    49.7837  -2.397  0.01763 * 
## PutOuts        0.2173     0.0930   2.337  0.02066 * 
## Assists        0.6353     0.2898   2.192  0.02979 * 
## Errors       -10.2234     5.7896  -1.766  0.07929 . 
## NewLeagueN    85.9137   106.3435   0.808  0.42033   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 315.4 on 164 degrees of freedom
## Multiple R-squared:  0.6234, Adjusted R-squared:  0.5798 
## F-statistic: 14.29 on 19 and 164 DF,  p-value: < 2.2e-16

Q3.2: Stepwise AIC Selection

Perform stepwise selection using step() with direction = “both” or “backward”

Save the final model as ‘lm_step_aic’

null_model <- lm(Salary ~ 1, data = Hitters_Train)
lm_step_aic <- step(full_model,
                    scope = list(lower = null_model, upper = full_model),
                    direction = "both", trace = F)
summary(lm_step_aic)

## 
## Call:
## lm(formula = Salary ~ AtBat + Hits + Walks + CAtBat + CRuns + 
##     CRBI + CWalks + Division + PutOuts + Assists + Errors + NewLeague, 
##     data = Hitters_Train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -797.0 -171.8  -10.1  127.8 1765.4 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  172.19676   83.25669   2.068 0.040121 *  
## AtBat         -2.00031    0.67324  -2.971 0.003394 ** 
## Hits           5.93052    2.12649   2.789 0.005889 ** 
## Walks          6.42584    2.01994   3.181 0.001742 ** 
## CAtBat        -0.25730    0.06738  -3.819 0.000187 ***
## CRuns          2.30925    0.51824   4.456 1.51e-05 ***
## CRBI           1.05082    0.29166   3.603 0.000412 ***
## CWalks        -1.01174    0.30628  -3.303 0.001164 ** 
## DivisionW   -120.14999   47.93791  -2.506 0.013132 *  
## PutOuts        0.22148    0.09084   2.438 0.015790 *  
## Assists        0.67724    0.26680   2.538 0.012030 *  
## Errors        -9.26152    5.59590  -1.655 0.099747 .  
## NewLeagueN   106.50038   48.78764   2.183 0.030403 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 311.5 on 171 degrees of freedom
## Multiple R-squared:  0.6169, Adjusted R-squared:   0.59 
## F-statistic: 22.95 on 12 and 171 DF,  p-value: < 2.2e-16

length(coef(lm_step_aic)) - 1

## [1] 12

Question: How many predictors were left in the final AIC model?

Comment: There are 12 predictors left.

———————————————————

Task 4: LASSO Regression

———————————————————

Q4.1: Matrix Preparation

Create the input matrix ‘x’ (predictors) and vector ‘y’ (response) from ‘train_data’ Hint: use model.matrix() for x to handle categorical variables automatically

x <- model.matrix(Salary~., data = Hitters_Train)[, -1]
y <- Hitters_Train$Salary

Q4.2: Cross-Validation for Lambda

Use cv.glmnet() to find optimal lambda

Hitters_lasso <- cv.glmnet(x, y, alpha = 1)
Hitters_lasso$lambda.min

## [1] 3.216343

Report the value of lambda.min

## [1] 3.216343

Q4.3: Final LASSO Model

Fit the final LASSO model using lambda.min

Display the coefficients using coef()

Hitters_lasso_final <- glmnet(x, y, alpha = 1, lambda = Hitters_lasso$lambda.min)
coef(Hitters_lasso_final)

## 20 x 1 sparse Matrix of class "dgCMatrix"
##                       s0
## (Intercept)  207.8575346
## AtBat         -1.5163170
## Hits           4.4833440
## HmRun         -3.8302523
## Runs           .        
## RBI            1.2041422
## Walks          5.9093856
## Years        -26.6539318
## CAtBat         .        
## CHits          .        
## CHmRun         1.6096210
## CRuns          1.1849423
## CRBI           0.1229401
## CWalks        -0.6825704
## LeagueN       10.5114166
## DivisionW   -131.1153295
## PutOuts        0.1882778
## Assists        0.3445808
## Errors        -6.5273150
## NewLeagueN    71.3468791

Question: Which variables were shrunk exactly to zero?

Comment: The runs, CatBar and Chits variables have been shrunk to zero meaning that they do not meaningfully improve ours models prediction abilities.

Hitters_lasso_min <- as.matrix(coef(Hitters_lasso, s = "lambda.min"))
rownames(Hitters_lasso_min)[Hitters_lasso_min == 0]

## [1] "Runs"   "CAtBat" "CHits"

———————————————————

Task 5: Model Comparison & Conclusion

———————————————————

Q5.1: Prediction & MSE

’ 1. Make predictions on ‘test_data’ using ‘lm_step_aic’

Hitters_pred_aic <- predict(lm_step_aic, newdata = Hitters_Test)

Make predictions on ‘test_data’ using the LASSO model

Hint: newx argument in predict() for LASSO needs a matrix

x_train <- model.matrix(Salary ~ ., data = Hitters_Train)[, -1]
y_train <- Hitters_Train$Salary

Hitters_lasso <- cv.glmnet(x_train, y_train, alpha = 1)

x_test <- model.matrix(Salary ~ ., data = Hitters_Test)[, -1]

pred_lasso <- predict(Hitters_lasso,
                      s = "lambda.min",
                      newx = x_test)

Calculate Out-of-Sample MSE for both models

MSE = mean((Actual - Predicted)^2)

mse_aic <- mean((Hitters_Test$Salary - Hitters_pred_aic)^2)
mse_aic

## [1] 124511.4

mse_lasso <- mean((Hitters_Test$Salary - pred_lasso)^2)
mse_lasso

## [1] 124910.6

Conclusion: Based on OOS MSE, which model performed better?

Comment: While close, we can see that the MSE of the AIC model is ever so slightly smaller, indicating that the AIC model perfomed better than the lasso model.

Assignment 2

Joe Frye

2026-03-02

———————————————————

MSBA-680: Data Mining - Assignment 2

Student Name: Joe Frye

———————————————————

— SETUP —

1. Load necessary libraries

2. Load the dataset

3. Set seed for reproducibility (Crucial for this assignment!)

———————————————————

Task 1: Data Exploration & Preparation

———————————————————

Q1.1: Dimensions

Q1.2: Missing Values

Q1.3: Distribution

———————————————————

Task 2: Data Splitting

———————————————————

Q2.1: Train/Test Split

———————————————————

Task 3: Linear Regression & Stepwise Selection

———————————————————

Q3.1: Full Model

Q3.2: Stepwise AIC Selection

———————————————————

Task 4: LASSO Regression

———————————————————

Q4.1: Matrix Preparation

Q4.2: Cross-Validation for Lambda

Q4.3: Final LASSO Model

———————————————————

Task 5: Model Comparison & Conclusion

———————————————————

Q5.1: Prediction & MSE

———————————————————

End of Assignment 2

———————————————————