For problems that include making graphics or performing analysis, students should be sure to articulate their answers in writing unless explicitly told not to. That is, if you provide a figure or table of results then it is expected that commentary will also be provided.

Exercise 1: Conceptual Questions

  1. Briefly explain the bias variance trade off as a concept. ANSWER: The bias-variance tradeoff is about balancing two errors in a model. As one tries to minimize the error component, (the bias), the other component (variance, for example), tends to increase, and vice versa. The goal it to find the right balance of bias and variance in order to create and effective and accurate model. High bias means the model is too simple and misses important trends (leading to underfitting), while high variance means the model is too complex, fitting noise as if it’s important (leading to overfitting). A good model avoids over and underfitting the data while training the algorithm, minimizing overall errors.

  2. If a knn model performs poorly using a value of “\(k=1\)”, then is it suffering due to bias or variance? ANSWER: When my k-NN model doesn’t perform well at k=1, it’s likely because of high variance. With k=1, the model is too sensitive to training data, capturing noise and failing to generalize well.

  3. What is the main purpose of feature selection? ANSWER: Feature selection is crucial for enhancing model performance, reducing overfitting, simplifying the model for better understanding, and speeding up training. It’s about choosing the most relevant features for the model.

  4. When using glmnet, what does the penalty term effectively do to your regression model? For example, what does it do if the penalty term is very large? ANSWER: In glmnet, the penalty term is for regularization, which controls model complexity. A large penalty shrinks coefficients towards zero, simplifying the model, preventing overfitting, and sometimes even helping in selecting the most relevant features.

Exercise 2: Using GLMNET

In the homework assignment for unit 1, we examined the residuals of various polynomial fits to a data set to determine roughly how complex of a polynomial fit is needed. With our new found knowledge of GLMNET and cross validation, we can investigate this a bit more realistically. The data set is generated below.

Use the caret package and codes from prelive to perform a LASSO model “(\(\alpha=1\))” with GLMNET with potentially up to 10 polynomial terms and determine what highest order degree is needed. Use 5-fold cross validation. Provide the necessary output and brief explanation to justify your answer.

set.seed(1234)
x<-runif(100,-6,6.5)
trend<- 2*x^4-2*x^3-50*x^2+100*x+2
y<-trend+rnorm(100,0,100)
plot(x,y)

In this scatterplot we observe that the points are not randomly dispersed; instead, they exhibit a structured pattern due to the underlying polynomial relationship. The vertical spread of points increases as x moves away from zero in either direction, which is consistent with the increasing variance introduced by the rnorm function as the magnitude of the trend term grows. The polynomial curve is not shown, but the scatter of points suggests that the relationship between x and y is not linear and involves higher-order terms.

ANSWER for Question 2

LASSO Regression with Polynomial Terms

# Data Generation
set.seed(1234)
x <- runif(100, -6, 6.5)
trend <- 2 * x^4 - 2 * x^3 - 50 * x^2 + 100 * x + 2
y <- trend + rnorm(100, 0, 100)
data <- data.frame(y = y, x = x)
for (i in 2:10) {
  data[[paste0("x", i)]] <- x^i
}
# LASSO Regression Setup
library(caret)
## Warning: package 'caret' was built under R version 4.3.2
## Loading required package: ggplot2
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 4.3.2
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.3.2
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 4.3.2
## Loaded glmnet 4.1-8
# Prepare predictors and response
predictors <- as.matrix(data[, -1])
response <- data$y

# Cross-validation setup
fitControl <- trainControl(method = "cv", number = 5)

# Define lambda values for tuning
lambda_grid <- 10^seq(-3, 3, length.out = 100)

# Train the LASSO model
lasso.fit <- train(
  y ~ .,
  data = data,
  method = "glmnet",
  trControl = fitControl,
  tuneLength = 100,
  tuneGrid = expand.grid(alpha = 1, lambda = lambda_grid)
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.

Prepare predictors and response for LASSO regression

predictors <- as.matrix(data[, -1])
response <- data$y

Cross-validation for our model fitting

library(caret)
library(glmnet)

fitControl <- trainControl(method = "cv", number = 5)
set.seed(12)
lambda_grid <- 10^seq(-3, 3, length.out = 100)
lasso.fit <- train(
  y ~ .,
  data = data,
  method = "glmnet",
  trControl = fitControl,
  tuneLength = 100,
  tuneGrid = expand.grid(alpha = 1, lambda = lambda_grid)
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.

Extracting the best lambda and coefficients from the model.

best_lambda <- lasso.fit$bestTune$lambda
lasso_coefs <- coef(lasso.fit$finalModel, s = best_lambda)
non_zero_coefs <- lasso_coefs[lasso_coefs[,1] != 0,]
non_zero_terms <- names(non_zero_coefs)

Building a simpler model based on significant terms.

simpler_model_formula <- as.formula(paste("y ~", paste(non_zero_terms[-1], collapse = "+")))
simpler_model <- lm(simpler_model_formula, data = data)
model_summary <- summary(simpler_model)
print(simpler_model_formula)
## y ~ x + x2 + x3 + x4 + x6 + x9
print(summary(simpler_model))
## 
## Call:
## lm(formula = simpler_model_formula, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -189.980  -50.693   -4.341   50.163  262.463 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.055e+01  2.137e+01  -0.962   0.3387    
## x            1.097e+02  8.739e+00  12.556  < 2e-16 ***
## x2          -3.854e+01  6.490e+00  -5.938 4.93e-08 ***
## x3          -2.222e+00  5.455e-01  -4.073 9.76e-05 ***
## x4           1.175e+00  4.595e-01   2.557   0.0122 *  
## x6           1.606e-02  8.779e-03   1.829   0.0706 .  
## x9          -1.021e-05  9.438e-06  -1.081   0.2824    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 93.5 on 93 degrees of freedom
## Multiple R-squared:  0.9384, Adjusted R-squared:  0.9344 
## F-statistic: 236.1 on 6 and 93 DF,  p-value: < 2.2e-16

Finding the highest order polynomial needed.

significant_terms <- which(model_summary$coefficients[, "Pr(>|t|)"] < 0.05)
highest_significant_term <- max(significant_terms)
highest_order_polynomial <- highest_significant_term - 1
cat("Highest order polynomial needed:", highest_order_polynomial, "\n")
## Highest order polynomial needed: 4

Plot results

library(ggplot2)
x_range <- range(data$x)
x_seq <- seq(from = x_range[1], to = x_range[2], length.out = 100)
predictions <- data.frame(x = x_seq)
for (i in 2:10) {
  predictions[[paste0("x", i)]] <- predictions$x^i
}
predictions$y <- predict(simpler_model, newdata = predictions)
ggplot(data, aes(x = x, y = y)) + 
  geom_point(color = "black") + 
  geom_line(data = predictions, aes(x = x, y = y), color = "red", size = 1) +
  labs(title = "LASSO Regression: Fitted Polynomial Curve", x = "Predictor (x)", y = "Response (y)") +
  theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Plot Summary: The provided graph illustrates the results of a LASSO regression, depicted through a polynomial curve fitted amidst a scatter plot. Utilizing R’s ggplot2 package, this graph enhances comprehension of the connection between the predictor ‘x’ and the outcome ‘y’.

In this visualization, the black points denote the actual observations, mapping out the interplay between ‘x’ and ‘y’. The red line delineates the polynomial relationship discerned through LASSO regression, manifesting the predicted values of ‘y’. Given the LASSO technique’s propensity for regularization, it’s inferred that the model has been adjusted to avert overfitting.

The fluid contour of the line suggests a precise capture of the data’s underlying trend. The curve’s pronounced fluctuations—plummeting on the left and soaring on the right as ‘x’ swells—reveal the intricate, non-linear dynamics at play, likely due to the incorporation of polynomial terms of higher degrees.

Through the streamlined aesthetic of ggplot2, the plot succinctly conveys the complexities unraveled by the LASSO regression, underscoring the nuanced, non-linear patterns embedded within the dataset.

Explanation: In this exercise, we aim to determine the highest order polynomial degree needed for modeling data using LASSO regression with the GLMNET package in R. This approach allows us to balance model fit and complexity, preventing overfitting by selecting the optimal lambda parameter through cross-validation.

Steps:

  1. Data Generation and Visualization:
    • Start by setting a random seed for reproducibility.
    • Generate random values for the predictor variable ‘x’ within a specified range.
    • Create a trend ‘y’ based on a 4th-degree polynomial equation with added noise.
    • Plot the generated data to visualize it.
set.seed(123)
x <- runif(100, 0, 10)
y <- 3 * x^4 - 2 * x^3 + 0.5 * x^2 + rnorm(100, mean = 0, sd = 2)
ggplot(data.frame(x, y), aes(x, y)) +
  geom_point() +
  ggtitle("Generated Data")

  1. Plot Summary: This graph presents a visualization of data generated through a specific polynomial relationship with added random noise. I employed R’s ggplot2 package to create this scatter plot, which depicts the association between the independent variable ‘x’ and the dependent variable ‘y’.

The dots on the plot are the data points, with the x-axis showing the input variable ‘x’ ranging from 0 to 10, and the y-axis representing the output variable ‘y’, which has been calculated using a fourth-degree polynomial equation. To this equation, I’ve added normally distributed noise to simulate real-world data scatter.

The curvature observed in the scatter plot aligns with the polynomial nature of the equation used to generate ‘y’. As ‘x’ increases, ‘y’ undergoes substantial changes due to the higher powers of ‘x’, particularly evident in the sharp increase past x=5. The randomness in the vertical spread of points around the curve is the result of the noise introduced, which provides a more realistic representation of data one might encounter in practical scenarios.

The title “Generated Data” succinctly indicates that the dataset is synthetic, designed for analysis or instructional purposes. The clean and straightforward design of the plot ensures focus on the data pattern, making it easier to interpret the underlying relationship between ‘x’ and ‘y’

  1. Data Preparation:
    • Combine the response variable ‘y’ and predictor variable ‘x’ into a data frame called ‘data’.
    • Create additional columns in ‘data’ for polynomial terms up to the 10th degree.
    • Define ‘predictors’ as a matrix containing all the predictor variables, and ‘response’ as the response variable ‘y’.
data <- data.frame(x, y)
  for (i in 2:10) {
    data[paste0("x_", i)] <- x^i
  }
  predictors <- as.matrix(data[, -2])
  response <- data$y
  1. Cross-Validation Setup:
    • Set up 5-fold cross-validation using the ‘caret’ package with the ‘trainControl’ function. This will be used during model training to assess model performance.
ctrl <- trainControl(method = "cv", number = 5)
  1. LASSO Model Training:
    • Specify a grid of lambda values (regularization parameters) for tuning.
    • Train a LASSO model using the ‘train’ function from the ‘caret’ package. Use the ‘glmnet’ method and perform 5-fold cross-validation while tuning the lambda parameter.
# Define the tuning grid with alpha and lambda values
tuneGrid <- expand.grid(alpha = 1, lambda = lambda_grid)

# Train the LASSO model with the corrected tuning grid
lasso_model <- train(
x = predictors,
y = response,
method = "glmnet",
trControl = ctrl,
tuneGrid = tuneGrid
 )
library(glmnet)

# Generate lambda grid
lambda_grid <- 10^seq(10, -2, length = 100)

# Define the tuning grid with alpha and lambda values
tuneGrid <- expand.grid(alpha = 1, lambda = lambda_grid)

# Train the LASSO model with the corrected tuning grid
lasso_model <- train(
  x = predictors,
  y = response,
  method = "glmnet",
  trControl = ctrl,
  tuneGrid = tuneGrid
 )
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
  1. Extracting Best Model:
    • Extract the best lambda value from the LASSO model.
    • Retrieve the coefficients of the best model, identifying the non-zero coefficients.
best_lambda <- lasso_model$bestTune$lambda
best_model <- glmnet(predictors, response, lambda = best_lambda)
non_zero_coeffs <- coef(best_model)
  1. Building a Simpler Model:
    • Create a formula for a simpler linear model based on the significant terms (non-zero coefficients) from the LASSO model.
    • Fit a linear regression model (‘simpler_model’) using this formula.
set.seed(1234)
x <- runif(100, -6, 6.5)
trend <- 2 * x^4 - 2 * x^3 - 50 * x^2 + 100 * x + 2
y <- trend + rnorm(100, 0, 100)
data <- data.frame(y = y, x = x)

# Ensure consistent naming for polynomial terms
for (i in 2:10) {
  data[[paste0("x", i)]] <- x^i
}
  1. Identifying Significant Terms:
    • Determine which terms in the simpler model have p-values less than 0.05, indicating statistical significance.
    • Find the highest significant term, which represents the highest order polynomial term needed in the model.
summary(simpler_model)
## 
## Call:
## lm(formula = simpler_model_formula, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -189.980  -50.693   -4.341   50.163  262.463 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.055e+01  2.137e+01  -0.962   0.3387    
## x            1.097e+02  8.739e+00  12.556  < 2e-16 ***
## x2          -3.854e+01  6.490e+00  -5.938 4.93e-08 ***
## x3          -2.222e+00  5.455e-01  -4.073 9.76e-05 ***
## x4           1.175e+00  4.595e-01   2.557   0.0122 *  
## x6           1.606e-02  8.779e-03   1.829   0.0706 .  
## x9          -1.021e-05  9.438e-06  -1.081   0.2824    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 93.5 on 93 degrees of freedom
## Multiple R-squared:  0.9384, Adjusted R-squared:  0.9344 
## F-statistic: 236.1 on 6 and 93 DF,  p-value: < 2.2e-16
  1. Output and Explanation:
    • Print the formula of the simpler model.
    • Output the highest order polynomial degree needed based on significant terms.
    • Visualize the original data and the fitted polynomial curve using ‘ggplot2’.
    • Based on the output, you will see the highest order polynomial degree needed for this dataset, and the visualization will show how well the model fits the data.
# Assuming non_zero_terms is a vector of the names of significant predictors
# First, ensure 'Intercept' is not included
non_zero_terms <- non_zero_terms[non_zero_terms != "(Intercept)"]

# Create the formula string
formula_string <- paste("y ~", paste(non_zero_terms, collapse = " + "))

# Convert the formula string to a formula object
simpler_formula <- as.formula(formula_string)

# Fit the simpler model
simpler_model <- lm(simpler_formula, data = data)

# Print the formula
cat("Simpler Model Formula:\n")
## Simpler Model Formula:
cat(as.character(simpler_formula))
## ~ y x + x2 + x3 + x4 + x6 + x9

10 .Plotting with the simpler model.

ggplot(data, aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = "lm", formula = simpler_formula, se = FALSE, color = "red") +
ggtitle("Fitted Polynomial Curve")
## Warning: Computation failed in `stat_smooth()`
## Caused by error:
## ! object 'x2' not found

  1. Simpler Model Plot Summary. his plot visualizes a dataset with a clear non-linear relationship between the variables x and y. The data was synthesized by applying a polynomial function to the variable x, then adding normally distributed random noise to introduce variability, mimicking real-world data irregularities.

Key points:

The data distribution indicates a complex, non-linear relationship between x and y, with significant variance as x values increase. No fitted curve is present, suggesting this plot is primarily for observing the raw data distribution before applying any regression analysis. The spread of y values widens with larger x, hinting at the increasing influence of higher-order terms in the polynomial equation used for data generation. The plot sets the stage for further analysis, likely involving non-linear regression techniques to model the underlying pattern and predict y from x.

  1. Trend Plot.
# Set seed for reproducibility
set.seed(1234)
# Generate data
x <- runif(100, -6, 6.5)
trend <- 2 * x^4 - 2 * x^3 - 50 * x^2 + 100 * x + 2
y <- trend + rnorm(100, 0, 100)
# Create a plot of the data
plot(x, y)

  1. Trend Plot Summary. This scatter plot displays synthetic data generated from a specific polynomial function of a single variable x, with added Gaussian noise. The plot reveals a non-linear relationship, where y increases dramatically as x moves away from zero. The variability in y also appears to grow with x, which is likely due to the higher power terms in the polynomial and the random noise. This visual suggests that any model fitting this data would need to account for its non-linear nature.

  2. Combine features and response variable into a data frame and run analyses.

data <- data.frame(y = y, x = x)
for (i in 2:10) {
  data[[paste0("x", i)]] <- x^i
}
# Define predictors and response
predictors <- as.matrix(data[, -1])
response <- data$y
# Set up cross-validation
fitControl <- trainControl(method = "cv", number = 5)

# Define lambda values for tuning
lambda_grid <- 10^seq(-3, 3, length.out = 100)
# Train the LASSO model with cross-validation
lasso.fit <- train(
  y ~ .,
  data = data,
  method = "glmnet",
  trControl = fitControl,
  tuneLength = 100,
  tuneGrid = expand.grid(alpha = 1, lambda = lambda_grid)
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
# Extract the best lambda value
best_lambda <- lasso.fit$bestTune$lambda

# Extract coefficients for the best model
lasso_coefs <- coef(lasso.fit$finalModel, s = best_lambda)

# Extract non-zero coefficients
non_zero_coefs <- lasso_coefs[lasso_coefs[,1] != 0,]
non_zero_terms <- names(non_zero_coefs)
# Build the simpler model based on significant terms
simpler_model_formula <- as.formula(paste("y ~", paste(non_zero_terms[-1], collapse = "+")))

# Print the formula for the simpler model
print(simpler_model_formula)
## y ~ x + x2 + x3 + x4 + x5 + x6 + x9
# Fit the simpler model
simpler_model <- lm(simpler_model_formula, data = data)

# Get summary of the simpler model
model_summary <- summary(simpler_model)
# Identifying Significant Terms
significant_terms <- which(model_summary$coefficients[, "Pr(>|t|)"] < 0.05)
highest_significant_term <- max(significant_terms)
# Determine the highest order polynomial needed
highest_order_polynomial <- highest_significant_term - 1

cat("Highest order polynomial needed:", highest_order_polynomial, "\n")
## Highest order polynomial needed: 6
# Define the range of x values
x_range <- range(data$x)
x_seq <- seq(from = x_range[1], to = x_range[2], length.out = 100)
predictions <- data.frame(x = x_seq)

# Generate polynomial terms for predictions
for (i in 2:10) {
  predictions[[paste0("x", i)]] <- predictions$x^i
}

# Predict y values using the simpler model
predictions$y <- predict(simpler_model, newdata = predictions)

# Create a plot of the LASSO Regression: Fitted Polynomial Curve
ggplot(data, aes(x = x, y = y)) + 
  geom_point(color = "black") + 
  geom_line(data = predictions, aes(x = x, y = y), color = "red", size = 1) +
  labs(title = "LASSO Regression: Fitted Polynomial Curve", x = "Predictor (x)", y = "Response (y)") +
  theme_minimal()

15.Plot Explanation. In this graph, I’ve depicted the relationship between a predictor variable, which I’ll call x, and a response variable, y, using a dataset that’s been modeled with LASSO regression to identify significant polynomial terms. I chose LASSO regression because it’s effective at reducing overfitting by penalizing the absolute size of the regression coefficients, which helps in selecting only the most informative features.

The black dots represent the actual data points from my dataset, while the red curve illustrates the fitted polynomial relationship as determined by the LASSO model. The curve’s complex shape suggests that a simple linear model wouldn’t suffice to describe the relationship between x and y; instead, it appears that a higher-degree polynomial is necessary to capture the nuances in the data.

The graph seems to show that my model has done a good job fitting the data. It aligns well with the central tendency of the data points and responds appropriately to changes in the data’s direction, indicating a strong understanding of the underlying pattern between the predictor and the response variable.

Exercise 3: Feature selection is not a fix all

When people start to learn how to build predictive models, they tend to focus on building the model and do not think enough about what the model is trying to do. In the case of linear regression, we are fitting linear trends unless we are adding complexity via transformations, polynomial terms, and interaction terms.

Consider the following data set located in the ISLR package which contains the salary (in thousands) of Major League baseball players as well as some of their performance statistics both currently and over their career. The variables that begin with the letter C such as CHits stands for the number of hits in their career versus the variable Hits, which is just the number of hits they have on the current salaries.

library(ISLR) #For the baseball data set
## Warning: package 'ISLR' was built under R version 4.3.2
Hitters2 =na.omit(Hitters)
which(rownames(Hitters2)=="-Mike Schmidt")
## [1] 173
Hitters2<-Hitters2[-173,]

Suppose a novice data analyst runs the following regression model to perform feature selection. They will use this model to not only predict future Salaries of players that were not included in the study but also provide interpretation as to which statistics are associated to Salary.

library(caret)
set.seed(1234)
fitControl<-trainControl(method="repeatedcv",number=10,repeats=1) 
glmnet.fit<-train(Salary~.,
               data=Hitters2,
               method="glmnet",
               trControl=fitControl
               )
opt.pen<-glmnet.fit$finalModel$lambdaOpt 
coef(glmnet.fit$finalModel,opt.pen)
glmnet.fit

A. Although you can predict with a glmnet model, it doesn’t offer any traditional hypothesis testing or diagnostics of the model fit. Use the traditional lm function to fit a regression model including only the predictors that the glmnet function call above suggests. Provide the residual plots of the model and provide commentary on what you see.

ANSWER.

#model
lm_model <- lm(Salary ~ Hits + Runs + RBI + Walks + CHits + CHmRun + CRuns + CRBI + League + Division + PutOuts + Assists + Errors, data = Hitters2)
summary(lm_model)
## 
## Call:
## lm(formula = Salary ~ Hits + Runs + RBI + Walks + CHits + CHmRun + 
##     CRuns + CRBI + League + Division + PutOuts + Assists + Errors, 
##     data = Hitters2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -904.68 -162.16  -16.75  121.70 1107.19 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -95.38855   63.05030  -1.513 0.131579    
## Hits           3.78556    1.36382   2.776 0.005928 ** 
## Runs          -1.80325    2.45610  -0.734 0.463524    
## RBI           -0.62889    1.55007  -0.406 0.685299    
## Walks          2.90696    1.33967   2.170 0.030963 *  
## CHits         -0.16198    0.37095  -0.437 0.662728    
## CHmRun        -0.28150    1.31109  -0.215 0.830170    
## CRuns          0.47007    0.55218   0.851 0.395430    
## CRBI           0.57573    0.58028   0.992 0.322089    
## LeagueN       29.17307   39.45085   0.739 0.460316    
## DivisionW   -112.14585   37.97289  -2.953 0.003446 ** 
## PutOuts        0.25361    0.07302   3.473 0.000607 ***
## Assists       -0.11107    0.20141  -0.551 0.581824    
## Errors        -1.67165    4.14269  -0.404 0.686916    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 301.1 on 248 degrees of freedom
## Multiple R-squared:  0.5572, Adjusted R-squared:  0.534 
## F-statistic:    24 on 13 and 248 DF,  p-value: < 2.2e-16

Call: lm(formula = Salary ~ Hits + Runs + RBI + Walks + CHits + CHmRun + CRuns + CRBI + League + Division + PutOuts + Assists + Errors, data = Hitters2)

Residuals: Min 1Q Median 3Q Max -904.68 -162.16 -16.75 121.70 1107.19

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -95.38855 63.05030 -1.513 0.131579
Hits 3.78556 1.36382 2.776 0.005928 ** Runs -1.80325 2.45610 -0.734 0.463524
RBI -0.62889 1.55007 -0.406 0.685299
Walks 2.90696 1.33967 2.170 0.030963 *
CHits -0.16198 0.37095 -0.437 0.662728
CHmRun -0.28150 1.31109 -0.215 0.830170
CRuns 0.47007 0.55218 0.851 0.395430
CRBI 0.57573 0.58028 0.992 0.322089
LeagueN 29.17307 39.45085 0.739 0.460316
DivisionW -112.14585 37.97289 -2.953 0.003446 ** PutOuts 0.25361 0.07302 3.473 0.000607 *** Assists -0.11107 0.20141 -0.551 0.581824
Errors -1.67165 4.14269 -0.404 0.686916
— Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 301.1 on 248 degrees of freedom Multiple R-squared: 0.5572, Adjusted R-squared: 0.534 F-statistic: 24 on 13 and 248 DF, p-value: < 2.2e-16

#plots
par(mfrow=c(2,2))
plot(lm_model)

Plot Analysis. Residuals vs Fitted: This plot helps to check for non-linear patterns in the residuals, and whether the variance of the residuals is constant (homoscedasticity). Ideally, the residuals should be randomly dispersed around the horizontal line indicating 0 residual. In this plot, there appears to be a slight funnel shape, with the variance of residuals increasing as the fitted values increase, which suggests potential heteroscedasticity.

Normal Q-Q: This plot shows whether the residuals are approximately normally distributed. The points should ideally lie on the line. Most of the data points follow the line, but there are a few deviations in the tails, especially on the upper end, indicating the presence of outliers or heavy tails in the distribution of residuals.

Scale-Location (or Spread-Location): This plot is used to check the homoscedasticity assumption. The red line should be approximately horizontal at a constant level if residuals are spread equally along the ranges of predictors. The upward trend in this plot also suggests that variance is not constant and increases with the fitted values, which is another sign of heteroscedasticity.

Residuals vs Leverage: This plot helps us to find influential cases, that is, observations that have a strong influence on the calculation of the regression coefficients. The Cook’s distance lines help to identify those influential cases. There are a few points that stand out from the bulk of the data, indicating possible influential observations. Particularly, the points labeled “Eddie Murray” and “Reggie Jackson” seem to have higher leverage, though their Cook’s distance does not exceed the common threshold of 0.5, which would indicate a highly influential point.

Commentary. The Residuals vs Fitted and Scale-Location plots both suggest that the residuals’ variance increases with the fitted values, indicating potential heteroscedasticity. This violates the assumption of homoscedasticity and can be addressed by transforming the dependent variable or using heteroscedasticity-consistent standard errors.

The Normal Q-Q plot indicates that while residuals are mostly normally distributed, there may be outliers or the tails are heavier than those of a normal distribution.

The Residuals vs Leverage plot doesn’t show any points with a particularly high Cook’s distance, which is reassuring, but there are observations with higher leverage that could be influencing the model’s estimates.

B. Compute the VIFs of the model.

ANSWER.

# Load the 'car' package for VIF calculation
library(car)
## Warning: package 'car' was built under R version 4.3.2
## Loading required package: carData
# Compute the VIFs for the linear model 'lm_model'
vif(lm_model)
##       Hits       Runs        RBI      Walks      CHits     CHmRun      CRuns 
##  10.710177  11.170787   4.580976   2.412680 166.302986  33.476221  96.225251 
##       CRBI     League   Division    PutOuts    Assists     Errors 
## 101.360973   1.120209   1.041186   1.204817   2.462949   2.163371

Hits Runs RBI Walks CHits CHmRun CRuns CRBI League Division PutOuts Assists Errors 10.710177 11.170787 4.580976 2.412680 166.302986 33.476221 96.225251 101.360973 1.120209 1.041186 1.204817 2.462949 2.163371

VIF Analysis. High VIF: Several predictors, such as CHits, CHmRun, CRuns, and CRBI, have extremely high VIF values, which suggests that these variables are highly collinear with other predictors in the model. This can make the model unstable and the estimates of the coefficients unreliable. Moderate VIF: Hits and Runs also have VIF values above 10, which is generally considered high, although not as extreme as the aforementioned career statistics. Low VIF: The remaining variables have VIF values close to or below 5, which suggests that they are not collinear with other variables in the model.

Commentary. The career statistics (CHits, CHmRun, CRuns, CRBI) are likely causing multicollinearity because these variables are naturally correlated with each other; as a player’s career progresses, these counts will tend to increase together. The high multicollinearity indicated by the high VIF values might be inflating the standard errors of the regression coefficients, which can lead to incorrect conclusions about the significance of predictors. The multicollinearity does not affect the predictive power of the model as a whole, but it does affect the interpretation of individual coefficients. When multicollinearity is present, small changes in the model or the data can lead to large changes in the coefficient estimates.

C. Consider the following plot. Remake this plot two additional ways: one logging the salary and two logging both salary and career RBIS. Which setting provides the most evidence of getting a good fit with a regression model?

library(ggplot2)
ggplot(Hitters2,aes(x=CRBI,y=Salary,color=League))+geom_point()+geom_smooth(method='loess',formula='y~x')+facet_wrap(~League)

ANSWER.

#Logging the Salary:
ggplot(Hitters2, aes(x = CRBI, y = log(Salary), color = League)) +
  geom_point() +
  geom_smooth(method = 'loess', formula = 'y ~ x') +
  facet_wrap(~League)

#Logging both the Salary and Career RBI's:
ggplot(Hitters2, aes(x = log(CRBI), y = log(Salary), color = League)) +
  geom_point() +
  geom_smooth(method = 'loess', formula = 'y ~ x') +
  facet_wrap(~League)

Evaluation of Fit for Created Plots. In reviewing the plots where I applied a logarithmic transformation to Salary, and then to both Salary and Career RBIs (CRBI), I noticed a marked improvement in the fit when both variables were logged. The LOESS curve, which provides a flexible fit to the data, adhered more closely to the trend of the logged points compared to the curve in the plot with only Salary logged. My observations include:

Linearity: The logged variables demonstrate a linear relationship that is more pronounced than in the previous plots. This is crucial for regression analysis as it helps in meeting the linearity assumption, allowing for a simpler interpretation of the model coefficients.

Variance Stabilization: The residuals’ variance along the range of CRBI values was more constant in the log-log plot. This stability is essential since the homoscedasticity assumption in linear regression posits that the variance of the residuals should not vary with the fitted values.

Outlier Influence: The logarithmic scale has compressed extreme values, reducing the impact of outliers on the model fit. This is beneficial because outliers can disproportionately affect the regression line and can sometimes indicate a model misspecification.

If analyzing only with these visual cues, logging both Salary and CRBI appears to yield a better model fit than logging only the Salary.

D. The feature selection tool is somewhat automated. You feed it the response and predictors, and it does the rest. Using your previous answers, write a short explanation to a new analyst on why feature selection tools are not a fix-all to the entire modeling process.

ANSWER. The nuances of predictive modeling and feature selection, while a great tool, when used appropriately and correctly, can also be tricky as they can be misleading, guiding us to making inaccurate predictions/assumptions/observations. The feature selection tools like the one we used in our analysis of baseball player salaries can be incredibly helpful. They automate the process of sifting through numerous potential predictors to identify those that might have the most significant impact on our response variable. However, while these tools are powerful, they are not a panacea for all the challenges we face in modeling. Feature selection tools aren’t a catch-all; they miss contextual nuances and require domain knowledge for informed analysis. Data integrity is essential, as poor quality can misguide selection. These tools assume certain data characteristics—check these match your data. They may also prioritize prediction over clarity, risking interpretability. Overfitting is a hazard; features must prove generalizable beyond sample data. Tools detect correlation, not causation—a critical distinction. Additionally, they don’t always address multicollinearity, which can skew results, as shown in our VIF analysis. Ultimately, feature selection is part of a broader analytic process that involves expertise, data scrutiny, assumption validation, and outcome evaluation to ensure a model’s robustness and utility.

E. Examine the following plot that plots penalized regression coefficients versus their penalty term. Provide a high level explanation as to why all of the coefficients are 0 when the log lambda value is 8.

library(rgl)
## Warning: package 'rgl' was built under R version 4.3.2
knitr::knit_hooks$set(webgl=hook_webgl)
plot(glmnet.fit$finalModel, xvar = "lambda", label = TRUE)

ANSWER. In examining the plot which delineates the trajectory of regression coefficients against the log lambda penalty in a regularized regression context, a discernible trend emerges. As the log lambda escalates, notably to a value of 8, the severity of the penalty imposed concurrently intensifies. This is characteristic of regularization techniques such as LASSO or Ridge regression, which aim to constrain coefficients, thus mitigating the risk of overfitting and bolstering the model’s generalizability.

At a high log lambda value, we observe an aggressive penalty that coerces the coefficients towards zero. This is a deliberate mechanism of the regularization process, emphasizing penalty optimization over data fit. When log lambda reaches the upper bounds, the coefficients converge to nil, signaling an overarching dominance of penalty over model fit. Consequently, this culminates in a model that is overly simplistic, unable to encapsulate the intricacies of the data and instead, reverting to the mean of the dependent variable.

This phenomenon underscores the intrinsic balancing act managed by regularization between bias and variance. An excessive penalty incurs a high bias, rendering the model unable to reflect the true relationships within the data — a scenario termed as underfitting. The essence of model calibration in this framework is to discover an optimal penalty level where the coefficients are sufficiently constrained to avert overfitting, without sacrificing the model’s predictive accuracy.

F. Use the caret package to run a 10-fold cross validation to determine an appropriate \(k\) when fitting all the numeric predictors (no transformation) to regress Salary (no transformation). Investigate the possible values \(k=1,2,3,4,5,6,7,8,9,10,20,30\) using the following option inside of the train function.

tuneGrid=expand.grid(k=c(1:10,20,30)) #Do not run in line.  Must be added inside of train function call. Google is your friend if stuck.

ANSWER.

numeric_predictors <- Hitters2[, sapply(Hitters2, is.numeric)]
numeric_predictors$Salary <- NULL
Hitters2_clean <- na.omit(numeric_predictors)
Hitters2_scaled <- scale(Hitters2_clean)
Hitters2_final <- data.frame(Salary = Hitters2$Salary, Hitters2_scaled)
fitControl <- trainControl(method = "cv", number = 10)
tuneGrid <- expand.grid(k = c(1:10, 20, 30))
set.seed(123) # for reproducibility
knn_fit <- train(
    Salary ~ ., # regress Salary on all other numeric predictors
    data = Hitters2,
    method = "knn",
    tuneGrid = tuneGrid,
    trControl = fitControl
)
print(knn_fit)
## k-Nearest Neighbors 
## 
## 262 samples
##  19 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 235, 236, 236, 237, 237, 236, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    1  311.9731  0.5365891  209.9654
##    2  309.5163  0.5043185  204.3238
##    3  300.7513  0.5220982  191.6569
##    4  301.3883  0.5244275  189.6463
##    5  305.7456  0.5170999  192.3514
##    6  308.5803  0.5124086  190.9252
##    7  312.6670  0.5014133  194.2113
##    8  315.2720  0.4966358  192.2687
##    9  315.1175  0.5031842  192.6566
##   10  312.6978  0.5105004  191.3511
##   20  314.1177  0.5041550  191.5355
##   30  317.4727  0.4909373  194.9134
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 3.

k-Nearest Neighbors

262 samples 19 predictor

No pre-processing Resampling: Cross-Validated (10 fold) Summary of sample sizes: 235, 236, 236, 237, 237, 236, … Resampling results across tuning parameters:

k RMSE Rsquared MAE
1 311.9731 0.5365891 209.9654 2 309.5163 0.5043185 204.3238 3 300.7513 0.5220982 191.6569 4 301.3883 0.5244275 189.6463 5 305.7456 0.5170999 192.3514 6 308.5803 0.5124086 190.9252 7 312.6670 0.5014133 194.2113 8 315.2720 0.4966358 192.2687 9 315.1175 0.5031842 192.6566 10 312.6978 0.5105004 191.3511 20 314.1177 0.5041550 191.5355 30 317.4727 0.4909373 194.9134

RMSE was used to select the optimal model using the smallest value. The final value used for the model was k = 3.

Output Analysis. Analyzing the k-nearest neighbors (KNN) regression model applied to the Hitters dataset, we employed 10-fold cross-validation to assess performance across various \(k\) values. The primary metric, Root Mean Squared Error (RMSE), is complemented by R-squared and Mean Absolute Error (MAE) to offer a holistic view of model efficacy.

My observations are as follows:

The caret package’s selection of \(k = 3\) is predicated on achieving the lowest RMSE, a vital indicator of model precision. However, for a well-rounded model assessment, it’s imperative to consider all metrics and the specific analytical objectives, ensuring the final model aligns with the nuanced needs of the research.

G. Compare the RMSE from the CV results of the best \(k\) and compare that to the best RMSE from the glmnet fit made by the analyst previously. Which one wins out?

ANSWER. In the realm of predictive modeling, my recent work involved applying an elastic net regularization, which judiciously integrates the attributes of both L1 (lasso) and L2 (ridge) penalties. The glmnet algorithm adeptly navigated through various combinations of the alpha parameter—striking a balance between lasso and ridge—and the lambda parameter, which governs the penalty magnitude, to procure a model that minimizes the cross-validated Root Mean Squared Error (RMSE).

Critical insights gleaned from the model output are:

In juxtaposition, the k-nearest neighbors (KNN) model with \(k = 3\) manifests a superior RMSE of 300.7513 relative to the glmnet model’s best RMSE. Consequently, when adjudicated by RMSE, the KNN model demonstrates enhanced performance on the dataset in question.

However, it is imperative to underscore that RMSE, despite being a pivotal metric of model performance, should not singularly dictate the model selection process. The overarching analytical objective and specific requirements—spanning interpretability, simplicity of the model, and computational viability—must be factored into the decision-making matrix to ensure the selection of an optimal model. This holistic approach ensures that the final model is not only statistically sound but also aligns with the practical demands of the application at hand.

For the context of this exercise the “winner” is determined based on the Root Mean Squared Error (RMSE): The KNN model with k=3 achieved an RMSE of 300.7513. The glmnet model, optimized for the combination of alpha and lambda parameters, achieved an RMSE of 308.4624. Comparing these RMSE values, the KNN model with k=3 has a lower RMSE, indicating that, on average, its predictions are closer to the actual values than those of the glmnet model. Therefore, based on RMSE as the criterion, the KNN model wins out over the glmnet model for this specific dataset and set of predictors.