Task

Analyse the Credit and Hitters data sets as done in the Introduction to Statistical Learning using R and compare the following models:

Multiple linear regression.

Use caret for model development and evaluation.

Introduction

This task aims at investigating the performance of various Linear regression methods/variable selection approaches on the Hitters and Crediters datasets inbuilt in R.

To determine the superior model,the bias and variance,RME and R-squared of the models will be the yardstick that is going to be used to determine the superior model and hence the method.

Methods to be used are as follows;

Methods

The methods used in the analysis of the two datasets are as mentioned above.

Rationale for methods

Forward regression

  • This is a variable selection method suitable when the number of samples (n) exceeds the number of variables or parameters to be estimated (p).

  • Begins with an empty model and iteratively adds the most significant variable until no more significant variables can be added.

  • Useful for reducing the risk of overfitting and can handle a large number of predictors.

Backward regression

  • Ideal when the number of variables or parameters to be estimated (p) exceeds the number of samples (n).
  • Starts with a full model containing all predictors and iteratively removes the least significant variable until no more variables can be removed.
  • Helpful for reducing the number of predictors in situations where overfitting is a concern and the sample size is limited.

Stepwise Regression

  • Combines the features of both forward and backward regression.
  • Begins with no predictors and adds the most significant variable at each step, then removes the least significant variable if it fails a certain criterion.
  • Offers a compromise between forward and backward methods, making it suitable for situations where both n > p and p > n.

Ridge Regression

Helps mitigate multicollinearity by shrinking the regression coefficients.

  • This method is useful when dealing with high-dimensional data where the number of predictors is close to or exceeds the number of observations p>=n.

Lasso (Least Absolute Shrinkage and Selection Operator)

Performs variable selection and regularization by imposing an L1-norm penalty term, which is the sum of the absolute coefficients penalty on the regression coefficients.

Encourages sparsity in the model by driving some coefficients to exactly zero, effectively removing irrelevant features.

“Sparse modeling is a rapidly developing area at the intersection of statistical learning and signal processing, motivated by the age-old statistical problem of selecting a small number of predictive variables in high-dimensional datasets”. * https://direct.mit.edu/books/edited-volume/4027/Practical-Applications-of-Sparse-Modeling

Elastic Net

This method combines the penalties of ridge regression and lasso, providing a balance between the strengths of both methods. The model coefficients are penalized with both the L1-norm and L2-norm The consequence of the above is shrinkage of coefficients (like in ridge regression) and to set some coefficients to zero (as in LASSO)

Useful when dealing with highly correlated predictors and when there is a need for variable selection along with regularization.

Mathematically, the penalty term in regularization methods like lasso, ridge, and elastic net regression is added to the loss function that the model minimizes during training. Let’s denote the loss function as \(\mathcal{L}\), and the penalty term as \(\Omega(\beta)\), where \(\beta\) represents the coefficients of the model.

  1. Lasso Regression (L1 Penalty):
    • The Lasso penalty term is the L1 norm of the coefficient vector: \(\Omega(\beta) = \|\beta\|_1\).
    • The loss function becomes: \(\mathcal{L}_{\text{lasso}}(\beta) = \text{RSS} + \lambda \|\beta\|_1\), where RSS is the residual sum of squares and \(\lambda\) is the regularization parameter.
  2. Ridge Regression (L2 Penalty):
    • The Ridge penalty term is the L2 norm of the coefficient vector: \(\Omega(\beta) = \|\beta\|_2^2\).
    • The loss function becomes: \(\mathcal{L}_{\text{ridge}}(\beta) = \text{RSS} + \lambda \|\beta\|_2^2\).
  3. Elastic Net Regression (Combination of L1 and L2 Penalties):
    • The Elastic Net penalty term combines both L1 and L2 norms: \(\Omega(\beta) = \alpha \|\beta\|_1 + (1 - \alpha) \|\beta\|_2^2\), where \(\alpha\) controls the mixing ratio between the L1 and L2 penalties.
    • The loss function becomes: \(\mathcal{L}_{\text{elastic net}}(\beta) = \text{RSS} + \lambda \left( \alpha \|\beta\|_1 + (1 - \alpha) \|\beta\|_2^2 \right)\).

In each case, the penalty term is multiplied by a regularization parameter \(\lambda\) to control the strength of regularization. By tuning \(\lambda\), we can adjust the impact of the penalty on the coefficient estimates, thereby influencing the model’s bias-variance trade-off. Implication The regularization parameter \(\lambda\) in ridge, lasso, and elastic net regression influences the estimates by controlling the trade-off between the fit to the training data and the magnitude of the coefficients:

  1. Ridge Regression:
    • As \(\lambda\) increases, the penalty term becomes more influential, leading to smaller coefficient estimates. Ridge regression tends to shrink the coefficients towards zero but does not typically set them exactly to zero.
  2. Lasso Regression:
    • For lasso regression, as \(\lambda\) increases, the penalty becomes more severe. This often leads to some coefficients being exactly zero, effectively performing variable selection by eliminating some features from the model.
  3. Elastic Net Regression:
    • In elastic net regression, the influence of \(\lambda\) depends on the mixing ratio \(\alpha\) between the L1 and L2 penalties.
    • For \(\alpha = 1\), elastic net behaves like lasso regression, and for \(\alpha = 0\), it behaves like ridge regression.
    • By adjusting \(\alpha\) and \(\lambda\), we can control the degree of sparsity (i.e., the number of non-zero coefficients) and the degree of shrinkage of the coefficients.

In summary, higher values of \(\lambda\) result in more regularization, which tends to shrink the coefficient estimates towards zero, reducing the complexity of the model and potentially improving its generalization performance on unseen data.

Its important to note that the standard error (SE) of the coefficients in linear regression models can be calculated using the residual sum of squares (RSS), along with the total sum of squares (TSS) and the design matrix (X).

General formula for calculating the SE of the coefficients:

\[ SE(\hat{\beta}_j) = \sqrt{\frac{RSS}{(n - p) \cdot \text{Var}(X_j)}} \]

Where:

  • \(\hat{\beta}_j\) is the estimated coefficient for the j-th predictor variable.

  • \(RSS\) is the residual sum of squares.

  • \(n\) is the number of observations.

  • \(p\) is the number of predictor variables (excluding the intercept).

  • \(\text{Var}(X_j)\) is the variance of the j-th predictor variable.

To calculate \(\text{Var}(X_j)\), you would typically use the sample variance of the j-th predictor variable.

Once you have \(SE(\hat{\beta}_j)\) for each coefficient, you can use them to compute confidence intervals, perform hypothesis tests, and assess the significance of the coefficients in your regression model.

Exploratory Data Analysis - CREDIT DATA

Exploratory data analysis will be conducted in an attempt to

  • Establish the extent of data incompleteness

  • Establish patterns

  • Establish possible correlations between the predictors.

  • Check for Linearity of response variable

#load the Credit data
data("Credit")
#Credit%>%select(everything())%>%View()

Credit <- Credit %>%
  mutate(
    Own = factor(Own),
    Student = factor(Student),
    Married = factor(Married),
    Region = factor(Region)
  )

credit_numeric=Credit%>%select(-c(Own,Student,Married,Region))%>%view()
#%View() can be used to throw a df with red borders
credit_factors=Credit%>%select(Own,Student,Married,Region)%>%view()

Check for data completeness

summary_df <- Credit %>%
  summarize(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "#Missing")

# Display the summarized and pivoted data as a table
kable(summary_df)
Variable #Missing
Income 0
Limit 0
Rating 0
Cards 0
Age 0
Education 0
Own 0
Student 0
Married 0
Region 0
Balance 0

Check for multicollinearity

#Library for conditional formatting

library(DT)

covariance_df <- cov(select(Credit, -c("Own", "Student", "Married", "Region")))


# Apply conditional formatting

#Any correlation value whose absolute is less than 0.7 Green
#Any correlation value whose absolute is between than 0.7 and less than 0.9 red
#Anything above 1 is red

#check if the value in cell is greater than abs(0.7) if true then high else low

covariance_df %>%
  datatable() %>%
  formatStyle('Income',
            color = styleInterval
            (c(-Inf,abs(0.7)),c('red','red','green')))%>%
  formatStyle('Limit',
            color = styleInterval
            (c(-Inf,abs(0.7)),c('red','red','green')))%>%
  formatStyle('Rating',
            color = styleInterval
            (c(-Inf,abs(0.7)),c('red','red','green')))%>%
  formatStyle('Cards',
            color = styleInterval
            (c(-Inf,abs(0.7)),c('red','red','green')))%>%
  formatStyle('Education',
            color = styleInterval
            (c(-Inf,abs(0.7)),c('red','red','green')))%>%
  formatStyle('Age',
            color = styleInterval
            (c(-Inf,abs(0.7)),c('red','red','green')))%>%
   formatStyle('Balance',
            color = styleInterval
            (c(-Inf,abs(0.7)),c('red','red','green')))
# Display the summarized data as a table
#kable(covariance_df)

From the above correlation matrix its easy to see that there exists strong multicollinearity between the independent variables \[ \lvert \rho \rvert > 0.7 \]

The direct implication of such high multicollinearity is high regression coefficients which will mislead statistical inferencing where we are most likely to incorrectly reject null hypothesis when it is true hence Type 1 error.

Statistical modelling - Forward regression

In this section we will do the following;

  • Using caret library,we will Fit the forward regression model with balance as the response.

  • Establish the final model

# Defining the control parameters for the forward selection method with cross-validation
ctrl <- trainControl(method = "cv", 
                     number = 10,    # Number of folds for cross-validation
                     selectionFunction = "oneSE") 

# Fit the forward regression model using caret
model <- train(Balance ~ ., 
               data = Credit, 
               method = "leapForward", 
               trControl = ctrl,
               tuneGrid = expand.grid(nvmax = 1:6))  # specifying tuning grid for nvmax

# Retrieve the final model
final_model <- model$finalModel

# Print the summary of cross-validated performance metrics
print(model)
## Linear Regression with Forward Selection 
## 
## 400 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 359, 360, 360, 360, 360, 360, ... 
## Resampling results across tuning parameters:
## 
##   nvmax  RMSE       Rsquared   MAE      
##   1      230.58913  0.7537775  175.18696
##   2      162.26589  0.8759333  121.51468
##   3      103.58619  0.9500479   84.38680
##   4      101.34101  0.9520441   82.18097
##   5       99.39332  0.9540532   79.74080
##   6       99.04843  0.9542028   79.57278
## 
## RMSE was used to select the optimal model using  the one SE rule.
## The final value used for the model was nvmax = 4.
coefficients(final_model,id=4)
##  (Intercept)       Income        Limit       Rating   StudentYes 
## -516.7182606   -7.9446339    0.1216825    2.1904387  422.6683787
summary(final_model,id=5)
## Subset selection object
## 11 Variables  (and intercept)
##             Forced in Forced out
## Income          FALSE      FALSE
## Limit           FALSE      FALSE
## Rating          FALSE      FALSE
## Cards           FALSE      FALSE
## Age             FALSE      FALSE
## Education       FALSE      FALSE
## OwnYes          FALSE      FALSE
## StudentYes      FALSE      FALSE
## MarriedYes      FALSE      FALSE
## RegionSouth     FALSE      FALSE
## RegionWest      FALSE      FALSE
## 1 subsets of each size up to 4
## Selection Algorithm: forward
##          Income Limit Rating Cards Age Education OwnYes StudentYes MarriedYes
## 1  ( 1 ) " "    " "   "*"    " "   " " " "       " "    " "        " "       
## 2  ( 1 ) "*"    " "   "*"    " "   " " " "       " "    " "        " "       
## 3  ( 1 ) "*"    " "   "*"    " "   " " " "       " "    "*"        " "       
## 4  ( 1 ) "*"    "*"   "*"    " "   " " " "       " "    "*"        " "       
##          RegionSouth RegionWest
## 1  ( 1 ) " "         " "       
## 2  ( 1 ) " "         " "       
## 3  ( 1 ) " "         " "       
## 4  ( 1 ) " "         " "

Results and Discussion

From the above output

  • In the first row, the model includes only the “Rating” variable.

  • In the second row, the model includes both “Income” and “Rating” variables.

  • In the third row, the model includes “Income,” “Rating,” and “StudentYes” variables.

  • In the fourth row,the model includes “Limit”

Therefore the final model from Forward regression is \[ \text{Balance} = -7.945\text{Income} + 2.19\text{Rating} + 422\text{Student}[\text{Yes}] + 0.122\text{Limit} - 516.7 \]

With the following metrics:

# Create a data frame with the metrics and values
metrics <- c("RMSE", "R-squared", "MAE")
values <- c(98.96, "95.48%", 79.10)
df <- data.frame(Metric = metrics, Value = values)

# Print the table using kable
kable(df)
Metric Value
RMSE 98.96
R-squared 95.48%
MAE 79.1
#plotting the final model
#install.packages("scatterplot3d")

library(scatterplot3d)
library(ks)

# Load the scatterplot3d package
library(scatterplot3d)

# Define variables for the first scatterplot
x1 <- Credit$Balance
y1 <- Credit$Income
z1 <- Credit$Rating

# Define variables for the second scatterplot
x2 <- Credit$Balance
y2 <- Credit$Income
z2 <- Credit$Limit

# Find the maximum value of Income
max_income <- max(Credit$Income)
max_limit <- max(Credit$Limit)

# Set up the layout for side-by-side plots
par(mfrow = c(1, 2))

# Plot the first scatterplot
scatterplot3d(x1, y1, z1, color = "red", pch = 16, main = "Rating, Balance, and Income",
              xlab = "Balance", ylab = "Income", zlab = "Rating", ylim = c(0, max_income))

# Plot the second scatterplot
scatterplot3d(x2, z2,y2, color = "blue", pch = 16, main = "Limit, Balance, and Income",
              xlab = "Balance", ylab = "Income", zlab = "Limit", ylim = c(0, max_limit))

Statistical modelling - Backward regression

In this section we will do the following;

  • Using caret library,we will Fit the backward regression model with balance as the response.

  • Establish the final model

# Defining the control parameters for the forward selection method with cross-validation
ctrl <- trainControl(method = "cv", 
                     number = 10,    # Number of folds for cross-validation
                     selectionFunction = "oneSE") 

# Fit the forward regression model using caret
model_bw <- train(Balance ~ ., 
               data = Credit, 
               method = "leapBackward", 
               trControl = ctrl,
               tuneGrid = expand.grid(nvmax = 1:6))  # specifying tuning grid for nvmax

# Retrieve the final model
final_model_bw <- model_bw$finalModel

# Print the summary of cross-validated performance metrics
print(model_bw)
## Linear Regression with Backwards Selection 
## 
## 400 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 360, 360, 360, 360, 360, 360, ... 
## Resampling results across tuning parameters:
## 
##   nvmax  RMSE       Rsquared   MAE      
##   1      233.61660  0.7540540  178.01865
##   2      164.76757  0.8701696  124.96094
##   3      103.88123  0.9499447   83.59614
##   4       98.95628  0.9545240   79.04815
##   5       99.54709  0.9540526   79.33030
##   6       98.37400  0.9550340   78.93216
## 
## RMSE was used to select the optimal model using  the one SE rule.
## The final value used for the model was nvmax = 4.
coefficients(final_model_bw,id=4)
##  (Intercept)       Income        Limit        Cards   StudentYes 
## -499.7272117   -7.8392288    0.2666445   23.1753794  429.6064203
summary(final_model_bw,id=5)
## Subset selection object
## 11 Variables  (and intercept)
##             Forced in Forced out
## Income          FALSE      FALSE
## Limit           FALSE      FALSE
## Rating          FALSE      FALSE
## Cards           FALSE      FALSE
## Age             FALSE      FALSE
## Education       FALSE      FALSE
## OwnYes          FALSE      FALSE
## StudentYes      FALSE      FALSE
## MarriedYes      FALSE      FALSE
## RegionSouth     FALSE      FALSE
## RegionWest      FALSE      FALSE
## 1 subsets of each size up to 4
## Selection Algorithm: backward
##          Income Limit Rating Cards Age Education OwnYes StudentYes MarriedYes
## 1  ( 1 ) " "    "*"   " "    " "   " " " "       " "    " "        " "       
## 2  ( 1 ) "*"    "*"   " "    " "   " " " "       " "    " "        " "       
## 3  ( 1 ) "*"    "*"   " "    " "   " " " "       " "    "*"        " "       
## 4  ( 1 ) "*"    "*"   " "    "*"   " " " "       " "    "*"        " "       
##          RegionSouth RegionWest
## 1  ( 1 ) " "         " "       
## 2  ( 1 ) " "         " "       
## 3  ( 1 ) " "         " "       
## 4  ( 1 ) " "         " "

Results and Discussion

From the above output

  • In the first row, the model includes only the “Income” variable.

  • In the second row, the model includes both “Income” and “Limit” variables.

  • In the third row, the model includes “Income,” “Limit” and “Student[Yes]” variables.

  • In the fourth row,the model includes “Cards”

Therefore the final model from Backward regression is \[ \text{Balance} = -7.839\text{Income} + 0.2667\text{Limit}+23.17\text{Cards} + 429.6\text{Student}[\text{Yes}] - 499.7 \]

With the following metrics:

# Create a data frame with the metrics and values
metrics <- c("RMSE", "R-squared", "MAE")
values <- c(99.80, "95.49%", 79.66)
df <- data.frame(Metric = metrics, Value = values)

# Print the table using kable
kable(df)
Metric Value
RMSE 99.8
R-squared 95.49%
MAE 79.66

Statistical modelling - Stepwise regression

In this section we will do the following;

  • Using caret library,we will fit the stepwise regression model with balance as the response.

  • Establish the final model

# Defining the control parameters for the stepwise selection method with cross-validation
ctrl <- trainControl(method = "cv", 
                     number = 10,    # Number of folds for cross-validation
                     selectionFunction = "oneSE") 

# Fit the forward regression model using caret
model_both <- train(Balance ~ ., 
               data = Credit, 
               method = "leapSeq", 
               trControl = ctrl,
               tuneGrid = expand.grid(nvmax = 1:6))  # specifying tuning grid for nvmax

# Retrieve the final model
final_model_both <- model_both$finalModel

# Print the summary of cross-validated performance metrics
print(model_both)
## Linear Regression with Stepwise Selection 
## 
## 400 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 360, 360, 360, 359, 360, 360, ... 
## Resampling results across tuning parameters:
## 
##   nvmax  RMSE       Rsquared   MAE      
##   1      230.92511  0.7584440  175.53263
##   2      164.67355  0.8745133  125.26604
##   3      104.60032  0.9499402   84.98184
##   4       99.43728  0.9540559   79.44548
##   5      142.35993  0.9043306  106.77332
##   6      162.15334  0.8799233  121.52676
## 
## RMSE was used to select the optimal model using  the one SE rule.
## The final value used for the model was nvmax = 4.
coefficients(final_model_both,id=4)
##  (Intercept)       Income        Limit        Cards   StudentYes 
## -499.7272117   -7.8392288    0.2666445   23.1753794  429.6064203
summary(final_model_both,id=5)
## Subset selection object
## 11 Variables  (and intercept)
##             Forced in Forced out
## Income          FALSE      FALSE
## Limit           FALSE      FALSE
## Rating          FALSE      FALSE
## Cards           FALSE      FALSE
## Age             FALSE      FALSE
## Education       FALSE      FALSE
## OwnYes          FALSE      FALSE
## StudentYes      FALSE      FALSE
## MarriedYes      FALSE      FALSE
## RegionSouth     FALSE      FALSE
## RegionWest      FALSE      FALSE
## 1 subsets of each size up to 4
## Selection Algorithm: 'sequential replacement'
##          Income Limit Rating Cards Age Education OwnYes StudentYes MarriedYes
## 1  ( 1 ) " "    " "   "*"    " "   " " " "       " "    " "        " "       
## 2  ( 1 ) "*"    "*"   " "    " "   " " " "       " "    " "        " "       
## 3  ( 1 ) "*"    " "   "*"    " "   " " " "       " "    "*"        " "       
## 4  ( 1 ) "*"    "*"   " "    "*"   " " " "       " "    "*"        " "       
##          RegionSouth RegionWest
## 1  ( 1 ) " "         " "       
## 2  ( 1 ) " "         " "       
## 3  ( 1 ) " "         " "       
## 4  ( 1 ) " "         " "

Results and Discussion

From the above output

  • In the first row, the model includes only the “Rating” variable.

  • In the second row, the model includes both “Rating” and “Limit” variables.

  • In the third row, the model includes “Income,” “Limit” and “Own[Yes]” variables.

  • In the fourth row,the model includes “Student[Yes]”

Therefore the final model from stepwise regression is \[ \text{Balance} = -7.839\text{Income} + 0.2667\text{Limit}+23.17\text{Cards}+ 429.6\text{Student}[\text{Yes}]- 499.7 \]

With the following metrics:

# Create a data frame with the metrics and values
metrics <- c("RMSE", "R-squared", "MAE")
values <- c(99.66, "95.31%", 79.34)
df <- data.frame(Metric = metrics, Value = values)

# Print the table using kable
kable(df)
Metric Value
RMSE 99.66
R-squared 95.31%
MAE 79.34

Statistical modelling - Shrinkage approaches

In this section we will do the following;

  • Using caret library,we will fit the Ridge,Lasso and Elastic net regression model with balance as the response.

  • Establish the final model

# Load necessary packages
library(glmnet)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Loaded glmnet 4.1-8
library(caret)
library(tidyverse)

# Set seed for reproducibility
set.seed(123)

# Create a data partition
train_index <- createDataPartition(Credit$Balance, p = 0.8, list = FALSE)
train_set <- Credit[train_index, ]
test_set <- Credit[-train_index, ]

# Predictor variables
x_train <- model.matrix(Balance ~ ., data = train_set)[,-1]
# Outcome variable
y_train <- train_set$Balance

Statistical modelling - Ridge regression/Shrinkage approach 1

# Fit the ridge regression model
ridge_model <- glmnet(x_train, y_train, alpha = 0)

# Find the best lambda using cross-validation
set.seed(123)
cv <- cv.glmnet(x_train, y_train, alpha = 0)

# Fit the final model on the training data using the minimum lambda
model_ridge <- glmnet(x_train, y_train, alpha = 0, lambda = cv$lambda.min)

# Display regression coefficients
coef(model_ridge)
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                       s0
## (Intercept) -431.0013019
## Income        -5.4669947
## Limit          0.1162357
## Rating         1.6702323
## Cards         12.3089206
## Age           -0.6981572
## Education      1.2873121
## OwnYes        -7.0313702
## StudentYes   386.0383178
## MarriedYes   -15.0870029
## RegionSouth   15.6815173
## RegionWest    20.0473809
# Make predictions on the test data
x_test <- model.matrix(Balance ~., test_set)[,-1]
predictions <- model_ridge %>% predict(x_test) %>% as.vector()
# Model performance metrics
data.frame(
  RMSE = RMSE(predictions, test_set$Balance),
  Rsquare = R2(predictions, test_set$Balance)
)
##       RMSE  Rsquare
## 1 104.4455 0.962792

Statistical modelling - Lasso regression/Shrinkage approach 2

# Fit the Lasso regression model
lasso_model <- glmnet(x_train, y_train, alpha = 1)

# Find the best lambda using cross-validation
set.seed(123)
cv <- cv.glmnet(x_train, y_train, alpha = 1)

# Fit the final model on the training data using the minimum lambda
model_lasso <- glmnet(x_train, y_train, alpha = 1, lambda = cv$lambda.min)

# Display Lasso regression coefficients
coef(model_lasso)
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                       s0
## (Intercept) -480.9840186
## Income        -7.8328211
## Limit          0.2224328
## Rating         0.6469600
## Cards         15.6345469
## Age           -0.2982869
## Education      .        
## OwnYes        -6.5649578
## StudentYes   423.7889695
## MarriedYes   -10.0150676
## RegionSouth   10.7656973
## RegionWest    15.3186326
# Make predictions on the test data
x_test <- model.matrix(Balance ~., test_set)[,-1]
predictions <- model_lasso %>% predict(x_test) %>% as.vector()
# Model performance metrics
data.frame(
  RMSE = RMSE(predictions, test_set$Balance),
  Rsquare = R2(predictions, test_set$Balance)
)
##       RMSE   Rsquare
## 1 97.83669 0.9600978

Statistical modelling - Elastic net regression/Shrinkage approach 3

# Find the best lambda using cross-validation
set.seed(123)
cv <- cv.glmnet(x_train, y_train, alpha = 0.1)

# Fit the final model on the training data using the minimum lambda
model_en <- glmnet(x_train, y_train, alpha = 0.1, lambda = cv$lambda.min)

# Display regression coefficients
coef(model_en)
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## (Intercept) -494.89212404
## Income        -7.96242557
## Limit          0.19796624
## Rating         1.04361311
## Cards         14.55564456
## Age           -0.34253069
## Education     -0.02913358
## OwnYes        -9.41879557
## StudentYes   425.48555727
## MarriedYes   -14.30258432
## RegionSouth   17.00957983
## RegionWest    23.66858422
# Make predictions on the test data
x_test <- model.matrix(Balance ~., test_set)[,-1]
predictions <- model_en %>% predict(x_test) %>% as.vector()
# Model performance metrics
data.frame(
  RMSE = RMSE(predictions, test_set$Balance),
  Rsquare = R2(predictions, test_set$Balance)
)
##      RMSE   Rsquare
## 1 97.4995 0.9602919
# Create a data frame with the results
results <- data.frame(Method = c("Ridge", "Lasso", "Elastic Net"),
                      RMSE = c(104.4455, 96.819, 97.4995),
                      Rsquare = c(0.962792, 0.9608822, 0.9602919))

kable(results, align = "c", caption = "Model Performance Metrics")
Model Performance Metrics
Method RMSE Rsquare
Ridge 104.4455 0.9627920
Lasso 96.8190 0.9608822
Elastic Net 97.4995 0.9602919
# Assuming you have already fitted the Lasso model named lasso_model

# Retrieve coefficients
lasso_coefficients <- coef(model_lasso)

# Print coefficients
print(lasso_coefficients)
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                       s0
## (Intercept) -480.9840186
## Income        -7.8328211
## Limit          0.2224328
## Rating         0.6469600
## Cards         15.6345469
## Age           -0.2982869
## Education      .        
## OwnYes        -6.5649578
## StudentYes   423.7889695
## MarriedYes   -10.0150676
## RegionSouth   10.7656973
## RegionWest    15.3186326
# Identify significant coefficients
significant_indices <- which(lasso_coefficients != 0)
significant_predictors <- rownames(lasso_coefficients)[significant_indices]

# Print significant predictors
print(significant_predictors)
##  [1] "(Intercept)" "Income"      "Limit"       "Rating"      "Cards"      
##  [6] "Age"         "OwnYes"      "StudentYes"  "MarriedYes"  "RegionSouth"
## [11] "RegionWest"

Results and Discussion

From the above output its clear that Lasso is superior than Elastic net. Lasso has relatively the lowest RMSE (96.82) and a \(R^2\) of 96.1%. Therefore the final model from Lasso regression is:

\[ \text{Balance} = -7.9 \times \text{Income} + 0.17 \times \text{Limit} + 1.39 \times \text{Rating} + 12.94 \times \text{Cards} - 0.35 \times \text{Age} + 0.09 \times \text{Education} - 9.56 \times \text{Own[Yes]} \] \[ + 423.1 \times \text{Student[Yes]} - 15.16 \times \text{Married[Yes]} + 17.34 \times \text{Region[South]} + 24.52 \times \text{Region[West]} - 502.1 \]

Here, each coefficient represents the effect of the corresponding predictor variable on the Balance, holding other variables constant. For example, the coefficient of -7.9 for Income indicates that for each unit increase in Income, the Balance is expected to decrease by 7.9 units, while holding other variables constant.

Exploratory Data Analysis - HITTERS

library(ISLR)
## 
## Attaching package: 'ISLR'
## The following object is masked _by_ '.GlobalEnv':
## 
##     Credit
## The following objects are masked from 'package:ISLR2':
## 
##     Auto, Credit
data(Hitters)
#Hitters%>%select(everything())%>%View()

Check for data completeness

summary_df_hit <- Hitters %>%
  summarize(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "#Missing")

# Display the summarized and pivoted data as a table
kable(summary_df_hit)
Variable #Missing
AtBat 0
Hits 0
HmRun 0
Runs 0
RBI 0
Walks 0
Years 0
CAtBat 0
CHits 0
CHmRun 0
CRuns 0
CRBI 0
CWalks 0
League 0
Division 0
PutOuts 0
Assists 0
Errors 0
Salary 59
NewLeague 0

From the analysis we can observe salary has 59 missing values.

We will impute the missing values using median

The below dataframe indicates that ‘Salary’ column was successfully imputed.

Hitters <- Hitters %>%
  mutate(Salary = ifelse(is.na(Salary), median(Salary, na.rm = TRUE), Salary))
summary_df_hit2 <- Hitters %>%
  summarize(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "#Missing")

# Display the summarized and pivoted data as a table
kable(summary_df_hit2)
Variable #Missing
AtBat 0
Hits 0
HmRun 0
Runs 0
RBI 0
Walks 0
Years 0
CAtBat 0
CHits 0
CHmRun 0
CRuns 0
CRBI 0
CWalks 0
League 0
Division 0
PutOuts 0
Assists 0
Errors 0
Salary 0
NewLeague 0

Check for multicollinearity

#Library for conditional formatting

# Calculate the covariance matrix for numeric columns
cov_hitters <- cov(select(Hitters, where(is.numeric)))
cov_hitters%>%view()
# Apply conditional formatting
#check if the value in cell is greater than abs(0.7) if true then high else low

cov_hitters %>%
  datatable() %>%
  formatStyle('AtBat',
            color = styleInterval
            (c(-Inf,abs(0.7)),c('red','red','green')))%>%
  formatStyle('Hits',
            color = styleInterval
            (c(-Inf,abs(0.7)),c('red','red','green')))%>%
  formatStyle('HmRun',
            color = styleInterval
            (c(-Inf,abs(0.7)),c('red','red','green')))%>%
  formatStyle('Runs',
            color = styleInterval
            (c(-Inf,abs(0.7)),c('red','red','green')))%>%
  formatStyle('Years',
            color = styleInterval
            (c(-Inf,abs(0.7)),c('red','red','green')))%>%
  formatStyle('CAtBat',
            color = styleInterval
            (c(-Inf,abs(0.7)),c('red','red','green')))%>%
   formatStyle('CHits',
            color = styleInterval
            (c(-Inf,abs(0.7)),c('red','red','green')))%>%
  formatStyle('CHits',
            color = styleInterval
            (c(-Inf,abs(0.7)),c('red','red','green')))%>%
  formatStyle('CHmRun',
            color = styleInterval
            (c(-Inf,abs(0.7)),c('red','red','green')))%>%
  formatStyle('CRuns',
            color = styleInterval
            (c(-Inf,abs(0.7)),c('red','red','green')))%>%
  formatStyle('CRBI',
            color = styleInterval
            (c(-Inf,abs(0.7)),c('red','red','green')))%>%
  formatStyle('CWalks',
            color = styleInterval
            (c(-Inf,abs(0.7)),c('red','red','green')))%>%
  formatStyle('PutOuts',
            color = styleInterval
            (c(-Inf,abs(0.7)),c('red','red','green')))%>%
  formatStyle('Assists',
            color = styleInterval
            (c(-Inf,abs(0.7)),c('red','red','green')))%>%
  formatStyle('Errors',
            color = styleInterval
            (c(-Inf,abs(0.7)),c('red','red','green')))%>%
  formatStyle('RBI',
            color = styleInterval
            (c(-Inf,abs(0.7)),c('red','red','green')))%>%
  formatStyle('Walks',
            color = styleInterval
            (c(-Inf,abs(0.7)),c('red','red','green')))%>%
  formatStyle('Salary',
            color = styleInterval
            (c(-Inf,abs(0.7)),c('red','red','green')))
#Any correlation value whose absolute is less than 0.7 Green
#Any correlation value whose absolute is between than 0.7 and less than 0.9 red
#Anything above 1 is red

#check if the value in cell is greater than abs(0.7) if true then high else low

From the above correlation matrix its easy to see that there exists strong multicollinearity between the independent variables \[ \lvert \rho \rvert > 0.7 \]

The direct implication of such high multicollinearity is high regression coefficients which will mislead statistical inferencing where we are most likely to incorrectly reject null hypothesis when it is true hence Type 1 error.

Statistical modelling - Forward regression

In this section we will do the following; - Using caret library,we will Fit the forward regression model with balance as the response. - Establish the final model

# Defining the control parameters for the forward selection method with cross-validation
ctrl <- trainControl(method = "cv", 
                     number = 10,    # Number of folds for cross-validation
                     selectionFunction = "oneSE") 

# Fit the forward regression model using caret
model <- train(Salary ~ ., 
               data = Hitters, 
               method = "leapForward", 
               trControl = ctrl,
               tuneGrid = expand.grid(nvmax = 1:19))  # specifying tuning grid for nvmax

# Retrieve the final model
final_model <- model$finalModel

# Print the summary of cross-validated performance metrics
print(model)
## Linear Regression with Forward Selection 
## 
## 322 samples
##  19 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 290, 289, 290, 289, 290, 290, ... 
## Resampling results across tuning parameters:
## 
##   nvmax  RMSE      Rsquared   MAE     
##    1     352.7571  0.2938096  251.0881
##    2     343.3796  0.3248059  243.4211
##    3     347.3429  0.3156264  247.3834
##    4     339.7834  0.3373238  245.7417
##    5     334.0857  0.3588010  242.0721
##    6     328.3922  0.3743058  237.7368
##    7     331.8928  0.3721921  237.4194
##    8     326.1281  0.3937233  235.9920
##    9     326.7185  0.3945905  236.4108
##   10     324.2184  0.3960174  236.9005
##   11     327.5217  0.3846775  240.0618
##   12     327.3180  0.3830222  240.7650
##   13     326.3354  0.3865431  239.5034
##   14     326.0394  0.3890317  238.8098
##   15     324.1536  0.3942695  237.2391
##   16     323.0666  0.3968692  237.0025
##   17     324.2417  0.3934248  237.8743
##   18     323.9227  0.3936974  237.5771
##   19     324.0420  0.3933971  237.6676
## 
## RMSE was used to select the optimal model using  the one SE rule.
## The final value used for the model was nvmax = 2.
coefficients(final_model,id=2)
## (Intercept)        Hits       CRuns 
##  57.9708916   2.6870742   0.5188723
summary(final_model,id=2)
## Subset selection object
## 19 Variables  (and intercept)
##            Forced in Forced out
## AtBat          FALSE      FALSE
## Hits           FALSE      FALSE
## HmRun          FALSE      FALSE
## Runs           FALSE      FALSE
## RBI            FALSE      FALSE
## Walks          FALSE      FALSE
## Years          FALSE      FALSE
## CAtBat         FALSE      FALSE
## CHits          FALSE      FALSE
## CHmRun         FALSE      FALSE
## CRuns          FALSE      FALSE
## CRBI           FALSE      FALSE
## CWalks         FALSE      FALSE
## LeagueN        FALSE      FALSE
## DivisionW      FALSE      FALSE
## PutOuts        FALSE      FALSE
## Assists        FALSE      FALSE
## Errors         FALSE      FALSE
## NewLeagueN     FALSE      FALSE
## 1 subsets of each size up to 2
## Selection Algorithm: forward
##          AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 1  ( 1 ) " "   " "  " "   " "  " " " "   " "   " "    " "   " "    "*"   " " 
## 2  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    "*"   " " 
##          CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1  ( 1 ) " "    " "     " "       " "     " "     " "    " "       
## 2  ( 1 ) " "    " "     " "       " "     " "     " "    " "

Results and Discussion

From the above output

  • In the first row, the model includes only the “CRuns” variable.

  • In the second row, the model includes both “Hits” and “CRuns” variables.

Based on the output provided above,

The one standard error rule gives an optimal model with nvmax=2 resulting into the following linear model.

\[ \text{Salary} = 2.69\text{Hits} + 0.52\text{Runs} + 58 \]

With the following metrics:

# Create a data frame with the metrics and values
metrics <- c("RMSE", "R-squared", "MAE")
values <- c(343.4, "32.5%", 243.4)
df <- data.frame(Metric = metrics, Value = values)

# Print the table using kable
kable(df)
Metric Value
RMSE 343.4
R-squared 32.5%
MAE 243.4

Statistical modelling - Backward regression

In this section we will do the following;

  • Using caret library,we will Fit the backward regression model with balance as the response.

  • Establish the final model

# Defining the control parameters for the forward selection method with cross-validation
ctrl <- trainControl(method = "cv", 
                     number = 10,    # Number of folds for cross-validation
                     selectionFunction = "oneSE") 

# Fit the forward regression model using caret
model_bw <- train(Salary ~ ., 
               data = Hitters, 
               method = "leapBackward", 
               trControl = ctrl,
               tuneGrid = expand.grid(nvmax = 1:6))  # specifying tuning grid for nvmax

# Retrieve the final model
final_model_bw <- model_bw$finalModel

# Print the summary of cross-validated performance metrics
print(model_bw)
## Linear Regression with Backwards Selection 
## 
## 322 samples
##  19 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 290, 289, 289, 290, 290, 290, ... 
## Resampling results across tuning parameters:
## 
##   nvmax  RMSE      Rsquared   MAE     
##   1      355.4987  0.2679713  250.3121
##   2      336.3418  0.3528149  237.9877
##   3      347.5916  0.3214210  246.1997
##   4      342.8978  0.3499113  242.5172
##   5      334.4856  0.3790217  240.0144
##   6      325.9773  0.4024042  235.6599
## 
## RMSE was used to select the optimal model using  the one SE rule.
## The final value used for the model was nvmax = 2.
coefficients(final_model_bw,id=2)
## (Intercept)        Hits       CRuns 
##  57.9708916   2.6870742   0.5188723
summary(final_model_bw,id=2)
## Subset selection object
## 19 Variables  (and intercept)
##            Forced in Forced out
## AtBat          FALSE      FALSE
## Hits           FALSE      FALSE
## HmRun          FALSE      FALSE
## Runs           FALSE      FALSE
## RBI            FALSE      FALSE
## Walks          FALSE      FALSE
## Years          FALSE      FALSE
## CAtBat         FALSE      FALSE
## CHits          FALSE      FALSE
## CHmRun         FALSE      FALSE
## CRuns          FALSE      FALSE
## CRBI           FALSE      FALSE
## CWalks         FALSE      FALSE
## LeagueN        FALSE      FALSE
## DivisionW      FALSE      FALSE
## PutOuts        FALSE      FALSE
## Assists        FALSE      FALSE
## Errors         FALSE      FALSE
## NewLeagueN     FALSE      FALSE
## 1 subsets of each size up to 2
## Selection Algorithm: backward
##          AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 1  ( 1 ) " "   " "  " "   " "  " " " "   " "   " "    " "   " "    "*"   " " 
## 2  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    "*"   " " 
##          CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1  ( 1 ) " "    " "     " "       " "     " "     " "    " "       
## 2  ( 1 ) " "    " "     " "       " "     " "     " "    " "

Results and Discussion

From the above output

  • In the first row, the model includes only the “CRuns” variable.

  • In the second row, the model includes both “Hits” and “CRuns” variables.

Based on the output provided above,

The one standard error rule gives an optimal model with nvmax=2 resulting into the following linear model. \[ \text{Salary} = 2.69\text{Hits} + 0.52\text{Runs} + 58 \]

With the following metrics:

# Create a data frame with the metrics and values
metrics <- c("RMSE", "R-squared", "MAE")
values <- c(336.3, "35.2%", 237.9)
df <- data.frame(Metric = metrics, Value = values)

# Print the table using kable
kable(df)
Metric Value
RMSE 336.3
R-squared 35.2%
MAE 237.9

Statistical modelling - Stepwise regression

In this section we will do the following;

  • Using caret library,we will fit the stepwise regression model with balance as the response.

  • Establish the final model

# Defining the control parameters for the stepwise selection method with cross-validation
ctrl <- trainControl(method = "cv", 
                     number = 10,    # Number of folds for cross-validation
                     selectionFunction = "oneSE") 

# Fit the forward regression model using caret
model_both <- train(Salary ~ ., 
               data = Hitters, 
               method = "leapSeq", 
               trControl = ctrl,
               tuneGrid = expand.grid(nvmax = 1:6))  # specifying tuning grid for nvmax

# Retrieve the final model
final_model_both <- model_both$finalModel

# Print the summary of cross-validated performance metrics
print(model_both)
## Linear Regression with Stepwise Selection 
## 
## 322 samples
##  19 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 289, 289, 288, 290, 290, 290, ... 
## Resampling results across tuning parameters:
## 
##   nvmax  RMSE      Rsquared   MAE     
##   1      357.9762  0.2602810  252.5547
##   2      344.6466  0.3293830  242.6623
##   3      348.3695  0.2955762  249.8461
##   4      347.6043  0.3200350  247.6202
##   5      342.4594  0.3470465  246.7261
##   6      327.0878  0.3784566  236.1341
## 
## RMSE was used to select the optimal model using  the one SE rule.
## The final value used for the model was nvmax = 2.
coefficients(final_model_both,id=2)
## (Intercept)        Hits        CRBI 
##  61.6032482   2.8026922   0.5175608
summary(final_model_both,id=2)
## Subset selection object
## 19 Variables  (and intercept)
##            Forced in Forced out
## AtBat          FALSE      FALSE
## Hits           FALSE      FALSE
## HmRun          FALSE      FALSE
## Runs           FALSE      FALSE
## RBI            FALSE      FALSE
## Walks          FALSE      FALSE
## Years          FALSE      FALSE
## CAtBat         FALSE      FALSE
## CHits          FALSE      FALSE
## CHmRun         FALSE      FALSE
## CRuns          FALSE      FALSE
## CRBI           FALSE      FALSE
## CWalks         FALSE      FALSE
## LeagueN        FALSE      FALSE
## DivisionW      FALSE      FALSE
## PutOuts        FALSE      FALSE
## Assists        FALSE      FALSE
## Errors         FALSE      FALSE
## NewLeagueN     FALSE      FALSE
## 1 subsets of each size up to 2
## Selection Algorithm: 'sequential replacement'
##          AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 1  ( 1 ) " "   " "  " "   " "  " " " "   " "   " "    " "   " "    "*"   " " 
## 2  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
##          CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1  ( 1 ) " "    " "     " "       " "     " "     " "    " "       
## 2  ( 1 ) " "    " "     " "       " "     " "     " "    " "

Results and Discussion

From the above output

  • In the first row, the model includes only the “CRuns” variable.

  • In the second row, the model includes both “Hits” and “CRBI” variables.

Based on the output provided above,

The one standard error rule gives an optimal model with nvmax=2 resulting into the following linear model. \[ \text{Salary} = 2.80\text{Hits} + 0.52\text{CRBI} + 61.6 \]

With the following metrics:

# Create a data frame with the metrics and values
metrics <- c("RMSE", "R-squared", "MAE")
values <- c(344.65, "32.9%", 242.7)
df <- data.frame(Metric = metrics, Value = values)

# Print the table using kable
kable(df)
Metric Value
RMSE 344.65
R-squared 32.9%
MAE 242.7

Statistical modelling Shrinkage approaches

In this section we will do the following;

  • Using caret library,we will fit the Ridge,Lasso and elastic net regression model with balance as the response.

  • Establish the final model

# Set seed for reproducibility
set.seed(123)

# Create a data partition
train_index <- createDataPartition(Hitters$Salary, p = 0.8, list = FALSE)
train_set <- Hitters[train_index, ]
test_set <- Hitters[-train_index, ]

# Predictor variables
x_train <- model.matrix(Salary ~ ., data = train_set)[,-1]
# Outcome variable
y_train <- train_set$Salary

Statistical modelling - Ridge regression/Shrinkage approach 1

ridge_model <- glmnet(x_train, y_train, alpha = 0)

# Find the best lambda using cross-validation
set.seed(123)
cv <- cv.glmnet(x_train, y_train, alpha = 0)

# Fit the final model on the training data using the minimum lambda
model_ridge <- glmnet(x_train, y_train, alpha = 0, lambda = cv$lambda.min)

# Display regression coefficients
coef(model_ridge)
## 20 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## (Intercept) 149.094338926
## AtBat        -0.026351452
## Hits          0.631595908
## HmRun         1.006411578
## Runs          0.937189751
## RBI           0.896224399
## Walks         1.438906787
## Years        -1.301690125
## CAtBat        0.009490935
## CHits         0.051148718
## CHmRun        0.300233880
## CRuns         0.105898602
## CRBI          0.096299283
## CWalks        0.038763702
## LeagueN      25.024070269
## DivisionW   -91.786426702
## PutOuts       0.166351009
## Assists       0.133157453
## Errors       -2.800253005
## NewLeagueN   -3.541849977
# Make predictions on the test data
x_test <- model.matrix(Salary ~., test_set)[,-1]
predictions <- model_ridge %>% predict(x_test) %>% as.vector()
# Model performance metrics
data.frame(
  RMSE = RMSE(predictions, test_set$Salary),
  Rsquare = R2(predictions, test_set$Salary)
)
##       RMSE   Rsquare
## 1 299.1887 0.4834495

Statistical modelling - Lasso regression/Shrinkage approach 2

lasso_model <- glmnet(x_train, y_train, alpha = 1)

# Find the best lambda using cross-validation
set.seed(123)
cv <- cv.glmnet(x_train, y_train, alpha = 1)

# Fit the final model on the training data using the minimum lambda
model_lasso <- glmnet(x_train, y_train, alpha = 1, lambda = cv$lambda.min)

# Display Lasso regression coefficients
coef(model_lasso)
## 20 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## (Intercept)  333.52945997
## AtBat         -1.91341335
## Hits           5.13939816
## HmRun          2.41315679
## Runs           0.45351575
## RBI            1.17401457
## Walks          4.05707149
## Years        -17.82966629
## CAtBat         .         
## CHits          .         
## CHmRun         0.09376282
## CRuns          0.76275991
## CRBI           0.28163504
## CWalks        -0.48304804
## LeagueN       94.57372906
## DivisionW   -131.69260973
## PutOuts        0.24665996
## Assists        0.48933206
## Errors        -6.57440972
## NewLeagueN   -70.37322651
# Make predictions on the test data
x_test <- model.matrix(Salary ~., test_set)[,-1]
predictions <- model_lasso %>% predict(x_test) %>% as.vector()
# Model performance metrics
data.frame(
  RMSE = RMSE(predictions, test_set$Salary),
  Rsquare = R2(predictions, test_set$Salary)
)
##       RMSE   Rsquare
## 1 287.0566 0.5237959

Statistical modelling - Elastic net regression/Shrinkage approach 3

# Find the best lambda using cross-validation
set.seed(123)
cv <- cv.glmnet(x_train, y_train, alpha = 0.1)

# Fit the final model on the training data using the minimum lambda
model_en <- glmnet(x_train, y_train, alpha = 0.1, lambda = cv$lambda.min)

# Display regression coefficients
coef(model_en)
## 20 x 1 sparse Matrix of class "dgCMatrix"
##                       s0
## (Intercept)  293.9354097
## AtBat         -1.2347188
## Hits           2.9462509
## HmRun          0.8718318
## Runs           1.4853760
## RBI            1.3727764
## Walks          3.0715062
## Years        -16.7807437
## CAtBat         .        
## CHits          0.1211086
## CHmRun         0.2913277
## CRuns          0.4012250
## CRBI           0.2083505
## CWalks        -0.2908127
## LeagueN       88.7486079
## DivisionW   -132.1801612
## PutOuts        0.2461816
## Assists        0.4439587
## Errors        -7.0980713
## NewLeagueN   -65.4366729
# Make predictions on the test data
x_test <- model.matrix(Salary ~., test_set)[,-1]
predictions <- model_en %>% predict(x_test) %>% as.vector()
# Model performance metrics
data.frame(
  RMSE = RMSE(predictions, test_set$Salary),
  Rsquare = R2(predictions, test_set$Salary)
)
##      RMSE   Rsquare
## 1 291.408 0.5080909

Results and Discussion

From the above output its clear that Lasso is superior method than Elastic net.

Lasso has relatively the lowest RMSE (288.6) and a \(R^2\) of 51.6%.

Therefore the final model from Lasso regression is : \[ \text{Salary} = 346.77 - 1.99 \times \text{AtBat} + 5.01 \times \text{Hits} + 2.41 \times \text{HmRun} + 0.85 \times \text{Runs} + 1.26 \times \text{RBI} + 4.24 \times \text{Walks} - 17.30 \times \text{Years} \] \[ - 0.04 \times \text{CAtBat} + 0.12 \times \text{CHits} + 0.12 \times \text{CHmRun} + 0.80 \times \text{CRuns} + 0.31 \times \text{CRBI} - 0.53 \times \text{CWalks} + 126.37 \times \text{LeagueN} \] \[ - 132.33 \times \text{DivisionW} + 0.25 \times \text{PutOuts} + 0.56 \times \text{Assists} - 7.37 \times \text{Errors} - 100.27 \times \text{NewLeagueN} \] Here, each coefficient represents the effect of the corresponding predictor variable on the Salary, holding other variables constant. For example, the coefficient of -17.3 for Years indicates that for each unit increase in Years, the Salary is expected to decrease by 17.3 units, while holding other variables constant.

Discussion of Results /Findings

Comparing all the six methods it was evident that the Lasso model out performed all the other variable selection methods interms of RMSE and also \(R^2\) this is because the method had the lowest RMSE and relatively high \(R^2\). On the other hand Backward regression performed relatively poorly in both \(R^2\) and RMSE.

Reccomendation

In the prediction of Balance and Salary in the Credit and Hitters dataset Lasso method should be used in the modelling.

Dataset: Credit

Best model accuracy and Bias metrics

Variable selection method RMSE R2 MAE
Lasso 97.8 96.0%
Ridge 104.4 96.3%
Elastic Net 97.5 96.0%
Forward regression 101.0 95.26% 81.7
Backward regression 99.5 95.31% 79.6
Stepwise 99.2 95.4% 79.3

Dataset: Hitters

Best model accuracy and Bias metrics

Variable selection method RMSE R2 MAE
Lasso 287.1 52.4%
Ridge 299.2 48.3%
Elastic Net 291.4 50.8%
Forward regression 343.4 32.5% 243.4
Backward regression 336.3 35.3% 246.2
Stepwise 344.6 32.9% 242.7