Analyse the Credit and Hitters data sets as done in the Introduction to Statistical Learning using R and compare the following models:
Multiple linear regression.
Use caret for model development and evaluation.
This task aims at investigating the performance of various Linear regression methods/variable selection approaches on the Hitters and Crediters datasets inbuilt in R.
To determine the superior model,the bias and variance,RME and R-squared of the models will be the yardstick that is going to be used to determine the superior model and hence the method.
Methods to be used are as follows;
The methods used in the analysis of the two datasets are as mentioned above.
Forward regression
This is a variable selection method suitable when the number of samples (n) exceeds the number of variables or parameters to be estimated (p).
Begins with an empty model and iteratively adds the most significant variable until no more significant variables can be added.
Useful for reducing the risk of overfitting and can handle a large number of predictors.
Backward regression
Stepwise Regression
Ridge Regression
Helps mitigate multicollinearity by shrinking the regression coefficients.
Lasso (Least Absolute Shrinkage and Selection Operator)
Performs variable selection and regularization by imposing an L1-norm penalty term, which is the sum of the absolute coefficients penalty on the regression coefficients.
Encourages sparsity in the model by driving some coefficients to exactly zero, effectively removing irrelevant features.
“Sparse modeling is a rapidly developing area at the intersection of statistical learning and signal processing, motivated by the age-old statistical problem of selecting a small number of predictive variables in high-dimensional datasets”. * https://direct.mit.edu/books/edited-volume/4027/Practical-Applications-of-Sparse-Modeling
Elastic Net
This method combines the penalties of ridge regression and lasso, providing a balance between the strengths of both methods. The model coefficients are penalized with both the L1-norm and L2-norm The consequence of the above is shrinkage of coefficients (like in ridge regression) and to set some coefficients to zero (as in LASSO)
Useful when dealing with highly correlated predictors and when there is a need for variable selection along with regularization.
Mathematically, the penalty term in regularization methods like lasso, ridge, and elastic net regression is added to the loss function that the model minimizes during training. Let’s denote the loss function as \(\mathcal{L}\), and the penalty term as \(\Omega(\beta)\), where \(\beta\) represents the coefficients of the model.
In each case, the penalty term is multiplied by a regularization parameter \(\lambda\) to control the strength of regularization. By tuning \(\lambda\), we can adjust the impact of the penalty on the coefficient estimates, thereby influencing the model’s bias-variance trade-off. Implication The regularization parameter \(\lambda\) in ridge, lasso, and elastic net regression influences the estimates by controlling the trade-off between the fit to the training data and the magnitude of the coefficients:
In summary, higher values of \(\lambda\) result in more regularization, which tends to shrink the coefficient estimates towards zero, reducing the complexity of the model and potentially improving its generalization performance on unseen data.
Its important to note that the standard error (SE) of the coefficients in linear regression models can be calculated using the residual sum of squares (RSS), along with the total sum of squares (TSS) and the design matrix (X).
General formula for calculating the SE of the coefficients:
\[ SE(\hat{\beta}_j) = \sqrt{\frac{RSS}{(n - p) \cdot \text{Var}(X_j)}} \]
Where:
\(\hat{\beta}_j\) is the estimated coefficient for the j-th predictor variable.
\(RSS\) is the residual sum of squares.
\(n\) is the number of observations.
\(p\) is the number of predictor variables (excluding the intercept).
\(\text{Var}(X_j)\) is the variance of the j-th predictor variable.
To calculate \(\text{Var}(X_j)\), you would typically use the sample variance of the j-th predictor variable.
Once you have \(SE(\hat{\beta}_j)\) for each coefficient, you can use them to compute confidence intervals, perform hypothesis tests, and assess the significance of the coefficients in your regression model.
Exploratory data analysis will be conducted in an attempt to
Establish the extent of data incompleteness
Establish patterns
Establish possible correlations between the predictors.
Check for Linearity of response variable
#load the Credit data
data("Credit")
#Credit%>%select(everything())%>%View()
Credit <- Credit %>%
mutate(
Own = factor(Own),
Student = factor(Student),
Married = factor(Married),
Region = factor(Region)
)
credit_numeric=Credit%>%select(-c(Own,Student,Married,Region))%>%view()
#%View() can be used to throw a df with red borders
credit_factors=Credit%>%select(Own,Student,Married,Region)%>%view()
summary_df <- Credit %>%
summarize(across(everything(), ~ sum(is.na(.)))) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "#Missing")
# Display the summarized and pivoted data as a table
kable(summary_df)
| Variable | #Missing |
|---|---|
| Income | 0 |
| Limit | 0 |
| Rating | 0 |
| Cards | 0 |
| Age | 0 |
| Education | 0 |
| Own | 0 |
| Student | 0 |
| Married | 0 |
| Region | 0 |
| Balance | 0 |
#Library for conditional formatting
library(DT)
covariance_df <- cov(select(Credit, -c("Own", "Student", "Married", "Region")))
# Apply conditional formatting
#Any correlation value whose absolute is less than 0.7 Green
#Any correlation value whose absolute is between than 0.7 and less than 0.9 red
#Anything above 1 is red
#check if the value in cell is greater than abs(0.7) if true then high else low
covariance_df %>%
datatable() %>%
formatStyle('Income',
color = styleInterval
(c(-Inf,abs(0.7)),c('red','red','green')))%>%
formatStyle('Limit',
color = styleInterval
(c(-Inf,abs(0.7)),c('red','red','green')))%>%
formatStyle('Rating',
color = styleInterval
(c(-Inf,abs(0.7)),c('red','red','green')))%>%
formatStyle('Cards',
color = styleInterval
(c(-Inf,abs(0.7)),c('red','red','green')))%>%
formatStyle('Education',
color = styleInterval
(c(-Inf,abs(0.7)),c('red','red','green')))%>%
formatStyle('Age',
color = styleInterval
(c(-Inf,abs(0.7)),c('red','red','green')))%>%
formatStyle('Balance',
color = styleInterval
(c(-Inf,abs(0.7)),c('red','red','green')))
# Display the summarized data as a table
#kable(covariance_df)
From the above correlation matrix its easy to see that there exists strong multicollinearity between the independent variables \[ \lvert \rho \rvert > 0.7 \]
The direct implication of such high multicollinearity is high regression coefficients which will mislead statistical inferencing where we are most likely to incorrectly reject null hypothesis when it is true hence Type 1 error.
In this section we will do the following;
Using caret library,we will Fit the forward regression model with balance as the response.
Establish the final model
# Defining the control parameters for the forward selection method with cross-validation
ctrl <- trainControl(method = "cv",
number = 10, # Number of folds for cross-validation
selectionFunction = "oneSE")
# Fit the forward regression model using caret
model <- train(Balance ~ .,
data = Credit,
method = "leapForward",
trControl = ctrl,
tuneGrid = expand.grid(nvmax = 1:6)) # specifying tuning grid for nvmax
# Retrieve the final model
final_model <- model$finalModel
# Print the summary of cross-validated performance metrics
print(model)
## Linear Regression with Forward Selection
##
## 400 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 359, 360, 360, 360, 360, 360, ...
## Resampling results across tuning parameters:
##
## nvmax RMSE Rsquared MAE
## 1 230.58913 0.7537775 175.18696
## 2 162.26589 0.8759333 121.51468
## 3 103.58619 0.9500479 84.38680
## 4 101.34101 0.9520441 82.18097
## 5 99.39332 0.9540532 79.74080
## 6 99.04843 0.9542028 79.57278
##
## RMSE was used to select the optimal model using the one SE rule.
## The final value used for the model was nvmax = 4.
coefficients(final_model,id=4)
## (Intercept) Income Limit Rating StudentYes
## -516.7182606 -7.9446339 0.1216825 2.1904387 422.6683787
summary(final_model,id=5)
## Subset selection object
## 11 Variables (and intercept)
## Forced in Forced out
## Income FALSE FALSE
## Limit FALSE FALSE
## Rating FALSE FALSE
## Cards FALSE FALSE
## Age FALSE FALSE
## Education FALSE FALSE
## OwnYes FALSE FALSE
## StudentYes FALSE FALSE
## MarriedYes FALSE FALSE
## RegionSouth FALSE FALSE
## RegionWest FALSE FALSE
## 1 subsets of each size up to 4
## Selection Algorithm: forward
## Income Limit Rating Cards Age Education OwnYes StudentYes MarriedYes
## 1 ( 1 ) " " " " "*" " " " " " " " " " " " "
## 2 ( 1 ) "*" " " "*" " " " " " " " " " " " "
## 3 ( 1 ) "*" " " "*" " " " " " " " " "*" " "
## 4 ( 1 ) "*" "*" "*" " " " " " " " " "*" " "
## RegionSouth RegionWest
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
From the above output
In the first row, the model includes only the “Rating” variable.
In the second row, the model includes both “Income” and “Rating” variables.
In the third row, the model includes “Income,” “Rating,” and “StudentYes” variables.
In the fourth row,the model includes “Limit”
Therefore the final model from Forward regression is \[ \text{Balance} = -7.945\text{Income} + 2.19\text{Rating} + 422\text{Student}[\text{Yes}] + 0.122\text{Limit} - 516.7 \]
With the following metrics:
# Create a data frame with the metrics and values
metrics <- c("RMSE", "R-squared", "MAE")
values <- c(98.96, "95.48%", 79.10)
df <- data.frame(Metric = metrics, Value = values)
# Print the table using kable
kable(df)
| Metric | Value |
|---|---|
| RMSE | 98.96 |
| R-squared | 95.48% |
| MAE | 79.1 |
#plotting the final model
#install.packages("scatterplot3d")
library(scatterplot3d)
library(ks)
# Load the scatterplot3d package
library(scatterplot3d)
# Define variables for the first scatterplot
x1 <- Credit$Balance
y1 <- Credit$Income
z1 <- Credit$Rating
# Define variables for the second scatterplot
x2 <- Credit$Balance
y2 <- Credit$Income
z2 <- Credit$Limit
# Find the maximum value of Income
max_income <- max(Credit$Income)
max_limit <- max(Credit$Limit)
# Set up the layout for side-by-side plots
par(mfrow = c(1, 2))
# Plot the first scatterplot
scatterplot3d(x1, y1, z1, color = "red", pch = 16, main = "Rating, Balance, and Income",
xlab = "Balance", ylab = "Income", zlab = "Rating", ylim = c(0, max_income))
# Plot the second scatterplot
scatterplot3d(x2, z2,y2, color = "blue", pch = 16, main = "Limit, Balance, and Income",
xlab = "Balance", ylab = "Income", zlab = "Limit", ylim = c(0, max_limit))
In this section we will do the following;
Using caret library,we will Fit the backward regression model with balance as the response.
Establish the final model
# Defining the control parameters for the forward selection method with cross-validation
ctrl <- trainControl(method = "cv",
number = 10, # Number of folds for cross-validation
selectionFunction = "oneSE")
# Fit the forward regression model using caret
model_bw <- train(Balance ~ .,
data = Credit,
method = "leapBackward",
trControl = ctrl,
tuneGrid = expand.grid(nvmax = 1:6)) # specifying tuning grid for nvmax
# Retrieve the final model
final_model_bw <- model_bw$finalModel
# Print the summary of cross-validated performance metrics
print(model_bw)
## Linear Regression with Backwards Selection
##
## 400 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 360, 360, 360, 360, 360, 360, ...
## Resampling results across tuning parameters:
##
## nvmax RMSE Rsquared MAE
## 1 233.61660 0.7540540 178.01865
## 2 164.76757 0.8701696 124.96094
## 3 103.88123 0.9499447 83.59614
## 4 98.95628 0.9545240 79.04815
## 5 99.54709 0.9540526 79.33030
## 6 98.37400 0.9550340 78.93216
##
## RMSE was used to select the optimal model using the one SE rule.
## The final value used for the model was nvmax = 4.
coefficients(final_model_bw,id=4)
## (Intercept) Income Limit Cards StudentYes
## -499.7272117 -7.8392288 0.2666445 23.1753794 429.6064203
summary(final_model_bw,id=5)
## Subset selection object
## 11 Variables (and intercept)
## Forced in Forced out
## Income FALSE FALSE
## Limit FALSE FALSE
## Rating FALSE FALSE
## Cards FALSE FALSE
## Age FALSE FALSE
## Education FALSE FALSE
## OwnYes FALSE FALSE
## StudentYes FALSE FALSE
## MarriedYes FALSE FALSE
## RegionSouth FALSE FALSE
## RegionWest FALSE FALSE
## 1 subsets of each size up to 4
## Selection Algorithm: backward
## Income Limit Rating Cards Age Education OwnYes StudentYes MarriedYes
## 1 ( 1 ) " " "*" " " " " " " " " " " " " " "
## 2 ( 1 ) "*" "*" " " " " " " " " " " " " " "
## 3 ( 1 ) "*" "*" " " " " " " " " " " "*" " "
## 4 ( 1 ) "*" "*" " " "*" " " " " " " "*" " "
## RegionSouth RegionWest
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
From the above output
In the first row, the model includes only the “Income” variable.
In the second row, the model includes both “Income” and “Limit” variables.
In the third row, the model includes “Income,” “Limit” and “Student[Yes]” variables.
In the fourth row,the model includes “Cards”
Therefore the final model from Backward regression is \[ \text{Balance} = -7.839\text{Income} + 0.2667\text{Limit}+23.17\text{Cards} + 429.6\text{Student}[\text{Yes}] - 499.7 \]
With the following metrics:
# Create a data frame with the metrics and values
metrics <- c("RMSE", "R-squared", "MAE")
values <- c(99.80, "95.49%", 79.66)
df <- data.frame(Metric = metrics, Value = values)
# Print the table using kable
kable(df)
| Metric | Value |
|---|---|
| RMSE | 99.8 |
| R-squared | 95.49% |
| MAE | 79.66 |
In this section we will do the following;
Using caret library,we will fit the stepwise regression model with balance as the response.
Establish the final model
# Defining the control parameters for the stepwise selection method with cross-validation
ctrl <- trainControl(method = "cv",
number = 10, # Number of folds for cross-validation
selectionFunction = "oneSE")
# Fit the forward regression model using caret
model_both <- train(Balance ~ .,
data = Credit,
method = "leapSeq",
trControl = ctrl,
tuneGrid = expand.grid(nvmax = 1:6)) # specifying tuning grid for nvmax
# Retrieve the final model
final_model_both <- model_both$finalModel
# Print the summary of cross-validated performance metrics
print(model_both)
## Linear Regression with Stepwise Selection
##
## 400 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 360, 360, 360, 359, 360, 360, ...
## Resampling results across tuning parameters:
##
## nvmax RMSE Rsquared MAE
## 1 230.92511 0.7584440 175.53263
## 2 164.67355 0.8745133 125.26604
## 3 104.60032 0.9499402 84.98184
## 4 99.43728 0.9540559 79.44548
## 5 142.35993 0.9043306 106.77332
## 6 162.15334 0.8799233 121.52676
##
## RMSE was used to select the optimal model using the one SE rule.
## The final value used for the model was nvmax = 4.
coefficients(final_model_both,id=4)
## (Intercept) Income Limit Cards StudentYes
## -499.7272117 -7.8392288 0.2666445 23.1753794 429.6064203
summary(final_model_both,id=5)
## Subset selection object
## 11 Variables (and intercept)
## Forced in Forced out
## Income FALSE FALSE
## Limit FALSE FALSE
## Rating FALSE FALSE
## Cards FALSE FALSE
## Age FALSE FALSE
## Education FALSE FALSE
## OwnYes FALSE FALSE
## StudentYes FALSE FALSE
## MarriedYes FALSE FALSE
## RegionSouth FALSE FALSE
## RegionWest FALSE FALSE
## 1 subsets of each size up to 4
## Selection Algorithm: 'sequential replacement'
## Income Limit Rating Cards Age Education OwnYes StudentYes MarriedYes
## 1 ( 1 ) " " " " "*" " " " " " " " " " " " "
## 2 ( 1 ) "*" "*" " " " " " " " " " " " " " "
## 3 ( 1 ) "*" " " "*" " " " " " " " " "*" " "
## 4 ( 1 ) "*" "*" " " "*" " " " " " " "*" " "
## RegionSouth RegionWest
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
From the above output
In the first row, the model includes only the “Rating” variable.
In the second row, the model includes both “Rating” and “Limit” variables.
In the third row, the model includes “Income,” “Limit” and “Own[Yes]” variables.
In the fourth row,the model includes “Student[Yes]”
Therefore the final model from stepwise regression is \[ \text{Balance} = -7.839\text{Income} + 0.2667\text{Limit}+23.17\text{Cards}+ 429.6\text{Student}[\text{Yes}]- 499.7 \]
With the following metrics:
# Create a data frame with the metrics and values
metrics <- c("RMSE", "R-squared", "MAE")
values <- c(99.66, "95.31%", 79.34)
df <- data.frame(Metric = metrics, Value = values)
# Print the table using kable
kable(df)
| Metric | Value |
|---|---|
| RMSE | 99.66 |
| R-squared | 95.31% |
| MAE | 79.34 |
In this section we will do the following;
Using caret library,we will fit the Ridge,Lasso and Elastic net regression model with balance as the response.
Establish the final model
# Load necessary packages
library(glmnet)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 4.1-8
library(caret)
library(tidyverse)
# Set seed for reproducibility
set.seed(123)
# Create a data partition
train_index <- createDataPartition(Credit$Balance, p = 0.8, list = FALSE)
train_set <- Credit[train_index, ]
test_set <- Credit[-train_index, ]
# Predictor variables
x_train <- model.matrix(Balance ~ ., data = train_set)[,-1]
# Outcome variable
y_train <- train_set$Balance
# Fit the ridge regression model
ridge_model <- glmnet(x_train, y_train, alpha = 0)
# Find the best lambda using cross-validation
set.seed(123)
cv <- cv.glmnet(x_train, y_train, alpha = 0)
# Fit the final model on the training data using the minimum lambda
model_ridge <- glmnet(x_train, y_train, alpha = 0, lambda = cv$lambda.min)
# Display regression coefficients
coef(model_ridge)
## 12 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -431.0013019
## Income -5.4669947
## Limit 0.1162357
## Rating 1.6702323
## Cards 12.3089206
## Age -0.6981572
## Education 1.2873121
## OwnYes -7.0313702
## StudentYes 386.0383178
## MarriedYes -15.0870029
## RegionSouth 15.6815173
## RegionWest 20.0473809
# Make predictions on the test data
x_test <- model.matrix(Balance ~., test_set)[,-1]
predictions <- model_ridge %>% predict(x_test) %>% as.vector()
# Model performance metrics
data.frame(
RMSE = RMSE(predictions, test_set$Balance),
Rsquare = R2(predictions, test_set$Balance)
)
## RMSE Rsquare
## 1 104.4455 0.962792
# Fit the Lasso regression model
lasso_model <- glmnet(x_train, y_train, alpha = 1)
# Find the best lambda using cross-validation
set.seed(123)
cv <- cv.glmnet(x_train, y_train, alpha = 1)
# Fit the final model on the training data using the minimum lambda
model_lasso <- glmnet(x_train, y_train, alpha = 1, lambda = cv$lambda.min)
# Display Lasso regression coefficients
coef(model_lasso)
## 12 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -480.9840186
## Income -7.8328211
## Limit 0.2224328
## Rating 0.6469600
## Cards 15.6345469
## Age -0.2982869
## Education .
## OwnYes -6.5649578
## StudentYes 423.7889695
## MarriedYes -10.0150676
## RegionSouth 10.7656973
## RegionWest 15.3186326
# Make predictions on the test data
x_test <- model.matrix(Balance ~., test_set)[,-1]
predictions <- model_lasso %>% predict(x_test) %>% as.vector()
# Model performance metrics
data.frame(
RMSE = RMSE(predictions, test_set$Balance),
Rsquare = R2(predictions, test_set$Balance)
)
## RMSE Rsquare
## 1 97.83669 0.9600978
# Find the best lambda using cross-validation
set.seed(123)
cv <- cv.glmnet(x_train, y_train, alpha = 0.1)
# Fit the final model on the training data using the minimum lambda
model_en <- glmnet(x_train, y_train, alpha = 0.1, lambda = cv$lambda.min)
# Display regression coefficients
coef(model_en)
## 12 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -494.89212404
## Income -7.96242557
## Limit 0.19796624
## Rating 1.04361311
## Cards 14.55564456
## Age -0.34253069
## Education -0.02913358
## OwnYes -9.41879557
## StudentYes 425.48555727
## MarriedYes -14.30258432
## RegionSouth 17.00957983
## RegionWest 23.66858422
# Make predictions on the test data
x_test <- model.matrix(Balance ~., test_set)[,-1]
predictions <- model_en %>% predict(x_test) %>% as.vector()
# Model performance metrics
data.frame(
RMSE = RMSE(predictions, test_set$Balance),
Rsquare = R2(predictions, test_set$Balance)
)
## RMSE Rsquare
## 1 97.4995 0.9602919
# Create a data frame with the results
results <- data.frame(Method = c("Ridge", "Lasso", "Elastic Net"),
RMSE = c(104.4455, 96.819, 97.4995),
Rsquare = c(0.962792, 0.9608822, 0.9602919))
kable(results, align = "c", caption = "Model Performance Metrics")
| Method | RMSE | Rsquare |
|---|---|---|
| Ridge | 104.4455 | 0.9627920 |
| Lasso | 96.8190 | 0.9608822 |
| Elastic Net | 97.4995 | 0.9602919 |
# Assuming you have already fitted the Lasso model named lasso_model
# Retrieve coefficients
lasso_coefficients <- coef(model_lasso)
# Print coefficients
print(lasso_coefficients)
## 12 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -480.9840186
## Income -7.8328211
## Limit 0.2224328
## Rating 0.6469600
## Cards 15.6345469
## Age -0.2982869
## Education .
## OwnYes -6.5649578
## StudentYes 423.7889695
## MarriedYes -10.0150676
## RegionSouth 10.7656973
## RegionWest 15.3186326
# Identify significant coefficients
significant_indices <- which(lasso_coefficients != 0)
significant_predictors <- rownames(lasso_coefficients)[significant_indices]
# Print significant predictors
print(significant_predictors)
## [1] "(Intercept)" "Income" "Limit" "Rating" "Cards"
## [6] "Age" "OwnYes" "StudentYes" "MarriedYes" "RegionSouth"
## [11] "RegionWest"
From the above output its clear that Lasso is superior than Elastic net. Lasso has relatively the lowest RMSE (96.82) and a \(R^2\) of 96.1%. Therefore the final model from Lasso regression is:
\[ \text{Balance} = -7.9 \times \text{Income} + 0.17 \times \text{Limit} + 1.39 \times \text{Rating} + 12.94 \times \text{Cards} - 0.35 \times \text{Age} + 0.09 \times \text{Education} - 9.56 \times \text{Own[Yes]} \] \[ + 423.1 \times \text{Student[Yes]} - 15.16 \times \text{Married[Yes]} + 17.34 \times \text{Region[South]} + 24.52 \times \text{Region[West]} - 502.1 \]
Here, each coefficient represents the effect of the corresponding predictor variable on the Balance, holding other variables constant. For example, the coefficient of -7.9 for Income indicates that for each unit increase in Income, the Balance is expected to decrease by 7.9 units, while holding other variables constant.
library(ISLR)
##
## Attaching package: 'ISLR'
## The following object is masked _by_ '.GlobalEnv':
##
## Credit
## The following objects are masked from 'package:ISLR2':
##
## Auto, Credit
data(Hitters)
#Hitters%>%select(everything())%>%View()
summary_df_hit <- Hitters %>%
summarize(across(everything(), ~ sum(is.na(.)))) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "#Missing")
# Display the summarized and pivoted data as a table
kable(summary_df_hit)
| Variable | #Missing |
|---|---|
| AtBat | 0 |
| Hits | 0 |
| HmRun | 0 |
| Runs | 0 |
| RBI | 0 |
| Walks | 0 |
| Years | 0 |
| CAtBat | 0 |
| CHits | 0 |
| CHmRun | 0 |
| CRuns | 0 |
| CRBI | 0 |
| CWalks | 0 |
| League | 0 |
| Division | 0 |
| PutOuts | 0 |
| Assists | 0 |
| Errors | 0 |
| Salary | 59 |
| NewLeague | 0 |
From the analysis we can observe salary has 59 missing values.
We will impute the missing values using median
The below dataframe indicates that ‘Salary’ column was successfully imputed.
Hitters <- Hitters %>%
mutate(Salary = ifelse(is.na(Salary), median(Salary, na.rm = TRUE), Salary))
summary_df_hit2 <- Hitters %>%
summarize(across(everything(), ~ sum(is.na(.)))) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "#Missing")
# Display the summarized and pivoted data as a table
kable(summary_df_hit2)
| Variable | #Missing |
|---|---|
| AtBat | 0 |
| Hits | 0 |
| HmRun | 0 |
| Runs | 0 |
| RBI | 0 |
| Walks | 0 |
| Years | 0 |
| CAtBat | 0 |
| CHits | 0 |
| CHmRun | 0 |
| CRuns | 0 |
| CRBI | 0 |
| CWalks | 0 |
| League | 0 |
| Division | 0 |
| PutOuts | 0 |
| Assists | 0 |
| Errors | 0 |
| Salary | 0 |
| NewLeague | 0 |
#Library for conditional formatting
# Calculate the covariance matrix for numeric columns
cov_hitters <- cov(select(Hitters, where(is.numeric)))
cov_hitters%>%view()
# Apply conditional formatting
#check if the value in cell is greater than abs(0.7) if true then high else low
cov_hitters %>%
datatable() %>%
formatStyle('AtBat',
color = styleInterval
(c(-Inf,abs(0.7)),c('red','red','green')))%>%
formatStyle('Hits',
color = styleInterval
(c(-Inf,abs(0.7)),c('red','red','green')))%>%
formatStyle('HmRun',
color = styleInterval
(c(-Inf,abs(0.7)),c('red','red','green')))%>%
formatStyle('Runs',
color = styleInterval
(c(-Inf,abs(0.7)),c('red','red','green')))%>%
formatStyle('Years',
color = styleInterval
(c(-Inf,abs(0.7)),c('red','red','green')))%>%
formatStyle('CAtBat',
color = styleInterval
(c(-Inf,abs(0.7)),c('red','red','green')))%>%
formatStyle('CHits',
color = styleInterval
(c(-Inf,abs(0.7)),c('red','red','green')))%>%
formatStyle('CHits',
color = styleInterval
(c(-Inf,abs(0.7)),c('red','red','green')))%>%
formatStyle('CHmRun',
color = styleInterval
(c(-Inf,abs(0.7)),c('red','red','green')))%>%
formatStyle('CRuns',
color = styleInterval
(c(-Inf,abs(0.7)),c('red','red','green')))%>%
formatStyle('CRBI',
color = styleInterval
(c(-Inf,abs(0.7)),c('red','red','green')))%>%
formatStyle('CWalks',
color = styleInterval
(c(-Inf,abs(0.7)),c('red','red','green')))%>%
formatStyle('PutOuts',
color = styleInterval
(c(-Inf,abs(0.7)),c('red','red','green')))%>%
formatStyle('Assists',
color = styleInterval
(c(-Inf,abs(0.7)),c('red','red','green')))%>%
formatStyle('Errors',
color = styleInterval
(c(-Inf,abs(0.7)),c('red','red','green')))%>%
formatStyle('RBI',
color = styleInterval
(c(-Inf,abs(0.7)),c('red','red','green')))%>%
formatStyle('Walks',
color = styleInterval
(c(-Inf,abs(0.7)),c('red','red','green')))%>%
formatStyle('Salary',
color = styleInterval
(c(-Inf,abs(0.7)),c('red','red','green')))
#Any correlation value whose absolute is less than 0.7 Green
#Any correlation value whose absolute is between than 0.7 and less than 0.9 red
#Anything above 1 is red
#check if the value in cell is greater than abs(0.7) if true then high else low
From the above correlation matrix its easy to see that there exists strong multicollinearity between the independent variables \[ \lvert \rho \rvert > 0.7 \]
The direct implication of such high multicollinearity is high regression coefficients which will mislead statistical inferencing where we are most likely to incorrectly reject null hypothesis when it is true hence Type 1 error.
In this section we will do the following; - Using caret library,we will Fit the forward regression model with balance as the response. - Establish the final model
# Defining the control parameters for the forward selection method with cross-validation
ctrl <- trainControl(method = "cv",
number = 10, # Number of folds for cross-validation
selectionFunction = "oneSE")
# Fit the forward regression model using caret
model <- train(Salary ~ .,
data = Hitters,
method = "leapForward",
trControl = ctrl,
tuneGrid = expand.grid(nvmax = 1:19)) # specifying tuning grid for nvmax
# Retrieve the final model
final_model <- model$finalModel
# Print the summary of cross-validated performance metrics
print(model)
## Linear Regression with Forward Selection
##
## 322 samples
## 19 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 290, 289, 290, 289, 290, 290, ...
## Resampling results across tuning parameters:
##
## nvmax RMSE Rsquared MAE
## 1 352.7571 0.2938096 251.0881
## 2 343.3796 0.3248059 243.4211
## 3 347.3429 0.3156264 247.3834
## 4 339.7834 0.3373238 245.7417
## 5 334.0857 0.3588010 242.0721
## 6 328.3922 0.3743058 237.7368
## 7 331.8928 0.3721921 237.4194
## 8 326.1281 0.3937233 235.9920
## 9 326.7185 0.3945905 236.4108
## 10 324.2184 0.3960174 236.9005
## 11 327.5217 0.3846775 240.0618
## 12 327.3180 0.3830222 240.7650
## 13 326.3354 0.3865431 239.5034
## 14 326.0394 0.3890317 238.8098
## 15 324.1536 0.3942695 237.2391
## 16 323.0666 0.3968692 237.0025
## 17 324.2417 0.3934248 237.8743
## 18 323.9227 0.3936974 237.5771
## 19 324.0420 0.3933971 237.6676
##
## RMSE was used to select the optimal model using the one SE rule.
## The final value used for the model was nvmax = 2.
coefficients(final_model,id=2)
## (Intercept) Hits CRuns
## 57.9708916 2.6870742 0.5188723
summary(final_model,id=2)
## Subset selection object
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE FALSE
## Hits FALSE FALSE
## HmRun FALSE FALSE
## Runs FALSE FALSE
## RBI FALSE FALSE
## Walks FALSE FALSE
## Years FALSE FALSE
## CAtBat FALSE FALSE
## CHits FALSE FALSE
## CHmRun FALSE FALSE
## CRuns FALSE FALSE
## CRBI FALSE FALSE
## CWalks FALSE FALSE
## LeagueN FALSE FALSE
## DivisionW FALSE FALSE
## PutOuts FALSE FALSE
## Assists FALSE FALSE
## Errors FALSE FALSE
## NewLeagueN FALSE FALSE
## 1 subsets of each size up to 2
## Selection Algorithm: forward
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " "*" " "
## 2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " "*" " "
## CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " "
From the above output
In the first row, the model includes only the “CRuns” variable.
In the second row, the model includes both “Hits” and “CRuns” variables.
Based on the output provided above,
The one standard error rule gives an optimal model with nvmax=2 resulting into the following linear model.
\[ \text{Salary} = 2.69\text{Hits} + 0.52\text{Runs} + 58 \]
With the following metrics:
# Create a data frame with the metrics and values
metrics <- c("RMSE", "R-squared", "MAE")
values <- c(343.4, "32.5%", 243.4)
df <- data.frame(Metric = metrics, Value = values)
# Print the table using kable
kable(df)
| Metric | Value |
|---|---|
| RMSE | 343.4 |
| R-squared | 32.5% |
| MAE | 243.4 |
In this section we will do the following;
Using caret library,we will Fit the backward regression model with balance as the response.
Establish the final model
# Defining the control parameters for the forward selection method with cross-validation
ctrl <- trainControl(method = "cv",
number = 10, # Number of folds for cross-validation
selectionFunction = "oneSE")
# Fit the forward regression model using caret
model_bw <- train(Salary ~ .,
data = Hitters,
method = "leapBackward",
trControl = ctrl,
tuneGrid = expand.grid(nvmax = 1:6)) # specifying tuning grid for nvmax
# Retrieve the final model
final_model_bw <- model_bw$finalModel
# Print the summary of cross-validated performance metrics
print(model_bw)
## Linear Regression with Backwards Selection
##
## 322 samples
## 19 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 290, 289, 289, 290, 290, 290, ...
## Resampling results across tuning parameters:
##
## nvmax RMSE Rsquared MAE
## 1 355.4987 0.2679713 250.3121
## 2 336.3418 0.3528149 237.9877
## 3 347.5916 0.3214210 246.1997
## 4 342.8978 0.3499113 242.5172
## 5 334.4856 0.3790217 240.0144
## 6 325.9773 0.4024042 235.6599
##
## RMSE was used to select the optimal model using the one SE rule.
## The final value used for the model was nvmax = 2.
coefficients(final_model_bw,id=2)
## (Intercept) Hits CRuns
## 57.9708916 2.6870742 0.5188723
summary(final_model_bw,id=2)
## Subset selection object
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE FALSE
## Hits FALSE FALSE
## HmRun FALSE FALSE
## Runs FALSE FALSE
## RBI FALSE FALSE
## Walks FALSE FALSE
## Years FALSE FALSE
## CAtBat FALSE FALSE
## CHits FALSE FALSE
## CHmRun FALSE FALSE
## CRuns FALSE FALSE
## CRBI FALSE FALSE
## CWalks FALSE FALSE
## LeagueN FALSE FALSE
## DivisionW FALSE FALSE
## PutOuts FALSE FALSE
## Assists FALSE FALSE
## Errors FALSE FALSE
## NewLeagueN FALSE FALSE
## 1 subsets of each size up to 2
## Selection Algorithm: backward
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " "*" " "
## 2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " "*" " "
## CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " "
From the above output
In the first row, the model includes only the “CRuns” variable.
In the second row, the model includes both “Hits” and “CRuns” variables.
Based on the output provided above,
The one standard error rule gives an optimal model with nvmax=2 resulting into the following linear model. \[ \text{Salary} = 2.69\text{Hits} + 0.52\text{Runs} + 58 \]
With the following metrics:
# Create a data frame with the metrics and values
metrics <- c("RMSE", "R-squared", "MAE")
values <- c(336.3, "35.2%", 237.9)
df <- data.frame(Metric = metrics, Value = values)
# Print the table using kable
kable(df)
| Metric | Value |
|---|---|
| RMSE | 336.3 |
| R-squared | 35.2% |
| MAE | 237.9 |
In this section we will do the following;
Using caret library,we will fit the stepwise regression model with balance as the response.
Establish the final model
# Defining the control parameters for the stepwise selection method with cross-validation
ctrl <- trainControl(method = "cv",
number = 10, # Number of folds for cross-validation
selectionFunction = "oneSE")
# Fit the forward regression model using caret
model_both <- train(Salary ~ .,
data = Hitters,
method = "leapSeq",
trControl = ctrl,
tuneGrid = expand.grid(nvmax = 1:6)) # specifying tuning grid for nvmax
# Retrieve the final model
final_model_both <- model_both$finalModel
# Print the summary of cross-validated performance metrics
print(model_both)
## Linear Regression with Stepwise Selection
##
## 322 samples
## 19 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 289, 289, 288, 290, 290, 290, ...
## Resampling results across tuning parameters:
##
## nvmax RMSE Rsquared MAE
## 1 357.9762 0.2602810 252.5547
## 2 344.6466 0.3293830 242.6623
## 3 348.3695 0.2955762 249.8461
## 4 347.6043 0.3200350 247.6202
## 5 342.4594 0.3470465 246.7261
## 6 327.0878 0.3784566 236.1341
##
## RMSE was used to select the optimal model using the one SE rule.
## The final value used for the model was nvmax = 2.
coefficients(final_model_both,id=2)
## (Intercept) Hits CRBI
## 61.6032482 2.8026922 0.5175608
summary(final_model_both,id=2)
## Subset selection object
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE FALSE
## Hits FALSE FALSE
## HmRun FALSE FALSE
## Runs FALSE FALSE
## RBI FALSE FALSE
## Walks FALSE FALSE
## Years FALSE FALSE
## CAtBat FALSE FALSE
## CHits FALSE FALSE
## CHmRun FALSE FALSE
## CRuns FALSE FALSE
## CRBI FALSE FALSE
## CWalks FALSE FALSE
## LeagueN FALSE FALSE
## DivisionW FALSE FALSE
## PutOuts FALSE FALSE
## Assists FALSE FALSE
## Errors FALSE FALSE
## NewLeagueN FALSE FALSE
## 1 subsets of each size up to 2
## Selection Algorithm: 'sequential replacement'
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " "*" " "
## 2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " "
From the above output
In the first row, the model includes only the “CRuns” variable.
In the second row, the model includes both “Hits” and “CRBI” variables.
Based on the output provided above,
The one standard error rule gives an optimal model with nvmax=2 resulting into the following linear model. \[ \text{Salary} = 2.80\text{Hits} + 0.52\text{CRBI} + 61.6 \]
With the following metrics:
# Create a data frame with the metrics and values
metrics <- c("RMSE", "R-squared", "MAE")
values <- c(344.65, "32.9%", 242.7)
df <- data.frame(Metric = metrics, Value = values)
# Print the table using kable
kable(df)
| Metric | Value |
|---|---|
| RMSE | 344.65 |
| R-squared | 32.9% |
| MAE | 242.7 |
In this section we will do the following;
Using caret library,we will fit the Ridge,Lasso and elastic net regression model with balance as the response.
Establish the final model
# Set seed for reproducibility
set.seed(123)
# Create a data partition
train_index <- createDataPartition(Hitters$Salary, p = 0.8, list = FALSE)
train_set <- Hitters[train_index, ]
test_set <- Hitters[-train_index, ]
# Predictor variables
x_train <- model.matrix(Salary ~ ., data = train_set)[,-1]
# Outcome variable
y_train <- train_set$Salary
ridge_model <- glmnet(x_train, y_train, alpha = 0)
# Find the best lambda using cross-validation
set.seed(123)
cv <- cv.glmnet(x_train, y_train, alpha = 0)
# Fit the final model on the training data using the minimum lambda
model_ridge <- glmnet(x_train, y_train, alpha = 0, lambda = cv$lambda.min)
# Display regression coefficients
coef(model_ridge)
## 20 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 149.094338926
## AtBat -0.026351452
## Hits 0.631595908
## HmRun 1.006411578
## Runs 0.937189751
## RBI 0.896224399
## Walks 1.438906787
## Years -1.301690125
## CAtBat 0.009490935
## CHits 0.051148718
## CHmRun 0.300233880
## CRuns 0.105898602
## CRBI 0.096299283
## CWalks 0.038763702
## LeagueN 25.024070269
## DivisionW -91.786426702
## PutOuts 0.166351009
## Assists 0.133157453
## Errors -2.800253005
## NewLeagueN -3.541849977
# Make predictions on the test data
x_test <- model.matrix(Salary ~., test_set)[,-1]
predictions <- model_ridge %>% predict(x_test) %>% as.vector()
# Model performance metrics
data.frame(
RMSE = RMSE(predictions, test_set$Salary),
Rsquare = R2(predictions, test_set$Salary)
)
## RMSE Rsquare
## 1 299.1887 0.4834495
lasso_model <- glmnet(x_train, y_train, alpha = 1)
# Find the best lambda using cross-validation
set.seed(123)
cv <- cv.glmnet(x_train, y_train, alpha = 1)
# Fit the final model on the training data using the minimum lambda
model_lasso <- glmnet(x_train, y_train, alpha = 1, lambda = cv$lambda.min)
# Display Lasso regression coefficients
coef(model_lasso)
## 20 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 333.52945997
## AtBat -1.91341335
## Hits 5.13939816
## HmRun 2.41315679
## Runs 0.45351575
## RBI 1.17401457
## Walks 4.05707149
## Years -17.82966629
## CAtBat .
## CHits .
## CHmRun 0.09376282
## CRuns 0.76275991
## CRBI 0.28163504
## CWalks -0.48304804
## LeagueN 94.57372906
## DivisionW -131.69260973
## PutOuts 0.24665996
## Assists 0.48933206
## Errors -6.57440972
## NewLeagueN -70.37322651
# Make predictions on the test data
x_test <- model.matrix(Salary ~., test_set)[,-1]
predictions <- model_lasso %>% predict(x_test) %>% as.vector()
# Model performance metrics
data.frame(
RMSE = RMSE(predictions, test_set$Salary),
Rsquare = R2(predictions, test_set$Salary)
)
## RMSE Rsquare
## 1 287.0566 0.5237959
# Find the best lambda using cross-validation
set.seed(123)
cv <- cv.glmnet(x_train, y_train, alpha = 0.1)
# Fit the final model on the training data using the minimum lambda
model_en <- glmnet(x_train, y_train, alpha = 0.1, lambda = cv$lambda.min)
# Display regression coefficients
coef(model_en)
## 20 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 293.9354097
## AtBat -1.2347188
## Hits 2.9462509
## HmRun 0.8718318
## Runs 1.4853760
## RBI 1.3727764
## Walks 3.0715062
## Years -16.7807437
## CAtBat .
## CHits 0.1211086
## CHmRun 0.2913277
## CRuns 0.4012250
## CRBI 0.2083505
## CWalks -0.2908127
## LeagueN 88.7486079
## DivisionW -132.1801612
## PutOuts 0.2461816
## Assists 0.4439587
## Errors -7.0980713
## NewLeagueN -65.4366729
# Make predictions on the test data
x_test <- model.matrix(Salary ~., test_set)[,-1]
predictions <- model_en %>% predict(x_test) %>% as.vector()
# Model performance metrics
data.frame(
RMSE = RMSE(predictions, test_set$Salary),
Rsquare = R2(predictions, test_set$Salary)
)
## RMSE Rsquare
## 1 291.408 0.5080909
From the above output its clear that Lasso is superior method than Elastic net.
Lasso has relatively the lowest RMSE (288.6) and a \(R^2\) of 51.6%.
Therefore the final model from Lasso regression is : \[ \text{Salary} = 346.77 - 1.99 \times \text{AtBat} + 5.01 \times \text{Hits} + 2.41 \times \text{HmRun} + 0.85 \times \text{Runs} + 1.26 \times \text{RBI} + 4.24 \times \text{Walks} - 17.30 \times \text{Years} \] \[ - 0.04 \times \text{CAtBat} + 0.12 \times \text{CHits} + 0.12 \times \text{CHmRun} + 0.80 \times \text{CRuns} + 0.31 \times \text{CRBI} - 0.53 \times \text{CWalks} + 126.37 \times \text{LeagueN} \] \[ - 132.33 \times \text{DivisionW} + 0.25 \times \text{PutOuts} + 0.56 \times \text{Assists} - 7.37 \times \text{Errors} - 100.27 \times \text{NewLeagueN} \] Here, each coefficient represents the effect of the corresponding predictor variable on the Salary, holding other variables constant. For example, the coefficient of -17.3 for Years indicates that for each unit increase in Years, the Salary is expected to decrease by 17.3 units, while holding other variables constant.
Comparing all the six methods it was evident that the Lasso model out performed all the other variable selection methods interms of RMSE and also \(R^2\) this is because the method had the lowest RMSE and relatively high \(R^2\). On the other hand Backward regression performed relatively poorly in both \(R^2\) and RMSE.
In the prediction of Balance and Salary in the Credit and Hitters dataset Lasso method should be used in the modelling.
| Variable selection method | RMSE | R2 | MAE |
|---|---|---|---|
| Lasso | 97.8 | 96.0% | |
| Ridge | 104.4 | 96.3% | |
| Elastic Net | 97.5 | 96.0% | |
| Forward regression | 101.0 | 95.26% | 81.7 |
| Backward regression | 99.5 | 95.31% | 79.6 |
| Stepwise | 99.2 | 95.4% | 79.3 |
| Variable selection method | RMSE | R2 | MAE |
|---|---|---|---|
| Lasso | 287.1 | 52.4% | |
| Ridge | 299.2 | 48.3% | |
| Elastic Net | 291.4 | 50.8% | |
| Forward regression | 343.4 | 32.5% | 243.4 |
| Backward regression | 336.3 | 35.3% | 246.2 |
| Stepwise | 344.6 | 32.9% | 242.7 |