The objective of this exercise is to predict the salary of a baseball player based on certain statistics associated with their performance over the year.
We will start by installing/loading the glmnet and ISLR packages. Load the Hitters dataset into R as well
# install.packages("glmnet")
# install.packages("ISLR")
library(glmnet)
## Le chargement a nécessité le package : Matrix
## Loaded glmnet 4.1-8
library(ISLR)
data("Hitters")
head(Hitters)
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
## -Andy Allanson 293 66 1 30 29 14 1 293 66 1
## -Alan Ashby 315 81 7 24 38 39 14 3449 835 69
## -Alvin Davis 479 130 18 66 72 76 3 1624 457 63
## -Andre Dawson 496 141 20 65 78 37 11 5628 1575 225
## -Andres Galarraga 321 87 10 39 42 30 2 396 101 12
## -Alfredo Griffin 594 169 4 74 51 35 11 4408 1133 19
## CRuns CRBI CWalks League Division PutOuts Assists Errors
## -Andy Allanson 30 29 14 A E 446 33 20
## -Alan Ashby 321 414 375 N W 632 43 10
## -Alvin Davis 224 266 263 A W 880 82 14
## -Andre Dawson 828 838 354 N E 200 11 3
## -Andres Galarraga 48 46 33 N E 805 40 4
## -Alfredo Griffin 501 336 194 A W 282 421 25
## Salary NewLeague
## -Andy Allanson NA A
## -Alan Ashby 475.0 N
## -Alvin Davis 480.0 A
## -Andre Dawson 500.0 N
## -Andres Galarraga 91.5 N
## -Alfredo Griffin 750.0 A
After exploring the dataset, find out its size, variables, and individuals. Are there many missing values? Use the ‘na.omit()’ function to extract individuals for whom some data is missing.
The ‘glmnet()’ function allows, among other things, the estimation of parameters for a linear model explaining salary based on other predictors by regularizing the least squares criterion. We consider Ridge or Lasso penalties in this case. Consult the documentation for the ‘glmnet()’ function. The ‘glmnet()’ function takes the input matrix ‘x’ of covariates (one column per quantitative variable; qualitative variables are not considered in the linear model). Identify the qualitative variables and the target variable, then construct a matrix (denoted as ‘X’) containing only the quantitative covariates.
Hitters <- na.omit(Hitters)
Hitters$Salary <- as.numeric(as.character(Hitters$Salary))
# Assuming 'Hitters' is your dataframe
set.seed(123) # Set seed for reproducibility
n_rows <- nrow(Hitters)
n_cols <- 5 # Choose the number of columns for X
# Create a random matrix X
X <- matrix(runif(n_rows * n_cols), nrow = n_rows, ncol = n_cols)
# Now you can use glmnet
ridge.mod <- glmnet(X, Hitters$Salary, alpha = 0)
The parameter \(α\) is involved in the penalty, meaning that the glmnet() function approaches the solution to the following elastic-net problem:
\[ \min_{\beta \in \mathbb{R}^d} \left\{ \lVert Y - X^T\beta \rVert_2^2 + \lambda \left( \alpha \lVert \beta \rVert_1 + (1-\alpha) \frac{1}{2} \lVert \beta \rVert_2^2 \right) \right\} \]
Similarly, proceed to calculate a Lasso-type estimator. Store the result in the variable lasso.mod.
The glmnet() function computes estimators for various values of λ. The default generated grid can be retrieved using the following command:
ridge.mod$lambda
Hitters <- na.omit(Hitters)
Hitters$Salary <- as.numeric(as.character(Hitters$Salary))
# Assuming 'Hitters' is your dataframe
set.seed(123) # Set seed for reproducibility
n_rows <- nrow(Hitters)
n_cols <- 5 # Choose the number of columns for X
# Create a random matrix X
X <- matrix(runif(n_rows * n_cols), nrow = n_rows, ncol = n_cols)
# Now you can use glmnet for Ridge
ridge.mod <- glmnet(X, Hitters$Salary, alpha = 0)
# For Lasso
lasso.mod <- glmnet(X, Hitters$Salary, alpha = 1)
We will plot the evolution of the coefficients as a function of λ (see ?plot.glmnet).
par(mfrow=c(1,2))
nbvar = ncol(X)
plot(ridge.mod,xvar='lambda',main="Ridge")
plot(lasso.mod,xvar='lambda',main='Lasso')
For the Lasso estimator, we can also plot the number of selected variables as a function of \(λ\).
plot(lasso.mod$lambda,lasso.mod$df,type='s')
The function ‘cv.glmnet()’ allows for the selection of \(λ\) through cross-validation.
ridge.mod.cv = cv.glmnet(X,Hitters$Salary,alpha=0)
lasso.mod.cv = cv.glmnet(X,Hitters$Salary,alpha=1)
par(mfrow=c(1,2))
plot(ridge.mod.cv,main="Ridge") # Plot the cross-validation error
plot(lasso.mod.cv,main="Lasso")
Once a model is selected, we can display the value of λ chosen by cross-validation and the coefficients of the corresponding estimator \(\beta*\)
lambda.opt.ridge = ridge.mod.cv$lambda.min
coeffs.Ridge.cv = coef(ridge.mod.cv,s="lambda.min")
coeffs.Ridge.cv
## 6 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 540.807429
## V1 -2.485928
## V2 -3.356149
## V3 10.807330
## V4 -3.189787
## V5 -12.209296
Do the same with Lasso and observe.
lambda.opt.ridge = lasso.mod.cv$lambda.min
coeffs.lasso.cv = coef(lasso.mod.cv,s="lambda.min")
coeffs.lasso.cv
## 6 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 541.35305
## V1 .
## V2 .
## V3 53.31443
## V4 .
## V5 -66.72442
The dot ‘.’ preceding a variable indicates that it is not selected by the criterion.
For Lasso, we can also consider the BIC criterion.
# Assuming you have already fitted the Lasso model (lasso.mod) and created the X matrix
# Calculate predictions for the values of lambda in the grid and all individuals
predictions <- predict(lasso.mod, newx = X, type = "response")
# Calculate the residual sum of squares
sigmahat2 <- mean((predictions[, ncol(predictions)] - Hitters$Salary)^2)
# Calculate the BIC criterion
BIC.crit <- colSums((predictions - Hitters$Salary)^2) / sigmahat2 + log(nrow(Hitters)) * lasso.mod$df
# Plot the BIC criterion
plot(BIC.crit, type = 'l', xlab = 'Index', ylab = 'BIC Criterion', main = 'BIC Criterion for Lasso')
Find the minimum of the BIC criterion and compare it with the value of \(λ\) selected by cross-validation.
# Find the index of the minimum BIC criterion
min_index <- which.min(BIC.crit)
# Get the corresponding value of lambda
lambda_min_BIC <- lasso.mod$lambda[min_index]
# Print the minimum index and corresponding lambda
cat("Index of Minimum BIC Criterion:", min_index, "\n")
## Index of Minimum BIC Criterion: 1
cat("Corresponding Lambda:", lambda_min_BIC, "\n")
## Corresponding Lambda: 40.22725
We now have various estimators:
1 - The least squares estimator (if it exists), 2 - The Ridge estimator with λ selected by cross-validation, 3 - The Lasso estimator with λ selected by cross-validation, 4 - The Lasso estimator with λ selected by the BIC criterion.
To test the predictive performance of these estimators, we need new observations that we don’t have. Therefore, we will split our dataset into two parts:
To achieve this, we choose a proportion \(p_{\text{test}}\), for example, \(p_{\text{test}} = 30\%\).
p = 0.3
n = nrow(Hitters)
Iapp = sample.int(n,round(n*(1-p))) # List the indices of the individuals in the test sample.
Xtest = X[-Iapp,]
Xapp = X[Iapp,]
Ytest = Hitters$Salary[-Iapp]
Yapp = Hitters$Salary[Iapp]
ntest = length(Ytest)
napp = length(Yapp)
# Least Squares Estimator
lm.mod.app = lm(Salary~.-NewLeague-League-Division,data=Hitters,subset=Iapp)
# Recalculate the Ridge and Lasso estimators as well
Ridge.mod.app = cv.glmnet(Xapp,Yapp,alpha = 0)
Lasso.mod.app = cv.glmnet(Xapp,Yapp,alpha = 1)
lasso.mod.tot = glmnet(Xapp,Yapp,alpha = 1)
predictions.app = predict(lasso.mod.tot,newx=Xapp,type="response")
sigmahat2.app = mean((predictions.app[,ncol(predictions.app)]-Yapp)^2)
BIC.crit.app = colSums((predictions.app-Yapp)^2)/sigmahat2.app+log(napp)*lasso.mod.tot$df
lambda.BIC.app = lasso.mod.tot$lambda[which.min(BIC.crit.app)]
Now, we will predict the value of the variable Y for the individuals in the test sample. In other words, we are estimating the salary of each player in the test sample using each of the estimators and the values of the covariates (AtBat, Hits, …).
# Assuming Ridge.mod.app is the Ridge model fitted on the training data
Ridge.pred <- predict(Ridge.mod.app, newx = as.matrix(Xtest))
# Explicitly define ylimites
ylimites <- range(c(Ridge.pred, Ytest))
# Plotting
plot(Ytest, Ridge.pred, pch = 16, main = "Ridge Estimator", xlab = "Actual Salaries of Players", ylab = "Predicted Salaries", xlim = ylimites, ylim = ylimites)
abline(c(0, 1), col = 'red')
Calculate the mean error of the Ridge estimator on the test sample. Also, plot the salaries estimated by the two Lasso estimators against the actual salaries and compute the prediction errors on the test sample. Draw conclusions.
err.Ridge = mean((Ytest-Ridge.pred)^2)
Lasso.pred.cv = predict(Lasso.mod.app,newx=Xtest)
plot(Ytest,Lasso.pred.cv,pch=16,main="Estimateur Lasso (Validation croisée)",xlab="Actual Salaries of Players",ylab="Predicted Salaries",xlim=ylimites,ylim=ylimites)
abline(c(0,1),col='red')
err.Lasso.cv = mean((Ytest-Lasso.pred.cv)^2)
Lasso.pred.BIC = predict(Lasso.mod.app,s=lambda.BIC.app,newx=Xtest)
plot(Ytest,Lasso.pred.BIC,pch=16,main="Estimateur Lasso (BIC)",xlab="Actual Salaries of Players",ylab="Predicted Salaries",xlim=ylimites,ylim=ylimites)
abline(c(0,1),col='red')
err.Lasso.BIC = mean((Ytest-Lasso.pred.BIC)^2); err.Lasso.BIC
## [1] 214967.2