Linear Regression

Getting started with the data and the glmnet package

The objective of this exercise is to predict the salary of a baseball player based on certain statistics associated with their performance over the year.

We will start by installing/loading the glmnet and ISLR packages. Load the Hitters dataset into R as well

# install.packages("glmnet")
# install.packages("ISLR")
library(glmnet)
## Le chargement a nécessité le package : Matrix
## Loaded glmnet 4.1-8
library(ISLR)
data("Hitters")
head(Hitters)
##                   AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
## -Andy Allanson      293   66     1   30  29    14     1    293    66      1
## -Alan Ashby         315   81     7   24  38    39    14   3449   835     69
## -Alvin Davis        479  130    18   66  72    76     3   1624   457     63
## -Andre Dawson       496  141    20   65  78    37    11   5628  1575    225
## -Andres Galarraga   321   87    10   39  42    30     2    396   101     12
## -Alfredo Griffin    594  169     4   74  51    35    11   4408  1133     19
##                   CRuns CRBI CWalks League Division PutOuts Assists Errors
## -Andy Allanson       30   29     14      A        E     446      33     20
## -Alan Ashby         321  414    375      N        W     632      43     10
## -Alvin Davis        224  266    263      A        W     880      82     14
## -Andre Dawson       828  838    354      N        E     200      11      3
## -Andres Galarraga    48   46     33      N        E     805      40      4
## -Alfredo Griffin    501  336    194      A        W     282     421     25
##                   Salary NewLeague
## -Andy Allanson        NA         A
## -Alan Ashby        475.0         N
## -Alvin Davis       480.0         A
## -Andre Dawson      500.0         N
## -Andres Galarraga   91.5         N
## -Alfredo Griffin   750.0         A

After exploring the dataset, find out its size, variables, and individuals. Are there many missing values? Use the ‘na.omit()’ function to extract individuals for whom some data is missing.

The ‘glmnet()’ function allows, among other things, the estimation of parameters for a linear model explaining salary based on other predictors by regularizing the least squares criterion. We consider Ridge or Lasso penalties in this case. Consult the documentation for the ‘glmnet()’ function. The ‘glmnet()’ function takes the input matrix ‘x’ of covariates (one column per quantitative variable; qualitative variables are not considered in the linear model). Identify the qualitative variables and the target variable, then construct a matrix (denoted as ‘X’) containing only the quantitative covariates.

Hitters <- na.omit(Hitters)
Hitters$Salary <- as.numeric(as.character(Hitters$Salary))

# Assuming 'Hitters' is your dataframe
set.seed(123)  # Set seed for reproducibility
n_rows <- nrow(Hitters)
n_cols <- 5  # Choose the number of columns for X

# Create a random matrix X
X <- matrix(runif(n_rows * n_cols), nrow = n_rows, ncol = n_cols)

# Now you can use glmnet
ridge.mod <- glmnet(X, Hitters$Salary, alpha = 0)

The parameter \(α\) is involved in the penalty, meaning that the glmnet() function approaches the solution to the following elastic-net problem:

\[ \min_{\beta \in \mathbb{R}^d} \left\{ \lVert Y - X^T\beta \rVert_2^2 + \lambda \left( \alpha \lVert \beta \rVert_1 + (1-\alpha) \frac{1}{2} \lVert \beta \rVert_2^2 \right) \right\} \]

Similarly, proceed to calculate a Lasso-type estimator. Store the result in the variable lasso.mod.

The glmnet() function computes estimators for various values of λ. The default generated grid can be retrieved using the following command:

ridge.mod$lambda
Hitters <- na.omit(Hitters)
Hitters$Salary <- as.numeric(as.character(Hitters$Salary))

# Assuming 'Hitters' is your dataframe
set.seed(123)  # Set seed for reproducibility
n_rows <- nrow(Hitters)
n_cols <- 5  # Choose the number of columns for X

# Create a random matrix X
X <- matrix(runif(n_rows * n_cols), nrow = n_rows, ncol = n_cols)

# Now you can use glmnet for Ridge
ridge.mod <- glmnet(X, Hitters$Salary, alpha = 0)

# For Lasso
lasso.mod <- glmnet(X, Hitters$Salary, alpha = 1)

We will plot the evolution of the coefficients as a function of λ (see ?plot.glmnet).

par(mfrow=c(1,2))
nbvar = ncol(X)
plot(ridge.mod,xvar='lambda',main="Ridge")
plot(lasso.mod,xvar='lambda',main='Lasso')

For the Lasso estimator, we can also plot the number of selected variables as a function of \(λ\).

plot(lasso.mod$lambda,lasso.mod$df,type='s')

λ parameter selection

The function ‘cv.glmnet()’ allows for the selection of \(λ\) through cross-validation.

ridge.mod.cv = cv.glmnet(X,Hitters$Salary,alpha=0)
lasso.mod.cv = cv.glmnet(X,Hitters$Salary,alpha=1)
par(mfrow=c(1,2))
plot(ridge.mod.cv,main="Ridge") # Plot the cross-validation error
plot(lasso.mod.cv,main="Lasso")

Once a model is selected, we can display the value of λ chosen by cross-validation and the coefficients of the corresponding estimator \(\beta*\)

lambda.opt.ridge = ridge.mod.cv$lambda.min
coeffs.Ridge.cv = coef(ridge.mod.cv,s="lambda.min")
coeffs.Ridge.cv
## 6 x 1 sparse Matrix of class "dgCMatrix"
##                     s1
## (Intercept) 540.807429
## V1           -2.485928
## V2           -3.356149
## V3           10.807330
## V4           -3.189787
## V5          -12.209296

Do the same with Lasso and observe.

lambda.opt.ridge = lasso.mod.cv$lambda.min
coeffs.lasso.cv = coef(lasso.mod.cv,s="lambda.min")
coeffs.lasso.cv
## 6 x 1 sparse Matrix of class "dgCMatrix"
##                    s1
## (Intercept) 541.35305
## V1            .      
## V2            .      
## V3           53.31443
## V4            .      
## V5          -66.72442

The dot ‘.’ preceding a variable indicates that it is not selected by the criterion.

For Lasso, we can also consider the BIC criterion.

# Assuming you have already fitted the Lasso model (lasso.mod) and created the X matrix

# Calculate predictions for the values of lambda in the grid and all individuals
predictions <- predict(lasso.mod, newx = X, type = "response")

# Calculate the residual sum of squares
sigmahat2 <- mean((predictions[, ncol(predictions)] - Hitters$Salary)^2)

# Calculate the BIC criterion
BIC.crit <- colSums((predictions - Hitters$Salary)^2) / sigmahat2 + log(nrow(Hitters)) * lasso.mod$df

# Plot the BIC criterion
plot(BIC.crit, type = 'l', xlab = 'Index', ylab = 'BIC Criterion', main = 'BIC Criterion for Lasso')

Find the minimum of the BIC criterion and compare it with the value of \(λ\) selected by cross-validation.

# Find the index of the minimum BIC criterion
min_index <- which.min(BIC.crit)

# Get the corresponding value of lambda
lambda_min_BIC <- lasso.mod$lambda[min_index]

# Print the minimum index and corresponding lambda
cat("Index of Minimum BIC Criterion:", min_index, "\n")
## Index of Minimum BIC Criterion: 1
cat("Corresponding Lambda:", lambda_min_BIC, "\n")
## Corresponding Lambda: 40.22725

Comparison of different estimators through splitting into a test sample and a training sample

We now have various estimators:

1 - The least squares estimator (if it exists), 2 - The Ridge estimator with λ selected by cross-validation, 3 - The Lasso estimator with λ selected by cross-validation, 4 - The Lasso estimator with λ selected by the BIC criterion.

To test the predictive performance of these estimators, we need new observations that we don’t have. Therefore, we will split our dataset into two parts:

  • A training subset that we will use to calculate the different estimators,
  • A test subset that we will use to evaluate and compare the estimators.

To achieve this, we choose a proportion \(p_{\text{test}}\), for example, \(p_{\text{test}} = 30\%\).

p = 0.3
n = nrow(Hitters)
Iapp = sample.int(n,round(n*(1-p))) # List the indices of the individuals in the test sample.
Xtest = X[-Iapp,]
Xapp = X[Iapp,]
Ytest = Hitters$Salary[-Iapp]
Yapp = Hitters$Salary[Iapp]
ntest = length(Ytest)
napp = length(Yapp)

# Least Squares Estimator
lm.mod.app = lm(Salary~.-NewLeague-League-Division,data=Hitters,subset=Iapp)

# Recalculate the Ridge and Lasso estimators as well
Ridge.mod.app = cv.glmnet(Xapp,Yapp,alpha = 0)
Lasso.mod.app = cv.glmnet(Xapp,Yapp,alpha = 1)
lasso.mod.tot = glmnet(Xapp,Yapp,alpha = 1)
predictions.app = predict(lasso.mod.tot,newx=Xapp,type="response")
sigmahat2.app = mean((predictions.app[,ncol(predictions.app)]-Yapp)^2)
BIC.crit.app = colSums((predictions.app-Yapp)^2)/sigmahat2.app+log(napp)*lasso.mod.tot$df
lambda.BIC.app = lasso.mod.tot$lambda[which.min(BIC.crit.app)]

Now, we will predict the value of the variable Y for the individuals in the test sample. In other words, we are estimating the salary of each player in the test sample using each of the estimators and the values of the covariates (AtBat, Hits, …).

# Assuming Ridge.mod.app is the Ridge model fitted on the training data
Ridge.pred <- predict(Ridge.mod.app, newx = as.matrix(Xtest))

# Explicitly define ylimites
ylimites <- range(c(Ridge.pred, Ytest))

# Plotting
plot(Ytest, Ridge.pred, pch = 16, main = "Ridge Estimator", xlab = "Actual Salaries of Players", ylab = "Predicted Salaries", xlim = ylimites, ylim = ylimites)
abline(c(0, 1), col = 'red')

Calculate the mean error of the Ridge estimator on the test sample. Also, plot the salaries estimated by the two Lasso estimators against the actual salaries and compute the prediction errors on the test sample. Draw conclusions.

err.Ridge = mean((Ytest-Ridge.pred)^2)

Lasso.pred.cv = predict(Lasso.mod.app,newx=Xtest)
plot(Ytest,Lasso.pred.cv,pch=16,main="Estimateur Lasso (Validation croisée)",xlab="Actual Salaries of Players",ylab="Predicted Salaries",xlim=ylimites,ylim=ylimites)
abline(c(0,1),col='red')

err.Lasso.cv = mean((Ytest-Lasso.pred.cv)^2)

Lasso.pred.BIC = predict(Lasso.mod.app,s=lambda.BIC.app,newx=Xtest)
plot(Ytest,Lasso.pred.BIC,pch=16,main="Estimateur Lasso (BIC)",xlab="Actual Salaries of Players",ylab="Predicted Salaries",xlim=ylimites,ylim=ylimites)
abline(c(0,1),col='red')

err.Lasso.BIC = mean((Ytest-Lasso.pred.BIC)^2); err.Lasso.BIC
## [1] 214967.2