Welcome AFIT Data Science learners! This lesson on regression is based on the Introduction to Statistical Learning in R (ISLR) course and book by Hastie and Tibshirani 1. Their course is offered for free on Stanford Lagunita Online edX. This is from Chapter 6: Linear Model Selection and Regularization. The exact link is here but you may need to register first at https://lagunita.stanford.edu/ to get access to the edX course.
library(ISLR)
library(tidyverse)
library(leaps)
First off, let’s examine our dataset, Hitters
.
?Hitters
The authors of the ISLR book/course use the Hitters
data from the ISLR
R-package. The Hitters
data contains Major League Baseball (MLB) data from the 1986 season. In this dataset, there are 322 observations and 20 variables. Each observation represents one MLB player. Let’s look at these 20 variables by calling the str
structure function as well as View
. After looking at them, make a guess with your classmates on which variables seem most likely to predict MLB players’ salaries.
str(Hitters)
## 'data.frame': 322 obs. of 20 variables:
## $ AtBat : int 293 315 479 496 321 594 185 298 323 401 ...
## $ Hits : int 66 81 130 141 87 169 37 73 81 92 ...
## $ HmRun : int 1 7 18 20 10 4 1 0 6 17 ...
## $ Runs : int 30 24 66 65 39 74 23 24 26 49 ...
## $ RBI : int 29 38 72 78 42 51 8 24 32 66 ...
## $ Walks : int 14 39 76 37 30 35 21 7 8 65 ...
## $ Years : int 1 14 3 11 2 11 2 3 2 13 ...
## $ CAtBat : int 293 3449 1624 5628 396 4408 214 509 341 5206 ...
## $ CHits : int 66 835 457 1575 101 1133 42 108 86 1332 ...
## $ CHmRun : int 1 69 63 225 12 19 1 0 6 253 ...
## $ CRuns : int 30 321 224 828 48 501 30 41 32 784 ...
## $ CRBI : int 29 414 266 838 46 336 9 37 34 890 ...
## $ CWalks : int 14 375 263 354 33 194 24 12 8 866 ...
## $ League : Factor w/ 2 levels "A","N": 1 2 1 2 2 1 2 1 2 1 ...
## $ Division : Factor w/ 2 levels "E","W": 1 2 2 1 1 2 1 2 2 1 ...
## $ PutOuts : int 446 632 880 200 805 282 76 121 143 0 ...
## $ Assists : int 33 43 82 11 40 421 127 283 290 0 ...
## $ Errors : int 20 10 14 3 4 25 7 9 19 0 ...
## $ Salary : num NA 475 480 500 91.5 750 70 100 75 1100 ...
## $ NewLeague: Factor w/ 2 levels "A","N": 1 2 1 2 2 1 1 1 2 1 ...
View(Hitters)
First, we notice that the Salary column, which is our response variable, is in terms of thousands. We’ll need to remember that later on when conducting our analysis. Next, after scrolling through the Salary column, we see that there are many NA’s. These will hamper our linear models, so we must remove them.
Hitters = na.omit(Hitters)
with(Hitters, sum(is.na(Salary))) # there are no more NA's
## [1] 0
Now that we have a clean dataset (reduced to 263 observations now), we can work on the goal of this lesson, which is to model and best predict Salary with some combination of these 20 variables.
First off, let’s look at the package leaps
2. This package will allow us to create best subset models of the MLB variables that best model Salary. Before we do that, let’s make a guess! What is your guess on the one variable that singlehandedly best explains the salary of MLB players in 1987?
?leaps
?regsubsets
Now let’s begin the modeling process to see if your guess is correct. Here we name a new variable “regfit.full” to capture all the data from the function regsubsets
, which will do an exhaustive search to find the best fits with \(1\) through \(n\) variables in Hitters
. It is important to note the default max is 8 variables. If you want more variables included, you can set nvmax
to a greater value.
regfit.full = regsubsets(Salary~., data = Hitters)
Next, we examine our model.
summary(regfit.full)
First, we must note that the output of summary is just scratching the surface: check out ?regsubsets
for more details. For now we are most interested in the eight rows at the bottom (the forced in and forced out output will not be covered). The summary output shows us the best model with only one variable (shown in row 1). We find this by looking for the one variable that has an “*" (asterisk) below it. In this case it is CRBI, which is defined as career runs batted in. For the best model with only two variables, we look at row 2, where we see that, in addition to CRBI, Hits is denoted with an asterisk and is therefore the second most important variable. Whereas CRBI is a career statistic, Hits only contains the numbers of hits each player had during the 1986 season. Continuing on, we can read down the output to row 8 to see the model with the eight most important variables. If we want to see larger models (with more than eight variables), we must adjust our modeling input. We have twenty variables, so let’s increase to, say, a nineteen variable model by adding nvmax = 19
to the code.
regfit.full = regsubsets(Salary~., data = Hitters, nvmax = 19)
summary(regfit.full)
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19)
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE FALSE
## Hits FALSE FALSE
## HmRun FALSE FALSE
## Runs FALSE FALSE
## RBI FALSE FALSE
## Walks FALSE FALSE
## Years FALSE FALSE
## CAtBat FALSE FALSE
## CHits FALSE FALSE
## CHmRun FALSE FALSE
## CRuns FALSE FALSE
## CRBI FALSE FALSE
## CWalks FALSE FALSE
## LeagueN FALSE FALSE
## DivisionW FALSE FALSE
## PutOuts FALSE FALSE
## Assists FALSE FALSE
## Errors FALSE FALSE
## NewLeagueN FALSE FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: exhaustive
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " "
## 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " "
## 7 ( 1 ) " " "*" " " " " " " "*" " " "*" "*" "*" " "
## 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " "*" "*"
## 9 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*"
## 10 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*"
## 11 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*"
## 12 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*"
## 13 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*"
## 14 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" " " " " "*"
## 15 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" "*" " " "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" " " "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1 ( 1 ) "*" " " " " " " " " " " " " " "
## 2 ( 1 ) "*" " " " " " " " " " " " " " "
## 3 ( 1 ) "*" " " " " " " "*" " " " " " "
## 4 ( 1 ) "*" " " " " "*" "*" " " " " " "
## 5 ( 1 ) "*" " " " " "*" "*" " " " " " "
## 6 ( 1 ) "*" " " " " "*" "*" " " " " " "
## 7 ( 1 ) " " " " " " "*" "*" " " " " " "
## 8 ( 1 ) " " "*" " " "*" "*" " " " " " "
## 9 ( 1 ) "*" "*" " " "*" "*" " " " " " "
## 10 ( 1 ) "*" "*" " " "*" "*" "*" " " " "
## 11 ( 1 ) "*" "*" "*" "*" "*" "*" " " " "
## 12 ( 1 ) "*" "*" "*" "*" "*" "*" " " " "
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " "
## 14 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " "
## 15 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " "
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " "
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
Where did your prediction on the most important variable fall? Did your guess make the top 19?
Next, let’s look at the model more closely.
reg.summary = summary(regfit.full)
names(reg.summary)
## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
Notice that there are more results captured in the summary than are actually outputted. From names(reg.summary)
we see there are actually seven more elements to describe this model in addition to the current output, the list variables in the model by importance. Here is the description from the Leaps
help package on regsubsets {leaps}
from the section called Value. We see that Summary.regsubsets
returns an object with these elements:
which
A logical matrix indicating which elements are in each modelrsq
The r-squared for each modelrss
Residual sum of squares for each modeladjr2
Adjusted r-squaredcp
Mallows’s Cp; an estimate of prediction error, equivalent to AIC in Gaussian linear regression3bic
Schwartz’s information criterion, BICoutmat
A version of the which component that is formatted for printingobj
A copy of the regsubsets object The coef
method returns a coefficient vector or list of vectors, the vcov
method returns a matrix or list of matrices.From this list, we see that, among other data, the summary captures the Cp values for each size of model. For the purposes of this lesson, we will skip the rest of the metrics and focus only on the Cp values, which we can use to help select the best model. Using the Cp values, we can plot the results of all the rows and find the bend in the curve where the modeling accuracy is best. First we need to see how many variables gives us the best model in terms of Cp.
which.min(reg.summary$cp)
## [1] 10
We see that a 10-variable model here gives us the lowest Cp possible. Next, let’s plot the Cp values according to the varying sizes of models. We designate and mark red the 10-variable point in question by using the points()
command.
plot(reg.summary$cp, xlab = "Number of Variables", ylab = "Cp")
points(10, reg.summary$cp[10], pch = 20, col = "red")
From this plot we see that the prediction error is definitely lowest when the model includes 10 (of the 20 total) variables. This implies that selecting the 10 most important variables will lead to the best-fitting and most appropriate model.
In addition to the plot generated above, we can look at our model another way. There is a plot method that comes included with the regsubsets
object which presents what the ISLR authors say looks “like a patent chart”.
plot(regfit.full, scale = "Cp")
This plot confirms what we saw before by presenting the variables according to importance. Here we can determine the most important variables by looking for the ones that have the most dark blocks, which would be Hits, CRBI, and PutOuts. This matches exactly what we learned from the original summary output. Next, let’s take a look at the 10 variables that the first plot told us minimized the estimated prediction error.
coef(regfit.full, 10)
## (Intercept) AtBat Hits Walks CAtBat
## 162.5354420 -2.1686501 6.9180175 5.7732246 -0.1300798
## CRuns CRBI CWalks DivisionW PutOuts
## 1.4082490 0.7743122 -0.8308264 -112.3800575 0.2973726
## Assists
## 0.2831680
This output gives us the 10 most important variables as well as their respective coefficients. With this information, we can construct the linear model that best fits the MLB data we were given. Remember that the Salary variable was in terms of thousands of dollars.
The 10-variable linear model that best explains salary for Major League Baseball players in 1987:
\[162.54 - 2.17*AtBat + 6.92*Hits + 5.77*Walks - 0.13*CAtBat + 1.41*CRuns + 0.77*CRBI - 0.83*CWalks\] \[- 112.38*DivisionW + 0.30*PutOuts + 0.28*Assists\]
This linear model shows us that our y-intercept is 162.54 (or 162,540 dollars). We also see from the coefficients that if a MLB player could only improve two of his statistics in 1986 in order to improve his 1987 salary, then he should focus on Hits and Walks. Each extra hit and walk would each generate him almost 7,000 and 6,000 more dollars in expected salary, respectively. The variables included in this model with the largest negative effect are AtBat and DivisionW. These negative coefficients tell us that being in the West Division hurts a player’s expected salary by around 112,000 dollars, while each at-bat costs a player a little more than 2,000 dollars.
Lastly, let’s check the R-squared values for each of the 19 models and see how our best subset of 10 variables performs.
reg.summary$rsq
## [1] 0.3214501 0.4252237 0.4514294 0.4754067 0.4908036 0.5087146 0.5141227
## [8] 0.5285569 0.5346124 0.5404950 0.5426153 0.5436302 0.5444570 0.5452164
## [15] 0.5454692 0.5457656 0.5459518 0.5460945 0.5461159
reg.summary$rsq[10]
## [1] 0.540495
plot(reg.summary$rsq, xlab = "# Variables Included", ylab = "R-sq")
From the R-squared output and plot we see that the R-squared values steadily improve from the minimum 1-variable model until the 10-variable model, at which point the curve plateaus. This confirms our earlier conclusion that 10 variables is the best subset. This plot clearly shows that increasing the model’s size to 11 variables or beyond has little to no impact on the model’s performance, and thus 10 is best. The R-squared value with the 10 best variables included is reported as 0.5405, or in other words, our best subset of 10 variables accounts for approximately 54% of the variation in our response, 1987 MLB salaries. This concludes our best subsets analysis as we have achieved a satisfactory result. While an R-squared of 54% is not ideal, it’s the best we can do with this linear best subsets technique. To improve on this result, other non-linear approaches like transformations or higher order models may need to be considered. For now, we will continue on with the linear approaches.
Now that we’ve found the best subset model, let’s try a different approach. Here, we will proceed as we did before but add method = "forward"
.
regfit.fwd = regsubsets(Salary ~., data = Hitters, nvmax = 19, method = "forward")
reg.fwd.summary <- summary(regfit.fwd)
plot(regfit.fwd, scale = "Cp")
We get a similar looking “patent chart” to before. Let’s look more closely.
which.min(reg.fwd.summary$cp)
## [1] 10
Again, 10 variables minimizes Cp. Let’s plot again though.
plot(reg.fwd.summary$cp, xlab = "Number of Variables", ylab = "Cp")
points(10, reg.summary$cp[10], pch = 20, col = "blue")
The curve looks similar to before, with a 10-variable model outperforming the rest. Let’s look at the coefficients to see if any differences arise, i.e. does a forward stepwise regression model select a different set of 10 variables?
coef(regfit.fwd, 10)
## (Intercept) AtBat Hits Walks CAtBat
## 162.5354420 -2.1686501 6.9180175 5.7732246 -0.1300798
## CRuns CRBI CWalks DivisionW PutOuts
## 1.4082490 0.7743122 -0.8308264 -112.3800575 0.2973726
## Assists
## 0.2831680
Our coefficients here match exactly what we obtained before. The forward stepwise technique confirms and supports the best subset model we generated earlier. At this point, at our desired 10-variable model size, it doesn’t look like we can improve on our initial result using other standard `regsubsets’ techniques like forward stepwise regression. However, let’s do one last comparison before we move on by trying a smaller model size. Let’s reduce the number of variables to 7 to see if any differences arise.
coef(regfit.fwd, 7)-coef(regfit.full,7)
## (Intercept) AtBat Hits Walks CRBI CWalks
## 30.33635900 -3.24223638 4.22245083 5.28837511 -0.64194510 -1.74736082
## DivisionW PutOuts
## 2.86425039 0.01665911
Interesting! Whereas with 10 variables, the methods produced identical models, with 7 variables, the models are noticably different. From this output, for example, we can see that the difference between the two models’ intercepts is just above 30 (30,000 dollars), which is quite large! This result here tells us that it’s important to always compare multiple model sizes. At 10 variables, best subset and forward stepwise settle on the same result, but at a slightly smaller size of 7 variables, the results are quite different. Now that we’ve seen the best subset and forward stepwise regression techniques in action, let’s move on to a different tactic by utilizing training and validation sets.
Now the ISLR authors instruct us to make a Training and Validation set! This is an important skill and OR tool because the process of training our model and then validating it allows us to objectively analyze the performance of our model. It’s important that we split our data up (randomly) so that we can build a model from one set of data (training set) and then analyze it on a separate set (validation set). We can’t just use the same set for both. This is because are building a predictive model, which requires us to conduct all model-fitting on a separate chunk of training data. In order to be able to be confident in our model, we need to be able to validate it by running our model on a ‘fresh’ set of data that the model hasn’t seen yet. Once we’ve completed the validation phase and are satisified with our model’s performance, we can take our model into the real world to use on test sets of data to make new predictions, e.g. about future MLB salaries. With all this in mind, we begin by rexamining our Hitters
dataset.
dim(Hitters)
## [1] 263 20
round(2/3*dim(Hitters)[1])
## [1] 175
round(1/3*dim(Hitters)[1])
## [1] 88
Our (cleaned) dataset is 263 rows by 20 columns. Now, in order to create training and validation sets, we’ll need to break apart our `Hitters’ set via a roughly 50/50 split. To do this, we will make R randomly separate our Hitters set into two separate sets. Note: In order to obtain reproducibility and standardization, we set our seed value at 1. Finally, we generate our training model in the same manner as before, just with an adjusted (reordered and reduced) dataset.
set.seed(1)
train <- sample(c(TRUE, FALSE), nrow(Hitters), rep = TRUE)
test <- (!train)
# alternative: train <- sample(seq(263), 175, replace = FALSE)
head(train)
## [1] TRUE TRUE FALSE FALSE TRUE FALSE
regfit.train <- regsubsets(Salary~., data = Hitters[train,], nvmax = 19)
Now, we will proceed on to using the training model to try to predict the validation model while collecting the validation set errors for each model size. We will then select our best model by examining the mean square errors (MSE’s) of all 19 sizes of the training model run on the validation set. Since there is no way to predict for regsubsets
R, the ISLR authors suggest we use some creative code, including a model matrix as well as a loop, to get our desired output.
val.errors = rep(NA, 19)
x.test = model.matrix(Salary~., data = Hitters[test,])
#make predictions for each model
for(i in 1:19){
coef.i = coef(regfit.train, id = i)
pred = x.test[, names(coef.i)]%*%coef.i
val.errors[i] = mean((Hitters$Salary[test] - pred)^2)
}
Now that our predictions have been made, we can examine the list of MSE’s in order to find our best performing model.
val.errors
## [1] 220968.0 169157.1 178518.2 163426.1 168418.1 171270.6 162377.1
## [8] 157909.3 154055.7 148162.1 151156.4 151742.5 152214.5 157358.7
## [15] 158541.4 158743.3 159972.7 159859.8 160105.6
plot(val.errors, xlab = "# of Variables in Model", ylab = "MSE")
We see from this output and plot that the best performing model is again one that includes 10 variables. This is due to the fact that the lowest MSE, 148162.1, occurs at the 10-variable mark, which means the 10-variable model has the smallest error when predicting the validation set. This matches our results from the earlier best subset and forward stepwise regressions!
Now let’s determine if the 10-variable model suggested by the training/validation method has the same coefficients as before.
coef(regfit.train ,10)
## (Intercept) AtBat Hits Walks CAtBat CHits
## -80.2751499 -1.4683816 7.1625314 3.6430345 -0.1855698 1.1053238
## CHmRun CWalks LeagueN DivisionW PutOuts
## 1.3844863 -0.7483170 84.5576103 -53.0289658 0.2381662
The coefficients here show us that we have a clearly different result. In just comparing the intercepts, we see a huge difference in the models. The best subset/forward stepwise regression produced an intercept of 162.54, while the training & validation model produced an intercept of -80.28. In terms of Salary, that’s a difference of over 242,000 dollars! Additionally, we see that the training & validation method did not select the same exact 10 best variables. Of the 10 from the best subset/forward stepwise model, 3 variables were swapped out. Specifically, career runs, career RBI, and assists were removed for career hits, career home runs, and whether or not the player was in the National League.
The new 10-variable linear model via the training & validation modeling process:
\[-80.28 - 1.47*AtBat + 7.16*Hits + 3.64*Walks - 0.19*CAtBat+ 1.11*CHits + 1.38*CHmRun - 0.75*CWalks\] \[+ 84.56*LeagueN - 53.03*DivisionW + 0.24*PutOuts\]
We have obtained a different result here as the training/validation technique suggests a model of 10 variables where 3 variables are different from before. We can attribute this difference to the fact that the training & validation model was only built off the training set, which was an incomplete dataset compared to the entire Hitters
set used for the best subset/forward stepwise regression. Since the datasets are different, it’s completely understandable that the final models would differ. It’s a positive result that the training & validation process confirmed our 10-variable model size. Clearly, we have enough results and evidence to suggest that a 10-variable model best explains the Salary portion of this Hitters
data. Moreover, the overlap of the 7 variables between the models supports our finding that the top variables, including CRBI, Hits, and PutOuts, explain and account for most of the variation in 1987 Salary.
In conclusion, by working with a straightforward and fun dataset, we have learned some valuable linear regression tools in R. Our varying findings are important as they tell us that there is no one right answer when seeking the most suitable model during our analyses. Different approaches may often yield different, and equally viable, results, and it’s up to us to compare them and ultimately recommend to the decision maker how best to proceed. This Hitters
dataset has allowed us to learn various linear modeling techniques and has provided key insights into how the 1987 MLB salaries depend on both specific 1986 season as well as career variables. In the end, we have added new tools to our belt and have seen how capable R is at producing effective linear modeling results.
Games, G., Witten, D., Hastie, T., and Tibshirani, R. (2013) An Introduction to Statistical Learning with applications in R, www.StatLearning.com, Springer-Verlag, New York [from ISLR
]↩
Alan Miller “Subset Selection in Regression” Chapman & Hall [from leaps
]↩
Boisbunon, Aurélie; Canu, Stephane; Fourdrinier, Dominique; Strawderman, William; Wells, Martin T. (2013). “AIC, Cp and estimators of loss for elliptically symmetric distributions”. arXiv:1308.2766???Freely accessible.↩