Today, we will see how to use R to create a least squares linear regression line. This is a tool we use to describe the linear relationship between two numeric variables.
Lab workflow
There is an entire field called sports analytics that is devoted to using statistics and data in professional sports. Today we are going to work with data from 30 different Major League Baseball teams. We’re interested specifically in figuring out what variables might be related to the number of runs a player gets during a Major League Baseball season. This is important because identifying variables that might be related to the number of runs could help a team improve their chances of winning a game.
As always, our first step is to load the data.
load(url("http://www.openintro.org/stat/data/mlb11.RData"))
You will notice that this command is different from the code we used last time to load the data. There are actually quite a few ways to load data into R, and we will use several as we go through this course.
Our response variable is runs, the number of runs a team
scored during a season of Major League Baseball. Each row in the data
set represents a different team during this season.
Suppose the manager of a Major League Baseball team asks you if the
number of times the team is up to bat has a strong relationship with the
number of runs scored during the season. The client is specifically
interested in having you use a least squares linear regression model to
describe the relationship between X = at_bats and Y =
runs.
To complete this task, we need to
We need to start with (1), because it will not make sense to use
linear regression if the relationship between X = at_bats
and Y = runs is not the right shape.
In order to check the shape, we need to create a visualization that
explores the relationship between at_bats and our numeric
response variable, runs.
Now that we know what type of plot we need, let’s make it. Recall from Lab 1 that a helpful command for plotting is:
plot(x = , y = , xlab = " ", ylab = " ")
One reminder. In order to plot any variable, we have to tell R (1)
the name of data set and (2) the name of the column. This means that if
you want to plot runs, you
actually need the code mlb11$runs. This tells R “Hey,
look at the data set mlb11
and grab a column $ called
runs”. Also remember that
when you name your axes, you need to put quotation marks around your
chosen labels.
Creating the plot is an excellent first step to answering the manager’s question, but once we have created the plot we then have to be able to interpret what this plot is telling us.
It can be difficult to assess the strength of a linear relationship
just by looking at a plot. Luckily, we have learned that a numerical way
to quantify the strength of a relationship is with the correlation. To
determine the correlation of two variables firstvariable and secondvariable in R, use:
cor(firstvariable , secondvariable )
firstvariable and secondvariable with your two
variables of interest.
Once we have determined that regression is an appropriate choice, we can move into the actual process of estimating our line. If we choose to use a least-squares line, we are making the choice that we can describe the relationship between y and x using a line. The question then becomes…which line.
We have discussed in class that there are specific formulas that can give us the slope (\(b_1\)) and intercept (\(b_0\)) estimates for a least-squares regression line;
\[\hat{y}= b_0 + b_1 x\]
Luckily for us, R has these formulas built in to a few commands, so we will not have to manually compute the values. To obtain the least squares regression line in R, we can use the following code:
m1 <- lm(runs ~at_bats, data = mlb11)
Let’s get back to the actual line of code defining m1.
The first argument in the code lm tells R that we want to
create a regression line. So, m1<-lm( ) will tell R to
create a least-squares regression line and name it m1. Our
next step is to tell R what to use to actually build the line. What is
the response variable (y)? What is the explanatory variable (x)? To tell
R what variable to use to create the line, we use a formula that takes
the form y ~ x.
We notice that we don’t use the $ notation here. Why?
Because there is a part of the command that specifies that R should look
in the mlb11 data frame to find the runs and
at_bats variables.
Now, what exactly is m1? It holds bits and pieces of
information that result from fitting a least-squares regression line.
The output of lm is an object that contains all of the
information we need about the regression line. Specifically, right now
our goal is to use R to get estimates of the intercept and the slope.
How do we get those out of m1? Use the code:
m1 <- lm(runs ~at_bats, data = mlb11)
m1$coefficients
This reaches into the object m1 and retrieves the
estimated coefficients, i.e., the intercept (\(b_0\)) and slope (\(b_1\)) estimates.
With this information, we can write down the least squares linear regression (LSLR) line.
Interpretation:
• Slope (\(b_1\)): For every additional at-bat, a team scores 0.63 more runs on average.
• Intercept (\(b_0\)): If a team had zero at-bats (hypothetical), they would score -2789.24 runs (nonsensical in this context but part of the model).
It is often helpful to actually visualize the least squares line by drawing it on the data. Let’s create a scatterplot with the least squares line laid on top.
plot(x= mlb11$at_bats, y = mlb11$runs, xlab = "At Bats", ylab = "Runs")
abline(m1)
Together, the two lines of code above (1) create a scatter plot for
at_bats and runs and (2) draw on the least
squares line. You do need both lines of code, but the function
abline is the one that actually plots a line based on the
slope and intercept we estimated in m1.
Looking at the line we have just drawn, we will notice that the line does not go through every data point. This should not surprise us. A least-squares regression line is a way to approximate the relationship between y and x, but it will not perfectly fit every data point. In order to explore how well the line estimates each data point, we use residuals.
In order to visualize the residuals, we will use a special command.
plot_ss(x = mlb11$at_bats, y = mlb11$runs, leastSquares = TRUE)
After running this command, you will see a scatter plot. The least square line is shown in black and the residuals are shown in blue.
Note that there are 30 residuals, one for each of the 30 observations. Recall that the residuals are denoted
\[e_i = y_i - \hat{y}_i\]
Okay, so we have estimated the coefficients we needed based on the
fact that we chose a least-squares regression line. We now have a line.
Wonderful. Before we use this line to make any sweeping conclusions
about baseball, let’s see if we can determine how good a job our line
actually does at explaining the variation in runs.
We have learned that one way to examine fit is to consider the amount
of variation in our response that is actually able to be explained by
the line. In other words, how much of the variation in runs
can we explain by at_bats? To get at this, we need to look
at the \(R^2\).
There are two ways to get at the \(R^2\) in R. The first is to find the correlation (which we already did) and literally square it.
cor( mlb11$at_bats, mlb11$runs)^2
Note: This lab is adapted from the Intro to Linear Regression Lab, which is a product of OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel.