Lab 3: Simple Linear Regression

Goal

Today, we will see how to use R to create a least squares linear regression line. This is a tool we use to describe the linear relationship between two numeric variables.

Lab workflow

Visualize the Data
1. Create a scatterplot
2. Observe the relationship (direction, strength, form)
Fit a Simple Linear Regression Model
1. Obtain the regression coefficients using software
2. Interpret the slope ($b_1$) and intercept ($b_0$)
3. Discuss predictions using the equation
Evaluate the Fit
1. Use of residuals
2. Interpret $R^2$

The Data

There is an entire field called sports analytics that is devoted to using statistics and data in professional sports. Today we are going to work with data from 30 different Major League Baseball teams. We’re interested specifically in figuring out what variables might be related to the number of runs a player gets during a Major League Baseball season. This is important because identifying variables that might be related to the number of runs could help a team improve their chances of winning a game.

As always, our first step is to load the data.

load(url("http://www.openintro.org/stat/data/mlb11.RData"))

You will notice that this command is different from the code we used last time to load the data. There are actually quite a few ways to load data into R, and we will use several as we go through this course.

Our response variable is runs, the number of runs a team scored during a season of Major League Baseball. Each row in the data set represents a different team during this season.

Graphing the Relationship between two numeric variables

Suppose the manager of a Major League Baseball team asks you if the number of times the team is up to bat has a strong relationship with the number of runs scored during the season. The client is specifically interested in having you use a least squares linear regression model to describe the relationship between X = at_bats and Y = runs.

To complete this task, we need to

Determine if using least squares linear regression is appropriate and
If so, build and interpret a least squares linear regression line.

We need to start with (1), because it will not make sense to use linear regression if the relationship between X = at_bats and Y = runs is not the right shape.

In order to check the shape, we need to create a visualization that explores the relationship between at_bats and our numeric response variable, runs.

Question 1: What type of plot would you use to determine the shape of the relationship between X = at_bats and Y = runs? Why?

Now that we know what type of plot we need, let’s make it. Recall from Lab 1 that a helpful command for plotting is:

plot(x = , y = , xlab = " ", ylab = " ")

One reminder. In order to plot any variable, we have to tell R (1) the name of data set and (2) the name of the column. This means that if you want to plot runs, you actually need the code mlb11$runs. This tells R “Hey, look at the data set mlb11 and grab a column $ called runs”. Also remember that when you name your axes, you need to put quotation marks around your chosen labels.

Question 2: Plot the relationship between X=at_bats and Y=runs. Show the plot as your answer, and label your axes.

Creating the plot is an excellent first step to answering the manager’s question, but once we have created the plot we then have to be able to interpret what this plot is telling us.

Question 3: Describe the relationship between X = at_bats and Y= runs. Does it seem reasonable to consider a regression line to describe the relationship between X = at_bats and Y= runs? Explain.

It can be difficult to assess the strength of a linear relationship just by looking at a plot. Luckily, we have learned that a numerical way to quantify the strength of a relationship is with the correlation. To determine the correlation of two variables firstvariable and secondvariable in R, use:

cor(firstvariable , secondvariable )

Question 4: What is the correlation between X = at_bats and Y= runs? What does this tell us about the strength of the linear relationship between these two variables? Hint: You need to replace firstvariable and secondvariable with your two variables of interest.

Fitting the least-squares regression line

Once we have determined that regression is an appropriate choice, we can move into the actual process of estimating our line. If we choose to use a least-squares line, we are making the choice that we can describe the relationship between y and x using a line. The question then becomes…which line.

We have discussed in class that there are specific formulas that can give us the slope ($b_1$) and intercept ($b_0$) estimates for a least-squares regression line;

\[\hat{y}= b_0 + b_1 x\]

Luckily for us, R has these formulas built in to a few commands, so we will not have to manually compute the values. To obtain the least squares regression line in R, we can use the following code:

m1 <- lm(runs ~at_bats, data = mlb11)

Let’s get back to the actual line of code defining m1. The first argument in the code lm tells R that we want to create a regression line. So, m1<-lm( ) will tell R to create a least-squares regression line and name it m1. Our next step is to tell R what to use to actually build the line. What is the response variable (y)? What is the explanatory variable (x)? To tell R what variable to use to create the line, we use a formula that takes the form y ~ x.

We notice that we don’t use the $ notation here. Why? Because there is a part of the command that specifies that R should look in the mlb11 data frame to find the runs and at_bats variables.

Now, what exactly is m1? It holds bits and pieces of information that result from fitting a least-squares regression line. The output of lm is an object that contains all of the information we need about the regression line. Specifically, right now our goal is to use R to get estimates of the intercept and the slope. How do we get those out of m1? Use the code:

m1 <- lm(runs ~at_bats, data = mlb11)
m1$coefficients

This reaches into the object m1 and retrieves the estimated coefficients, i.e., the intercept ($b_0$) and slope ($b_1$) estimates.

With this information, we can write down the least squares linear regression (LSLR) line.

Question 5: Write down the LSLR line.

Interpretation:

• Slope ($b_1$): For every additional at-bat, a team scores 0.63 more runs on average.

• Intercept ($b_0$): If a team had zero at-bats (hypothetical), they would score -2789.24 runs (nonsensical in this context but part of the model).

Question 6: Think back to your plot. Does it make sense that the slope coefficient is positive? Why or why not?

Drawing the line and considering residuals

It is often helpful to actually visualize the least squares line by drawing it on the data. Let’s create a scatterplot with the least squares line laid on top.

plot(x=  mlb11$at_bats, y = mlb11$runs, xlab = "At Bats", ylab = "Runs")
abline(m1)

Together, the two lines of code above (1) create a scatter plot for at_bats and runs and (2) draw on the least squares line. You do need both lines of code, but the function abline is the one that actually plots a line based on the slope and intercept we estimated in m1.

Looking at the line we have just drawn, we will notice that the line does not go through every data point. This should not surprise us. A least-squares regression line is a way to approximate the relationship between y and x, but it will not perfectly fit every data point. In order to explore how well the line estimates each data point, we use residuals.

In order to visualize the residuals, we will use a special command.

plot_ss(x = mlb11$at_bats, y = mlb11$runs, leastSquares = TRUE)

After running this command, you will see a scatter plot. The least square line is shown in black and the residuals are shown in blue.

Question 7: In words, what do the residuals represent? When we fit least-squares regression lines, do we want the residuals to be small or large? Explain.

Note that there are 30 residuals, one for each of the 30 observations. Recall that the residuals are denoted

\[e_i = y_i - \hat{y}_i\]

Question 8: Use the least squares line to answer the following. If a team manager saw the least squares regression line and not the actual data, about how many runs would they predict for a team with 5579 at_bats? Compare the predicted runs with the actual runs of 713.

Question 9: What is the residual for the prediction of runs for a team with 5579 at_bats?

Assessing the line

Okay, so we have estimated the coefficients we needed based on the fact that we chose a least-squares regression line. We now have a line. Wonderful. Before we use this line to make any sweeping conclusions about baseball, let’s see if we can determine how good a job our line actually does at explaining the variation in runs.

We have learned that one way to examine fit is to consider the amount of variation in our response that is actually able to be explained by the line. In other words, how much of the variation in runs can we explain by at_bats? To get at this, we need to look at the $R^2$.

There are two ways to get at the $R^2$ in R. The first is to find the correlation (which we already did) and literally square it.

cor( mlb11$at_bats, mlb11$runs)^2

Question 10: What percentage of the variability in runs is associated with the number of times a team was at_bat?

Wrap-Up

We explored the relationship between two numerical variables.
The Simple linear regression equation helps us make predictions, but other factors could improve the model.(What additional variables might help explain runs scored?)
We can expand to multiple regression to include additional predictors.

Note: This lab is adapted from the Intro to Linear Regression Lab, which is a product of OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel.