STA 111 Lab 3
Complete all Questions, and submit final documents in PDF or html form on Canvas.
Goal
In the last lab, we focused on visualizations as well as summary statistics for one numeric variable. Today, we are going to use visualizations and numeric measures to describe the relationship between two numeric variables. The primary visualization we will use is a scatter plot, and the primary numeric measure we will use is called correlation. We will also see how to use R to create a least squares linear regression line. This is another tool we use to describe the relationship between two numeric variables.
The Data
The use of data to improve team performance, training, and recruiting is becoming increasingly popular and important for college and professional sports. There is an entire field called sports analytics that is devoted to using statistics and data in professional sports. Today we are going to work with data from 30 different Major League Baseball teams. We’re interested specifically in figuring out what variables might be related to the number of runs a player gets during a Major League Baseball season. This is important because runs are directly related to the number of points a team earns. This means that identifying variables that might be related to the number of runs could help a team improve their chances of winning a game.
As always, our first step is to load the data. We know how to do this now-we create a chunk, paste in the code, and then press play!
You will notice that this command is different from the code we used last time to load the data. There are actually quite a few ways to load data into R, and we will use several as we go through this course.
Question 1
What is the name of the data set you have just loaded?
Our response variable for our analysis for today is runs, the number of runs a team scored during a season of Major League Baseball. Each row in the data set represents a different team during this season.
One formatting note. Right now, if you knit your Markdown, the command to load the data will show up in your PDF. We need that command to run, yes, but we don’t actually need to see it. If you want to hide the R code in a chunk from your PDF, but you still need the code to run, there is a way to do it. Look at the top of your R chunk. You should see {r}. If you want to stop the R code for that chunk from appearing in your PDF, change this to {r,echo=FALSE}.
Graphing the Relationship between two numeric variables
Suppose the manager of a Major League Baseball team asks you if the number of times the team is up to bat has a strong relationship with the number of runs scored during the season. The client is specifically interested in having you use a least squares linear regression model to describe the relationship between X = at bats and Y = runs. Remember that response variables are usually denoted by a capital letter Y, and explanatory variables are usually denoted by a capital letter X.
To complete this task, we need to
- Determine if using least squares linear regression is appropriate and
- If so, build and interpret a least squares linear regression line.
We need to start with (1), because it will not make sense to use linear regression if the relationship between X = at bats and Y = runs is not the right shape.
Question 2
What shape does the relationship between X = at bats and Y = runs need to be in order for us to reasonably use least squares linear regression?
In order to check the shape, we need to create a visualization that explores the relationship between at bats and our numeric response variable, runs.
Question 3
What type of plot would you use to determine the shape of the relationship between X = at bats and Y = runs? Why?
Now that we know what type of plot we need, let’s make it. Recall from Lab 1 that a helpful command for plotting is:
One reminder. In order to plot any variable, we have to tell R (1)
the name of data set and (2) the name of the column. This means that if
you want to plot runs
, you actually need the code
mlb11$runs
. This tells R “Hey, look at the data set
mlb11
and grab a column $
called
runs
”. Also remember that when you name your axes, you need
to put quotation marks around your chosen labels.
Question 4
Plot the relationship between X= at_bats
and Y =
runs
. Show the plot as your answer, and label your
axes.
Creating the plot is an excellent first step to answering the manager’s question, but once we have created the plot we then have to be able to interpret what this plot is telling us. Generally, when we are comparing two numeric variables using this type of plot, we are interested in four things:
- Does the relationship seem linear?
- Does the relationship between these variables seem to be strong or weak?
- Is the relationship positive, negative, or neither?
- Are there any points that seem not to follow the trend of the rest of the data? In other words, are there any outliers?
There is a reason why we are interested in these specific questions. The first question (Does the relationship seem linear?) tells us whether or not using a least-squares regression line makes sense for these two variables. All models have assumptions. A big one for least-squares regression lines is that the relationship between the two variables we are modeling should be able to be described using a line. In other words, the relationship should look linear. If it does not, we should not use a least-squares regression line.
Question 5
Does it seem reasonable to consider a regression line to describe the relationship between X = at bats and Y= runs? Explain.
The second question asks about the strength of the relationship between the two variables. This is important because a least squares linear regression line like the client has asked us for uses X to describe the variation in Y. In other words, we are hoping to be able to say something like as X goes up by 1, Y generally goes up by 2.5. If the relationship between X and Y is not very strong, than the line may not be a good tool for describing the relationship.
It can be difficult to assess the strength of a linear relationship
just by looking at a plot. Luckily, we have learned that a numerical way
to quantify the strength of a relationship is with the
correlation. To determine the correlation of two
variables firstvariable
and secondvariable
in
R, use:
Question 6
What is the correlation between X = at bats and Y= runs? What does
this tell us about the strength of the linear relationship between these
two variables? Hint: You need to replace firstvariable
and
secondvariable
with your two variables of interest.
The third question (Is the relationship positive, negative, or neither?) tells us what sort of slope to expect when we fit a regression line to the data. Do we expect a positive slope? A negative one? This intuition will allows us to check our line once we fit it. Do the results of the line make sense with what the plot is telling us?
The fourth question (Are there any points that seem not to follow the trend of the rest of the data?) helps with identifying outliers that can strongly impact the fit of a line.
Question 7
Describe the relationship between X = at bats and Y= runs. Make sure to comment on all four things listed above. Would you be comfortable telling the client that a least squares linear regression model is a reasonable choice to describe the relationship between at bats and runs?
Fitting the least-squares regression line
Once we have determined that regression is an appropriate choice, we can move into the actual process of estimating our line. If we choose to use a least-squares line, we are making the choice that we can describe the relationship between y and x using a line. The question then becomes…which line.
We have discussed in class that there are specific formulas that can give us the slope and intercept for a least-squares regression line. Luckily for us, R has these formulas built in to a few commands, so we will not have to manually compute the values. To obtain the least squares regression line in R, we can use the following code:
This code is a little different from those we have worked with. We
run the code, and nothing seems to happen…but it did. Run the code (hit
play on the code chunk) and then take a look at the Environment Tab in
the upper right panel of your R screen. See that we now have an object
called m1
? You just created it, using the line of code
above.
The easiest way to think about it is to imagine that you just told R
to create a box called m1
. This box was empty. Now, you
want R to fill the box with something. This is what the
<-
part of the code does. It tells R to fill object
m1
with whatever comes next in the code line. In this case,
we are going to fill the box m1
with all of the information
we need to build our regression line.
Now, an important point. Let’s say you run a different command in the
form m1<-
. This means that you will empty out your
m1
box and then fill it according to the new command. In
other words, you will replace your original m1
with the
result of your new command. Because of this, whenever you create
a new object in R, you need a new name. These names are case sensitive,
meaning that M1
and m1
are
different.
Let’s get back to the actual line of code defining m1
.
The first argument in the code lm
tells R that we want to
create a regression line. So, m1<-lm( )
will tell R to
create a least-squares regression line and name it m1
. Our
next step is to tell R what to use to actually build the line. What is
the response variable? What is the explanatory variable?
To tell R what variable to use to create the line, we use a formula
that takes the form y ~ x
. See how our code says
m1 <-lm(runs~at_bats, data = mlb11 )
? This can be read
as telling R we want to make a regression line (lm
) with
runs
as our response variable and at_bats
as
our explanatory variable. We notice that we don’t use the $ notation
here. Why? Because there is a part of the command that specifies that R
should look in the mlb11
data frame to find the
runs
and at_bats
variables.
Now, what exactly is m1
? It’s not a data set. Instead,
it holds bits and pieces of information that result from fitting a
least-squares regression line. The output of lm
is an
object that contains all of the information we need about the regression
line that was just fit. Specifically, right now our goal is to use R to
get estimates of the intercept a and the slope b. How do we get those
out of m1
? Use the code:
This reaches into the “box” m1
and retrieves the
estimated coefficients
, i.e., the estimates of a and b.
With this information, we can write down the least squares linear regression (LSLR) line.
Question 8
Write down the LSLR line.
Note: If you want to produce the notation, copy and paste the following into WHITE SPACE in your Markdown file (NOT a chunk) and adapt it:
$$\widehat{Y} = a + b X $$
Question 9
Think back to your plot. Does it make sense that the slope coefficient is positive? Why or why not?
Drawing the line and considering residuals
It is often helpful to actually visualize the least squares line by drawing it on the data. Let’s create a scatterplot with the least squares line laid on top.
Together, the two lines of code above (1) create a scatter plot for
at_bats
and runs
and (2) draw on the least
squares line. You do need both lines of code, but the function
abline
is the one that actually plots a line based on the
slope and intercept we estimated in m1
.
Looking at the line we have just drawn, we will notice that the line does not go through every data point. This should not surprise us. A least-squares regression line is a way to approximate the relationship between y and x, but it will not perfectly fit every data point. In order to explore how well the line estimates each data point, we use residuals.
In order to visualize the residuals, we will use a special command. Copy and paste the following into a chunk, and hit play.
After running this command, you will see a scatter plot. The line you specified is shown in black and the residuals are shown in blue.
Question 10
In words, what do the residuals represent? When we fit least-squares regression lines, do we want the residuals to be small or large? Explain.
Note that there are 30 residuals, one for each of the 30 observations. Recall that the residuals are denoted \[ e_i = y_i - \hat{y}_i \]
Question 11
Use the least squares line to answer the following. If a team manager saw the least squares regression line and not the actual data, about how many runs would they predict for a team with 5579 at-bats? Is this estimate an overestimate or an underestimate of the true value?
Question 12
What is the residual for the prediction of runs for a team with 5579 at-bats?
Note:
This lab is adapted from the Intro to Linear Regression Lab, which is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported license. The original lab adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by the faculty and TAs of UCLA Statistics.
This version of the lab is not endorsed by OpenIntro and was last updated by Nicole Dalzell May 30, 2022.