STA 111 Lab 3

Complete all Questions, and submit final documents in PDF or html form on Canvas.

Goal

In the last lab, we focused on visualizations as well as summary statistics for one numeric variable. Today, we are going to use visualizations and numeric measures to describe the relationship between two numeric variables. The primary visualization we will use is a scatter plot, and the primary numeric measure we will use is called correlation. We will also see how to use R to create a least squares linear regression line. This is another tool we use to describe the relationship between two numeric variables.

The Data

The use of data to improve team performance, training, and recruiting is becoming increasingly popular and important for college and professional sports. There is an entire field called sports analytics that is devoted to using statistics and data in professional sports. Today we are going to work with data from 30 different Major League Baseball teams. We’re interested specifically in figuring out what variables might be related to the number of runs a player gets during a Major League Baseball season. This is important because runs are directly related to the number of points a team earns. This means that identifying variables that might be related to the number of runs could help a team improve their chances of winning a game.

As always, our first step is to load the data. We know how to do this now-we create a chunk, paste in the code, and then press play!

load(url("http://www.openintro.org/stat/data/mlb11.RData"))

You will notice that this command is different from the code we used last time to load the data. There are actually quite a few ways to load data into R, and we will use several as we go through this course.

Question 1

What is the name of the data set you have just loaded?

Our response variable for our analysis for today is runs, the number of runs a team scored during a season of Major League Baseball. Each row in the data set represents a different team during this season.

One formatting note. Right now, if you knit your Markdown, the command to load the data will show up in your PDF. We need that command to run, yes, but we don’t actually need to see it. If you want to hide the R code in a chunk from your PDF, but you still need the code to run, there is a way to do it. Look at the top of your R chunk. You should see {r}. If you want to stop the R code for that chunk from appearing in your PDF, change this to {r,echo=FALSE}.

Graphing the Relationship between two numeric variables

Suppose the manager of a Major League Baseball team asks you if the number of times the team is up to bat has a strong relationship with the number of runs scored during the season. The client is specifically interested in having you use a least squares linear regression model to describe the relationship between X = at bats and Y = runs. Remember that response variables are usually denoted by a capital letter Y, and explanatory variables are usually denoted by a capital letter X.

To complete this task, we need to

Determine if using least squares linear regression is appropriate and
If so, build and interpret a least squares linear regression line.

We need to start with (1), because it will not make sense to use linear regression if the relationship between X = at bats and Y = runs is not the right shape.

Question 2

What shape does the relationship between X = at bats and Y = runs need to be in order for us to reasonably use least squares linear regression?

In order to check the shape, we need to create a visualization that explores the relationship between at bats and our numeric response variable, runs.

Question 3

What type of plot would you use to determine the shape of the relationship between X = at bats and Y = runs? Why?

Now that we know what type of plot we need, let’s make it. Recall from Lab 1 that a helpful command for plotting is:

plot(x = , y = , xlab = " ", ylab = " ")

One reminder. In order to plot any variable, we have to tell R (1) the name of data set and (2) the name of the column. This means that if you want to plot runs, you actually need the code mlb11$runs. This tells R “Hey, look at the data set mlb11 and grab a column $ called runs”. Also remember that when you name your axes, you need to put quotation marks around your chosen labels.

Question 4

Plot the relationship between X= at_bats and Y = runs. Show the plot as your answer, and label your axes.

Creating the plot is an excellent first step to answering the manager’s question, but once we have created the plot we then have to be able to interpret what this plot is telling us. Generally, when we are comparing two numeric variables using this type of plot, we are interested in four things:

Does the relationship seem linear?
Does the relationship between these variables seem to be strong or weak?
Is the relationship positive, negative, or neither?
Are there any points that seem not to follow the trend of the rest of the data? In other words, are there any outliers?

There is a reason why we are interested in these specific questions. The first question (Does the relationship seem linear?) tells us whether or not using a least-squares regression line makes sense for these two variables. All models have assumptions. A big one for least-squares regression lines is that the relationship between the two variables we are modeling should be able to be described using a line. In other words, the relationship should look linear. If it does not, we should not use a least-squares regression line.

Question 5

Does it seem reasonable to consider a regression line to describe the relationship between X = at bats and Y= runs? Explain.

The second question asks about the strength of the relationship between the two variables. This is important because a least squares linear regression line like the client has asked us for uses X to describe the variation in Y. In other words, we are hoping to be able to say something like as X goes up by 1, Y generally goes up by 2.5. If the relationship between X and Y is not very strong, than the line may not be a good tool for describing the relationship.

It can be difficult to assess the strength of a linear relationship just by looking at a plot. Luckily, we have learned that a numerical way to quantify the strength of a relationship is with the correlation. To determine the correlation of two variables firstvariable and secondvariablein R, use:

cor(firstvariable , secondvariable )

Question 6

What is the correlation between X = at bats and Y= runs? What does this tell us about the strength of the linear relationship between these two variables? Hint: You need to replace firstvariable and secondvariable with your two variables of interest.

The third question (Is the relationship positive, negative, or neither?) tells us what sort of slope to expect when we fit a regression line to the data. Do we expect a positive slope? A negative one? This intuition will allows us to check our line once we fit it. Do the results of the line make sense with what the plot is telling us?

The fourth question (Are there any points that seem not to follow the trend of the rest of the data?) helps with identifying outliers that can strongly impact the fit of a line.

Question 7

Describe the relationship between X = at bats and Y= runs. Make sure to comment on all four things listed above. Would you be comfortable telling the client that a least squares linear regression model is a reasonable choice to describe the relationship between at bats and runs?

Fitting the least-squares regression line

Once we have determined that regression is an appropriate choice, we can move into the actual process of estimating our line. If we choose to use a least-squares line, we are making the choice that we can describe the relationship between y and x using a line. The question then becomes…which line.

We have discussed in class that there are specific formulas that can give us the slope and intercept for a least-squares regression line. Luckily for us, R has these formulas built in to a few commands, so we will not have to manually compute the values. To obtain the least squares regression line in R, we can use the following code:

m1 <- lm(runs ~at_bats, data = mlb11)

This code is a little different from those we have worked with. We run the code, and nothing seems to happen…but it did. Run the code (hit play on the code chunk) and then take a look at the Environment Tab in the upper right panel of your R screen. See that we now have an object called m1? You just created it, using the line of code above.

The easiest way to think about it is to imagine that you just told R to create a box called m1. This box was empty. Now, you want R to fill the box with something. This is what the <- part of the code does. It tells R to fill object m1 with whatever comes next in the code line. In this case, we are going to fill the box m1 with all of the information we need to build our regression line.

Now, an important point. Let’s say you run a different command in the form m1<-. This means that you will empty out your m1 box and then fill it according to the new command. In other words, you will replace your original m1 with the result of your new command. Because of this, whenever you create a new object in R, you need a new name. These names are case sensitive, meaning that M1 and m1 are different.

Let’s get back to the actual line of code defining m1. The first argument in the code lm tells R that we want to create a regression line. So, m1<-lm( ) will tell R to create a least-squares regression line and name it m1. Our next step is to tell R what to use to actually build the line. What is the response variable? What is the explanatory variable?

To tell R what variable to use to create the line, we use a formula that takes the form y ~ x. See how our code says m1 <-lm(runs~at_bats, data = mlb11 )? This can be read as telling R we want to make a regression line (lm) with runs as our response variable and at_bats as our explanatory variable. We notice that we don’t use the $ notation here. Why? Because there is a part of the command that specifies that R should look in the mlb11 data frame to find the runs and at_bats variables.

Now, what exactly is m1? It’s not a data set. Instead, it holds bits and pieces of information that result from fitting a least-squares regression line. The output of lm is an object that contains all of the information we need about the regression line that was just fit. Specifically, right now our goal is to use R to get estimates of the intercept a and the slope b. How do we get those out of m1? Use the code:

m1 <- lm(runs ~at_bats, data = mlb11)
m1$coefficients

This reaches into the “box” m1 and retrieves the estimated coefficients, i.e., the estimates of a and b.

With this information, we can write down the least squares linear regression (LSLR) line.

Question 8

Write down the LSLR line.

Note: If you want to produce the notation, copy and paste the following into WHITE SPACE in your Markdown file (NOT a chunk) and adapt it:

$$\widehat{Y} = a + b X $$

Question 9

Think back to your plot. Does it make sense that the slope coefficient is positive? Why or why not?

Drawing the line and considering residuals

It is often helpful to actually visualize the least squares line by drawing it on the data. Let’s create a scatterplot with the least squares line laid on top.

plot(x=  mlb11$at_bats, y = mlb11$runs, xlab = "At Bats", ylab = "Runs")
abline(m1)

Together, the two lines of code above (1) create a scatter plot for at_bats and runs and (2) draw on the least squares line. You do need both lines of code, but the function abline is the one that actually plots a line based on the slope and intercept we estimated in m1.

Looking at the line we have just drawn, we will notice that the line does not go through every data point. This should not surprise us. A least-squares regression line is a way to approximate the relationship between y and x, but it will not perfectly fit every data point. In order to explore how well the line estimates each data point, we use residuals.

In order to visualize the residuals, we will use a special command. Copy and paste the following into a chunk, and hit play.

plot_ss(x = mlb11$at_bats, y = mlb11$runs)

After running this command, you will see a scatter plot. The line you specified is shown in black and the residuals are shown in blue.

Question 10

In words, what do the residuals represent? When we fit least-squares regression lines, do we want the residuals to be small or large? Explain.

Note that there are 30 residuals, one for each of the 30 observations. Recall that the residuals are denoted \[ e_i = y_i - \hat{y}_i \]

Question 11

Use the least squares line to answer the following. If a team manager saw the least squares regression line and not the actual data, about how many runs would they predict for a team with 5579 at-bats? Is this estimate an overestimate or an underestimate of the true value?

Question 12

What is the residual for the prediction of runs for a team with 5579 at-bats?

Note:

This lab is adapted from the Intro to Linear Regression Lab, which is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported license. The original lab adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by the faculty and TAs of UCLA Statistics.

This version of the lab is not endorsed by OpenIntro and was last updated by Nicole Dalzell May 30, 2022.