Goal

This lab will explore how statistical computing can help us answer questions using data. Last week, we discussed different statistical measures like IQR, standard deviation, and variance. We also learned how to plot box plots, histograms, and scatter plots. We will apply these measures and plots in this lab to examine numerical data.

This week we have learned the use of linear regression, and in this lab, we will use it and the least squares regression to fit linear models on data.

If you want a reminder on how to use R or R markdown, look back at Lab 1!

The Data and Installing Library and Packages

As we learned in the last lab, the first thing we need to start an analysis is data. Our data for today comes from the movie series Star Wars. We are going to look at information on the height and weight of different characters. In order to do that, we need a data set.

The Star Wars data set is stored in a library called dplyr. Your first question is probably “what is a library?” A library in our is a collection of codes that perform a certain task. For example, the library ggplot2 contains multiple codes to create professional data visualizations. As we work in R, sometimes we need to load libraries in order to access the codes that we want. So, how do we do that?

Look at the very top of your RStudio screen. You should see an option called Tools. Click on tools and from the drop-down menu select Install Packages. In the prompt box that appears, type dplyr, and then hit Install. Now, a whole bunch of output is going to start to appear on your screen. We can ignore all of it. This is just how R tells us it is loading a library.

Once the package installs, there is one more step we need to do to access the Star Wars data. Create a chunk in our, copy and paste the following, and hit play.

library(dplyr)
data("starwars")

If you look in your environment tab on the upper right-hand corner of your art studio screen, you should see a data set called starwars. Let’s start to explore this data set.

Question 1

How many cases (rows) are in this data set? How many variables?

You will notice something interesting in the 4th column of the data set. Some of the entries are recorded as NA. NA stands for not available and is one way that people indicate in a data set the information is missing. In other words when we see NA in a data set this tells us that a certain piece of information is not known.

Considering Height

Let’s start off by exploring the height (in centimeters) of different characters in Star Wars. This is an interesting question to explore because there are multiple different species in the Star Wars universe. Whenever you start to explore a variable, remember that the tools you use will depend on the type of variable you were exploring. The tools we use to describe a categorical variable are different from the tools we use to describe a numeric variable.

In R, one powerful command for helping us explore people is the summary command. Copy and paste the code from the chunk below and press play.

summary(starwars$height)

Question 2

For how many characters in this data set is their height unknown?

Now lets calculate the variance and standard deviation of height. You will have to use three new commands: mean(), var(), and sd(). var() command will provide the variance of the target column, in our case it is height. sd() command will provide the standard deviation. But there is a problem, there are missing values or for some of the characters in the dataset the height is unknown. These unknown heights are called missing values. If you just use mean () command then it will provide NA output. The general idea in R is that NA stands for “unknown”. If some of the values in a vector are unknown, then the mean of the vector is also unknown. NA is also used in other ways sometimes; then it makes sense to remove it and compute the mean of the other values. One of the easier ways is to use the argument na.rm=TRUE to calculate the result while ignoring the missing values. Now if you write down mean(object, na.rm= True) in the chunk then you will get the mean. Use the same process to calculate standard deviation and variance.Example of the mean is given in the chunk.

mean(starwars$height, na.rm=TRUE)

Question 3

What are the mean, variance, and standard deviation? Explain the result of the standard deviation in relation to the mean.

Boxplots

Box plot visualizes the center and spread of a distribution quite differently from a histogram. Specifically box plots show the first quartile, median, and third quartile of a variable, and also make it easier to see outliers, i.e., unusually large or small values of the variable.

To make a boxplot in R, the command we need is boxplot.

boxplot(object, col= "some color", 
        xlab= 'the X axis label', horizontal = TRUE)

The horizontal = TRUE part of the code just tells our that we want a horizontal box plot. If you want a vertical boxplot just remove this piece of the code. we are starting to see that when we have a primary command, like boxplot or histogram, we can then add arguments, or extra pieces of the command that help us to personalize our plot. Adding color, adding labels to the axes, and choosing whether the plot is vertical or horizontal are all examples of things we can specify using arguments.

Question 4

Adapt the code above to create a boxplot of the height of the Star Wars characters in this data set. Label the x axis “Height of characters (in centimeters)”. Make the plot any color you like, but do not use gold, black, or white! For suggestions, look at http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf

Question 5

Which measure of center is depicted in a boxplot: the mean or the median?

Question 6

Based on the boxplot, are there any outliers in terms of the characters height?

Question 7

Adapt the code above to create a box plot of the mass of the Star Wars characters in this data set. Label the x axis “Mass of characters (in kilograms)”. Make the plot any color you like, but do not use gold, black, or white!

Question 8

Based on the box plot, there is one very large outlier in terms of mass. Which Star Wars character has this very large body mass? Hint: You will need to open your data set!

Scatter Plot and Linear Regression

The Data

The use of data to improve team performance, training, and recruiting is becoming increasingly popular and important for college and professional sports. There is an entire field called sports analytics that is devoted to using statistics and data in professional sports. Today we are going to work with data from 30 different Major League Baseball teams. We’re interested specifically in figuring out what variables might be related to the number of runs a player gets during a Major League Baseball season. This is important because runs are directly related to the number of points a team earns. This means that identifying variables that might be related to the number of runs could help a team improve their chances of winning a game.

As always, our first step is to load the data. We know how to do this now-we create a chunk, paste in the code, and then press play!

load(url("http://www.openintro.org/stat/data/mlb11.RData"))

You will notice that this command is different from the code we used last time to load the data. There are actually quite a few ways to load data into R, and we will use several as we go through this course.

Question 9

What is the name of the data set you have just loaded?

Our response variable for our analysis for today is runs, the number of runs a team scored during a season of Major League Baseball. Each row in the data set represents a different team during this season.

One formatting note. Right now, if you knit your Markdown, the command to load the data will show up in your PDF. We need that command to run, yes, but we don’t actually need to see it. If you want to hide the R code in a chunk from your PDF, but you still need the code to run, there is a way to do it. Look at the top of your R chunk. You should see {r}. If you want to stop the R code for that chunk from appearing in your PDF, change this to {r,echo=FALSE}.

Graphing the Relationship between two numeric variables

Suppose the manager of a Major League Baseball team asks you if the number of times the team is up to bat has a strong relationship with the number of runs scored during the season. The client is specifically interested in having you use a least squares linear regression model to describe the relationship between X = at bats and Y = runs. Remember that response variables are usually denoted by a capital letter Y, and explanatory variables are usually denoted by a capital letter X.

To complete this task, we need to

Determine if using least squares linear regression is appropriate and If so, build and interpret a least squares linear regression line. We need to start with (1), because it will not make sense to use linear regression if the relationship between X = at bats and Y = runs is not the right shape.

In order to check the shape, we need to create a visualization that explores the relationship between at bats and our numeric response variable, runs. Scatter plot helps in this regard and therefore we are going to use scatter plots.

Now that we know what type of plot we need, let’s make it. Recall from Lab 1 that a helpful command for plotting is:

plot(x = , y = , xlab = " ", ylab = " ")

One reminder. In order to plot any variable, we have to tell R (1) the name of data set and (2) the name of the column. This means that if you want to plot runs, you actually need the code mlb11$runs. This tells R “Hey, look at the data set mlb11 and grab a column $ called runs”. Also remember that when you name your axes, you need to put quotation marks around your chosen labels.

Question 10

Plot the relationship between X= at_bats and Y = runs. Show the plot as your answer, and label your axes.

Creating the plot is an excellent first step to answering the manager’s question, but once we have created the plot we then have to be able to interpret what this plot is telling us.Generally, when we are comparing two numeric variables using this type of plot, we are interested in four things:

Does the relationship seem linear? Does the relationship between these variables seem to be strong or weak? Is the relationship positive, negative, or neither? Are there any points that seem not to follow the trend of the rest of the data? In other words, are there any outliers? There is a reason why we are interested in these specific questions. The first question (Does the relationship seem linear?) tells us whether or not using a least-squares regression line makes sense for these two variables. All models have assumptions. A big one for least-squares regression lines is that the relationship between the two variables we are modeling should be able to be described using a line. In other words, the relationship should look linear. If it does not, we should not use a least-squares regression line.

Question 11

Does it seem reasonable to consider a line to describe the relationship between X = at bats and Y= runs?

The second question asks about the strength of the relationship between the two variables. This is important because a least squares linear regression line like the client has asked us for uses X to describe the variation in Y. In other words, we are hoping to be able to say something like as X goes up by 1, Y generally goes up by 2.5. If the relationship between X and Y is not very strong, than the line may not be a good tool for describing the relationship.

It can be difficult to assess the strength of a linear relationship just by looking at a scatterplot. Luckily, we have learned that a numerical way to quantify the strength of a relationship is with the correlation. To determine the correlation of two variables firstvariable and secondvariablein R, use the code.

cor(firstvariable , secondvariable )

Question 12

What is the correlation between X = at bats and Y= runs? What does this tell us about the strength of the linear relationship between these two variables? Hint: You need to replace firstvariable and secondvariable with your two variables of interest. To declare the first variable or you have to give the name of the data then the $ sign and then the name of the first variable. Here variables are coloumns.

Fitting the least-squares regression line

Once we have determined that regression is an appropriate choice, we can move into the actual process of estimating our line. If we choose to use a least-squares line, we are making the choice that we can describe the relationship between y and x using a line. The question then becomes…which line.

We have discussed in class that there are specific formulas that can give us the slope and intercept for a least-squares regression line. Luckily for us, R has these formulas built in to a few commands, so we will not have to manually compute the values. To obtain the least squares regression line in R, we can use the following code:

m1 <- lm(runs ~at_bats, data = mlb11)

This code is a little different from those we have worked with. We run the code, and nothing seems to happen…but it did. Run the code (hit play on the code chunk) and then take a look at the workspace (environment) in the upper right panel of your R screen. See that we now have an object called m1? You just created it, using the line of code above.

The easiest way to think about it is to imagine that you just told R to create a box called m1. This box was empty. Now, you want R to fill the box with something. This is what the <- part of the code does. It tells R to fill object m1 with whatever comes next in the code line. In this case, we are going to fill the box m1 with all of the information we need to build our regression line.

Now, an important point. Let’s say you run a different command in the form m1<-. This means that you will empty out your m1 box and then fill it according to the new command. In other words, you will replace your original m1 with the result of your new command. Because of this, whenever you create a new object in R, you need a new name. These names are case sensitive, meaning that M1 and m1 are different.

Let’s get back to the actual line of code defining m1. The first argument in the code lm tells R that we want to create a regression line. So, m1<-lm( ) will tell R to create a least-squares regression line and name it m1. Our next step is to tell R what to use to actually build the line. What is the response variable? What is the explanatory variable?

To tell R what variable to use to create the line, we use a formula that takes the form y ~ x. See how our code says m1 <-lm(runs~at_bats, data = mlb11 )? This can be read as telling R we want to make a regression line (lm) with runs as our response variable and at_bats as our explanatory variable. We notice that we don’t use the $ notation here. Why? Because there is a part of the command that specifies that R should look in the mlb11 data frame to find the runs and at_bats variables.

Now, what exactly is m1? It’s not a data set. Instead, it holds bits and pieces of information that result from fitting a least-squares regression line. The output of lm is an object that contains all of the information we need about the regression line that was just fit. Specifically, right now our goal is to use R to get estimates of the intercept a and the slope b. How do we get those out of m1? Use the code:

m1 <- lm(runs ~at_bats, data = mlb11)
m1$coefficients

This reaches into the “box” m1 and retrieves the estimated coefficients, i.e., the estimates of a and b.

With this information, we can write down the least squares regression (LSLR) line.

Question 13

Write down the LSLR line.

Drawing the line and considering residuals

It is often helpful to actually visualize the least squares line by drawing it on the data. Let’s create a scatterplot with the least squares line laid on top.

plot(x=  mlb11$at_bats, y = mlb11$runs, xlab = "At Bats", ylab = "Runs")
abline(m1)

Together, the two lines of code above (1) create a scatter plot for at_bats and runs and (2) draw on the least squares line. You do need both lines of code, but the function abline is the one that actually plots a line based on the slope and intercept we estimated in m1.

Looking at the line we have just drawn, we will notice that the line does not go through every data point. This should not surprise us. A least-squares regression line is a way to approximate the relationship between y and x, but it will not perfectly fit every data point. In order to explore how well the line estimates each data point, we use residuals.

In order to visualize the residuals, we will use a special command. Copy and paste the following into a chunk, and hit play.

plot_ss(x = mlb11$at_bats, y = mlb11$runs)

After running this command, you will see a scatter plot. The line you specified is shown in black and the residuals are shown in blue.

Question 14

In words, what do the residuals represent? When we fit least-squares regression lines, do we want the residuals to be small or large? Explain.

Note that there are 30 residuals, one for each of the 30 observations.

Question 15

Use the least squares line (the equation, not the picture) to answer the following. If a team manager saw the least squares regression line and not the actual data, about how many runs would they predict for a team with 5579 at-bats?

Question 16

What is the residual for the prediction of runs for a team with 5579 at-bats? Interpret this value in words. Hint: You will need to open the data set to see the true value for runs for this team.

Assessing the line

Okay, so we have estimated the coefficients we needed based on the fact that we chose a least-squares regression line. We now have a line. Wonderful. Before we use this line to make any sweeping conclusions about baseball, let’s see if we can determine how good a job our line actually does at explaining the variation in runs.

We have learned that one way to examine fit is to consider the amount of variation in our response that is actually able to be explained by the line. In other words, how much of the variation in runs can we explain by at_bats? To get at this, we need to look at the R-squared.

There are two ways to get at the R-squared in R. The first is to find the correlation (which we already did) and literally square it.

cor( mlb11$at_bats, mlb11$runs)^2

Question 17

What percentage of the variability in runs is associated with the number of times a team was at bat?

Summary

In this lab,we have learned how to calculate different statistical measures like: standard deviation and variance. We also learned how to customize different plots like box plot and scatter plot. At the end, we have learned how to interpret the results of linear regression