Today, we are going to focus on using R to understand the relationship between two numeric variables, and to fit a LSLR model using those two variables.
In the last lab, we learned that the "data environment" (the upper right hand panel of RStudio) is where all of our data sets are stored. If you look at your Environment now, you will probably see the data set from the last lab. If your Environment is blank, don't worry about it! If it's not blank, there is a way to clear your Environment, i.e., remove data sets that you no longer need.
To clear all of the data in the upper right hand panel, look at the top of the panel. Beside "Import Dataset" there is an image that looks like a small broom. Pushing that will "clean" your space, meaning that it will remove all the contents within the panel. It is good practice to do this before you start each lab. This keeps us from having a lot of unnecessary stuff cluttering up this window, making the data sets you need easier to find.
Everything we learned in Lab 1 still applies. Flip back to those more detailed instructions as you need to throughout the course. We recall that we begin every new lab by creating a Markdown file. Go ahead and do that, and remember that you need to delete everything after line 12 in the template that comes up. You will do this for every Markdown you create in this class.
Now that you have a clean template and a clean Environment, we're ready to start a new lab.
Today, we are going to work with data on nutrition information for Tall (12 ounce) drinks at Starbucks. The data set comes from Kaggle, an online data repository with data sets on a wide range of topics.
As we discussed in Lab 1, the first thing we need to do to start any analysis in R is to load the data. Last class, we loaded data directly from the internet using a single line of code. We were able to do this because the data was available on a website. However, not all data sets are able to be loaded into R in this way. Today, we are going to load the data from a very common file type, a .csv file.
The data that you need is on Canvas, under Lab 2 in the assignments tab. Go ahead and download the data set from Canvas. A note for Mac users. It is strongly recommended that you use a browser other than Safari when you do this. When you are prompted to save the data, make sure to save the .csv file in the same folder where you generally save your STA 112 work. It is not helpful if you download the file and then cannot find it!
Now that we have downloaded the data, we need to move the data from your computer into R. To do that, look at your Environment (the upper right hand panel). There is an icon called Import Dataset. When you click on the drop down arrow, you will be presented with a few choices. We are looking for one of two choices: "From Text (base)" or "From CSV". The option you are presented with will depend on your machine, but either one works! The purpose of these options is to help R know what file type it is expected to download.
Once you have selected the appropriate option, you will be prompted to browse within your computer files to find the data set. Go ahead and navigate to the folder where you stored the data, and click on it. A preview of the data set will appear. Make sure that your data has a heading/header, i.e., that there are column labels on the data set. Once the preview looks okay, go ahead and load/import the data. Take a look in your Environment and see that you should have a data set called Starbucks
with 230 observations.
At this point, R can see the data set. However, at this point RMarkdown cannot see the data set. This is important. RMarkdown will only be able to see the data if the line to load the data is in a chunk in your Markdown document.
What line of code?? Look at your Console, which should be the lower left panel of your screen. You should see a line of code that includes Starbucks <- read.csv()
or Starbucks <- read_csv()
. Inside the parentheses will be the path that specifies where the file has been stored on your computer.
Starbucks
to the final parentheses.read.csv
rather than read_csv
.Congratulations, you have now successfully loaded the data into RMarkdown!
Now, why did we have to do all that? It seemed much easier just to use the single line of code from last time! While it is true that loading that way is faster, it is often the case that data files are not able to be loaded by using a URL. More often than not, the data sets are in the form of an Excel or .csv file, and they need to be moved into R before we can work with them. The process you have just gone through allows you to load any such data set into R.
One formatting note. Right now, if you knit your Markdown, the command to load the data will show up in your PDF. We need that command to run, yes, but we don't actually need to see it. If you want to hide the R code in a chunk from your PDF, but you still need the code to run, there is a way to do it. Look at the top of your R chunk. You should see ```{r}. If you want to stop the R code for that chunk from appearing in your PDF, change this to ```{r,echo=FALSE}. Knit your document and notice that the command to load the data has disappeared. You can use this technique on any chunk where you do not want the line of code to show up in the document, but you do want the code to run.
Based on the name of this lab and the stated goals, it's probably safe to assume we will be using a LSLR model for the relationship between our two variables of interest. However, in practice we will not have that information at the beginning of an analysis. What we will have is data, and a question of interest. It is then the job of the statistician to determine what techniques are appropriate to help address the question of interest.
Sugar content and its relationship to the obesity epidemic in the United States is a hot topic of research and debate. Restaurants are being required to post calories and sugar content of items on their menus, and companies like soda manufacturers are posting information cautioning consumers on the health risks associated with sugar. Today, we are going to explore the relationship between sugar content and the number of calories in 230 Starbucks beverages. Specifically, we want to answer the question: Is higher sugar content associated with higher calories in Starbucks beverages?
Based on this question of interest, what is our response variable? Our explanatory variable?
With our question of interest defined, we can now begin the process of exploring the data. As a first step, we will conduct exploratory data analysis, or EDA. This process of visualizing and otherwise exploring the data provides information that will help us decide on statistical techniques appropriate for the task we are trying to accomplish. This is the first step in any modeling task. Why? Because we need to choose a form for our model! Understanding the data will help us decide what models are appropriate for our research question.
The process of conducting EDA is different for different types of data. If we have one numeric variable, for instance, we tend to make a histogram or a box plot. If we have one categorical variable, we tend to use a table or a bar plot. For two numeric variables, we use a scatter plot. For one numeric and one categorical, we use a side by side box plot.
To use R to determine whether or not a variable is numeric, we use the command class
. For instance, to check to see if Calories
is a numeric variable, we can use the code:
class(Starbucks$Calories)
Notice that when we run this code, R returns the word "integer"
, which tells us that Calories
contains only integers, and is therefore a numeric variable. The result "numeric"
is another option that suggests we have a numeric variable.
Notice in this code that we have provided R (1) the name of data set and (2) the name of the column. This means that if you want to get the class of Calories
, you actually need the code Starbucks$Calories
. This tells R "Hey, look at the data set Starbucks
and grab a column $
called Calories
".
Why do we need to bother to check this? Didn't we learn how to distinguish numeric and categorical variables in intro stat? We did, but just because our intuition tells us that a variable should be numeric does not mean the creators of the data set chose to record the values as a number. For instance, let's consider the variable VitaminA
, which tells us how much Vitamin A is in each Starbucks drink. Our intuition likely tells us that this is a numeric variable. Let's check.
What type of variable does R think VitaminA
is? Show the code you used to check your answer. What class did we expect? Why do you think R believes the variable is not the class we expect? Hint: Take a look at the symbols used to record the variable.
The moral of the story: It is important to check to see how a variable is actually recorded in a data set. It may not be what you think!
Let's return to our research question. Is higher sugar content associated with higher calories in Starbucks beverages? This means that we are going to focus our attention on two variables: Calories
and Sugars
.
What type of plot would you use to explore the relationship between Calories
and Sugars
? Why?
Plot the relationship between Sugars
and Calories
, making sure that the explanatory variable you identified in Question 1 is on the x-axis Show the plot in your answer, and clearly label your axes. Hint: Remember you have to run the code library(ggplot2)
in a chunk before you try to make your first plot.
Now that we have used R to create a plot, we need to read it. Reading a plot means interpreting the information the plot is telling us. When we create a plot for two numeric variables, there are typically four pieces of information we try to read:
Sugars
(in grams) and Calories
. Make sure to comment on all four things listed above.There is a reason why we are interested in these four specific questions. The first question ( What is the shape of the relationship?) tells us whether or not using a line makes sense for this data set. All models have assumptions. A big one for LSLR models is that the relationship between the two variables we are modeling should be able to be described using a line. In other words, the relationship should look linear. If it does not, we should not use a LSLR model.
Sugars
and Calories
, does it look like a LSLR model might be a reasonable model to consider? Explain. (We will learn other ways to help answer this question later. For now, just reason based on what you see.)The second question tells us whether or not there seems to be an informative relationship between the variables. If there is really no relationship, building a statistical model is not going to be particularly useful. We can still fit a model, but whole point of modeling is to use one variable (X) to describe the variation in another (Y). If that relationship is weak, our model is not going to be able to explain much variation in Y, and therefore will not be particularly useful for prediction or interpretation.
We have learned that one way to quantify the strength of a linear relationship is with the correlation. To determine the correlation of two variables firstvariable
and secondvariable
in R, use the following code:
cor(firstvariable , secondvariable )
To do this with your data, replace firstvariable
and secondvariable
with your variables of interest. Remember that do to do this, you have to provide R (1) the name of data set and (2) the name of the column. This means that if you want to use Calories
as your first variable, you actually need to type Starbucks$Calories
.
Sugars
and Calories
? What does this tell us about the strength of the linear relationship between these two variables?The third question (Is the relationship positive, negative, or neither?) tells us what sort of slope to expect when we fit a LSLR model to the data. Do we expect a positive slope? A negative one? This intuition will allows us to check our model once we fit it. Do the results of the model make sense with what the plot is telling us?
The fourth question (Are there any points that seem not to follow the trend of the rest of the data?) helps with identifying outliers and influential points that can strongly impact the fit of a model. We'll learn more about these in coming classes.
At this point in the course, this is the end of EDA. One plot, and the correlation. As we move deeper into the material, we will learn that there are more things to look for in this stage of the modeling process. For now, the thing to keep in mind is that the first step to any modeling task is some form of EDA. This helps us to choose the form of the model that might be appropriate for a given data set.
Once we have conducted EDA, we can move into the actual process of building a model. This is Step 2 in model building: Fit the Model. If we choose to use a LSLR model, we are making the choice that we can describe the relationship between y and x using a line. The question then becomes...which line.
Looking at a plot, there are a ton of different lines we could choose. All we need to define a line is two points. So, how do we settle on one line? We discussed in the class that when we fit LSLR models, we are actually looking for a line with a very specific property. This line will be called our least-squares linear regression line, or our LSLR line.
To obtain the least-squares linear regression line, i.e., fit a least-squares linear regression model, in R, we can use the following code:
m1 <- lm(Calories ~ Sugars, data = Starbucks)
This code is a little different from those we have worked with. We run the code, and nothing seems to happen...but it did. Run the code (hit play on the code chunk) and then take a look at the Environment in the upper right panel of your R screen. See that we now have an object called m1
? You just created it, using the line of code above.
The easiest way to think about it is to imagine that you just told R to create a box called m1
. This box was empty. Now, you want R to fill the box with something. This is what the <-
part of the code does. It tells R to fill object m1
with whatever comes next in the code line. In this case, we are going to fill the box m1
with all of the information we need to build our LSLR line.
Now, an important point. Let's say you run a different command in the form m1<-
. This means that you will empty out your m1
box and then fill it according to the new command. In other words, you will replace your original m1
with the result of your new command. Because of this, whenever you create a new object in R, you need a new name. These names are case sensitive, meaning that M1
and m1
are different.
Let's get back to the actual line of code defining m1
. The first argument in the code lm
tells R that we want to fit a least-squares linear regression model. So, m1<-lm( )
will tell R to create a model and name it m1
. Our next step is to tell R what to use to actually build the model. What is the response variable? What is the explanatory variable?
To tell R what variable to use to create the model, we use a formula that takes the form y ~ x
. See how our code says m1 <-lm(Calories~Sugars, data = Starbucks)
? This can be read as telling R we want to make a least-squares linear regression model (lm
) with Calories
as our response variable and Sugars
as our explanatory variable. We notice that we don't use the $ notation here. Why? Because there is a part of the command that specifies that R should look in the Starbucks
data frame to find the Calories
and Sugars
variables.
We could also use the code m1 <-lm(Starbucks$Calories ~ Starbucks$Sugars)
and get the same result. As long as R know where to find the columns you want, it's good to go.
Now, what exactly is m1
? It's not a data set. Instead, it holds bits and pieces of information that result from fitting a model. The output of lm
is an object that contains all of the information we need about the LSLR model that was just fit. Specifically, right now our goal is to use R to get estimates of \(\hat{\beta}_0\) and \(\hat{\beta}_1\). How do we get those out of m1
? Use the code:
m1$coefficients
This reaches into the "box" m1
and retrieves the estimated coefficients
, i.e., the estimates of \(\hat{\beta}_0\) and \(\hat{\beta}_1\).
With this information, we can write down the least-squares linear regression line:
\[ \widehat{Calories} = 45.3571 + 4.5679~Sugars \]
One note in Markdown. Let's say that you want to write down \(\hat{\beta}_1\), using that mathematical notation. Can we do it? Yes, indeed.
There are three types of things you can type into a Markdown document: regular text, R code, and mathematical notation. Anything in the white space is read just as text, like what you would type into Word. Anything in an R chunk is read as R code. If you want to write mathematical notation, we need to tell Markdown, "Hey, we're going to make a math symbol!" To do that, you use dollar signs. For instance, to make \(\hat{\beta}_1\), you simply put $\hat{\beta}_1$ into the white space (not a chunk) in your Markdown. Go ahead and do that. See how the dollar signs change colors? Also note that if you hover your mouse over what you just pasted, the mathematical symbol we want will appear.
The same type of structure can be used for other mathematical symbols. Let's say I want to write out the LSLR line. The code is $\widehat{Calories} = 45.3571 + 4.5679 Sugars$. We'll notice that we used \hat for the \(\hat{\beta}_1\) but \widehat over Calories
. Why? Because Calories
is a longer word than \(\hat{\beta}_1\) and hence needs a bigger (wider) hat.
A word of caution. You must make sure that both the dollar sign at the beginning and end of your mathematical expression is touching text. In other words, $\hat{y}$ will knit just fine but $\hat{y} $ will yield an error. This is important. Your document will not knit if you forget! Now, you can put spaces inside, meaning that $\hat{y} = 4$ is fine, but the beginning and end can have NO spaces.
Okay, so we have (1) chosen a form for the model and (2) fit it, i.e., estimated the coefficients we needed based on the fact that we chose a LSLR model. We now have a line. Wonderful. Before we use this line to make any sweeping conclusions about Starbucks drinks, let's see if we can determine how good a job our model actually does at explaining the variation in Calories
.
We have learned that one way to examine a model fit is to consider the amount of variation in our response that is actually able to be explained by the model. In other words, how much of the variation in Calories
can we associate with the linear relationship between Calories and Sugars
? To get at this, we need to look at the \(R^2\).
There are two ways to get at the \(R^2\) in R. The first is to find the correlation (which we already did) and literally square it.
cor( Starbucks$Sugars, Starbucks$Calories)^2
Calories
is associated with its linear relationship with Sugars
?There is another way to get at the \(R^2\), as well as a lot of other things about a model that we will use later. We can look at a summary of what is stored in m1
.
summary(m1)
The \(R^2\) is actually called the Multiple R-squared in the output. Can you find it? Note that the coefficient estimates are here as well, along with a lot of other things that we don't need now, but will understand soon!
There are other ways to assess the fit of a LSLR line, and we will explore those in the next lab.
When we use a LSLR model to describe the relationship between two variables, we are making the claim that a line is a good description of their relationship. A good way to check this is to plot the LSLR line. To do this in R, we will use the following codes:
ggplot(Starbucks, aes(Sugars, Calories) ) + geom_point() + stat_smooth(formula = y ~ x, method = "lm", se = FALSE)
We already know that the first two parts (ggplot and geom_point) create the scatter plot. The final part of the code (stat_smooth
) is what actually draws on the LSLR line.
Looking at the line we have just drawn, we will notice that the line does not go through every data point. This should not surprise us. A LSLR line is a way to approximate the relationship between the response variable Y and explanatory variable X. We then use this line to estimate our response variable, i.e., obtain \(\hat{y}_i\), at each value \(x_i\) of our explanatory variable. So, the line goes through an average value for calories at each value of sugar, but is not expected to go through all the points.
Recall that we can use the LSLR line to estimate the value of Y (in our case Calories
) for any given value of X (Sugars
).
We can use the LSLR line to create predictions, but these predictions will not exactly match the data. One way to assess a LSLR line is examine how far off our predictions are from the actual data, i.e., to examine the residuals. Recall that the residuals are denoted:
\[ e_i = y_i - \hat{y}_i \]
The last step in our LSLR fitting model process is to check to perform model diagnostics to make sure our LSLR line is a good fit to the data. We will do that in the next lab.