STA 112 Lab 2
Complete all Questions and submit your final PDF or html (either works!) under Assignments in Canvas.
Goal
Today, we are going to focus on using R to understand the relationship between two numeric variables, and to fit a LSLR line using those two variables.
Getting Started: Clearing the Work space
In the last lab, we learned that the “data environment” (the upper right hand panel of RStudio) is where all of our data sets are stored. If you look at your Environment now, you will probably see the data set from the last lab. If your Environment is blank, don’t worry about it! If it’s not blank, there is a way to clear your Environment, i.e., remove data sets that you no longer need.
To clear all of the data in the upper right hand panel, look at the top of the panel. Beside “Import Dataset” there is an image that looks like a small broom. Pushing that will “clean” your space, meaning that it will remove all the contents within the panel.
It is good practice to do this before you start each lab. This keeps us from having a lot of unnecessary stuff cluttering up this window, making the data sets you need easier to find.
Everything we learned in Lab 1 still applies. Flip back to those more detailed instructions as you need to throughout the course. We recall that we begin every new lab by creating a Markdown file. Go ahead and do that, and remember that you need to delete everything after the first chunk in the template that comes up. You will do this for every Markdown you create in this class.
Now that you have a clean template and a clean Environment, we’re ready to start a new lab.
Loading the Data: Coffee, anyone?
Today, we are going to work with data on nutrition information for
Tall (12 ounce) drinks at Starbucks. You
can download the data from here., or following copying and pasting
the following into your browser:
https://drive.google.com/file/d/1KAxD9jTODqeI4zOErDsL6zg0PYqeBEm0/view?usp=sharing
Now that you have downloaded the data, let’s move it into R!
- Step 1 Look at the upper right hand panel of your RStudio screen (the Environment tab).
- Step 2 Find “Import Dataset” or “Import” and click on it. If you don’t see it, look for this symbol instead and click on it. If you have none of these, let Dr. Dalzell know!!
- Step 3 Choose “Text (base)” or “From CSV” (it will depend on your
computer).
- Step 4 Navigate to your data (Starbucks.csv) in the list that comes up. Choose it!
- Step 5 Now, look the bottom right hand panel of your screen. You
should see a line of code with something like
Starbucks <- read.csv(“Starbucks.csv”)
orStarbucks <- read.csv(“Starbucks.csv”)
. Copy that ENTIRE line of code. - Step 6 In your Markdown file, insert a code chunk.
- Step 7 Paste the line from Step 5 into this gray code chunk, and press the green arrow (the play button).
This means your Markdown should currently look like this (though what is inside the “” will look different for you, as it will point to where the file is on your computer!!)
And you are ready to go!! You now have the data loaded and you can work with it!
Exploratory Data Analysis
Based on the name of this lab and the stated goals, it’s probably safe to assume we will be using a LSLR line for the relationship between our two variables of interest. However, in practice we will not have that information at the beginning of an analysis. What we will have is data, and a question of interest. It is then the job of the statistician to determine what techniques are appropriate to help address the question of interest.
Sugar content and its relationship to the obesity epidemic in the United States is a hot topic of research and debate. Restaurants are being required to post calories and sugar content of items on their menus, and companies like soda manufacturers are posting information cautioning consumers on the health risks associated with sugar. Today, we are going to explore the relationship between sugar content and the number of calories in 230 Starbucks beverages. Specifically, we want to answer the question: Is higher sugar content associated with higher calories in Starbucks beverages? In other words, as sugar content increases, is this associated with higher calories?
Question 1
Based on this question of interest, what is our response variable? Our explanatory variable?
With our question of interest defined, we can now begin the process of exploring the data. As a first step, we will conduct exploratory data analysis, or EDA. This process of visualizing and otherwise exploring the data provides information that will help us decide on statistical techniques appropriate for the task we are trying to accomplish. This is the first step in any modeling task. .
The process of conducting EDA is different for different types of
data. If we have one numeric variable, for instance, we tend to make a
histogram or a box plot. Sometimes we can tell what type of variable we
have just by looking at the data. However, we can also use R to help us.
To use R to determine whether a variable is numeric, we use the command
class
. For instance, to check to see what kind of variable
calories is, we can use the code:
Notice that when we run this code, R returns the word
"integer"
, which tells us that calories contains only
integers, and is therefore a numeric variable. The result
"numeric"
is another option that suggests we have a numeric
variable.
In this code that we have provided R (1) the name of data set and (2)
the name of the column. This means that if you want to get the class of
calories, you actually need the code Starbucks$Calories
.
This tells R “Hey, look at the data set Starbucks
and grab
a column ($
) called Calories”.
Why do we need to bother to check this? Didn’t we learn how to distinguish numeric and categorical variables in intro stat? We did, but just because our intuition tells us that a variable should be numeric does not mean the creators of the data set chose to record the values as a number. For instance, let’s consider the variable VitaminA, which tells us how much Vitamin A is in each Starbucks drink. Our intuition likely tells us that this is a numeric variable. Let’s check.
Question 2
What type of variable does R think VitaminA
is? Show the
code you used to check your answer. Why do you think R believes the
variable is not the class we expect? Hint: Take a look at the symbols
used to record the variable.
The moral of the story: It is important to check to see how a variable is actually recorded in a data set. It may not be what you think!
Sugar and Calories
Let’s return to our research question. Is higher sugar content
associated with higher calories in Starbucks beverages? This means that
we are going to focus our attention on two variables:
Calories
and Sugars
.
Question 3
What type of plot would you use to explore the relationship between sugar and calories? Why?
Question 4
Plot the relationship between x = Sugars and Y = Calories. Show the plot in your answer, and clearly label your axes.
Hint 1: Remember you have to run the code
library(ggplot2)
in a chunk before you try to make your
first plot.
Hint 2:: Remember from last lab that the structure of how we make this plot is:
Now that we have used R to create a plot, we need to read it. Reading a plot means interpreting the information the plot is telling us. When we create a plot for two numeric variables, there are typically four pieces of information we try to read:
- Form (Shape): What is the shape of the relationship? Does it look like the pattern in the plot can be described by a line? A curve?
- Strength: Does the relationship between these variables seem to be strong, weak, moderate, or is there no apparent relationship?
- Direction: Is the relationship positive, negative, or flat (no relationship)?
- Outliers: Are there any points that seem not to follow the trend of the rest of the data?
There is a reason why we are interested in these four specific questions. The first question ( What is the shape of the relationship?) tells us whether or not using a line makes sense for this data set. All models have assumptions. A big one for LSLR lines is that the relationship between the two variables we are modeling should be able to be described using a line. In other words, the relationship should look linear. If it does not, we should not use a LSLR line.
Question 5
What is the form (shape) of the relationship between X = sugars and Y = calories?
The second question tells us whether or not there seems to be an informative relationship between the variables. If there is really no relationship, building a statistical model is not going to be particularly useful.
We have learned that one way to quantify the strength of a linear
relationship is with the correlation. To determine the correlation of
two variables firstvariable
and
secondvariable
in R, use the following code:
To do this with your data, replace firstvariable
and
secondvariable
with your variables of interest. Remember
that do to do this, you have to provide R (1) the name of data set and
(2) the name of the column. This means that if you want to use
Calories
as your first variable, you actually need to type
Starbucks$Calories
.
Question 6
What is the correlation between sugar and calories? What does this tell us about the strength of the linear relationship between these two variables?
The third question (Is the relationship positive, negative, or neither?) tells us what sort of slope to expect when we fit a LSLR line to the data. Do we expect a positive slope? A negative one? This intuition will allows us to check our model once we fit it. Do the results of the model make sense with what the plot is telling us?
Question 7
Is the relationship positive or negative?
The fourth question (Are there any points that seem not to follow the trend of the rest of the data?) helps with identifying points that can strongly impact the fit of a model. We’ll learn more about these in coming classes.
At this point in the course, this is the end of EDA. One plot, and the correlation. As we move deeper into the material, we will learn that there are more things to look for in this stage of the modeling process. For now, the thing to keep in mind is that the first step to any modeling task is some form of EDA. This helps us to choose the form of the model that might be appropriate for a given data set.
The LSLR line
Once we have conducted EDA, we can move into the actual process of building a model. Based on what we have seen so far, it looks like it would be reasonable to use a line to describe the relationship between X = sugars and Y = calories. This means we will be building linear model.
A linear model has the form
\[y = \beta_0 + \beta_1 x + \epsilon\]
where
- \({y}\) is the response variable
- \(x\) is the explanatory variable
- \({\beta}_1\) is the slope
- \({\beta}_0\) is the y-intercept
- \(\epsilon\) is the residual
We are going to use a very specific line to create our linear model. This line is called the least-squares lines regression (LSLR) line,
Estimating the slope and intercept
We have seen the formula in class for how to compute the slope and intercept of the LSLR line. However, today we want to focus on how we would typically find these values in the real world by using statistical computing.
To obtain the slope and intercept for our LSLR line in R, create a chunk, copy in the following code, and press play:
This code is a little different from those we have worked with. The
command lm
tells R that we want to build an LSLR line. Our
next step is to tell R what to use to actually build the line. What is
the response variable? What is the explanatory variable?
To tell R what variable to use to create the line, we use a formula
that takes the form y ~ x
. See how our code says
lm(Calories~Sugars, data = Starbucks)
? This can be read as
telling R we want to make a LSLR line (lm
) with
Calories
as our response variable and Sugars
as our explanatory variable. We notice that we don’t use the $ notation
here. Why? Because there is a part of the command that specifies that R
should look in the Starbucks
data frame to find the
Calories
and Sugars
variables.
The output of lm
shows us the values of \(\hat{\beta}_0\) and \(\hat{\beta}_1\) that we need to make our
LSLR line.
Question 8
Write down the LSLR line. Hint: If you want to make the hat symbols,
try copying and pasting the following into the white space where you
usually type your answers (NOT in a chunk). Then, just replace the words
intercept and slope with the numeric values:
$$\widehat{y} = intercept+ slope x$$
Question 9
Think back to your EDA. Does it make sense that the slope coefficient \(\hat{\beta}_1\) is positive? Why or why not?
Now that we have the line, we can start to describe the relationship between sugar and calories in a little more detail than we did with our scatter plot. Instead of just saying that we have a positive relationship, we can say something more specific.
Question 10
Use your LSLR line to answer the following: If we increase the amount of sugar in a drink by 1 gram, about how many more calories do we expect the drink to have?
Question 11
Did Question 10 involve the intercept or the slope?
Question 12
Use your LSLR line to answer the following: If we have a drink with no grams of sugar, about how many calories do we think the drink will have?
Question 13
Did Question 12 involve the intercept or the slope?
Visualizing the LSLR line
Now that we have created our line, we can also draw this line on top of our scatter plot. To do so, copy and paste the following code into a chunk and press play:
ggplot(Starbucks, aes(x=Sugars, y = Calories)) +
geom_point() +
stat_smooth(formula = y ~ x, method = "lm", se = FALSE) +
labs(x = "Sugars (in grams)", y = "Calories" , title = "Figure 2")
The first two lines of the code above are familiar - we did these
earlier today! The third line is new. This part of the code
(stat_smooth(...)
) tells R to draw the LSLR line onto the
scatter plot.
Question 14
Change the color of the points. Make them any color you like aside from black (because they are already black) or white (because then we can’t see it!)
Question 15
Change the color of the LSLR line. Make it any color you like aside from black or white.
We notice that even though the relationship between sugars and calories is linear, the line will not go through every single dot on the graph. Some dots are below the line, and some dots are above the line. This should not surprise us. A LSLR line is a way to approximate the relationship between the response variable Y and explanatory variable X. If it went through every point, we would not have a line!
It turns out that choosing a line has some distinct advantages. The first is that lines are not difficult to use.
Question 16
Take a look at the first drink in the Starbucks data set. How many grams of sugar are in this drink? How many calories are in this drink?
Question 17
Using your LSLR line, how many calories would we predict that drink 1 would have? In other words, based on the grams of sugar in the first drink in the data set, what value of \(\widehat{y}\) would you compute using your LSLR line? Hint: Use the equation from Question 8 to answer this.
Question 18
What is the difference between the actual number of calories in the first drink (\(y_1\)) in the data set and the predicted value we got from the LSLR line (\(\widehat{y}_1\))? In other words, what is the value of the residual for the first drink?
Considering Residuals
It turns out that residuals are really important when you are building statistical models. They represent the distance between the actual data and the line you have used. In our case, it is the distance between the actual value of Calories for each Starbucks drink and the LSLR line.
We want the residuals to be small, because this means the line is close to the data. It turns out that the LSLR line we have just built actually gives the smallest possible value of the residuals out of any possible line we could draw. Specifically, the LSLR line has the minimum value of the residual sum of squares (RSS).
\[RSS = \sum_{i=1}^n \epsilon_i^2 = \epsilon_1^2 + \epsilon_2^2 + \dots + \epsilon_n^2 = (y_1 - \hat{y}_1)^2 + \dots + (y_n -\hat{y}_n)^2\]
The RSS is a way to measure the size of the residuals across the data set. We take the residual for each data point, we square it to convert all the values to positive numbers, and then we add them all up. Since small residuals indicate a stronger model fit, we want the smallest possible value of the RSS. The LSLR line is the line that achieves the smallest possible RSS for a given data set.
We can actually compute the RSS using R, but it takes a few steps. Place the following in a chunk of code and press play. Don’t worry about the actual code itself, though please ask if you are interested in how it is set up!
# Step 1: Make predictions using your line
CaloriesHat <- 45.357 + 4.568*Starbucks$Sugars
# Step 2: Store the residuals
residuals <- (Starbucks$Calories - CaloriesHat)
# Step 3: Square the residuals and then add then up!
sum(residuals^2)
Question 19
What is the RSS for your LSLR line?
Question 20
Look at Step 1 in the code above. Change the intercept and slope to be different numbers (any numbers you want). Compute the RSS using these new values of the intercept and the slope. Is this value larger or smaller than what you got in Question 19?
This will hold true for any other combination of slope and intercept values you want to try. Nothing beats the LSLR line in terms of the RSS!
Question 21
The RMSE for our LSLR line is 42.88. Interpret this value in the context of the data. In other words, what can this value tell us?
Before you submit
Three last steps before we knit, and then you will be done with Lab 2!
- Find the top of this file (the little tab), and look under it. You should see something with ABC and a check mark. This is for checking spelling! Click this to check your spelling before you do a final knit and submit.
- If you are working with a partner, make sure their name and yours is on the top of the file.
- You must submit a PDF or HTML file. If you submit any other file type, it cannot be graded. Let me know if you have any questions.
Once you’ve done this, knit your file. This will create the PDF or html you need to submit. If you get stuck, let Dr. Dalzell know!
This
work was created by Nicole Dalzell and is licensed under a
Creative
Commons Attribution-NonCommercial 4.0 International License. Last
updated 2025 August 30.