Learning Objectives

Statistical Learning Objectives

  1. Interpret a correlation matrix
  2. Interpret a fitted linear regression model
  3. Check the fit of the linear regression model using \(R^2\)
  4. Explore the dangers of extrapolation

R Learning Objectives

  1. Dive deeper into R plotting
  2. Create a correlation matrix
  3. Use R to fit a linear regression model

Functions covered in this lab

  1. plot()
  2. legend()
  3. levels()
  4. lm()
  5. anova()
  6. abline()
  7. cor()

Weekly Advice

This week, we’re learning more about R “graphics” (e.g., making plots and graphs). The trick to working with graphics in R is to remember these two points:

  1. R “draws” graphs like ink on paper. You can create a graph (using, e.g., plot()), then add things to it by using other functions to draw on top of the graph. But because R draws in “ink”, there’s no eraser! The way to clean things up is to make the graph again.
  2. The way to get a graphic you like is by trying stuff and adjusting. Don’t be afraid to play around!
  3. Use R’s built-in help for “graphical parameters”! In the R console, type ?par then hit enter/return.

Make sure you knit your document before submitting!


Lab Tutorial

Vectors

Remember that a vector in R is just a way to store a bunch of things together, like a pill counter. We saw last week that we can create vectors of consecutive integers using :, like 1:6.

We can make other types of vectors using the c() function. c here stands for combine. Here’s a vector of non-consecutive numbers, for example:

x <- c(1, 72.15, -4)
x
## [1]  1.00 72.15 -4.00

We’re going to see this c() function pop up a lot, so know that when you see it, it’s making a vector.

The stringsAsFactors Argument to read.csv()

Here’s our penguins again.

penguins <- read.csv("https://raw.githubusercontent.com/STATS250SBI/palmerpenguins/master/inst/extdata/penguins_NArm.csv", stringsAsFactors = T)

Something that’s important to note when we read in the data this time is that we gave read.csv() an extra argument called stringsAsFactors and we set it to T (T stands for TRUE). This argument tells read.csv() that it should treat data that looks like text as a categorical variable. Sometimes this is what you want, sometimes it’s not; the default in R is stringsAsFactors = FALSE, so it doesn’t treat text-like data as categorical by default.

In STATS 250, text-like data will almost always be a categorical variable, so we’ll be setting stringsAsFactors = TRUE quite often. Here, we’re doing it so we can more easily customize scatterplots below.

Customize a Scatterplot

Let’s make a scatterplot of penguin body mass vs. bill “depth”:

plot(body_mass_g ~ bill_depth_mm, 
     data = penguins,
     main = "Scatterplot of Penguin Body Mass vs. Bill Depth",
     xlab = "Bill Depth (mm)",
     ylab = "Body Mass (g)")

You should notice a couple of things about what we just did:

  1. We used “formula syntax” in the plot() code: we specified the response variable (y) and the explanatory variable (x) by typing y ~ x and then providing the data argument so R would know where to look for those variables. Remember that we can read the tilde (~) as “by”, or “versus”, so body_mass_g ~ bill_depth_mm would be read “body mass g versus bill depth mm”. We always say “[y variable] vs. [x variable]”.
  2. There’s some pretty obvious clustering happening in this plot! Take a second to think of an explanation for this.

One possible reason for the clustering might be the fact that our data set contains information on three different penguin species. It might be nice to identify which points belong to each species. To do this, we’ll use the col argument to plot, but in a tricky way.

plot(bill_depth_mm ~ body_mass_g, 
     data = penguins,
     main = "Scatterplot of Penguin Body Mass vs. Bill Depth",
     xlab = "Bill Depth (mm)",
     ylab = "Body Mass (g)",
     col = c("midnightblue", "brown1", "mediumseagreen")[penguins$species])

So it looks like one species is very different from the other two in terms of the relationship between bill depth and body mass! It seems like they have relatively deep bills but are also lighter than the other species. How can we tell which color corresponds to which species?

R chooses colors from the vector of options we give it – c("midnightblue", "brown1", "mediumseagreen") is a vector color names, just like 1:6 is a vector of consecutive integers from last week – in the order of the levels of that categorical variable penguins$species. Remember that R orders these levels alphabetically! We can use the levels() function to see this order:

levels(penguins$species)
## [1] "Adelie"    "Chinstrap" "Gentoo"

The “medium sea green” dots are coming from the third level (since mediumseagreen is the third “element” of that colors vector), so the green dots are Gentoo penguins.

Let’s make a legend (or key) so people can better understand our scatterplot.

# Make the plot again
plot(bill_depth_mm ~ body_mass_g, 
     data = penguins,
     main = "Scatterplot of Penguin Body Mass vs. Bill Depth",
     xlab = "Bill Depth (mm)",
     ylab = "Body Mass (g)",
     col = c("midnightblue", "brown1", "mediumseagreen")[penguins$species])

# Add a legend
legend("topright", 
       legend = levels(penguins$species),
       col = c("midnightblue", "brown1", "mediumseagreen"),
       pch = 1,
       title = "Species")

The first argument to legend is a position – it’s where you want the legend to go on the plot. The easiest way to set the position is to choose one of “bottomright”, “bottom”, “bottomleft”, “left”, “topleft”, “top”, “topright”, “right” or “center”.

The next argument is called “legend” and is the words you want to go in the legend. Here, we want our legend to identify the “levels” of the species variable in penguins, so we’ll say legend = levels(penguins$species). Then we give our vector of colors to col.

The last (non-obviously-named) argument we provide to legend() is called pch, which stands for Plotting CHaracter. Changing the pch argument changes what the points look like in our plot. Since we didn’t specify pch in our plot, it gave us open circles (pch = 1).

If we want, we could make the plot again, this time giving each species its own plotting character. You can see the possible options for pch using help: type ?points in the console, hit enter/return, then scroll down.

# Make the plot again
plot(bill_depth_mm ~ body_mass_g, 
     data = penguins,
     main = "Scatterplot of Penguin Body Mass vs. Bill Depth",
     xlab = "Bill Depth (mm)",
     ylab = "Body Mass (g)",
     col = c("midnightblue", "brown1", "mediumseagreen")[penguins$species],
     pch = c(0, 1, 2)[penguins$species])

# Add a legend
legend("topright", 
       legend = levels(penguins$species),
       col = c("midnightblue", "brown1", "mediumseagreen"),
       pch = c(0, 1, 2),
       title = "Species")

Correlation Matrices

Let’s go back to the scatterplot we made last week.

plot(penguins$bill_length_mm, penguins$body_mass_g,
     main = "Scatterplot of Penguin Body Mass versus Bill Length",
     xlab = "Bill Length (mm)",
     ylab = "Body Mass in (g)")

Last week, we said this was a moderately-strong linear relationship with no obvious outliers or clustering, and computed the correlation between the two variables:

cor(penguins$bill_length_mm, penguins$body_mass_g)
## [1] 0.5894511

If we wanted to consider the correlation between multiple quantitative variables, we could use cor() on every pair of them, but that’s tedious. Instead, we’ll compute a correlation matrix. To do this, we’re going to give cor() a subset of penguins containing only the variables we want to look at the correlation between. To make this subset, we’ll use the subset() function and the select argument. select is a vector of variable names in penguins.

cor(subset(penguins, select = c("bill_length_mm", "bill_depth_mm", "body_mass_g")))
##                bill_length_mm bill_depth_mm body_mass_g
## bill_length_mm      1.0000000    -0.2286256   0.5894511
## bill_depth_mm      -0.2286256     1.0000000  -0.4720157
## body_mass_g         0.5894511    -0.4720157   1.0000000

Each “entry” in the correlation matrix is the correlation between the variables labeling that entry’s row and column. So for example, the correlation between bill depth and bill length is about -0.229.

Linear Regression

We’ve actually already discussed most of what we need to perform linear regression in R, so let’s jump right into it.

We’ll use the function lm(), and provide it a formula (y ~ x) and a data argument. We’ll store that as an object called reg1. Then, to get detailed results, we’ll use the summary() function.

reg1 <- lm(body_mass_g ~ bill_length_mm, data = penguins)
summary(reg1)
## 
## Call:
## lm(formula = body_mass_g ~ bill_length_mm, data = penguins)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1759.38  -468.82    27.79   464.20  1641.00 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     388.845    289.817   1.342    0.181    
## bill_length_mm   86.792      6.538  13.276   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 651.4 on 331 degrees of freedom
## Multiple R-squared:  0.3475, Adjusted R-squared:  0.3455 
## F-statistic: 176.2 on 1 and 331 DF,  p-value: < 2.2e-16

MORE DETAILS ABOUT HOW TO INTERP OUTPUT

We can also get more details about the regression through an ANOVA table, which we can compute using by giving the “regression model object” (reg1) to the anova() function:

anova(reg1)
## Analysis of Variance Table
## 
## Response: body_mass_g
##                 Df    Sum Sq  Mean Sq F value    Pr(>F)    
## bill_length_mm   1  74792533 74792533  176.24 < 2.2e-16 ***
## Residuals      331 140467133   424372                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

** MORE DETAILS ABOUT HOW TO INTERP OUTPUT **

Then, when we’re done with all of that, we can add thec estimated regression line to our scatterplot by giving the model object to the abline() function.

plot(penguins$bill_length_mm, penguins$body_mass_g,
     main = "Scatterplot of Penguin Body Mass versus Bill Length",
     xlab = "Bill Length (mm)",
     ylab = "Body Mass in (g)")
abline(reg1)

Try It!

Depending on your instruction method (in person, synchronous remote, asynchronous remote), complete the following exercises collaboratively.

  1. In person: Form groups of 2-4 with those around you (maintaining physical distance). Your instructor will provide additional details.
  2. Synchronous remote: You will be placed in a Zoom breakout room to work together in small groups.
  3. Asynchronous remote: Join a Piazza lab group for asynchronous collaboration.

Collaborators: If applicable, replace this text with the names of anyone you collaborated with on this project.

This week, we’ll be using a data set from the Coffee Quality Database collected from the Coffee Quality Institute’s review pages in January 2018 by BuzzFeed data scientist James Ledoux. The data are available to you in coffee_ratings.csv.

1. Read the data into R. You’ll need to give the name of this file (in quotes!) to read.csv(), and call the data.frame coffee (if you don’t do this, you WILL get errors later). Be sure to set stringsAsFactors = TRUE! Then use your favorite function to peek at the data.

# Replace this comment with code required for Try It 1. (Remember that this text is a comment, so R ignores it; you can delete it if you want.) If you don't delete it, **start your code on a new line that doesn't start with #**

2. Create a correlation matrix of the variables total_cup_points, acidity, body, and balance. total_cup_points represents a coffee’s overall quality rating (0-100 points). The others represent a score of a coffee’s acidity, body, and balance, respectively (0-10 points). Which pair of these variables have the strongest linear association?

# Replace this comment with code required for Try It 2. (Remember that this text is a comment, so R ignores it; you can delete it if you want.) If you don't delete it, **start your code on a new line that doesn't start with #**

Replace this text with your answer to Try It 2.

3. Make a scatterplot of the two variables you chose in Try It 2. Feel free to play around with the color of the points, or the plotting character via pch. Use the variable descriptions in Try It 2 to determine which variable should be the explanatory variable and which should be the response.

# Replace this comment with code required for Try It 3. (Remember that this text is a comment, so R ignores it; you can delete it if you want.) If you don't delete it, **start your code on a new line that doesn't start with #**

4. Fit a linear regression model to the data using the same explanatory and response variables as in try it 3. What is our estimate of the population slope? What is the \(R^2\) value?

# Replace this comment with code required for Try It 4. (Remember that this text is a comment, so R ignores it; you can delete it if you want.) If you don't delete it, **start your code on a new line that doesn't start with #**

Replace this text with your answer to Try It 4.

5. Use the anova() function to produce the ANOVA table for the model, and use it to check the value of \(R^2\) that you found in Try It 4. Show your work the best you can.

# Replace this comment with code required for Try It 5. (Remember that this text is a comment, so R ignores it; you can delete it if you want.) If you don't delete it, **start your code on a new line that doesn't start with #**

Replace this text with your answers to Try It 5.

6. In Try It 3, you made a scatterplot of two variables. The coffee data contains a categorical variable called color, which represents the color of the coffee bean (before roasting; coffee’s not brown until it’s roasted!). Make a frequency table of this variable, then remake your scatterplot from Try It 3, this time coloring the points according to the value of color.

# Replace this comment with code required for Try It 6. (Remember that this text is a comment, so R ignores it; you can delete it if you want.) If you don't delete it, **start your code on a new line that doesn't start with #**

Dive Deeper

In the Try It, you played around a little with data about real estate. Now, we’re going to have you dive a little deeper.

1. How would you interpret the estimate of the slope coefficient we found in Try It 4? Speculate on the real-world impact of this estimate.

Replace this text with your written answer for Dive Deeper 1

2. How would you interpret the value of \(R^2\) we found in Try It 4? Why do we care about this?

Replace this text with your written answer to Dive Deeper 2.

3. In your scatterplot from Try It 3, you should notice that the data seem to be in “strips” – vertical stacks of data points. Speculate on why this is happening.

Replace this text with your written answer to Dive Deeper 3.

# Replace this comment with code required for Dive Deeper 3. (Remember that this text is a comment, so R ignores it; you can delete it if you want.) If you don't delete it, **start your code on a new line that doesn't start with #**

4. Using your scatterplot from Try It 6, does coffee bean color appear to be related to the coffee’s total cup points and balance? Why or why not?

Replace this text with your written answer to Dive Deeper 4.

5. In late June, this image made the rounds on statistics Twitter. It’s a plot of COVID-19 per-capita (read: population-adjusted) mortality rate versus average physician salary in each state. The caption of the image was “States where physicians are highly paid have lower Covid-19 mortality per capita.” Do you agree with that conclusion based on the plot? Why or why not?

Replace this text with your written answer to Dive Deeper 5.

Wrap-Up and Submission

At the top of the document, make sure you’ve changed the author field to your name (in quotes!). If you’d like, change the date as well.

When you’ve finished the lab, click the Knit button one last time.

Submission instructions

If you’re using RStudio Cloud

  1. In the Files pane, check the box next to lab03report.html.
  2. Click More > Export…
  3. Click Download and save the file on your computer in a folder you’ll remember and be able to find later. You can just call the file lab03report.html or whatever else you’d like (as long as you remember what you called it).

If you’re using RStudio installed on your computer

  1. locate the lab03report.html file on your computer. The file will be saved in the location indicated at the top of the files pane.

Submission to Canvas

  1. Click the “Assignments” panel on the left side of the page. Scroll to find “Lab 3”, and open the assignment. Click “Submit Assignment”.
  2. Towards the bottom of the page, you’ll be able to choose lab03report.html from the folder you saved it in from RStudio Cloud or noted if you’re using RStudio Desktop. You will only be able to upload a .html file – do not upload any other file type.
  3. Click “Submit Assignment”. You’re done!