plot()legend()levels()lm()anova()abline()cor()This week, we’re learning more about R “graphics” (e.g., making plots and graphs). The trick to working with graphics in R is to remember these two points:
plot()), then add things to it by using other functions to draw on top of the graph. But because R draws in “ink”, there’s no eraser! The way to clean things up is to make the graph again.?par then hit enter/return.Make sure you knit your document before submitting!
Remember that a vector in R is just a way to store a bunch of things together, like a pill counter. We saw last week that we can create vectors of consecutive integers using :, like 1:6.
We can make other types of vectors using the c() function. c here stands for combine. Here’s a vector of non-consecutive numbers, for example:
x <- c(1, 72.15, -4)
x
## [1] 1.00 72.15 -4.00
We’re going to see this c() function pop up a lot, so know that when you see it, it’s making a vector.
stringsAsFactors Argument to read.csv()Here’s our penguins again.
penguins <- read.csv("https://raw.githubusercontent.com/STATS250SBI/palmerpenguins/master/inst/extdata/penguins_NArm.csv", stringsAsFactors = T)
Something that’s important to note when we read in the data this time is that we gave read.csv() an extra argument called stringsAsFactors and we set it to T (T stands for TRUE). This argument tells read.csv() that it should treat data that looks like text as a categorical variable. Sometimes this is what you want, sometimes it’s not; the default in R is stringsAsFactors = FALSE, so it doesn’t treat text-like data as categorical by default.
In STATS 250, text-like data will almost always be a categorical variable, so we’ll be setting stringsAsFactors = TRUE quite often. Here, we’re doing it so we can more easily customize scatterplots below.
Let’s make a scatterplot of penguin body mass vs. bill “depth”:
plot(body_mass_g ~ bill_depth_mm,
data = penguins,
main = "Scatterplot of Penguin Body Mass vs. Bill Depth",
xlab = "Bill Depth (mm)",
ylab = "Body Mass (g)")
You should notice a couple of things about what we just did:
plot() code: we specified the response variable (y) and the explanatory variable (x) by typing y ~ x and then providing the data argument so R would know where to look for those variables. Remember that we can read the tilde (~) as “by”, or “versus”, so body_mass_g ~ bill_depth_mm would be read “body mass g versus bill depth mm”. We always say “[y variable] vs. [x variable]”.One possible reason for the clustering might be the fact that our data set contains information on three different penguin species. It might be nice to identify which points belong to each species. To do this, we’ll use the col argument to plot, but in a tricky way.
plot(bill_depth_mm ~ body_mass_g,
data = penguins,
main = "Scatterplot of Penguin Body Mass vs. Bill Depth",
xlab = "Bill Depth (mm)",
ylab = "Body Mass (g)",
col = c("midnightblue", "brown1", "mediumseagreen")[penguins$species])
So it looks like one species is very different from the other two in terms of the relationship between bill depth and body mass! It seems like they have relatively deep bills but are also lighter than the other species. How can we tell which color corresponds to which species?
R chooses colors from the vector of options we give it – c("midnightblue", "brown1", "mediumseagreen") is a vector color names, just like 1:6 is a vector of consecutive integers from last week – in the order of the levels of that categorical variable penguins$species. Remember that R orders these levels alphabetically! We can use the levels() function to see this order:
levels(penguins$species)
## [1] "Adelie" "Chinstrap" "Gentoo"
The “medium sea green” dots are coming from the third level (since mediumseagreen is the third “element” of that colors vector), so the green dots are Gentoo penguins.
Let’s make a legend (or key) so people can better understand our scatterplot.
# Make the plot again
plot(bill_depth_mm ~ body_mass_g,
data = penguins,
main = "Scatterplot of Penguin Body Mass vs. Bill Depth",
xlab = "Bill Depth (mm)",
ylab = "Body Mass (g)",
col = c("midnightblue", "brown1", "mediumseagreen")[penguins$species])
# Add a legend
legend("topright",
legend = levels(penguins$species),
col = c("midnightblue", "brown1", "mediumseagreen"),
pch = 1,
title = "Species")
The first argument to legend is a position – it’s where you want the legend to go on the plot. The easiest way to set the position is to choose one of “bottomright”, “bottom”, “bottomleft”, “left”, “topleft”, “top”, “topright”, “right” or “center”.
The next argument is called “legend” and is the words you want to go in the legend. Here, we want our legend to identify the “levels” of the species variable in penguins, so we’ll say legend = levels(penguins$species). Then we give our vector of colors to col.
The last (non-obviously-named) argument we provide to legend() is called pch, which stands for Plotting CHaracter. Changing the pch argument changes what the points look like in our plot. Since we didn’t specify pch in our plot, it gave us open circles (pch = 1).
If we want, we could make the plot again, this time giving each species its own plotting character. You can see the possible options for pch using help: type ?points in the console, hit enter/return, then scroll down.
# Make the plot again
plot(bill_depth_mm ~ body_mass_g,
data = penguins,
main = "Scatterplot of Penguin Body Mass vs. Bill Depth",
xlab = "Bill Depth (mm)",
ylab = "Body Mass (g)",
col = c("midnightblue", "brown1", "mediumseagreen")[penguins$species],
pch = c(0, 1, 2)[penguins$species])
# Add a legend
legend("topright",
legend = levels(penguins$species),
col = c("midnightblue", "brown1", "mediumseagreen"),
pch = c(0, 1, 2),
title = "Species")
Let’s go back to the scatterplot we made last week.
plot(penguins$bill_length_mm, penguins$body_mass_g,
main = "Scatterplot of Penguin Body Mass versus Bill Length",
xlab = "Bill Length (mm)",
ylab = "Body Mass in (g)")
Last week, we said this was a moderately-strong linear relationship with no obvious outliers or clustering, and computed the correlation between the two variables:
cor(penguins$bill_length_mm, penguins$body_mass_g)
## [1] 0.5894511
If we wanted to consider the correlation between multiple quantitative variables, we could use cor() on every pair of them, but that’s tedious. Instead, we’ll compute a correlation matrix. To do this, we’re going to give cor() a subset of penguins containing only the variables we want to look at the correlation between. To make this subset, we’ll use the subset() function and the select argument. select is a vector of variable names in penguins.
cor(subset(penguins, select = c("bill_length_mm", "bill_depth_mm", "body_mass_g")))
## bill_length_mm bill_depth_mm body_mass_g
## bill_length_mm 1.0000000 -0.2286256 0.5894511
## bill_depth_mm -0.2286256 1.0000000 -0.4720157
## body_mass_g 0.5894511 -0.4720157 1.0000000
Each “entry” in the correlation matrix is the correlation between the variables labeling that entry’s row and column. So for example, the correlation between bill depth and bill length is about -0.229.
We’ve actually already discussed most of what we need to perform linear regression in R, so let’s jump right into it.
We’ll use the function lm(), and provide it a formula (y ~ x) and a data argument. We’ll store that as an object called reg1. Then, to get detailed results, we’ll use the summary() function.
reg1 <- lm(body_mass_g ~ bill_length_mm, data = penguins)
summary(reg1)
##
## Call:
## lm(formula = body_mass_g ~ bill_length_mm, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1759.38 -468.82 27.79 464.20 1641.00
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 388.845 289.817 1.342 0.181
## bill_length_mm 86.792 6.538 13.276 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 651.4 on 331 degrees of freedom
## Multiple R-squared: 0.3475, Adjusted R-squared: 0.3455
## F-statistic: 176.2 on 1 and 331 DF, p-value: < 2.2e-16
MORE DETAILS ABOUT HOW TO INTERP OUTPUT
We can also get more details about the regression through an ANOVA table, which we can compute using by giving the “regression model object” (reg1) to the anova() function:
anova(reg1)
## Analysis of Variance Table
##
## Response: body_mass_g
## Df Sum Sq Mean Sq F value Pr(>F)
## bill_length_mm 1 74792533 74792533 176.24 < 2.2e-16 ***
## Residuals 331 140467133 424372
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
** MORE DETAILS ABOUT HOW TO INTERP OUTPUT **
Then, when we’re done with all of that, we can add thec estimated regression line to our scatterplot by giving the model object to the abline() function.
plot(penguins$bill_length_mm, penguins$body_mass_g,
main = "Scatterplot of Penguin Body Mass versus Bill Length",
xlab = "Bill Length (mm)",
ylab = "Body Mass in (g)")
abline(reg1)
Depending on your instruction method (in person, synchronous remote, asynchronous remote), complete the following exercises collaboratively.
Collaborators: If applicable, replace this text with the names of anyone you collaborated with on this project.
This week, we’ll be using a data set from the Coffee Quality Database collected from the Coffee Quality Institute’s review pages in January 2018 by BuzzFeed data scientist James Ledoux. The data are available to you in coffee_ratings.csv.
1. Read the data into R. You’ll need to give the name of this file (in quotes!) to
read.csv(), and call the data.framecoffee(if you don’t do this, you WILL get errors later). Be sure to setstringsAsFactors = TRUE! Then use your favorite function to peek at the data.
# Replace this comment with code required for Try It 1. (Remember that this text is a comment, so R ignores it; you can delete it if you want.) If you don't delete it, **start your code on a new line that doesn't start with #**
2. Create a correlation matrix of the variables
total_cup_points,acidity,body, andbalance.total_cup_pointsrepresents a coffee’s overall quality rating (0-100 points). The others represent a score of a coffee’s acidity, body, and balance, respectively (0-10 points). Which pair of these variables have the strongest linear association?
# Replace this comment with code required for Try It 2. (Remember that this text is a comment, so R ignores it; you can delete it if you want.) If you don't delete it, **start your code on a new line that doesn't start with #**
Replace this text with your answer to Try It 2.
3. Make a scatterplot of the two variables you chose in Try It 2. Feel free to play around with the color of the points, or the plotting character via
pch. Use the variable descriptions in Try It 2 to determine which variable should be the explanatory variable and which should be the response.
# Replace this comment with code required for Try It 3. (Remember that this text is a comment, so R ignores it; you can delete it if you want.) If you don't delete it, **start your code on a new line that doesn't start with #**
4. Fit a linear regression model to the data using the same explanatory and response variables as in try it 3. What is our estimate of the population slope? What is the \(R^2\) value?
# Replace this comment with code required for Try It 4. (Remember that this text is a comment, so R ignores it; you can delete it if you want.) If you don't delete it, **start your code on a new line that doesn't start with #**
Replace this text with your answer to Try It 4.
5. Use the
anova()function to produce the ANOVA table for the model, and use it to check the value of \(R^2\) that you found in Try It 4. Show your work the best you can.
# Replace this comment with code required for Try It 5. (Remember that this text is a comment, so R ignores it; you can delete it if you want.) If you don't delete it, **start your code on a new line that doesn't start with #**
Replace this text with your answers to Try It 5.
6. In Try It 3, you made a scatterplot of two variables. The
coffeedata contains a categorical variable calledcolor, which represents the color of the coffee bean (before roasting; coffee’s not brown until it’s roasted!). Make a frequency table of this variable, then remake your scatterplot from Try It 3, this time coloring the points according to the value ofcolor.
# Replace this comment with code required for Try It 6. (Remember that this text is a comment, so R ignores it; you can delete it if you want.) If you don't delete it, **start your code on a new line that doesn't start with #**
In the Try It, you played around a little with data about real estate. Now, we’re going to have you dive a little deeper.
1. How would you interpret the estimate of the slope coefficient we found in Try It 4? Speculate on the real-world impact of this estimate.
Replace this text with your written answer for Dive Deeper 1
2. How would you interpret the value of \(R^2\) we found in Try It 4? Why do we care about this?
Replace this text with your written answer to Dive Deeper 2.
3. In your scatterplot from Try It 3, you should notice that the data seem to be in “strips” – vertical stacks of data points. Speculate on why this is happening.
Replace this text with your written answer to Dive Deeper 3.
# Replace this comment with code required for Dive Deeper 3. (Remember that this text is a comment, so R ignores it; you can delete it if you want.) If you don't delete it, **start your code on a new line that doesn't start with #**
4. Using your scatterplot from Try It 6, does coffee bean color appear to be related to the coffee’s total cup points and balance? Why or why not?
Replace this text with your written answer to Dive Deeper 4.
5. In late June, this image made the rounds on statistics Twitter. It’s a plot of COVID-19 per-capita (read: population-adjusted) mortality rate versus average physician salary in each state. The caption of the image was “States where physicians are highly paid have lower Covid-19 mortality per capita.” Do you agree with that conclusion based on the plot? Why or why not?
Replace this text with your written answer to Dive Deeper 5.
At the top of the document, make sure you’ve changed the author field to your name (in quotes!). If you’d like, change the date as well.
When you’ve finished the lab, click the Knit button one last time.
lab03report.html.lab03report.html or whatever else you’d like (as long as you remember what you called it).lab03report.html file on your computer. The file will be saved in the location indicated at the top of the files pane.lab03report.html from the folder you saved it in from RStudio Cloud or noted if you’re using RStudio Desktop. You will only be able to upload a .html file – do not upload any other file type.