This presentation corresponds to Chapter 2: Data visualization https://r4ds.hadley.nz/data-visualize.html
Chapter 2 of RDS focuses on ggplot2, one of the core members of the tidyverse. To access the datasets for the first time on your computer, type (in your console)
install.packages(“tidyverse”)
and then load tidyverse for this session by running:
From Thorndike, Chapter 2, pp. 23-28:
“Catherine Johnson and Peter Cordero wanted to gather information about acheivement levels in their two sixth-grade classes. They gave their students a 45-item reading comprehension test provided in their current reading series, a 65-item review test from the mathematics book, and a dictation spelling test of 80 items based on the words their classes had been studying during the past 6 weeks.”
The dataset should pop up in your environment. But if not…
# Make column "Gender" into factors and label it
Table.2.1 = Table.2.1 %>%
mutate(Gender = factor(Gender, levels=c("1", "2"), labels=c("male", "female")))
# Make column "Class" into factors and label
Table.2.1 = Table.2.1 %>%
mutate(Class = factor(Class, levels=c("1", "2"), labels=c("Johnson", "Cordero")))
RDS asked: “Do penguins with longer flippers weigh more or less than penguins with shorter flippers?
So for us, how about “Do students who are doing well on the 6th-grade spelling tests do better on the math review tests?”
# A tibble: 52 × 7
First Last Gender Class Reading Spelling Math
<chr> <chr> <fct> <fct> <dbl> <dbl> <dbl>
1 Aaron Andrews male Johnson 32 64 43
2 Byron Biggs male Johnson 40 64 37
3 Charles Cowen male Johnson 36 40 38
4 Donna Davis female Johnson 41 74 40
5 Erin Edwards female Johnson 36 69 28
6 Fernando Franco male Johnson 41 67 42
7 Gail Galaraga female Johnson 40 71 37
8 Harpo Henry male Johnson 30 51 34
9 Irrida Ignacio female Johnson 37 68 35
10 Jack Johanson male Johnson 26 56 26
# … with 42 more rows
Rows: 52
Columns: 7
$ First <chr> "Aaron", "Byron", "Charles", "Donna", "Erin", "Fernando", "Ga…
$ Last <chr> "Andrews", "Biggs", "Cowen", "Davis", "Edwards", "Franco", "G…
$ Gender <fct> male, male, male, female, female, male, female, male, female,…
$ Class <fct> Johnson, Johnson, Johnson, Johnson, Johnson, Johnson, Johnson…
$ Reading <dbl> 32, 40, 36, 41, 36, 41, 40, 30, 37, 26, 28, 36, 39, 22, 36, 3…
$ Spelling <dbl> 64, 64, 40, 74, 69, 67, 71, 51, 68, 56, 51, 57, 68, 47, 59, 6…
$ Math <dbl> 43, 37, 38, 40, 28, 42, 37, 34, 35, 26, 25, 53, 37, 22, 33, 3…
The blank canvas:
…to the plot with the x axis already set
The function geom_point() adds a layer of points to your plot, which creates a scatterplot.
You can add a third variable, like class, to a scatterplot by mapping it to an aesthetic — a visual property of the objects in your plot. Aesthetics include things like the size, the shape, or the color of your points.
“When a variable is mapped to an aesthetic, ggplot2 will automatically assign a unique value of the aesthetic (here a unique color) to each unique level of the variable…a process known as scaling. ggplot2 will also add a legend that explains which values correspond to which levels.
7-day rolling average
Similar to the ways that the New York Times displays COVID infections over time, we can look at a single data variable — like spelling test scores — to display the counts for each score (from 38 to 76) and smoothed conditional means (known as a kernel smooth).
ggplot(Table.2.1, aes(x = Spelling, y = after_stat(count))) +
geom_histogram(binwidth = 1, color = "black", fill = "white") +
geom_density(lwd = .5, color = "black", adjust = .7, fill = "green", alpha = .5) + labs(title = "Spelling Scores for Two Classes") + labs(subtitle = "and overlayed density plot") + theme_classic()
They first suggest assigning a color to each of the classrooms
The previous plot used color coding to parse what’s going on between the two classrooms.
Another way to approach this is to use “subplots that each display one subset of the data”
In this case, we show two plots side by side. One with spelling and math for Mr. Cordero’s class and a second plot for Ms. Johnson’s class.
“It’s generally not a good idea to represent information using only colors on a plot, as people perceive colors differently due to color blindness or other color vision differences. Therefore, in addition to color, we can also map species [or classroom] to the shape aesthetic.
“And finally, we can improve the labels of our plot using the labs() function in a new layer. Some of the arguments to labs() might be self explanatory: title adds a title and subtitle adds a subtitle to the plot. Other arguments match the aesthetic mappings, x is the x-axis label, y is the y-axis label, and color and shape define the label for the legend.”
ggplot(
data = Table.2.1,
mapping = aes(x = Spelling, y = Math)
) +
geom_point(aes(color = Class, shape = Class)) +
geom_smooth() +
labs(
title = "Spelling scores and Math scores",
subtitle = "Test scores for two classrooms",
x = "Spelling (out of 80 items)",
y = "Math (out of 65 items)",
color = "Class",
shape = "Class"
)
Rewriting the basic scatterplot code
A very boring chart
a histogram or a density plot.
different binwidths can reveal different patterns
To facet your plot by a single variable, use facet_wrap()