RDS Chapter 2

Hoffman

R for Data Science Ch2

This presentation corresponds to Chapter 2: Data visualization https://r4ds.hadley.nz/data-visualize.html

Prerequsites

Chapter 2 of RDS focuses on ggplot2, one of the core members of the tidyverse. To access the datasets for the first time on your computer, type (in your console)

install.packages(“tidyverse”)

and then load tidyverse for this session by running:

library(tidyverse)

Follow RDS ch. 2 with Thorndike data

From Thorndike, Chapter 2, pp. 23-28:

“Catherine Johnson and Peter Cordero wanted to gather information about acheivement levels in their two sixth-grade classes. They gave their students a 45-item reading comprehension test provided in their current reading series, a 65-item review test from the mathematics book, and a dictation spelling test of 80 items based on the words their classes had been studying during the past 6 weeks.”

Loading Table 2-1 for our use

## CLEAR WORKSPACE
rm(list=ls())

## READ IN DATA
# read data from your own computer
Table.2.1 <- read_csv(file = "Table.2.1.csv")

The dataset should pop up in your environment. But if not…

  • Did you download the data from Canvas? If not, please download Table.2.1_clean.csv.
  • What directory on your computer are you working from? getwd()
  • Is the name of the .csv file on your script the same as what you downloaded?
  • Did you install tidyverse? install.packages(“tidyverse”)

Some data cleaning

# Make column "Gender" into factors and label it
Table.2.1 = Table.2.1 %>%
  mutate(Gender = factor(Gender, levels=c("1", "2"), labels=c("male", "female")))

# Make column "Class" into factors and label
Table.2.1 = Table.2.1 %>%
  mutate(Class = factor(Class, levels=c("1", "2"), labels=c("Johnson", "Cordero")))

2.2 First steps

RDS asked: “Do penguins with longer flippers weigh more or less than penguins with shorter flippers?

So for us, how about “Do students who are doing well on the 6th-grade spelling tests do better on the math review tests?”

2.2.1 The Table.2.1 data frame

Table.2.1
# A tibble: 52 × 7
   First    Last     Gender Class   Reading Spelling  Math
   <chr>    <chr>    <fct>  <fct>     <dbl>    <dbl> <dbl>
 1 Aaron    Andrews  male   Johnson      32       64    43
 2 Byron    Biggs    male   Johnson      40       64    37
 3 Charles  Cowen    male   Johnson      36       40    38
 4 Donna    Davis    female Johnson      41       74    40
 5 Erin     Edwards  female Johnson      36       69    28
 6 Fernando Franco   male   Johnson      41       67    42
 7 Gail     Galaraga female Johnson      40       71    37
 8 Harpo    Henry    male   Johnson      30       51    34
 9 Irrida   Ignacio  female Johnson      37       68    35
10 Jack     Johanson male   Johnson      26       56    26
# … with 42 more rows

glimpse The Table.2.1 data frame

glimpse(Table.2.1)
Rows: 52
Columns: 7
$ First    <chr> "Aaron", "Byron", "Charles", "Donna", "Erin", "Fernando", "Ga…
$ Last     <chr> "Andrews", "Biggs", "Cowen", "Davis", "Edwards", "Franco", "G…
$ Gender   <fct> male, male, male, female, female, male, female, male, female,…
$ Class    <fct> Johnson, Johnson, Johnson, Johnson, Johnson, Johnson, Johnson…
$ Reading  <dbl> 32, 40, 36, 41, 36, 41, 40, 30, 37, 26, 28, 36, 39, 22, 36, 3…
$ Spelling <dbl> 64, 64, 40, 74, 69, 67, 71, 51, 68, 56, 51, 57, 68, 47, 59, 6…
$ Math     <dbl> 43, 37, 38, 40, 28, 42, 37, 34, 35, 26, 25, 53, 37, 22, 33, 3…

2.2.3 Creating a ggplot

The blank canvas:

ggplot(data = Table.2.1)

Add the x axis (spelling)

ggplot(
  data = Table.2.1,
  mapping = aes(x = Spelling))

Add the y axis (math)

…to the plot with the x axis already set

ggplot(
  data = Table.2.1,
  mapping = aes(x = Spelling, y = Math))

Add the layer of data points

The function geom_point() adds a layer of points to your plot, which creates a scatterplot.

ggplot(
  data = Table.2.1,
  mapping = aes(x = Spelling, y = Math)) + 
  geom_point()

2.2.4 Adding aesthetics and layers

You can add a third variable, like class, to a scatterplot by mapping it to an aesthetic — a visual property of the objects in your plot. Aesthetics include things like the size, the shape, or the color of your points.

“When a variable is mapped to an aesthetic, ggplot2 will automatically assign a unique value of the aesthetic (here a unique color) to each unique level of the variable…a process known as scaling. ggplot2 will also add a legend that explains which values correspond to which levels.

Did Mr. Cordero emphasize Math and Ms. Johnson emphasize spelling?

ggplot(
  data = Table.2.1,
  mapping = aes(x = Spelling, y = Math, color = Class)) +
  geom_point()

One way of reporting COVID data

7-day rolling average

Univariate (one variable)

Similar to the ways that the New York Times displays COVID infections over time, we can look at a single data variable — like spelling test scores — to display the counts for each score (from 38 to 76) and smoothed conditional means (known as a kernel smooth).

ggplot(Table.2.1, aes(x = Spelling, y = after_stat(count))) +
  geom_histogram(binwidth = 1, color = "black", fill = "white") +
  geom_density(lwd = .5, color = "black", adjust = .7, fill = "green", alpha = .5) + labs(title = "Spelling Scores for Two Classes") + labs(subtitle = "and overlayed density plot") +  theme_classic() 

RDS recommends geom_smooth

They first suggest assigning a color to each of the classrooms

ggplot(
  data = Table.2.1,
  mapping = aes(x = Spelling, y = Math, color = Class)
) +
  geom_point() +
  geom_smooth()

One smooth curve, not two

ggplot(
  data = Table.2.1,
  mapping = aes(x = Spelling, y = Math)
) +
  geom_point(mapping = aes(color = Class)) +
  geom_smooth()

3.5 Facets

The previous plot used color coding to parse what’s going on between the two classrooms.

Another way to approach this is to use “subplots that each display one subset of the data”

In this case, we show two plots side by side. One with spelling and math for Mr. Cordero’s class and a second plot for Ms. Johnson’s class.

ggplot(data = Table.2.1) + 
  geom_point(mapping = aes(x = Spelling, y = Math)) +
  facet_wrap(~ Class)

Also the shape of data points

“It’s generally not a good idea to represent information using only colors on a plot, as people perceive colors differently due to color blindness or other color vision differences. Therefore, in addition to color, we can also map species [or classroom] to the shape aesthetic.

ggplot(
  data = Table.2.1,
  mapping = aes(x = Spelling, y = Math)
) +
  geom_point(mapping = aes(color = Class, shape = Class)) +
  geom_smooth()

Titles and subtitles

“And finally, we can improve the labels of our plot using the labs() function in a new layer. Some of the arguments to labs() might be self explanatory: title adds a title and subtitle adds a subtitle to the plot. Other arguments match the aesthetic mappings, x is the x-axis label, y is the y-axis label, and color and shape define the label for the legend.”

ggplot(
  data = Table.2.1,
  mapping = aes(x = Spelling, y = Math)
) +
  geom_point(aes(color = Class, shape = Class)) +
  geom_smooth() +
  labs(
    title = "Spelling scores and Math scores",
    subtitle = "Test scores for two classrooms",
    x = "Spelling (out of 80 items)",
    y = "Math (out of 65 items)",
    color = "Class",
    shape = "Class"
  )

2.3 ggplot2 calls

Rewriting the basic scatterplot code

ggplot(Table.2.1, aes(x = Spelling, y = Math)) + 
  geom_point()

And the pipe:

Table.2.1 |> 
  ggplot(aes(x = Spelling, y = Math)) + 
  geom_point()

2.4 Visualizing distributions

2.4.1 A categorical variable

A very boring chart

ggplot(Table.2.1, aes(x = Class)) +
  geom_bar()

2.4.2 A numerical variable

a histogram or a density plot.

ggplot(Table.2.1, aes(x = Spelling)) + 
  geom_histogram(binwidth = 3)

ggplot(Table.2.1, aes(x = Spelling)) + 
 geom_density()

Histogram Binwidth

different binwidths can reveal different patterns

ggplot(Table.2.1, aes(x = Spelling)) +
  geom_histogram(binwidth = 2)

ggplot(Table.2.1, aes(x = Spelling)) +
 geom_histogram(binwidth = 10)

Frequency polygons

ggplot(Table.2.1, aes(x = Spelling, color = Class)) +
  geom_freqpoly(binwidth = 4, linewidth = 0.75)

Overlaid density plots

ggplot(Table.2.1, aes(x = Spelling, color = Class, fill = Class)) +
  geom_density(alpha = 0.5)

2.5.2 Two categorical variables

“We can use segmented bar plots to visualize the distribution between two categorical variables.” For this data, we can compare the gender breakdown in each classroom.

ggplot(Table.2.1, aes(x = Class, fill = Gender)) +
  geom_bar()

2.5.4 Three or more variables

ggplot(Table.2.1, aes(x = Spelling, y = Math)) +
  geom_point(aes(color = Class, shape = Gender))

Split your plot into facets

To facet your plot by a single variable, use facet_wrap()

ggplot(Table.2.1, aes(x = Spelling, y = Math)) +
  geom_point(aes(color = Gender, shape = Class)) +
  facet_wrap(~Class)