Content Reference:

This lab reference practice problems from “R for Data Science” - Chapter 3: Data Visualisation https://r4ds.had.co.nz/data-visualisation.html

In this lab we will discuss and apply:

  • Position Adjustments (for bars)
  • Geometric Objects

Example 1: Diamonds

First, call the tidyverse package

library(tidyverse)

The diamonds dataset is built into the ggplot2 package.

Prices of over 50,000 round cut diamonds

Description: A dataset containing the prices and other attributes of almost 54,000 diamonds. The variables are as follows:

?diamonds

A. Learn About the Data

What kinds of variables are we working with?

# look at the structure of the data
str(diamonds)
## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

B. Basic Bar Graph

Let’s start with a basic bar graph. This is a univariate graphical tool for the relative frequency of categorical data.

ggplot(data=diamonds)+
  geom_bar(aes(x=cut))

1. Playing with Color vs Fill

Color
# Apply cut to the color aesthetic
ggplot(data=diamonds)+
  geom_bar(aes(x=cut, color=cut))

Does the output of the graphic above match what you imagined? Why or why not?

Fill

Now try changing the aesthetic to fill.

# Apply cut to the fill aesthetic
ggplot(data=diamonds)+
  geom_bar(aes(x=cut, fill=cut))

Does this match what you were hoping for?

2. Color Palette

What do you notice about this color palette?

Hint: How is it different than the following example using the mpg dataset:

ggplot(data=mpg)+
  geom_bar(aes(x=class, fill=class))

C. Position Adjustment

1. Stacked Bar Graphs

By default, when we use a different variable to add color to a bar graph than the frequency variable, R creates a stacked bar graph

# Look at the frequency of cut 
# Apply clarity to the fill aesthetic
ggplot(data=diamonds)+
  geom_bar(aes(x=cut, fill=clarity))

  • When would it be useful to use a graphic like this?
  • When might it not be a good choice? Why?

2. Filled Bar Graphs

When we are interested in comparing proportions across, we can use the `position=“fill”’ argument.

# add position="fill" to your graph above
ggplot(data=diamonds)+
  geom_bar(aes(x=cut, fill=clarity), position = "fill")

3. Side-by-side Bar Graphs

When we are interested in comparing counts across groups, we can use the `position=“dodge”’ argument.

# add position="dodge" to your graph above
ggplot(data=diamonds)+
  geom_bar(aes(x=cut, fill=clarity), position = "dodge")

  • When would it be useful to use a graphic like this?
  • When might it not be a good choice? Why?

Example 2: FiveThiryEight

A. Read the Article

“Voter Registrations Are Way, Way Down During the Pandemic” (Jun 26, 2020) by Kaleigh Rogers and Nathaniel Rakich

https://fivethirtyeight.com/features/voter-registrations-are-way-way-down-during-the-pandemic/

B. Discuss in Small Groups

  1. How are graphics used to tell the author’s story?

  2. What geometries are used?

C. The Data

What does the raw data look like?

# Import data
vreg<-read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/voter-registration/new-voter-registrations.csv",
               header=TRUE)


head(vreg)
##   Jurisdiction Year Month New.registered.voters
## 1      Arizona 2016   Jan                 25852
## 2      Arizona 2016   Feb                 51155
## 3      Arizona 2016   Mar                 48614
## 4      Arizona 2016   Apr                 30668
## 5      Arizona 2020   Jan                 33229
## 6      Arizona 2020   Feb                 50853

D. Processing the Data

Relevel

Relevel the data so that its in the right order:

# Level the Month variable so that its in the right order (ie not alphabetical)
vreg$Month<-factor(vreg$Month,
                   levels=c("Jan", "Feb", "Mar", "Apr", "May"))

Tidy

### USE spread() FROM tidyr
vregYear<-vreg%>%
  spread(Year, New.registered.voters)

### RENAME THE COLUMNS
colnames(vregYear)<-c("Jurisdiction", "Month", "Y2016", "Y2020")

Mutate

Add change in registration.

### mutate() FROM dplyr()
vregChange<-vregYear%>%
  mutate(change=Y2020-Y2016)

head(vregChange)
##   Jurisdiction Month  Y2016  Y2020 change
## 1      Arizona   Jan  25852  33229   7377
## 2      Arizona   Feb  51155  50853   -302
## 3      Arizona   Mar  48614  31872 -16742
## 4      Arizona   Apr  30668  10249 -20419
## 5   California   Jan  87574 151595  64021
## 6   California   Feb 103377 238281 134904

E. Recreate the graphic

# type code/answer here
ggplot(data=vregChange, aes(x=Month, y=change, fill=change>0))+
  geom_col()+
  facet_wrap(Jurisdiction~., scales = "free_y")+
  theme_minimal()+
  guides(fill="none")+
  scale_y_continuous(name="")+
  scale_x_discrete(name="")

Other hints

  • You can add another column to define color
  • Pay careful attention to the axes, you might want to read the help file for facet_wrap