Aims of this worksheet

In an earlier worksheet, you learned the basic data manipulation verbs from the dplyr package: select(), filter(), mutate(), arrange(), group_by(), and summarize(). In this worksheet you will learn additional data verbs from the dplyr and tidyr packages. These data verbs relate to window functions (lead() and lag()), data table joins (left_join() et al.), and data reshaping (spread() and gather())

To begin, we will load the necessary packages, as well as the Methodist data

library(dplyr)
library(tidyr)
library(readr)
library(historydata)
library(ggplot2)
methodists <- read_csv("http://lincolnmullen.com/projects/worksheets/data/methodists.csv")

Data joining with two table verbs (left_join() et al.)

Data lookups

It is often the case that we want to use some variable in our data to create a new variable. Consider the Methodist data for the year 1800. Perhaps we are interested in the racial composition of the churches. Do they tend to be all white and all black, or do some churches have both white and black members in varying proportions? The simplest way to get a look at that question is to create a scatter plot of the figures for white and black membership.

methodists_1800 <- methodists %>% 
  filter(minutes_year == 1800) %>% 
  select(meeting, state, members_white, members_colored)

ggplot(methodists_1800, aes(x = members_white, y = members_colored)) +
  geom_point(shape = 1) 

That scatterplot is interesting as far as it goes, but we might reasonably suspect that the racial composition of methodist meetings varies by region. We could use the state variable to facet the plot by state. However, this has two problems. There are 20 states represented in that year.

methodists_1800$state %>% unique() %>% length()
## [1] 20

Our faceted plot would have 20 panels, which is too many. But more important, by looking at individual states we might be getting too fine grained a look at the data. We have good reason to think that it is regions that matter more than states.

It is easy enough to describe what we would do to translate states into a new column with regions. We would look at each state name and assign it to a region. Connecticut would be in the Northeast, New York would be in the Mid-Atlantic, and so on. So far when we have created new columns in a data frame we have done so with mutate().1 Another way to think of this problem, though, is to think of looking up the state names in a table where they associated with regions. We can create such a data frame with the code below. In many cases, though, it would make more sense to create a CSV file with the data and read it in as a data frame.

regions <- data_frame(
  states = c("Connecticut", "Delaware", "Georgia", "Kentucky", "Maine", 
             "Maryland", "Massachusetts", "Mississippi", "New Hampshire", 
             "New Jersey", "New York", "North Carolina",
             "Northwestern Territory", "Pennsylvania", "Rhode Island",
             "South Carolina", "Tennessee", "Upper Canada", "Vermont",
             "Virginia"),
  region = c("Northeast", "Atlantic South", "Atlantic South", "West",
             "Northeast", "Atlantic South", "Northeast", "Deep South", 
             "Northeast", "Mid-Atlantic", "Mid-Atlantic", "Atlantic South",
             "West", "Mid-Atlantic", "Northeast", "Atlantic South", "West",
             "Canada", "Northeast", "Atlantic South")
)

And now we can inspect the table.

regions
## Source: local data frame [20 x 2]
## 
##                    states         region
##                     (chr)          (chr)
## 1             Connecticut      Northeast
## 2                Delaware Atlantic South
## 3                 Georgia Atlantic South
## 4                Kentucky           West
## 5                   Maine      Northeast
## 6                Maryland Atlantic South
## 7           Massachusetts      Northeast
## 8             Mississippi     Deep South
## 9           New Hampshire      Northeast
## 10             New Jersey   Mid-Atlantic
## 11               New York   Mid-Atlantic
## 12         North Carolina Atlantic South
## 13 Northwestern Territory           West
## 14           Pennsylvania   Mid-Atlantic
## 15           Rhode Island      Northeast
## 16         South Carolina Atlantic South
## 17              Tennessee           West
## 18           Upper Canada         Canada
## 19                Vermont      Northeast
## 20               Virginia Atlantic South

We can do a look up where we take the state column in the methodists_1800 data frame and associate it with the states column in our regions data frame. The result will be a new column region. Notice how we use the by = argument to specify which column in the left hand table matches which column in the right hand table.

methodists_region <- methodists_1800 %>% 
  left_join(regions, by = c("state" = "states"))

methodists_region
## Source: local data frame [169 x 5]
## 
##        meeting          state members_white members_colored         region
##          (chr)          (chr)         (int)           (int)          (chr)
## 1      Augusta        Georgia            61               9 Atlantic South
## 2        Burke        Georgia           297              36 Atlantic South
## 3     Richmond        Georgia           548             115 Atlantic South
## 4   Washington        Georgia           497              92 Atlantic South
## 5  Broad River South Carolina           604              62 Atlantic South
## 6   Bush River South Carolina           328              31 Atlantic South
## 7   Charleston South Carolina            60             440 Atlantic South
## 8     Cherokee South Carolina            79               0 Atlantic South
## 9       Edisto South Carolina           572             126 Atlantic South
## 10  Georgetown South Carolina            10             223 Atlantic South
## ..         ...            ...           ...             ...            ...

Then we can plot the results. As we suspected, there is a huge regional variation.

ggplot(methodists_region, aes(x = members_white, y = members_colored)) +
  geom_point(shape = 1) +
  facet_wrap(~ region)

  1. Beginning in 1802, the Methodist minutes no longer kept track of the state in which a meeting was held. Instead it classified them into different conferences. For the years 1802 and after, create a look up table assigning the different conferences to regions of your choice, and join that table back to the Methodists data.
methodists_1802 <- methodists %>% 
  filter(minutes_year >= 1802)

conference_regions = data_frame( 
  regional = c("Western", "MidAtlantic", "Atlantic South", "MidAtlantic","MidAtlantic","Northeast", "Northeast", "Northeast", "Western", "South","South", "South", "South", "Canada", "South", "Northeast", "MidAtlantic", "Western", "Northeast", "Northeast", "Northeast", "NA"),

conferences = c("Western", "Baltimore", "South Carolina", "Virginia", "Philadelphia", "New York", "New England", "Genesee", "Ohio", "Tennessee", "Missouri", "Mississippi", "Kentucky", "Canada", "Holston", "Maine", "Pittsburg", "Illinios", "Genessee", "Oneida", "New Hampshire and Vermont", "NA")
)

regional_conference <- methodists_1802 %>% 
  left_join(conference_regions, by = c("conference" = "conferences"))
  1. Can you summarize the racial composition of the different regions by year (i.e., a region had \(x\) percent white and black members for a given year) and create a plot of the changing racial composition in each region over time?
regional_conference %>% 
  mutate(avg_colored = members_colored/(members_colored + members_white), avg_white = members_white/(members_white + members_colored)) %>% 
  group_by(minutes_year, regional) %>% 
  summarize(perc_colored = sum(avg_colored, na.rm = TRUE)/n(), perc_white = sum(avg_white, na.rm = TRUE)/n()) %>% 
  ggplot(aes(x=minutes_year)) +
  geom_line(aes(y = perc_colored, color = regional)) +
  geom_line(aes(y = perc_white, color = regional))

  1. In the europop package there are two data frames, europop with the historical populations of European cities, and city_coords which has the latitudes and longitudes of those cities. Load that package and join the two tables together. Can you get the populations of cities north of 48° of latitude?
library(europop)

europop_coords<- europop %>% 
  left_join(city_coords, by = c("city" = "city"))

europop_coords %>% 
  filter(lat>48)
## Source: local data frame [1,414 x 6]
## 
##          city            region  year population       lon      lat
##         (chr)             (chr) (int)      (int)     (dbl)    (dbl)
## 1      BERGEN       Scandinavia  1500          0  5.330000 60.38944
## 2  COPENHAGEN       Scandinavia  1500         NA 12.565530 55.67594
## 3    GOTEBORG       Scandinavia  1500          0 11.966790 57.70716
## 4  KARLSKRONA       Scandinavia  1500          0 15.586610 56.16156
## 5        OSLO       Scandinavia  1500          0 10.746090 59.91273
## 6   STOCKHOLM       Scandinavia  1500          0 18.064900 59.33258
## 7        BATH England and Wales  1500          0 -2.359070 51.37795
## 8  BIRMINGHAM England and Wales  1500          0 -1.898073 52.48137
## 9   BLACKBURN England and Wales  1500          0 -2.483330 53.75000
## 10     BOLTON England and Wales  1500          0 -2.428887 53.57769
## ..        ...               ...   ...        ...       ...      ...
  1. In the historydata package there are two tables, judges_people and judges_appointments. Join them together. What are the names of black judges who were appointed to the Supreme Court?
judges_total <- judges_people %>% 
  left_join(judges_appointments, by = c("judge_id" = "judge_id"))
judges_total %>% 
  filter(race=="African American", court_type=="USSC") %>% 
  select(judge_id,starts_with("name_"), commission_date, gender, race)
## Source: local data frame [2 x 8]
## 
##   judge_id name_first name_middle name_last name_suffix commission_date
##      (int)      (chr)       (chr)     (chr)       (chr)           (chr)
## 1     1489   Thurgood          NA  Marshall          NA      08/30/1967
## 2     2362   Clarence          NA    Thomas          NA      10/18/1991
## Variables not shown: gender (chr), race (chr)
  1. What courts did those justices serve on before the Supreme Court?
judges_total %>% 
  filter(judge_id==1489) %>% 
  select(starts_with("name_"), court_type, commission_date)
## Source: local data frame [2 x 6]
## 
##   name_first name_middle name_last name_suffix court_type commission_date
##        (chr)       (chr)     (chr)       (chr)      (chr)           (chr)
## 1   Thurgood          NA  Marshall          NA       USCA      09/14/1962
## 2   Thurgood          NA  Marshall          NA       USSC      08/30/1967
judges_total %>% 
  filter( judge_id==2362) %>% 
  select(starts_with("name_"), court_type, commission_date)
## Source: local data frame [2 x 6]
## 
##   name_first name_middle name_last name_suffix court_type commission_date
##        (chr)       (chr)     (chr)       (chr)      (chr)           (chr)
## 1   Clarence          NA    Thomas          NA       USCA      03/06/1990
## 2   Clarence          NA    Thomas          NA       USSC      10/18/1991

Data reshaping (spread() and gather())

It can be helpful to think of tabular data as coming in two forms: wide data, and long data. Let’s load in a table of data. This data contains total membership figures for the Virginia conference of the Methodist Episcopal Church for the years 1812 to 1830.

va_methodists_wide <- read_csv("http://lincolnmullen.com/projects/worksheets/data/va-methodists-wide.csv")
va_methodists_wide
## Source: local data frame [10 x 21]
## 
##    conference    district  1812  1813  1814  1815  1816  1817  1818  1819
##         (chr)       (chr) (int) (int) (int) (int) (int) (int) (int) (int)
## 1    Virginia James River  5348  4691  4520  4209  4118  3888  3713  3580
## 2    Virginia    Meherren  4882  4486  4771  4687  4702    NA    NA    NA
## 3    Virginia    Meherrin    NA    NA    NA    NA    NA  4435  3964  3860
## 4    Virginia       Neuse    NA    NA  3474  3475  3448  2702  3340  4667
## 5    Virginia     Newbern  3511  3558    NA    NA    NA    NA    NA    NA
## 6    Virginia     Norfolk  4686  6196  6127  6001  5661  6495  6471    NA
## 7    Virginia     Raleigh  3822  4018    NA    NA    NA    NA    NA    NA
## 8    Virginia     Roanoke    NA    NA    NA    NA  3049    NA  1507    NA
## 9    Virginia   Tar River    NA    NA  3834  3466    NA    NA    NA    NA
## 10   Virginia      Yadkin  3174  3216  3528  3323  3374  3323  4689  4547
## Variables not shown: 1820 (int), 1821 (int), 1822 (int), 1823 (int), 1824
##   (int), 1825 (int), 1826 (int), 1827 (int), 1828 (int), 1829 (int), 1830
##   (int)

The first thing we can notice about this data frame is that it is very wide because it has a column for each of the years. The data is also suitable for reading because it like a table in a publication. We can read from left to right and see when certain districts begin and end and get the values for each year. The difficulties of computing on or plotting the data will also become quickly apparent. How would you make a plot of the change over time in the number of members in each district? Or how would you filter by year, or summarize by year? For that matter, what do the numbers in the table represent, since they are not given an explicit variable name?

The problem with the table is that it is not tidy data, because the variables are not in columns and observations in rows. One of the variables is the year, but its values are in the column headers. And another of the variables is total membership, but its values are spread across rows and columns and it is not explicitly named.

The gather() function from the tidyr package lets us turn wide data into long data. We need to tell the function two kinds of information. First we need to tell it the name of the column to create from the column headers and the name of the implicit variable in the rows. In the example below, we create two new columns minutes_year and total_membership. Then we also have to tell the function if there are any columns which should remain unchanged. In this case, the conference and district variables should remain the same, so we remove them from the gathering using the same syntax as the select() function.

va_methodists_wide %>% 
  gather(minutes_year, total_membership, -conference, -district)
## Source: local data frame [190 x 4]
## 
##    conference    district minutes_year total_membership
##         (chr)       (chr)       (fctr)            (int)
## 1    Virginia James River         1812             5348
## 2    Virginia    Meherren         1812             4882
## 3    Virginia    Meherrin         1812               NA
## 4    Virginia       Neuse         1812               NA
## 5    Virginia     Newbern         1812             3511
## 6    Virginia     Norfolk         1812             4686
## 7    Virginia     Raleigh         1812             3822
## 8    Virginia     Roanoke         1812               NA
## 9    Virginia   Tar River         1812               NA
## 10   Virginia      Yadkin         1812             3174
## ..        ...         ...          ...              ...

We can see the results above. There are two ways that this result is not quite what we want. Because the years were column headers they are treated as character vectors rather than integers. We can manually convert them in a later step, but we can also let gather() do the right thing with the convert = argument. Then we have a lot of NA values which were explicit in the wide table but which can be removed from the long table with na.rm =.

va_methodists_long <- va_methodists_wide %>% 
  gather(minutes_year, total_membership, -conference, -district, 
         convert = TRUE, na.rm = TRUE)

va_methodists_long
## Source: local data frame [100 x 4]
## 
##    conference    district minutes_year total_membership
##         (chr)       (chr)        (int)            (int)
## 1    Virginia James River         1812             5348
## 2    Virginia    Meherren         1812             4882
## 3    Virginia     Newbern         1812             3511
## 4    Virginia     Norfolk         1812             4686
## 5    Virginia     Raleigh         1812             3822
## 6    Virginia      Yadkin         1812             3174
## 7    Virginia James River         1813             4691
## 8    Virginia    Meherren         1813             4486
## 9    Virginia     Newbern         1813             3558
## 10   Virginia     Norfolk         1813             6196
## ..        ...         ...          ...              ...

Notice that now we can use the data in ggplot2 without any problem.

ggplot(va_methodists_long, 
       aes(x = minutes_year, y = total_membership, color = district)) +
  geom_line() +
  ggtitle("Membership of districts in the Virginia conference")