In an earlier worksheet, you learned the basic data manipulation verbs from the dplyr package: select(), filter(), mutate(), arrange(), group_by(), and summarize(). In this worksheet you will learn additional data verbs from the dplyr and tidyr packages. These data verbs relate to window functions (lead() and lag()), data table joins (left_join() et al.), and data reshaping (spread() and gather())
To begin, we will load the necessary packages, as well as the Methodist data
library(dplyr)
library(tidyr)
library(readr)
library(historydata)
library(ggplot2)
methodists <- read_csv("http://lincolnmullen.com/projects/worksheets/data/methodists.csv")
left_join() et al.)It is often the case that we want to use some variable in our data to create a new variable. Consider the Methodist data for the year 1800. Perhaps we are interested in the racial composition of the churches. Do they tend to be all white and all black, or do some churches have both white and black members in varying proportions? The simplest way to get a look at that question is to create a scatter plot of the figures for white and black membership.
methodists_1800 <- methodists %>%
filter(minutes_year == 1800) %>%
select(meeting, state, members_white, members_colored)
ggplot(methodists_1800, aes(x = members_white, y = members_colored)) +
geom_point(shape = 1)
That scatterplot is interesting as far as it goes, but we might reasonably suspect that the racial composition of methodist meetings varies by region. We could use the state variable to facet the plot by state. However, this has two problems. There are 20 states represented in that year.
methodists_1800$state %>% unique() %>% length()
## [1] 20
Our faceted plot would have 20 panels, which is too many. But more important, by looking at individual states we might be getting too fine grained a look at the data. We have good reason to think that it is regions that matter more than states.
It is easy enough to describe what we would do to translate states into a new column with regions. We would look at each state name and assign it to a region. Connecticut would be in the Northeast, New York would be in the Mid-Atlantic, and so on. So far when we have created new columns in a data frame we have done so with mutate().1 Another way to think of this problem, though, is to think of looking up the state names in a table where they associated with regions. We can create such a data frame with the code below. In many cases, though, it would make more sense to create a CSV file with the data and read it in as a data frame.
regions <- data_frame(
states = c("Connecticut", "Delaware", "Georgia", "Kentucky", "Maine",
"Maryland", "Massachusetts", "Mississippi", "New Hampshire",
"New Jersey", "New York", "North Carolina",
"Northwestern Territory", "Pennsylvania", "Rhode Island",
"South Carolina", "Tennessee", "Upper Canada", "Vermont",
"Virginia"),
region = c("Northeast", "Atlantic South", "Atlantic South", "West",
"Northeast", "Atlantic South", "Northeast", "Deep South",
"Northeast", "Mid-Atlantic", "Mid-Atlantic", "Atlantic South",
"West", "Mid-Atlantic", "Northeast", "Atlantic South", "West",
"Canada", "Northeast", "Atlantic South")
)
And now we can inspect the table.
regions
## Source: local data frame [20 x 2]
##
## states region
## (chr) (chr)
## 1 Connecticut Northeast
## 2 Delaware Atlantic South
## 3 Georgia Atlantic South
## 4 Kentucky West
## 5 Maine Northeast
## 6 Maryland Atlantic South
## 7 Massachusetts Northeast
## 8 Mississippi Deep South
## 9 New Hampshire Northeast
## 10 New Jersey Mid-Atlantic
## 11 New York Mid-Atlantic
## 12 North Carolina Atlantic South
## 13 Northwestern Territory West
## 14 Pennsylvania Mid-Atlantic
## 15 Rhode Island Northeast
## 16 South Carolina Atlantic South
## 17 Tennessee West
## 18 Upper Canada Canada
## 19 Vermont Northeast
## 20 Virginia Atlantic South
We can do a look up where we take the state column in the methodists_1800 data frame and associate it with the states column in our regions data frame. The result will be a new column region. Notice how we use the by = argument to specify which column in the left hand table matches which column in the right hand table.
methodists_region <- methodists_1800 %>%
left_join(regions, by = c("state" = "states"))
methodists_region
## Source: local data frame [169 x 5]
##
## meeting state members_white members_colored region
## (chr) (chr) (int) (int) (chr)
## 1 Augusta Georgia 61 9 Atlantic South
## 2 Burke Georgia 297 36 Atlantic South
## 3 Richmond Georgia 548 115 Atlantic South
## 4 Washington Georgia 497 92 Atlantic South
## 5 Broad River South Carolina 604 62 Atlantic South
## 6 Bush River South Carolina 328 31 Atlantic South
## 7 Charleston South Carolina 60 440 Atlantic South
## 8 Cherokee South Carolina 79 0 Atlantic South
## 9 Edisto South Carolina 572 126 Atlantic South
## 10 Georgetown South Carolina 10 223 Atlantic South
## .. ... ... ... ... ...
Then we can plot the results. As we suspected, there is a huge regional variation.
ggplot(methodists_region, aes(x = members_white, y = members_colored)) +
geom_point(shape = 1) +
facet_wrap(~ region)
methodists_1802 <- methodists %>%
filter(minutes_year >= 1802) %>%
select(minutes_year, conference, district, meeting, state, starts_with("members_"))
conference_regions <- data_frame(
conference = c("Western", "Baltimore", "South Carolina", "Virginia", "Philadelphia", "New York", "New England", "Genesee", "Ohio", "Tennessee", "Missouri", "Mississippi", "Kentucky", "Canada", "Holston", "Maine", "Pittsburg", "Illinois", "Oneida", "New Hampshire and Vermont"),
regions_conference = c("West", "Midatlantic", "South", "South", "Rust Belt", "North", "Northeast", "North", "Rust Belt", "South", "Midwest", "South", "North", "North of US", "South", "Northeast", "Rust Belt", "Rust Belt", "North", "North")
)
name_variable <- methodists_1802 %>%
left_join(conference_regions, by = c("conference" = "conference"))
name_variable
## Source: local data frame [13,672 x 10]
##
## minutes_year conference district meeting state
## (int) (chr) (chr) (chr) (chr)
## 1 1802 Western Kentucky Scioto NA
## 2 1802 Baltimore Pittsburg Muskingum NA
## 3 1802 Western Kentucky Miami NA
## 4 1802 Baltimore Pittsburg West Wheeling NA
## 5 1802 Western Kentucky Limestone NA
## 6 1802 Western Kentucky Hinkstone NA
## 7 1802 Western Kentucky Lexington NA
## 8 1802 Western Kentucky Salt River and Shelby NA
## 9 1802 Western Kentucky Danville NA
## 10 1802 Western Kentucky Cumberland NA
## .. ... ... ... ... ...
## Variables not shown: members_general (int), members_white (int),
## members_colored (int), members_indian (chr), regions_conference (chr)
name_variable_2 <- name_variable %>%
mutate(members_total = members_colored + members_white) %>%
mutate(perc_black = members_colored / members_total) %>%
mutate(perc_white = members_white / members_total) %>%
group_by(minutes_year, regions_conference) %>%
summarize(black_average = mean(perc_black, na.rm = TRUE), white_average = mean(perc_white, na.rm = TRUE))
ggplot(name_variable_2, aes(x = minutes_year, color = regions_conference)) +
geom_line(aes(y = black_average)) +
geom_line(aes(y = white_average))
I know that this isn’t the best–or even correct–way to display the data, but I finally got stumped after attempting facet_wrap by regions_conference for a while. I was trying this code:
ggplot(name_variable_2, aes(x = minutes_year)) +
geom_line(aes(y = black_average)) +
geom_line(aes(y = white_average)) +
facet_wrap(~ regions_conference)
And would get this error: Error in layout_base(data, vars, drop = drop) : At least one layer must contain all variables used for facetting. I went to Google and was unable to make heads or tails of why it didn’t want to facet wrap.
europop with the historical populations of European cities, and city_coords which has the latitudes and longitudes of those cities. Load that package and join the two tables together. Can you get the populations of cities north of 48° of latitude?library(europop)
city_europop <- europop %>%
left_join(city_coords) %>%
filter(lat > 48)
## Joining by: "city"
city_europop
## Source: local data frame [1,414 x 6]
##
## city region year population lon lat
## (chr) (chr) (int) (int) (dbl) (dbl)
## 1 BERGEN Scandinavia 1500 0 5.330000 60.38944
## 2 COPENHAGEN Scandinavia 1500 NA 12.565530 55.67594
## 3 GOTEBORG Scandinavia 1500 0 11.966790 57.70716
## 4 KARLSKRONA Scandinavia 1500 0 15.586610 56.16156
## 5 OSLO Scandinavia 1500 0 10.746090 59.91273
## 6 STOCKHOLM Scandinavia 1500 0 18.064900 59.33258
## 7 BATH England and Wales 1500 0 -2.359070 51.37795
## 8 BIRMINGHAM England and Wales 1500 0 -1.898073 52.48137
## 9 BLACKBURN England and Wales 1500 0 -2.483330 53.75000
## 10 BOLTON England and Wales 1500 0 -2.428887 53.57769
## .. ... ... ... ... ... ...
judges_people and judges_appointments. Join them together. What are the names of black judges who were appointed to the Supreme Court?judges_manip <- judges_people %>%
left_join(judges_appointments)
## Joining by: "judge_id"
judges_manip %>%
filter(race == "African American", court_type == "USSC") %>%
select(race, court_type, judge_id, starts_with("name_"))
## Source: local data frame [2 x 7]
##
## race court_type judge_id name_first name_middle name_last
## (chr) (chr) (int) (chr) (chr) (chr)
## 1 African American USSC 1489 Thurgood NA Marshall
## 2 African American USSC 2362 Clarence NA Thomas
## Variables not shown: name_suffix (chr)
judges_manip %>%
filter(judge_id == "1489") %>%
select(race, court_type, judge_id, starts_with("name"))
## Source: local data frame [2 x 7]
##
## race court_type judge_id name_first name_middle name_last
## (chr) (chr) (int) (chr) (chr) (chr)
## 1 African American USCA 1489 Thurgood NA Marshall
## 2 African American USSC 1489 Thurgood NA Marshall
## Variables not shown: name_suffix (chr)
judges_manip %>%
filter(judge_id == "2362") %>%
select(race, court_type, judge_id, starts_with("name"))
## Source: local data frame [2 x 7]
##
## race court_type judge_id name_first name_middle name_last
## (chr) (chr) (int) (chr) (chr) (chr)
## 1 African American USCA 2362 Clarence NA Thomas
## 2 African American USSC 2362 Clarence NA Thomas
## Variables not shown: name_suffix (chr)
I do not think this is the most elegant way to find the answer.
spread() and gather())It can be helpful to think of tabular data as coming in two forms: wide data, and long data. Let’s load in a table of data. This data contains total membership figures for the Virginia conference of the Methodist Episcopal Church for the years 1812 to 1830.
va_methodists_wide <- read_csv("http://lincolnmullen.com/projects/worksheets/data/va-methodists-wide.csv")
va_methodists_wide
## Source: local data frame [10 x 21]
##
## conference district 1812 1813 1814 1815 1816 1817 1818 1819
## (chr) (chr) (int) (int) (int) (int) (int) (int) (int) (int)
## 1 Virginia James River 5348 4691 4520 4209 4118 3888 3713 3580
## 2 Virginia Meherren 4882 4486 4771 4687 4702 NA NA NA
## 3 Virginia Meherrin NA NA NA NA NA 4435 3964 3860
## 4 Virginia Neuse NA NA 3474 3475 3448 2702 3340 4667
## 5 Virginia Newbern 3511 3558 NA NA NA NA NA NA
## 6 Virginia Norfolk 4686 6196 6127 6001 5661 6495 6471 NA
## 7 Virginia Raleigh 3822 4018 NA NA NA NA NA NA
## 8 Virginia Roanoke NA NA NA NA 3049 NA 1507 NA
## 9 Virginia Tar River NA NA 3834 3466 NA NA NA NA
## 10 Virginia Yadkin 3174 3216 3528 3323 3374 3323 4689 4547
## Variables not shown: 1820 (int), 1821 (int), 1822 (int), 1823 (int), 1824
## (int), 1825 (int), 1826 (int), 1827 (int), 1828 (int), 1829 (int), 1830
## (int)
The first thing we can notice about this data frame is that it is very wide because it has a column for each of the years. The data is also suitable for reading because it like a table in a publication. We can read from left to right and see when certain districts begin and end and get the values for each year. The difficulties of computing on or plotting the data will also become quickly apparent. How would you make a plot of the change over time in the number of members in each district? Or how would you filter by year, or summarize by year? For that matter, what do the numbers in the table represent, since they are not given an explicit variable name?
The problem with the table is that it is not tidy data, because the variables are not in columns and observations in rows. One of the variables is the year, but its values are in the column headers. And another of the variables is total membership, but its values are spread across rows and columns and it is not explicitly named.
The gather() function from the tidyr package lets us turn wide data into long data. We need to tell the function two kinds of information. First we need to tell it the name of the column to create from the column headers and the name of the implicit variable in the rows. In the example below, we create to new columns minutes_year and total_membership. Then we also have to tell the function if there are any columns which should remain unchanged. In this case, the conference and district variables should remain the same, so we remove them from the gathering using the same syntax as the select() function.
va_methodists_wide %>%
gather(minutes_year, total_membership, -conference, -district)
## Source: local data frame [190 x 4]
##
## conference district minutes_year total_membership
## (chr) (chr) (fctr) (int)
## 1 Virginia James River 1812 5348
## 2 Virginia Meherren 1812 4882
## 3 Virginia Meherrin 1812 NA
## 4 Virginia Neuse 1812 NA
## 5 Virginia Newbern 1812 3511
## 6 Virginia Norfolk 1812 4686
## 7 Virginia Raleigh 1812 3822
## 8 Virginia Roanoke 1812 NA
## 9 Virginia Tar River 1812 NA
## 10 Virginia Yadkin 1812 3174
## .. ... ... ... ...
We can see the results above. There are two ways that this result is not quite what we want. Because the years were column headers they are treated as character vectors rather than integers. We can manually convert them in a later step, but we can also let gather() do the right thing with the convert = argument. Then we have a lot of NA values which were explicit in the wide table but which can be removed from the long table with na.rm =.
va_methodists_long <- va_methodists_wide %>%
gather(minutes_year, total_membership, -conference, -district,
convert = TRUE, na.rm = TRUE)
va_methodists_long
## Source: local data frame [100 x 4]
##
## conference district minutes_year total_membership
## (chr) (chr) (int) (int)
## 1 Virginia James River 1812 5348
## 2 Virginia Meherren 1812 4882
## 3 Virginia Newbern 1812 3511
## 4 Virginia Norfolk 1812 4686
## 5 Virginia Raleigh 1812 3822
## 6 Virginia Yadkin 1812 3174
## 7 Virginia James River 1813 4691
## 8 Virginia Meherren 1813 4486
## 9 Virginia Newbern 1813 3558
## 10 Virginia Norfolk 1813 6196
## .. ... ... ... ...
Notice that now we can use the data in ggplot2 without any problem.
ggplot(va_methodists_long,
aes(x = minutes_year, y = total_membership, color = district)) +
geom_line() +
ggtitle("Membership of districts in the Virginia conference")
The inverse operation of gather() is spread(). With spread() we specify the name of the column which should become the new column headers (in this case minutes_year), and then the name of the column to fill in underneath those new column headers (in this case, total_membership). We can see the results below.
va_methodists_wide2 <- va_methodists_long %>%
spread(minutes_year, total_membership)
va_methodists_wide2
## Source: local data frame [10 x 21]
##
## conference district 1812 1813 1814 1815 1816 1817 1818 1819
## (chr) (chr) (int) (int) (int) (int) (int) (int) (int) (int)
## 1 Virginia James River 5348 4691 4520 4209 4118 3888 3713 3580
## 2 Virginia Meherren 4882 4486 4771 4687 4702 NA NA NA
## 3 Virginia Meherrin NA NA NA NA NA 4435 3964 3860
## 4 Virginia Neuse NA NA 3474 3475 3448 2702 3340 4667
## 5 Virginia Newbern 3511 3558 NA NA NA NA NA NA
## 6 Virginia Norfolk 4686 6196 6127 6001 5661 6495 6471 NA
## 7 Virginia Raleigh 3822 4018 NA NA NA NA NA NA
## 8 Virginia Roanoke NA NA NA NA 3049 NA 1507 NA
## 9 Virginia Tar River NA NA 3834 3466 NA NA NA NA
## 10 Virginia Yadkin 3174 3216 3528 3323 3374 3323 4689 4547
## Variables not shown: 1820 (int), 1821 (int), 1822 (int), 1823 (int), 1824
## (int), 1825 (int), 1826 (int), 1827 (int), 1828 (int), 1829 (int), 1830
## (int)
Just by looking at the data we can see that we got back to where we started, but we can also verify that programmatically.
identical(va_methodists_wide, va_methodists_wide2)
## [1] TRUE
Turning long data into wide is often useful when you want to create a tabular representation of data. (And once you have a data frame that can be a table, the knitr::kable() function is quite nice.) And some algorithms, such as clustering algorithms, expect wide data rather than tidy data.
For the exercise, we will use summary statistics of the number of white and black members in the Methodists by year.
methodists_by_year_race <- methodists %>%
filter(minutes_year >= 1786) %>%
group_by(minutes_year) %>%
summarize(white = sum(members_white, na.rm = TRUE),
black = sum(members_colored, na.rm = TRUE),
indian = sum(as.integer(members_indian), na.rm = TRUE))
methodists_by_year_race
## Source: local data frame [45 x 4]
##
## minutes_year white black indian
## (int) (int) (int) (int)
## 1 1786 18291 2890 0
## 2 1787 21949 3883 0
## 3 1788 30557 7991 0
## 4 1789 34425 8840 0
## 5 1790 45983 11682 0
## 6 1791 50580 13098 0
## 7 1792 52079 13871 0
## 8 1793 51486 14420 0
## 9 1794 52794 13906 0
## 10 1795 48121 12171 0
## .. ... ... ... ...
methodists_by_year_race could be tidier still. While white, black, and indian are variables, it is perhaps better to think of them as two different variables. One variable would be race, containing the racial descriptions that the Methodists used, and another would be members, containing the number of members. Using the gather() function, create that data frame.methodists_tidy_long <- methodists_by_year_race %>%
gather(race, members, -minutes_year)
methodists_tidy_long
## Source: local data frame [135 x 3]
##
## minutes_year race members
## (int) (fctr) (int)
## 1 1786 white 18291
## 2 1787 white 21949
## 3 1788 white 30557
## 4 1789 white 34425
## 5 1790 white 45983
## 6 1791 white 50580
## 7 1792 white 52079
## 8 1793 white 51486
## 9 1794 white 52794
## 10 1795 white 48121
## .. ... ... ...
race column to the color aesthetic.ggplot(methodists_tidy_long,
aes(x = minutes_year, y = members, color = race)) +
geom_line() +
ggtitle("Methodist membership over time")
methodists_tidy_wide <- methodists_tidy_long %>%
spread(minutes_year, race)
methodists_tidy_wide
## Source: local data frame [98 x 46]
##
## members 1786 1787 1788 1789 1790 1791 1792 1793 1794
## (int) (fctr) (fctr) (fctr) (fctr) (fctr) (fctr) (fctr) (fctr) (fctr)
## 1 0 indian indian indian indian indian indian indian indian indian
## 2 56 NA NA NA NA NA NA NA NA NA
## 3 112 NA NA NA NA NA NA NA NA NA
## 4 274 NA NA NA NA NA NA NA NA NA
## 5 496 NA NA NA NA NA NA NA NA NA
## 6 633 NA NA NA NA NA NA NA NA NA
## 7 942 NA NA NA NA NA NA NA NA NA
## 8 2890 black NA NA NA NA NA NA NA NA
## 9 3883 NA black NA NA NA NA NA NA NA
## 10 3986 NA NA NA NA NA NA NA NA NA
## .. ... ... ... ... ... ... ... ... ... ...
## Variables not shown: 1795 (fctr), 1796 (fctr), 1797 (fctr), 1798 (fctr),
## 1799 (fctr), 1800 (fctr), 1801 (fctr), 1802 (fctr), 1803 (fctr), 1804
## (fctr), 1805 (fctr), 1806 (fctr), 1807 (fctr), 1808 (fctr), 1809 (fctr),
## 1810 (fctr), 1811 (fctr), 1812 (fctr), 1813 (fctr), 1814 (fctr), 1815
## (fctr), 1816 (fctr), 1817 (fctr), 1818 (fctr), 1819 (fctr), 1820 (fctr),
## 1821 (fctr), 1822 (fctr), 1823 (fctr), 1824 (fctr), 1825 (fctr), 1826
## (fctr), 1827 (fctr), 1828 (fctr), 1829 (fctr), 1830 (fctr)
methodists_tidy_wide_again <- methodists_tidy_long %>%
spread(race, members)
methodists_tidy_wide_again
## Source: local data frame [45 x 4]
##
## minutes_year white black indian
## (int) (int) (int) (int)
## 1 1786 18291 2890 0
## 2 1787 21949 3883 0
## 3 1788 30557 7991 0
## 4 1789 34425 8840 0
## 5 1790 45983 11682 0
## 6 1791 50580 13098 0
## 7 1792 52079 13871 0
## 8 1793 51486 14420 0
## 9 1794 52794 13906 0
## 10 1795 48121 12171 0
## .. ... ... ... ...
There are a number of different kinds of window functions in R. We are going to look at two window functions, lead() and lag() which help us look for change over time. (For a fuller explanation of window functions, see the related dplyr vignette.)
To understand what a window function does, it is helpful to compare it to a transformation function and an aggregation function. Suppose we have a vector with five numeric values.
original <- c(1.1, 2.2, 3.3, 4.4, 5.5)
A transformation function changes each element in the vector and returns a new value for each. In the example below, we round each element in the vector. We have a different result, but it still has five elements.
round(original, 0)
## [1] 1 2 3 4 6
In an aggregation function, we pass in a vector of numbers and get back a single value. In this case, we get the sum of the numbers.
sum(original)
## [1] 16.5
Aggregation functions work well with summarize(); transformation functions works well with mutate().
A window function gives back a vector of numbers, but a vector which has fewer useable elements than the original. It is like sliding a window over the vector. Consider the case below.
lead(original)
## [1] 2.2 3.3 4.4 5.5 NA
lag(original)
## [1] NA 1.1 2.2 3.3 4.4
The function lead() returns the next element of a vector in place of the original value. At the end of the vector we get an NA because there are no more elements left. The function lag() does the opposite, giving us the previous element in the vector. In that case, the first element of the returned vector is NA.
The lead() and lag() functions are useful for comparing one value to its previous or successor value. Suppose, for instance, that we have a vector of membership figures for each year. We can calculate the number of new members each year by subtracting the current value from its previous value.
membership <- c(100, 150, 250, 400, 600)
lag(membership)
## [1] NA 100 150 250 400
membership - lag(membership)
## [1] NA 50 100 150 200
Now that we understand those basics, we can apply that to the Methodist annual minutes data that we worked with in a previous lesson. Let’s start by getting just the membership data from Fairfax, Virginia. We will also calculate the members_general value for the years it is missing, and select only the columns we absolutely need.
fairfax <- methodists %>%
filter(meeting == "Fairfax") %>%
mutate(members_general = ifelse(is.na(members_general),
members_white + members_colored,
members_general)) %>%
select(minutes_year, meeting, starts_with("members"), -members_indian)
fairfax
## Source: local data frame [54 x 5]
##
## minutes_year meeting members_general members_white members_colored
## (int) (chr) (int) (int) (int)
## 1 1775 Fairfax 30 NA NA
## 2 1776 Fairfax 350 NA NA
## 3 1777 Fairfax 330 NA NA
## 4 1779 Fairfax 309 NA NA
## 5 1780 Fairfax 361 NA NA
## 6 1781 Fairfax 301 NA NA
## 7 1782 Fairfax 362 NA NA
## 8 1783 Fairfax 310 NA NA
## 9 1784 Fairfax 317 NA NA
## 10 1786 Fairfax 260 260 0
## .. ... ... ... ... ...
Now that we have the data, we can add a column for the number of new members added each year.
fairfax <- fairfax %>%
mutate(growth = members_general - lag(members_general))
fairfax %>%
filter(growth < 0) %>%
select(minutes_year, growth)
## Source: local data frame [29 x 2]
##
## minutes_year growth
## (int) (int)
## 1 1777 -20
## 2 1779 -21
## 3 1781 -60
## 4 1783 -52
## 5 1786 -57
## 6 1791 -152
## 7 1792 -9
## 8 1793 -219
## 9 1795 -166
## 10 1796 -15
## .. ... ...
ggplot(fairfax,
aes(x = minutes_year, y = growth)) +
geom_line()
## Warning: Removed 1 rows containing missing values (geom_path).
If there was a sudden uptick or downturn in growth, it might mean that the number of members was counted incorrectly, or that whoever input the data into the table did so incorrectly.
plot_fairfax <- fairfax %>%
mutate(growth_black = members_colored - lag(members_colored)) %>%
mutate(growth_white = members_white - lag(members_white)) %>%
filter(growth_black < 0, growth_white < 0) %>%
select(minutes_year, growth_black, growth_white)
ggplot(plot_fairfax, aes(x = minutes_year, color = starts_with("growth_"))) +
geom_line(aes(y = growth_black, color = "Black members")) +
geom_line(aes(y = growth_white, color = "White members")) +
labs(title = "Growth in White and Black Methodist Membership",
x = "Year",
y = "Growth",
color = "Growth of members")
methodists data. Beginning in 1802, the Methodists organized the data by conference, district, and meeting. For 1802 and following, calculate the growth for each conference. Which conferences were growing the most in absolute terms? Which were growing the most in relative terms (i.e., growth percentage)? Do you get a clearer picture by looking at the growth in districts? Feel free to plot the data if you wish and to add explanatory text.hard_question <- methodists %>%
filter(minutes_year >= 1802) %>%
mutate(members_general = ifelse(is.na(members_general),
members_white + members_colored,
members_general)) %>%
mutate(members_growth = members_general - lag(members_general)) %>%
group_by(conference) %>%
arrange(desc(members_growth))
hard_question
## Source: local data frame [13,672 x 11]
## Groups: conference [21]
##
## minutes_year conference district meeting state
## (int) (chr) (chr) (chr) (chr)
## 1 1830 Baltimore Baltimore Baltimore NA
## 2 1829 Baltimore Baltimore Baltimore NA
## 3 1828 Baltimore Baltimore Baltimore NA
## 4 1827 Baltimore Baltimore Baltimore city NA
## 5 1826 Baltimore Baltimore Baltimore station NA
## 6 1825 Baltimore Baltimore Baltimore city NA
## 7 1823 Baltimore Baltimore Baltimore city NA
## 8 1822 Baltimore Baltimore Baltimore city NA
## 9 1824 Baltimore Baltimore Baltimore city NA
## 10 1817 Baltimore Baltimore Baltimore city NA
## .. ... ... ... ... ...
## Variables not shown: members_general (int), members_white (int),
## members_colored (int), members_indian (chr), url (chr), members_growth
## (int)
ggplot(hard_question, aes(x = minutes_year, y = members_growth)) +
geom_point(shape = .5) +
facet_wrap(~ conference)
## Warning: Removed 130 rows containing missing values (geom_point).
percentage_growth <- methodists %>%
filter(minutes_year >= 1802) %>%
mutate(members_general = ifelse(is.na(members_general),
members_white + members_colored,
members_general)) %>%
mutate(members_percentage = members_general / (members_general - lag(members_general))) %>%
group_by(conference) %>%
arrange(desc(members_percentage))
percentage_growth
## Source: local data frame [13,672 x 11]
## Groups: conference [21]
##
## minutes_year conference district meeting state
## (int) (chr) (chr) (chr) (chr)
## 1 1806 Baltimore Susquehannah Northumberland NA
## 2 1821 Baltimore Carlisle Bedford NA
## 3 1823 Baltimore Baltimore Baltimore cir. NA
## 4 1802 Baltimore Baltimore Prince George's NA
## 5 1820 Baltimore Potomac Loudoun NA
## 6 1811 Baltimore Monongahela Hartford NA
## 7 1825 Baltimore Winchester Rockingham NA
## 8 1825 Baltimore Baltimore Baltimore cir. NA
## 9 1830 Baltimore Potomac Leesburg NA
## 10 1816 Baltimore Carlisle Huntingdon and Mishannon NA
## .. ... ... ... ... ...
## Variables not shown: members_general (int), members_white (int),
## members_colored (int), members_indian (chr), url (chr),
## members_percentage (dbl)
ggplot(percentage_growth, aes(x = minutes_year, y = members_percentage)) +
geom_point(shape = .5) +
facet_wrap(~ conference)
## Warning: Removed 130 rows containing missing values (geom_point).
I was only able to get this far.
And indeed, we could write a function that translates state names into regions.↩