In an earlier worksheet, you learned the basic data manipulation verbs from the dplyr package: select()
, filter()
, mutate()
, arrange()
, group_by()
, and summarize()
. In this worksheet you will learn additional data verbs from the dplyr and tidyr packages. These data verbs relate to window functions (lead()
and lag()
), data table joins (left_join()
et al.), and data reshaping (spread()
and gather()
)
To begin, we will load the necessary packages, as well as the Methodist data
library(dplyr)
library(tidyr)
library(readr)
library(historydata)
library(ggplot2)
methodists <- read_csv("http://lincolnmullen.com/projects/worksheets/data/methodists.csv")
left_join()
et al.)It is often the case that we want to use some variable in our data to create a new variable. Consider the Methodist data for the year 1800. Perhaps we are interested in the racial composition of the churches. Do they tend to be all white and all black, or do some churches have both white and black members in varying proportions? The simplest way to get a look at that question is to create a scatter plot of the figures for white and black membership.
methodists_1800 <- methodists %>%
filter(minutes_year == 1800) %>%
select(meeting, state, members_white, members_colored)
ggplot(methodists_1800, aes(x = members_white, y = members_colored)) +
geom_point(shape = 1)
That scatterplot is interesting as far as it goes, but we might reasonably suspect that the racial composition of methodist meetings varies by region. We could use the state
variable to facet the plot by state. However, this has two problems. There are 20 states represented in that year.
methodists_1800$state %>% unique() %>% length()
## [1] 20
Our faceted plot would have 20 panels, which is too many. But more important, by looking at individual states we might be getting too fine grained a look at the data. We have good reason to think that it is regions that matter more than states.
It is easy enough to describe what we would do to translate states into a new column with regions. We would look at each state name and assign it to a region. Connecticut would be in the Northeast, New York would be in the Mid-Atlantic, and so on. So far when we have created new columns in a data frame we have done so with mutate()
.1 Another way to think of this problem, though, is to think of looking up the state names in a table where they associated with regions. We can create such a data frame with the code below. In many cases, though, it would make more sense to create a CSV file with the data and read it in as a data frame.
regions <- data_frame(
states = c("Connecticut", "Delaware", "Georgia", "Kentucky", "Maine",
"Maryland", "Massachusetts", "Mississippi", "New Hampshire",
"New Jersey", "New York", "North Carolina",
"Northwestern Territory", "Pennsylvania", "Rhode Island",
"South Carolina", "Tennessee", "Upper Canada", "Vermont",
"Virginia"),
region = c("Northeast", "Atlantic South", "Atlantic South", "West",
"Northeast", "Atlantic South", "Northeast", "Deep South",
"Northeast", "Mid-Atlantic", "Mid-Atlantic", "Atlantic South",
"West", "Mid-Atlantic", "Northeast", "Atlantic South", "West",
"Canada", "Northeast", "Atlantic South")
)
And now we can inspect the table.
regions
## Source: local data frame [20 x 2]
##
## states region
## (chr) (chr)
## 1 Connecticut Northeast
## 2 Delaware Atlantic South
## 3 Georgia Atlantic South
## 4 Kentucky West
## 5 Maine Northeast
## 6 Maryland Atlantic South
## 7 Massachusetts Northeast
## 8 Mississippi Deep South
## 9 New Hampshire Northeast
## 10 New Jersey Mid-Atlantic
## 11 New York Mid-Atlantic
## 12 North Carolina Atlantic South
## 13 Northwestern Territory West
## 14 Pennsylvania Mid-Atlantic
## 15 Rhode Island Northeast
## 16 South Carolina Atlantic South
## 17 Tennessee West
## 18 Upper Canada Canada
## 19 Vermont Northeast
## 20 Virginia Atlantic South
We can do a look up where we take the state
column in the methodists_1800
data frame and associate it with the states
column in our regions
data frame. The result will be a new column region
. Notice how we use the by =
argument to specify which column in the left hand table matches which column in the right hand table.
methodists_region <- methodists_1800 %>%
left_join(regions, by = c("state" = "states"))
methodists_region
## Source: local data frame [169 x 5]
##
## meeting state members_white members_colored region
## (chr) (chr) (int) (int) (chr)
## 1 Augusta Georgia 61 9 Atlantic South
## 2 Burke Georgia 297 36 Atlantic South
## 3 Richmond Georgia 548 115 Atlantic South
## 4 Washington Georgia 497 92 Atlantic South
## 5 Broad River South Carolina 604 62 Atlantic South
## 6 Bush River South Carolina 328 31 Atlantic South
## 7 Charleston South Carolina 60 440 Atlantic South
## 8 Cherokee South Carolina 79 0 Atlantic South
## 9 Edisto South Carolina 572 126 Atlantic South
## 10 Georgetown South Carolina 10 223 Atlantic South
## .. ... ... ... ... ...
Then we can plot the results. As we suspected, there is a huge regional variation.
ggplot(methodists_region, aes(x = members_white, y = members_colored)) +
geom_point(shape = 1) +
facet_wrap(~ region)
methodists_1802 <- methodists %>%
filter(minutes_year >= 1802)
conference_regions = data_frame(
regional = c("Western", "MidAtlantic", "Atlantic South", "MidAtlantic","MidAtlantic","Northeast", "Northeast", "Northeast", "Western", "South","South", "South", "South", "Canada", "South", "Northeast", "MidAtlantic", "Western", "Northeast", "Northeast", "Northeast", "NA"),
conferences = c("Western", "Baltimore", "South Carolina", "Virginia", "Philadelphia", "New York", "New England", "Genesee", "Ohio", "Tennessee", "Missouri", "Mississippi", "Kentucky", "Canada", "Holston", "Maine", "Pittsburg", "Illinios", "Genessee", "Oneida", "New Hampshire and Vermont", "NA")
)
regional_conference <- methodists_1802 %>%
left_join(conference_regions, by = c("conference" = "conferences"))
regional_conference %>%
mutate(avg_colored = members_colored/(members_colored + members_white), avg_white = members_white/(members_white + members_colored)) %>%
group_by(minutes_year, regional) %>%
summarize(perc_colored = sum(avg_colored, na.rm = TRUE)/n(), perc_white = sum(avg_white, na.rm = TRUE)/n()) %>%
ggplot(aes(x=minutes_year)) +
geom_line(aes(y = perc_colored, color = regional)) +
geom_line(aes(y = perc_white, color = regional))
europop
with the historical populations of European cities, and city_coords
which has the latitudes and longitudes of those cities. Load that package and join the two tables together. Can you get the populations of cities north of 48° of latitude?library(europop)
europop_coords<- europop %>%
left_join(city_coords, by = c("city" = "city"))
europop_coords %>%
filter(lat>48)
## Source: local data frame [1,414 x 6]
##
## city region year population lon lat
## (chr) (chr) (int) (int) (dbl) (dbl)
## 1 BERGEN Scandinavia 1500 0 5.330000 60.38944
## 2 COPENHAGEN Scandinavia 1500 NA 12.565530 55.67594
## 3 GOTEBORG Scandinavia 1500 0 11.966790 57.70716
## 4 KARLSKRONA Scandinavia 1500 0 15.586610 56.16156
## 5 OSLO Scandinavia 1500 0 10.746090 59.91273
## 6 STOCKHOLM Scandinavia 1500 0 18.064900 59.33258
## 7 BATH England and Wales 1500 0 -2.359070 51.37795
## 8 BIRMINGHAM England and Wales 1500 0 -1.898073 52.48137
## 9 BLACKBURN England and Wales 1500 0 -2.483330 53.75000
## 10 BOLTON England and Wales 1500 0 -2.428887 53.57769
## .. ... ... ... ... ... ...
judges_people
and judges_appointments
. Join them together. What are the names of black judges who were appointed to the Supreme Court?judges_total <- judges_people %>%
left_join(judges_appointments, by = c("judge_id" = "judge_id"))
judges_total %>%
filter(race=="African American", court_type=="USSC") %>%
select(judge_id,starts_with("name_"), commission_date, gender, race)
## Source: local data frame [2 x 8]
##
## judge_id name_first name_middle name_last name_suffix commission_date
## (int) (chr) (chr) (chr) (chr) (chr)
## 1 1489 Thurgood NA Marshall NA 08/30/1967
## 2 2362 Clarence NA Thomas NA 10/18/1991
## Variables not shown: gender (chr), race (chr)
judges_total %>%
filter(judge_id==1489) %>%
select(starts_with("name_"), court_type, commission_date)
## Source: local data frame [2 x 6]
##
## name_first name_middle name_last name_suffix court_type commission_date
## (chr) (chr) (chr) (chr) (chr) (chr)
## 1 Thurgood NA Marshall NA USCA 09/14/1962
## 2 Thurgood NA Marshall NA USSC 08/30/1967
judges_total %>%
filter( judge_id==2362) %>%
select(starts_with("name_"), court_type, commission_date)
## Source: local data frame [2 x 6]
##
## name_first name_middle name_last name_suffix court_type commission_date
## (chr) (chr) (chr) (chr) (chr) (chr)
## 1 Clarence NA Thomas NA USCA 03/06/1990
## 2 Clarence NA Thomas NA USSC 10/18/1991
spread()
and gather()
)It can be helpful to think of tabular data as coming in two forms: wide data, and long data. Let’s load in a table of data. This data contains total membership figures for the Virginia conference of the Methodist Episcopal Church for the years 1812 to 1830.
va_methodists_wide <- read_csv("http://lincolnmullen.com/projects/worksheets/data/va-methodists-wide.csv")
va_methodists_wide
## Source: local data frame [10 x 21]
##
## conference district 1812 1813 1814 1815 1816 1817 1818 1819
## (chr) (chr) (int) (int) (int) (int) (int) (int) (int) (int)
## 1 Virginia James River 5348 4691 4520 4209 4118 3888 3713 3580
## 2 Virginia Meherren 4882 4486 4771 4687 4702 NA NA NA
## 3 Virginia Meherrin NA NA NA NA NA 4435 3964 3860
## 4 Virginia Neuse NA NA 3474 3475 3448 2702 3340 4667
## 5 Virginia Newbern 3511 3558 NA NA NA NA NA NA
## 6 Virginia Norfolk 4686 6196 6127 6001 5661 6495 6471 NA
## 7 Virginia Raleigh 3822 4018 NA NA NA NA NA NA
## 8 Virginia Roanoke NA NA NA NA 3049 NA 1507 NA
## 9 Virginia Tar River NA NA 3834 3466 NA NA NA NA
## 10 Virginia Yadkin 3174 3216 3528 3323 3374 3323 4689 4547
## Variables not shown: 1820 (int), 1821 (int), 1822 (int), 1823 (int), 1824
## (int), 1825 (int), 1826 (int), 1827 (int), 1828 (int), 1829 (int), 1830
## (int)
The first thing we can notice about this data frame is that it is very wide because it has a column for each of the years. The data is also suitable for reading because it like a table in a publication. We can read from left to right and see when certain districts begin and end and get the values for each year. The difficulties of computing on or plotting the data will also become quickly apparent. How would you make a plot of the change over time in the number of members in each district? Or how would you filter by year, or summarize by year? For that matter, what do the numbers in the table represent, since they are not given an explicit variable name?
The problem with the table is that it is not tidy data, because the variables are not in columns and observations in rows. One of the variables is the year, but its values are in the column headers. And another of the variables is total membership, but its values are spread across rows and columns and it is not explicitly named.
The gather()
function from the tidyr package lets us turn wide data into long data. We need to tell the function two kinds of information. First we need to tell it the name of the column to create from the column headers and the name of the implicit variable in the rows. In the example below, we create two new columns minutes_year
and total_membership
. Then we also have to tell the function if there are any columns which should remain unchanged. In this case, the conference
and district
variables should remain the same, so we remove them from the gathering using the same syntax as the select()
function.
va_methodists_wide %>%
gather(minutes_year, total_membership, -conference, -district)
## Source: local data frame [190 x 4]
##
## conference district minutes_year total_membership
## (chr) (chr) (fctr) (int)
## 1 Virginia James River 1812 5348
## 2 Virginia Meherren 1812 4882
## 3 Virginia Meherrin 1812 NA
## 4 Virginia Neuse 1812 NA
## 5 Virginia Newbern 1812 3511
## 6 Virginia Norfolk 1812 4686
## 7 Virginia Raleigh 1812 3822
## 8 Virginia Roanoke 1812 NA
## 9 Virginia Tar River 1812 NA
## 10 Virginia Yadkin 1812 3174
## .. ... ... ... ...
We can see the results above. There are two ways that this result is not quite what we want. Because the years were column headers they are treated as character vectors rather than integers. We can manually convert them in a later step, but we can also let gather()
do the right thing with the convert =
argument. Then we have a lot of NA
values which were explicit in the wide table but which can be removed from the long table with na.rm =
.
va_methodists_long <- va_methodists_wide %>%
gather(minutes_year, total_membership, -conference, -district,
convert = TRUE, na.rm = TRUE)
va_methodists_long
## Source: local data frame [100 x 4]
##
## conference district minutes_year total_membership
## (chr) (chr) (int) (int)
## 1 Virginia James River 1812 5348
## 2 Virginia Meherren 1812 4882
## 3 Virginia Newbern 1812 3511
## 4 Virginia Norfolk 1812 4686
## 5 Virginia Raleigh 1812 3822
## 6 Virginia Yadkin 1812 3174
## 7 Virginia James River 1813 4691
## 8 Virginia Meherren 1813 4486
## 9 Virginia Newbern 1813 3558
## 10 Virginia Norfolk 1813 6196
## .. ... ... ... ...
Notice that now we can use the data in ggplot2 without any problem.
ggplot(va_methodists_long,
aes(x = minutes_year, y = total_membership, color = district)) +
geom_line() +
ggtitle("Membership of districts in the Virginia conference")
The inverse operation of gather()
is spread()
. With spread()
we specify the name of the column which should become the new column headers (in this case minutes_year
), and then the name of the column to fill in underneath those new column headers (in this case, total_membership
). We can see the results below.
va_methodists_wide2 <- va_methodists_long %>%
spread(minutes_year, total_membership)
va_methodists_wide2
## Source: local data frame [10 x 21]
##
## conference district 1812 1813 1814 1815 1816 1817 1818 1819
## (chr) (chr) (int) (int) (int) (int) (int) (int) (int) (int)
## 1 Virginia James River 5348 4691 4520 4209 4118 3888 3713 3580
## 2 Virginia Meherren 4882 4486 4771 4687 4702 NA NA NA
## 3 Virginia Meherrin NA NA NA NA NA 4435 3964 3860
## 4 Virginia Neuse NA NA 3474 3475 3448 2702 3340 4667
## 5 Virginia Newbern 3511 3558 NA NA NA NA NA NA
## 6 Virginia Norfolk 4686 6196 6127 6001 5661 6495 6471 NA
## 7 Virginia Raleigh 3822 4018 NA NA NA NA NA NA
## 8 Virginia Roanoke NA NA NA NA 3049 NA 1507 NA
## 9 Virginia Tar River NA NA 3834 3466 NA NA NA NA
## 10 Virginia Yadkin 3174 3216 3528 3323 3374 3323 4689 4547
## Variables not shown: 1820 (int), 1821 (int), 1822 (int), 1823 (int), 1824
## (int), 1825 (int), 1826 (int), 1827 (int), 1828 (int), 1829 (int), 1830
## (int)
Just by looking at the data we can see that we got back to where we started, but we can also verify that programmatically.
identical(va_methodists_wide, va_methodists_wide2)
## [1] TRUE
Turning long data into wide is often useful when you want to create a tabular representation of data. (And once you have a data frame that can be a table, the knitr::kable()
function is quite nice.) And some algorithms, such as clustering algorithms, expect wide data rather than tidy data.
For the exercise, we will use summary statistics of the number of white and black members in the Methodists by year.
methodists_by_year_race <- methodists %>%
filter(minutes_year >= 1786) %>%
group_by(minutes_year) %>%
summarize(white = sum(members_white, na.rm = TRUE),
black = sum(members_colored, na.rm = TRUE),
indian = sum(as.integer(members_indian), na.rm = TRUE))
methodists_by_year_race
## Source: local data frame [45 x 4]
##
## minutes_year white black indian
## (int) (int) (int) (int)
## 1 1786 18291 2890 0
## 2 1787 21949 3883 0
## 3 1788 30557 7991 0
## 4 1789 34425 8840 0
## 5 1790 45983 11682 0
## 6 1791 50580 13098 0
## 7 1792 52079 13871 0
## 8 1793 51486 14420 0
## 9 1794 52794 13906 0
## 10 1795 48121 12171 0
## .. ... ... ... ...
methodists_by_year_race
could be tidier still. While white
, black
, and indian
are variables, it is perhaps better to think of them as two different variables. One variable would be race
, containing the racial descriptions that the Methodists used, and another would be members
, containing the number of members. Using the gather()
function, create that data frame.methodists_race_long <- methodists_by_year_race %>%
gather(race, members, -minutes_year)
methodists_race_long
## Source: local data frame [135 x 3]
##
## minutes_year race members
## (int) (fctr) (int)
## 1 1786 white 18291
## 2 1787 white 21949
## 3 1788 white 30557
## 4 1789 white 34425
## 5 1790 white 45983
## 6 1791 white 50580
## 7 1792 white 52079
## 8 1793 white 51486
## 9 1794 white 52794
## 10 1795 white 48121
## .. ... ... ...
race
column to the color
aesthetic.methodists_race_long %>%
ggplot(aes(x=minutes_year)) +
geom_line(aes(y=members, color=race))
methodists_race_wide <- methodists_race_long %>%
spread(minutes_year, members)
methodists_race_wide
## Source: local data frame [3 x 46]
##
## race 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796
## (fctr) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int)
## 1 white 18291 21949 30557 34425 45983 50580 52079 51486 52794 48121 45374
## 2 black 2890 3883 7991 8840 11682 13098 13871 14420 13906 12171 11380
## 3 indian 0 0 0 0 0 0 0 0 0 0 0
## Variables not shown: 1797 (int), 1798 (int), 1799 (int), 1800 (int), 1801
## (int), 1802 (int), 1803 (int), 1804 (int), 1805 (int), 1806 (int), 1807
## (int), 1808 (int), 1809 (int), 1810 (int), 1811 (int), 1812 (int), 1813
## (int), 1814 (int), 1815 (int), 1816 (int), 1817 (int), 1818 (int), 1819
## (int), 1820 (int), 1821 (int), 1822 (int), 1823 (int), 1824 (int), 1825
## (int), 1826 (int), 1827 (int), 1828 (int), 1829 (int), 1830 (int)
methodists_race_long %>%
spread(race, members)
## Source: local data frame [45 x 4]
##
## minutes_year white black indian
## (int) (int) (int) (int)
## 1 1786 18291 2890 0
## 2 1787 21949 3883 0
## 3 1788 30557 7991 0
## 4 1789 34425 8840 0
## 5 1790 45983 11682 0
## 6 1791 50580 13098 0
## 7 1792 52079 13871 0
## 8 1793 51486 14420 0
## 9 1794 52794 13906 0
## 10 1795 48121 12171 0
## .. ... ... ... ...
There are a number of different kinds of window functions in R. We are going to look at two window functions, lead()
and lag()
which help us look for change over time. (For a fuller explanation of window functions, see the related dplyr vignette.)
To understand what a window function does, it is helpful to compare it to a transformation function and an aggregation function. Suppose we have a vector with five numeric values.
original <- c(1.1, 2.2, 3.3, 4.4, 5.5)
A transformation function changes each element in the vector and returns a new value for each. In the example below, we round each element in the vector. We have a different result, but it still has five elements.
round(original, 0)
## [1] 1 2 3 4 6
In an aggregation function, we pass in a vector of numbers and get back a single value. In this case, we get the sum of the numbers.
sum(original)
## [1] 16.5
Aggregation functions work well with summarize()
; transformation functions works well with mutate()
.
A window function gives back a vector of numbers, but a vector which has fewer useable elements than the original. It is like sliding a window over the vector. Consider the case below.
lead(original)
## [1] 2.2 3.3 4.4 5.5 NA
lag(original)
## [1] NA 1.1 2.2 3.3 4.4
The function lead()
returns the next element of a vector in place of the original value. At the end of the vector we get an NA
because there are no more elements left. The function lag()
does the opposite, giving us the previous element in the vector. In that case, the first element of the returned vector is NA
.
The lead()
and lag()
functions are useful for comparing one value to its previous or successor value. Suppose, for instance, that we have a vector of membership figures for each year. We can calculate the number of new members each year by subtracting the current value from its previous value.
membership <- c(100, 150, 250, 400, 600)
membership - lag(membership)
## [1] NA 50 100 150 200
Now that we understand those basics, we can apply that to the Methodist annual minutes data that we worked with in a previous lesson. Let’s start by getting just the membership data from Fairfax, Virginia. We will also calculate the members_general
value for the years it is missing, and select only the columns we absolutely need.
fairfax <- methodists %>%
filter(meeting == "Fairfax") %>%
mutate(members_general = ifelse(is.na(members_general),
members_white + members_colored,
members_general)) %>%
select(minutes_year, meeting, starts_with("members"), -members_indian)
fairfax
## Source: local data frame [54 x 5]
##
## minutes_year meeting members_general members_white members_colored
## (int) (chr) (int) (int) (int)
## 1 1775 Fairfax 30 NA NA
## 2 1776 Fairfax 350 NA NA
## 3 1777 Fairfax 330 NA NA
## 4 1779 Fairfax 309 NA NA
## 5 1780 Fairfax 361 NA NA
## 6 1781 Fairfax 301 NA NA
## 7 1782 Fairfax 362 NA NA
## 8 1783 Fairfax 310 NA NA
## 9 1784 Fairfax 317 NA NA
## 10 1786 Fairfax 260 260 0
## .. ... ... ... ... ...
Now that we have the data, we can add a column for the number of new members added each year.
fairfax <- fairfax %>%
mutate(growth = members_general - lag(members_general))
fairfax
## Source: local data frame [54 x 6]
##
## minutes_year meeting members_general members_white members_colored
## (int) (chr) (int) (int) (int)
## 1 1775 Fairfax 30 NA NA
## 2 1776 Fairfax 350 NA NA
## 3 1777 Fairfax 330 NA NA
## 4 1779 Fairfax 309 NA NA
## 5 1780 Fairfax 361 NA NA
## 6 1781 Fairfax 301 NA NA
## 7 1782 Fairfax 362 NA NA
## 8 1783 Fairfax 310 NA NA
## 9 1784 Fairfax 317 NA NA
## 10 1786 Fairfax 260 260 0
## .. ... ... ... ... ...
## Variables not shown: growth (int)
fairfax %>%
filter(growth<0) %>%
select(minutes_year,growth)
## Source: local data frame [29 x 2]
##
## minutes_year growth
## (int) (int)
## 1 1777 -20
## 2 1779 -21
## 3 1781 -60
## 4 1783 -52
## 5 1786 -57
## 6 1791 -152
## 7 1792 -9
## 8 1793 -219
## 9 1795 -166
## 10 1796 -15
## .. ... ...
fairfax %>%
filter(growth>0) %>%
select(minutes_year, growth)
## Source: local data frame [24 x 2]
##
## minutes_year growth
## (int) (int)
## 1 1776 320
## 2 1780 52
## 3 1782 61
## 4 1784 7
## 5 1787 10
## 6 1788 98
## 7 1789 182
## 8 1790 400
## 9 1794 20
## 10 1798 2
## .. ... ...
fairfax %>%
ggplot(aes(x=minutes_year))+
geom_line(aes(y=growth))
## Warning: Removed 1 rows containing missing values (geom_path).
My initial thought is it apply my knowledge of history to the dataset to see if anything doesn’t quite match up. For example, were there years of war, famine, disease, revivals, increase immigrations etc. that would directly affect these growth statistics. Also, I feel, that it is always good to explore outliers or extreme changes from year to year for transcription errors.
fairfax %>%
mutate(white_growth = members_white - lag(members_white), colored_growth = members_colored- lag(members_colored)) %>%
ggplot(aes(x=minutes_year))+
geom_line(aes(y=colored_growth), color="blue") +
geom_line(aes(y=white_growth), color="black")
## Warning: Removed 10 rows containing missing values (geom_path).
## Warning: Removed 10 rows containing missing values (geom_path).
methodists
data. Beginning in 1802, the Methodists organized the data by conference, district, and meeting. For 1802 and following, calculate the growth for each conference. Which conferences were growing the most in absolute terms? Which were growing the most in relative terms (i.e., growth percentage)? Do you get a clearer picture by looking at the growth in districts? Feel free to plot the data if you wish and to add explanatory text.general_members_methodists <- methodists %>%
filter(minutes_year>=1802) %>%
mutate(members_general = ifelse(is.na(members_general),
members_white + members_colored,
members_general))
conference_growth_methodists <- general_members_methodists %>%
group_by(conference, minutes_year) %>%
summarize(total_members_yearly = sum(members_general, na.rm=TRUE)) %>%
group_by(conference) %>%
mutate(growth = total_members_yearly - lag(total_members_yearly)) %>%
summarize(absolute_growth=sum(growth, na.rm=TRUE), relative_growth = (tail(total_members_yearly,1)/head(total_members_yearly, 1))*100)
conference_growth_methodists %>%
ggplot(aes(x= conference)) +
geom_bar(stat = "identity", aes(y=absolute_growth)) +
theme(axis.text.x =
element_text(size = 10,
angle = 45,
hjust = 1,
vjust = 1))
## Warning: Removed 1 rows containing missing values (position_stack).
conference_growth_methodists %>%
ggplot(aes(x= conference)) +
geom_bar(stat = "identity", aes(y=relative_growth))+
theme(axis.text.x =
element_text(size = 10,
angle = 45,
hjust = 1,
vjust = 1))
## Warning: Removed 1 rows containing missing values (position_stack).
It is a pretty interesting break down to compare the absolute growth to the relative growth as the leaders in each are not the same. As an exploratory tool, these type of calculations are fantastic and extremely helpful. In absolute terms, Baltimore, New York, South Carolina, Virginia, and Western were growing the most. However according to the relative growth calculation, South Carolina and Western are still growing the most but so are New England and Mississippi. Of those four, Mississippi has the highest relative growth rate.
district_growth_methodists <- general_members_methodists %>%
group_by(district, minutes_year) %>%
summarize(total_members_yearly = sum(members_general, na.rm=TRUE)) %>%
group_by(district) %>%
mutate(growth = total_members_yearly - lag(total_members_yearly)) %>%
summarize(absolute_growth=sum(growth, na.rm=TRUE), relative_growth = (tail(total_members_yearly,1, na.rm=TRUE)/head(total_members_yearly, 1, na.rm=TRUE))*100)
district_growth_methodists %>%
ggplot(aes(x= district)) +
geom_bar(stat = "identity", aes(y=absolute_growth)) +
theme(axis.text.x =
element_text(size = 5,
angle = 45,
hjust = 1,
vjust = 1))
## Warning: Removed 1 rows containing missing values (position_stack).
## Warning: Stacking not well defined when ymin != 0
district_growth_methodists %>%
ggplot(aes(x= district)) +
geom_bar(stat = "identity", aes(y=relative_growth))+
theme(axis.text.x =
element_text(size = 5,
angle = 45,
hjust = 1,
vjust = 1))
## Warning: Removed 2 rows containing missing values (position_stack).
By breaking the membrship growth down by distrcit, I do notice that Asbury has a ridiculous relative growth rate (167,850) which is either the result of a transcription error or my scripting/calcuations (more likely). If it is a transcription error, by breaking the membership growth down by he district level, I notice that transcription error that I probably wouldn’t if I only looked at the data from the conference level.
And indeed, we could write a function that translates state names into regions.↩