Basic data manipulation with dplyr

Aims of this worksheet

One of the key reasons to use R is to be able to manipulate data with ease. After completing this worksheet you will be able to work with the most commonly used data manipulation verbs provided by the dplyr package. In addition, you will learn how to read a CSV file into R, and learn how to use the pipe operator.

Since these packages were written by Hadley Wickham, you may find his articles on “Tidy Data” and “The Split-Apply-Combine Strategy for Data Analysis” to be useful. In addition, you should read the package documentation and vignettes for the dplyr package.

Loading CSV files

We will begin by loading the necessary packages:

library(dplyr)
library(historydata)

So far we have worked almost exclusively with data which is contained in R packages, which we can install and then load. However, most data will come to you outside of an R package, so you must learn how to load that data.

Often tabular data will come to you in the form of a CSV (comma separated values) file. This is a text file where, as the name indicates, the columns in the data are separated with columns. This is a much better way of receiving data than say, an Excel spreadsheet.

R has a number of functions that can read in data in various formats. We will use one, read_csv(), from the readr package, which works nicely with dplyr and tidyr. But there is an equivalent function in base R called read.csv(). See ?readr::read_csv or ?read.csv for details.

Because R is awesome, we can pass it the URL to a file (in addition to the local path to a file) and it will be loaded. (You could also download the file with download.file().)

library(readr)
methodists <- read_csv("http://lincolnmullen.com/projects/worksheets/data/methodists.csv")

The pipe (`%>%`)

The packages that we will be working with all take advantage of an operator provided by the magrittr package. Called the pipe, this operator %>% passes the output of one function to the first argument of a second function.

Consider this example from an earlier exercise, where we find the earliest and latest dates from a column in a data frame.

range(as.Date(naval_promotions$date), na.rm = TRUE)

## [1] "1794-06-04" "1905-02-12"

Here we have to read the expression from the inside out, from the naval_promotions$date, to the as.Date() function, to the range() function. The pipe operator lets us write an equivalent expression which is more legible.

naval_promotions$date %>% 
  as.Date() %>% 
  range(na.rm = TRUE)

## [1] "1794-06-04" "1905-02-12"

Selecting columns (`select()`)

Now we can begin to use the Methodists data. This data contains membership statistics for different churches, circuits, and class meetings. (This data is transcribed from the Methodist annual minutes meeting, and so it uses the racial terminology from those primary sources.) First can get a sense of what the data looks like.

methodists

## Source: local data frame [15,887 x 10]
## 
##    minutes_year conference district      meeting state members_general
##           (int)      (chr)    (chr)        (chr) (chr)           (int)
## 1          1773         NA       NA     New York    NA             180
## 2          1773         NA       NA Philadelphia    NA             180
## 3          1773         NA       NA   New Jersey    NA             200
## 4          1773         NA       NA     Maryland    NA             500
## 5          1773         NA       NA     Virginia    NA             100
## 6          1774         NA       NA     New York    NA             222
## 7          1774         NA       NA Philadelphia    NA             204
## 8          1774         NA       NA   New Jersey    NA             257
## 9          1774         NA       NA      Chester    NA              36
## 10         1774         NA       NA    Baltimore    NA             738
## ..          ...        ...      ...          ...   ...             ...
## Variables not shown: members_white (int), members_colored (int),
##   members_indian (chr), url (chr)

The first data manipulation verb that we are going to use is select(). This function lets us pass the names of the columns that we want to keep.

methodists %>% 
  select(minutes_year, meeting, members_general)

## Source: local data frame [15,887 x 3]
## 
##    minutes_year      meeting members_general
##           (int)        (chr)           (int)
## 1          1773     New York             180
## 2          1773 Philadelphia             180
## 3          1773   New Jersey             200
## 4          1773     Maryland             500
## 5          1773     Virginia             100
## 6          1774     New York             222
## 7          1774 Philadelphia             204
## 8          1774   New Jersey             257
## 9          1774      Chester              36
## 10         1774    Baltimore             738
## ..          ...          ...             ...

Notice that we have not actually changed the data stored in methodists until we assign the changed data back to the variable.

Read the documentation for this function, ?select.

Select the minutes_year, meeting, and all the columns that begin with the word members_.

methodists %>% 
  select(minutes_year, meeting, starts_with("members_"))

## Source: local data frame [15,887 x 6]
## 
##    minutes_year      meeting members_general members_white members_colored
##           (int)        (chr)           (int)         (int)           (int)
## 1          1773     New York             180            NA              NA
## 2          1773 Philadelphia             180            NA              NA
## 3          1773   New Jersey             200            NA              NA
## 4          1773     Maryland             500            NA              NA
## 5          1773     Virginia             100            NA              NA
## 6          1774     New York             222            NA              NA
## 7          1774 Philadelphia             204            NA              NA
## 8          1774   New Jersey             257            NA              NA
## 9          1774      Chester              36            NA              NA
## 10         1774    Baltimore             738            NA              NA
## ..          ...          ...             ...           ...             ...
## Variables not shown: members_indian (chr)

Remove the column url.

methodists %>% 
  select(-url)

## Source: local data frame [15,887 x 9]
## 
##    minutes_year conference district      meeting state members_general
##           (int)      (chr)    (chr)        (chr) (chr)           (int)
## 1          1773         NA       NA     New York    NA             180
## 2          1773         NA       NA Philadelphia    NA             180
## 3          1773         NA       NA   New Jersey    NA             200
## 4          1773         NA       NA     Maryland    NA             500
## 5          1773         NA       NA     Virginia    NA             100
## 6          1774         NA       NA     New York    NA             222
## 7          1774         NA       NA Philadelphia    NA             204
## 8          1774         NA       NA   New Jersey    NA             257
## 9          1774         NA       NA      Chester    NA              36
## 10         1774         NA       NA    Baltimore    NA             738
## ..          ...        ...      ...          ...   ...             ...
## Variables not shown: members_white (int), members_colored (int),
##   members_indian (chr)

Filtering rows (`filter()`)

The select() function lets us pick certain columns. The filter() function lets select certain rows based on logical conditions. For example, here we get the only the churches where the total number of members is at greater than 1,000.

methodists %>% 
  filter(members_general > 1000)

## Source: local data frame [4 x 10]
## 
##   minutes_year conference district   meeting state members_general
##          (int)      (chr)    (chr)     (chr) (chr)           (int)
## 1         1776         NA       NA Brunswick    NA            1611
## 2         1777         NA       NA Brunswick    NA            1360
## 3         1781         NA       NA  Delaware    NA            1052
## 4         1783         NA       NA     Dover    NA            1017
## Variables not shown: members_white (int), members_colored (int),
##   members_indian (chr), url (chr)

Get just the rows from New York in 1800.

methodists %>% 
  filter(state == "New York",
         minutes_year == 1800)

## Source: local data frame [18 x 10]
## 
##    minutes_year conference district                 meeting    state
##           (int)      (chr)    (chr)                   (chr)    (chr)
## 1          1800         NA       NA             Albany City New York
## 2          1800         NA       NA          Albany Circuit New York
## 3          1800         NA       NA                Brooklyn New York
## 4          1800         NA       NA               Cambridge New York
## 5          1800         NA       NA                Chenango New York
## 6          1800         NA       NA                Columbia New York
## 7          1800         NA       NA                Delaware New York
## 8          1800         NA       NA                Dutchess New York
## 9          1800         NA       NA                Herkimer New York
## 10         1800         NA       NA             Long Island New York
## 11         1800         NA       NA                  Mohawk New York
## 12         1800         NA       NA                 Newburg New York
## 13         1800         NA       NA New Rochelle and Croton New York
## 14         1800         NA       NA                New York New York
## 15         1800         NA       NA       Oneida and Cayuga New York
## 16         1800         NA       NA              Plattsburg New York
## 17         1800         NA       NA                Saratoga New York
## 18         1800         NA       NA                  Seneca New York
## Variables not shown: members_general (int), members_white (int),
##   members_colored (int), members_indian (chr), url (chr)

Which Methodist meetings had only black members?

methodists %>% 
  filter(members_white == 0)

## Source: local data frame [8 x 10]
## 
##   minutes_year     conference   district
##          (int)          (chr)      (chr)
## 1         1786             NA         NA
## 2         1787             NA         NA
## 3         1819       New York   New York
## 4         1820       New York   New York
## 5         1823       New York   New York
## 6         1830 South Carolina    Augusta
## 7         1830 South Carolina Charleston
## 8         1830 South Carolina Charleston
## Variables not shown: meeting (chr), state (chr), members_general (int),
##   members_white (int), members_colored (int), members_indian (chr), url
##   (chr)

Of the 8 records returned, 5 of them have NA for members_native. It would be a judgement call as to whether you would include those 5 records as only having black members or not.

The Methodists gradually kept more detailed records. They began by keeping track of the total number of members, then the number of white and black members, and eventually the number of white, black, and Indian members. But when they kept track of those racial divisions, they did not report the total number of members.

In which year did the Methodists start keeping track of white and black members? (Hint: use is.na() to find when the missing values begin.)

methodists %>% 
  filter(!is.na(members_white))

## Source: local data frame [15,583 x 10]
## 
##    minutes_year conference district      meeting state members_general
##           (int)      (chr)    (chr)        (chr) (chr)           (int)
## 1          1786         NA       NA   Portsmouth    NA              NA
## 2          1786         NA       NA       Sussex    NA              NA
## 3          1786         NA       NA    Brunswick    NA              NA
## 4          1786         NA       NA       Amelia    NA              NA
## 5          1786         NA       NA  Mecklenburg    NA              NA
## 6          1786         NA       NA      Bedford    NA              NA
## 7          1786         NA       NA       Orange    NA              NA
## 8          1786         NA       NA Williamsburg    NA              NA
## 9          1786         NA       NA    Alleghany    NA              NA
## 10         1786         NA       NA      Berkley    NA              NA
## ..          ...        ...      ...          ...   ...             ...
## Variables not shown: members_white (int), members_colored (int),
##   members_indian (chr), url (chr)

Bonus: some of these Methodist meetings were “missions.” Load the stringr package and look at the documentation for ?str_detect. Can you find which rows represent missions?

library (stringr)
methodists %>% 
  filter(str_detect(meeting, "mission"))

## Source: local data frame [82 x 10]
## 
##    minutes_year     conference     district                   meeting
##           (int)          (chr)        (chr)                     (chr)
## 1          1823 South Carolina      Catawba      Early county mission
## 2          1823        Genesee Upper Canada       Grand River mission
## 3          1824           Ohio      Lebanon            Indian mission
## 4          1824       Kentucky   Cumberland        Cumberland mission
## 5          1824      Tennessee   Huntsville Cherokee mission, Indiana
## 6          1824 South Carolina       Athens            Monroe mission
## 7          1824 South Carolina       Athens      Yellow River mission
## 8          1824 South Carolina       Athens          Gwinnett mission
## 9          1824 South Carolina       Oconee     St. Augustine mission
## 10         1824 South Carolina       Oconee      Chatahoochee mission
## ..          ...            ...          ...                       ...
## Variables not shown: state (chr), members_general (int), members_white
##   (int), members_colored (int), members_indian (chr), url (chr)

Creating new columns (`mutate()`)

Very often one will want to create a new column based on other columns in the data. For instance, in our Methodist data, there is a column called minutes_year for the year that the minutes were kept. But the actual data was for the previous year. Here we create a new column called year, where each value is one less than in minutes_year.

methodists %>% 
  mutate(year = minutes_year - 1) %>% 
  select(minutes_year, year, meeting)

## Source: local data frame [15,887 x 3]
## 
##    minutes_year  year      meeting
##           (int) (dbl)        (chr)
## 1          1773  1772     New York
## 2          1773  1772 Philadelphia
## 3          1773  1772   New Jersey
## 4          1773  1772     Maryland
## 5          1773  1772     Virginia
## 6          1774  1773     New York
## 7          1774  1773 Philadelphia
## 8          1774  1773   New Jersey
## 9          1774  1773      Chester
## 10         1774  1773    Baltimore
## ..          ...   ...          ...

Notice that we chained the data manipulation functions using the pipe (%>%). This lets us create a pipeline were we can do many different manipulations in a row.

Filter the Methodists data to 1786 and later. All of the members_total values for those years are NA. Assign the the value of members_white plus members_colored to the members_total column.

methodists %>% 
  filter(minutes_year >= 1786) %>% 
  mutate(members_total = members_white + members_colored) %>% 
  select(minutes_year, members_white,members_colored, members_total)

## Source: local data frame [15,663 x 4]
## 
##    minutes_year members_white members_colored members_total
##           (int)         (int)           (int)         (int)
## 1          1786           330              26           356
## 2          1786           416              72           488
## 3          1786           305              59           364
## 4          1786           382              30           412
## 5          1786           392              37           429
## 6          1786           524              16           540
## 7          1786           374              75           449
## 8          1786           167              11           178
## 9          1786           350              18           368
## 10         1786           140              26           166
## ..          ...           ...             ...           ...

Create two new columns, one with the percentage of white members, and one with the percentage of black members.

methodists %>% 
  filter(minutes_year >= 1786) %>% 
  mutate(perc_afam = members_colored / (members_colored + members_white)) %>%
  mutate(perc_whit = members_white / (members_colored + members_white)) %>% 
  select(minutes_year, perc_afam, perc_whit)

## Source: local data frame [15,663 x 3]
## 
##    minutes_year  perc_afam perc_whit
##           (int)      (dbl)     (dbl)
## 1          1786 0.07303371 0.9269663
## 2          1786 0.14754098 0.8524590
## 3          1786 0.16208791 0.8379121
## 4          1786 0.07281553 0.9271845
## 5          1786 0.08624709 0.9137529
## 6          1786 0.02962963 0.9703704
## 7          1786 0.16703786 0.8329621
## 8          1786 0.06179775 0.9382022
## 9          1786 0.04891304 0.9510870
## 10         1786 0.15662651 0.8433735
## ..          ...        ...       ...

Bonus: We neglected the column members_indian above, which should be counted in the members_total column. However, for some years the values of that column are NA, and in other years they have a numeric value. The difference is that the years where the value is NA, that column does not appear in the data because it was not tracked. How can you add that column to the total without running into problems with the missing values? (Hints: One way to do this is to use the ifelse() function to conditionally assign a value to members_total. Another way to do this is to group the data frame using rowwise() and use sum() with the na.rm = TRUE argument. A third, less elegant way, would be to split the data frame manually based on whether the value is NA or not, then recombine them.)

Sorting columns (`arrange()`)

Often we want to sort a data frame by one of its columns. This can be done with the verb arrange(). By default arrange() will sort from least to greatest; we can use the function desc() to sort from greatest to least. In this example, we sort the data frame to get the circuits with the highest number of white members.

methodists %>% 
  arrange(desc(members_white))

## Source: local data frame [15,887 x 10]
## 
##    minutes_year conference  district     meeting state members_general
##           (int)      (chr)     (chr)       (chr) (chr)           (int)
## 1          1825       Ohio Lancaster   Muskingum    NA              NA
## 2          1830   New York  New York    New York    NA              NA
## 3          1829   New York  New York    New York    NA              NA
## 4          1828   New York  New York    New York    NA              NA
## 5          1830  Baltimore Baltimore   Baltimore    NA              NA
## 6          1825       Ohio Lancaster Hockhocking    NA              NA
## 7          1827   New York  New York    New York    NA              NA
## 8          1829  Baltimore Baltimore   Baltimore    NA              NA
## 9          1828  Baltimore Baltimore   Baltimore    NA              NA
## 10         1826   New York  New York    New York    NA              NA
## ..          ...        ...       ...         ...   ...             ...
## Variables not shown: members_white (int), members_colored (int),
##   members_indian (chr), url (chr)

Which circuits had the highest number of black members? Be sure to select only the necessary columns so that the results print in a meaningful way.

methodists %>% 
  arrange(desc(members_colored)) %>% 
  select(minutes_year, members_colored, members_general)

## Source: local data frame [15,887 x 3]
## 
##    minutes_year members_colored members_general
##           (int)           (int)           (int)
## 1          1817            5699              NA
## 2          1816            5313              NA
## 3          1815            3796              NA
## 4          1813            3604              NA
## 5          1814            3418              NA
## 6          1830            3163              NA
## 7          1812            3128              NA
## 8          1829            3000              NA
## 9          1825            2747              NA
## 10         1826            2712              NA
## ..          ...             ...             ...

Which circuits had the most members overall? Keep in mind that you will have to use your code from above to calculate the total members for years where that value is NA because only the number of white and black members is supplied?

methodists %>% 
  filter(minutes_year>=1786) %>% 
  mutate(members_total = members_white + members_colored) %>% 
  select(minutes_year, members_white, members_colored, members_total) %>% 
  arrange(desc(members_colored))

## Source: local data frame [15,663 x 4]
## 
##    minutes_year members_white members_colored members_total
##           (int)         (int)           (int)         (int)
## 1          1817           350            5699          6049
## 2          1816           345            5313          5658
## 3          1815           262            3796          4058
## 4          1813           321            3604          3925
## 5          1814           279            3418          3697
## 6          1830           548            3163          3711
## 7          1812           282            3128          3410
## 8          1829           500            3000          3500
## 9          1825           431            2747          3178
## 10         1826           441            2712          3153
## ..          ...           ...             ...           ...

Which circuits had the highest percentage of black members without being entirely black?

methodists %>% 
  filter(minutes_year>=1786) %>% 
  mutate(perc_afam = members_colored / (members_colored + members_white)) %>%
  arrange(desc(perc_afam)) %>% 
  filter(!members_white == 0) %>% 
  select(minutes_year, perc_afam, members_colored, members_white)

## Source: local data frame [15,575 x 4]
## 
##    minutes_year perc_afam members_colored members_white
##           (int)     (dbl)           (int)         (int)
## 1          1830 0.9952153             208             1
## 2          1799 0.9627660             181             7
## 3          1798 0.9625668             180             7
## 4          1816 0.9581182            1670            73
## 5          1829 0.9574468              45             2
## 6          1800 0.9570815             223            10
## 7          1801 0.9503311             287            15
## 8          1815 0.9501823            1564            82
## 9          1805 0.9484778             405            22
## 10         1828 0.9478944            1328            73
## ..          ...       ...             ...           ...

Split-apply-combine (`group_by()`)

Notice that in the example above the arrange() function sorted the entire data frame. So when we looked for the circuits with the largest number of members, we got rows from 1825, then 1830, then 1829, then 1830, and so on. What if we wanted to get the biggest circuit from each year?

We can solve this kind of problem with what Hadley Wickham calls the “split-apply-combine” pattern of data analysis. Think of it this way. First we can split the big data frame into separate data frames, one for each year. Then we can apply our logic to get the results we want; in this case, that means sorting the data frame. We might also want to get just the top one row with the biggest number of members. Then we can combine those split apart data frames into a new data frame.

Observe how this works. If we want to get circuit with the most black members in each year, we can use the following code:

methodists %>% 
  filter(minutes_year >= 1786) %>% 
  select(minutes_year, meeting, starts_with("members")) %>% 
  group_by(minutes_year) %>% 
  arrange(desc(members_colored)) %>% 
  slice(1)

## Source: local data frame [45 x 6]
## Groups: minutes_year [45]
## 
##    minutes_year meeting members_general members_white members_colored
##           (int)   (chr)           (int)         (int)           (int)
## 1          1786 Antigua              NA             0            1000
## 2          1787    Kent              NA           607             604
## 3          1788 Calvert              NA           505             842
## 4          1789 Calvert              NA           943             909
## 5          1790 Calvert              NA           814            1170
## 6          1791 Calvert              NA           760            1329
## 7          1792 Calvert              NA           700            1200
## 8          1793   Surry              NA           814             955
## 9          1794 Calvert              NA           682            1102
## 10         1795 Calvert              NA           602             924
## ..          ...     ...             ...           ...             ...
## Variables not shown: members_indian (chr)

Let’s walk through that logic step by step.

First we get only the years since 1786 with filter(), since the data does not keep track of the number of white and black members until that year.
Then we select only the columns that we are interested in, namely, the column for the year, the name of the meeting, and the various values for membership. We do this just so that the results print out in a useful way: in a real analysis we might decide not to throw away the other columns.
The crucial step is when we group_by() the minutes_year. This creates a new data-frame (the split step) for each unique combination of values in the variables. Notice that the printed our result says that there are 45 groups, i.e., one for each year from 1786 to 1830. (Note that you can group by combinations of columns, so, one group for each combination of city and state, for instance.)
Next we apply our logic, in this case, sorting by the column members_colored in descending order. This puts the rows with the biggest value at the top.
Next we continue to apply our logic with slice(). This function simply gives us the rows in each of the split-up data frames with that index. So slice(1) gives us the first row, slice(5) gives us this fifth row, and slice(1:5) gives us the first through fifth rows.
The last step, combine, where the split-up data frames are brought back together, is done for us automatically. Note that the data frame is still grouped, however, so any subsequent data manipulation verbs will be applied to the groups rather than the whole data frame. If we wished, we could use ungroup().

This particular operation, getting the top value in a split up data frame is so common that dplyr provides us with a top_n() function as a short cut. That function also handles ties better. (What if, for instance, two circuits both have the same biggest value?)

methodists %>% 
  filter(minutes_year >= 1786) %>% 
  select(minutes_year, meeting, starts_with("members")) %>% 
  group_by(minutes_year) %>% 
  top_n(1, members_colored)

## Source: local data frame [45 x 6]
## Groups: minutes_year [45]
## 
##    minutes_year meeting members_general members_white members_colored
##           (int)   (chr)           (int)         (int)           (int)
## 1          1786 Antigua              NA             0            1000
## 2          1787    Kent              NA           607             604
## 3          1788 Calvert              NA           505             842
## 4          1789 Calvert              NA           943             909
## 5          1790 Calvert              NA           814            1170
## 6          1791 Calvert              NA           760            1329
## 7          1792 Calvert              NA           700            1200
## 8          1793   Surry              NA           814             955
## 9          1794 Calvert              NA           682            1102
## 10         1795 Calvert              NA           602             924
## ..          ...     ...             ...           ...             ...
## Variables not shown: members_indian (chr)

We get the same results more concisely and reliably, though the steps of “split-apply-combine” are perhaps somewhat less easy to see.

For each year, which was the biggest circuit?

methodists %>% 
  filter(minutes_year>= 1786) %>% 
  mutate(members_total = members_colored + members_white) %>% 
  select(minutes_year, conference, district, meeting, members_total) %>% 
  group_by(minutes_year) %>% 
  top_n(1, members_total)

## Source: local data frame [45 x 5]
## Groups: minutes_year [45]
## 
##    minutes_year conference district meeting members_total
##           (int)      (chr)    (chr)   (chr)         (int)
## 1          1786         NA       NA    Kent          1013
## 2          1787         NA       NA  Talbot          1601
## 3          1788         NA       NA  Sussex          1611
## 4          1789         NA       NA Calvert          1852
## 5          1790         NA       NA Calvert          1984
## 6          1791         NA       NA Calvert          2089
## 7          1792         NA       NA Calvert          1900
## 8          1793         NA       NA   Surry          1769
## 9          1794         NA       NA  Sussex          2354
## 10         1795         NA       NA Calvert          1526
## ..          ...        ...      ...     ...           ...

For each year since 1786, which church had the biggest percentage of black members without being entirely black?

methodists %>% 
  filter(minutes_year >= 1786, !members_white==0) %>% 
  mutate(perc_afam = members_colored/(members_colored + members_white)) %>% 
  select(minutes_year, meeting, members_colored, members_white, perc_afam) %>% 
  group_by(minutes_year) %>% 
  arrange(desc(perc_afam)) %>% 
  slice(1)

## Source: local data frame [45 x 5]
## Groups: minutes_year [45]
## 
##    minutes_year         meeting members_colored members_white perc_afam
##           (int)           (chr)           (int)         (int)     (dbl)
## 1          1786         Calvert             316           295 0.5171849
## 2          1787      Charleston              53            34 0.6091954
## 3          1788         Calvert             842           505 0.6250928
## 4          1789     Mecklenburg             692            98 0.8759494
## 5          1790       Annapolis             185           122 0.6026059
## 6          1791      Charleston             119            66 0.6432432
## 7          1792      Georgetown             100            49 0.6711409
## 8          1793 Prince George's             225            65 0.7758621
## 9          1794      Charleston             220            60 0.7857143
## 10         1795      Charleston             280            65 0.8115942
## ..          ...             ...             ...           ...       ...

For the year 1825, what was the biggest meeting in each conference? In each district?

methodists %>% 
  filter(minutes_year == 1825) %>% 
  select(conference, district, meeting, members_white, members_colored) %>% 
   mutate(members_total=members_colored + members_white) %>% 
   group_by(conference) %>% 
  arrange(desc(members_total)) %>% 
  slice(1)

## Source: local data frame [15 x 6]
## Groups: conference [15]
## 
##        conference      district               meeting members_white
##             (chr)         (chr)                 (chr)         (int)
## 1       Baltimore     Baltimore        Baltimore city          2639
## 2          Canada Bay of Quinte               Augusta           590
## 3         Genesee       Ontario                 Lyons          1000
## 4         Holston       Holston               Holston          1700
## 5        Kentucky      Kentucky        Lexington cir.           750
## 6           Maine      Portland             Livermore           668
## 7     Mississippi       Cahawba            Tuscaloosa           735
## 8        Missouri       Indiana            Charleston          1000
## 9     New England    New London Hebron and New London           660
## 10       New York      New York         New York city          2567
## 11           Ohio     Lancaster             Muskingum          7775
## 12   Philadelphia  Philadelphia          St. George's          1994
## 13 South Carolina        Edisto            Charleston           431
## 14      Tennessee     Nashville               Lebanon          1050
## 15       Virginia       Norfolk          Princess Ann          1023
## Variables not shown: members_colored (int), members_total (int)

methodists %>% 
  filter(minutes_year == 1825) %>% 
  select(conference, district, meeting, members_white, members_colored) %>% 
   mutate(members_total=members_colored + members_white) %>% 
   group_by(district) %>% 
  arrange(desc(members_total)) %>% 
  slice(1)

## Source: local data frame [74 x 6]
## Groups: district [74]
## 
##        conference      district        meeting members_white
##             (chr)         (chr)          (chr)         (int)
## 1     Mississippi       Alabama        Alabama           639
## 2        Missouri      Arkansas  Mount Prairie           258
## 3  South Carolina        Asbury         Asbury             2
## 4  South Carolina        Athens    Cedar Creek           910
## 5        Kentucky       Augusta        Fleming           873
## 6       Baltimore     Baltimore Baltimore city          2639
## 7          Canada Bay of Quinte        Augusta           590
## 8         Genesee   Black River         Le Roy           630
## 9     New England        Boston         Boston           550
## 10 South Carolina   Broad River         Santee           647
## ..            ...           ...            ...           ...
## Variables not shown: members_colored (int), members_total (int)

For each year, what were the three biggest churches in the Baltimore conference?

methodists %>% 
  filter(conference == "Baltimore") %>% 
   mutate(members_total = members_colored + members_white) %>% 
   select(minutes_year, conference, district, meeting, members_total) %>%
   group_by(minutes_year) %>% 
  arrange(desc(members_total)) %>% 
  slice(1:3)

## Source: local data frame [87 x 5]
## Groups: minutes_year [29]
## 
##    minutes_year conference  district           meeting members_total
##           (int)      (chr)     (chr)             (chr)         (int)
## 1          1802  Baltimore Baltimore           Calvert          1612
## 2          1802  Baltimore Baltimore        Montgomery          1362
## 3          1802  Baltimore Baltimore Baltimore circuit          1314
## 4          1803  Baltimore Baltimore           Calvert          2052
## 5          1803  Baltimore Baltimore    Baltimore cir.          1718
## 6          1803  Baltimore Baltimore        Montgomery          1376
## 7          1804  Baltimore Baltimore           Calvert          2457
## 8          1804  Baltimore Baltimore         Baltimore          1857
## 9          1804  Baltimore Baltimore Baltimore circuit          1392
## 10         1805  Baltimore Baltimore           Calvert          2531
## ..          ...        ...       ...               ...           ...

Summarizing or aggregating data (`summarize()`)

In the examples using top_n() or slice() we performed a very simple kind of data summary, where we took the single row with the biggest value in a given column. This essentially boiled many rows of a data frame down into a single row. We would like to be able to summarize or aggregate a data frame in other ways as well. For instance, we often want to take the sum or the mean of a given column. We can do this using the summarize() function in conjunction with the group_by() function.

In this example, we group by the year the minutes were taken. Then we find the total number of white members for each year.

methodists %>% 
  filter(minutes_year >= 1786) %>% 
  group_by(minutes_year) %>% 
  summarize(total_members_white = sum(members_white, na.rm = TRUE))

## Source: local data frame [45 x 2]
## 
##    minutes_year total_members_white
##           (int)               (int)
## 1          1786               18291
## 2          1787               21949
## 3          1788               30557
## 4          1789               34425
## 5          1790               45983
## 6          1791               50580
## 7          1792               52079
## 8          1793               51486
## 9          1794               52794
## 10         1795               48121
## ..          ...                 ...

Notice that we get one row in the recombined data frame for each group in the original data frame. The value in the new column is the result of a function (in this case, sum()) applied to the columns in each of the split apart data frames.

There is also a special case where we might want to know how many rows were in each of the split apart (or grouped) data frames. We can use the special n() function to get that count. (Just like the case of slice() and top_n(), this is such a common thing to do that dplyr provides the special functions count() and tally(). You can look up their documentation to see how they work.)

methodists %>% 
  group_by(minutes_year) %>% 
  summarize(total_meetings = n())

## Source: local data frame [56 x 2]
## 
##    minutes_year total_meetings
##           (int)          (int)
## 1          1773              5
## 2          1774              9
## 3          1775             10
## 4          1776             12
## 5          1777             15
## 6          1779             22
## 7          1780             21
## 8          1781             24
## 9          1782             27
## 10         1783             37
## ..          ...            ...

How many meetings (i.e., churches or circuits) were there in each conference in each year since 1802?

methodists %>% 
  filter(minutes_year>= 1802) %>% 
  group_by(minutes_year, conference) %>% 
  summarize(total_meetings = n())

## Source: local data frame [310 x 3]
## Groups: minutes_year [?]
## 
##    minutes_year     conference total_meetings
##           (int)          (chr)          (int)
## 1          1802      Baltimore             30
## 2          1802    New England             20
## 3          1802       New York             36
## 4          1802   Philadelphia             43
## 5          1802 South Carolina             18
## 6          1802       Virginia             30
## 7          1802        Western             13
## 8          1803      Baltimore             30
## 9          1803    New England             22
## 10         1803       New York             39
## ..          ...            ...            ...

What is the average number of white, black, and average number of total members for each year since 1786?

methodists %>% 
  filter(minutes_year >= 1786) %>% 
  group_by(minutes_year) %>% 
  summarize(avg_mem_white = (sum(members_white, na.rm = TRUE)/n()), avg_mem_colored = (sum(members_colored, na.rm=TRUE)/n()), tot_avg = (sum(members_white, members_colored, na.rm = TRUE)/n()))

## Source: local data frame [45 x 4]
## 
##    minutes_year avg_mem_white avg_mem_colored  tot_avg
##           (int)         (dbl)           (dbl)    (dbl)
## 1          1786      358.6471        56.66667 415.3137
## 2          1787      399.0727        70.60000 469.6727
## 3          1788      402.0658       105.14474 507.2105
## 4          1789      409.8214       105.23810 515.0595
## 5          1790      464.4747       118.00000 582.4747
## 6          1791      401.4286       103.95238 505.3810
## 7          1792      385.7704       102.74815 488.5185
## 8          1793      365.1489       102.26950 467.4184
## 9          1794      361.6027        95.24658 456.8493
## 10         1795      353.8309        89.49265 443.3235
## ..          ...           ...             ...      ...

Make a plot of the average percentage of black members over time.

library(ggplot2)

methodists %>% 
  filter(minutes_year >= 1786) %>% 
  mutate(perc_colored = members_colored / (members_colored + members_white)) %>% 
  group_by(minutes_year) %>% 
  summarize(avg_perc_colored = sum(perc_colored, na.rm = TRUE)/n()) %>% 

ggplot(aes(x=minutes_year, y=avg_perc_colored)) + 
  geom_line()

Make a plot of the average and median number of members over time.

library(ggplot2)

methodists %>% 
  filter(minutes_year >= 1786) %>% 
  mutate(members_total = members_colored + members_white) %>% 
  group_by(minutes_year) %>% 
  summarize(avg_members = sum(members_total, na.rm = TRUE)/n(), median_members = median(members_total, na.rm=TRUE)) %>% 

ggplot(aes(x=minutes_year)) + 
  geom_line(aes(y = avg_members), color="blue") +
  geom_line(aes(y = median_members), color="red")

What was the average number of members in each conference for each year? Can you also make a plot of this?

library(ggplot2)

methodists %>% 
  filter(minutes_year >= 1786) %>% 
  mutate(members_total = members_colored + members_white) %>% 
  group_by(minutes_year, conference) %>% 
  summarize(avg_members = sum(members_total, na.rm = TRUE)/n(), median_members = median(members_total, na.rm=TRUE)) %>% 

ggplot(aes(x=minutes_year)) + 
  geom_line(aes(y = avg_members), color="blue") +
  geom_point(aes(y = avg_members), color="blue", size = 0.5) +
  facet_wrap(~conference)

## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?

What was the average percentage of black members in each conference for each year? Can you also make a plot of this?

library(ggplot2)

methodists %>% 
  filter(minutes_year >= 1786) %>% 
  mutate(avg_colored = members_colored / (members_colored + members_white)) %>% 
  group_by(minutes_year, conference) %>% 
  summarize(avg_perc_colored = sum(avg_colored, na.rm = TRUE)/n()) %>% 

ggplot(aes(x=minutes_year)) + 
  geom_line(aes(y = avg_perc_colored), color="blue") +
  geom_point(aes(y = avg_perc_colored), color="blue", size = 0.5) +
  facet_wrap(~conference)

## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?

Basic data manipulation with dplyr

Jordan Bratt

Aims of this worksheet

Loading CSV files

The pipe (%>%)

Selecting columns (select())

Filtering rows (filter())

Creating new columns (mutate())

Sorting columns (arrange())

Split-apply-combine (group_by())