One of the key reasons to use R is to be able to manipulate data with ease. After completing this worksheet you will be able to work with the most commonly used data manipulation verbs provided by the dplyr package. In addition, you will learn how to read a CSV file into R, and learn how to use the pipe operator.
Since these packages were written by Hadley Wickham, you may find his articles on “Tidy Data” and “The Split-Apply-Combine Strategy for Data Analysis” to be useful. In addition, you should read the package documentation and vignettes for the dplyr package.
We will begin by loading the necessary packages:
library(dplyr)
library(historydata)
So far we have worked almost exclusively with data which is contained in R packages, which we can install and then load. However, most data will come to you outside of an R package, so you must learn how to load that data.
Often tabular data will come to you in the form of a CSV (comma separated values) file. This is a text file where, as the name indicates, the columns in the data are separated with columns. This is a much better way of receiving data than say, an Excel spreadsheet.
R has a number of functions that can read in data in various formats. We will use one, read_csv(), from the readr package, which works nicely with dplyr and tidyr. But there is an equivalent function in base R called read.csv(). See ?readr::read_csv or ?read.csv for details.
Because R is awesome, we can pass it the URL to a file (in addition to the local path to a file) and it will be loaded. (You could also download the file with download.file().)
library(readr)
methodists <- read_csv("http://lincolnmullen.com/projects/worksheets/data/methodists.csv")
%>%)The packages that we will be working with all take advantage of an operator provided by the magrittr package. Called the pipe, this operator %>% passes the output of one function to the first argument of a second function.
Consider this example from an earlier exercise, where we find the earliest and latest dates from a column in a data frame.
range(as.Date(naval_promotions$date), na.rm = TRUE)
## [1] "1794-06-04" "1905-02-12"
Here we have to read the expression from the inside out, from the naval_promotions$date, to the as.Date() function, to the range() function. The pipe operator lets us write an equivalent expression which is more legible.
naval_promotions$date %>%
as.Date() %>%
range(na.rm = TRUE)
## [1] "1794-06-04" "1905-02-12"
select())Now we can begin to use the Methodists data. This data contains membership statistics for different churches, circuits, and class meetings. (This data is transcribed from the Methodist annual minutes meeting, and so it uses the racial terminology from those primary sources.) First can get a sense of what the data looks like.
methodists
## Source: local data frame [15,887 x 10]
##
## minutes_year conference district meeting state members_general
## (int) (chr) (chr) (chr) (chr) (int)
## 1 1773 NA NA New York NA 180
## 2 1773 NA NA Philadelphia NA 180
## 3 1773 NA NA New Jersey NA 200
## 4 1773 NA NA Maryland NA 500
## 5 1773 NA NA Virginia NA 100
## 6 1774 NA NA New York NA 222
## 7 1774 NA NA Philadelphia NA 204
## 8 1774 NA NA New Jersey NA 257
## 9 1774 NA NA Chester NA 36
## 10 1774 NA NA Baltimore NA 738
## .. ... ... ... ... ... ...
## Variables not shown: members_white (int), members_colored (int),
## members_indian (chr), url (chr)
The first data manipulation verb that we are going to use is select(). This function lets us pass the names of the columns that we want to keep.
methodists %>%
select(minutes_year, meeting, members_general)
## Source: local data frame [15,887 x 3]
##
## minutes_year meeting members_general
## (int) (chr) (int)
## 1 1773 New York 180
## 2 1773 Philadelphia 180
## 3 1773 New Jersey 200
## 4 1773 Maryland 500
## 5 1773 Virginia 100
## 6 1774 New York 222
## 7 1774 Philadelphia 204
## 8 1774 New Jersey 257
## 9 1774 Chester 36
## 10 1774 Baltimore 738
## .. ... ... ...
Notice that we have not actually changed the data stored in methodists until we assign the changed data back to the variable.
Read the documentation for this function, ?select.
minutes_year, meeting, and all the columns that begin with the word members_.methodists %>%
select(minutes_year, meeting, starts_with("members_"))
## Source: local data frame [15,887 x 6]
##
## minutes_year meeting members_general members_white members_colored
## (int) (chr) (int) (int) (int)
## 1 1773 New York 180 NA NA
## 2 1773 Philadelphia 180 NA NA
## 3 1773 New Jersey 200 NA NA
## 4 1773 Maryland 500 NA NA
## 5 1773 Virginia 100 NA NA
## 6 1774 New York 222 NA NA
## 7 1774 Philadelphia 204 NA NA
## 8 1774 New Jersey 257 NA NA
## 9 1774 Chester 36 NA NA
## 10 1774 Baltimore 738 NA NA
## .. ... ... ... ... ...
## Variables not shown: members_indian (chr)
url.methodists %>%
select(-url)
## Source: local data frame [15,887 x 9]
##
## minutes_year conference district meeting state members_general
## (int) (chr) (chr) (chr) (chr) (int)
## 1 1773 NA NA New York NA 180
## 2 1773 NA NA Philadelphia NA 180
## 3 1773 NA NA New Jersey NA 200
## 4 1773 NA NA Maryland NA 500
## 5 1773 NA NA Virginia NA 100
## 6 1774 NA NA New York NA 222
## 7 1774 NA NA Philadelphia NA 204
## 8 1774 NA NA New Jersey NA 257
## 9 1774 NA NA Chester NA 36
## 10 1774 NA NA Baltimore NA 738
## .. ... ... ... ... ... ...
## Variables not shown: members_white (int), members_colored (int),
## members_indian (chr)
filter())The select() function lets us pick certain columns. The filter() function lets select certain rows based on logical conditions. For example, here we get the only the churches where the total number of members is at greater than 1,000.
methodists %>%
filter(members_general > 1000)
## Source: local data frame [4 x 10]
##
## minutes_year conference district meeting state members_general
## (int) (chr) (chr) (chr) (chr) (int)
## 1 1776 NA NA Brunswick NA 1611
## 2 1777 NA NA Brunswick NA 1360
## 3 1781 NA NA Delaware NA 1052
## 4 1783 NA NA Dover NA 1017
## Variables not shown: members_white (int), members_colored (int),
## members_indian (chr), url (chr)
methodists %>%
filter(state == "New York",
minutes_year == 1800)
## Source: local data frame [18 x 10]
##
## minutes_year conference district meeting state
## (int) (chr) (chr) (chr) (chr)
## 1 1800 NA NA Albany City New York
## 2 1800 NA NA Albany Circuit New York
## 3 1800 NA NA Brooklyn New York
## 4 1800 NA NA Cambridge New York
## 5 1800 NA NA Chenango New York
## 6 1800 NA NA Columbia New York
## 7 1800 NA NA Delaware New York
## 8 1800 NA NA Dutchess New York
## 9 1800 NA NA Herkimer New York
## 10 1800 NA NA Long Island New York
## 11 1800 NA NA Mohawk New York
## 12 1800 NA NA Newburg New York
## 13 1800 NA NA New Rochelle and Croton New York
## 14 1800 NA NA New York New York
## 15 1800 NA NA Oneida and Cayuga New York
## 16 1800 NA NA Plattsburg New York
## 17 1800 NA NA Saratoga New York
## 18 1800 NA NA Seneca New York
## Variables not shown: members_general (int), members_white (int),
## members_colored (int), members_indian (chr), url (chr)
methodists %>%
filter(members_white == 0)
## Source: local data frame [8 x 10]
##
## minutes_year conference district
## (int) (chr) (chr)
## 1 1786 NA NA
## 2 1787 NA NA
## 3 1819 New York New York
## 4 1820 New York New York
## 5 1823 New York New York
## 6 1830 South Carolina Augusta
## 7 1830 South Carolina Charleston
## 8 1830 South Carolina Charleston
## Variables not shown: meeting (chr), state (chr), members_general (int),
## members_white (int), members_colored (int), members_indian (chr), url
## (chr)
Of the 8 records returned, 5 of them have NA for
members_native. It would be a judgement call as to whether you would include those 5 records as only having black members or not.
The Methodists gradually kept more detailed records. They began by keeping track of the total number of members, then the number of white and black members, and eventually the number of white, black, and Indian members. But when they kept track of those racial divisions, they did not report the total number of members.
is.na() to find when the missing values begin.)methodists %>%
filter(!is.na(members_white))
## Source: local data frame [15,583 x 10]
##
## minutes_year conference district meeting state members_general
## (int) (chr) (chr) (chr) (chr) (int)
## 1 1786 NA NA Portsmouth NA NA
## 2 1786 NA NA Sussex NA NA
## 3 1786 NA NA Brunswick NA NA
## 4 1786 NA NA Amelia NA NA
## 5 1786 NA NA Mecklenburg NA NA
## 6 1786 NA NA Bedford NA NA
## 7 1786 NA NA Orange NA NA
## 8 1786 NA NA Williamsburg NA NA
## 9 1786 NA NA Alleghany NA NA
## 10 1786 NA NA Berkley NA NA
## .. ... ... ... ... ... ...
## Variables not shown: members_white (int), members_colored (int),
## members_indian (chr), url (chr)
stringr package and look at the documentation for ?str_detect. Can you find which rows represent missions?library (stringr)
methodists %>%
filter(str_detect(meeting, "mission"))
## Source: local data frame [82 x 10]
##
## minutes_year conference district meeting
## (int) (chr) (chr) (chr)
## 1 1823 South Carolina Catawba Early county mission
## 2 1823 Genesee Upper Canada Grand River mission
## 3 1824 Ohio Lebanon Indian mission
## 4 1824 Kentucky Cumberland Cumberland mission
## 5 1824 Tennessee Huntsville Cherokee mission, Indiana
## 6 1824 South Carolina Athens Monroe mission
## 7 1824 South Carolina Athens Yellow River mission
## 8 1824 South Carolina Athens Gwinnett mission
## 9 1824 South Carolina Oconee St. Augustine mission
## 10 1824 South Carolina Oconee Chatahoochee mission
## .. ... ... ... ...
## Variables not shown: state (chr), members_general (int), members_white
## (int), members_colored (int), members_indian (chr), url (chr)
mutate())Very often one will want to create a new column based on other columns in the data. For instance, in our Methodist data, there is a column called minutes_year for the year that the minutes were kept. But the actual data was for the previous year. Here we create a new column called year, where each value is one less than in minutes_year.
methodists %>%
mutate(year = minutes_year - 1) %>%
select(minutes_year, year, meeting)
## Source: local data frame [15,887 x 3]
##
## minutes_year year meeting
## (int) (dbl) (chr)
## 1 1773 1772 New York
## 2 1773 1772 Philadelphia
## 3 1773 1772 New Jersey
## 4 1773 1772 Maryland
## 5 1773 1772 Virginia
## 6 1774 1773 New York
## 7 1774 1773 Philadelphia
## 8 1774 1773 New Jersey
## 9 1774 1773 Chester
## 10 1774 1773 Baltimore
## .. ... ... ...
Notice that we chained the data manipulation functions using the pipe (%>%). This lets us create a pipeline were we can do many different manipulations in a row.
members_total values for those years are NA. Assign the the value of members_white plus members_colored to the members_total column.methodists %>%
filter(minutes_year >= 1786) %>%
mutate(members_total = members_white + members_colored) %>%
select(minutes_year, members_white,members_colored, members_total)
## Source: local data frame [15,663 x 4]
##
## minutes_year members_white members_colored members_total
## (int) (int) (int) (int)
## 1 1786 330 26 356
## 2 1786 416 72 488
## 3 1786 305 59 364
## 4 1786 382 30 412
## 5 1786 392 37 429
## 6 1786 524 16 540
## 7 1786 374 75 449
## 8 1786 167 11 178
## 9 1786 350 18 368
## 10 1786 140 26 166
## .. ... ... ... ...
methodists %>%
filter(minutes_year >= 1786) %>%
mutate(perc_afam = members_colored / (members_colored + members_white)) %>%
mutate(perc_whit = members_white / (members_colored + members_white)) %>%
select(minutes_year, perc_afam, perc_whit)
## Source: local data frame [15,663 x 3]
##
## minutes_year perc_afam perc_whit
## (int) (dbl) (dbl)
## 1 1786 0.07303371 0.9269663
## 2 1786 0.14754098 0.8524590
## 3 1786 0.16208791 0.8379121
## 4 1786 0.07281553 0.9271845
## 5 1786 0.08624709 0.9137529
## 6 1786 0.02962963 0.9703704
## 7 1786 0.16703786 0.8329621
## 8 1786 0.06179775 0.9382022
## 9 1786 0.04891304 0.9510870
## 10 1786 0.15662651 0.8433735
## .. ... ... ...
members_indian above, which should be counted in the members_total column. However, for some years the values of that column are NA, and in other years they have a numeric value. The difference is that the years where the value is NA, that column does not appear in the data because it was not tracked. How can you add that column to the total without running into problems with the missing values? (Hints: One way to do this is to use the ifelse() function to conditionally assign a value to members_total. Another way to do this is to group the data frame using rowwise() and use sum() with the na.rm = TRUE argument. A third, less elegant way, would be to split the data frame manually based on whether the value is NA or not, then recombine them.)arrange())Often we want to sort a data frame by one of its columns. This can be done with the verb arrange(). By default arrange() will sort from least to greatest; we can use the function desc() to sort from greatest to least. In this example, we sort the data frame to get the circuits with the highest number of white members.
methodists %>%
arrange(desc(members_white))
## Source: local data frame [15,887 x 10]
##
## minutes_year conference district meeting state members_general
## (int) (chr) (chr) (chr) (chr) (int)
## 1 1825 Ohio Lancaster Muskingum NA NA
## 2 1830 New York New York New York NA NA
## 3 1829 New York New York New York NA NA
## 4 1828 New York New York New York NA NA
## 5 1830 Baltimore Baltimore Baltimore NA NA
## 6 1825 Ohio Lancaster Hockhocking NA NA
## 7 1827 New York New York New York NA NA
## 8 1829 Baltimore Baltimore Baltimore NA NA
## 9 1828 Baltimore Baltimore Baltimore NA NA
## 10 1826 New York New York New York NA NA
## .. ... ... ... ... ... ...
## Variables not shown: members_white (int), members_colored (int),
## members_indian (chr), url (chr)
methodists %>%
arrange(desc(members_colored)) %>%
select(minutes_year, members_colored, members_general)
## Source: local data frame [15,887 x 3]
##
## minutes_year members_colored members_general
## (int) (int) (int)
## 1 1817 5699 NA
## 2 1816 5313 NA
## 3 1815 3796 NA
## 4 1813 3604 NA
## 5 1814 3418 NA
## 6 1830 3163 NA
## 7 1812 3128 NA
## 8 1829 3000 NA
## 9 1825 2747 NA
## 10 1826 2712 NA
## .. ... ... ...
NA because only the number of white and black members is supplied?methodists %>%
filter(minutes_year>=1786) %>%
mutate(members_total = members_white + members_colored) %>%
select(minutes_year, members_white, members_colored, members_total) %>%
arrange(desc(members_colored))
## Source: local data frame [15,663 x 4]
##
## minutes_year members_white members_colored members_total
## (int) (int) (int) (int)
## 1 1817 350 5699 6049
## 2 1816 345 5313 5658
## 3 1815 262 3796 4058
## 4 1813 321 3604 3925
## 5 1814 279 3418 3697
## 6 1830 548 3163 3711
## 7 1812 282 3128 3410
## 8 1829 500 3000 3500
## 9 1825 431 2747 3178
## 10 1826 441 2712 3153
## .. ... ... ... ...
methodists %>%
filter(minutes_year>=1786) %>%
mutate(perc_afam = members_colored / (members_colored + members_white)) %>%
arrange(desc(perc_afam)) %>%
filter(!members_white == 0) %>%
select(minutes_year, perc_afam, members_colored, members_white)
## Source: local data frame [15,575 x 4]
##
## minutes_year perc_afam members_colored members_white
## (int) (dbl) (int) (int)
## 1 1830 0.9952153 208 1
## 2 1799 0.9627660 181 7
## 3 1798 0.9625668 180 7
## 4 1816 0.9581182 1670 73
## 5 1829 0.9574468 45 2
## 6 1800 0.9570815 223 10
## 7 1801 0.9503311 287 15
## 8 1815 0.9501823 1564 82
## 9 1805 0.9484778 405 22
## 10 1828 0.9478944 1328 73
## .. ... ... ... ...
group_by())Notice that in the example above the arrange() function sorted the entire data frame. So when we looked for the circuits with the largest number of members, we got rows from 1825, then 1830, then 1829, then 1830, and so on. What if we wanted to get the biggest circuit from each year?
We can solve this kind of problem with what Hadley Wickham calls the “split-apply-combine” pattern of data analysis. Think of it this way. First we can split the big data frame into separate data frames, one for each year. Then we can apply our logic to get the results we want; in this case, that means sorting the data frame. We might also want to get just the top one row with the biggest number of members. Then we can combine those split apart data frames into a new data frame.
Observe how this works. If we want to get circuit with the most black members in each year, we can use the following code:
methodists %>%
filter(minutes_year >= 1786) %>%
select(minutes_year, meeting, starts_with("members")) %>%
group_by(minutes_year) %>%
arrange(desc(members_colored)) %>%
slice(1)
## Source: local data frame [45 x 6]
## Groups: minutes_year [45]
##
## minutes_year meeting members_general members_white members_colored
## (int) (chr) (int) (int) (int)
## 1 1786 Antigua NA 0 1000
## 2 1787 Kent NA 607 604
## 3 1788 Calvert NA 505 842
## 4 1789 Calvert NA 943 909
## 5 1790 Calvert NA 814 1170
## 6 1791 Calvert NA 760 1329
## 7 1792 Calvert NA 700 1200
## 8 1793 Surry NA 814 955
## 9 1794 Calvert NA 682 1102
## 10 1795 Calvert NA 602 924
## .. ... ... ... ... ...
## Variables not shown: members_indian (chr)
Let’s walk through that logic step by step.
filter(), since the data does not keep track of the number of white and black members until that year.group_by() the minutes_year. This creates a new data-frame (the split step) for each unique combination of values in the variables. Notice that the printed our result says that there are 45 groups, i.e., one for each year from 1786 to 1830. (Note that you can group by combinations of columns, so, one group for each combination of city and state, for instance.)members_colored in descending order. This puts the rows with the biggest value at the top.slice(). This function simply gives us the rows in each of the split-up data frames with that index. So slice(1) gives us the first row, slice(5) gives us this fifth row, and slice(1:5) gives us the first through fifth rows.ungroup().This particular operation, getting the top value in a split up data frame is so common that dplyr provides us with a top_n() function as a short cut. That function also handles ties better. (What if, for instance, two circuits both have the same biggest value?)
methodists %>%
filter(minutes_year >= 1786) %>%
select(minutes_year, meeting, starts_with("members")) %>%
group_by(minutes_year) %>%
top_n(1, members_colored)
## Source: local data frame [45 x 6]
## Groups: minutes_year [45]
##
## minutes_year meeting members_general members_white members_colored
## (int) (chr) (int) (int) (int)
## 1 1786 Antigua NA 0 1000
## 2 1787 Kent NA 607 604
## 3 1788 Calvert NA 505 842
## 4 1789 Calvert NA 943 909
## 5 1790 Calvert NA 814 1170
## 6 1791 Calvert NA 760 1329
## 7 1792 Calvert NA 700 1200
## 8 1793 Surry NA 814 955
## 9 1794 Calvert NA 682 1102
## 10 1795 Calvert NA 602 924
## .. ... ... ... ... ...
## Variables not shown: members_indian (chr)
We get the same results more concisely and reliably, though the steps of “split-apply-combine” are perhaps somewhat less easy to see.
methodists %>%
filter(minutes_year>= 1786) %>%
mutate(members_total = members_colored + members_white) %>%
select(minutes_year, conference, district, meeting, members_total) %>%
group_by(minutes_year) %>%
top_n(1, members_total)
## Source: local data frame [45 x 5]
## Groups: minutes_year [45]
##
## minutes_year conference district meeting members_total
## (int) (chr) (chr) (chr) (int)
## 1 1786 NA NA Kent 1013
## 2 1787 NA NA Talbot 1601
## 3 1788 NA NA Sussex 1611
## 4 1789 NA NA Calvert 1852
## 5 1790 NA NA Calvert 1984
## 6 1791 NA NA Calvert 2089
## 7 1792 NA NA Calvert 1900
## 8 1793 NA NA Surry 1769
## 9 1794 NA NA Sussex 2354
## 10 1795 NA NA Calvert 1526
## .. ... ... ... ... ...
methodists %>%
filter(minutes_year >= 1786, !members_white==0) %>%
mutate(perc_afam = members_colored/(members_colored + members_white)) %>%
select(minutes_year, meeting, members_colored, members_white, perc_afam) %>%
group_by(minutes_year) %>%
arrange(desc(perc_afam)) %>%
slice(1)
## Source: local data frame [45 x 5]
## Groups: minutes_year [45]
##
## minutes_year meeting members_colored members_white perc_afam
## (int) (chr) (int) (int) (dbl)
## 1 1786 Calvert 316 295 0.5171849
## 2 1787 Charleston 53 34 0.6091954
## 3 1788 Calvert 842 505 0.6250928
## 4 1789 Mecklenburg 692 98 0.8759494
## 5 1790 Annapolis 185 122 0.6026059
## 6 1791 Charleston 119 66 0.6432432
## 7 1792 Georgetown 100 49 0.6711409
## 8 1793 Prince George's 225 65 0.7758621
## 9 1794 Charleston 220 60 0.7857143
## 10 1795 Charleston 280 65 0.8115942
## .. ... ... ... ... ...
methodists %>%
filter(minutes_year == 1825) %>%
select(conference, district, meeting, members_white, members_colored) %>%
mutate(members_total=members_colored + members_white) %>%
group_by(conference) %>%
arrange(desc(members_total)) %>%
slice(1)
## Source: local data frame [15 x 6]
## Groups: conference [15]
##
## conference district meeting members_white
## (chr) (chr) (chr) (int)
## 1 Baltimore Baltimore Baltimore city 2639
## 2 Canada Bay of Quinte Augusta 590
## 3 Genesee Ontario Lyons 1000
## 4 Holston Holston Holston 1700
## 5 Kentucky Kentucky Lexington cir. 750
## 6 Maine Portland Livermore 668
## 7 Mississippi Cahawba Tuscaloosa 735
## 8 Missouri Indiana Charleston 1000
## 9 New England New London Hebron and New London 660
## 10 New York New York New York city 2567
## 11 Ohio Lancaster Muskingum 7775
## 12 Philadelphia Philadelphia St. George's 1994
## 13 South Carolina Edisto Charleston 431
## 14 Tennessee Nashville Lebanon 1050
## 15 Virginia Norfolk Princess Ann 1023
## Variables not shown: members_colored (int), members_total (int)
methodists %>%
filter(minutes_year == 1825) %>%
select(conference, district, meeting, members_white, members_colored) %>%
mutate(members_total=members_colored + members_white) %>%
group_by(district) %>%
arrange(desc(members_total)) %>%
slice(1)
## Source: local data frame [74 x 6]
## Groups: district [74]
##
## conference district meeting members_white
## (chr) (chr) (chr) (int)
## 1 Mississippi Alabama Alabama 639
## 2 Missouri Arkansas Mount Prairie 258
## 3 South Carolina Asbury Asbury 2
## 4 South Carolina Athens Cedar Creek 910
## 5 Kentucky Augusta Fleming 873
## 6 Baltimore Baltimore Baltimore city 2639
## 7 Canada Bay of Quinte Augusta 590
## 8 Genesee Black River Le Roy 630
## 9 New England Boston Boston 550
## 10 South Carolina Broad River Santee 647
## .. ... ... ... ...
## Variables not shown: members_colored (int), members_total (int)
methodists %>%
filter(conference == "Baltimore") %>%
mutate(members_total = members_colored + members_white) %>%
select(minutes_year, conference, district, meeting, members_total) %>%
group_by(minutes_year) %>%
arrange(desc(members_total)) %>%
slice(1:3)
## Source: local data frame [87 x 5]
## Groups: minutes_year [29]
##
## minutes_year conference district meeting members_total
## (int) (chr) (chr) (chr) (int)
## 1 1802 Baltimore Baltimore Calvert 1612
## 2 1802 Baltimore Baltimore Montgomery 1362
## 3 1802 Baltimore Baltimore Baltimore circuit 1314
## 4 1803 Baltimore Baltimore Calvert 2052
## 5 1803 Baltimore Baltimore Baltimore cir. 1718
## 6 1803 Baltimore Baltimore Montgomery 1376
## 7 1804 Baltimore Baltimore Calvert 2457
## 8 1804 Baltimore Baltimore Baltimore 1857
## 9 1804 Baltimore Baltimore Baltimore circuit 1392
## 10 1805 Baltimore Baltimore Calvert 2531
## .. ... ... ... ... ...
summarize())In the examples using top_n() or slice() we performed a very simple kind of data summary, where we took the single row with the biggest value in a given column. This essentially boiled many rows of a data frame down into a single row. We would like to be able to summarize or aggregate a data frame in other ways as well. For instance, we often want to take the sum or the mean of a given column. We can do this using the summarize() function in conjunction with the group_by() function.
In this example, we group by the year the minutes were taken. Then we find the total number of white members for each year.
methodists %>%
filter(minutes_year >= 1786) %>%
group_by(minutes_year) %>%
summarize(total_members_white = sum(members_white, na.rm = TRUE))
## Source: local data frame [45 x 2]
##
## minutes_year total_members_white
## (int) (int)
## 1 1786 18291
## 2 1787 21949
## 3 1788 30557
## 4 1789 34425
## 5 1790 45983
## 6 1791 50580
## 7 1792 52079
## 8 1793 51486
## 9 1794 52794
## 10 1795 48121
## .. ... ...
Notice that we get one row in the recombined data frame for each group in the original data frame. The value in the new column is the result of a function (in this case, sum()) applied to the columns in each of the split apart data frames.
There is also a special case where we might want to know how many rows were in each of the split apart (or grouped) data frames. We can use the special n() function to get that count. (Just like the case of slice() and top_n(), this is such a common thing to do that dplyr provides the special functions count() and tally(). You can look up their documentation to see how they work.)
methodists %>%
group_by(minutes_year) %>%
summarize(total_meetings = n())
## Source: local data frame [56 x 2]
##
## minutes_year total_meetings
## (int) (int)
## 1 1773 5
## 2 1774 9
## 3 1775 10
## 4 1776 12
## 5 1777 15
## 6 1779 22
## 7 1780 21
## 8 1781 24
## 9 1782 27
## 10 1783 37
## .. ... ...
methodists %>%
filter(minutes_year>= 1802) %>%
group_by(minutes_year, conference) %>%
summarize(total_meetings = n())
## Source: local data frame [310 x 3]
## Groups: minutes_year [?]
##
## minutes_year conference total_meetings
## (int) (chr) (int)
## 1 1802 Baltimore 30
## 2 1802 New England 20
## 3 1802 New York 36
## 4 1802 Philadelphia 43
## 5 1802 South Carolina 18
## 6 1802 Virginia 30
## 7 1802 Western 13
## 8 1803 Baltimore 30
## 9 1803 New England 22
## 10 1803 New York 39
## .. ... ... ...
methodists %>%
filter(minutes_year >= 1786) %>%
group_by(minutes_year) %>%
summarize(avg_mem_white = (sum(members_white, na.rm = TRUE)/n()), avg_mem_colored = (sum(members_colored, na.rm=TRUE)/n()), tot_avg = (sum(members_white, members_colored, na.rm = TRUE)/n()))
## Source: local data frame [45 x 4]
##
## minutes_year avg_mem_white avg_mem_colored tot_avg
## (int) (dbl) (dbl) (dbl)
## 1 1786 358.6471 56.66667 415.3137
## 2 1787 399.0727 70.60000 469.6727
## 3 1788 402.0658 105.14474 507.2105
## 4 1789 409.8214 105.23810 515.0595
## 5 1790 464.4747 118.00000 582.4747
## 6 1791 401.4286 103.95238 505.3810
## 7 1792 385.7704 102.74815 488.5185
## 8 1793 365.1489 102.26950 467.4184
## 9 1794 361.6027 95.24658 456.8493
## 10 1795 353.8309 89.49265 443.3235
## .. ... ... ... ...
library(ggplot2)
methodists %>%
filter(minutes_year >= 1786) %>%
mutate(perc_colored = members_colored / (members_colored + members_white)) %>%
group_by(minutes_year) %>%
summarize(avg_perc_colored = sum(perc_colored, na.rm = TRUE)/n()) %>%
ggplot(aes(x=minutes_year, y=avg_perc_colored)) +
geom_line()
library(ggplot2)
methodists %>%
filter(minutes_year >= 1786) %>%
mutate(members_total = members_colored + members_white) %>%
group_by(minutes_year) %>%
summarize(avg_members = sum(members_total, na.rm = TRUE)/n(), median_members = median(members_total, na.rm=TRUE)) %>%
ggplot(aes(x=minutes_year)) +
geom_line(aes(y = avg_members), color="blue") +
geom_line(aes(y = median_members), color="red")
library(ggplot2)
methodists %>%
filter(minutes_year >= 1786) %>%
mutate(members_total = members_colored + members_white) %>%
group_by(minutes_year, conference) %>%
summarize(avg_members = sum(members_total, na.rm = TRUE)/n(), median_members = median(members_total, na.rm=TRUE)) %>%
ggplot(aes(x=minutes_year)) +
geom_line(aes(y = avg_members), color="blue") +
geom_point(aes(y = avg_members), color="blue", size = 0.5) +
facet_wrap(~conference)
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
library(ggplot2)
methodists %>%
filter(minutes_year >= 1786) %>%
mutate(avg_colored = members_colored / (members_colored + members_white)) %>%
group_by(minutes_year, conference) %>%
summarize(avg_perc_colored = sum(avg_colored, na.rm = TRUE)/n()) %>%
ggplot(aes(x=minutes_year)) +
geom_line(aes(y = avg_perc_colored), color="blue") +
geom_point(aes(y = avg_perc_colored), color="blue", size = 0.5) +
facet_wrap(~conference)
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?