library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
options(stringsAsFactors = FALSE)
opts_chunk$set(tidy = FALSE)
ca <- read.csv("ca_cities_short.csv")
dplyr is purely functional, so can run any of its commands without changing your original data. You can see the five main verbs with ?manip; there are also commands for data joins. Use the pipe operator (%>%) to chain operations.
The key thing to wrap your mind around is the split-apply-combine model. The group_by operator splits a data frame into various pieces. You then apply a function, perhaps summarizing or counting the data. dplyr will then combine the separate data frames back together.
Let's group by county. To get the population figures for a county, we'll sum some of the population fields for each country grouping. (Remove the call to head() if you want the full data)
ca %>%
group_by(COUNTY) %>%
summarize(YR_1850 = sum(YR_1850),
YR_1940 = sum(YR_1940),
YR_2000 = sum(YR_2000)) %>%
arrange(desc(YR_1850)) %>%
head()
## Source: local data frame [6 x 4]
##
## COUNTY YR_1850 YR_1940 YR_2000
## 1 El Dorado 2269 0 5
## 2 Sutter 547 0 0
## 3 Yuba 504 0 0
## 4 Solano 480 3 43
## 5 Sacramento 335 0 0
## 6 Yolo 275 2 60
The nice thing about dplyr is that you can build your manipulation gradually. So if we took the same commands as above, we could then add another line to sort the rows:
ca %>%
group_by(COUNTY) %>%
summarize(YR_1850 = sum(YR_1850),
YR_1940 = sum(YR_1940),
YR_2000 = sum(YR_2000)) %>%
arrange(desc(YR_1850)) %>%
head()
## Source: local data frame [6 x 4]
##
## COUNTY YR_1850 YR_1940 YR_2000
## 1 El Dorado 2269 0 5
## 2 Sutter 547 0 0
## 3 Yuba 504 0 0
## 4 Solano 480 3 43
## 5 Sacramento 335 0 0
## 6 Yolo 275 2 60
If you wanted to count how many cities were in a county, you could do this using the special n() function:
ca %>%
group_by(COUNTY) %>%
summarize(cities = n()) %>%
arrange(desc(cities)) %>%
head()
## Source: local data frame [6 x 2]
##
## COUNTY cities
## 1 Placer 13
## 2 El Dorado 12
## 3 Mendocino 9
## 4 Calaveras 8
## 5 Butte 7
## 6 Nevada 7
Preserving properties when summarizing data could potentially be tricky, if for example the value of a property varied. That's why if you summarize you have to tell dplyr how to figure out the value. Here I've preserved for example the incorporation year (this doesn't actually make sense, of course, since the value doesn't map on to the county):
ca %>%
group_by(COUNTY) %>%
summarize(YR_1850 = sum(YR_1850),
YR_1940 = sum(YR_1940),
YR_2000 = sum(YR_2000),
INCORP_YR = max(INCORP_YR)) %>%
arrange(desc(YR_1850)) %>%
head()
## Source: local data frame [6 x 5]
##
## COUNTY YR_1850 YR_1940 YR_2000 INCORP_YR
## 1 El Dorado 2269 0 5 0
## 2 Sutter 547 0 0 0
## 3 Yuba 504 0 0 0
## 4 Solano 480 3 43 1878
## 5 Sacramento 335 0 0 0
## 6 Yolo 275 2 60 1917
Or you could create new statistics. This calculates the growth from one year to the next, then shows the cities that grew the most:
ca %>%
mutate(growth = YR_1860 - YR_1850) %>%
select(NAME, COUNTY, growth) %>%
arrange(desc(growth)) %>%
head()
## NAME COUNTY growth
## 1 Downieville Sierra 940
## 2 Auburn Placer 811
## 3 Crescent City Del Norte 638
## 4 American Riverside 629
## 5 Big River Mendocino 473
## 6 Big River San Bernardino 473
There's lots more, but the vignettes and docs are pretty helpful.