Aggregating CA Cities to Counties

library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

options(stringsAsFactors = FALSE)
opts_chunk$set(tidy = FALSE)

ca <- read.csv("ca_cities_short.csv")

dplyr is purely functional, so can run any of its commands without changing your original data. You can see the five main verbs with ?manip; there are also commands for data joins. Use the pipe operator (%>%) to chain operations.

The key thing to wrap your mind around is the split-apply-combine model. The group_by operator splits a data frame into various pieces. You then apply a function, perhaps summarizing or counting the data. dplyr will then combine the separate data frames back together.

Let's group by county. To get the population figures for a county, we'll sum some of the population fields for each country grouping. (Remove the call to head() if you want the full data)

ca %>%
  group_by(COUNTY) %>%
  summarize(YR_1850 = sum(YR_1850),
            YR_1940 = sum(YR_1940),
            YR_2000 = sum(YR_2000)) %>%
  arrange(desc(YR_1850)) %>%
  head()

## Source: local data frame [6 x 4]
## 
##       COUNTY YR_1850 YR_1940 YR_2000
## 1  El Dorado    2269       0       5
## 2     Sutter     547       0       0
## 3       Yuba     504       0       0
## 4     Solano     480       3      43
## 5 Sacramento     335       0       0
## 6       Yolo     275       2      60

The nice thing about dplyr is that you can build your manipulation gradually. So if we took the same commands as above, we could then add another line to sort the rows:

ca %>%
  group_by(COUNTY) %>%
  summarize(YR_1850 = sum(YR_1850),
            YR_1940 = sum(YR_1940),
            YR_2000 = sum(YR_2000)) %>%
  arrange(desc(YR_1850)) %>%
  head()

## Source: local data frame [6 x 4]
## 
##       COUNTY YR_1850 YR_1940 YR_2000
## 1  El Dorado    2269       0       5
## 2     Sutter     547       0       0
## 3       Yuba     504       0       0
## 4     Solano     480       3      43
## 5 Sacramento     335       0       0
## 6       Yolo     275       2      60

If you wanted to count how many cities were in a county, you could do this using the special n() function:

ca %>%
  group_by(COUNTY) %>%
  summarize(cities = n()) %>%
  arrange(desc(cities)) %>%
  head()

## Source: local data frame [6 x 2]
## 
##      COUNTY cities
## 1    Placer     13
## 2 El Dorado     12
## 3 Mendocino      9
## 4 Calaveras      8
## 5     Butte      7
## 6    Nevada      7

Preserving properties when summarizing data could potentially be tricky, if for example the value of a property varied. That's why if you summarize you have to tell dplyr how to figure out the value. Here I've preserved for example the incorporation year (this doesn't actually make sense, of course, since the value doesn't map on to the county):

ca %>%
  group_by(COUNTY) %>%
  summarize(YR_1850 = sum(YR_1850),
            YR_1940 = sum(YR_1940),
            YR_2000 = sum(YR_2000),
            INCORP_YR = max(INCORP_YR)) %>%
  arrange(desc(YR_1850)) %>%
  head()

## Source: local data frame [6 x 5]
## 
##       COUNTY YR_1850 YR_1940 YR_2000 INCORP_YR
## 1  El Dorado    2269       0       5         0
## 2     Sutter     547       0       0         0
## 3       Yuba     504       0       0         0
## 4     Solano     480       3      43      1878
## 5 Sacramento     335       0       0         0
## 6       Yolo     275       2      60      1917

Or you could create new statistics. This calculates the growth from one year to the next, then shows the cities that grew the most:

ca %>%
  mutate(growth = YR_1860 - YR_1850) %>%
  select(NAME, COUNTY, growth) %>%
  arrange(desc(growth)) %>%
  head()

##            NAME         COUNTY growth
## 1   Downieville         Sierra    940
## 2        Auburn         Placer    811
## 3 Crescent City      Del Norte    638
## 4      American      Riverside    629
## 5     Big River      Mendocino    473
## 6     Big River San Bernardino    473

There's lots more, but the vignettes and docs are pretty helpful.