Visualization Essay: Papers of the War Department

library(magrittr)
library(ggplot2)
library(ggmap)

## 
## Attaching package: 'ggmap'

## The following object is masked from 'package:magrittr':
## 
##     inset

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(USAboundaries)
library(sp)
library(broom)
library(maptools)

## Checking rgeos availability: TRUE

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(rgdal)

## rgdal: version: 1.1-8, (SVN revision 616)
##  Geospatial Data Abstraction Library extensions to R successfully loaded
##  Loaded GDAL runtime: GDAL 1.9.2, released 2012/10/08
##  Path to GDAL shared files: /usr/share/gdal
##  Loaded PROJ.4 runtime: Rel. 4.8.0, 6 March 2012, [PJ_VERSION: 480]
##  Path to PROJ.4 shared files: (autodetected)
##  Linking to sp version: 1.2-2

library(RColorBrewer)
library(stringr)
library(readr)

I am analyzing data from volunteers who have signed up to transcribe for the Papers of the War Department (PWD), a project of the Public Projects Division at the Roy Rosenzweig Center for History and New Media. Please note that the dataset I imported is not a complete representation of all volunteers who have signed up to transcribe with the PWD. I duplicated our spreadsheet of volunteers who have signed up for a transcriber account, and then went through the entirety of the data and made the following modifications: I removed any duplicates (volunteers who have signed up multiple times at once); I removed any volunteers who specified that they were from the Center and using their account for testing purposes; I removed any zip code extensions volunteers provided; and I removed those volunteers who either did not provide a zip code or who were residing in countries other than the United States. I also removed the columns that included names and email addresses prior to publishing my Google sheet.

pwdregistration <- read.csv("/tmp/RtmpKqYAWk/data5a4279ee78b", colClasses=c("ZCTA5CE10"="character"))

Part One

Summary of the data

summary(pwdregistration)

##       timestamp                     affiliation  
##  2015-09-14:  41                          : 845  
##  2012-09-04:  29   none                   :  57  
##  2012-09-06:  21   George Mason University:  24  
##  2015-04-06:  17   None                   :  24  
##  2011-03-17:  15   DAR                    :  21  
##  2014-11-17:  15   (Other)                :1046  
##  (Other)   :1881   NA's                   :   2  
##                      country     ZCTA5CE10        
##  USA                     :980   Length:2019       
##  United States           :592   Class :character  
##  usa                     : 96   Mode  :character  
##  US                      : 92                     
##                          : 40                     
##  United States of America: 36                     
##  (Other)                 :183                     
##                  other      account_created
##                     : 350   jph    :969    
##  Genealogy          :  25   mrb    :544    
##  genealogy          :  12   atf    :295    
##  Family History     :   5   RJM    :124    
##  historical research:   5   sml    : 60    
##  Genealogy research :   4          : 23    
##  (Other)            :1618   (Other):  4

How many dates are represented in the dataset?

pwdregistration %>% count(timestamp)

## Source: local data frame [1,091 x 2]
## 
##     timestamp     n
##        (fctr) (int)
## 1                 1
## 2  2011-02-22     2
## 3  2011-02-23     1
## 4  2011-02-27     2
## 5  2011-02-28     1
## 6  2011-03-02     2
## 7  2011-03-04     2
## 8  2011-03-05     3
## 9  2011-03-06     1
## 10 2011-03-07     1
## ..        ...   ...

How many volunteers provided a unique affiliation?

pwdregistration %>% count(affiliation)

## Source: local data frame [795 x 2]
## 
##                          affiliation     n
##                               (fctr) (int)
## 1                                      845
## 2           1 NJ Regiment, Recreated     1
## 3              1st delaware regiment     1
## 4           1st NJ Regt. (recreated)     1
## 5        24 Connecticut Militia Rgt.     1
## 6                  4 great grand son     1
## 7              4th Inf rgt vet assoc     1
## 8              5th great-grandfather     1
## 9  5th New York Regiment (recreated)     1
## 10      Alabama Genealogical Society     1
## ..                               ...   ...

How many volunteers specified an interest in the military, army, or militia when they signed up?

changed_military <- pwdregistration %>% 
  filter(str_detect(other, "military"))

count(changed_military)

## Source: local data frame [1 x 1]
## 
##       n
##   (int)
## 1    84

changed_army <- pwdregistration %>% 
  filter(str_detect(other, "army"))

count(changed_army)

## Source: local data frame [1 x 1]
## 
##       n
##   (int)
## 1     5

changed_militia <- pwdregistration %>% 
  filter(str_detect(other, "militia"))

count(changed_militia)

## Source: local data frame [1 x 1]
## 
##       n
##   (int)
## 1    13

How many volunteers specified an interest in genealogy when they signed up?

changed_geneology <- pwdregistration %>% 
  filter(str_detect(other, "genealogy"))

count(changed_geneology)

## Source: local data frame [1 x 1]
## 
##       n
##   (int)
## 1    79

changed_geneology <- pwdregistration %>% 
  filter(str_detect(affiliation, "genealogy"))

count(changed_geneology)

## Source: local data frame [1 x 1]
## 
##       n
##   (int)
## 1     4

changed_family <- pwdregistration %>% 
  filter(str_detect(other, "family"))

count(changed_family)

## Source: local data frame [1 x 1]
## 
##       n
##   (int)
## 1   121

changed_relative <- pwdregistration %>% 
  filter(str_detect(other, "relative"))

count(changed_relative)

## Source: local data frame [1 x 1]
## 
##       n
##   (int)
## 1    21

changed_ancestor <- pwdregistration %>% 
  filter(str_detect(other, "ancestor"))

count(changed_ancestor)

## Source: local data frame [1 x 1]
## 
##       n
##   (int)
## 1   117

How many volunteers specified an interest in the American Revolution or war when they signed up?

changed_rev <- pwdregistration %>% 
  filter(str_detect(other, "Revolution"))

count(changed_rev)

## Source: local data frame [1 x 1]
## 
##       n
##   (int)
## 1   181

changed_war <- pwdregistration %>% 
  filter(str_detect(other, "war"))

count(changed_war)

## Source: local data frame [1 x 1]
## 
##       n
##   (int)
## 1   114

How many volunteers either identified as Native American or specified an interest in Native American studies when signing up?

changed_native <- pwdregistration %>% 
  filter(str_detect(other, "Native American"))

count(changed_native)

## Source: local data frame [1 x 1]
## 
##       n
##   (int)
## 1    59

changed_native2 <- pwdregistration %>% 
  filter(str_detect(other, "Indian"))

count(changed_native2)

## Source: local data frame [1 x 1]
## 
##       n
##   (int)
## 1   139

changed_nativeaff <- pwdregistration %>% 
  filter(str_detect(affiliation, "Indian"))

count(changed_nativeaff)

## Source: local data frame [1 x 1]
## 
##       n
##   (int)
## 1     7

changed_tribe <- pwdregistration %>% 
  filter(str_detect(affiliation, "tribe"))

count(changed_tribe)

## Source: local data frame [1 x 1]
## 
##       n
##   (int)
## 1     3

How many people at the Center worked to create accounts?

pwdregistration %>% count(account_created)

## Source: local data frame [10 x 2]
## 
##                                                         account_created
##                                                                  (fctr)
## 1                                                                      
## 2                                                                 \njph
## 3                                                                   atf
## 4  Don't think this is actually a dup; can't find the previous instance
## 5                                                                   jpg
## 6                                                                   jph
## 7                                                                   mrb
## 8                                                                   RjM
## 9                                                                   RJM
## 10                                                                  sml
## Variables not shown: n (int)

How many unique zip codes are contained in the dataset?

pwdregistration %>% count(ZCTA5CE10)

## Source: local data frame [1,635 x 2]
## 
##    ZCTA5CE10     n
##        (chr) (int)
## 1                3
## 2      01002     1
## 3      01083     1
## 4      01098     1
## 5      01301     1
## 6      01430     1
## 7      01453     1
## 8      01460     1
## 9      01532     1
## 10     01564     1
## ..       ...   ...

Part Two - Mapping PWD zip code data

First I download the zip code tabulation data from the Census Bureau.

dir.create("data_2/", showWarnings = FALSE)
get_census_data <- function(x) {
  download.file(paste0("http://www2.census.gov/geo/tiger/GENZ2015/shp/", x),
  paste0("data_2/", x))
  unzip(paste0("data_2/", x), exdir = "data_2/")
}
get_census_data("cb_2015_us_zcta510_500k.zip")

Then I turn the shapefile into a vector.

library(raster)

## 
## Attaching package: 'raster'

## The following object is masked from 'package:dplyr':
## 
##     select

## The following object is masked from 'package:magrittr':
## 
##     extract

zipcode <- shapefile("~/pwd/data_2/cb_2015_us_zcta510_500k.shp", stringsAsFactors = TRUE)

Next I create centroids of zip codes. Mapping each individual zip code would be too much data to put on a single map, so I created centroids, which are the center points of each zip code, instead. Then I mapped the centroids using ggplot.

library(rgeos)

## rgeos version: 0.3-19, (SVN revision 524)
##  GEOS runtime version: 3.3.8-CAPI-1.7.8 
##  Linking to sp version: 1.2-2 
##  Polygon checking: TRUE

pts <- gCentroid(zipcode, byid = TRUE)
pts <- as.data.frame(pts)
pts <- bind_cols(pts, zipcode@data)

ggplot(pts, aes(x = x, y = y)) + geom_point()

I use left_join to combine the PWD registration data with the centroid data.

pwdzip <- pwdregistration %>% 
  left_join(pts, by = c("ZCTA5CE10"))

## Warning in left_join_impl(x, y, by$x, by$y): joining factor and character
## vector, coercing into character vector

I run a test map of the newly created vector pwdzip.

ggplot(pwdzip, aes(x = x, y = y)) +
  geom_point()

## Warning: Removed 56 rows containing missing values (geom_point).

I create a basemap, and then I add an albers projection to it.

states_2000 <- us_states("2000-12-31")
plot(states_2000)

states_df <- tidy(states_2000, region = "id")
head(states_df)

##        long      lat order  hole piece      group       id
## 1 -146.7397 61.38464     1 FALSE     1 ak_state.1 ak_state
## 2 -146.4230 61.36831     2 FALSE     1 ak_state.1 ak_state
## 3 -146.2444 61.44145     3 FALSE     1 ak_state.1 ak_state
## 4 -146.4647 61.40448     4 FALSE     1 ak_state.1 ak_state
## 5 -146.5387 61.49832     5 FALSE     1 ak_state.1 ak_state
## 6 -146.4333 61.58677     6 FALSE     1 ak_state.1 ak_state

base_map <- ggplot() +
  geom_map(data = states_df, map = states_df, 
           aes(x = long, y = lat, map_id = id, group = group),
           fill = "white", color = "gray", size = 0.25) +
  coord_map(projection = "albers", lat0 = 29.5, lat1 = 45.5) +
  theme_nothing(legend = TRUE)

Next I map the pwdzip data to the basemap with albers projection. I play around with geom_point and geom_count to see which is better suited to the data.

base_map +
  geom_point(data = pwdzip, aes(x = x, y = y),
             color = "purple") +
  labs(title = "Map of PWD volunteers by zip code")

## Warning: Removed 56 rows containing missing values (geom_point).

base_map +
  geom_count(data = pwdzip, aes(x = x, y = y), color = "purple") +
  labs(title = "Map of PWD volunteers by zip code")

## Warning: Removed 56 rows containing non-finite values (stat_sum).

base_map +
  geom_count(data = pwdzip, aes (x = x, y = y), shape = 1, color = "purple") +
  labs(title = "Map of PWD volunteers by zip code")

## Warning: Removed 56 rows containing non-finite values (stat_sum).

Finally, I map the data using leaflet. The data displays as cluster markers, and the popups are the information volunteers provided as to why they want to transcribe for the PWD and/or particular topics of interest.

pwdzip_2 <- pwdzip %>% 
  filter(x != "null", y != "null")

library(leaflet)

leaflet(pwdzip_2) %>% addTiles() %>%  addMarkers(lng = ~x, lat = ~y, popup = ~other, clusterOptions = markerClusterOptions())

The locations of our transcribers match fairly well with general popualation trends for the US: we have a strong showing of voluteers east of the Mississippi, a good showing on the west coast, and a comparatively sparse number of volunteers throughout the midwest. Spatial analysis can be useful for many different kinds of public history related efforts. The leaflet map, for example, can be used to determine outreach strategies for the PWD. Apart from the northern Virginia area, where the project orginiated, where are the largest number of volunteers located? What specific topics or historical figures are they interested in? What interests did volunteers hailing from the midwest specify, and how can the outreach efforts of the PWD cater to them? These questions are a small sample of the various ways in which spatial data can be used in public history related projects.