Introduction

This is a continuation of my previous examples of analyzing building permit data from the [data.howardcountymd.gov][data] site to answer the following question: Which localities within Howard County, Maryland, saw the most residential building permits issued in 2014?

In this version I create a map of Howard County with the varying number of permits shown using variations in color from one zip code area to another.

Load packages

For this analysis I’ll again be using the R statistical package run from the RStudio development environment, along with the dplyr package to do data manipulation and the ggplot2 package to draw the map.

library("dplyr", warn.conflicts = FALSE)
library("ggplot2")

I also need to load several R packages used to manipulate spatial data in R. I first load the sp package, a prerequisite for using the other packages. I use the rgdal package to load spatial data on zip code boundaries downloaded from data.howardcountymd.gov, and the rgeos and maptools packages to manipulate the spatial data once loaded. The maptools package checks on the presence of the rgeos package, so I load rgeos first and then maptools.

library("sp")
library("rgdal")
## rgdal: version: 0.9-1, (SVN revision 518)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 1.11.1, released 2014/09/24
## Path to GDAL shared files: /Library/Frameworks/GDAL.framework/Versions/1.11/Resources/gdal
## Loaded PROJ.4 runtime: Rel. 4.8.0, 6 March 2012, [PJ_VERSION: 480]
## Path to PROJ.4 shared files: (autodetected)
library("rgeos")
## rgeos version: 0.3-8, (SVN revision 460)
##  GEOS runtime version: 3.4.2-CAPI-1.8.2 r3921 
##  Polygon checking: TRUE
library("maptools")
## Checking rgeos availability: TRUE

I should note here (having forgotten to do it in the previous examples) that none of the packages above are part of the base R system, and so I had to install them using the install.packages() function. The mapping packages also require installing sets of underlying mapping libraries. Doing this is not trivial, and varies from operating system to operating system. Unfortunately I don’t have space in this document to describe the entire process.

Loading the data

First I download the CVS-format data relating to issuance of building permits from the data.howardcountymd.gov site and store it in a local file hoco-building-permits.csv. (This is the same process as in the previous examples.)

download.file("https://data.howardcountymd.gov/geoserver/ows?service=WFS&version=1.0.0&request=GetFeature&typeName=general:Permits_View_Building_New&outputFormat=csv",
              "hoco-building-permits.csv", method = "curl")

Then I download spatial data from data.howardcountymd.gov spcifying the boundaries of Howard County zip code areas. The Howard County GIS Division rather nicely offers this data in a variety of formats; I chose to use the GeoJSON format, and thus stored the data locally in a file zipcodes.json.

download.file("https://data.howardcountymd.gov/geoserver/ows?service=WFS&version=1.0.0&request=GetFeature&typeName=general:Zipcodes&outputFormat=application/json", "zipcodes.json", method = "curl")

Next I read the CSV data for building permits and convert it into a data frame. (Again, this is the same process as in the last two examples.)

permits <- read.csv("hoco-building-permits.csv", stringsAsFactors = FALSE)

Finally I read in the GeoJSON zip code data. The readOGR() function is part of the rgdal package; since the function understands multiple format I have to tell it which format this file is in.

zips <- readOGR("zipcodes.json", "OGRGeoJSON")
## OGR data source with driver: GeoJSON 
## Source: "zipcodes.json", layer: "OGRGeoJSON"
## with 25 features and 2 fields
## Feature type: wkbPolygon with 2 dimensions
str(zips@data)
## 'data.frame':    25 obs. of  2 variables:
##  $ ZIPCODE : Factor w/ 25 levels "20701","20723",..: 22 13 14 16 6 1 4 24 19 23 ...
##  $ AREANAME: Factor w/ 22 levels "Annapolis Junction",..: 17 4 4 12 14 1 18 20 5 19 ...

The zips variable is not a regular data frame but rather is a special type of data structure, a “SpatialPolygonsDataFrame” containing both a regular data frame (in zips@data) and a list of 25 “polygons” containing map data for the 25 Howard County zip codes.

Data processing

I first need to repeat the processing steps from the prior examples to get a count of residential permits issued for each zip code in 2014. The main difference in this example is that I choose to work only with zip codes and do not need the locality names.

permits_by_zip <- permits %>%
    select(Issued_Date, Permit_Type_2, Detailed_Permit_Type, City, Zip) %>%
    filter(Permit_Type_2 == "Residential") %>%
    filter(grepl("/2014$", Issued_Date)) %>%
    group_by(Zip) %>%
    summarise(Permits = n()) %>%
    arrange(desc(Permits))

Now I have to further modify the permits_by_zip data frame to add data for zip codes for which there were no permits issued in 2014. (I neglected to do this the first time, and the resulting map was messed up something awful.)

The permits_by_zip data frame includes the following zip codes:

sort(permits_by_zip$Zip)
##  [1] 20723 20759 20777 20794 20833 21029 21036 21042 21043 21044 21045
## [12] 21075 21076 21104 21163 21723 21737 21738 21771 21784 21794 21797

The zips@data data frame within the zips data structure contains the following zip codes:

sort(unique(zips@data$ZIPCODE))
##  [1] 20701 20723 20759 20763 20777 20794 20833 21029 21036 21042 21043
## [12] 21044 21045 21046 21075 21076 21104 21163 21723 21737 21738 21771
## [23] 21784 21794 21797
## 25 Levels: 20701 20723 20759 20763 20777 20794 20833 21029 21036 ... 21797

I want to find all zip codes that are in the zip code map but not in permits_by_zip. I first convert the ZIPCODE values from zips@data into integers, to match the permits_by_zip$Zip values. (I have to convert the ZIPCODE values first to characters and then to integers because the values are factors, an R data type that has some odd properties.)

map_zips <- as.integer(as.character(zips@data$ZIPCODE))
str(map_zips)
##  int [1:25] 21771 21045 21046 21076 20794 20701 20763 21794 21723 21784 ...

I then use the setdiff() function to find the set of zip codes in map_zips that are not in permits_by_zip$Zip. (I reference the setdiff() function as base:setdiff to avoid a name conflict with a function of the same name in the dplyr package.) I store the resulting set of zips codes in a variable named Zip, for reasons which will become apparent in the next step.

Zip <- base::setdiff(map_zips, permits_by_zip$Zip)
Zip
## [1] 21046 20701 20763

Next I use the variable Zip to create a data frame permits_by_zip2 with a single variable permits_by_zip2$Zip as the first and only column. Then I add a new column permits_by_zip2$Permits with all zero values. Finally I combine the original permits_by_zip data frame with the newly-created permits_by_zip2 data frame to create a modified permits_by_zip data frame with rows for the formerly missing zip codes. (This is why I chose the name Zip above for the column in permits_by_zip2, to match the corresponding column name in permits_by_zip.)

Since I might want to reuse this code for future years, I allow for the possibility that there are no zip codes without permits issued in a given year, and do all this only if the set of missing zip codes is non-empty.

if (length(Zip) > 0) {
    permits_by_zip2 <- as.data.frame(Zip)
    permits_by_zip2$Permits <- 0
    permits_by_zip <- bind_rows(permits_by_zip, permits_by_zip2)
}

I then print the entire resulting data frame to show the added rows for the zip codes with no permits issued in 2014.

print.data.frame(permits_by_zip)
##      Zip Permits
## 1  21043     130
## 2  21042      70
## 3  21104      69
## 4  20759      57
## 5  21075      54
## 6  21044      36
## 7  21029      32
## 8  20723      30
## 9  21737      25
## 10 21797      21
## 11 21076      17
## 12 20794      14
## 13 21794       8
## 14 21723       5
## 15 21771       5
## 16 20777       4
## 17 21784       3
## 18 21036       2
## 19 21045       2
## 20 20833       1
## 21 21163       1
## 22 21738       1
## 23 21046       0
## 24 20701       0
## 25 20763       0

Mapping the number of permits by zip code

Now comes the fun part: actually mapping the data. I first use the plot() function in base R to verify that the zips data structure actually contains the necessary data to produce a zip code map of Howard County.

plot(zips)

So far so good. I now have two possible strategies I can follow. What I want to do is to color each of the zip code areas differently according to the number of permits issued in that particular area. (Fun fact: This is called a choropleth map.)

One way to do this is to use the plot() function in base R and provide a vector of colors to use in coloring each zip code area (polygon), with the colors calculated according to the number of permits issued. The plot() function doesn’t know how to do this calculation itself, so I would have to write code to do it.

The other approach is to use the ggplot() function from the ggplot2 package. The ggplot() function already knows how to pick colors based on data values. However it doesn’t understand the zips data structure (a SpatialPolygonsDataFrame), so I have to convert that data into the type of data frame that ggplot() expects.

In this analysis I follow the second strategy. I got the basic idea for how to do this from the answers to a Stack Overflow question “R ggplot2 merge with shapefile and csv data to fill polygons”. I have to add a new column id to the data frame zip@data inside zips, use the function fortify() to convert the polygons in zips into individual latitude/longitude pairs grouped by id, then merge that data with the data in zip@data to create a new data frame zips_df.

zips@data$id <- rownames(zips@data)
zips_points <- fortify(zips, region = "id")
zips_df <- merge(zips_points, zips@data, by = "id")
str(zips_df)
## 'data.frame':    18664 obs. of  9 variables:
##  $ id      : chr  "0" "0" "0" "0" ...
##  $ long    : num  1283882 1283895 1283912 1283952 1284018 ...
##  $ lat     : num  620135 620105 620081 620060 620037 ...
##  $ order   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ hole    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ piece   : Factor w/ 7 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ group   : Factor w/ 38 levels "0.1","1.1","10.1",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZIPCODE : Factor w/ 25 levels "20701","20723",..: 22 22 22 22 22 22 22 22 22 22 ...
##  $ AREANAME: Factor w/ 22 levels "Annapolis Junction",..: 17 17 17 17 17 17 17 17 17 17 ...

The resulting variable zips_df looks more like the sort of data frame we’ve in previous examples; each row (observation) corresponds to a point in the set of polygons representing the zip code boundaries, and includes values for the zip code with which the point is associated.

Now I want to combine the zips_df data frame with the permits_by_zip data to create a consolidated data set with the number of permits included. To do that I first create a new variable zips_df$Zip that is of the same integer data type as the variable permits_by_zip$Zip, and then join the two data frames together based on that common variable. The left_join() function (from the dplyr package) takes all the rows in the zips_df data frame for each zip code and adds the Permits column from the row in permits_by_zip that has a matching zip code.

zips_df <- mutate(zips_df, Zip = as.integer(as.character(ZIPCODE)))
zips_df <- left_join(zips_df, permits_by_zip, by = "Zip")
str(zips_df)
## 'data.frame':    18664 obs. of  11 variables:
##  $ id      : chr  "0" "0" "0" "0" ...
##  $ long    : num  1283882 1283895 1283912 1283952 1284018 ...
##  $ lat     : num  620135 620105 620081 620060 620037 ...
##  $ order   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ hole    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ piece   : Factor w/ 7 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ group   : Factor w/ 38 levels "0.1","1.1","10.1",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZIPCODE : Factor w/ 25 levels "20701","20723",..: 22 22 22 22 22 22 22 22 22 22 ...
##  $ AREANAME: Factor w/ 22 levels "Annapolis Junction",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ Zip     : int  21771 21771 21771 21771 21771 21771 21771 21771 21771 21771 ...
##  $ Permits : num  5 5 5 5 5 5 5 5 5 5 ...

Now I have something ggplot() can deal with. I tell ggplot() to plot zips_df using the long (longitude) values on the x axis and the lat (latitude) values on the y axis. Then I add to the graph object g the drawing of actual polygons, grouping the points together that correspond to the various polygons, with the color of each polygon (the “fill” value) determined based on the value of the zips_df$Permits value for that polygon.

g <- ggplot(zips_df, aes(x = long, y = lat, group = group, fill = Permits))
g <- g + geom_polygon(aes(group = group, fill = Permits))
print(g)

This map doesn’t look bad as a first attempt. The colors do reflect the number of permits, and there’s even a nice legend to specify how they match up. However the map looks somewhat squished in the horizontal direction, and I don’t really need or want the latitude and longitude values on the axes.

Those are relatively easy things to fix. However what I really want to do is more complicated: I’d like to label each area with its zip code value. To do this I first need to figure out where to put the zip code labels on the map. I use the coordinates() function to find the centers (more correctly, the centroids) of each of the 25 polygons in the original zips data structures that store the zip code boundaries.

zips_centers = coordinates(zips)
str(zips_centers)
##  num [1:25, 1:2] 1275434 1361257 1357333 1389449 1369186 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:25] "0" "1" "2" "3" ...
##   ..$ : NULL

The zips_centers variable is a matrix of 25 rows and 2 columns. To use it with ggplot() I have to convert it into a data frame, including giving the two columns their correct names (“long” and “lat”). I then add a third column containing the zip code values from the zips@data data frame within the zips data structure. (This works because the zip code values in zips@data are in the same order as the polygons corresponding to those zip codes.)

zips_centers_df <- as.data.frame(zips_centers)
names(zips_centers_df) <- c("long", "lat")
zips_centers_df$Zip = as.character(zips@data$ZIPCODE)
str(zips_centers_df)
## 'data.frame':    25 obs. of  3 variables:
##  $ long: num  1275434 1361257 1357333 1389449 1369186 ...
##  $ lat : num  608676 560783 548843 555423 541748 ...
##  $ Zip : chr  "21771" "21045" "21046" "21076" ...

Now I have two data frames, zips_df and zips_centers_df, and I want to plot them on the same map. To do this I start off with an “empty” graph object, created by calling ggplot() with no arguments. I next add a layer consisting of polygons for the zip code areas, with latitude and longitude data taken from zips_df and the polygons colored according to the Permits variable in zips_df. Then I add a second layer consisting of text strings for the zip codes themselves, with latitude and longitude data taken from zips_centers_df, the text labels themselves from the Zip variable in zips_centers_df, and the text colored and sized as indicated. (I had to experiment a little to find a text size that worked well.)

g <- ggplot()
g <- g + geom_polygon(data = zips_df, aes(x = long, y = lat,
                      group = group, fill = Permits))
g <- g + geom_text(data = zips_centers_df, aes(x = long, y = lat, label = Zip),
                   size = 2.5, colour = "white", show_guide = FALSE)

Finally I made a few other minor fixes to the graph: Setting the coordinates so that the map would have a proper aspect ratio, removing the text strings on the x and y axes, and giving the map a suitable title.

g <- g + coord_equal()
g <- g + theme(axis.title = element_blank(), axis.text = element_blank())
g <- g + ggtitle("Howard County, Maryland, 2014 Residential Building Permits")
print(g)

There are a few other things I might want to do in a perfect world, but as it is the resulting graph looks reasonably professional.

Conclusion

In this example I made the big leap from tables and bar charts to actual data-driven maps. It was a fair amount of trouble to install all the necessary geospatial software (all free, by the way) and figure out how to use the geospatial data with ggplot2, but I think the results were well worth it.

That concludes this example of analyzing Howard County building permit data. Unless I can think of something else worth doing this will be the last example in this particular series.

Appendix

I used the following R environment in doing the analysis for this example:

sessionInfo()
## R version 3.1.2 (2014-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] maptools_0.8-34 rgeos_0.3-8     rgdal_0.9-1     sp_1.0-17      
## [5] ggplot2_1.0.0   dplyr_0.4.0     RCurl_1.95-4.3  bitops_1.0-6   
## 
## loaded via a namespace (and not attached):
##  [1] assertthat_0.1   colorspace_1.2-4 DBI_0.3.1        digest_0.6.4    
##  [5] evaluate_0.5.5   foreign_0.8-61   formatR_1.0      grid_3.1.2      
##  [9] gtable_0.1.2     htmltools_0.2.6  knitr_1.7        labeling_0.3    
## [13] lattice_0.20-29  lazyeval_0.1.10  magrittr_1.0.1   MASS_7.3-35     
## [17] munsell_0.4.2    parallel_3.1.2   plyr_1.8.1       proto_0.3-10    
## [21] Rcpp_0.11.3      reshape2_1.4     rmarkdown_0.5.1  scales_0.2.4    
## [25] stringr_0.6.2    tools_3.1.2      yaml_2.1.13

The underlying GDAL and GEOS libraries for the rgeos and rgdal packages are from the KyngChaos GDAL Complete distribution version 1.11 for Mac OS X.

You can find the source code for this analysis and others at my HoCoData repository on GitHub. This document and its source code are available for unrestricted use, distribution and modification under the terms of the Creative Commons CC0 1.0 Universal (CC0 1.0) Public Domain Dedication. Stated more simply, you’re free to do whatever you‘d like with it.