Introduction

This is an extension of my previous example of analyzing building permit data from the data.howardcountymd.gov site to answer the following question: Which localities within Howard County, Maryland, saw the most residential building permits issued in 2014?

In this version I streamline the analysis by using a data processing “pipeline” and then graph the results.

Load libraries

For this and future analyses I’ll again be using the R statistical package run from the RStudio development environment, along with the dplyr package to do data manipulation. Since I want to graph the results I also load the ggplot2 package, a plotting library from the same people who created dplyr.

library("dplyr", warn.conflicts = FALSE)
library("ggplot2")

Loading the data

First I download the CVS-format data relating to issuance of building permits from the data.howardcountymd.gov site and store it in a local file hoco-building-permits.csv.

download.file("https://data.howardcountymd.gov/geoserver/ows?service=WFS&version=1.0.0&request=GetFeature&typeName=general:Permits_View_Building_New&outputFormat=csv",
              "hoco-building-permits.csv", method = "curl")

Next I read the CSV data and convert it into a data frame.

permits <- read.csv("hoco-building-permits.csv", stringsAsFactors = FALSE)

At the time of writing there are a total of 755 rows in the dataset, each representing a single issued building permit. Since this dataset gets continually updated as more permits are issued, as time goes on the number of rows in the dataset will grow.

The data processing pipeline

Now that I have the permits data I have to do several data processing steps to select only residential permits for 2014, count the number of permits per locality and zip code, and sort the data into descending order by number of residential permits issued.

In the previous example I did this one step at a time, taking one data set and processing it to produce a new data set, then taking that data set and processing it to produce yet another data set, and so on until I had the final data set containg the results. Fortunately the dplyr package provides a simpler way to specify these steps, using so-called “pipeline” syntax.

The basic idea is this: Suppose we have a data set dsinput with variables a, b, c, and so on, and we want to transform it into a second data set dsoutput by first selecting only variables a and b and then filtering to include only rows where the value of a is equal to 100. In the more traditional approach we would first do the selection of variables as follows to create a temporary data frame dstemp:

dstemp <- select(dsinput, a, b)

and then filter dstemp to create the final data set dsoutput:

dsoutput <- filter(dstemp, a == 100)

The dplyr pipeline syntax allows us to express the first step as follows:

dstemp <- dsinput %>% select(a, b)

The syntax dsinput %>% select(a, b) is equivalent to select(dsinput, a, b) but better expresses the idea of a pipeline where the results of one step are used as input to the next step.

Similarly we can re-write the second step as follows:

dsoutput <- dstemp %>% filter(a == 100)

where dstemp %>% filter(a == 100) is equivalent to filter(dstemp, a == 100).

But now that we’re using the pipeline syntax like this we no longer need to explicitly reference the intermediate data frame dstemp. Instead we can express the entire data processing pipeline in one line:

dsoutput <- dsinput %>% select(a, b) %>% filter(a == 100)

If we want to add additional data processing steps we can simply insert them into the pipeline as appropriate.

The streamlined analysis

In this case I need to do the following processing steps:

Using the pipeline syntax I can compress all of the steps in the above list into one R statement:

permits_by_zip <- permits %>%
    select(Issued_Date, Permit_Type_2, Detailed_Permit_Type, City, Zip) %>%
    filter(Permit_Type_2 == "Residential") %>%
    filter(grepl("/2014$", Issued_Date)) %>%
    mutate(CityZip = paste(City, Zip, sep = "/")) %>%
    group_by(CityZip) %>%
    summarise(Permits = n()) %>%
    arrange(desc(Permits))

The new data frame permits_by_zip has only 22 rows, one per locality, and only two fields: CityZip, the variable I grouped by, and Permits, the variable containing the number of permits in each group (returned by the n() function).

Graphing the number of permits by locality

As I did before, I print the entire resulting data frame to show all of the localities and the number of residential building permits issued for each one in 2014.

print.data.frame(permits_by_zip)
##                  CityZip Permits
## 1    ELLICOTT CITY/21043     130
## 2    ELLICOTT CITY/21042      70
## 3   MARRIOTTSVILLE/21104      69
## 4           FULTON/20759      57
## 5         ELKRIDGE/21075      54
## 6         COLUMBIA/21044      36
## 7      CLARKSVILLE/21029      32
## 8           LAUREL/20723      30
## 9          GLENELG/21737      25
## 10        WOODBINE/21797      21
## 11         HANOVER/21076      17
## 12          JESSUP/20794      14
## 13 WEST FRIENDSHIP/21794       8
## 14      COOKSVILLE/21723       5
## 15      MOUNT AIRY/21771       5
## 16        HIGHLAND/20777       4
## 17      SYKESVILLE/21784       3
## 18        COLUMBIA/21045       2
## 19          DAYTON/21036       2
## 20     BROOKEVILLE/20833       1
## 21        GLENWOOD/21738       1
## 22       WOODSTOCK/21163       1

However it would also be nice to have a graph of the data, say as a bar chart, in order to get a better sense of the relative numbers of permits issued in each locality and zip code. I do that using the ggplot2 package, which implements a “grammar of graphics” (hence the name) that is somewhat difficult to learn but once learned makes it relatively easy to produce professional-looking graphs.

I first start out with a relatively basic graph, a bar chart with the minimum information specified in order to produce a plot. I use the ggplot() function to specify the data going into the plot (the permits_by_zip data frame) and the “aesthetics” of the plot, that is, that I want to plot the localities on the x axis and the number of permits on the y axis:

g <- ggplot(permits_by_zip, aes(x = CityZip, y = Permits))

The ggplot() function returns an object g that I can then modify to produce the actual graph. In this case I want to start with a simple bar chart, so I modify g to include a geometric object or “geom” consisting of multiple bars (one per locality/zip code) with the height of the bars equal to the value of the Permits variable. (This is what the stat = "identity" expression specifies.) I then use the print() function to actually plot the graph.

g <- g + geom_bar(stat = "identity")
print(g)

Unfortunately this graph falls short in a number of areas. Most notably the names of the localities are all run together in a way that makes them unreadable. I can improve the graph by doing the following:

I modify the graph object g to make these changes and then print it again. The xlab(), ylab(), and ggtitle() functions should be self-explanatory. Reorienting the names and zip codes is a bit more complicated. I confess that I resorted to an online search to find an answer on the Stack Overflow developer site that describes how to do this. (Although it’s apparent only in the underlying source code for this document, I also increased the height of the overall graph a bit to compensate for the extra space taken up by the vertical labels.)

g <- g +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
    xlab("Locality/Zip Code") +
    ylab("Number of Permits Issued") +
    ggtitle("Howard County, Maryland, 2014 Residential Building Permits")
print(g)

This graph could be further improved in various ways, but as is it presents the information in a reasonably readable way.

Conclusion

Using data processing pipelines with the dplyr functions makes it much easier to understand conceptually what’s going on in the analysis: each step performs a clearly delineated task, with its output being the input for the next step. It’s easy to see how the above analysis could be modified to cover a different year, or to count commercial building permits rather than residential permits.

The ggplot2 package is a natural complement to dplyr, as both use data frames as their data structure of choice. In this case plotting the data makes it much easier to see the wide disparities among localities in Howard County in terms of residential building permit issuance.

That concludes this example of analyzing Howard County building permit data. If I have time to do another example I’ll try to create an actual map showing the relative number of permits issued per zip code within the county.

Appendix

I used the following R environment in doing the analysis for this example:

sessionInfo()
## R version 3.1.2 (2014-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_1.0.0  dplyr_0.4.0    RCurl_1.95-4.3 bitops_1.0-6  
## 
## loaded via a namespace (and not attached):
##  [1] assertthat_0.1   colorspace_1.2-4 DBI_0.3.1        digest_0.6.4    
##  [5] evaluate_0.5.5   formatR_1.0      grid_3.1.2       gtable_0.1.2    
##  [9] htmltools_0.2.6  knitr_1.7        labeling_0.3     lazyeval_0.1.10 
## [13] magrittr_1.0.1   MASS_7.3-35      munsell_0.4.2    parallel_3.1.2  
## [17] plyr_1.8.1       proto_0.3-10     Rcpp_0.11.3      reshape2_1.4    
## [21] rmarkdown_0.5.1  scales_0.2.4     stringr_0.6.2    tools_3.1.2     
## [25] yaml_2.1.13

You can find the source code for this analysis and others at my HoCoData repository on GitHub. This document and its source code are available for unrestricted use, distribution and modification under the terms of the Creative Commons CC0 1.0 Universal (CC0 1.0) Public Domain Dedication. Stated more simply, you’re free to do whatever you‘d like with it.