This is an extension of my previous example of analyzing building permit data from the data.howardcountymd.gov
site to answer the following question: Which localities within Howard County, Maryland, saw the most residential building permits issued in 2014?
In this version I streamline the analysis by using a data processing “pipeline” and then graph the results.
For this and future analyses I’ll again be using the R statistical package run from the RStudio development environment, along with the dplyr package to do data manipulation. Since I want to graph the results I also load the ggplot2 package, a plotting library from the same people who created dplyr.
library("dplyr", warn.conflicts = FALSE)
library("ggplot2")
First I download the CVS-format data relating to issuance of building permits from the data.howardcountymd.gov
site and store it in a local file hoco-building-permits.csv
.
download.file("https://data.howardcountymd.gov/geoserver/ows?service=WFS&version=1.0.0&request=GetFeature&typeName=general:Permits_View_Building_New&outputFormat=csv",
"hoco-building-permits.csv", method = "curl")
Next I read the CSV data and convert it into a data frame.
permits <- read.csv("hoco-building-permits.csv", stringsAsFactors = FALSE)
At the time of writing there are a total of 755 rows in the dataset, each representing a single issued building permit. Since this dataset gets continually updated as more permits are issued, as time goes on the number of rows in the dataset will grow.
Now that I have the permits data I have to do several data processing steps to select only residential permits for 2014, count the number of permits per locality and zip code, and sort the data into descending order by number of residential permits issued.
In the previous example I did this one step at a time, taking one data set and processing it to produce a new data set, then taking that data set and processing it to produce yet another data set, and so on until I had the final data set containg the results. Fortunately the dplyr package provides a simpler way to specify these steps, using so-called “pipeline” syntax.
The basic idea is this: Suppose we have a data set dsinput
with variables a
, b
, c
, and so on, and we want to transform it into a second data set dsoutput
by first selecting only variables a
and b
and then filtering to include only rows where the value of a
is equal to 100. In the more traditional approach we would first do the selection of variables as follows to create a temporary data frame dstemp
:
dstemp <- select(dsinput, a, b)
and then filter dstemp
to create the final data set dsoutput
:
dsoutput <- filter(dstemp, a == 100)
The dplyr pipeline syntax allows us to express the first step as follows:
dstemp <- dsinput %>% select(a, b)
The syntax dsinput %>% select(a, b)
is equivalent to select(dsinput, a, b)
but better expresses the idea of a pipeline where the results of one step are used as input to the next step.
Similarly we can re-write the second step as follows:
dsoutput <- dstemp %>% filter(a == 100)
where dstemp %>% filter(a == 100)
is equivalent to filter(dstemp, a == 100)
.
But now that we’re using the pipeline syntax like this we no longer need to explicitly reference the intermediate data frame dstemp
. Instead we can express the entire data processing pipeline in one line:
dsoutput <- dsinput %>% select(a, b) %>% filter(a == 100)
If we want to add additional data processing steps we can simply insert them into the pipeline as appropriate.
In this case I need to do the following processing steps:
permits
data set I just read in.permits_by_zip
.Using the pipeline syntax I can compress all of the steps in the above list into one R statement:
permits_by_zip <- permits %>%
select(Issued_Date, Permit_Type_2, Detailed_Permit_Type, City, Zip) %>%
filter(Permit_Type_2 == "Residential") %>%
filter(grepl("/2014$", Issued_Date)) %>%
mutate(CityZip = paste(City, Zip, sep = "/")) %>%
group_by(CityZip) %>%
summarise(Permits = n()) %>%
arrange(desc(Permits))
The new data frame permits_by_zip
has only 22 rows, one per locality, and only two fields: CityZip
, the variable I grouped by, and Permits
, the variable containing the number of permits in each group (returned by the n()
function).
As I did before, I print the entire resulting data frame to show all of the localities and the number of residential building permits issued for each one in 2014.
print.data.frame(permits_by_zip)
## CityZip Permits
## 1 ELLICOTT CITY/21043 130
## 2 ELLICOTT CITY/21042 70
## 3 MARRIOTTSVILLE/21104 69
## 4 FULTON/20759 57
## 5 ELKRIDGE/21075 54
## 6 COLUMBIA/21044 36
## 7 CLARKSVILLE/21029 32
## 8 LAUREL/20723 30
## 9 GLENELG/21737 25
## 10 WOODBINE/21797 21
## 11 HANOVER/21076 17
## 12 JESSUP/20794 14
## 13 WEST FRIENDSHIP/21794 8
## 14 COOKSVILLE/21723 5
## 15 MOUNT AIRY/21771 5
## 16 HIGHLAND/20777 4
## 17 SYKESVILLE/21784 3
## 18 COLUMBIA/21045 2
## 19 DAYTON/21036 2
## 20 BROOKEVILLE/20833 1
## 21 GLENWOOD/21738 1
## 22 WOODSTOCK/21163 1
However it would also be nice to have a graph of the data, say as a bar chart, in order to get a better sense of the relative numbers of permits issued in each locality and zip code. I do that using the ggplot2 package, which implements a “grammar of graphics” (hence the name) that is somewhat difficult to learn but once learned makes it relatively easy to produce professional-looking graphs.
I first start out with a relatively basic graph, a bar chart with the minimum information specified in order to produce a plot. I use the ggplot()
function to specify the data going into the plot (the permits_by_zip
data frame) and the “aesthetics” of the plot, that is, that I want to plot the localities on the x axis and the number of permits on the y axis:
g <- ggplot(permits_by_zip, aes(x = CityZip, y = Permits))
The ggplot()
function returns an object g
that I can then modify to produce the actual graph. In this case I want to start with a simple bar chart, so I modify g
to include a geometric object or “geom” consisting of multiple bars (one per locality/zip code) with the height of the bars equal to the value of the Permits
variable. (This is what the stat = "identity"
expression specifies.) I then use the print()
function to actually plot the graph.
g <- g + geom_bar(stat = "identity")
print(g)
Unfortunately this graph falls short in a number of areas. Most notably the names of the localities are all run together in a way that makes them unreadable. I can improve the graph by doing the following:
I modify the graph object g
to make these changes and then print it again. The xlab()
, ylab()
, and ggtitle()
functions should be self-explanatory. Reorienting the names and zip codes is a bit more complicated. I confess that I resorted to an online search to find an answer on the Stack Overflow developer site that describes how to do this. (Although it’s apparent only in the underlying source code for this document, I also increased the height of the overall graph a bit to compensate for the extra space taken up by the vertical labels.)
g <- g +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
xlab("Locality/Zip Code") +
ylab("Number of Permits Issued") +
ggtitle("Howard County, Maryland, 2014 Residential Building Permits")
print(g)
This graph could be further improved in various ways, but as is it presents the information in a reasonably readable way.
Using data processing pipelines with the dplyr functions makes it much easier to understand conceptually what’s going on in the analysis: each step performs a clearly delineated task, with its output being the input for the next step. It’s easy to see how the above analysis could be modified to cover a different year, or to count commercial building permits rather than residential permits.
The ggplot2 package is a natural complement to dplyr, as both use data frames as their data structure of choice. In this case plotting the data makes it much easier to see the wide disparities among localities in Howard County in terms of residential building permit issuance.
That concludes this example of analyzing Howard County building permit data. If I have time to do another example I’ll try to create an actual map showing the relative number of permits issued per zip code within the county.
I used the following R environment in doing the analysis for this example:
sessionInfo()
## R version 3.1.2 (2014-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_1.0.0 dplyr_0.4.0 RCurl_1.95-4.3 bitops_1.0-6
##
## loaded via a namespace (and not attached):
## [1] assertthat_0.1 colorspace_1.2-4 DBI_0.3.1 digest_0.6.4
## [5] evaluate_0.5.5 formatR_1.0 grid_3.1.2 gtable_0.1.2
## [9] htmltools_0.2.6 knitr_1.7 labeling_0.3 lazyeval_0.1.10
## [13] magrittr_1.0.1 MASS_7.3-35 munsell_0.4.2 parallel_3.1.2
## [17] plyr_1.8.1 proto_0.3-10 Rcpp_0.11.3 reshape2_1.4
## [21] rmarkdown_0.5.1 scales_0.2.4 stringr_0.6.2 tools_3.1.2
## [25] yaml_2.1.13
You can find the source code for this analysis and others at my HoCoData repository on GitHub. This document and its source code are available for unrestricted use, distribution and modification under the terms of the Creative Commons CC0 1.0 Universal (CC0 1.0) Public Domain Dedication. Stated more simply, you’re free to do whatever you‘d like with it.