Introduction

This is a simple example of analyzing building permit data from the data.howardcountymd.gov site. I want to answer the following question: Which localities within Howard County, Maryland, saw the most residential building permits issued in 2014?

This question is not the same as asking which places saw the most housing units built, because a single permit could cover multiple dwellings, for example a multi-unit apartment complex. Nevertheless the answer will give a rough picture of where residential development is occurring in the county.

Load needed libraries

For this analysis I’ll be using the R statistical package, run from the R Studio development environment. To help make the analysis easier I’ll first use the library() function to load an additional R package, add-on software that provides additional capability not in the base R software.

The dplyr package provides functions for manipulating datasets in a fairly intuitive way (at least, more intuitive than is often the case with R). Because the package provides functions whose names conflict with those elsewhere in R, I call the library() function with the argument warn.conflicts = FALSE to suppress associated warning messages.

library("dplyr", warn.conflicts = FALSE)

Loading the data

First I download the CVS-format data relating to issuance of building permits from the data.howardcountymd.gov site and store it in a local file hoco-building-permits.csv.

download.file("https://data.howardcountymd.gov/geoserver/ows?service=WFS&version=1.0.0&request=GetFeature&typeName=general:Permits_View_Building_New&outputFormat=csv",
              "hoco-building-permits.csv", method = "curl")

Next I read the CSV data and convert it into a data frame, a standard R object for representing datasets. A data frame is similar to a spreadsheet: It has multiple columns representing different variables (one per column), and multiple rows representing different sets of values for those variables (one set per row). One major difference from a traditional spreadsheet is that in R you can refer to the variables (columns) using meaningful names, rather than using letters like ‘A’, ‘B’, and so on.

I use the stringsAsFactors = FALSE argument to the read.csv() function to keep character strings in the data as character strings, and not convert them into factors (an R data type used for categorical variables, i.e., variables whose values are like answers to a multiple-choice question).

permits <- read.csv("hoco-building-permits.csv", stringsAsFactors = FALSE)

Now let’s take a look at what’s in the dataset; the str() function shows the overall structure of the data frame (hence the function name):

str(permits)
## 'data.frame':    755 obs. of  14 variables:
##  $ FID                 : chr  "Permits_View_Building_New.fid-27152ead_14b8a9cb73c_-210e" "Permits_View_Building_New.fid-27152ead_14b8a9cb73c_-210d" "Permits_View_Building_New.fid-27152ead_14b8a9cb73c_-210c" "Permits_View_Building_New.fid-27152ead_14b8a9cb73c_-210b" ...
##  $ X                   : num  -77.2 -77.2 -77.1 -77.1 -77.1 ...
##  $ Y                   : num  39.3 39.3 39.3 39.3 39.3 ...
##  $ Permit_ID           : chr  "B14003498" "B14004339" "B14000500" "B14004193" ...
##  $ Issued_Date         : chr  "10/29/2014" "01/23/2015" "05/27/2014" "12/17/2014" ...
##  $ Permit_Type         : chr  "Building" "Building" "Building" "Building" ...
##  $ Permit_Type_2       : chr  "Commercial" "Residential" "Residential" "Residential" ...
##  $ Detailed_Permit_Type: chr  "Commercial New Building Permit" "Residential New Single Family Dwelling Permit" "Residential New Single Family Dwelling Permit" "Residential New Single Family Dwelling Permit" ...
##  $ Is_Capital_Project  : chr  "No" "No" "No" "No" ...
##  $ Address             : chr  "1504 LONG CORNER RD" "18336 CHELSEA KNOLLS DR" "17762 HARDY RD" "2771 FLORENCE RD" ...
##  $ City                : chr  "MOUNT AIRY" "MOUNT AIRY" "MOUNT AIRY" "WOODBINE" ...
##  $ State               : chr  "MD" "MD" "MD" "MD" ...
##  $ Zip                 : int  21771 21771 21771 21797 21797 21771 21771 21797 21797 21771 ...
##  $ geom                : chr  "POINT (1267772.72262635 607564.595198192)" "POINT (1268073.13298635 600789.777068192)" "POINT (1275124.27426635 612582.978998192)" "POINT (1278720.69640635 602131.649258192)" ...

The first thing to notice is that there are 14 variables (columns) in the data frame, ranging from FID to geom.

The second thing to notice is that there are a total of 755 rows in the dataset (“obs.”, or observations), each representing a single issued building permit. Since this dataset gets continually updated as more permits are issued, as time goes on the number of rows in the dataset will grow.

Finally, the str() function prints the first few values for each of the variables (corresponding to the first few rows of the data frame). These values are numbers with decimals (‘num’), character strings (‘str’), or numbers without decimals (‘int’, for integer).

Choosing which data to work with

As it turns out I don’t need all the data in the building permits dataset in order to answer the question I asked. My next step is therefore to reduce the amount of data I’m working with. In spreadsheet terms this is analogous to selecting only the columns and rows I want, and deleting the rest. First I select the variables (columns) I want, and then I filter the dataset for the observations (rows) I want.

Some of the variables in the permits dataset are not of immediate interest, including FID and Permit_ID (internal identifiers for each permit), X, Y, and geom (data for mapping), and Is_Capital_Project (irrelevant for this example). I also don’t care about the entire address, just the city and zip portions.

Looking a bit further, the variable Permit_Type appears to always have the same value, ‘Building’. I can test this using the unique() function to print out all the unique values the variable Permit_Type takes on. (The syntax permits$Permit_Type, with the ‘$’ separating permits and Permit_Type, is how I tell R that I am interested in the variable Permit_Type within the data frame permits.)

unique(permits$Permit_Type)
## [1] "Building"

As I thought, this variable always has the same value, presumably since all the data in this data set is related to building permits. Since the variable doesn’t add any useful information I can ignore it as well.

I then start the process of cutting down the data I’m dealing with by creating a new data frame with just the variables I need. The select() function takes as its first argument a data frame, with the remaining arguments naming the variables I want to keep. The function returns a new data frame, which I assign to the variable permits1.

permits1 <- select(permits, Issued_Date, Permit_Type_2,
                   Detailed_Permit_Type, City, Zip)
str(permits1)
## 'data.frame':    755 obs. of  5 variables:
##  $ Issued_Date         : chr  "10/29/2014" "01/23/2015" "05/27/2014" "12/17/2014" ...
##  $ Permit_Type_2       : chr  "Commercial" "Residential" "Residential" "Residential" ...
##  $ Detailed_Permit_Type: chr  "Commercial New Building Permit" "Residential New Single Family Dwelling Permit" "Residential New Single Family Dwelling Permit" "Residential New Single Family Dwelling Permit" ...
##  $ City                : chr  "MOUNT AIRY" "MOUNT AIRY" "MOUNT AIRY" "WOODBINE" ...
##  $ Zip                 : int  21771 21771 21771 21797 21797 21771 21771 21797 21797 21771 ...

The new data frame permits1 has only 5 variables, but still has the full set of 755 observations. Having picked the variables I want, I now want to cut down the data set to include only the permits I’m interested in.

My original question was about residential permits, so I want to filter the data to eliminate non-residential permits. Looking at the values in permits2 it appears that the variable Permit_Type_2 will allow me to figure out which permits are for residential construction and which are not. I confirm this by using the unique() function to print out all the unique values that the variable Permit_Type_2 takes on.

unique(permits1$Permit_Type_2)
## [1] "Commercial"  "Residential"

As I thought, this variable has the value ‘Residential’ for residential building permits and ‘Commercial’ for non-residential permits. I can therefore filter the data again to create a new data frame permits2 that contains only residential permits. (The operator ‘==’ is a check for equality, that is, that the value of the variable Permit_Type_2 is the character string ‘Residential’.)

permits2 <- filter(permits1, Permit_Type_2 == "Residential")
str(permits2)
## 'data.frame':    713 obs. of  5 variables:
##  $ Issued_Date         : chr  "01/23/2015" "05/27/2014" "12/17/2014" "12/22/2014" ...
##  $ Permit_Type_2       : chr  "Residential" "Residential" "Residential" "Residential" ...
##  $ Detailed_Permit_Type: chr  "Residential New Single Family Dwelling Permit" "Residential New Single Family Dwelling Permit" "Residential New Single Family Dwelling Permit" "Residential New Single Family Dwelling Permit" ...
##  $ City                : chr  "MOUNT AIRY" "MOUNT AIRY" "WOODBINE" "WOODBINE" ...
##  $ Zip                 : int  21771 21771 21797 21797 21771 21771 21797 21797 21771 21797 ...

I next want to restrict myself to permits issued in 2014, using the Issued_Date variable.

Date handling in R can be somewhat clunky at times, with multiple ways of handling dates depending on what you want to do with them. In this case the easiest way to identify 2014 permits is to take Issued_Date as a character string and look for the last part of the string being ‘/2014’, using the grepl() function.

(The ‘grep’ part of the name ‘grepl’ comes from the Unix/Linux command grep; for more information see the relevant Wikipedia article. The ‘l’ part of the name indicates that the function returns a logical value TRUE or FALSE, depending on whether the string in question was found or not. The ‘$’ part of the search string ‘/2014$’ tells the function to look for ‘/2014’ only at the end of the string.)

permits3 <- filter(permits2, grepl("/2014$", Issued_Date))
str(permits3)
## 'data.frame':    587 obs. of  5 variables:
##  $ Issued_Date         : chr  "05/27/2014" "12/17/2014" "12/22/2014" "04/21/2014" ...
##  $ Permit_Type_2       : chr  "Residential" "Residential" "Residential" "Residential" ...
##  $ Detailed_Permit_Type: chr  "Residential New Single Family Dwelling Permit" "Residential New Single Family Dwelling Permit" "Residential New Single Family Dwelling Permit" "Residential New Single Family Dwelling Permit" ...
##  $ City                : chr  "MOUNT AIRY" "WOODBINE" "WOODBINE" "MOUNT AIRY" ...
##  $ Zip                 : int  21771 21797 21797 21771 21771 21797 21797 21771 21797 21771 ...

The resulting data frame permits3 has the same 5 variables as the data frame permits1, but has only 587 observations.

Counting the number of permits for each locality

I now want to count the number of permits issued for each locality within Howard County. There are two variables that specify the location for which the permit was issued: City and Zip. I again use the unique() function to determine what values these variables contain:

unique(permits3$City)
##  [1] "MOUNT AIRY"      "WOODBINE"        "BROOKEVILLE"    
##  [4] "GLENELG"         "GLENWOOD"        "COOKSVILLE"     
##  [7] "DAYTON"          "WEST FRIENDSHIP" "CLARKSVILLE"    
## [10] "ELLICOTT CITY"   "HIGHLAND"        "SYKESVILLE"     
## [13] "FULTON"          "MARRIOTTSVILLE"  "COLUMBIA"       
## [16] "LAUREL"          "WOODSTOCK"       "JESSUP"         
## [19] "ELKRIDGE"        "HANOVER"
unique(permits3$Zip)
##  [1] 21771 21797 20833 21737 21738 21723 21036 21794 21029 21042 20777
## [12] 21784 20759 21104 21044 20723 21163 21045 20794 21043 21075 21076

It looks as if there are more zip codes than locality names, probably because both Columbia and Ellicott City contain multiple zip codes. It’s useful to be able to go down to individual zip codes, but the zip codes are hard to interpret for people not familiar with the area.

To get the best of both approaches I use the mutate() function to create a new data frame with the City and Zip variables consolidated into a single variable CityZip. The paste() function pastes together the values of the two original variables with a slash character used as a separator.

permits4 <- mutate(permits3, CityZip = paste(City, Zip, sep = "/"))
str(permits4)
## 'data.frame':    587 obs. of  6 variables:
##  $ Issued_Date         : chr  "05/27/2014" "12/17/2014" "12/22/2014" "04/21/2014" ...
##  $ Permit_Type_2       : chr  "Residential" "Residential" "Residential" "Residential" ...
##  $ Detailed_Permit_Type: chr  "Residential New Single Family Dwelling Permit" "Residential New Single Family Dwelling Permit" "Residential New Single Family Dwelling Permit" "Residential New Single Family Dwelling Permit" ...
##  $ City                : chr  "MOUNT AIRY" "WOODBINE" "WOODBINE" "MOUNT AIRY" ...
##  $ Zip                 : int  21771 21797 21797 21771 21771 21797 21797 21771 21797 21771 ...
##  $ CityZip             : chr  "MOUNT AIRY/21771" "WOODBINE/21797" "WOODBINE/21797" "MOUNT AIRY/21771" ...

I’m now ready to count the number of residential building permits in each locality. I first use the function group_by() to group the permits based on the values of the CityZip variables, and then use the function summarise() to count the number of permits in each group.

permits5 <- group_by(permits4, CityZip)
permits6 <- summarise(permits5, Permits = n())
str(permits6)
## Classes 'tbl_df', 'tbl' and 'data.frame':    22 obs. of  2 variables:
##  $ CityZip: chr  "BROOKEVILLE/20833" "CLARKSVILLE/21029" "COLUMBIA/21044" "COLUMBIA/21045" ...
##  $ Permits: int  1 32 36 2 5 2 54 70 130 57 ...
##  - attr(*, "drop")= logi TRUE

The new data frame has only 22 rows, one per locality, and only two fields: CityZip, the variable I grouped by, and Permits, the variable containing the number of permits in each group (returned by the n() function).

Displaying the number of permits by locality

Since I’m interested in which localities had the most permits issued I sort the rows using the arrange() function to put them in descending order based on the Permits variable.

permits7 <- arrange(permits6, desc(Permits))

Finally I print the entire data frame to show all of the localities and the number of residential building permits issued for each one in 2014.

print.data.frame(permits7)
##                  CityZip Permits
## 1    ELLICOTT CITY/21043     130
## 2    ELLICOTT CITY/21042      70
## 3   MARRIOTTSVILLE/21104      69
## 4           FULTON/20759      57
## 5         ELKRIDGE/21075      54
## 6         COLUMBIA/21044      36
## 7      CLARKSVILLE/21029      32
## 8           LAUREL/20723      30
## 9          GLENELG/21737      25
## 10        WOODBINE/21797      21
## 11         HANOVER/21076      17
## 12          JESSUP/20794      14
## 13 WEST FRIENDSHIP/21794       8
## 14      COOKSVILLE/21723       5
## 15      MOUNT AIRY/21771       5
## 16        HIGHLAND/20777       4
## 17      SYKESVILLE/21784       3
## 18        COLUMBIA/21045       2
## 19          DAYTON/21036       2
## 20     BROOKEVILLE/20833       1
## 21        GLENWOOD/21738       1
## 22       WOODSTOCK/21163       1

Conclusion

Given that the two Ellicott City zip codes cover a relatively large and populous area of Howard County (see the zip code map on data.howardcountymd.gov), it’s not surprising that Ellicott City saw the most residential permits issued in 2014. Marriottsville, Fulton, and Elkridge also saw a lot of residential development. However it’s interesting to note that East Columbia (21045 zip code) had only two residential building permits issued in all of 2014.

That concludes this example of analyzing Howard County building permit data. In the next example I’ll show how to streamline the analysis and then create a graph from the summary data.

Appendix 1: R environment

I used the following R environment in doing the analysis for this example:

sessionInfo()
## R version 3.1.2 (2014-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dplyr_0.4.0    RCurl_1.95-4.3 bitops_1.0-6  
## 
## loaded via a namespace (and not attached):
##  [1] assertthat_0.1  DBI_0.3.1       digest_0.6.4    evaluate_0.5.5 
##  [5] formatR_1.0     htmltools_0.2.6 knitr_1.7       lazyeval_0.1.10
##  [9] magrittr_1.0.1  parallel_3.1.2  Rcpp_0.11.3     rmarkdown_0.5.1
## [13] stringr_0.6.2   tools_3.1.2     yaml_2.1.13

Appendix 2: Public domain dedication

I am making this work available under the terms of the Creative Commons CC0 1.0 Universal (CC0 1.0) Public Domain Dedication.