Wednesday, April 15, 2015

Introduction and Background

NYC's 311 data has been explored, summarized, and mapped countless times by statisticians, bloggers, and civic hackers. Although large numbers of people have explored this data set before, to my knowledge, not a single investigator asked a question that seemed fundamental to the study of this dataset: Do a number of these neighborhoods act similarity? In short, is there any underlying structure to this data? Can neighborhoods be grouped by their 311 calls, and what can we learn about these neighborhoods by classifying them in this way?

Goals and Outline

Data Preparation and Cleaning

library(plyr)
library(dplyr)
NYC311 = read.csv('2014_NYC.csv', header=T)

#Make complaints uniform
NYC311$Complaint.Type = tolower(NYC311$Complaint.Type) 
NYC311$Complaint.Type = gsub('s$', '', NYC311$Complaint.Type) 
NYC311$Complaint.Type = gsub('paint - plaster', 'paint/plaster', NYC311$Complaint.Type)
NYC311$Complaint.Type = gsub('general construction', 'construction', NYC311$Complaint.Type)
NYC311$Complaint.Type = gsub('nonconst', 'construction', NYC311$Complaint.Type)

#Group Similar Complaints
NYC311$Complaint.Type = gsub('street sign - [[:alpha:]]+', 'street sign', NYC311$Complaint.Type)
NYC311$Complaint.Type = gsub('fire alarm - .+','fire alarm', NYC311$Complaint.Type)

#Make Zipcodes Uniform
NYC311$Incident.Zip = gsub('-[[:digit:]]{4}$', '', NYC311$Incident.Zip)
idx = grepl('[[:digit:]]{5}', NYC311$Incident.Zip)
NYC311clean = NYC311[idx,]

#Counts of each complaint by zipcode
NYC311byZip = ddply(NYC311clean, .(Incident.Zip, Complaint.Type), count)

Data Exploration and Structure

Exploratory Factor Analysis was used to explore the underlying structure of the data set to understand if any latent variables, or factors, might explain the variance seen in multiple predictors.

library(tidyr); library(psych); library(reshape2); library(ggplot2)

#Prepare data for PCA/EFA
raw = spread(NYC311byZip, Complaint.Type, n)
raw[is.na(raw)] = 0
counts = which(colSums(raw[,-1]) < 10)
zipcodes = raw[,1]
raw = raw[,-1]; raw = raw[,-counts]
processed = scale(raw, center=T, scale=T)

pca = principal(processed, nfactor=5, covar=F)
loadings = as.data.frame(pca$loadings[,1:5])

Visualizing EFA results

The results show four factors have multiple variable loadings >0.9 indicating there are four latent variables which cause residents to make similar complaints.

##                                   RC3         RC1         RC2         RC5
## air quality                0.07117459  0.22591181 0.624238032  0.45181672
## animal abuse               0.54470537  0.62494829 0.082246626  0.11319346
## animal in a park           0.12018019  0.03693350 0.186197219  0.49037864
## appliance                  0.08785885  0.94192572 0.006914322  0.05472913
## asbesto                    0.14267891  0.35927232 0.541564666  0.28437665
## beach/pool/sauna complaint 0.19726022 -0.01480234 0.409175759 -0.12737744
##                                     RC4
## air quality                 0.001674035
## animal abuse                0.070357837
## animal in a park           -0.013666413
## appliance                   0.045679394
## asbesto                     0.049882041
## beach/pool/sauna complaint -0.013770279

A better way to visualize EFA results

Clustering Zipcodes

Given that there are four factors driving the variation in the data the complaints were clustered with four centers. Then the cluster assignments were visualized in Eigenspace to inspect the results. As shown below, the clusters are fairly well separated and the cluster assignments appear reasonable.

#Cluster data
set.seed(400)
cluster=kmeans(processed, 4)

Eigenspace

Reducing the number of dimensions from p to 3 Eigendimensions allows us to visually inspect the clustering results. The first three principal components are the Eigendimensions along which the variance is maximized so the cluster centers should ideally be separated in this space.

Missing Values

Interestingly, many zipcodes in NYC had no complaints but the map (see below) did not have any unassigned areas due to missing values. These zipcodes do not have any complaints because they are assigned to buildings, not areas of NYC. For example, the World Trade Center, Empire State building, and Saks Fifth Ave shoe department and others have their own "vanity zip."

sum(is.na(zipcolors$clusters))
## [1] 50

Visualizing the Clusters on a Map - Segments

  • Cluster 1: midtown and lower Manhattan ("Manhattan")
  • Cluster 2: large swaths of Queens, Brooklyn, and Staten Island ("Typical New Yorker")

Visualizing the Clusters on a Map - Segments

  • Cluster 3: Outer NYC, adjacent to Long Island and Westchester ("Suburbanites")
  • Cluster 4: Harlem, the Bronx, and the boundary of Brooklyn and Queens ("Third World")

What each Cluster complains about - Manhattan

Since the data is centered and scaled before clustering, so the cluster centers are Z-scores, which are straightforward to interpret as compared to the overall mean.

#Manhattan
sort(cluster$centers[1,], decreasing=T)[1:5]
##              taxi complaint                       noise 
##                    2.678001                    2.511287 
##           broken muni meter                 air quality 
##                    2.459634                    2.254665 
## dof parking - tax exemption 
##                    2.230553

Typical New Yorker

#Typical New Yorker
sort(cluster$centers[2,], decreasing=T)[1:5]
##           emergency response team (ert) 
##                             -0.05850242 
##                        best/site safety 
##                             -0.07466780 
##              unsanitary animal facility 
##                             -0.09033467 
##                        bridge condition 
##                             -0.09085959 
## special projects inspection team (spit) 
##                             -0.09847059

Suburbanites

#Suburbanites
sort(cluster$centers[3,], decreasing=T)[1:5]
##                          snow                  damaged tree 
##                      1.166978                      1.153884 
##                         sewer root/sewer/sidewalk condition 
##                      1.114761                      1.102813 
##               illegal parking 
##                      1.098841

The Third World

#The Third World
sort(cluster$centers[4,], decreasing=T)[1:5]
##      plumbing paint/plaster  construction      electric       heating 
##      2.460937      2.459225      2.401919      2.387758      2.370252

Conclusion

  • Manhattan (Cluster 1) complaints are largely about taxis, noise, air quality, and broken muni-meters; true first world problems.

  • Typical New Yorkers (Cluster 2) complain less frequently on average (about everything) than the residents in the other clusters.

  • Suburbanites, or New York City residents living on the fringes of NYC often adjacent to city suburbs (Cluster 3) complain about suburban problems, damaged trees, snow, abandonded vehicles, etc.

  • The poorer residents in NYC (Cluster 4) complain about basic necessities most often, plumbing, heating, and electric problems. Even in a city of First World City, Third World Problems continue to persist and thrive.