Load libraries:
library(corrplot)
library(dplyr)
library(ggplot2)
library(lubridate)
library(lme4)
library(pscl)
library(tidycensus)
library(tidyr)
source('../R/multiplot.R')Introduction
The purpose of this document is to explore the relationship between 311 calls and housing violations in New York City. After investigating their statistical properties, and incorporating demographic variables, I develop a number of successful models for predicting housing violations in NYC zip codes. After testing each model on a hold-out set, the best model was a special Poisson regression method that accounted for 72 percent of variation in housing violations.
311 is a phone number in New York City where citizens can make civil, non-emergency calls, e.g., reports of graffiti or noise complaints. The expectation is that geographic areas with higher 311 calls should produce more housing violations. A number of theories can plausibly account for this relationship: Areas with higher 311 calls are more prone to contacting authorities or are more closely watched by authorities, who then carry out more housing inspections. I will leave it to the reader to develop their own explanations.
The end goal is to build a model to predict housing violations. A successful model could potentially allow the city government to plan and target housing inspections more efficiently.
The first section of this paper describes the data and how it is assembled. Logs of 311 calls and housing violations from 2014 are pulled from NYC’s open data web site. Additionally, I use supplementary demographic data pulled from the U.S. Census Bureau (via a very nice R package called tidycensus).
Data Collection
311 Calls
The 311 call logs are automatically updated every day on the NYC Open Data web site. The online records go back to 2010, and each year contains many calls. For the purposes of this project, I will limit my exploration to 2014.
I used this curl call to download the dataset directly from the web site, and then used the grep command line tool to filter to 2014:
curl https://data.cityofnewyork.us/resource/fhrw-4uyv.csv?%24limit=5000&%24%24app_token=YOURAPPTOKENHERE | grep '2014-' > ./data/raw/311_calls_2014.csv
Note that this command requires a registration to get a token, and that I had to fiddle with various query parameters (especially limit!) to get exaclty what I wanted.
Load the data:
raw_311 <- read.csv('../data/raw/311_calls_2014.csv', stringsAsFactors=TRUE)
colnames(raw_311) <- c('address_type', 'agency', 'agency_name', 'bbl', 'borough',
'bridge_highway_direction', 'bridge_highway_name',
'bridge_highway_segment', 'city', 'closed_date',
'community_board', 'complaint_type', 'created_date',
'cross_street_1', 'cross_street_2', 'descriptor',
'due_date', 'facility_type', 'incident_address',
'incident_zip', 'intersection_street_1',
'intersection_street_2', 'landmark', 'latitude',
'location', 'location_address', 'location_city',
'location_state', 'location_type', 'location_zip',
'longitude', 'open_data_channel_type', 'park_borough',
'park_facility_name', 'resolution_action_updated_date',
'resolution_description', 'road_ramp', 'status',
'street_name', 'taxi_company_borough',
'taxi_pick_up_location', 'unique_key', 'vehicle_type',
'x_coordinate_state_plane', 'y_coordinate_state_plane')
calls <- raw_311 %>%
rename(lat=latitude, lon=longitude) %>%
mutate(created_date = as.Date(created_date, '%Y-%m-%dT%X')) %>%
filter(year(created_date) == 2014) %>%
drop_na(lat, lon)
rm(raw_311); gc() # garbage collection!## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 4851497 259.1 14442815 771.4 9574461 511.4
## Vcells 71438814 545.1 224984221 1716.5 281227329 2145.6
## address_type agency agency_name bbl
## 1 ADDRESS DEP Department of Environmental Protection 2058120054
## 2 ADDRESS DEP Department of Environmental Protection 2037060001
## 3 ADDRESS DEP Department of Environmental Protection 4026810027
## 4 INTERSECTION DEP Department of Environmental Protection NA
## 5 ADDRESS DEP Department of Environmental Protection 5006690001
## 6 ADDRESS DEP Department of Environmental Protection 4022450046
## borough bridge_highway_direction bridge_highway_name
## 1 BRONX
## 2 BRONX
## 3 QUEENS
## 4 STATEN ISLAND
## 5 STATEN ISLAND
## 6 QUEENS
## bridge_highway_segment city closed_date
## 1 BRONX 2014-12-30T00:00:00.000
## 2 BRONX 2014-11-13T00:00:00.000
## 3 Maspeth 2014-12-18T00:00:00.000
## 4 STATEN ISLAND 2014-10-07T00:00:00.000
## 5 STATEN ISLAND 2014-12-05T00:00:00.000
## 6 Forest Hills 2016-02-11T00:00:00.000
## community_board complaint_type created_date cross_street_1
## 1 08 BRONX ATF 2014-12-22 FIELDSTON RD
## 2 09 BRONX ATF 2014-11-05 QUIMBY AVE
## 3 05 QUEENS ATF 2014-11-13 58 PL
## 4 02 STATEN ISLAND ATF 2014-10-06
## 5 01 STATEN ISLAND FATF 2014-12-04 ONTARIO AVE
## 6 06 QUEENS ATF 2014-07-17 GRAND CENTRAL PKWY
## cross_street_2 descriptor due_date facility_type
## 1 DELAFIELD AVE N/A
## 2 CROSS BRONX EXPWY ET 6 A EB N/A
## 3 58 RD N/A
## 4 N/A
## 5 LOGAN AVE N/A
## 6 GRAND CENTRAL PKWY N/A
## incident_address incident_zip intersection_street_1
## 1 458 WEST 246 STREET 10471
## 2 930 ZEREGA AVENUE 10473
## 3 58-52 GRAND AVENUE 11378
## 4 10304 TARGEE STREET
## 5 1270 VICTORY BOULEVARD 10301
## 6 69-70 PEARTREE AVENUE 11375
## intersection_street_2 landmark lat
## 1 40.89360
## 2 40.82710
## 3 40.72003
## 4 RALPH PLACE 40.60509
## 5 40.61544
## 6 40.72387
## location location_address location_city
## 1 POINT (-73.906451039282 40.89360169479) NA NA
## 2 POINT (-73.843596451986 40.827103589668) NA NA
## 3 POINT (-73.909836842511 40.720028020229) NA NA
## 4 POINT (-74.091052322046 40.605092404738) NA NA
## 5 POINT (-74.105751318156 40.615435937341) NA NA
## 6 POINT (-73.837481878781 40.723868142958) NA NA
## location_state location_type location_zip lon
## 1 NA NA -73.90645
## 2 NA NA -73.84360
## 3 NA NA -73.90984
## 4 NA NA -74.09105
## 5 NA NA -74.10575
## 6 NA NA -73.83748
## open_data_channel_type park_borough park_facility_name
## 1 UNKNOWN BRONX Unspecified
## 2 UNKNOWN BRONX Unspecified
## 3 UNKNOWN QUEENS Unspecified
## 4 UNKNOWN STATEN ISLAND Unspecified
## 5 UNKNOWN STATEN ISLAND Unspecified
## 6 UNKNOWN QUEENS Unspecified
## resolution_action_updated_date resolution_description road_ramp status
## 1 2014-12-30T00:00:00.000 Closed
## 2 2014-11-13T00:00:00.000 Closed
## 3 2014-12-18T00:00:00.000 Closed
## 4 2014-10-07T00:00:00.000 Closed
## 5 2014-12-05T00:00:00.000 Closed
## 6 2016-02-11T00:00:00.000 Closed
## street_name taxi_company_borough taxi_pick_up_location unique_key
## 1 WEST 246 STREET 36273085
## 2 ZEREGA AVENUE 36281859
## 3 GRAND AVENUE 36282922
## 4 36381500
## 5 VICTORY BOULEVARD 36390454
## 6 PEARTREE AVENUE 36452915
## vehicle_type x_coordinate_state_plane y_coordinate_state_plane
## 1 1010114 264855
## 2 1027535 240652
## 3 1009243 201615
## 4 958967 159741
## 5 954890 163514
## 6 1029297 203043
The scale of 311 calls over a year is immediately apparent: Nearly 2 million calls in 2014!
Especially important are the geographic information (address, borough, latitude and longitude), creation and close date, and compaint type.
Housing Violations
This data set was also pulled from the NYC Open Data web site, using a similar bash formulation:
curl https://data.cityofnewyork.us/resource/b2iz-pps8.csv?%24limit=5000&%24%24app_token=YOURAPPTOKENHERE > ./data/raw/housing_violations.csv
Note that this data set is also updated daily, but does not extend as far back in time as the calls dataset.
Load the dataset:
house <- read.csv('../data/raw/housing_violations.csv', stringsAsFactors=FALSE)
house <- house %>%
mutate(approveddate = as.Date(approveddate, '%Y-%m-%dT%X')) %>%
filter(year(approveddate) == 2014)
head(house)## apartment approveddate bbl bin block boro boroid
## 1 2I 2014-01-28 2029760092 2010493 2976 BRONX 2
## 2 2014-01-29 2025220109 2003396 2522 BRONX 2
## 3 2014-01-29 2025220109 2003396 2522 BRONX 2
## 4 2B 2014-02-02 2025220109 2003396 2522 BRONX 2
## 5 1 2014-02-13 3043420016 3097632 4342 BROOKLYN 3
## 6 1 2014-02-13 3043420016 3097632 4342 BROOKLYN 3
## buildingid censustract certifieddate class communityboard
## 1 110036 123 A 3
## 2 103554 211 A 4
## 3 103554 211 B 4
## 4 103554 211 A 4
## 5 331370 1104 C 5
## 6 331370 1104 C 5
## councildistrict currentstatus currentstatusdate currentstatusid
## 1 17 VIOLATION CLOSED 2017-10-23T00:00:00.000 19
## 2 16 VIOLATION CLOSED 2017-10-19T00:00:00.000 19
## 3 16 VIOLATION CLOSED 2017-10-19T00:00:00.000 19
## 4 16 VIOLATION CLOSED 2017-10-26T00:00:00.000 19
## 5 42 VIOLATION CLOSED 2017-10-19T00:00:00.000 19
## 6 42 VIOLATION CLOSED 2017-10-19T00:00:00.000 19
## highhousenumber housenumber inspectiondate latitude longitude
## 1 1327 1327 2014-01-27T00:00:00.000 40.83098 -73.89157
## 2 1383 1383 2014-01-28T00:00:00.000 40.84214 -73.92328
## 3 1383 1383 2014-01-28T00:00:00.000 40.84214 -73.92328
## 4 1383 1383 2014-02-01T00:00:00.000 40.84214 -73.92328
## 5 176 176 2014-02-12T00:00:00.000 40.65646 -73.89339
## 6 176 176 2014-02-12T00:00:00.000 40.65646 -73.89339
## lot lowhousenumber newcertifybydate newcorrectbydate
## 1 92 1327
## 2 109 1383
## 3 109 1383
## 4 109 1383
## 5 16 176
## 6 16 176
## novdescription
## 1 SECTION 27-2005 ADM CODE PROPERLY REPAIR WITH SIMILAR MATERIAL THE BROKEN OR DEFECTIVE WOODEN FLOOR IN THE ENTIRE APARTMENT LOCATED AT APT 2I, 2nd STORY, 2nd APARTMENT FROM NORTH AT EAST
## 2 SECTION 27-2053 ADM CODE POST SIGN ON WALL OF ENTRANCE STORY BEARING NAME, ADDRESS INCLUDING APARTMENT NUMBER IF ANY, AND TELEPHONE NUMBER OF SUPERINTENDENT, JANITOR OR HOUSEKEEPER.
## 3 SECTION 27-2053 ADM CODE PROVIDE DWELLING WITH A JANITOR OR RESPONSIBLE PERSON OR JANITORIAL SERVICE.
## 4 SECTION 27-2005 ADM CODE PROPERLY SECURE THE LOOSE WOOD SADDLE AT DOOR OPENING IN THE KITCHEN LOCATED AT APT 2B, 2nd STORY, 2nd APARTMENT FROM WEST AT NORTH
## 5 SECTION 27-2031 ADMIN. CODE: PROVIDE HOT WATER AT ALL HOT WATER FIXTURES IN THE ENTIRE APARTMENT LOCATED AT APT 1, 1st STORY
## 6 SECTION 27-2073 ADM CODE - PROVIDE AN ADEQUATE SUPPLY OF GAS TO THE FIXTURES AT KITCHEN STOVE IN THE ENTIRE APARTMENT LOCATED AT APT 1, 1st STORY
## novid novissueddate novtype nta ordernumber
## 1 4768682 2014-01-29T00:00:00.000 Original Crotona Park East 502
## 2 4769405 2014-01-30T00:00:00.000 Original Highbridge 722
## 3 4769406 2014-01-30T00:00:00.000 Original Highbridge 721
## 4 4772303 2014-02-03T00:00:00.000 Original Highbridge 509
## 5 4781222 2014-02-14T00:00:00.000 Original East New York 970
## 6 4780994 2014-02-14T00:00:00.000 Original East New York 1042
## originalcertifybydate originalcorrectbydate registrationid
## 1 2014-05-18T00:00:00.000 2014-05-04T00:00:00.000 221732
## 2 2014-05-19T00:00:00.000 2014-05-05T00:00:00.000 200526
## 3 2014-03-20T00:00:00.000 2014-03-06T00:00:00.000 200526
## 4 2014-05-23T00:00:00.000 2014-05-09T00:00:00.000 200526
## 5 2014-02-25T00:00:00.000 2014-02-13T00:00:00.000 816931
## 6 2014-02-27T00:00:00.000 2014-02-22T00:00:00.000 816931
## story streetcode streetname violationid violationstatus
## 1 2 65320 SOUTHERN BOULEVARD 10112705 Close
## 2 All Stories 58720 PLIMPTON AVENUE 10113444 Close
## 3 All Stories 58720 PLIMPTON AVENUE 10113445 Close
## 4 2 58720 PLIMPTON AVENUE 10118834 Close
## 5 1 58230 MALTA STREET 10135223 Close
## 6 1 58230 MALTA STREET 10135252 Close
## zip
## 1 10459
## 2 10452
## 3 10452
## 4 10452
## 5 11207
## 6 11207
This housing violations data set is still substantial, but less voluminous than the calls dataset, at about 23 thousand violations in 2014. Important columns here include various dates (of inspection, certification, etc.), geographic information (address, lat/long, Census tract, etc.), a description of the violation, and violation status.
Census Demographics
The U.S. Census Bureau makes all of their data available to the public, including with an API. Fortunately for us, some thoughtful data citizens have made an easy-to-use R package called tidycensus, making the process of accessing Census data quite simple. One simply needs to install the package and get an API key from the Census.
Census data can be aggregated at various geographical levels. The smallest is by census tract, which is also conveniantly availabe in the house dataset. However, it is not in calls. It would be possible to use addresses in calls to get each observation’s tract, but is would be tedious with nearly 2 million calls, even using an API. Additionally, it’s possible that level of granularity might make the data too sparse. Instead, I chose zip code for aggregation by geographic. This is a reasonable compromise between granularity and the need to avoid sparseness. (Unfortunately tidycensus does not allow me to limit the data pull to just New York state if I request aggregation by zip code—I will have to filter the resulting dataset.)
The Census tabulates hundreds of variables, many of which would be useful in this task. To keep the project managable, I focus on summarizing geographic zones by measures of race/ethnicity, age, economic class, language, and education. Of course, most of these variables are related to eachother, and we should expect the problem of multicollinearity to crop up in the modeling phase.
All sensitive permissions are stored in a config file (ignored by Git), accessible by sourcing the file.
Load the data:
source('../R/config.R')
nyc_zips <- unique(unlist(house$zip, calls$incident_zip))
census_api_key(CENSUS_API_KEY)
census <- get_acs(geography = 'zip code tabulation area', vintage = '2014',
variables = c(total_pop ='B01001_001',
pop_white = 'B01001H_001', # not hispanic
pop_black = 'B01001B_001',
pop_hispanic = 'B01001I_001',
median_age = 'B01002_001',
below_poverty = 'B05010_002',
speak_english = 'B06007_002',
speak_spanish = 'B06007_003',
bachelors = 'B06008_002',
married = 'B06008_003',
no_hs = 'B06009_002',
hs = 'B06009_003',
bach_degree = 'B06009_005',
grad_degree = 'B06009_006',
income_high = 'B06010_011')) # $75k+
census <- census %>%
filter(GEOID %in% nyc_zips) %>%
select(-moe) %>%
spread(variable, estimate)Normalize these variables by converting most to proportions of the zip code’s total population:
census <- census %>%
mutate(GEOID = as.integer(GEOID),
bach_degree = bach_degree / total_pop,
bachelors = bachelors / total_pop,
below_poverty = below_poverty/ total_pop,
grad_degree = grad_degree / total_pop,
hs = hs / total_pop,
income_high = income_high / total_pop,
married = married / total_pop,
no_hs = no_hs / total_pop,
pop_black = pop_black / total_pop,
pop_hispanic = pop_hispanic / total_pop,
pop_white = pop_white / total_pop,
speak_english = speak_english / total_pop,
speak_spanish = speak_spanish / total_pop)
colnames(census)[1] <- 'zip'
head(census)## # A tibble: 6 x 17
## zip NAME bach_degree bachelors below_poverty grad_degree hs
## <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 10001 ZCTA5 10001 0.268 0.529 0.0406 0.238 0.101
## 2 10002 ZCTA5 10002 0.177 0.392 0.0460 0.0871 0.156
## 3 10003 ZCTA5 10003 0.307 0.599 0.00264 0.262 0.0428
## 4 10009 ZCTA5 10009 0.289 0.542 0.0313 0.175 0.107
## 5 10010 ZCTA5 10010 0.305 0.530 0.00597 0.285 0.0453
## 6 10011 ZCTA5 10011 0.331 0.483 0.0105 0.318 0.0541
## # ... with 10 more variables: income_high <dbl>, married <dbl>,
## # median_age <dbl>, no_hs <dbl>, pop_black <dbl>, pop_hispanic <dbl>,
## # pop_white <dbl>, speak_english <dbl>, speak_spanish <dbl>,
## # total_pop <dbl>
Dependent and Independent Variables
The next step is to transform housing violations into a ‘proper’ dependent variable, accounting for time.
Roll up violations by month:
house_ts <- house %>%
mutate(epoch = as.Date(paste(year(inspectiondate), month(inspectiondate), '01', sep='-'), '%Y-%m-%d')) %>%
group_by(zip, epoch) %>%
summarise(violations = n()) %>%
arrange(zip, epoch)
head(house_ts)## # A tibble: 6 x 3
## # Groups: zip [3]
## zip epoch violations
## <int> <date> <int>
## 1 10001 2014-01-01 4
## 2 10001 2014-02-01 10
## 3 10002 2014-01-01 49
## 4 10002 2014-02-01 44
## 5 10003 2014-01-01 20
## 6 10003 2014-02-01 25
However—not every possible (zip, epoch) observation is included. The data set as it stands is missing all observations where violations = 0. To remedy this, we can conduct a cross join using merge:
housing_obs <- unique(merge(house_ts$zip, house_ts$epoch, all=TRUE))
colnames(housing_obs) <- c('zip', 'epoch')
house_ts <- housing_obs %>%
left_join(house_ts, by=c('zip', 'epoch')) %>%
mutate(violations = replace_na(violations, 0)) %>%
arrange(zip, epoch)
rm(housing_obs)
head(house_ts)## zip epoch violations
## 1 10001 2013-11-01 0
## 2 10001 2013-12-01 0
## 3 10001 2014-01-01 4
## 4 10001 2014-02-01 10
## 5 10001 2014-03-01 0
## 6 10001 2014-04-01 0
Now let’s perform a similar operation on calls:
calls_ts <- calls %>%
mutate(epoch = as.Date(paste(year(created_date), month(created_date), '01', sep='-'), '%Y-%m-%d')) %>%
group_by(incident_zip, epoch) %>%
summarize(calls = n()) %>%
arrange(incident_zip, epoch)
colnames(calls_ts) <- c('zip', 'epoch', 'calls')
calls_obs <- unique(merge(calls_ts$zip, calls_ts$epoch, all=TRUE))
colnames(calls_obs) <- c('zip', 'epoch')
calls_ts <- calls_obs %>%
left_join(calls_ts, by=c('zip', 'epoch'))%>%
mutate(calls = replace_na(calls, 0)) %>%
arrange(zip, epoch)
rm(calls_obs)
head(calls_ts)## zip epoch calls
## 1 2014-01-01 55
## 2 2014-02-01 83
## 3 2014-03-01 114
## 4 2014-04-01 82
## 5 2014-05-01 93
## 6 2014-06-01 37
Finally, join to create the final data set:
calls_ts$zip <- as.factor(calls_ts$zip)
house_ts$zip <- as.factor(house_ts$zip)
census$zip <- as.factor(census$zip)
df <- calls_ts %>%
inner_join(house_ts, by=c('zip', 'epoch')) %>%
left_join(census, by='zip') %>%
select(-NAME) %>%
filter(year(epoch) == 2014)
df <- df[complete.cases(df), ]Split into test and train sets:
split_data <- function(df, proportion, seed) {
set.seed(seed)
n <- nrow(df)
n_train <- as.integer(n * proportion)
n_test <- n - n_train
index <- seq(1:n)
train_index <- sample(index, n_train)
train <- df[train_index,]
test <- df[-train_index,]
return(list('train'=train, 'test'=test))
}
split <- split_data(df, proportion=.75, seed=1804)
train <- split$train
test <- split$test
write.table(train, '../data/clean/train.csv', row.names=FALSE)
write.table(test, '../data/clean/test.csv', row.names=FALSE)
# keep workspace clean
rm(split, df, calls, census, house); gc()## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 2111823 112.8 20082226 1072.6 22802738 1217.8
## Vcells 20049699 153.0 259322622 1978.5 322894673 2463.5
Exploratory Data Analysis
First, let’s get a sense of our two variables of interest: violations and calls.
hist_vio <- ggplot(train, aes(x=violations)) + geom_histogram(binwidth=10)
hist_calls <- ggplot(train, aes(x=calls)) + geom_histogram(binwidth=100)
multiplot(hist_vio, hist_calls)## 10% 25% 50% 75% 90% 95% 99%
## 0.0 0.0 0.0 0.0 18.0 53.5 299.5
## 10% 25% 50% 75% 90% 95% 99%
## 335.0 542.5 839.0 1298.0 1730.0 2026.0 3001.7
calls is a much nicer variable than violations. It has a distribution that can be much more reasonably called ‘approximately normal.’ It is still skewed, however, with a mean to the right of its median. violations is highly distorted, with lots of zeros and very long right tail. Seventy-five percent of (zip, epoch) observations have one or less violations.
Let’s examine the relationship between the two variables with a scatterplot:
There are two groups of observations in this plot: One which is clearly linear, and a cluster of observations with a wide range of calls but no housing violations. Hopefully adding demographic variables will help straighten this relationship out for the model.
Correlations
The corrplot package creates a helpful visual to help us understand the correlations between our numeric variables:
This plot is informative. violations appears only weakly correlated to most of the variables. Meanwhile, calls is more strongly correlated to variables such as below_poverty (0.4), median_age (-0.42), and speak_spanish (0.31)—total_pop has a correlation of 0.77! This suggests that is might be wise to ‘standardize’ calls and violations by dividing both total_pop, to create a sort of ‘311 calls and housing violations per capita’.
The strongest correlations are between the demographic variables. For instance, the proportion of people in a zip code with bachelors degree is negatively correlated with the proportion of people that live below the poverty line. This is to be expected; we’ll have to keep an eye on these correlations in the modeling stage.
Modeling
The overall analytic strategy is to divide the full dataset into a train and test dataset—see above—train a number of models on it, apply the trained models to the test dataset, and evaluate each model according to mean squared error \(MSE\). I will also pay attention to \(R^2\), the percentage of variation in violations each model accounts for. However, \(MSE\) is our primary measure to minimize.
Because I use different kinds of models from different packages, it is helpful to have these measures explicitly coded:
mse <- function(m) mean(resid(m)^2)
calc_r2 <- function(y, y_hat) {
rss <- sum((y_hat - y)^2)
tss <- sum((y - mean(y_hat))^2)
return(1 - (rss/tss))
}\(M_0\): Dummy Model
First, I create a dummy model that predicts the mean of violations for every observation. This will allow us to see how much more complicated models add to our predictive capability:
In-sample performance:
## [1] 2140.327
## [1] 0
Ideally, future models will have a much lower \(MSE\) than 2140, and a much higher \(R^2\) than 0. If they do not, we’ll know we’re not getting anywhere with this data set.
\(M_1\): Simple Linear Regression
The next model \(M_1\) is a simple linear regression with all of the independent variables:
m1 <- lm(violations ~ calls + bach_degree + bachelors + below_poverty +
grad_degree + hs + income_high + married + median_age +
no_hs + pop_black + pop_hispanic + pop_white +
speak_english + speak_spanish + total_pop, train)
summary(m1)##
## Call:
## lm(formula = violations ~ calls + bach_degree + bachelors + below_poverty +
## grad_degree + hs + income_high + married + median_age + no_hs +
## pop_black + pop_hispanic + pop_white + speak_english + speak_spanish +
## total_pop, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -141.08 -13.71 -1.74 5.70 351.59
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.108e+01 5.431e+01 -0.204 0.8383
## calls 5.600e-02 3.393e-03 16.505 <2e-16 ***
## bach_degree -1.013e+02 5.759e+01 -1.759 0.0788 .
## bachelors 5.144e+00 5.045e+01 0.102 0.9188
## below_poverty 9.829e+01 7.708e+01 1.275 0.2025
## grad_degree -1.409e+01 5.573e+01 -0.253 0.8004
## hs -9.054e+01 6.719e+01 -1.347 0.1780
## income_high 7.040e+01 4.382e+01 1.607 0.1083
## married 3.412e+01 6.486e+01 0.526 0.5990
## median_age 6.643e-01 5.069e-01 1.311 0.1902
## no_hs -5.072e+00 5.063e+01 -0.100 0.9202
## pop_black 2.537e+01 1.919e+01 1.322 0.1863
## pop_hispanic 1.360e+01 6.133e+01 0.222 0.8246
## pop_white 7.404e+00 1.443e+01 0.513 0.6081
## speak_english -4.050e+01 1.830e+01 -2.214 0.0270 *
## speak_spanish -2.167e+01 7.029e+01 -0.308 0.7579
## total_pop -7.484e-04 7.564e-05 -9.894 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41.04 on 1414 degrees of freedom
## Multiple R-squared: 0.2224, Adjusted R-squared: 0.2136
## F-statistic: 25.28 on 16 and 1414 DF, p-value: < 2.2e-16
## [1] 1664.268
This model performs reasonably well for a first pass. Its \(MSE = 1664\) is 22 percent less than \(M_0\), and \(R^2 = 0.22\). The \(F\)-statistic is very significant, confirming that this model performs better a dummy model. Interestingly, the intercept is not significant, i.e., it is not different from zero. In my view, this could make sense if we can say that any housing violation is due solely to the independent variables.
The intution underlying this paper, that 311 calls has something to do with housing violations, is confirmed by this model. The positive \(\beta\) for calls indicates that as calls increase, so do housing violations. The estimate is significant at \(p < .001\). Interestingly, total_pop has a negative relationship with violations. This is hard to square with intuition. A priori it should be have a positive relationship. For now, I will not worry too much about this because the coefficient is very small.
Let’s examine the residuals:
It is clear \(M_1\)’s residuals deviate sharply from a normal distribution. There is a long right-tail, indicating there are quite a few outlying observations the model does not handle well.
Let’s examine some of these leading residuals:
train_m1 <- train %>%
mutate(pred = predict(m1, train),
resid = violations - pred) %>%
arrange(desc(resid))
head(train_m1[c('zip', 'epoch', 'resid')], 10)## zip epoch resid
## 1 11226 2014-02-01 351.5891
## 2 11213 2014-02-01 313.6789
## 3 11207 2014-01-01 310.0429
## 4 11221 2014-02-01 295.4526
## 5 10457 2014-02-01 288.9505
## 6 10032 2014-02-01 279.7924
## 7 10456 2014-02-01 262.8867
## 8 10453 2014-02-01 261.4567
## 9 10456 2014-01-01 261.4486
## 10 10452 2014-02-01 254.7850
\(M_1\) appears to perform the worst in winter: January and February. This indicates that it may be possible to improve the model by adding a month variable.
\(M_2\): Regression + Month
To remedy this, let’s conduct the same kind of regression but add a variable for month:
train_m2 <- train %>%
mutate(month = month(epoch))
m2 <- lm(violations ~ calls + bach_degree + bachelors + below_poverty +
grad_degree + hs + income_high + married + median_age +
no_hs + pop_black + pop_hispanic + pop_white +
speak_english + speak_spanish + total_pop +
as.factor(month), data=train_m2)
summary(m2)##
## Call:
## lm(formula = violations ~ calls + bach_degree + bachelors + below_poverty +
## grad_degree + hs + income_high + married + median_age + no_hs +
## pop_black + pop_hispanic + pop_white + speak_english + speak_spanish +
## total_pop + as.factor(month), data = train_m2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -108.353 -13.470 3.028 11.018 313.795
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.039e+01 4.617e+01 0.875 0.3818
## calls 4.513e-02 2.966e-03 15.214 <2e-16 ***
## bach_degree -8.523e+01 4.893e+01 -1.742 0.0817 .
## bachelors 1.215e+01 4.284e+01 0.284 0.7768
## below_poverty 6.675e+01 6.550e+01 1.019 0.3083
## grad_degree -2.339e+01 4.731e+01 -0.494 0.6211
## hs -1.005e+02 5.707e+01 -1.760 0.0785 .
## income_high 4.113e+01 3.723e+01 1.105 0.2694
## married 2.393e+01 5.509e+01 0.434 0.6641
## median_age 5.423e-01 4.307e-01 1.259 0.2082
## no_hs -1.064e+01 4.300e+01 -0.247 0.8047
## pop_black 1.888e+01 1.631e+01 1.157 0.2473
## pop_hispanic -2.406e+00 5.209e+01 -0.046 0.9632
## pop_white 4.412e+00 1.227e+01 0.360 0.7192
## speak_english -2.616e+01 1.556e+01 -1.681 0.0929 .
## speak_spanish 2.813e+00 5.970e+01 0.047 0.9624
## total_pop -5.518e-04 6.559e-05 -8.413 <2e-16 ***
## as.factor(month)2 8.349e+00 4.529e+00 1.844 0.0655 .
## as.factor(month)3 -5.613e+01 4.525e+00 -12.403 <2e-16 ***
## as.factor(month)4 -5.243e+01 4.605e+00 -11.384 <2e-16 ***
## as.factor(month)5 -5.587e+01 4.554e+00 -12.268 <2e-16 ***
## as.factor(month)6 -5.552e+01 4.514e+00 -12.299 <2e-16 ***
## as.factor(month)7 -5.481e+01 4.600e+00 -11.915 <2e-16 ***
## as.factor(month)8 -5.280e+01 4.553e+00 -11.596 <2e-16 ***
## as.factor(month)9 -5.180e+01 4.634e+00 -11.180 <2e-16 ***
## as.factor(month)10 -5.433e+01 4.499e+00 -12.077 <2e-16 ***
## as.factor(month)11 -5.737e+01 4.463e+00 -12.855 <2e-16 ***
## as.factor(month)12 -5.555e+01 4.496e+00 -12.355 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34.83 on 1403 degrees of freedom
## Multiple R-squared: 0.4442, Adjusted R-squared: 0.4335
## F-statistic: 41.53 on 27 and 1403 DF, p-value: < 2.2e-16
## [1] 1189.566
Adding month really helped! Almost all of the month variables are significant at \(p < .001\). \(MSE\) is almost a third less compared to \(M_1\), and \(R^2\) doubles from 0.22 to 0.44.
Accounting for month also makes intuitive sense: Winter is when people use heating, and New Yorkers are prone to having problems with their apartment’s heat. This could naturally inspire more calls to the housing inspection authorities.
Although calls remains highly sigificant, almost none of the other variables do.
Residual plots:
The shape of the residuals remains about the same compared to \(M_1\), although notice the right-tail is smaller:
## [1] 351.5891
## [1] 313.7954
Let’s re-check the leading residuals:
train_m2 <- train_m2 %>%
mutate(pred = predict(m2, train_m2),
resid = violations - pred) %>%
arrange(desc(resid))
head(train_m2[c('zip', 'epoch', 'resid')], 10)## zip epoch resid
## 1 11226 2014-02-01 313.7954
## 2 11207 2014-01-01 272.4597
## 3 11213 2014-02-01 267.4036
## 4 11221 2014-02-01 244.8598
## 5 10457 2014-02-01 242.1114
## 6 10456 2014-01-01 230.1314
## 7 10032 2014-02-01 226.9596
## 8 10458 2014-01-01 217.5249
## 9 10453 2014-02-01 216.4459
## 10 10456 2014-02-01 209.9291
The residuals on the whole are smaller, but it appears to be the same observations that \(M_1\) missed.
\(M_3\): Mixed Effects Panel Model
The three previous models have used vanilla linear regression. However, this is not analytically appropriate. Linear regression assumes that each observation is independent of eachother—which is not the case for this dataset. The number of housing violations in a zip code in month \(t\) is not independent of the number of housing violations in month \(t-1\). And as we’ve seen above, the number of housing violations in February is not independent of the number of housing violations in January.
We can use mixed models to account for the fact that observations are sampled from the same ‘unit’ multiple times. I will not attempt to explain the underlying logic of these models here.
zip and month are coded as random effects:
train_m3 <- train %>%
mutate(month = as.factor(month(epoch)))
m3 <- lmer(violations ~ calls + bach_degree + bachelors + below_poverty +
grad_degree + hs + income_high + married + median_age +
no_hs + pop_black + pop_hispanic + pop_white +
speak_english + speak_spanish + total_pop + (1 | zip) +
(1 | month), data=train_m3)## Warning: Some predictor variables are on very different scales: consider
## rescaling
## Linear mixed model fit by REML ['lmerMod']
## Formula: violations ~ calls + bach_degree + bachelors + below_poverty +
## grad_degree + hs + income_high + married + median_age + no_hs +
## pop_black + pop_hispanic + pop_white + speak_english + speak_spanish +
## total_pop + (1 | zip) + (1 | month)
## Data: train_m3
##
## REML criterion at convergence: 14173.8
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0934 -0.3897 0.0822 0.3124 9.0298
##
## Random effects:
## Groups Name Variance Std.Dev.
## zip (Intercept) 0.0 0.00
## month (Intercept) 519.7 22.80
## Residual 1213.3 34.83
## Number of obs: 1431, groups: zip, 159; month, 12
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) -4.591e+00 4.660e+01 -0.099
## calls 4.535e-02 2.965e-03 15.297
## bach_degree -8.555e+01 4.893e+01 -1.749
## bachelors 1.201e+01 4.284e+01 0.280
## below_poverty 6.738e+01 6.550e+01 1.029
## grad_degree -2.321e+01 4.731e+01 -0.490
## hs -1.003e+02 5.707e+01 -1.757
## income_high 4.172e+01 3.723e+01 1.121
## married 2.413e+01 5.509e+01 0.438
## median_age 5.448e-01 4.307e-01 1.265
## no_hs -1.053e+01 4.300e+01 -0.245
## pop_black 1.900e+01 1.631e+01 1.165
## pop_hispanic -2.084e+00 5.209e+01 -0.040
## pop_white 4.469e+00 1.227e+01 0.364
## speak_english -2.644e+01 1.556e+01 -1.700
## speak_spanish 2.315e+00 5.970e+01 0.039
## total_pop -5.558e-04 6.556e-05 -8.477
##
## Correlation matrix not shown by default, as p = 17 > 12.
## Use print(x, correlation=TRUE) or
## vcov(x) if you need it
## fit warnings:
## Some predictor variables are on very different scales: consider rescaling
## [1] 1189.749
## [1] 0.4441274
Surprisingly, using this more appropriate method fails to decrease \(MSE\) or improve \(R^2\) over \(M_2\) at all. In fact, the two models are nearly identical.
\(M_4\): Dealing with Multicollinearity
From the correlation plot and the some of the residuals, it seems likely that some or perhaps most of the demographic variables from the Census are correlated with eachother. This state of affairs, called multicollinearity, is harmful to the modeling process, because it can cause misestimation of coefficients. Training a model with less correlated variables should make the estimates more accurate, and increase the performance on the test set.
Any sociologist can tell you that education level, poverty, income, and language are all related to eachother, primarily by class-exclusionary mechanisms. I will try to retain the essentials of this data while reducing the correlation:
train_m4 <- train %>%
mutate(month = as.factor(month(epoch)),
college = bach_degree + grad_degree) %>%
select(-bach_degree, -bachelors, -grad_degree, -income_high, -pop_hispanic,
-pop_black, -speak_spanish, -hs)
m4 <- lmer(violations ~ calls + college + below_poverty + married +
median_age + no_hs + pop_white + speak_english +
total_pop + (1 | zip) + (1 | month), data=train_m4)## Warning: Some predictor variables are on very different scales: consider
## rescaling
## Linear mixed model fit by REML ['lmerMod']
## Formula:
## violations ~ calls + college + below_poverty + married + median_age +
## no_hs + pop_white + speak_english + total_pop + (1 | zip) +
## (1 | month)
## Data: train_m4
##
## REML criterion at convergence: 14244.6
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.2114 -0.3983 0.0733 0.3007 9.0422
##
## Random effects:
## Groups Name Variance Std.Dev.
## zip (Intercept) 0 0.00
## month (Intercept) 523 22.87
## Residual 1215 34.86
## Number of obs: 1431, groups: zip, 159; month, 12
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) -2.988e+01 2.029e+01 -1.473
## calls 4.462e-02 2.871e-03 15.540
## college 1.453e+01 1.223e+01 1.188
## below_poverty 1.506e+02 5.038e+01 2.990
## married 8.353e+00 2.071e+01 0.403
## median_age 5.289e-01 3.329e-01 1.589
## no_hs -1.012e+01 3.421e+01 -0.296
## pop_white -1.163e+01 7.187e+00 -1.618
## speak_english -6.255e+00 8.990e+00 -0.696
## total_pop -5.371e-04 6.247e-05 -8.598
##
## Correlation of Fixed Effects:
## (Intr) calls colleg blw_pv marrid medn_g no_hs pp_wht spk_ng
## calls -0.208
## college -0.599 0.091
## below_pvrty -0.623 0.135 0.552
## married -0.441 0.272 0.479 0.455
## median_age -0.601 0.134 0.237 0.449 -0.210
## no_hs -0.467 -0.007 0.330 -0.085 0.235 -0.044
## pop_white 0.240 -0.055 -0.624 -0.349 -0.464 -0.140 0.197
## speak_nglsh -0.541 -0.005 0.279 0.230 0.502 -0.093 0.651 -0.042
## total_pop 0.077 -0.776 -0.040 -0.172 -0.274 -0.011 -0.037 0.001 0.018
## fit warnings:
## Some predictor variables are on very different scales: consider rescaling
## [1] 1197.75
## [1] 0.4403892
\(R^2\) and \(MSE\) do not change much compared to the two previous models. However, we do see that some parameter estimates have changed quite drastically. This should result in improved performance on the test set, even if performance is unchanged on the in-sample training set.
Bonus: \(M_5\): Zero-Inflated Poisson Regression
NOTE: This is new territory for me, just trying it out for the first time!
Since \(M_4\) likely has done a better job with regard to multicollinearity, I am using it as the basis for this next model, a special kind of Poisson regression built to deal with dependent variables with excessive zeros. It is a part of the pscl package—a UCLA web site has a quick tutorial.
train_m5 <- train %>%
mutate(month = as.factor(month(epoch)),
college = bach_degree + grad_degree) %>%
select(-bach_degree, -bachelors, -grad_degree, -income_high, -pop_hispanic,
-pop_black, -speak_spanish, -hs)
m5 <- zeroinfl(violations ~ calls + below_poverty + pop_white + median_age +
college | month, data=train_m5)
summary(m5)##
## Call:
## zeroinfl(formula = violations ~ calls + below_poverty + pop_white +
## median_age + college | month, data = train_m5)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -4.3586 -0.3253 -0.2796 -0.2407 15.2318
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.819e+00 1.959e-01 19.50 <2e-16 ***
## calls 8.473e-04 8.863e-06 95.60 <2e-16 ***
## below_poverty 5.165e+00 4.254e-01 12.14 <2e-16 ***
## pop_white -1.245e+00 7.794e-02 -15.97 <2e-16 ***
## median_age -5.094e-02 4.448e-03 -11.45 <2e-16 ***
## college 2.034e+00 1.416e-01 14.36 <2e-16 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.6811 0.5849 -6.293 3.11e-10 ***
## month2 1.2604 0.6777 1.860 0.0629 .
## month3 7.3445 0.8271 8.880 < 2e-16 ***
## month4 6.1563 0.6802 9.051 < 2e-16 ***
## month5 6.2834 0.6902 9.103 < 2e-16 ***
## month6 6.4623 0.7028 9.196 < 2e-16 ***
## month7 5.7212 0.6550 8.734 < 2e-16 ***
## month8 5.8779 0.6594 8.915 < 2e-16 ***
## month9 5.9177 0.6654 8.893 < 2e-16 ***
## month10 5.6639 0.6465 8.760 < 2e-16 ***
## month11 6.1228 0.6715 9.118 < 2e-16 ***
## month12 6.2195 0.6797 9.150 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 29
## Log-likelihood: -1.062e+04 on 18 Df
## [1] 2.583519
## [1] 0.6543592
I do not understand this \(MSE\), presumably some kind of transformation has happened, but I’m not sure which. The \(R^2 = 0.65\) looks great though—nearly fifty percent better than the last few models.
Model Evaluation
It is finally time to evaluate each of these models on the test set!
Prepare the test set for each model:
test <- test %>% mutate(month = as.factor(month(epoch)))
test_m0 <- test
test_m0 <- test_m0 %>%
mutate(pred = predict(m0, test_m0),
resid = violations - pred)
test_m1 <- test
test_m1 <- test_m1 %>%
mutate(pred = predict(m1, test_m1),
resid = violations - pred)
test_m2 <- test
test_m2 <- test_m2 %>%
mutate(pred = predict(m2, test_m2),
resid = violations - pred)
test_m3 <- test
test_m3 <- test_m3 %>%
mutate(pred = predict(m3, test_m3),
resid = violations - pred)
test_m4 <- test %>%
mutate(month = as.factor(month(epoch)),
college = bach_degree + grad_degree) %>%
select(-bach_degree, -bachelors, -grad_degree, -income_high, -pop_hispanic,
-pop_black, -speak_spanish, -hs)
test_m4 <- test_m4 %>%
mutate(pred = predict(m4, test_m4),
resid = violations - pred)
test_m5 <- test %>%
mutate(month = as.factor(month(epoch)),
college = bach_degree + grad_degree) %>%
select(-bach_degree, -bachelors, -grad_degree, -income_high, -pop_hispanic,
-pop_black, -speak_spanish, -hs)
test_m5 <- test_m5 %>%
mutate(pred = predict(m5, test_m5),
resid = violations - pred)Calculate \(MSE\) for each model on the test set:
## [1] "M_0: 2656.90358493984"
## [1] "M_1: 1848.85988638713"
## [1] "M_2: 1305.03244510467"
## [1] "M_3: 1306.33557707587"
## [1] "M_4: 1303.55055189596"
## [1] "M_5: 757.679178844792"
The Poisson regression equipped to deal with excessive zeros \(M_5\) came out best, with almost 50 percent lower \(MSE\) than the mixed effects models! As expected, accounting for month improved \(M_2\) subtantially over \(M_1\). Using a mixed model did not improve \(M_3\) over \(M_2\), surprisingly. Also a suprise \(M_4\), the model to deal with multicollinearity, had only a slight improvement.
Now let’s look at \(R^2\):
## [1] "M_0: 0"
## [1] "M_1: 0.303941300833325"
## [1] "M_2: 0.508693379627696"
## [1] "M_3: 0.508202160577651"
## [1] "M_4: 0.50924525862471"
## [1] "M_5: 0.715024352069364"
Interestingly, the \(R^2\) for each model on the test set is higher than when applied to the training set. \(M_5\) comes out the winner again, explaining almost 72 percent of variance in violations.
Conclusion
This paper used open source data from New York City and the U.S. Census Bureau to predict monthly housing violations. A number of models were tested. The best model was a special kind of Poisson regression built to deal with dependent variables with excessive zeros. This model explained 72 percent of variation in housing violations.
The percentage of people living below poverty in a zip code, and the percentage with college degrees, are positively associated with housing violations. A higher proportion of white people and young people are both negatively associated with housing violations.
This paper was particularly interested in the number of calls made to 311 in a zip code could help predict the number of housing violations. Every model found a highly significant, positive relationship between the two, even when accounting for demographic variables and seasonality—every 1175 calls to 311 is associated with one additional housing violation.
I would like to thank whomever wrote tidycensus for making this project a million times easier, and I am also fortunate that this Poisson model worked out so well.