Modeling Housing Violations in New York City

Ben Horvath

December 9, 2018

Load libraries:

Introduction

The purpose of this document is to explore the relationship between 311 calls and housing violations in New York City. After investigating their statistical properties, and incorporating demographic variables, I develop a number of successful models for predicting housing violations in NYC zip codes. After testing each model on a hold-out set, the best model was a special Poisson regression method that accounted for 72 percent of variation in housing violations.

311 is a phone number in New York City where citizens can make civil, non-emergency calls, e.g., reports of graffiti or noise complaints. The expectation is that geographic areas with higher 311 calls should produce more housing violations. A number of theories can plausibly account for this relationship: Areas with higher 311 calls are more prone to contacting authorities or are more closely watched by authorities, who then carry out more housing inspections. I will leave it to the reader to develop their own explanations.

The end goal is to build a model to predict housing violations. A successful model could potentially allow the city government to plan and target housing inspections more efficiently.

The first section of this paper describes the data and how it is assembled. Logs of 311 calls and housing violations from 2014 are pulled from NYC’s open data web site. Additionally, I use supplementary demographic data pulled from the U.S. Census Bureau (via a very nice R package called tidycensus).

Data Collection

311 Calls

The 311 call logs are automatically updated every day on the NYC Open Data web site. The online records go back to 2010, and each year contains many calls. For the purposes of this project, I will limit my exploration to 2014.

I used this curl call to download the dataset directly from the web site, and then used the grep command line tool to filter to 2014:

curl https://data.cityofnewyork.us/resource/fhrw-4uyv.csv?%24limit=5000&%24%24app_token=YOURAPPTOKENHERE | grep '2014-' > ./data/raw/311_calls_2014.csv

Note that this command requires a registration to get a token, and that I had to fiddle with various query parameters (especially limit!) to get exaclty what I wanted.

Load the data:

##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  4851497 259.1   14442815  771.4   9574461  511.4
## Vcells 71438814 545.1  224984221 1716.5 281227329 2145.6
##   address_type agency                            agency_name        bbl
## 1      ADDRESS    DEP Department of Environmental Protection 2058120054
## 2      ADDRESS    DEP Department of Environmental Protection 2037060001
## 3      ADDRESS    DEP Department of Environmental Protection 4026810027
## 4 INTERSECTION    DEP Department of Environmental Protection         NA
## 5      ADDRESS    DEP Department of Environmental Protection 5006690001
## 6      ADDRESS    DEP Department of Environmental Protection 4022450046
##         borough bridge_highway_direction bridge_highway_name
## 1         BRONX                                             
## 2         BRONX                                             
## 3        QUEENS                                             
## 4 STATEN ISLAND                                             
## 5 STATEN ISLAND                                             
## 6        QUEENS                                             
##   bridge_highway_segment          city             closed_date
## 1                                BRONX 2014-12-30T00:00:00.000
## 2                                BRONX 2014-11-13T00:00:00.000
## 3                              Maspeth 2014-12-18T00:00:00.000
## 4                        STATEN ISLAND 2014-10-07T00:00:00.000
## 5                        STATEN ISLAND 2014-12-05T00:00:00.000
## 6                         Forest Hills 2016-02-11T00:00:00.000
##    community_board complaint_type created_date     cross_street_1
## 1         08 BRONX            ATF   2014-12-22       FIELDSTON RD
## 2         09 BRONX            ATF   2014-11-05         QUIMBY AVE
## 3        05 QUEENS            ATF   2014-11-13              58 PL
## 4 02 STATEN ISLAND            ATF   2014-10-06                   
## 5 01 STATEN ISLAND           FATF   2014-12-04        ONTARIO AVE
## 6        06 QUEENS            ATF   2014-07-17 GRAND CENTRAL PKWY
##                 cross_street_2 descriptor due_date facility_type
## 1                DELAFIELD AVE                               N/A
## 2 CROSS BRONX EXPWY ET 6 A  EB                               N/A
## 3                        58 RD                               N/A
## 4                                                            N/A
## 5                    LOGAN AVE                               N/A
## 6           GRAND CENTRAL PKWY                               N/A
##         incident_address incident_zip intersection_street_1
## 1   458 WEST  246 STREET        10471                      
## 2      930 ZEREGA AVENUE        10473                      
## 3     58-52 GRAND AVENUE        11378                      
## 4                               10304         TARGEE STREET
## 5 1270 VICTORY BOULEVARD        10301                      
## 6  69-70 PEARTREE AVENUE        11375                      
##   intersection_street_2 landmark      lat
## 1                                40.89360
## 2                                40.82710
## 3                                40.72003
## 4           RALPH PLACE          40.60509
## 5                                40.61544
## 6                                40.72387
##                                   location location_address location_city
## 1  POINT (-73.906451039282 40.89360169479)               NA            NA
## 2 POINT (-73.843596451986 40.827103589668)               NA            NA
## 3 POINT (-73.909836842511 40.720028020229)               NA            NA
## 4 POINT (-74.091052322046 40.605092404738)               NA            NA
## 5 POINT (-74.105751318156 40.615435937341)               NA            NA
## 6 POINT (-73.837481878781 40.723868142958)               NA            NA
##   location_state location_type location_zip       lon
## 1             NA                         NA -73.90645
## 2             NA                         NA -73.84360
## 3             NA                         NA -73.90984
## 4             NA                         NA -74.09105
## 5             NA                         NA -74.10575
## 6             NA                         NA -73.83748
##   open_data_channel_type  park_borough park_facility_name
## 1                UNKNOWN         BRONX        Unspecified
## 2                UNKNOWN         BRONX        Unspecified
## 3                UNKNOWN        QUEENS        Unspecified
## 4                UNKNOWN STATEN ISLAND        Unspecified
## 5                UNKNOWN STATEN ISLAND        Unspecified
## 6                UNKNOWN        QUEENS        Unspecified
##   resolution_action_updated_date resolution_description road_ramp status
## 1        2014-12-30T00:00:00.000                                  Closed
## 2        2014-11-13T00:00:00.000                                  Closed
## 3        2014-12-18T00:00:00.000                                  Closed
## 4        2014-10-07T00:00:00.000                                  Closed
## 5        2014-12-05T00:00:00.000                                  Closed
## 6        2016-02-11T00:00:00.000                                  Closed
##         street_name taxi_company_borough taxi_pick_up_location unique_key
## 1  WEST  246 STREET                                              36273085
## 2     ZEREGA AVENUE                                              36281859
## 3      GRAND AVENUE                                              36282922
## 4                                                                36381500
## 5 VICTORY BOULEVARD                                              36390454
## 6   PEARTREE AVENUE                                              36452915
##   vehicle_type x_coordinate_state_plane y_coordinate_state_plane
## 1                               1010114                   264855
## 2                               1027535                   240652
## 3                               1009243                   201615
## 4                                958967                   159741
## 5                                954890                   163514
## 6                               1029297                   203043

The scale of 311 calls over a year is immediately apparent: Nearly 2 million calls in 2014!

Especially important are the geographic information (address, borough, latitude and longitude), creation and close date, and compaint type.

Housing Violations

This data set was also pulled from the NYC Open Data web site, using a similar bash formulation:

curl https://data.cityofnewyork.us/resource/b2iz-pps8.csv?%24limit=5000&%24%24app_token=YOURAPPTOKENHERE > ./data/raw/housing_violations.csv

Note that this data set is also updated daily, but does not extend as far back in time as the calls dataset.

Load the dataset:

##   apartment approveddate        bbl     bin block     boro boroid
## 1        2I   2014-01-28 2029760092 2010493  2976    BRONX      2
## 2             2014-01-29 2025220109 2003396  2522    BRONX      2
## 3             2014-01-29 2025220109 2003396  2522    BRONX      2
## 4        2B   2014-02-02 2025220109 2003396  2522    BRONX      2
## 5         1   2014-02-13 3043420016 3097632  4342 BROOKLYN      3
## 6         1   2014-02-13 3043420016 3097632  4342 BROOKLYN      3
##   buildingid censustract certifieddate class communityboard
## 1     110036         123                   A              3
## 2     103554         211                   A              4
## 3     103554         211                   B              4
## 4     103554         211                   A              4
## 5     331370        1104                   C              5
## 6     331370        1104                   C              5
##   councildistrict    currentstatus       currentstatusdate currentstatusid
## 1              17 VIOLATION CLOSED 2017-10-23T00:00:00.000              19
## 2              16 VIOLATION CLOSED 2017-10-19T00:00:00.000              19
## 3              16 VIOLATION CLOSED 2017-10-19T00:00:00.000              19
## 4              16 VIOLATION CLOSED 2017-10-26T00:00:00.000              19
## 5              42 VIOLATION CLOSED 2017-10-19T00:00:00.000              19
## 6              42 VIOLATION CLOSED 2017-10-19T00:00:00.000              19
##   highhousenumber housenumber          inspectiondate latitude longitude
## 1            1327        1327 2014-01-27T00:00:00.000 40.83098 -73.89157
## 2            1383        1383 2014-01-28T00:00:00.000 40.84214 -73.92328
## 3            1383        1383 2014-01-28T00:00:00.000 40.84214 -73.92328
## 4            1383        1383 2014-02-01T00:00:00.000 40.84214 -73.92328
## 5             176         176 2014-02-12T00:00:00.000 40.65646 -73.89339
## 6             176         176 2014-02-12T00:00:00.000 40.65646 -73.89339
##   lot lowhousenumber newcertifybydate newcorrectbydate
## 1  92           1327                                  
## 2 109           1383                                  
## 3 109           1383                                  
## 4 109           1383                                  
## 5  16            176                                  
## 6  16            176                                  
##                                                                                                                                                                                  novdescription
## 1 SECTION 27-2005 ADM CODE  PROPERLY REPAIR WITH SIMILAR MATERIAL THE BROKEN OR DEFECTIVE WOODEN FLOOR  IN THE ENTIRE APARTMENT  LOCATED AT APT 2I, 2nd STORY, 2nd APARTMENT FROM NORTH AT EAST
## 2        SECTION 27-2053 ADM CODE  POST SIGN ON WALL OF ENTRANCE STORY BEARING NAME, ADDRESS INCLUDING APARTMENT NUMBER IF ANY, AND TELEPHONE NUMBER OF SUPERINTENDENT, JANITOR OR HOUSEKEEPER.
## 3                                                                                        SECTION 27-2053 ADM CODE  PROVIDE DWELLING WITH A JANITOR OR RESPONSIBLE PERSON OR JANITORIAL SERVICE.
## 4                               SECTION 27-2005 ADM CODE  PROPERLY SECURE THE LOOSE WOOD SADDLE AT DOOR OPENING  IN THE KITCHEN  LOCATED AT APT 2B, 2nd STORY, 2nd APARTMENT FROM WEST AT NORTH
## 5                                                               SECTION 27-2031  ADMIN. CODE:  PROVIDE HOT WATER AT ALL HOT WATER FIXTURES IN THE ENTIRE APARTMENT  LOCATED AT APT 1, 1st STORY
## 6                                           SECTION 27-2073 ADM CODE - PROVIDE AN ADEQUATE SUPPLY OF GAS TO THE FIXTURES AT KITCHEN STOVE  IN THE ENTIRE APARTMENT  LOCATED AT APT 1, 1st STORY
##     novid           novissueddate  novtype               nta ordernumber
## 1 4768682 2014-01-29T00:00:00.000 Original Crotona Park East         502
## 2 4769405 2014-01-30T00:00:00.000 Original        Highbridge         722
## 3 4769406 2014-01-30T00:00:00.000 Original        Highbridge         721
## 4 4772303 2014-02-03T00:00:00.000 Original        Highbridge         509
## 5 4781222 2014-02-14T00:00:00.000 Original     East New York         970
## 6 4780994 2014-02-14T00:00:00.000 Original     East New York        1042
##     originalcertifybydate   originalcorrectbydate registrationid
## 1 2014-05-18T00:00:00.000 2014-05-04T00:00:00.000         221732
## 2 2014-05-19T00:00:00.000 2014-05-05T00:00:00.000         200526
## 3 2014-03-20T00:00:00.000 2014-03-06T00:00:00.000         200526
## 4 2014-05-23T00:00:00.000 2014-05-09T00:00:00.000         200526
## 5 2014-02-25T00:00:00.000 2014-02-13T00:00:00.000         816931
## 6 2014-02-27T00:00:00.000 2014-02-22T00:00:00.000         816931
##         story streetcode         streetname violationid violationstatus
## 1           2      65320 SOUTHERN BOULEVARD    10112705           Close
## 2 All Stories      58720    PLIMPTON AVENUE    10113444           Close
## 3 All Stories      58720    PLIMPTON AVENUE    10113445           Close
## 4           2      58720    PLIMPTON AVENUE    10118834           Close
## 5           1      58230       MALTA STREET    10135223           Close
## 6           1      58230       MALTA STREET    10135252           Close
##     zip
## 1 10459
## 2 10452
## 3 10452
## 4 10452
## 5 11207
## 6 11207

This housing violations data set is still substantial, but less voluminous than the calls dataset, at about 23 thousand violations in 2014. Important columns here include various dates (of inspection, certification, etc.), geographic information (address, lat/long, Census tract, etc.), a description of the violation, and violation status.

Census Demographics

The U.S. Census Bureau makes all of their data available to the public, including with an API. Fortunately for us, some thoughtful data citizens have made an easy-to-use R package called tidycensus, making the process of accessing Census data quite simple. One simply needs to install the package and get an API key from the Census.

Census data can be aggregated at various geographical levels. The smallest is by census tract, which is also conveniantly availabe in the house dataset. However, it is not in calls. It would be possible to use addresses in calls to get each observation’s tract, but is would be tedious with nearly 2 million calls, even using an API. Additionally, it’s possible that level of granularity might make the data too sparse. Instead, I chose zip code for aggregation by geographic. This is a reasonable compromise between granularity and the need to avoid sparseness. (Unfortunately tidycensus does not allow me to limit the data pull to just New York state if I request aggregation by zip code—I will have to filter the resulting dataset.)

The Census tabulates hundreds of variables, many of which would be useful in this task. To keep the project managable, I focus on summarizing geographic zones by measures of race/ethnicity, age, economic class, language, and education. Of course, most of these variables are related to eachother, and we should expect the problem of multicollinearity to crop up in the modeling phase.

All sensitive permissions are stored in a config file (ignored by Git), accessible by sourcing the file.

Load the data:

Normalize these variables by converting most to proportions of the zip code’s total population:

## # A tibble: 6 x 17
##     zip NAME        bach_degree bachelors below_poverty grad_degree     hs
##   <int> <chr>             <dbl>     <dbl>         <dbl>       <dbl>  <dbl>
## 1 10001 ZCTA5 10001       0.268     0.529       0.0406       0.238  0.101 
## 2 10002 ZCTA5 10002       0.177     0.392       0.0460       0.0871 0.156 
## 3 10003 ZCTA5 10003       0.307     0.599       0.00264      0.262  0.0428
## 4 10009 ZCTA5 10009       0.289     0.542       0.0313       0.175  0.107 
## 5 10010 ZCTA5 10010       0.305     0.530       0.00597      0.285  0.0453
## 6 10011 ZCTA5 10011       0.331     0.483       0.0105       0.318  0.0541
## # ... with 10 more variables: income_high <dbl>, married <dbl>,
## #   median_age <dbl>, no_hs <dbl>, pop_black <dbl>, pop_hispanic <dbl>,
## #   pop_white <dbl>, speak_english <dbl>, speak_spanish <dbl>,
## #   total_pop <dbl>

Dependent and Independent Variables

The next step is to transform housing violations into a ‘proper’ dependent variable, accounting for time.

Roll up violations by month:

## # A tibble: 6 x 3
## # Groups:   zip [3]
##     zip epoch      violations
##   <int> <date>          <int>
## 1 10001 2014-01-01          4
## 2 10001 2014-02-01         10
## 3 10002 2014-01-01         49
## 4 10002 2014-02-01         44
## 5 10003 2014-01-01         20
## 6 10003 2014-02-01         25

However—not every possible (zip, epoch) observation is included. The data set as it stands is missing all observations where violations = 0. To remedy this, we can conduct a cross join using merge:

##     zip      epoch violations
## 1 10001 2013-11-01          0
## 2 10001 2013-12-01          0
## 3 10001 2014-01-01          4
## 4 10001 2014-02-01         10
## 5 10001 2014-03-01          0
## 6 10001 2014-04-01          0

Now let’s perform a similar operation on calls:

##   zip      epoch calls
## 1     2014-01-01    55
## 2     2014-02-01    83
## 3     2014-03-01   114
## 4     2014-04-01    82
## 5     2014-05-01    93
## 6     2014-06-01    37

Finally, join to create the final data set:

Split into test and train sets:

##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  2111823 112.8   20082226 1072.6  22802738 1217.8
## Vcells 20049699 153.0  259322622 1978.5 322894673 2463.5

Exploratory Data Analysis

First, let’s get a sense of our two variables of interest: violations and calls.

##   10%   25%   50%   75%   90%   95%   99% 
##   0.0   0.0   0.0   0.0  18.0  53.5 299.5
##    10%    25%    50%    75%    90%    95%    99% 
##  335.0  542.5  839.0 1298.0 1730.0 2026.0 3001.7

calls is a much nicer variable than violations. It has a distribution that can be much more reasonably called ‘approximately normal.’ It is still skewed, however, with a mean to the right of its median. violations is highly distorted, with lots of zeros and very long right tail. Seventy-five percent of (zip, epoch) observations have one or less violations.

Let’s examine the relationship between the two variables with a scatterplot:

There are two groups of observations in this plot: One which is clearly linear, and a cluster of observations with a wide range of calls but no housing violations. Hopefully adding demographic variables will help straighten this relationship out for the model.

Correlations

The corrplot package creates a helpful visual to help us understand the correlations between our numeric variables:

This plot is informative. violations appears only weakly correlated to most of the variables. Meanwhile, calls is more strongly correlated to variables such as below_poverty (0.4), median_age (-0.42), and speak_spanish (0.31)—total_pop has a correlation of 0.77! This suggests that is might be wise to ‘standardize’ calls and violations by dividing both total_pop, to create a sort of ‘311 calls and housing violations per capita’.

The strongest correlations are between the demographic variables. For instance, the proportion of people in a zip code with bachelors degree is negatively correlated with the proportion of people that live below the poverty line. This is to be expected; we’ll have to keep an eye on these correlations in the modeling stage.

Modeling

The overall analytic strategy is to divide the full dataset into a train and test dataset—see above—train a number of models on it, apply the trained models to the test dataset, and evaluate each model according to mean squared error \(MSE\). I will also pay attention to \(R^2\), the percentage of variation in violations each model accounts for. However, \(MSE\) is our primary measure to minimize.

Because I use different kinds of models from different packages, it is helpful to have these measures explicitly coded:

\(M_0\): Dummy Model

First, I create a dummy model that predicts the mean of violations for every observation. This will allow us to see how much more complicated models add to our predictive capability:

In-sample performance:

## [1] 2140.327
## [1] 0

Ideally, future models will have a much lower \(MSE\) than 2140, and a much higher \(R^2\) than 0. If they do not, we’ll know we’re not getting anywhere with this data set.

\(M_1\): Simple Linear Regression

The next model \(M_1\) is a simple linear regression with all of the independent variables:

## 
## Call:
## lm(formula = violations ~ calls + bach_degree + bachelors + below_poverty + 
##     grad_degree + hs + income_high + married + median_age + no_hs + 
##     pop_black + pop_hispanic + pop_white + speak_english + speak_spanish + 
##     total_pop, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -141.08  -13.71   -1.74    5.70  351.59 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -1.108e+01  5.431e+01  -0.204   0.8383    
## calls          5.600e-02  3.393e-03  16.505   <2e-16 ***
## bach_degree   -1.013e+02  5.759e+01  -1.759   0.0788 .  
## bachelors      5.144e+00  5.045e+01   0.102   0.9188    
## below_poverty  9.829e+01  7.708e+01   1.275   0.2025    
## grad_degree   -1.409e+01  5.573e+01  -0.253   0.8004    
## hs            -9.054e+01  6.719e+01  -1.347   0.1780    
## income_high    7.040e+01  4.382e+01   1.607   0.1083    
## married        3.412e+01  6.486e+01   0.526   0.5990    
## median_age     6.643e-01  5.069e-01   1.311   0.1902    
## no_hs         -5.072e+00  5.063e+01  -0.100   0.9202    
## pop_black      2.537e+01  1.919e+01   1.322   0.1863    
## pop_hispanic   1.360e+01  6.133e+01   0.222   0.8246    
## pop_white      7.404e+00  1.443e+01   0.513   0.6081    
## speak_english -4.050e+01  1.830e+01  -2.214   0.0270 *  
## speak_spanish -2.167e+01  7.029e+01  -0.308   0.7579    
## total_pop     -7.484e-04  7.564e-05  -9.894   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41.04 on 1414 degrees of freedom
## Multiple R-squared:  0.2224, Adjusted R-squared:  0.2136 
## F-statistic: 25.28 on 16 and 1414 DF,  p-value: < 2.2e-16
## [1] 1664.268

This model performs reasonably well for a first pass. Its \(MSE = 1664\) is 22 percent less than \(M_0\), and \(R^2 = 0.22\). The \(F\)-statistic is very significant, confirming that this model performs better a dummy model. Interestingly, the intercept is not significant, i.e., it is not different from zero. In my view, this could make sense if we can say that any housing violation is due solely to the independent variables.

The intution underlying this paper, that 311 calls has something to do with housing violations, is confirmed by this model. The positive \(\beta\) for calls indicates that as calls increase, so do housing violations. The estimate is significant at \(p < .001\). Interestingly, total_pop has a negative relationship with violations. This is hard to square with intuition. A priori it should be have a positive relationship. For now, I will not worry too much about this because the coefficient is very small.

Let’s examine the residuals:

It is clear \(M_1\)’s residuals deviate sharply from a normal distribution. There is a long right-tail, indicating there are quite a few outlying observations the model does not handle well.

Let’s examine some of these leading residuals:

##      zip      epoch    resid
## 1  11226 2014-02-01 351.5891
## 2  11213 2014-02-01 313.6789
## 3  11207 2014-01-01 310.0429
## 4  11221 2014-02-01 295.4526
## 5  10457 2014-02-01 288.9505
## 6  10032 2014-02-01 279.7924
## 7  10456 2014-02-01 262.8867
## 8  10453 2014-02-01 261.4567
## 9  10456 2014-01-01 261.4486
## 10 10452 2014-02-01 254.7850

\(M_1\) appears to perform the worst in winter: January and February. This indicates that it may be possible to improve the model by adding a month variable.

\(M_2\): Regression + Month

To remedy this, let’s conduct the same kind of regression but add a variable for month:

## 
## Call:
## lm(formula = violations ~ calls + bach_degree + bachelors + below_poverty + 
##     grad_degree + hs + income_high + married + median_age + no_hs + 
##     pop_black + pop_hispanic + pop_white + speak_english + speak_spanish + 
##     total_pop + as.factor(month), data = train_m2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -108.353  -13.470    3.028   11.018  313.795 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.039e+01  4.617e+01   0.875   0.3818    
## calls               4.513e-02  2.966e-03  15.214   <2e-16 ***
## bach_degree        -8.523e+01  4.893e+01  -1.742   0.0817 .  
## bachelors           1.215e+01  4.284e+01   0.284   0.7768    
## below_poverty       6.675e+01  6.550e+01   1.019   0.3083    
## grad_degree        -2.339e+01  4.731e+01  -0.494   0.6211    
## hs                 -1.005e+02  5.707e+01  -1.760   0.0785 .  
## income_high         4.113e+01  3.723e+01   1.105   0.2694    
## married             2.393e+01  5.509e+01   0.434   0.6641    
## median_age          5.423e-01  4.307e-01   1.259   0.2082    
## no_hs              -1.064e+01  4.300e+01  -0.247   0.8047    
## pop_black           1.888e+01  1.631e+01   1.157   0.2473    
## pop_hispanic       -2.406e+00  5.209e+01  -0.046   0.9632    
## pop_white           4.412e+00  1.227e+01   0.360   0.7192    
## speak_english      -2.616e+01  1.556e+01  -1.681   0.0929 .  
## speak_spanish       2.813e+00  5.970e+01   0.047   0.9624    
## total_pop          -5.518e-04  6.559e-05  -8.413   <2e-16 ***
## as.factor(month)2   8.349e+00  4.529e+00   1.844   0.0655 .  
## as.factor(month)3  -5.613e+01  4.525e+00 -12.403   <2e-16 ***
## as.factor(month)4  -5.243e+01  4.605e+00 -11.384   <2e-16 ***
## as.factor(month)5  -5.587e+01  4.554e+00 -12.268   <2e-16 ***
## as.factor(month)6  -5.552e+01  4.514e+00 -12.299   <2e-16 ***
## as.factor(month)7  -5.481e+01  4.600e+00 -11.915   <2e-16 ***
## as.factor(month)8  -5.280e+01  4.553e+00 -11.596   <2e-16 ***
## as.factor(month)9  -5.180e+01  4.634e+00 -11.180   <2e-16 ***
## as.factor(month)10 -5.433e+01  4.499e+00 -12.077   <2e-16 ***
## as.factor(month)11 -5.737e+01  4.463e+00 -12.855   <2e-16 ***
## as.factor(month)12 -5.555e+01  4.496e+00 -12.355   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34.83 on 1403 degrees of freedom
## Multiple R-squared:  0.4442, Adjusted R-squared:  0.4335 
## F-statistic: 41.53 on 27 and 1403 DF,  p-value: < 2.2e-16
## [1] 1189.566

Adding month really helped! Almost all of the month variables are significant at \(p < .001\). \(MSE\) is almost a third less compared to \(M_1\), and \(R^2\) doubles from 0.22 to 0.44.

Accounting for month also makes intuitive sense: Winter is when people use heating, and New Yorkers are prone to having problems with their apartment’s heat. This could naturally inspire more calls to the housing inspection authorities.

Although calls remains highly sigificant, almost none of the other variables do.

Residual plots:

The shape of the residuals remains about the same compared to \(M_1\), although notice the right-tail is smaller:

## [1] 351.5891
## [1] 313.7954

Let’s re-check the leading residuals:

##      zip      epoch    resid
## 1  11226 2014-02-01 313.7954
## 2  11207 2014-01-01 272.4597
## 3  11213 2014-02-01 267.4036
## 4  11221 2014-02-01 244.8598
## 5  10457 2014-02-01 242.1114
## 6  10456 2014-01-01 230.1314
## 7  10032 2014-02-01 226.9596
## 8  10458 2014-01-01 217.5249
## 9  10453 2014-02-01 216.4459
## 10 10456 2014-02-01 209.9291

The residuals on the whole are smaller, but it appears to be the same observations that \(M_1\) missed.

\(M_3\): Mixed Effects Panel Model

The three previous models have used vanilla linear regression. However, this is not analytically appropriate. Linear regression assumes that each observation is independent of eachother—which is not the case for this dataset. The number of housing violations in a zip code in month \(t\) is not independent of the number of housing violations in month \(t-1\). And as we’ve seen above, the number of housing violations in February is not independent of the number of housing violations in January.

We can use mixed models to account for the fact that observations are sampled from the same ‘unit’ multiple times. I will not attempt to explain the underlying logic of these models here.

zip and month are coded as random effects:

## Warning: Some predictor variables are on very different scales: consider
## rescaling
## Linear mixed model fit by REML ['lmerMod']
## Formula: violations ~ calls + bach_degree + bachelors + below_poverty +  
##     grad_degree + hs + income_high + married + median_age + no_hs +  
##     pop_black + pop_hispanic + pop_white + speak_english + speak_spanish +  
##     total_pop + (1 | zip) + (1 | month)
##    Data: train_m3
## 
## REML criterion at convergence: 14173.8
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0934 -0.3897  0.0822  0.3124  9.0298 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  zip      (Intercept)    0.0    0.00   
##  month    (Intercept)  519.7   22.80   
##  Residual             1213.3   34.83   
## Number of obs: 1431, groups:  zip, 159; month, 12
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)   -4.591e+00  4.660e+01  -0.099
## calls          4.535e-02  2.965e-03  15.297
## bach_degree   -8.555e+01  4.893e+01  -1.749
## bachelors      1.201e+01  4.284e+01   0.280
## below_poverty  6.738e+01  6.550e+01   1.029
## grad_degree   -2.321e+01  4.731e+01  -0.490
## hs            -1.003e+02  5.707e+01  -1.757
## income_high    4.172e+01  3.723e+01   1.121
## married        2.413e+01  5.509e+01   0.438
## median_age     5.448e-01  4.307e-01   1.265
## no_hs         -1.053e+01  4.300e+01  -0.245
## pop_black      1.900e+01  1.631e+01   1.165
## pop_hispanic  -2.084e+00  5.209e+01  -0.040
## pop_white      4.469e+00  1.227e+01   0.364
## speak_english -2.644e+01  1.556e+01  -1.700
## speak_spanish  2.315e+00  5.970e+01   0.039
## total_pop     -5.558e-04  6.556e-05  -8.477
## 
## Correlation matrix not shown by default, as p = 17 > 12.
## Use print(x, correlation=TRUE)  or
##     vcov(x)        if you need it
## fit warnings:
## Some predictor variables are on very different scales: consider rescaling
## [1] 1189.749
## [1] 0.4441274

Surprisingly, using this more appropriate method fails to decrease \(MSE\) or improve \(R^2\) over \(M_2\) at all. In fact, the two models are nearly identical.

\(M_4\): Dealing with Multicollinearity

From the correlation plot and the some of the residuals, it seems likely that some or perhaps most of the demographic variables from the Census are correlated with eachother. This state of affairs, called multicollinearity, is harmful to the modeling process, because it can cause misestimation of coefficients. Training a model with less correlated variables should make the estimates more accurate, and increase the performance on the test set.

Any sociologist can tell you that education level, poverty, income, and language are all related to eachother, primarily by class-exclusionary mechanisms. I will try to retain the essentials of this data while reducing the correlation:

## Warning: Some predictor variables are on very different scales: consider
## rescaling
## Linear mixed model fit by REML ['lmerMod']
## Formula: 
## violations ~ calls + college + below_poverty + married + median_age +  
##     no_hs + pop_white + speak_english + total_pop + (1 | zip) +  
##     (1 | month)
##    Data: train_m4
## 
## REML criterion at convergence: 14244.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.2114 -0.3983  0.0733  0.3007  9.0422 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  zip      (Intercept)    0      0.00   
##  month    (Intercept)  523     22.87   
##  Residual             1215     34.86   
## Number of obs: 1431, groups:  zip, 159; month, 12
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)   -2.988e+01  2.029e+01  -1.473
## calls          4.462e-02  2.871e-03  15.540
## college        1.453e+01  1.223e+01   1.188
## below_poverty  1.506e+02  5.038e+01   2.990
## married        8.353e+00  2.071e+01   0.403
## median_age     5.289e-01  3.329e-01   1.589
## no_hs         -1.012e+01  3.421e+01  -0.296
## pop_white     -1.163e+01  7.187e+00  -1.618
## speak_english -6.255e+00  8.990e+00  -0.696
## total_pop     -5.371e-04  6.247e-05  -8.598
## 
## Correlation of Fixed Effects:
##             (Intr) calls  colleg blw_pv marrid medn_g no_hs  pp_wht spk_ng
## calls       -0.208                                                        
## college     -0.599  0.091                                                 
## below_pvrty -0.623  0.135  0.552                                          
## married     -0.441  0.272  0.479  0.455                                   
## median_age  -0.601  0.134  0.237  0.449 -0.210                            
## no_hs       -0.467 -0.007  0.330 -0.085  0.235 -0.044                     
## pop_white    0.240 -0.055 -0.624 -0.349 -0.464 -0.140  0.197              
## speak_nglsh -0.541 -0.005  0.279  0.230  0.502 -0.093  0.651 -0.042       
## total_pop    0.077 -0.776 -0.040 -0.172 -0.274 -0.011 -0.037  0.001  0.018
## fit warnings:
## Some predictor variables are on very different scales: consider rescaling
## [1] 1197.75
## [1] 0.4403892

\(R^2\) and \(MSE\) do not change much compared to the two previous models. However, we do see that some parameter estimates have changed quite drastically. This should result in improved performance on the test set, even if performance is unchanged on the in-sample training set.

Bonus: \(M_5\): Zero-Inflated Poisson Regression

NOTE: This is new territory for me, just trying it out for the first time!

Since \(M_4\) likely has done a better job with regard to multicollinearity, I am using it as the basis for this next model, a special kind of Poisson regression built to deal with dependent variables with excessive zeros. It is a part of the pscl package—a UCLA web site has a quick tutorial.

## 
## Call:
## zeroinfl(formula = violations ~ calls + below_poverty + pop_white + 
##     median_age + college | month, data = train_m5)
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3586 -0.3253 -0.2796 -0.2407 15.2318 
## 
## Count model coefficients (poisson with log link):
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    3.819e+00  1.959e-01   19.50   <2e-16 ***
## calls          8.473e-04  8.863e-06   95.60   <2e-16 ***
## below_poverty  5.165e+00  4.254e-01   12.14   <2e-16 ***
## pop_white     -1.245e+00  7.794e-02  -15.97   <2e-16 ***
## median_age    -5.094e-02  4.448e-03  -11.45   <2e-16 ***
## college        2.034e+00  1.416e-01   14.36   <2e-16 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.6811     0.5849  -6.293 3.11e-10 ***
## month2        1.2604     0.6777   1.860   0.0629 .  
## month3        7.3445     0.8271   8.880  < 2e-16 ***
## month4        6.1563     0.6802   9.051  < 2e-16 ***
## month5        6.2834     0.6902   9.103  < 2e-16 ***
## month6        6.4623     0.7028   9.196  < 2e-16 ***
## month7        5.7212     0.6550   8.734  < 2e-16 ***
## month8        5.8779     0.6594   8.915  < 2e-16 ***
## month9        5.9177     0.6654   8.893  < 2e-16 ***
## month10       5.6639     0.6465   8.760  < 2e-16 ***
## month11       6.1228     0.6715   9.118  < 2e-16 ***
## month12       6.2195     0.6797   9.150  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 29 
## Log-likelihood: -1.062e+04 on 18 Df
## [1] 2.583519
## [1] 0.6543592

I do not understand this \(MSE\), presumably some kind of transformation has happened, but I’m not sure which. The \(R^2 = 0.65\) looks great though—nearly fifty percent better than the last few models.

Model Evaluation

It is finally time to evaluate each of these models on the test set!

Prepare the test set for each model:

Calculate \(MSE\) for each model on the test set:

## [1] "M_0: 2656.90358493984"
## [1] "M_1: 1848.85988638713"
## [1] "M_2: 1305.03244510467"
## [1] "M_3: 1306.33557707587"
## [1] "M_4: 1303.55055189596"
## [1] "M_5: 757.679178844792"

The Poisson regression equipped to deal with excessive zeros \(M_5\) came out best, with almost 50 percent lower \(MSE\) than the mixed effects models! As expected, accounting for month improved \(M_2\) subtantially over \(M_1\). Using a mixed model did not improve \(M_3\) over \(M_2\), surprisingly. Also a suprise \(M_4\), the model to deal with multicollinearity, had only a slight improvement.

Now let’s look at \(R^2\):

## [1] "M_0: 0"
## [1] "M_1: 0.303941300833325"
## [1] "M_2: 0.508693379627696"
## [1] "M_3: 0.508202160577651"
## [1] "M_4: 0.50924525862471"
## [1] "M_5: 0.715024352069364"

Interestingly, the \(R^2\) for each model on the test set is higher than when applied to the training set. \(M_5\) comes out the winner again, explaining almost 72 percent of variance in violations.

Conclusion

This paper used open source data from New York City and the U.S. Census Bureau to predict monthly housing violations. A number of models were tested. The best model was a special kind of Poisson regression built to deal with dependent variables with excessive zeros. This model explained 72 percent of variation in housing violations.

The percentage of people living below poverty in a zip code, and the percentage with college degrees, are positively associated with housing violations. A higher proportion of white people and young people are both negatively associated with housing violations.

This paper was particularly interested in the number of calls made to 311 in a zip code could help predict the number of housing violations. Every model found a highly significant, positive relationship between the two, even when accounting for demographic variables and seasonality—every 1175 calls to 311 is associated with one additional housing violation.

I would like to thank whomever wrote tidycensus for making this project a million times easier, and I am also fortunate that this Poisson model worked out so well.