Overview

The purpose of this project was to determine if any data features of a city can be used to predict blight for a given location. All code, ETL workflow and analysis is available on https://github.com/aliciatb/blight.

The data

The data provided for this project include 311 Service Calls, Crime Incidents, Blight Violations and Permits for Demolition. The first 3 datasets provide the foundation for a Buildings dataset that consists of unique locations as well as the source of derived features used in the model creation like Number of Crimes, Number of Blight Violations, and Number of Service Calls. These datasets were downloaded from the course site, but are also available via capstone project repo on github. All of the data comes from Socrata powered Detroit Open Data Portal, https://data.detroitmi.gov/.

Files

Note - I downloaded this Detroit Demolition dataset to use rather than one of building permits provided by instructor since it was cleaner data and contained only the essential fields needed to label known blight locations.

Data preparation & Tools

The greatest challenge in the provided data from Socrata is within the Location column because it concatenates all of the fields used in the geocoding process, and when address fields are included, line breaks are entered into the field and cause havoc until they are removed from the data file. Before analysis of the data could be undertaken, all files were initially formatted using Excel PowerQuery for removal of aforementioned line breaks and standardization of the street number and addresses. Then the data was loaded into FME, a powerful ETL tool, to validate and standardize the geographic coordinates and create well formatted incident and unique building files. Exploratory analysis and model creation was performed in python notebooks and RStudio.

Buildings

In order to derive a building, within the incident files, the latitude and longitude coordinates were rounded to 3 decimal points and then each file was individually joined with the demolition data that also had its latitude and longitude coordinates rounded to the same number of decimal points. Where there was a match in each file, then that building record was also labeled as blight. Any incident record that lacked a coordinate was excluded from the final dataset. One interesting discovery is that street addresses are not consistently captured across datasets and could not be relied on for aggregation due to variances in the same address.

Features

Features drive the creation of prediction models because it is in their diversity that differences can be discovered that explain why one given building may be more prone to becoming blightful than another. In one of the readings for the course, Spatial Characteristics of Housing Abandonment, Dr. Morckel surmises that housing abandonment is a result of 3 key conditions - market conditions, gentrification and physical neglect. For this project, we are mostly focusing on the data evidence of neglect.

The first features added to the building dataset include a count of total 311 calls, crime incidents and blight violation citations for a given building. No filtering was performed on any of the incident datasets, because I didn’t want to presume that calls about infrastructure or non-violent crimes are not related to a geographic inclination towards neglect.

A second set of features were added from a Property Values dataset found on the Detroit Data Portal that included appraised and taxed values, sales price, tax status, and whether it had been improved at any point. It also included well formatted street address and latitude and longitude coordinates which helped to further reduce the overall building dataset since any buildings that lack features are not useful in generating prediction models.

Features that were not included in this project that would be interesting to add so that the model accounts for possible gentrification and economic conditions could include - * Building Permits - alterations and other types that may signal gentrification in progress in neighborhood * Zillow Zestimate for an address * American Community Survey annual estimates on income, mortgage and rental at Census tract block level

It would also be interesting to consider time as a factor and perhaps calculate incident counts at 90 day, 180 day and annual intervals before demolition.

Models

For the prediction of whether a location is blighted or not, I used logistic regression and classification tree models to answer that questions. Logistic regression models the probability that an outcome belongs to a category. Classification trees identify feature importance and provide a visual representation of the feature path taken to form the best performing model. The models will be judged on their Accuracy and Kappa values. Kappa values evaluate whether the accuracy is due to random chance.

Training & Test datasets

In order to estimate the accuracy of a model, one must divide the data into training and test datasets. I have allocated 75% of the data for training the models. I have also set both blight and IsImproved fields to factors and dropped fields from the dataset like Address, Latitude, Longitude, Parcel Number, Tax Status and Year that Residence was built so that the models evaluated only the fields - IsImproved, Appraised Value, Taxed Value, Ward Number, and number of 311 calls, crimes, and blight violations as indicators of blight. Originally, Tax Status was included, but its importance to all models was not significant.

Logistic Regression Model

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -1.232e-05  -7.940e-06  -7.660e-06  -7.050e-06   5.348e-04  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)
## (Intercept)          -2.170e+00  4.767e+02  -0.005    0.996
## CrimeCount            5.388e+01  1.196e+03   0.045    0.964
## X311Count             1.080e+01  6.052e+02   0.018    0.986
## BlightViolationCount  1.290e+02  1.731e+03   0.075    0.941
## Ward                  1.420e-01  3.060e+02   0.000    1.000
## SalePrice            -4.127e-02  2.264e+03   0.000    1.000
## IsImproved1          -3.787e-01  2.216e+02  -0.002    0.999
## AppraisedValue       -2.394e+00  2.537e+04   0.000    1.000
## TaxedValue           -6.364e-01  2.760e+04   0.000    1.000
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3.4638e+04  on 120738  degrees of freedom
## Residual deviance: 1.2957e-05  on 120730  degrees of freedom
## AIC: 18
## 
## Number of Fisher Scoring iterations: 25

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 38937     0
##          1     0  1309
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9999, 1)
##     No Information Rate : 0.9675     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.9675     
##          Detection Rate : 0.9675     
##    Detection Prevalence : 0.9675     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 0          
## 

Logistic Model Analysis

This model indicates that the top 3 fields that have the most predictive value are Blight Violation Count, Crime Count and 311 Call Count. The overall accuracy of the model is 100% with a Kappa value of 100%.

Tree Model

The first model is a simple tree that reveals the decisions that most frequently indicated what factors led to blight classification.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 38937     6
##          1     0  1303
##                                           
##                Accuracy : 0.9999          
##                  95% CI : (0.9997, 0.9999)
##     No Information Rate : 0.9675          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.9976          
##  Mcnemar's Test P-Value : 0.04123         
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9954          
##          Pos Pred Value : 0.9998          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.9675          
##          Detection Rate : 0.9675          
##    Detection Prevalence : 0.9676          
##       Balanced Accuracy : 0.9977          
##                                           
##        'Positive' Class : 0               
## 

Tree Model Analysis

This model indicates that the top 3 fields that have the most predictive value are Crime Count, Blight Violation Count and 311 Call Count. The overall accuracy of the model is 99.98% with a Kappa value of 99.72%.

Bagging Model

The next model involves Bootstrap Aggregating where the data is randomly resampled multiple times and the average is returned.

## Bagged CART 
## 
## 120739 samples
##      8 predictor
##      2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 120739, 120739, 120739, 120739, 120739, 120739, ... 
## Resampling results:
## 
##   Accuracy  Kappa
##   1         1    
## 
## 
## 
## Bagging classification trees with 25 bootstrap replications

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 38937     0
##          1     0  1309
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9999, 1)
##     No Information Rate : 0.9675     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.9675     
##          Detection Rate : 0.9675     
##    Detection Prevalence : 0.9675     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 0          
## 

Bagging Model Analysis

This model indicates that the top 3 fields that have the most predictive value are Blight Violation Count, Crime Count and 311 Call Count. The overall accuracy of the model is 100% with a Kappa value of 100%.

Boosting Model

This model will combine weak classifiers so they can contribute to creating a more powerful model.

## Stochastic Gradient Boosting 
## 
## 120739 samples
##      8 predictor
##      2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 120739, 120739, 120739, 120739, 120739, 120739, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.9997360  0.9957829
##   1                  100      0.9998396  0.9974343
##   1                  150      1.0000000  1.0000000
##   2                   50      1.0000000  1.0000000
##   2                  100      1.0000000  1.0000000
##   2                  150      1.0000000  1.0000000
##   3                   50      1.0000000  1.0000000
##   3                  100      1.0000000  1.0000000
##   3                  150      1.0000000  1.0000000
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were n.trees = 50, interaction.depth
##  = 2, shrinkage = 0.1 and n.minobsinnode = 10.

##                                       var   rel.inf
## BlightViolationCount BlightViolationCount 85.774721
## CrimeCount                     CrimeCount 12.784663
## X311Count                       X311Count  1.440617
## Ward                                 Ward  0.000000
## SalePrice                       SalePrice  0.000000
## IsImproved1                   IsImproved1  0.000000
## AppraisedValue             AppraisedValue  0.000000
## TaxedValue                     TaxedValue  0.000000
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 38937     0
##          1     0  1309
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9999, 1)
##     No Information Rate : 0.9675     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.9675     
##          Detection Rate : 0.9675     
##    Detection Prevalence : 0.9675     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 0          
## 

Boosting Model Analysis

This model indicates that the top 3 fields that have the most predictive value are Blight Violation Count, Crime Count and 311 Call Count. The overall accuracy of the model is 100% with a Kappa value of 100%.

Other Models

A Random Forest model is the most powerful of the tree classification models that involves bagging where it also resamples the feature combinations. However, the complexity of its algorithm causes long processing and in this case caused memory exceptions on 2 different and otherwise powerful macbookpro and imac computers, so was unable to run this model successfully.

Model Comparison

It is easy to compare the 4 models in R and generating a visualization on how they compare using Caret package [resample methods] (http://www.inside-r.org/packages/cran/caret/docs/as.matrix.resamples).

## 
## Call:
## summary.resamples(object = results)
## 
## Models: tree_model, bagged_model, boost_model, logistic_model 
## Number of resamples: 25 
## 
## Accuracy 
##                  Min. 1st Qu. Median   Mean 3rd Qu. Max. NA's
## tree_model     0.9996  0.9997 0.9997 0.9998       1    1    0
## bagged_model   1.0000  1.0000 1.0000 1.0000       1    1    0
## boost_model    1.0000  1.0000 1.0000 1.0000       1    1    0
## logistic_model 0.9929  1.0000 1.0000 0.9997       1    1    0
## 
## Kappa 
##                  Min. 1st Qu. Median  Mean 3rd Qu. Max. NA's
## tree_model     0.9942   0.995 0.9958 0.997       1    1    0
## bagged_model   1.0000   1.000 1.0000 1.000       1    1    0
## boost_model    1.0000   1.000 1.0000 1.000       1    1    0
## logistic_model 0.8747   1.000 1.0000 0.995       1    1    0

Conclusion

Three of the the four models returned accuracy and kappa values of 100%, while the 4th scored 99.9% accuracy and 97% kappa values and all indicated that blight violation and crime counts are key factors in whether a building is blight, with 311 call counts also being a significant indicator. It isn’t surprising that these 3 attributes are excellent indicators of blight, but their usefulness in building a model to predict the probability of blight can be questionable if the goal of a city is to become aware of potential blight early enough to be able to counter whatever conditions are truly the underlying cause of blight. For this reason, I think it is very important to build models using economic and gentrifying features in addition to physical neglect and crime statistics because they may be able to reveal problem areas early. I would also like to account for time in models and see if there is a significant difference in incident counts at different time intervals.

In addition to adding more economic features in model building, I am excited about the possibility of a new feature that FME has added to its latest release is the RCaller transformer that I hope will allow it to leverage the Caret library which is what I used entirely for my models. This would make it possibly to create a workflow that could analyze the data for any city, provided the datasets conform to a general schema for 311 calls, crime incidents and blight violation counts, on a regular schedule and predict what properties are likely to be blight. If the data can indeed be standardized across a group of cities, and more known blight can be used to train the models, this would be an amazing opportunity to empower cities to proactively address potential issues while conditions are manageable rather than only be able react to them with bulldozers.

Resources & Readings