Introduction

This is a Capstone project in the Data Science at Scale specification, given by the university of Washington in co-operation with Coursera. The task is to predict building abandonment “blighted building” based on public datasets.

Data

Downloading

The public data can be downloaded from: http://data.detroitmi.gov

For the project, four files consisting of blight violation incident, criminal incidents, 311 call (typically complaint) incidents and issued demolition permit will be used. For each of the four incident type, the location of the incident is established as a latitude, longitude pair.

Data cleaning

For the official center of Detroit the latitude (lat) and longitude (lon) is approximately 42.3314 degree and -83.0458 degree. All records with (lat,lon) values out of range, or missing (such as NA) is removed, unless they easily can be extracted from another datafield in the dataset.

As the earliest demolution permit in the incident data is from the 2010, the incidents older than 2002 are filtered out form the source data.

Finally, all addresses are converted to uppercase, and empty spaces, tab etc are cleaned out. Across all files the same cleaning process is applied for consistency of the data format.

Definition of buildings

Identify a building

Most types of clusterization algorithms would not differenciate between different buildings on high density areas. A more natural approach to identify a building, would be using the street name and the street number as a identifyer. However, due to Road, Street, Drive, or Avenue suffix, a common street name and street number could still map to different buildings.

So identifying a building boils down of deciding the granularity of the identifyer. A ideal identifyer would indeed be able to pinpoint out every unique building. A really bad identifyer would e.g. cluster all buildings as one unique building only!

In this project our identifyer consist of 4 (or possible 5) parts:
street address + street number + latitude + longitude + prefix

In the latitude and longitude, only 3 decimal places will be used to represent a unique building. Three decimal places in lat/lon represents a location down to 111 meter accuracy, which should be good enough for our purposes.

So any unique combination of (street address + street number + latitude + longitude + prefix) tuple is considered as a single building.

Finally, if a building is demolished (and therefore non-existing), we add blighted (in the next release also the date of the demolisment) as a prefix.

Visualization of the top 10 biggest building.

The identifyer will assign each record a buildingID. Across all files, for each unique buildingID, we saved all different (lat,lon) values into a list, and append it as an additional column. For a unique buildingID, I can visualize all the incidents (cluster) belonging to that ID. Building Clusters

The cluster can be a point, a line or a polygon. A top the building, I plotted the incident type from each of the 4 files. In order to give a building some spatial extension, all cluster will be enclosed by a rectangle (a bounding box).

Collect all the buildings in master dataframe

To track the origin of the incident, a column named type, with following possible values: complaint, crime, blight and demolited was appended to each of the four files. The files are then merged into one big dataframe.

Now, for any buildingID, I can easily construct other columns of interest, e.g: bulding area, max latitude, min latitude and many more.

If sorting this master file by building and time of the reported incident, and then using the method subset(), or aggregate(), the history of one building can easily be filtered out.

buildingID type dt truth.label
PORTLANCE_11200_42418_83008 BLIGHT_VIOLATIONS 2008-09-16 NON.BLIGHTED
PORTLANCE_11200_42418_83008 BLIGHT_VIOLATIONS 2012-06-29 NON.BLIGHTED
PORTLANCE_11200_42418_83008 BLIGHT_VIOLATIONS 2012-11-15 NON.BLIGHTED
PORTLANCE_11200_42418_83008 BLIGHT_VIOLATIONS 2013-04-22 NON.BLIGHTED
PORTLANCE_11200_42418_83008 BLIGHT_VIOLATIONS 2014-03-20 NON.BLIGHTED
PORTLANCE_11200_42418_83008 BLIGHT_VIOLATIONS 2014-05-21 NON.BLIGHTED
PORTLANCE_11200_42418_83008 BLIGHT_VIOLATIONS 2014-06-23 NON.BLIGHTED
PORTLANCE_11200_42418_83008 BLIGHT_VIOLATIONS 2014-08-01 NON.BLIGHTED
PORTLANCE_11200_42418_83008 COMPLAINT 2015-06-09 NON.BLIGHTED
PORTLANCE_11200_42418_83008 COMPLAINT 2015-07-07 NON.BLIGHTED
PORTLANCE_11200_42418_83008 CRIME 2015-07-09 NON.BLIGHTED
PORTLANCE_11200_42418_83008 CRIME 2015-07-19 NON.BLIGHTED
PORTLANCE_11200_42418_83008 COMPLAINT 2015-08-19 NON.BLIGHTED

Note!: In order to have a updated building history, in the pre-processing any incident happened before a demolished permit incident will be transformed, by having a suffix (“blighted”) appended to its buildingID.

Model construction

Features selection

I am using gbm (Gradient Boosted Machine) from the Caret package, for flexibility and effiency. Following variables (features) are selected from the master dataframe for prediction. The predictors of the model are normalized (centered and scale).

Variable Description
truth.label Response variable: BLIGHTED / NON.BLIGHTED
no.blight number of blight violation
no.crime number of crimes
no.complaint number of complaint (311 calls)

If using following formula:

Formula Accuracy
truth.label ~ no.blight 0.54
truth.label ~ no.blight + no.crime 0.69
truth.label ~ no.blight + no.crime + no.complaint 0.70

Summary

The model is very simplistic, but is already working way better than a coin flip. If provided more time, it might be a good plan to add a predictor that is estimating the lifetime of a building. Further development is definitely needed, before the model could be given to the hands of a wrecking ball crew.