Introduction

Detroit has been struck by a rash dropping property values, home foreclosures, and blight. Detroit has decided to combat this problem by demolishing tens of thousands of blighted properties, but funding and time are limiting factors in this work. Detroit has joined a number of cities in providing critical data sets to the public, and is working to prioritize the demolition of the numerous blighted properties, as well as better understand the factors leading to blight, and predict at-risk neighborhoods. This report will leverage several public data sets provided by the city of Detroit in an attempt to provide predictive models of blight and subsequent demolition, with attention paid to providing a model capable of predicting at-risk properties well in advance of blight status.

IMPORTANT NOTE: Code for every section is available in this report, by clicking the ‘Code’ button on the right hand side of the report, or for all chunks using the menu at the top right of this report.

Data Loading and Preparation

Several data sets are used in the analysis of blight within Detroit, including a reference data set of all parcels in Detroit, as well as incident-based data sets such as crime reports, blight violations, 311 calls, and demolitions. Up-to-date versions of these data sets can be obtained at the Detroit Open Data Portal, using the links provided with each data set below. A summary of the date ranges spanned in each data set are included below for reference.

Parcel Data with Taxable Value for current date
311 Calls: 7/2014 - 12/2016
Blight Violations: 3/2004 - 9/2016
Crime reports: 1/2009 - 12/2016
Demolition Permits: 1/2014 - 12-2016

# 311 Call Data
# Downloaded from https://data.detroitmi.gov/Government/Improve-Detroit-Submitted-Issues/fwz3-w3yn
call311_dat <- read.csv('/mnt/data/Improve_Detroit__Submitted_Issues.csv', header=T, na.strings='', stringsAsFactors=F)

# Blight Violations
# Downloaded from https://data.detroitmi.gov/Property-Parcels/Blight-Violations/teu6-anhh
blight_viols <- read.csv('/mnt/data/Blight_Violations.csv', header=T, na.strings='', stringsAsFactors=F)

# Demo Permits
# Downloaded from https://data.detroitmi.gov/Property-Parcels/Building-Permits/xw2a-a7tf
demo_permits <- read.csv('/mnt/data/Detroit_Demolitions.csv', header=T, stringsAsFactors=F)

# Crime Data
# Downloaded from https://data.detroitmi.gov/Public-Safety/DPD-All-Crime-Incidents-2009-Present-Provisional-/b4hw-v6w2
crime_dat <- read.csv('/mnt/data/DPD__All_Crime_Incidents__2009_-_Present__Provisional_.csv', header=T, na.strings='', stringsAsFactors=F)

# Detroit Land Parcels
# Downloaded from https://data.detroitmi.gov/Property-Parcels/Parcel-Points-Ownership/eijm-6nr4
parcel_data <- read.csv('/mnt/data/Parcel_Points_Ownership.csv', header=T, sep=',', na.strings='', stringsAsFactors=F)

In order to investigate ‘neighborhoods’ and proximity, I will be using a method called ‘geohashing’. Geohashing divides the globe into grids of decreasing size, with the smallest being roughly a square meter. Each longitude/latitude coordinate can be assigned to a unique block, which greatly speeds up the identification of events that occur in close proximity. Geohashing can be accomplished using the ‘geohash’ package in R, and I’ve elected to identify ‘buildings’ using a precision of 8, and neighborhoods using a precision of 7. For a visual depiction of geohashing, see this website.

# Geohashing houses with precision 8
# check out this site for a visual: http://www.movable-type.co.uk/scripts/geohash.html
# Alternatively, plug neighbors into the Haversine formula until you get a reasonable distance
# between neighbors

call311_dat$geohash <- gh_encode(call311_dat$lat, call311_dat$lng, 8)
call311_dat$geohash_7 <- gh_encode(call311_dat$lat, call311_dat$lng, 7)
call311_dat$unique_key <- paste0('call311_',seq(1,nrow(call311_dat)))

blight_viols <- blight_viols %>%
    mutate(coord = str_extract_all(ViolationLocation, "\\([^()]+\\)")) %>%
  filter(coord != 'character(0)') %>%
    transform(coord = gsub('[()]','', coord)) %>%
    separate(coord, c('lat', 'lng'), ', ') %>%
    mutate(geohash = gh_encode(as.numeric(lat), as.numeric(lng), 8),
           geohash_7 = gh_encode(as.numeric(lat), as.numeric(lng), 7),
                 unique_key = paste0('blightviol_',row_number()))

demo_permits <- demo_permits %>%
    mutate(geohash = gh_encode(as.numeric(Latitude), as.numeric(Longitude), 8),
           geohash_7 = gh_encode(as.numeric(Latitude), as.numeric(Longitude), 7),
                 unique_key = paste0('demo_',row_number()))

crime_dat$geohash <- gh_encode(crime_dat$LAT, crime_dat$LON, 8)
crime_dat$geohash_7 <- gh_encode(crime_dat$LAT, crime_dat$LON, 7)
crime_dat$unique_key <- paste0('crime_',seq(1,nrow(crime_dat)))

# Parcel Data Formatting
parcel_data <- parcel_data %>%
    filter(!is.na(ParcelNo)) %>%
    transform(Latitude = as.numeric(Latitude),
                        Longitude = as.numeric(Longitude)) %>%
    mutate(geohash = gh_encode(Latitude, Longitude, 8),
                 demo_flag = ifelse(ParcelNo %in% demo_permits$Parcel.ID, 'Yes', 'No'))

parcel_data$demo_flag <- as.factor(parcel_data$demo_flag)

Exploratory Analysis

In total, the data sets comprise:

311 calls: 64,675 reports
Crime reports: 1,210,021 reports
Blight Violations: 332,387 violations
Demolitions: 10704 permits issued
Land Parcels: 384,675 zoned parcels

With 10704 demolition permits, and 384675 parcels of land, the incidence of demolition due to blight is just under 3% over the course of the roughly 3 years the demolitions span. That would be unusually high, except for the fact that we know Detroit is actively combatting blight by demolishing homes. Detroit has stated they would like to demolish roughly 40,000 structures in 8 years. This presents an interesting problem, since we know there are many properties that should be demolished. Many properties that have not been demolished probably are blighted, but time and funding have resulted in them not being prioritized.

It is important to investigate where the most events take place in Detroit for each data set, and whether or not those occurrences correspond to blight-related demolitions. Heatmaps of each data set are plotted below, showing high and low concentration areas for each data set.

det_map <- qmap(location='detroit', zoom=10, maptype='roadmap', color='bw')
#seattle_map <- qmap(location='seattle', zoom=12, maptype='roadmap', color='bw')

#crime_dat <- crime_dat_backup

crime_plot <- det_map + #geom_point(data=crime_dat, aes(x=LON, y=LAT), color="dark green", alpha=.1, size=1.1) +
  geom_density2d(data=crime_dat, aes(x=LON, y=LAT), size = 0.3) + 
  stat_density2d(data=crime_dat, aes(x=LON, y=LAT, fill = ..level.., alpha = ..level..), size = 0.01, bins = 16, geom = "polygon") + 
  scale_fill_gradient(low = "green", high = "red") + 
  scale_alpha(range = c(0, 0.3), guide = FALSE) + labs(title="Crime Reports for Detroit")

# 311 Calls
call311_plot <- det_map + #geom_point(data=call311_dat, aes(x=lng, y=lat), color="dark green", alpha=.1, size=1.1) +
  geom_density2d(data=call311_dat, aes(x=lng, y=lat), size = 0.3) + 
  stat_density2d(data=call311_dat, aes(x=lng, y=lat, fill = ..level.., alpha = ..level..), size = 0.01, bins = 16, geom = "polygon") + 
  scale_fill_gradient(low = "green", high = "red") + 
  scale_alpha(range = c(0, 0.3), guide = FALSE) + labs(title="311 Calls for Detroit")

# Blight Violations
blight_viols <- blight_viols %>%
  transform(lng = as.numeric(lng),
            lat = as.numeric(lat))

blight_plot <- det_map + 
  geom_density2d(data=blight_viols, aes(x=lng, y=lat), size = 0.3) + 
  stat_density2d(data=blight_viols, aes(x=lng, y=lat, fill = ..level.., alpha = ..level..), size = 0.01, bins = 16, geom = "polygon") + 
  scale_fill_gradient(low = "green", high = "red") + 
  scale_alpha(range = c(0, 0.3), guide = FALSE) + labs(title="Blight Violations for Detroit")

# Demolitions
demo_plot <- det_map + #geom_point(data=demo_permits, aes(x=lng, y=lat), color="dark green", alpha=.1, size=1.1) +
  geom_density2d(data=demo_permits, aes(x=Longitude, y=Latitude), size = 0.3) + 
  stat_density2d(data=demo_permits, aes(x=Longitude, y=Latitude, fill = ..level.., alpha = ..level..), size = 0.01, bins = 16, geom = "polygon") + 
  scale_fill_gradient(low = "green", high = "red") + 
  scale_alpha(range = c(0, 0.3), guide = FALSE) + labs(title="Demolition Locations for Detroit")

311 Calls

Blight Violations

Crime Reports

Demolitions

From the three plots above, we can see that calls to 311 have the highest concentration on the west side of town, around Dearborn Heights. There are also medium concentrations in the northeast corner of town, and and in central Detroit. Crime has a large hotspot in downtown Detroit (south central region), while the rest of town has a moderate amount of reported crimes. Demolitions are the highest in the east side neighborhoods.

The high crime in downtown Detroit is likely due to it being the urban center, while the low number of demolitions is most likely attributed to a lack of residential dwellings. The high number of 311 calls in Dearborn Heights is interesting, and corresponds with a reasonably large number of demolitions. Demolitions are spread throughout Detroit, but appear to have higher concentrations in roughly 5 neighborhoods. It is likely that these neighborhoods have been targeted by the city.

Modeling Techniques

Two models will be trained for this project - the first will be a simple randomForest model trained using only characteristics of the property itself that were obtained from the parcel data. The second model will include the demographic variables, as well as incident-based metrics for 311 calls, crime, and demolitions in the surrounding neighborhood.

Since blight is a relatively rare phenomenon (as seen in the exploratory analysis), we will need to rebalance the training set. This can be accomplished by selecting every observation of a demolished house, and an equal number of non-blighted properties. This technique is known as down-sampling, and can be accomplished using caret’s downSample() function.

The down-sampled data set will be split into a training and test set, at an 80:20 ratio. Each model will be trained using the training set with 10-fold cross validation, with ROC AUC as the optimization metric, and subsequently validated for ‘generalization’ using the 20% holdout set. I prefer using ROC as the optimization technique in modeling techniques like this because it will generally select models that not only identify the positive case, but assign positive observations a high probability.

parcel_data <- parcel_data %>%
  mutate(sales_year = year(as.Date(SaleDate, format='%m/%d/%Y')),
         SalePrice = as.numeric(gsub("\\$",'',SalePrice)),
         SEV = as.numeric(gsub("\\$",'',SEV)),
         AV = as.numeric(gsub("\\$",'',AV)),
         TV =  as.numeric(gsub("\\$",'',TV)))

set.seed(42)
train_idx <- createDataPartition(parcel_data$demo_flag, times=1, p=0.8, list=T)

smallset <- downSample(parcel_data[train_idx$Resample1,], parcel_data[train_idx$Resample1,]$demo_flag, list=F)

Simple Model

The simple model will use the sale price of the home, the SEV (State-equalized value), the buld year for residential properties, square footage of the home, the acreage of the land, as well as the frontage and depth of the land, and a flag for whether or not the property has been improved.

fitControl <- trainControl(method = "repeatedcv",
                           number = 10,
                           ## Estimate class probabilities
                           classProbs = TRUE,
                           ## Evaluate performance using 
                           ## the following function
                           summaryFunction = twoClassSummary
                           )

set.seed(1024)
rf_fit <- train(demo_flag ~ SalePrice + SEV + ResYrBuilt +
                  TotSqFt + TotAcres + Frontage + Depth + IsImproved,
                 data = smallset, 
                 method = "rf",
                 ntree = 100,
                 trControl = fitControl, 
                 verbose = FALSE,
                 ## Specify which metric to optimize
                 metric = "ROC"
                )

rf_fit

## Random Forest 
## 
## 17128 samples
##    35 predictors
##     2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 15416, 15415, 15415, 15415, 15415, 15416, ... 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens       Spec       ROC SD       Sens SD    
##   2     0.9259681  0.8905874  0.8867365  0.008290925  0.007466176
##   5     0.9248295  0.8773944  0.8909409  0.007531252  0.009620851
##   8     0.9245233  0.8744744  0.8891889  0.006975152  0.011015747
##   Spec SD   
##   0.01064143
##   0.01018435
##   0.01208218
## 
## ROC was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.

The simple model performs remarkably well, with an AUC of 0.926 +/- 0.008. Let’s quickly confirm on our test set that performance holds up.

test_preds <- predict(rf_fit, newdata=parcel_data[-train_idx$Resample1,])

confusionMatrix(test_preds,parcel_data[-train_idx$Resample1,]$demo_flag)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  66755   251
##        Yes  8024  1889
##                                           
##                Accuracy : 0.8924          
##                  95% CI : (0.8902, 0.8946)
##     No Information Rate : 0.9722          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : NA              
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8927          
##             Specificity : 0.8827          
##          Pos Pred Value : 0.9963          
##          Neg Pred Value : 0.1906          
##              Prevalence : 0.9722          
##          Detection Rate : 0.8679          
##    Detection Prevalence : 0.8711          
##       Balanced Accuracy : 0.8877          
##                                           
##        'Positive' Class : No              
##

The sensitivity and specificity are almost identical to the cross-validated results, and within the standard deviations from the 10-fold cross-validation. Now let’s look at what the model is telling us about variable importance in the trained model.

By far, the most predictive variable in the model is the State-equalized value, which is the value used to estimate property taxes. The year the property was built is also predictive, with sale price, total acreage, and whether or not the property was improved recently rounding out the top 5 variables.

Certainly with an AUC of 0.92, the model is reasonably predictive, but we do see in the confusion matrix that it has a habit of overpredicting a large number of properties as ‘at risk’ of blight - in fact, it classifies 4 times the number of properties should be demolished than actually were demolished. In fact, (Detroit has stated that it would like to demolish 40,000 properties), but let’s see if including the incident data can differentiate demolished properties with higher specificity.

Incident-Based Modeling

Processing of the incidents is critical to accurate modeling. It is important that we only consider events leading up to a demolition, but not afterward. For example, if a building was destroyed in 2014, who cares how much crime there was in that neighborhood in 2016? To simplify the calculation, I will aggregate each data set at the month/year level, and then join data for the month previous to the actual demolition. I will include cumulative counts of crimes, 311 calls, blight violations, and surrounding demolitions, as well as a cumulative monthly average, and average monthly values for the 12 months leading up to the demolition. All values will be calculated using events in the geohash-7 of the demolition, as well as the neighboring geohashes.

#------------------------
# Support Functions
#------------------------

# This will grab all neighbors for a geohash, plus the geohash itself
gh_filter <- function(x) {
    nbrs <- gh_neighbours(x)

    return(c(x, list(nbrs)[[1]]))
}

# Rolling 12-Month Mean
new_rollmean <- function(x) { rollapply(x, width=12, FUN=mean, na.rm=T, fill=NA, align="right")}

#-------------------------
# Monthly Aggregation
#-------------------------

geo7_monthly_311calls <- call311_dat %>%
  mutate(incident_date = as.Date(ticket_created_date_time, format='%m/%d/%Y'),
         month = month(incident_date),
         year = year(incident_date)) %>%
  group_by(geohash_7, month, year) %>%
  summarize(monthly_311calls = n()) %>%
  arrange(month, year)

geo7_monthly_crime <- crime_dat %>%
  mutate(incident_date = as.Date(INCIDENTDATE, format='%m/%d/%Y'),
         month = month(incident_date),
         year = year(incident_date)) %>%
  group_by(geohash_7, month, year) %>%
  summarize(monthly_crime = n()) %>%
  arrange(month, year)

geo7_monthly_demos <- demo_permits %>%
  mutate(incident_date = as.Date(Demolition.Date, format='%m/%d/%Y'),
         month = month(incident_date),
         year = year(incident_date)) %>%
  group_by(geohash_7, month, year) %>%
  summarize(monthly_demos = n()) %>%
  arrange(month, year)

geo7_monthly_viols <- blight_viols %>%
  mutate(incident_date = as.Date(TicketIssuedDT, format='%m/%d/%Y'),
         month = month(incident_date),
         year = year(incident_date)) %>%
  group_by(geohash_7, month, year) %>%
  summarize(monthly_viols = n()) %>%
  arrange(month, year)

date_seq <- seq.POSIXt(from=as.POSIXct('1/01/2009', format="%m/%d/%Y"), 
                       to=as.POSIXct('11/01/2016', format='%m/%d/%Y'), by='1 month')

date_seq <- as.data.frame(date_seq) %>%
  mutate(month = month(date_seq),
         year = year(date_seq))
         
all_incidents <- geo7_monthly_crime %>% 
  left_join(geo7_monthly_311calls, by=c('geohash_7'='geohash_7', 'month'='month', 'year'='year')) %>%
  left_join(geo7_monthly_demos, by=c('geohash_7'='geohash_7', 'month'='month', 'year'='year')) %>%
  left_join(geo7_monthly_viols, by=c('geohash_7'='geohash_7', 'month'='month', 'year'='year')) %>%
  transform(monthly_crime = ifelse(is.na(monthly_crime), 0, monthly_crime),
            monthly_311calls = ifelse(is.na(monthly_311calls), 0, monthly_311calls),
            monthly_demos = ifelse(is.na(monthly_demos), 0, monthly_demos),
            monthly_viols = ifelse(is.na(monthly_viols), 0, monthly_viols))

# Let's do blight violations at geohash-7 + neighbors
# Start with all data sets at that range
parcel_data <- parcel_data %>%
    mutate(geohash_7 = gh_encode(Latitude, Longitude, 7))

unique_geohash7 <- unique(parcel_data$geohash_7)

incident_list <- lapply(unique_geohash7, function(x) { 
    all_incidents %>%
        filter(geohash_7 %in% gh_filter(x)) %>%
        mutate(reference = x)
    })

incident_list <- rbind_all(incident_list)

incident_list_cumulative <- incident_list %>% 
    group_by(reference) %>%
    # tidyr::full_seq() will fill in the missing date values
    tidyr::complete(month = full_seq(seq(1,12), 1),
                     year = full_seq(seq(2009,2016), 1)) %>%
    # Fill in the newly added blank rows with 0 (non-reported)
    transform(monthly_311calls = ifelse(is.na(monthly_311calls),0,monthly_311calls),
                        monthly_crime = ifelse(is.na(monthly_crime),0,monthly_crime),
                        monthly_demos = ifelse(is.na(monthly_demos),0,monthly_demos),
                        monthly_viols = ifelse(is.na(monthly_viols),0,monthly_viols)) %>%
    group_by(reference, month, year) %>%
    # Aggregate to the reference geohash at each month/year (all neighbors grouped into one row)
    summarize(monthly_crime = sum(monthly_crime),
                        monthly_311calls = sum(monthly_311calls),
                        monthly_demos = sum(monthly_demos),
                        monthly_viols = sum(monthly_viols)) %>%
    group_by(reference) %>%
    arrange(reference, year, month) %>%
    # Now we create average/rolling variables for each stat
    mutate(cummean_crimes = cummean(monthly_crime),
                 cummean_311calls = cummean(monthly_311calls),
                 cummean_demos = cummean(monthly_demos),
                 cummean_viols = cummean(monthly_viols),
                 cumsum_crimes = cumsum(monthly_crime),
                 cumsum_311calls= cumsum(monthly_311calls),
                 cumsum_demos = cumsum(monthly_demos),
                 cumsum_viols = cumsum(monthly_viols),
                 rollmean_crimes = new_rollmean(monthly_crime),
                 rollmean_311calls = new_rollmean(monthly_311calls),
                 rollmean_demos = new_rollmean(monthly_demos),
                 rollmean_viols = new_rollmean(monthly_viols)) %>%
    # Clean up the first 12 months of the year using cummean()
    transform(rollmean_crimes = ifelse(is.na(rollmean_crimes), cummean_crimes, rollmean_crimes),
                        rollmean_311calls = ifelse(is.na(rollmean_311calls), cummean_311calls, rollmean_311calls),
                        rollmean_demos = ifelse(is.na(rollmean_demos), cummean_demos, rollmean_demos),
                        rollmean_viols = ifelse(is.na(rollmean_viols), cummean_viols, rollmean_viols))

# Let's create reference set of crime stats we can add on to the data set
parcel_incidents <- parcel_data %>%
    left_join({demo_permits %>%
                                transform(PERMIT_ISSUED = as.Date(Demolition.Date, format="%m/%d/%Y")) %>%
                                group_by(Parcel.ID) %>%
                                summarize(PERMIT_DATE = max(PERMIT_ISSUED))}, by=c('ParcelNo'='Parcel.ID')) %>%
  mutate(join_month = ifelse(month(PERMIT_DATE) == 1,12,month(PERMIT_DATE)),
         join_year = ifelse(month(PERMIT_DATE) == 1,year(PERMIT_DATE)-1, year(PERMIT_DATE)),
         join_month = ifelse(is.na(join_month), 11, join_month),
         join_year = ifelse(is.na(join_year), 2016, join_year)) %>%
    left_join(incident_list_cumulative, by=c('geohash_7'='reference', 'join_month'='month', 'join_year'='year'))

# Some geohashes don't show up because they have no incidents, and neighbors have no incidents. Fill these in with zeroes.
parcel_incidents <- parcel_incidents %>%
  transform(monthly_crime = ifelse(is.na(monthly_crime), 0, monthly_crime),
            monthly_311calls = ifelse(is.na(monthly_311calls), 0, monthly_311calls),
            monthly_demos = ifelse(is.na(monthly_demos), 0, monthly_demos),
            monthly_viols = ifelse(is.na(monthly_viols), 0, monthly_viols),
            cummean_crimes = ifelse(is.na(cummean_crimes), 0, cummean_crimes),
            cummean_311calls = ifelse(is.na(cummean_311calls), 0, cummean_311calls),
            cummean_demos = ifelse(is.na(cummean_demos), 0, cummean_demos),
            cummean_viols = ifelse(is.na(cummean_viols), 0, cummean_viols),
            cumsum_crimes = ifelse(is.na(cumsum_crimes), 0, cumsum_crimes),
            cumsum_311calls = ifelse(is.na(cumsum_311calls), 0, cumsum_311calls),
            cumsum_demos = ifelse(is.na(cumsum_demos), 0, cumsum_demos),
            cumsum_viols = ifelse(is.na(cumsum_viols), 0, cumsum_viols),
            rollmean_crimes = ifelse(is.na(rollmean_crimes), 0, rollmean_crimes),
            rollmean_311calls = ifelse(is.na(rollmean_311calls), 0, rollmean_311calls),
            rollmean_demos = ifelse(is.na(rollmean_demos), 0, rollmean_demos),
            rollmean_viols = ifelse(is.na(rollmean_viols), 0, rollmean_viols))

Model

parcel_incidents <- parcel_incidents %>%
  mutate(sales_year = year(as.Date(SaleDate, format='%m/%d/%Y')),
         SalePrice = as.numeric(gsub("\\$",'',SalePrice)),
         SEV = as.numeric(gsub("\\$",'',SEV)),
         AV = as.numeric(gsub("\\$",'',AV)),
         TV =  as.numeric(gsub("\\$",'',TV)))


set.seed(42)
train_idx <- createDataPartition(parcel_incidents$demo_flag, times=1, p=0.8, list=T)

smallset <- downSample(parcel_incidents[train_idx$Resample1,], parcel_incidents[train_idx$Resample1,]$demo_flag, list=F)

fitControl <- trainControl(method = "repeatedcv",
                           number = 10,
                           ## Estimate class probabilities
                           classProbs = TRUE,
                           ## Evaluate performance using 
                           ## the following function
                           summaryFunction = twoClassSummary
)

set.seed(1024)
rf_fit_incidents <- train(demo_flag ~ SalePrice + SEV + ResYrBuilt +
                  TotSqFt + TotAcres + Frontage + Depth + IsImproved +
#                   monthly_crime,monthly_311calls,monthly_demos,
#                   cummean_crimes,cummean_311calls,cummean_demos,
#                   cumsum_crimes,cumsum_311calls,cumsum_demos +
                   rollmean_crimes + rollmean_311calls + rollmean_demos + rollmean_viols,
                data = smallset, 
                method = "rf",
                ntree = 100,
                trControl = fitControl, 
                verbose = FALSE,
                ## Specify which metric to optimize
                metric = "ROC"
)

rf_fit_incidents

## Random Forest 
## 
## 18296 samples
##    55 predictors
##     2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 16466, 16467, 16467, 16466, 16466, 16466, ... 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens       Spec       ROC SD       Sens SD    
##    2    0.9540114  0.8806293  0.9075185  0.004551512  0.005479783
##    7    0.9531509  0.8891571  0.8969154  0.005762455  0.008052482
##   12    0.9511553  0.8885010  0.8969154  0.006356062  0.009766297
##   Spec SD   
##   0.01204289
##   0.01161469
##   0.01244415
## 
## ROC was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.

# rf_fit_allvars <- train(demo_flag ~ SalePrice + SEV + ResYrBuilt +
#                   TotSqFt + TotAcres + Frontage + Depth + IsImproved +
#                   monthly_crime + monthly_311calls + monthly_demos + monthly_viols +
#                   cummean_crimes + cummean_311calls + cummean_demos + cummean_viols +
#                   cumsum_crimes + cumsum_311calls + cumsum_demos + cumsum_viols +
#                   rollmean_crimes + rollmean_311calls + rollmean_demos + rollmean_viols,
#                 data = smallset,
#                 method = "rf",
#                 ntree = 100,
#                 trControl = fitControl,
#                 verbose = FALSE,
#                 ## Specify which metric to optimize
#                 metric = "ROC"
# )

# Need to impute missing values for the test set now that I am using caret. 
test_pred_incidents <- predict(rf_fit_incidents, newdata=parcel_incidents[-train_idx$Resample1,])
#test_preds_allvars <- predict(rf_fit_allvars, newdata=parcel_incidents[-train_idx$Resample1,])

The incident-based model once again performs well, with an AUC of 0.954 +/- 0.004. The performance has not increased drastically from the simple model, but with an AUC over 0.9, we would not expect to see huge leaps in performance.

confusionMatrix(test_pred_incidents, parcel_incidents[-train_idx$Resample1,]$demo_flag, positive='Yes')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  65999   242
##        Yes  8634  2045
##                                           
##                Accuracy : 0.8846          
##                  95% CI : (0.8823, 0.8869)
##     No Information Rate : 0.9703          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : NA              
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.89418         
##             Specificity : 0.88431         
##          Pos Pred Value : 0.19150         
##          Neg Pred Value : 0.99635         
##              Prevalence : 0.02973         
##          Detection Rate : 0.02659         
##    Detection Prevalence : 0.13883         
##       Balanced Accuracy : 0.88925         
##                                           
##        'Positive' Class : Yes             
##

Unfortunately, the confusion matrix indicates that most of the performance came from increases in correctly identifying demolished buildings, and we have not decreased the proportion of misclassified buildings.

varImpPlot(rf_fit_incidents$finalModel, main='Variable Importance: Incident-Based Model')

Incidents clearly do play a role in the identification of blight, with the previous year’s monthly average of 311 calls in the surrounding neighborhoods being the second strongest predictor of blight, and nearly as influential as SEV. Interestingly, the only incident-based variable that does not appear to contribute strongly to the model is blight violations. Perhaps that is because the largest number of blight violations were located downtown, where there are few residential properties.

Results and Blight Prediction

Two models were trained to predict demolitions due to blight: a simple random forest model based on basic property features such as size, build year, and value, and a second random forest model that added in a rolling 12-month average of blight violations, 311 calls, crime, and demolitions in the surrounding neighborhoods. Both models were successful at identifying demolished properties, but also ‘misclassified’ a large number of properties as blighted.

A third model not described here included a raw count of incidents from the month previous to demolition, and possessed an AUC of 0.98, while decreasing the number of misclassified properties by a factor of 4. However, that performance comes at the cost of only a 1-month lead time prior to demolition, which cuts down the utility of the model at diagnosing blight. With knowledge that Detroit wishes to demolish upwards of 40000 buildings due to blight, a high ‘misclassification’ rate is probably accurate, and therefore the model with the greater lead time is most utilitous, even with its decreased performance metrics.

The most predictive features in the incident-based random forest model are the State Equalized Value, monthly average of 311 calls, demolitions, and crimes from the 12-month period prior to demolition, and the year the property was built. This collection of features is generally easy to come by, and provides a relatively long lead time to diagnose impending blight in neighborhoods.

Detroit Blight Prediction

Jeremy Beck

12/31/2016