Illegal dumping is a major problem in Philadelphia. Especially in low-income, minority neighborhoods, illegal dumping has a significant impact on quality of life, property values, safety, and public health. Due to various causes, under-reporting of dumping issues in certain areas likely leads to selection bias. Below, we attempt to implement two Poisson regression-based predictive models to mitigate this selection bias. One of these models incorporates spatial process into its data in the form of spatial lag while the other does not. Both models are evaluated using regular 100-fold cross validation and leave one location out spatial cross validation. Finally, we compare the results to the accuracy of a traditional kernel density estimate-based hotspot prediction. We find that:
incorporating spatial process into the model substantially improved its performance,
spatial cross-validation reveals that random k-fold cross validation is meaningfully over-optimistic in its evaluation of predictions,
our model does not generalize well across neighborhoods or racial contexts,
our model does not meaningfully improve over a more traditional kernel density estimate prediction approach,
and finally, our model is highly sensitive to input conditions such as cell size and classification breaks.
As a result, we do not recommend that this algorithm be put into production.
Illegal dumping is a major issue in Philadelphia. Especially in low-income, minority neighborhoods, illegal dumping has a significant impact on quality of life, property values, safety, and public health. For years, the City has failed to address the issue. This is likely due to a combination of factors such as bureaucratic ineptitude, lack of resources, the COVID-19 pandemic, and the sheer scale of the problem (there are, as of this writing, more than 1,600 open cases across the city, but only one detective assigned to solve them). In the case of illegal dumping, selection bias (when inaccurate sampling negatively impacts the quality of data) are likely a major issue. Residents of wealthier neighborhoods are more likely to report illegal dumping for a variety of reasons, such as more trust in City services and a higher expectation that something will actually be done about the problem. Residents of poorer, majority-minority neighborhoods are less inclined to report illegal dumping cases in part because they feel that nothing will be done about the problem anyway. As a result, there is likely under-reporting of illegal dumping in the neighborhoods it most impacts.
As is evident from the histogram below, illegal dumping is also unevenly distributed across the city; the large majority of illegal dumping (complaints, at least) happen in only a handful of places across Philadelphia. This makes illegal dumping a suitable candidate for hotspot analysis, but also imposes some challenges in terms of statistical modeling of a non-normal distribution.
The distribution of illegal dumping across Philadelphia hexagon bins, 2020-22
In order to more successfully intervene proactively to mitigate illegal dumping, it is useful to have a model that accounts for this selection bias. Below, we use a cross-validated Poisson regression to build a model that attempts to do exactly that. By incorporating associated risk factors (in this case, vacant properties, abandoned cars, street lights out, and building permits), we seek to construct a model that more can more accurately predict future occurrences of illegal dumping while mitigating the influence of selection bias.
3 Methods
3.1 Data Sources & Cleaning
To begin with, we draw on several City data sources for feature engineering. We use data from 2020 to 2022, and, in addition to illegal dumping complaints, incorporate abandoned vehicles, light outages, vacant properties, and building permits. Note that, because these data come from the peak years of the COVID-19 pandemic, they are likely anomalous in many ways, such as the fact that the Kenney administration cut the sanitation department’s budget.
3.1.1 The Fishnet Grid and Spatial Process
For the purposes of our spatial model, we aggregate data to a grid of equal-area hexagons covering Philadelphia. As is noted below, the cell size of these hexagons has a meaningful impact on the accuracy of the model. Ideally, this cell size should be matched to the spatial process of the dependent variable, but in this case we have settled on a value of 1000 feet, based on a trade-off between optimizing MAE and processing speed. (Generally, we found that smaller cell sizes reduced MAE, especially for the spatial process models.)
The distribution of illegal dumping across Philadelphia, 2020-22
In addition to the aggregated data, we calculate the spatial lag of contiguous neighbors. (Note that this introduces some issues with edge cases that have fewer contiguous neighbors than others.) We do this in order to account for the inherently spatial nature of our data, and it is an approximation of spatial lag regression. The spatial lag model assumes that the value of the dependent variable at one location is associated with the values of that variable in nearby locations. “Nearby” is as defined by the weights matrix W (rook, queen, or within a certain distance of one another). In other words, the spatial lag model includes the spatial lag of the dependent variable as a predictor. Although we do not implement a proper spatial lag model here, we attempt to approximate it be incorporating the lagged values of neighboring cells into our spatial process models. Here, for example, we visualize the spatial lag of abandoned cars relative to the actual values of abandoned cars.
The spatial lag of illegal dumping across Philadelphia, 2020-22
Using local Moran’s I calculations, we can identify statistically significant hotspots of illegal dumping in Philadelphia. They appear to cluster in North Philly, West Philly, and Center City, roughly.
To assess the relevance to our model, we can plot the correlation of our predictors with our dependent variable. We note that all of our predictors have high R-squared values, and that, generally, our spatial process features have slightly higher R-squared values than their non-spatial process counterparts.
Below, we implement two versions of a Poisson regression, which is a type of generalized linear model (GLM) specifically meant to address count data that are skewed, as in the case of our illegal dumping data. (This contrasts with the default version of a GLM, which is based on a Gaussian, or normal, distribution of data.) We note that a Poisson regression is not the only potential model that could be used here, and that we found, for example, slightly better results using a regression tree, which suggests that there may be other suitable ways to approach this problem as well.
3.2.2 Spatial Process
To account for spatial process, as mentioned above, we took inspiration from a spatial lag regression approach. Thus, we used our Poisson regression on two sets of data: one that was simply counts of predictors aggregated to our hexagon fishnet, and one that was the spatial lag of these predictors. Our hypothesis was that the spatial lag would minimize the impact of outliers and better account for spatial process.
4 Results
4.1 Non-Spatial Process Data
Broadly speaking, we see that our non-spatial process data produces granular hotspots fairly consistent with what would expect. Both the k-fold cross validation and the leave one geography out spatial cross validation produce similar prediction patterns, although we note that the spatial CV returns a much broader range of predictions, including some predictions that are obviously wrong.
On the other hand, the spatial process data produces less granular hotspots, but this tradeoff appears to increase the accuracy of the model. The hotspots are not as localized, but the range of predictions is much more consistent with what we would expect.
Validating our data, we see that the non-spatial process data yields an odd MAE outlier, indicating perhaps some kind of trouble dealing with the spatial nature of our data. On the other hand, the MAE distribution per iteration with our model that accounts for spatial process is much closer to a normal distribution, and appears to have a central tendency around 10.
Here, we examine the mean and standard deviation of MAE per validation approach across the non-spatial and spatial models. We find that, while in both cases spatial cross validation reveals higher errors than k-fold CV, the model accounting for spatial process has a much lower MAE and a much smaller standard deviation of MAE than the model that does not account for spatial process. (We note, furthermore, that this is highly contingent on cell size; running this model with a cell size of 2500 feet, as opposed to our current 1000 feet, we did not see much difference between the non-spatial and spatial models.)
Overall, the model incorporating spatial process was more accurate than the one that did not. With a lower MAE and smaller standard deviation, it is preferrable. That does not, however, mean that it is an especially accurate model in general, merely that it is more accurate than a non-spatial Poisson regression model.
5.1.1 Comparing to KDE Approach
To assess the accuracy of our model more generally, we compare it to the more traditional hotspot approach based on a kernel density estimate to see how both models fare in predicting 2023 illegal dumping based on 2020-22 data. Here, we implement an adaptive bandwidth kernel density estimate.
Show the code
source("R/st_kde.R")dumping_kde <-st_kde(complaints_sample)terra::crs(dumping_kde) <- crsdumping_kde_phl <- terra::mask(dumping_kde, phl)tm <-tm_shape(dumping_kde_phl) +tm_raster(palette ='viridis', title ="KDE", style ="fisher") %>%tmap_theme("Kernel Density Estimate of Illegal Dumping")
We then classify both the KDE layer and the spatial process Poisson model into five risks classes based on quantile breaks (this decision is discussed below). We then compare how these risk classes capture the actual distribution of illegal dumping complaints in Philadelphia in 2023 so far.
We find that the spatial process Poisson model does not meaningfully outperform the KDE predictions when considering 2023 data. This suggests that, without further improvement to the input parameters and the feature engineering, it is not worth the effort to implement this approach rather than a more traditional, straightforward hotspot model using a KDE.
Show the code
lag_preds_net <-left_join(lag_preds_net, st_drop_geometry(recent_complaints_net))# lag_spcv_preds_boxplot <- ggplot(lag_preds_net) +# geom_boxplot(aes(x = risk_class, y = count_recent_complaints)) +# labs(title = "2023 Illegal Dumping by Predicted Risk",# subtitle = "Spatial CV, Spatial Process Data",# x = "Risk Class",# y = "Actual 2023 Complaints")kde_grid <-left_join(kde_grid, st_drop_geometry(recent_complaints_net))# kde_preds_boxplot <- ggplot(kde_grid) +# geom_boxplot(aes(x = risk_class, y = count_recent_complaints)) +# labs(title = "2023 Illegal Dumping by Predicted Risk",# subtitle = "KDE",# x = "Risk Class",# y = "Actual 2023 Complaints")# # ggarrange(lag_spcv_preds_boxplot, kde_preds_boxplot, ncol = 2)lag_preds_net_predicted <- lag_preds_net %>%st_drop_geometry() %>%group_by(risk_class) %>%summarize(tot_dumping =sum(count_recent_complaints)) %>%mutate(pct_dumping = tot_dumping /sum(tot_dumping),model ="SPCV") %>%select(risk_class, pct_dumping, model)kde_grid_predicted <-kde_grid %>%st_drop_geometry() %>%group_by(risk_class) %>%summarize(tot_dumping =sum(count_recent_complaints)) %>%mutate(pct_dumping = tot_dumping /sum(tot_dumping),model ="KDE") %>%select(risk_class, pct_dumping, model)predicted_compared <-rbind(lag_preds_net_predicted, kde_grid_predicted)ggplot(predicted_compared) +geom_col(aes(x = risk_class, y = pct_dumping *100, fill = model), position ="dodge") +labs(title ="Risk Prediction (2023 Illegal Dumping)",subtitle ="Spatial Process Poisson Regression vs. KDE",x ="Risk Class",y ="Pct. Dumping",fill ="Model")
5.2 Generalizability
The use of spatial cross validation is in itself a means of testing the generalizability of the model across neighborhoods. Given that, even with spatial process accounted for in the data, the spatial cross validation indicates a higher average MAE than regular k-fold cross validation, it is evident that this model does not generalize perfectly across neighborhoods.
5.2.1 Racial Context
Furthermore, when considering racial context, we note that the model performs worse in minority neighborhoods than in white neighborhoods. This indicates that it does not generalize across racial contexts.
mae_x_race %>%kbl(caption ="MAE by Race Context and Model") %>%kable_minimal(full_width = F)
MAE by Race Context and Model
race_context
cv_avg_error
spcv_avg_error
lag_cv_avg_error
lag_spcv_avg_error
Majority Non-White
10.930919
11.077264
10.321334
10.530307
Majority White
7.448051
73.450550
6.139702
6.287381
NA
4.047303
4.068974
2.947341
2.962322
6 Conclusion
Above, we evaluated the effectiveness of a Poisson regression in predicting illegal dumping across Philadelphia. We assessed the utility of incorporating spatial process into our dataset and compared the merits of random k-fold cross validation versus spatial cross validation at the neighborhood level. We evaluated the accuracy and generalizability of all models and found that the model that incorporated spatial process meaningfully outperformed the model that did not. However, spatial cross-validation revealed that the model does not generalize well across neighborhoods. Furthermore, we found that it does not generalize well across racial contexts. Lastly, we compared the model to a more traditional hotspot prediction approach using a kernel density estimate and found that it did not offer much in the way of improvement.
As a result of these findings, we do not recommend that this model be put into production. Given its limited accuracy and generalizability, especially compared to simpler approaches such as a KDE, it is not worth the effort invested. We found, furthermore, that it is highly sensitive to input parameters such as the cell size of the hexagon grid, or the class breaks used to define risk classes for prediction (quantile breaks were more effective at capturing risk than Jenks breaks, despite the skewed distribution of dumping complaints). In the course of our work, we found that other models, such as a regression tree, spatial lag regression, or geographically weighted regression, performed better and may be preferrable, too. In sum, given the sensitivity of the model to input parameters and its limited improvement over simpler approaches, we do not recommend using it.