CPLN 675 Midterm - Inundation Model (Cartwright and Csere)

Project Presentation Information

A recording of our presentation can be viewed here: https://upenn.zoom.us/rec/share/TR7GwgcxE5TarJl8AIMjnw3SA4l5wifkve9Qq0wZDaFEjagxp1YkqqTM81U48uiv.TYPC4DRPexdr9dYA?startTime=1617201061000 Passcode: 15A&JjhD

Project Introduction

The 100-year flood has been highlighted as a misleading term in recent years. Regions around the world have grappled with flooding from unusually intense storms and increasingly variable rainfall due to climate change. Flood events have wide ranging impacts, including safety concerns, property damage, loss of economic activity, and environmental destruction. In 2013, heavy rainfall in the province of Alberta, Canada caused an unexpected flood that resulted in over C$5 billion ($3.958) in damages and five deaths. Planners must learn from these events and attempt to understand why they had such drastic impacts, as well as determine what can be done to mitigate them in the future. Using data from the City of Calgary, the largest city located within Alberta, Canada, this project will build a predictive model for areas of flood inundation and test the model’s validity on areas that actually flooded in 2013. We will then apply the predictive model to Denver County, located in Colorado, USA, to highlight areas that may be at risk of flooding if subject to similar levels of heavy rainfall. These identified areas could then be the focus of a plan to build Denver’s resilience in the face of increased flooding risk.

Significant Features

After testing multiple variables, our team determined four variables to be the most significant and accurate in predicting flood inundation areas. The following page shows the four variables used in the predictive model: Distance from a stream (in meters, with streams calculated using ArcGIS), Elevation (categorized from 1-10, with 1 as the lowest and 10 as the highest), Whether or not the grid cell contains a concerning body of water (defined as a body of water that touches no permeable land), and Permeable Land as a percentage of the total area of a given grid cell. Specifically, if a grid cell contained a concerning body of water, defined as a water body that touches no land categorized as more than 60% permeable, it was assigned a value of 1. We repeated the process for those grid cells not containing a concerning body of water and assigned them a value of 0, creating a binary variable.

Each value of a given variable is represented in a 300 meter by 300 meter grid cell to allow more accurate generalization across cities and easier model functionality.

Final Logistic Regression Model Summary

Summary of Model

Our final model indicates that the intercept and all dependent variables are statistically significant at the highest level. Based on these findings, stream distance and inundation, as well as elevation and inundation are negatively associated. Similarly, permeable land percentage and whether or not the fishnet cell contains a concerning body of water are positively associated with inundation.

## 
## Call:
## glm(formula = Training ~ ., family = binomial(link = "logit"), 
##     data = floodTrain %>% as.data.frame() %>% select(-geometry))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9415  -0.4567  -0.1969  -0.0492   3.8931  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  7.688e-01  1.674e-01   4.592 4.39e-06 ***
## Stream_dis  -1.164e-03  6.622e-05 -17.576  < 2e-16 ***
## Perm_land    8.597e-03  1.829e-03   4.701 2.59e-06 ***
## Water_conc1  1.778e+00  1.586e-01  11.215  < 2e-16 ***
## elevation   -6.673e-01  3.940e-02 -16.934  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4259.2  on 5559  degrees of freedom
## Residual deviance: 2869.2  on 5555  degrees of freedom
## AIC: 2879.2
## 
## Number of Fisher Scoring iterations: 7

Probability Threshold

Our team tested multiple models against a subset of data from Calgary’s 2013 inundation. By actively comparing the model’s flood inundation predictions to real-world data, our team was able to adjust features to maximize accuracy. The logistic regression model generates inundation probabilities for each entry in the training data set, but it is up to the modelers to determine the probability threshold that defines if an entry is inundated or not. The threshold that optimized the sensitivity and specificity of the model, which are the true positive and true negative rates, was 0.175.

Confusion Matrix

The confusion matrix for the final model is included below. Based on this information, the sensitivity (“true positive rate”) of the predictive model is 0.8537 and the specificity (“true negative rate”) is 0.8343. Additionally, the Kappa statistic, which tests for interrater reliabiity and “represents the extent to which the data collected in the study are correct representations of the variables measured” (1), is relatively high. According to an article in Biochemia medica, kappa values as low as 0.41 are still considered acceptable (1).

Based on this information, we plotted our ROC curve to visualize the model’s accuracy and proceeded to cross-validate the results.

McHugh M. L. (2012). Interrater reliability: the kappa statistic. Biochemia medica, 22(3), 276–282.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1742   43
##          1  346  251
##                                           
##                Accuracy : 0.8367          
##                  95% CI : (0.8212, 0.8513)
##     No Information Rate : 0.8766          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4769          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8537          
##             Specificity : 0.8343          
##          Pos Pred Value : 0.4204          
##          Neg Pred Value : 0.9759          
##              Prevalence : 0.1234          
##          Detection Rate : 0.1054          
##    Detection Prevalence : 0.2506          
##       Balanced Accuracy : 0.8440          
##                                           
##        'Positive' Class : 1               
##

Receiver Operating Characteristic (ROC) Curve

The final ROC curve, a plot of true positives and false positives as predicted by the final model, is shown below. The area under the curve is 0.9015, indicating that the final model is able to predict flood inundation at 90.15% accuracy.

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Area under the curve: 0.9015

Cross Validation

To test the generalized validity of the model, we used k-fold cross-validation methodology with 100 folds on the Calgary data set. The average accuracy over 100 folds was slightly lower than the result from the initial run on the training set, but the variance in accuracy between folds was low, giving us confidence in applying the model to comparable urbanized areas.

## Generalized Linear Model 
## 
## 7942 samples
##    4 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 7863, 7863, 7863, 7863, 7863, 7862, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8924769  0.3640955

Training Set and Predicted Outcomes

Calgary Predictions

Because we have access to real data showing the areas where Calgary flooded in 2013, we are able to compare the model’s predictions of flood inundation to the historical data. If the model predicted flood inundation where it actually occurred in 2013, this is a “true positive”. The training set predictions for Calgary are shown below. We have included a map showing the true positives and negatives, as well as the false positives and negatives predicted by the model. We have also included a map showing where the final model predicts flooding in Calgary would occur.

Denver Predictions

Though the model was trained on Calgary data, its findings can potentially be extrapolated to Denver due to their similarities. A map of the predicted Denver flood inundation areas is shown below.