We used flood inundation data from Calgary, Alberta, Canada and selected variables related to watersheds to create a model for predicting areas that will flood. We then tuned the model and used it to predict flood inundation in a comparable city: Denver, Colorado, USA. Denver is half the size of Calgary in terms of land area and population, but we selected Denver because it has similar characteristics. Denver also has a major river running through it, is located in the foothills of a mountain range but is relatively flat, and is not a coastal city. We found data for both cities for the same 5 variables: flow accumulation, land cover (vegetated vs developed), distance to a body of water (250m or 820ft), distance to a body of water (500m or 1640ft), and impervious surfaces. The following document shows our process for creating the model and illustrates the output plots and maps for both Calgary and Denver.
To get started, let’s install libraries and a set of ggplot styles we
call mapTheme and plotTheme.
library(caret)
library(pscl)
library(plotROC)
library(pROC)
library(sf)
library(tidyverse)
library(knitr)
library(kableExtra)
library(tigris)
library(viridis)
mapTheme <- theme(plot.title =element_text(size=12),
plot.subtitle = element_text(size=8),
plot.caption = element_text(size = 6),
axis.line=element_blank(),
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks=element_blank(),
axis.title.x=element_blank(),
axis.title.y=element_blank(),
panel.background=element_blank(),
panel.border=element_blank(),
panel.grid.major=element_line(colour = 'transparent'),
panel.grid.minor=element_blank(),
legend.direction = "vertical",
legend.position = "right",
plot.margin = margin(1, 1, 1, 1, 'cm'),
legend.key.height = unit(1, "cm"), legend.key.width = unit(0.2, "cm"))
plotTheme <- theme(
plot.title =element_text(size=12),
plot.subtitle = element_text(size=8),
plot.caption = element_text(size = 6),
axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
axis.text.y = element_text(size = 10),
axis.title.y = element_text(size = 10),
# Set the entire chart region to blank
panel.background=element_blank(),
plot.background=element_blank(),
#panel.border=element_rect(colour="#F0F0F0"),
# Format the grid
panel.grid.major=element_line(colour="#D0D0D0",linewidth=.75),
axis.ticks=element_blank())
Let’s load our data. We are going to include two datasets:
Calg_Fishnet, which is a fishnet dataset representing all
of our variables for the model, and Calg_Flooded, which is
the original shapefile of flooded areas (flood inundation) in Calgary,
Alberta, Canada.
Calg_Fishnet <- st_read("C:/Users/3lpaw/Desktop/ArcGIS Pro 3.2/EnvModeling/MidtermProject_FloodProbability/Calgary_fishnet_500m_Final3.shp")
Calg_Flooded <- st_read("C:/Users/3lpaw/Desktop/ArcGIS Pro 3.2/EnvModeling/MidtermProject_FloodProbability/Calg_Inundation_Poly.shp")
The data that we loaded in is projected in the Calgary_3TM_WGS_1984_W114 coordinate system, which uses meters as the unit. We want to make sure the data is projected with a linear unit in feet or meters so that we can do various distance measurements later on.
We loaded a shapefile of the Calgary city boundary to use as a background for the maps.
Calg_CityBoundary <- st_read("C:/Users/3lpaw/Desktop/ArcGIS Pro 3.2/EnvModeling/MidtermProject_FloodProbability/Calg_Cityboundary_2.shp")
Let’s start by checking out the variables we have in our datasets
using the glimpse command.
glimpse(Calg_Fishnet)
glimpse(Calg_Flooded)
The Calgary Fishnet variables are as follows:
MAJ_Hyd250 = In an area that is a body of water or within 250m of water (buffer)
MAJ_Hyd500 = In an area that is a body of water or within 500m of water (buffer)
SUM_FlowAcc = Flow Accumulation (continuous)
MAJ_ImpSurf = Impervious surface
MAJ_LandCo = Where land cover includes vegetated land cover types (forest, grassland, shrubland, agriculture, and golf course) = 0 and where land cover includes developed land or water bodies = 1.
Flooded = Where a cell has actually flooded (column joined from Calg_Flooded shapefile)
We selected these variables because we predict that cells that have high flow accumulation, are located in close proximity to a body of water, and have developed land cover are more likely to flood.
Let’s plot the Calgary flooded areas on top of the city boundary. We give the shape some color and transparency (alpha).
ggplot() +
geom_sf(data = Calg_CityBoundary,
fill = "transparent",
color = "black",
alpha = 1)+
geom_sf(data=Calg_Flooded,
aes(fill=as.factor(gridcode)),
color = "dark blue",
alpha = 0.6) +
scale_fill_manual(values = c("white", "dark blue"),
labels = c("Not Flooded","Flooded"),
name = "") +
labs(title="Flooded Areas in Calgary, Alberta, Canada") +
mapTheme
The gray line in the background shows the Calgary city boundary. The blue areas show where it actually flooded in Calgary. These areas have some overlap with the hydrology of the watershed (where there are rivers and bodies of water). The major river that runs through Calgary is the Bow River and the large body of water in the southwest is the Glenmore Reservoir.
Now let’s plot the fishnet version. We used a cell size of 500m.
ggplot() +
geom_sf(data=Calg_Fishnet, aes(fill=as.factor(Flooded)), color = "transparent") +
geom_sf(data = Calg_CityBoundary, fill = "transparent", color = "black")+
scale_fill_manual(values = c("white", "blue"),
labels = c("Not Flooded","Flooded"),
name = "") +
labs(title="Flooded Areas in Calgary, Canada (Fishnet)") +
mapTheme
When comparing the two plots, you can see that choosing 500m for the cell size may be a bit too large because in the fishnet the areas where the rivers are not very wide are no longer considered “Flooded”. But we will keep going with this analysis using the 500m cell size.
Here are maps for the five environmental- and watershed-related variables we selected. These maps show the Calgary fishnet feature class and each of the fields.
ggplot() +
geom_sf(data=Calg_Fishnet, aes(fill=as.factor(MAJ_Hyd250)), color = "transparent") +
geom_sf(data = Calg_CityBoundary, fill = "transparent", color = "black")+
scale_fill_manual(values = c("white", "light blue"),
labels = c("Not within 250m of a Body of Water","Within 250m of a Body of Water"),
name = "") +
labs(title="Proximity to Water (250m) in Calgary, Canada (Fishnet)") +
mapTheme
ggplot() +
geom_sf(data=Calg_Fishnet, aes(fill=as.factor(MAJ_hyd500)), color = "transparent") +
geom_sf(data = Calg_CityBoundary, fill = "transparent", color = "black")+
scale_fill_manual(values = c("white", "#0047AB"),
labels = c("Not within 500m of a Body of Water","Within 500m of a Body of Water"),
name = "") +
labs(title="Proximity to Water (500m) in Calgary, Canada (Fishnet)") +
mapTheme
ggplot() +
geom_sf(data=Calg_Fishnet, aes(fill=SUM_FlowAc), color = "black") +
geom_sf(data = Calg_CityBoundary, fill = "transparent", color = "black") +
labs(title="Flow Accumulation (Sum) in Calgary, Canada (Fishnet)", subtitle = "Cells that are light blue have higher flow accumulation") +
mapTheme
ggplot() +
geom_sf(data=Calg_Fishnet, aes(fill=as.factor(MAJ_ImpSur)), color = "transparent") +
geom_sf(data = Calg_CityBoundary, fill = "transparent", color = "black")+
scale_fill_manual(values = c("white", "gray"),
labels = c("Not Impervious","Impervious Surfaces"),
name = "") +
labs(title="Impervious Surfaces in Calgary, Canada (Fishnet)") +
mapTheme
ggplot() +
geom_sf(data=Calg_Fishnet, aes(fill=as.factor(MAJ_LandCo)), color = "transparent") +
geom_sf(data = Calg_CityBoundary, fill = "transparent", color = "black")+
scale_fill_manual(values = c("#d1ffbd", "dark gray"),
labels = c("Land Cover: Vegetated or Null","Land Cover: Water Bodies or Developed Land"),
name = "") +
labs(title="Land Cover (Reclassified) in Calgary, Canada (Fishnet)") +
mapTheme
Let’s build some bar plots that show differences in our independent variables across land that has and has not been flooded.
CalgPlotVariables <-
Calg_Fishnet %>%
as.data.frame() %>%
select(Flooded,MAJ_Hyd250,MAJ_hyd500,MAJ_ImpSur,MAJ_LandCo,SUM_FlowAc) %>%
gather(variable, value, -Flooded)
glimpse(CalgPlotVariables)
## Rows: 26,560
## Columns: 3
## $ Flooded <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0…
## $ variable <chr> "MAJ_Hyd250", "MAJ_Hyd250", "MAJ_Hyd250", "MAJ_Hyd250", "MAJ_…
## $ value <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
view(CalgPlotVariables)
Note there is a more modern version of gather
known as pivot_longer, where the syntax would look like
this:
Calg_Fishnet %>% as.data.frame() %>%
select(Flooded,MAJ_Hyd250,MAJ_hyd500,MAJ_ImpSur,MAJ_LandCo,SUM_FlowAc) %>%
pivot_longer(cols = -Flooded)
view(Calg_Fishnet)
Let’s examine some of the variables and how they vary across our
flooded/not flooded variable. Our CalgPlotVariables data
frame has rows for each cell-variable pair, we group_by
variable and flooded status and take the mean value for each variable by
status.
ggplot(CalgPlotVariables %>%
group_by(Flooded, variable) %>%
summarize(mean = mean(value))) +
geom_bar(aes(as.factor(Flooded),
mean,
fill=as.factor(Flooded)),
stat="identity") +
facet_wrap(~variable, scales = "free") + # You can use scales = "fixed" to keep them the same scale
scale_fill_manual(values = c("white", "dark blue"),
labels = c("Not Flooded","Flooded"),
name = "") +
labs(x="Flooded", y="Value")
## `summarise()` has grouped output by 'Flooded'. You can override using the
## `.groups` argument.
ggplot(CalgPlotVariables) +
geom_violin(aes(x = as.factor(Flooded),
y = value, fill = as.factor(Flooded))) +
facet_wrap(~variable, scales = "free") +
labs(x="Flooded", y="Value") +
scale_fill_manual(values = c("white", "dark blue"),
labels = c("Not Flooded","Flooded"), name = "") +
labs(x="Flooded", y="Value") +
plotTheme
Calg_Fishnet %>% as.data.frame () %>% group_by(Flooded) %>% tally() %>%mutate (pct = n/sum(n))
Let’s build another data set of just the variables we want to analyze. We re-code all of the variables from integers to factors (except for flow accumulation).
Calg_Fishnet <-
Calg_Fishnet %>%
select(Flooded,MAJ_Hyd250,MAJ_hyd500,MAJ_ImpSur,MAJ_LandCo,SUM_FlowAc) %>%
mutate(MAJ_Hyd250 = as.factor(MAJ_Hyd250))%>%
mutate(MAJ_hyd500 = as.factor(MAJ_hyd500))%>%
mutate(MAJ_ImpSur = as.factor(MAJ_ImpSur))%>%
mutate(MAJ_LandCo = as.factor(MAJ_LandCo))
view(Calg_Fishnet)
Now we create training and test sets. Let’s look over this operation:
- set.seed generates a random number -
createDataPartition randomly separates our data into two
sets. We set p to .85 - an 85% training set and 15% test
set.
set.seed(3456)
trainIndex <- createDataPartition(Calg_Fishnet$Flooded, p = .85,
list = FALSE,
times = 1)
CalgTrain <- Calg_Fishnet[ trainIndex,]
CalgTest <- Calg_Fishnet[-trainIndex,]
Now let’s estimate a logistic regression model. The binomial logit
model runs in the glm function (generalized linear models).
We specify the dependent variable as Flooded and run the
model on our training set CalgTrain. Note how we can use
the dplyr pipes right in the data parameter. We have to convert to a
data frame because R won’t know how to run a regression on an sf.
Let’s look at the model output, we see that we have coefficients, and p-values, but no R-squared. There are other goodness of fit metrics we will look at. The AIC, though not on a 0-1 scale like R-squared, has a similar function in that it tells you about overall model fit, but not about error and accuracy.
We are not really interested in our coefficients other than their
magnitude, directionality and p-value (generall). But for the record,
the way the coefficients in a logistic regression are interpreted is
different than in OLS - we are talking in terms of “odds” of an outcome
occurring (in our case odds of an area being flooded). If we
exponentiate the coefficient (exp()) we can interpret it as
all else equal the exponentiated value being the increase or
decrease in the odds of the outcome.
CalgModel <- glm(Flooded ~ .,
family="binomial"(link="logit"), data = CalgTrain %>%
as.data.frame() %>%
select(-geometry))
summary(CalgModel)
##
## Call:
## glm(formula = Flooded ~ ., family = binomial(link = "logit"),
## data = CalgTrain %>% as.data.frame() %>% select(-geometry))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.717e+00 1.012e-01 -26.861 < 2e-16 ***
## MAJ_Hyd2501 7.838e-01 1.558e-01 5.032 4.85e-07 ***
## MAJ_hyd5001 2.762e-02 1.646e-01 0.168 0.8668
## MAJ_ImpSur1 -5.839e-02 2.094e-01 -0.279 0.7804
## MAJ_LandCo1 -3.847e-01 1.770e-01 -2.173 0.0298 *
## SUM_FlowAc 1.308e-06 2.900e-07 4.511 6.46e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2688.6 on 4515 degrees of freedom
## Residual deviance: 2580.2 on 4510 degrees of freedom
## AIC: 2592.2
##
## Number of Fisher Scoring iterations: 5
You can see that MAJ_hyd500 (Variable: Areas within 500m of a body of water) and MAJ_ImpSur (Variable: Impervious surfaces) are not significant variables in the model.
The other three variables MAJ_Hyd250 (Variable: Areas within 250m of a body of water), MAJ_LandCo (Variable: Areas where land cover is developed or a body of water) and SUM_FlowAc (Variable: Flow accumulation) are significant in the model.
These results may be because 500m from a body of water is too far of a buffer distance (compared to 250m) and may include land areas that do not typically flood. The impervious surface data is similarly not a good variable to use for this model to predict where there will be flood inundation. In general, areas that are developed, are bodies of water, or are located very close to bodies of water seem to be most significant in predicting flood inundation.
Let’s delete the variables from the model that are not significant: MAJ_hyd500 (areas that are a body of water or within 500m of one) and MAJ_ImpSur (impervious surfaces)
Calg_Fishnet <-
Calg_Fishnet %>%
select(Flooded,MAJ_Hyd250,MAJ_LandCo,SUM_FlowAc) %>%
mutate(MAJ_Hyd250 = as.factor(MAJ_Hyd250))%>%
mutate(MAJ_LandCo = as.factor(MAJ_LandCo))
view(Calg_Fishnet)
Create the training and test set again for Calgary.
set.seed(3456)
trainIndex <- createDataPartition(Calg_Fishnet$Flooded, p = .85,
list = FALSE,
times = 1)
CalgTrain <- Calg_Fishnet[ trainIndex,]
CalgTest <- Calg_Fishnet[-trainIndex,]
Run the model again for Calgary.
CalgModel <- glm(Flooded ~ .,
family="binomial"(link="logit"), data = CalgTrain %>%
as.data.frame() %>%
select(-geometry))
summary(CalgModel)
##
## Call:
## glm(formula = Flooded ~ ., family = binomial(link = "logit"),
## data = CalgTrain %>% as.data.frame() %>% select(-geometry))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.711e+00 8.389e-02 -32.312 < 2e-16 ***
## MAJ_Hyd2501 8.045e-01 1.129e-01 7.128 1.02e-12 ***
## MAJ_LandCo1 -4.182e-01 1.299e-01 -3.221 0.00128 **
## SUM_FlowAc 1.313e-06 2.896e-07 4.534 5.80e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2688.6 on 4515 degrees of freedom
## Residual deviance: 2580.3 on 4512 degrees of freedom
## AIC: 2588.3
##
## Number of Fisher Scoring iterations: 5
Using the predict function, we create a vector of
classification probabilities we call classProbs. These are
the predicted probability of a test set (CalgTest) fishnet
cell being flooded conditional on our model. Setting the parameter
type="reponse" returns probabilities that range from 0 to
1.
Look at the distribution of classProbs.
classProbs_Calg <- predict(CalgModel, CalgTest, type="response")
hist(classProbs_Calg,
main = "Histogram of Class Probabilities (Calgary)",
xlab = "Class Probabilities")
Let’s put classProbs_Calg into a data frame along with
the observed Flooded outcome, which is either
1 for Flooded or 0 for Not Flooded.
Then we build this plot, testProbsPlot. The vertical
line represents a 0.5 probability of flooding.
testProbs <- data.frame(obs = as.numeric(CalgTest$Flooded),
pred = classProbs_Calg)
ggplot(testProbs, aes(x = pred, fill=as.factor(obs))) +
geom_density() +
facet_grid(obs ~ .) +
xlab("Probability") +
ylab("Frequency")+
geom_vline(xintercept = .5) +
scale_fill_manual(values = c("white", "dark blue"),
labels = c("Not Flooded","Flooded"),
name = "")+
plotTheme
You can adjust the value to move the probability line left or right. We tweak it in the next few steps.
Now we have to figure out at which probability level do we wish to classify land as being flooded.
We selected a prediction level of 0.12 (probability of flooding).
testProbs$predClass = ifelse(testProbs$pred > .12 ,1,0) # We are using pred> 0.12 for probability level
caret::confusionMatrix(reference = as.factor(testProbs$obs),
data = as.factor(testProbs$predClass),
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 537 30
## 1 192 37
##
## Accuracy : 0.7211
## 95% CI : (0.6885, 0.752)
## No Information Rate : 0.9158
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1377
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.55224
## Specificity : 0.73663
## Pos Pred Value : 0.16157
## Neg Pred Value : 0.94709
## Prevalence : 0.08417
## Detection Rate : 0.04648
## Detection Prevalence : 0.28769
## Balanced Accuracy : 0.64443
##
## 'Positive' Class : 1
##
What is the sensitivity and specificity suggest about our model? What is accuracy suggest?
Predicted = 0, Observed = 0 —> True Negative
Predicted = 0, Observed = 1 —> False Negative
Predicted = 1, Observed = 0 —> False Positive
Predicted = 1, Observed = 1 —> True Positive
We had 537 True Negatives, where we predicted it would not flood and it did not flood.
We had 30 False Negatives, where we predicted it would not flood and it did flood.
We had 192 False Positives, where we predicted it would flood and it did not flood.
We had 37 True Positives, where we predicted it would flood and it did flood.
1. Sensitivity - the proportion of actual positives (1’s) that were predicted to be positive. Also known as “true positive rate”.
2. Specificity - The proportion of actual negatives (0’s) that were predicted to be negatives. Also known as “true negative rate”.
Let’s create an ROC (receiver operating characteristic) curve. It plots the rate of true positives and the rate of false positives at each threshold setting to illustrate the performance of the model.
ggplot(testProbs, aes(d = obs, m = pred)) +
geom_roc(n.cuts = 50, labels = FALSE) +
style_roc(theme = theme_grey) +
geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey')
Each point on the line represents a threshold cutoff for our model. On the x-axis we have the proportion of times our model gets 0’s wrong (false positives), and on the y-axis the proportion of the time we get 1’s right (true positives). So for a point at (1,1), there is a cutoff where we are getting all our 1s right, but we are also mis-classifying all our 0’s as 1’s - essentially our model is a machine that predicts 1s.
A healthy model is one where you get it right most of the time, but not all of the time. If the ROC curve looks like a right angle, so the slope is really high in the beginning and then the line levels off, then it overfits your data. It won’t be good at make predictions for new data. You want it to be an arc over the straight line. This doesn’t overfit your data but still makes accurate predictions for true positives. Our ROC curve is not a perfect arc, but upon first observations, it seems to have a fair amount of area under the curve.
Let’s find the actual area under the curve. It ranges between 0 and 1 and you want to try to maximize this value.
auc(testProbs$obs, testProbs$pred)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.6536
Testing the power of your model on out of sample data is critical to the machine learning process. Cross-validation iteratively creates many randomly generated test sets or ‘folds’, testing the power of your model on each.
First we set the ctrl parameter which specifies the flavor of cross validation we wish to use. You can see all the different cross validation options here. In this instance, number = 100 tell us that we are going to iteratively test our models on 100 hold out test sets.Then we estimate our model using the train function from the caret package. Note that we use the entire flood inundation data set as the process of cross-validation will create test sets for us.
Make sure that you update the regression to the model you specified above. We used p = 0.85.
ctrl <- trainControl(method = "cv",
number = 100,
p = 0.85,
savePredictions = TRUE)
cvFit <- train(as.factor(Flooded) ~ ., data = Calg_Fishnet %>%
as.data.frame() %>%
select(-geometry),
method="glm", family="binomial",
trControl = ctrl)
cvFit
## Generalized Linear Model
##
## 5312 samples
## 3 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (100 fold)
## Summary of sample sizes: 5258, 5258, 5259, 5259, 5260, 5260, ...
## Resampling results:
##
## Accuracy Kappa
## 0.9125364 -0.0003184713
Our accuracy is about 91%.
Notice that the accuracy metric is actually the average accuracy across all 100 folds. While that is useful, what we are really interested in is the variability of accuracy across all 100 folds. Before going any further into that, let’s plot a histogram of accuracy across all 100 folds.
ggplot(as.data.frame(cvFit$resample), aes(Accuracy)) +
geom_histogram() +
scale_x_continuous(limits = c(0, 1)) +
labs(x="Accuracy",
y="Count")+
plotTheme
The key to this plot is as follows. We want a model that is generalizable, particularly to data it hasn’t seen already. Cross-validation helps us understand how the model works in this context. If our model was not generalizable to ‘out-of-sample’ data, then we should expect wildly different accuracies across each of the 100 folds.
Now that we have tuned our model, let’s predict for the entire dataset and assess our predictions.
allPredictions <-
predict(cvFit, Calg_Fishnet, type="prob")[,2]
Calg_Fishnet <-
cbind(Calg_Fishnet,allPredictions) %>%
mutate(allPredictions = round(allPredictions * 100))
Now we map the predictions. Note how we use ntile in the aes parameter to create quintiles. (The quintile labels are created in the scale_fill_manual function.)
ggplot() +
geom_sf(data=Calg_Fishnet, aes(fill=factor(ntile(allPredictions,5))),
colour=NA) +
scale_fill_manual(values = c("#edf8fb","#b3cde3","#8c96c6","#8856a7","#810f7c"),
labels=as.character(quantile(Calg_Fishnet$allPredictions,
c(0.1,.2,.4,.6,.8),
na.rm=T)),
name="Predicted\nProbabilities(%)\n(Quintile\nBreaks)") +
mapTheme +
labs(title="Predicted Flood Inundation Areas",
subtitle="Calgary, Canada")
The areas that have a darker purple color show where there is a higher predicted probability of flooding. The lighter purple or blue areas show where there is a lower predicted probability of flooding.
Let’s map it again with areas where it has flooded overlaid.
ggplot() +
geom_sf(data=Calg_Fishnet, aes(fill=factor(ntile(allPredictions,5))), colour=NA) +
scale_fill_manual(values = c("#edf8fb","#b3cde3","#8c96c6","#8856a7","#810f7c"),
labels=as.character(quantile(Calg_Fishnet$allPredictions,
c(0.1,.2,.4,.6,.8),
na.rm=T)),
name="Predicted\nProbabilities(%)\n(Quintile\nBreaks)") +
geom_sf(data=Calg_Fishnet %>%
filter(Flooded == 1),
fill="dark blue",colour=NA) +
mapTheme +
labs(title="Observed and Predicted Flood Inundation Areas",
subtitle="Calgary, Canada; Existing flooded areas in dark blue")
The areas where it actually flooded are shown in dark blue. Compare this map and the previous one to see all of the areas where we predicted that it would flood.
We can assess many things about our model by exploring our errors.
Calg_Fishnet %>%
mutate(confResult=case_when(allPredictions < 15 & Flooded==0 ~ "True Negative",
allPredictions >= 15 & Flooded==1 ~ "True Positive",
allPredictions < 15 & Flooded==1 ~ "False Negative",
allPredictions >= 15 & Flooded==0 ~ "False Positive")) %>%
ggplot()+
geom_sf(aes(fill = confResult), color = "transparent")+
scale_fill_manual(values = c("Red","Orange","Light Blue","Light Green"),
name="Outcomes")+
labs(title="Confusion Metrics") +
mapTheme
Recall the results of the Confusion Matrix that we created before:
We had 30 False Negatives, where we predicted it would not flood and it did flood. These are shown in red in the map.
We had 192 False Positives, where we predicted it would flood and it did not flood. These are the second highest amount of our predictions. These are shown in orange in the map.
We had 537 True Negatives, where we predicted it would not flood and it did not flood. These are the majority of our predictions. These are shown in blue in the map.
We had 37 True Positives, where we predicted it would flood and it did flood. These are shown in green in the map.
Overall, the model is best at predicting where it will not flood. It accurately predicts where it will flood a small amount of the time. We tried to minimize the False Negatives and accepted that there would be some False Positives. This is because it is better to predict that it will flood, even if it doesn’t, so that residents of those areas can properly prepare for a flooding event. If we predicted it would not flood and it actually did (as with a False Negative), residents would not anticipate that they would be impacted by a flooding event and they may not make any preparations for one. This may cause more property damages and/or financial losses for those residents. Ideally, this model would still have a high accuracy (91%) but it would be better at predicting where where it would actually flood. This would allow city officials to properly warn residents in flood-prone areas and help them to prepare for flooding events.
If we recreated this model, we could have used a smaller fishnet cell size or other input variables to perhaps have greater accuracy.
Now we will use the model to predict flood inundation in Denver, Colorado. We have the same variables for both cities. Denver is similar to Calgary in that a major river runs through it, the South Platte River, and there are a few reservoirs located within the city boundary or nearby.
Denver_Fishnet <- st_read("C:/Users/3lpaw/Desktop/ArcGIS Pro 3.2/EnvModeling/MidtermProject_FloodProbability/Denver_Fishnet_Export.shp")
View(Denver_Fishnet)
# Renaming the columns to match Calgary variables
names(Denver_Fishnet)[names(Denver_Fishnet) == "DSmallWate"] <- "MAJ_Hyd250"
names(Denver_Fishnet)[names(Denver_Fishnet) == "DLargeWate"] <- "MAJ_hyd500"
names(Denver_Fishnet)[names(Denver_Fishnet) == "DImpervSur"] <- "MAJ_ImpSur"
names(Denver_Fishnet)[names(Denver_Fishnet) == "DLC"] <- "MAJ_LandCo"
names(Denver_Fishnet)[names(Denver_Fishnet) == "SUM"] <- "SUM_FlowAc"
Let’s also load a shape of CO counties from the US Census via the
tigris package, make sure it’s an sf object
(st_as_sf) and reproject it to crs = 2232 like our Denver
data (st_transform). Then create a new object that only
includes Denver County.
counties_CO <- counties('CO') %>%
st_as_sf() %>%
st_transform(crs = 2232)
view(counties_CO)
Denver_Countyboundary <- counties_CO %>%
filter(NAME == "Denver")
View(Denver_Countyboundary)
Now let’s plot the fishnet shapefile for Denver to see one of the variables.
ggplot() +
geom_sf(data=Denver_Fishnet, aes(fill=as.factor(MAJ_Hyd250)), color = "transparent") +
geom_sf(data = Denver_Countyboundary, fill = "transparent", color = "black")+
scale_fill_manual(values = c("white", "blue"),
labels = c("Not Within 820ft (250m) of a Body of Water","Within 820ft (250m) of a Body of Water"),
name = "") +
labs(title="Proximity to a Body of Water in Denver, CO (Fishnet)",
subtitle="The black boundary represents the Denver city border") +
mapTheme
Convert the variables to factors.
We only want to include the variables that we used in the final model for Calgary (the ones that were significant). These included (1) flow accumulation, (2) areas within 250m (or 820ft) of a water body, and (3) land cover (vegetated vs developed).
Denver_Fishnet <-
Denver_Fishnet %>%
select(MAJ_Hyd250,MAJ_LandCo,SUM_FlowAc) %>%
mutate(MAJ_Hyd250 = as.factor(MAJ_Hyd250))%>%
mutate(MAJ_LandCo = as.factor(MAJ_LandCo))
view(Denver_Fishnet)
classProbs_Denv <- predict(CalgModel, Denver_Fishnet, type="response")
hist(classProbs_Denv,
main = "Histogram of Class Probabilities (Denver)",
xlab = "Class Probabilities")
Now that we have tuned our model, let’s predict for the Denver dataset and assess our predictions.
allPredictions <-
predict(cvFit, Denver_Fishnet, type="prob")[,2]
Denver_Fishnet <-
cbind(Denver_Fishnet,allPredictions) %>%
mutate(allPredictions = round(allPredictions * 100))
Now we map the predictions for Denver.
ggplot() +
geom_sf(data=Denver_Fishnet, aes(fill=factor(ntile(allPredictions,5))),
colour=NA) +
scale_fill_manual(values = c("#edf8fb","#b3cde3","#8c96c6","#8856a7","#810f7c"),
labels=as.character(quantile(Denver_Fishnet$allPredictions,
c(0.1,.2,.4,.6,.8),
na.rm=T)),
name="Predicted\nProbabilities(%)\n(Quintile\nBreaks)") +
mapTheme +
labs(title="Predicted Flood Inundation Areas",
subtitle="Denver, Colorado")
The darker purple areas show where there are higher predicted probabilities for flooding in Denver, Colorado. The lighter purple or blue areas show where there are lower predicted probabilities for flooding.
Click the link below to watch.
library("vembedr")
## Warning: package 'vembedr' was built under R version 4.3.3
##
## Attaching package: 'vembedr'
## The following object is masked from 'package:lubridate':
##
## hms
embed_url("https://www.youtube.com/watch?v=BNCLNaMPlNQ")