DS 6030 | Spring 2021 | University of Virginia

1 Introduction

In early 2010 the Caribbean nation of Haiti was devastated by a magnitude 7.0 earthquake. This catastrophe leveled many buildings, and resulted in numerous lives lost. Most people around the world are familiar with this disaster and its level of destruction, but few are as familiar with the after-effects it had on those that lived but their homes didn’t.

In the wake of the earthquake, an estimated five million people, more than 50% of the population at the time, were displaced, with 1.5 million of them living in tent camps (https://www.worldvision.org/disaster-relief-news-stories/2010-haiti-earthquake-facts). This wide-spread displacement of people across a country with worsened infrastructure made relief efforts more difficult. Teams needed an accurate way to locate these individuals so they could provide aid.

In an effort to assist the search, a team from the Rochester Institute of Technology (RIT) collected aerial imagery of the country. These images were then converted into datasets of Red, Green, and Blue (RGB) values. Using this RGB data from the imagery, with the knowledge that many of the displaced people were using distinguishable blue tarps as their shelter, I attempted to predict the locations of these blue tarps using several classification models.

My goal in this analysis was to determine the optimal model for locating displaced people. To determine those models, I focused on three statistics; accuracy, false positive rate (FPR), and false negative rate (FNR). Given the context of the situation, I believed the FNR to be a very important metric, much more than the false positive rate (FPR), because I wanted to make sure no displaced individual was being overlooked. I would much rather have over-classified and found no one at a certain location than under-classify and not provide aid to someone in need. But, it is important to note that these efforts still needed to be made in a timely manner, so grossly over-classifying to get the smallest FNR was not the optimal solution. So, a combination of accuracy, FNR, and FPR was used to determine these models.

Two datasets were provided for this analysis, the training data and the hold-out data. All seven models were trained using 10-fold cross validation and then their predictive ability tested on the hold-out dataset.

2 Training Data / EDA

2.0.1 Packages

Several packages were used for this analysis, but the most important was the Caret package. This package allowed me to easily reproduce different trained models, use the same 10-fold cross-validation on each, and iteratively test different thresholds and tuning parameters when needed.

# Load Required Packages
library(tidyverse)
library(htmlTable)
library(ggplot2)
library(ggmap)
library(GGally)
library(ROCR)
library(knitr)
library(caret)
library(plotly)
library(gridExtra)
library(doParallel)

Parallel processing was used for the training of each model to improve performance.

numCores <- 14

2.0.2 The Pixel Data

The data provided for this analysis consists of four fields. “Class” defines land-classification of the pixel, whether that be vegetation, soil, rooftop, various non-tarp, or blue tarp. The other three classes, “Red,” “Green,” and “Blue,” pertain to the RGB color values of the pixel. The RGB values are what were used as the predictors in the models.

data <- read.table("./data/HaitiPixels.csv", sep=",", header=TRUE)

To make the analysis easier, a new binary field was added to the dataset representing whether or not the pixel was a “Blue Tarp.” This was done to simplify the modeling process, because we are not interested in predicting the other classes. The new field, “ClassTarp,” was used as the response in the models.

data$ClassTarp <- ifelse(data$Class == "Blue Tarp", "Yes", "No")
data$ClassTarp <- factor(data$ClassTarp, levels = c("No", "Yes"))

2.0.3 Exploratory Analysis

In exploring the correlations of the color values, I noticed it was less likely to see any Blue Tarp pixels with Red values above 200. Also, only pixels with a Blue value above ~100 were classified as Blue Tarp. This may cause issues in areas that had strong shadows from trees or buildings.

data %>%
  dplyr::select(-Class, -ClassTarp) %>%
  ggpairs(upper=list(continuous=wrap("cor", size=3)),
          mapping=ggplot2::aes(color=data$ClassTarp,alpha=0.7))

Because the pixel values are comprised of three colors, I thought a 3D scatter plot would be more appropriate for showing how all the values played into each classification. In the plot below, it is easy to see a strong distinction between the Blue Tarp pixels and the others. There is very little bleed into the other category’s areas. Also, I noticed the volume of Blue Tarp pixels grows in size as all values increase. This could be because at lower color values, the pixels get very dark, making it more difficult to classify a pixel as Blue Tarp.

plot_ly(x=data$Red, y=data$Green, z=data$Blue, type="scatter3d", mode="markers", color=data$Class)

The above plots show that it is unlikely to find a Blue Tarp pixel with a high Red value, but not as unlikely for a pixel with a high Green value. I think this is because Green and Blue are more visually related than Red and Blue.

After looking at the relationships between the three color values, and how they interact with the land classifications, I believe there are cases where a model with an added interaction between Green and Blue could be beneficial to the overall accuracy.

3 Model Training

3.1 Set-up

3.1.1 Variables

Several values were set and functions created to make the model building and analysis process more efficient. This included setting a universal seed to make sure everything is reproducible. Folds were created from the data and a trainControl() object so the same 10-fold cross validation was performed on all models.

useed <- 13
folds <- createFolds(data$ClassTarp, k=10, list = TRUE, returnTrain = TRUE)

The trainControl() object was a very useful part of the Caret package to use for this analysis. It is used to control how the models are trained and what information to return from that training. By setting method to “cv,” number to 10, and then using the previously created folds as the index, I was able to easily reproduce the same 10-fold cross-validation on all models being trained. Being able to save the prediction values was also very useful to plotting the ROC curves. ClassProbs was also set to TRUE so that the probabilities for both “No” and “Yes” classes were calculated and returned.

# https://www.rdocumentation.org/packages/caret/versions/6.0-88/topics/trainControl
control <- trainControl(method="cv",
                        number=10,
                        index=folds,
                        savePredictions=TRUE,
                        classProbs=TRUE,
                        allowParallel=TRUE)

3.1.2 Functions

The threshold_test() function was created to produce a dataframe of summary statistics from the inputted model using multiple thresholds, in order to determine the most desirable value. The function makes use of the thresholder() function from the Caret package, which takes in a model, a sequence of threshold values, and a list of summary statistics, and returned a dataframe of those statistics calculated for each threshold value. Because I was also interested minimizing the false negative rate, I added that to the dataframe before it was returned.

# https://www.rdocumentation.org/packages/caret/versions/6.0-88/topics/thresholder
th <- seq(0.1,0.9, by=0.1)
statsWanted <- c("Accuracy","Kappa","Sensitivity","Specificity","Precision")

threshold_test <- function(model) {
  set.seed(useed) # just in case
  stats <- thresholder(model,
                       threshold=th,
                       statistics=statsWanted)
  stats$falseNeg <- 1 - stats$Sensitivity
  stats$falsePos <- 1 - stats$Specificity
  return(stats)
}

The plotROC() function was created to make ROC curve plotting easier for each model, and to calculate and append the AUROC value to the stats generated by the threshold_test() function. This function takes in the model, the statistics of the selected “optimal” model, and a string to add to the title of the plot. Because the model inputted is actually a train() object created by the Caret package, it would already have the prediction values saved as a variable “$pred,” which was then used to calculate the true positive rates (TPR) and false positive rates (FPR) to plot the ROC curve.

plotROC <- function(model, stats.selected, model_name) {
  set.seed(useed)
  
  # https://www.statmethods.net/management/sorting.html
  prob <- model$pred[order(model$pred$rowIndex),]
  
  # https://rviews.rstudio.com/2019/03/01/some-r-packages-for-roc-curves/
  rates <- prediction(prob$Yes,as.numeric(data$ClassTarp))
  roc <- performance(rates, measure="tpr", x.measure ="fpr")
  plot(roc, main=paste("ROC Curve:", model_name))
  lines(x=c(0,1),y=c(0,1),col="red")
  
  auc <- performance(rates,"auc")
  stats.selected <- stats.selected %>% mutate(AUROC = auc@y.values[[1]])
  return(stats.selected)
}

The predictionStats() function was made to calculate and organize the predictive ability of the trained models on the hold-out dataset. The use of parallel processing was applied within the function because the closing of the cluster after each model consistently kept memory usage low.

predictionStats <- function(model, oldStats) {
  
  cluster <- makePSOCKcluster(numCores) 
  registerDoParallel(cluster)
  
  set.seed(useed)
  prob <- predict(model, newdata=hold.out, type="prob")
  pred <- as.factor(ifelse(prob$Yes>oldStats$prob_threshold, "Yes", "No"))
  rate <- prediction(prob[,2], hold.out$ClassTarp)
  auc <- performance(rate,"auc")
  
  stopCluster(cluster)
  
  conf <- confusionMatrix(pred, hold.out$ClassTarp)
  
  Tuning <- NaN
  Tuning2 <- NaN
  
  cols <- colnames(oldStats)
  
  # tuning
  for (col in c("alpha","k","mtry","C")) {
    if(col %in% cols){
      Tuning <- oldStats[[col]]
    }
  }
  
  # tuning2
  for (col in c("lambda","sigma")) {
    if(col %in% cols){
      Tuning2 <- oldStats[[col]]
    }
  }
  
  stats <- data.frame(Tuning = Tuning,
                      Tuning2 = Tuning2,
                      AUROC = auc@y.values[[1]],
                      Threshold = oldStats$prob_threshold,
                      Accuracy = conf$overall[["Accuracy"]],
                      TPR = conf$byClass[["Sensitivity"]],
                      FPR = 1-conf$byClass[["Specificity"]],
                      Precision = conf$byClass[["Precision"]])
  return(stats)
}

3.1.3 Formulas

For most model types I started by comparing the measured accuracy, kappa, and other summary statistics of two formulas. The first formula being the basic full formula,

\[ \text{ClassTarp} = \text{Red}x_1 + \text{Green}x_2 + \text{Blue}x_3 \]

and the second being that same formula with an added interaction,

\[ \text{ClassTarp} = \text{Red}x_1 + \text{Green}x_2 + \text{Blue}x_3 + \text{(Green:Blue)}x_4 \]

I did this believing there may be a case where the added interaction proved to make the model more accurate, or maybe reduced the FNR a desirable amount.

3.2 Logistic Regression

3.2.1 Model Training

To create the logistic regression models, I used the train() function of the Caret package and set the family to “binomial” and the method to “glm” for generalized linear model. Also, the trainControl object was brought in to perform 10-fold cross-validation on the model to calculate the accuracy and kappa statistics.

# https://www.youtube.com/watch?v=BQ1VAZ7jNYQ
set.seed(useed)

cluster <- makePSOCKcluster(numCores) 
registerDoParallel(cluster)

data.log <- train(ClassTarp~Red+Green+Blue,
                  data=data,
                  family="binomial",
                  method="glm",
                  trControl=control)

data.log.i <- train(ClassTarp~Red+Green+Blue+Green:Blue, data=data,
                  family="binomial",
                  method="glm",
                  trControl=control)

stopCluster(cluster)

data.log$results %>% slice_max(Accuracy)

#>   parameter  Accuracy     Kappa   AccuracySD   KappaSD
#> 1      none 0.9953194 0.9210953 0.0009495595 0.0167334

data.log.i$results %>% slice_max(Accuracy)

#>   parameter  Accuracy     Kappa  AccuracySD    KappaSD
#> 1      none 0.9955092 0.9258895 0.000742601 0.01245748

Comparing the accuracy and kappa statistics of each model, I saw that the values were identical, implying that adding the interaction would not be beneficial enough to justify the added complexity. However, there was still a chance that the added interaction could benefit other statistics, so I ran the threshold test on both models to compare their performances across multiple threshold values.

3.2.2 Determining Threshold

Using the threshold_test() function I tested each model’s performance and selected the threshold that produced the highest accuracy for each.

data.log.thres <- threshold_test(data.log)
data.log.thres[2:9] %>% slice_max(Accuracy)

#>   prob_threshold  Accuracy     Kappa Sensitivity Specificity Precision
#> 1            0.7 0.9956515 0.9281814   0.9984482   0.9109472 0.9970642
#>      falseNeg   falsePos
#> 1 0.001551815 0.08905282

data.log.i.thres <- threshold_test(data.log.i)
data.log.i.thres[2:9] %>% slice_max(Accuracy)

#>   prob_threshold  Accuracy    Kappa Sensitivity Specificity Precision
#> 1            0.8 0.9961575 0.939379   0.9972231   0.9638882 0.9988059
#>      falseNeg   falsePos
#> 1 0.002776924 0.03611179

The highest accuracy threshold for the no-interaction model was 0.7, while that of the interaction model was 0.8. The accuracy of the interaction model was higher, but the FNR of the no-interaction model was lower. Because I wanted to reduce the FNR, and the decrease in accuracy was insignificant, I believed the best choice for logistic regression was the no-interaction model, at the threshold 0.7.

3.2.3 ROC Curve

The ROC curve is a good way to visually represent the classification abilities of a model, plotting the TPR against the FPR at numerous threshold values. The ROC curves for this analysis are built from the out-of-sample data predictions provided from the train() function. I used the plotROC() function to both plot the ROC curve and calculate the AUROC value of the model.

log.selected <- data.log.thres[2:9] %>% slice_max(Accuracy)
log.selected <- plotROC(data.log, log.selected, "Logistic Regression")

The ROC curve for the selected logistic regression model was very good. The true positive rate reaches 1 barely above a false positive rate of 0. The AUROC for this model was 0.998491.

3.3 Linear Discriminant Analysis (LDA)

3.3.1 Model Training

To create the LDA models, I used the train() function again, but set the method to “lda.” By using the same trainControl() object throughout all the model building, I was able to perform 10-fold cross-validation on the model to calculate the accuracy and kappa statistics, using the same folds every time.

set.seed(useed)

cluster <- makePSOCKcluster(numCores) 
registerDoParallel(cluster)

data.lda <- train(ClassTarp~Red+Green+Blue, data=data,
                  method="lda",
                  trControl=control)

data.lda.i <- train(ClassTarp~Red+Green+Blue+Green:Blue, data=data,
                    method="lda",
                    trControl=control)

stopCluster(cluster)

data.lda$results %>% slice_max(Accuracy)

#>   parameter  Accuracy     Kappa  AccuracySD    KappaSD
#> 1      none 0.9839503 0.7529257 0.001399024 0.02368392

data.lda.i$results %>% slice_max(Accuracy)

#>   parameter  Accuracy     Kappa  AccuracySD    KappaSD
#> 1      none 0.9943549 0.9030947 0.001075436 0.01957562

Unlike what I saw with the logistic regression models, there was a difference in the interaction and no-interaction models for LDA. The kappa value for the no-interaction model was relatively lower. It seemed the best model to use was the interaction model because of the better metrics, but I wanted to be sure, so I ran the threshold test on both models.

3.3.2 Determining Threshold

data.lda.thres <- threshold_test(data.lda)
data.lda.thres[2:9] %>% slice_max(Accuracy)

#>   prob_threshold  Accuracy     Kappa Sensitivity Specificity Precision
#> 1            0.1 0.9846144 0.7468456   0.9926494   0.7413281 0.9914688
#>      falseNeg  falsePos
#> 1 0.007350638 0.2586719

data.lda.i.thres <- threshold_test(data.lda.i)
data.lda.i.thres[2:9] %>% slice_max(Accuracy)

#>   prob_threshold  Accuracy     Kappa Sensitivity Specificity Precision
#> 1            0.7 0.9944814 0.9063693   0.9987259   0.8659562 0.9955878
#>      falseNeg  falsePos
#> 1 0.001274109 0.1340438

The previous determination that the interaction model was more accurate was upheld by the threshold test, where it had a higher accuracy and precision for all thresholds tested. The highest accuracy threshold for the no-interaction model was 0.1, while that of the interaction model was 0.7. With a higher accuracy, kappa, and precision, and a much lower FNR, the interaction model at threshold 0.7 was the clear better choice.

3.3.3 ROC Curve

I used the plotROC() function to both plot the ROC curve and calculate the AUROC value of the model.

lda.selected <- data.lda.i.thres[2:9] %>% slice_max(Accuracy)
lda.selected <- plotROC(data.lda.i, lda.selected, "LDA")

The ROC curve for the selected LDA model is not as sharp to the top-left corner as the logistic regression model’s ROC Curve. There was a sharp increase to a TPR of 0.8, and then a drop-off in that growth to a jagged incline to a TPR of 1. The AUROC for this model was 0.9952404.

3.4 Quadratic Discriminant Analysis (QDA)

3.4.1 Model Training

To create the QDA models, I used the train() function again, but set the method to “qda.” By using the same trainControl() object throughout all the model building, I was able to perform 10-fold cross-validation on the model to calculate the accuracy and kappa statistics, using the same folds every time.

set.seed(useed)

cluster <- makePSOCKcluster(numCores) 
registerDoParallel(cluster)

data.qda <- train(ClassTarp~Red+Green+Blue, data=data,
                  method="qda",
                  trControl=control)

data.qda.i <- train(ClassTarp~Red+Green+Blue+Green:Blue, data=data,
                    method="qda",
                    trControl=control)

stopCluster(cluster)

data.qda$results %>% slice_max(Accuracy)

#>   parameter  Accuracy     Kappa  AccuracySD   KappaSD
#> 1      none 0.9946237 0.9060371 0.001062324 0.0199123

data.qda.i$results %>% slice_max(Accuracy)

#>   parameter  Accuracy     Kappa  AccuracySD    KappaSD
#> 1      none 0.9889945 0.8356276 0.001534962 0.02110115

Expectedly, the accuracies of the no-interaction and interaction models were very similar. But unlike the outcomes of the logistic regression and LDA model comparisons, the QDA models showed the no-interaction model being more accurate and with a higher kappa. From this, I was very doubtful that the added interaction could benefit the model at all, but I thought it was worth verifying through the threshold test.

3.4.2 Determining Threshold

data.qda.thres <- threshold_test(data.qda)
data.qda.thres[2:9] %>% slice_max(Accuracy)

#>   prob_threshold  Accuracy     Kappa Sensitivity Specificity Precision
#> 1            0.6 0.9947344 0.9084762   0.9995753   0.8481369 0.9950092
#>       falseNeg  falsePos
#> 1 0.0004247112 0.1518631

data.qda.i.thres <- threshold_test(data.qda.i)
data.qda.i.thres[2:9] %>% slice_max(Accuracy)

#>   prob_threshold  Accuracy     Kappa Sensitivity Specificity Precision
#> 1            0.1 0.9944814 0.9067588   0.9987259   0.8659464 0.9955875
#>      falseNeg  falsePos
#> 1 0.001274104 0.1340536

The highest accuracy threshold for the no-interaction model was 0.7, while that of the interaction model was 0.1. The no-interaction model was verified as the better choice at this point, with stronger values in almost every metric, especially the FNR. With a higher accuracy, and kappa, and a much lower FNR, the no-interaction model at threshold 0.7 was the better choice.

3.4.3 ROC Curve

I used the plotROC() function to both plot the ROC curve and calculate the AUROC value of the model.

qda.selected <- data.qda.thres[2:9] %>% slice_max(Accuracy)
qda.selected <- plotROC(data.qda.i, qda.selected, "QDA")

The ROC curve for the selected QDA model was pretty good, and had a similar point of TPR where the sharp increase stops. It was similar to the others, but not as jagged in the corner as the LDA ROC curve. The AUROC for this model was 0.997053.

3.5 KNN

3.5.1 Model Training

To create the KNN models, I used the train() function again, but made several changes to the settings. For the KNN models, a list of k values, from 0 to 50 at intervals of 5, was used for the tuneGrid variable in the train() function. The tuneGrid acted as a list of options to train models on and determine from those models the best one. Adding this option made the train() function choose the optimal k-value for the model out of the ones given.

It was important to consider the additional argument for the train() function, preProcess. Adding “center” and “scale” to that argument would normalize the color values, and is a common practice I saw online when using Caret to train a KNN model. However, because the RGB values represent real-world color that was normalized to a scale of 0-255, I did not believe it was necessary to normalize the data again.

cluster <- makePSOCKcluster(numCores)
registerDoParallel(cluster)

start.time <- Sys.time()

set.seed(useed)

klist <- data.frame(k=seq(0,50,5))
klist[1,] <- 1

data.knn <- train(ClassTarp~Red+Green+Blue, data=data,
                  tuneGrid = klist,
                  method="knn",
                  metric="Accuracy",
                  trControl=control)

data.knn.i <- train(ClassTarp~Red+Green+Blue+Green:Blue, data=data,
                    tuneGrid = klist,
                    method="knn",
                    metric="Accuracy",
                    trControl=control)

stopCluster(cluster)

data.knn$results %>% slice_max(Accuracy)

#>    k  Accuracy    Kappa   AccuracySD     KappaSD
#> 1 10 0.9971853 0.954792 0.0005766019 0.009223093

data.knn.i$results %>% slice_max(Accuracy)

#>   k  Accuracy     Kappa  AccuracySD    KappaSD
#> 1 1 0.9943391 0.9069368 0.001048404 0.01731508

plot(data.knn, main="Accuracy of KNN at different k")

3.5.1.1 Tuning Parameter

For all levels of k, the no-interaction and interaction model had very similar outcomes, so I believed it best to work with the no-interaction model for simplicity.

The output of the train() function shows the optimal k value to be 5, based on accuracy. I was hesitant to use 10, because the accuracy remained above 0.99 for all k values tested. I think 5 could prove to be a problem for a larger dataset, possibly causing errors in variance.

So, I chose to go with k=20 to retain a similar accuracy but with more potential stability. Given the nature of the train() function, I needed to re-run it with 20 as the only available value for k to get the model object.

cluster <- makePSOCKcluster(numCores)
registerDoParallel(cluster)

data.knn20 <- train(ClassTarp~Red+Green+Blue, data=data,
                  tuneGrid = data.frame(k=seq(20,20,1)),
                  method="knn",
                  metric="Accuracy",
                  trControl=control)

stopCluster(cluster)

data.knn.thres <- threshold_test(data.knn)
data.knn20.thres <- threshold_test(data.knn20) # had to run up here to get the plot to work

p1 <- ggplot() +
  geom_line(data = data.knn.thres,
            aes(x = prob_threshold, y = Accuracy),
            color = "blue", lwd=1) +
  geom_line(data = data.knn20.thres,
            aes(x = prob_threshold, y = Accuracy),
            color="red", lwd=1) +
  labs(x = "Threshold", y="Accuracy", title="Accuracy by Threshold")

p2 <- ggplot() +
  geom_line(data = data.knn.thres,
            aes(x = prob_threshold, y = falseNeg),
            color = "blue", lwd=1) +
  geom_line(data = data.knn20.thres,
            aes(x = prob_threshold, y = falseNeg),
            color="red", lwd=1) +
  labs(x = "Threshold", y="FNR", title="FNR by Threshold")

p3 <- ggplot() +
  geom_line(data = data.knn.thres,
            aes(x = prob_threshold, y = Specificity),
            color = "blue", lwd=1) +
  geom_line(data = data.knn20.thres,
            aes(x = prob_threshold, y = Specificity),
            color="red", lwd=1) +
  labs(x = "Threshold", y="FPR", title="FPR by Threshold")

p4 <- ggplot() +
  geom_line(data = data.knn.thres,
            aes(x = prob_threshold, y = Sensitivity),
            color = "blue", lwd=1) +
  geom_line(data = data.knn20.thres,
            aes(x = prob_threshold, y = Sensitivity),
            color="red", lwd=1) +
  labs(x = "Threshold", y="TPR", title="TPR by Threshold")

grid.arrange(p1,p2,p3,p4, nrow=2)

When comparing the metrics of a k=10 (blue) and k=20 (red) model, I see that they are very similar. The k=10 model is technically more accurate, but the FNR, FPR, and TPR are very similar. From that similarity I believe it is fine to sacrifice a small amount of accuracy to handle potential variability in the future.

3.5.2 Determining Threshold

data.knn.thres <- threshold_test(data.knn20)
data.knn20.thres <- threshold_test(data.knn20)

data.knn.thres %>% slice_max(Accuracy)

#>    k prob_threshold  Accuracy     Kappa Sensitivity Specificity Precision
#> 1 20            0.4 0.9970272 0.9516135   0.9987259   0.9455836 0.9982042
#>      falseNeg   falsePos
#> 1 0.001274131 0.05441643

data.knn20.thres %>% slice_max(Accuracy)

#>    k prob_threshold  Accuracy     Kappa Sensitivity Specificity Precision
#> 1 20            0.4 0.9970272 0.9516135   0.9987259   0.9455836 0.9982042
#>      falseNeg   falsePos
#> 1 0.001274131 0.05441643

The highest accuracy threshold for the k=20 model is 0.5. This threshold produced a higher accuracy for the model than before, as well as a very high precision and good FNR and FPR combination. The metrics are not as good for k=20 model when compared to k=10, but the differences are very minor. Because of the added stability, I believe the k=20 model at threshold 0.5 is the optimal model of the two.

3.5.3 ROC Curve

I used the plotROC() function to both plot the ROC curve and calculate the AUROC value of the model.

knn.selected <- data.knn20.thres[1:9] %>% slice_max(Accuracy)
knn.selected <- plotROC(data.knn20, knn.selected, "KNN")

The ROC curve for the selected KNN model looks very good. The curve is better than the logistic regression model’s ROC curve, and reaches a TPR of 1 very early in the plot. The AUROC for the KNN model was 0.9994968.

3.6 Penalized Logistic Regression (ElasticNet)

3.6.1 Model Training

Going off of advice from Professor Gedeck, I decided to start with a ridge regression model and work forward from there. To create the ridge model, and other elasticNet models after that, I used the train() function again, with the method set to “glmnet.” A tune grid was also added to the train() function containing a sequence of lambda values from 0 to 1 every 0.1.

set.seed(useed)
cluster <- makePSOCKcluster(numCores)
registerDoParallel(cluster)

lambdaGrid <- expand.grid(alpha = 0, lambda = seq(0,1, 0.1))

data.ridge <- train(ClassTarp~Red+Green+Blue, data=data,
                  method="glmnet",
                  tuneGrid=lambdaGrid,
                  trControl=control)

data.ridge.i <- train(ClassTarp~Red+Green+Blue+Green:Blue, data=data,
                  method="glmnet",
                  tuneGrid=lambdaGrid,
                  trControl=control)

stopCluster(cluster)

data.ridge$results %>% slice_max(Accuracy)

#>   alpha lambda  Accuracy     Kappa   AccuracySD    KappaSD
#> 1     0      0 0.9779099 0.4634828 0.0009550066 0.03435105

data.ridge.i$results %>% slice_max(Accuracy)

#>   alpha lambda  Accuracy     Kappa  AccuracySD    KappaSD
#> 1     0      0 0.9784949 0.4844104 0.001033443 0.03644156

plot(data.ridge, main="Accuracy of PLR at different Lambda values")

The optimal lambda value selected for the ridge regression was 0. This was understandable because each of the color values has proven to be very significant to predicting the response, so neither one should be reduced. Any lambda tested above 0 had the same, lower accuracy and a kappa value of 0. So if there were a lambda value better for the model than 0, it would have been between 0 and 0.1 but that was unlikely. After determining the optimal lambda of 0, I wanted to determine what value of alpha would be the best.

3.6.1.1 Tuning Parameters

A second model was created, this time with a constant lambda of 0, and a sequence of alpha values to run and compare. Running the train() with a sequence of alphas values from 0 to 1 every 0.05.

set.seed(useed)

cluster <- makePSOCKcluster(numCores)
registerDoParallel(cluster)

alphaGrid <- expand.grid(alpha = seq(0,1, 0.05), lambda=0)

data.elastic <- train(ClassTarp~Red+Green+Blue, data=data,
                      method="glmnet",
                      tuneGrid=alphaGrid,
                      trControl=control)

stopCluster(cluster)

data.elastic$results %>% slice_max(Accuracy)

#>   alpha lambda  Accuracy     Kappa  AccuracySD    KappaSD
#> 1  0.85      0 0.9952562 0.9197249 0.000974985 0.01736751

plot(data.elastic, main="Accuracy of PLR at different Alpha values")

With a lambda of 0, the most accurate model produced had an alpha of 0.8. This value gets the model closer to a Lasso regression, but not quite there. With an accuracy of 0.9952563, the optimal tuning parameters for the Penalized Logistic Regression were alpha=0.8 and lambda=0.

3.6.2 Determining Threshold

data.elastic.thres <- threshold_test(data.elastic2)
data.elastic.thres %>% slice_max(Accuracy)

#>   alpha lambda prob_threshold  Accuracy     Kappa Sensitivity Specificity
#> 1   0.8      0            0.7 0.9956041 0.9270436   0.9985789   0.9055114
#>   Precision    falseNeg   falsePos
#> 1 0.9968859 0.001421136 0.09448861

The most accurate threshold for classification ended up being 0.7. This level produced very high accuracy and precision, as well as low false negative and false positive rates; showing to be a very effective option.

3.6.3 ROC Curve

I used the plotROC() function to both plot the ROC curve and calculate the AUROC value of the model.

plr.selected <- data.elastic.thres %>% slice_max(Accuracy)
plr.selected <- plotROC(data.elastic2, plr.selected, "Penalized Logistic Regression")

The ROC curve for the penalized logistic regression model was good and looked quite similar to the ROC curve of the logistic regression. It also rounds off at the corner, similar to the QDA curve. The AUROC for this model was 0.9985056.

3.7 Random Forest

3.7.1 Model Training

Because of the processing time required to train a random forest model, I did not train the interaction model to compare to the no-interaction model. With this data, there were only 3 parameters to be used which greatly limited the number of different values we should try for mtry. The general rule of thumb when choosing a value for mtry is to either choose the square root of the number of parameters, or the number of parameters divided by three. With those two values coming out to almost 2, and 1, I decided to only test which of those would be optimal.

set.seed(useed)

cluster <- makePSOCKcluster(8) # not enough RAM to handle 14 workers for RF
registerDoParallel(cluster)

mtryGrid <- data.frame(mtry=c(1,2))

data.rdf <- train(ClassTarp~Red+Green+Blue, data=data,
                  tuneGrid = mtryGrid,   
                  method="rf",
                  metric="Accuracy",
                  trControl=control)

stopCluster(cluster)

data.rdf$results %>% slice_max(Accuracy)

#>   mtry  Accuracy    Kappa   AccuracySD    KappaSD
#> 1    1 0.9971379 0.953283 0.0007359755 0.01223592

The value for mtry that came out with the highest accuracy was 2.

3.7.2 Determining Threshold

data.rdf.thres <- threshold_test(data.rdf)
data.rdf.thres[2:9] %>% slice_max(Accuracy)

#>   prob_threshold  Accuracy     Kappa Sensitivity Specificity Precision
#> 1            0.5 0.9971537 0.9535596   0.9988239   0.9465639 0.9982372
#>      falseNeg   falsePos
#> 1 0.001176116 0.05343608

p1 <- ggplot() +
  geom_line(data = data.rdf.thres,
            aes(x = prob_threshold, y = Accuracy), lwd=1) +
  labs(x = "Threshold", y="Accuracy", title="Random Forest: Accuracy by Threshold")

p2 <- ggplot() +
  geom_line(data = data.rdf.thres,
            aes(x = prob_threshold, y = falseNeg), lwd=1) +
  labs(x = "Threshold", y="FNR", title="Random Forest: FNR by Threshold")
grid.arrange(p1,p2,nrow=1)

With mtry=2, the model saw the highest accuracy at a threshold of 0.5. The FNR at that threshold is still very good, <0.002, so I believed this to be the best value to go with.

3.7.3 ROC Curve

The model was re-trained with the optimal mtry value to get the correct number of predictions.

cluster <- makePSOCKcluster(numCores)
registerDoParallel(cluster)

set.seed(useed)

# re-running to get the right number of predictions
data.rdf1 <- train(ClassTarp~Red+Green+Blue, data=data,
                  tuneGrid = data.frame(mtry=2),   
                  method="rf",
                  metric="Accuracy",
                  trControl=control)

stopCluster(cluster)

I used the plotROC() function to both plot the ROC curve and calculate the AUROC value of the model.

rdf.selected <- data.rdf.thres %>% slice_max(Accuracy)
rdf.selected <- plotROC(data.rdf1, rdf.selected, "Random Forest")

The ROC curve for the random forest model was very good. There is no rounding at the top left corner, but it was worth noting that the TPR did not reach exactly 1.0 as quickly as the other models. The AUROC for this model was 0.9942777.

3.8 SVM (Support Vector Machines)

3.8.1 Model Training

For the SVM models, instead of comparing a no interaction to an interaction model, I compared the accuracies of the different kernel basis; linear, radial, and polynomial. Because RGB values are already normalized to a 0-255 scale, I did not see it necessary to pre-process the data in the train() function.

Each kernel basis takes in slightly different tuning parameters to train the model. The linear kernel uses cost, a parameter used to determine the level of misclassification that will be allowed. This is done as cost is treated as a weight for penalizing the soft margin.

The polynomial kernel takes in degree as a turning parameter, which determines the degree of the polynomial used for the decision boundary. A degree of 1 yields a linear separation while higher degrees allow more flexibility in the decision boundary.nb mn

The radial basis kernel takes in cost as a tuning parameter, but also takes in sigma. Sigma defines the reach of a single training instance’s influence. Larger sigma values allow more flexibility by allow influence from data further from the decision boundary. The smaller the sigma value, the less points influence the shape of the boundary.

To create the SVM models, I used the train() function but this time did not make changes to the settings. I let each kernel run as default to compare what was produced.

cluster <- makePSOCKcluster(numCores)
registerDoParallel(cluster)

set.seed(useed)

# https://rpubs.com/uky994/593668
data.svm.linear <- train(ClassTarp~Red+Green+Blue, data=data,  
                         method="svmLinear",
                         metric="Accuracy",
                         trControl=control)

data.svm.poly <- train(ClassTarp~Red+Green+Blue, data=data,  
                       method="svmPoly",
                       metric="Accuracy",
                       trControl=control)

data.svm.radial <- train(ClassTarp~Red+Green+Blue, data=data,  
                         method="svmRadial",
                         metric="Accuracy",
                         trControl=control)

stopCluster(cluster)

data.svm.linear$results %>% slice_max(Accuracy)

#>   C  Accuracy     Kappa  AccuracySD    KappaSD
#> 1 1 0.9953985 0.9224624 0.000979053 0.01726858

data.svm.poly$results %>% slice_max(Accuracy)

#>   degree scale C Accuracy     Kappa   AccuracySD    KappaSD
#> 1      3   0.1 1 0.996031 0.9354796 0.0007206174 0.01152732

data.svm.radial$results %>% slice_max(Accuracy)

#>      sigma C  Accuracy     Kappa   AccuracySD   KappaSD
#> 1 8.183301 1 0.9969956 0.9508022 0.0006873597 0.0112738

With the default tuning settings for SVM in Caret, the outputs showed that using the radial basis function kernel was the most effective. The model chose sigma=8.183301 and C=1 for the tuning parameters. Not only was the radial function more accurate, but it was also more time efficient when compared to the polynomial function.

Although the radial model was very promising, I was aware that it only tested three values for C; 0.25, 0.5, and 1. The model chose 1 for C and was trained with a constant value for sigma at 8.183301. I wanted to further explore those tuning parameters so I did some sensitivity testing around the given values.

cluster <- makePSOCKcluster(numCores)
registerDoParallel(cluster)

set.seed(useed)

rad.grid <- expand.grid(C=c(0.75, 1, 1.5, 2), sigma=c(7.5, 8, 8.18, 9, 9.5))

data.svm.radial2 <- train(ClassTarp~Red+Green+Blue, data=data,  
                         method="svmRadial",
                         metric="Accuracy",
                         trControl=control,
                         tuneGrid=rad.grid)

stopCluster(cluster)

data.svm.radial2$results %>% slice_max(Accuracy)

#>   C sigma  Accuracy     Kappa   AccuracySD    KappaSD
#> 1 2   9.5 0.9970746 0.9521057 0.0006337175 0.01041054

After manually tuning the radial model, I found that a cost of 2 with sigma of either 9 or 9.5 was the optimal combination for accuracy.

3.8.2 Determining Threshold

data.svm.radial.thres <- threshold_test(data.svm.radial)
data.svm.radial.thres[1:9] %>% slice_max(Accuracy)

#>      sigma C prob_threshold  Accuracy     Kappa Sensitivity Specificity
#> 1 8.183301 1            0.7 0.9970588 0.9521256   0.9987422   0.9460811
#>   Precision    falseNeg
#> 1 0.9982206 0.001257791

data.svm.radial2.thres <- threshold_test(data.svm.radial2)
data.svm.radial2.thres[1:9] %>% slice_max(Accuracy)

#>   sigma C prob_threshold  Accuracy    Kappa Sensitivity Specificity Precision
#> 1   9.5 2            0.8 0.9971379 0.953725   0.9985625   0.9539921 0.9984811
#>      falseNeg
#> 1 0.001437468

There was little difference between the performances of the auto-tuned and manually tuned models. Although, the manually tuned model performed better in most metrics, including Accuracy and FNR. Both model chose 0.7 as the optimal threshold.

3.8.3 ROC Curve

The model was retrained with the optimal tuning parameters to get the correct number of predictions.

cluster <- makePSOCKcluster(numCores)
registerDoParallel(cluster)
# re-running to get the right number of predictions
data.svm.radial3 <- train(ClassTarp~Red+Green+Blue, data=data,  
                         method="svmRadial",
                         metric="Accuracy",
                         trControl=control,
                         tuneGrid=data.frame(sigma=9, C=2))
stopCluster(cluster)

I used the plotROC() function to both plot the ROC curve and calculate the AUROC value of the model.

data.svm.radial3.thres <- threshold_test(data.svm.radial3)
svm.selected <- data.svm.radial3.thres %>% slice_max(Accuracy)
svm.selected <- plotROC(data.svm.radial3, svm.selected, "Support Vector Machine")

Very similar to the ROC curve of the random forest model, the ROC curve for the SVM model had no rounded corner, and was not quick to reach a TPR of exactly 1.0. The AUROC of this model was 0.9952222.

4 Results (Cross-Validation)

# add model name to stats
log.selected <- log.selected %>% mutate(Model="Log Reg")
lda.selected <- lda.selected %>% mutate(Model="LDA")
qda.selected <- qda.selected %>% mutate(Model="QDA")
knn.selected <- knn.selected %>% mutate(Model="KNN")
plr.selected <- plr.selected %>% mutate(Model="PLR")
rdf.selected <- rdf.selected %>% mutate(Model="RF")
svm.selected <- svm.selected %>% mutate(Model="SVM")

Model	Tuning	Tuning2	AUROC	Threshold	Accuracy	TPR	FPR	Precision
*Tuning2 refers to lambda for PLR and sigma for SVM.
PLR	0.8	0	0.9985	0.7	0.9956	0.9986	0.0945	0.9969
SVM	2	9	0.995	0.6	0.9972	0.9989	0.0554	0.9982
QDA			0.9971	0.6	0.9947	0.9996	0.1519	0.995
RF	1		0.9943	0.5	0.9972	0.9988	0.0534	0.9982
LDA			0.9952	0.7	0.9945	0.9987	0.134	0.9956
KNN	20		0.9995	0.4	0.997	0.9987	0.0544	0.9982
Log Reg			0.9985	0.7	0.9957	0.9984	0.0891	0.9971

Each of the seven classification methods tested in this analysis proved to be very effective options for classifying blue tarp pixels based on RGB values, when training through 10-fold cross-validation. Two models performed very similarly, the logistic regression and penalized logistic regression model. This makes sense because the optimal lambda value was zero, so the predictors were not minimized to any extent. The only statistics where the two models differed were the TPR and FPR, where the penalized logistic regression model had slightly higher values. Overall, each model type was very effective, but the question was, which one was the optimal solution?

When only looking at the table, the best option appeared to be Random Forest. The RF model had the one of the highest accuracies and the lowest FPR. The metrics for the RF model were very promising, but it needed to be noted that these were all calculated without the use of a test dataset. In order to double-down on RF being the best model or to crown a different method as optimal, the models needed to be tested against the hold-out data.

5 Hold-out Data / EDA

5.1 Importing The Data

The hold-out data provided was not of the same format as the training data. It was given as several text files with not-exactly matching naming conventions. The data in the text files were also not exactly in CSV format. Each file started with 7 lines of output metadata from when the values were exported from ENVI. This did introduce a small problem for importing the data, but I was able to navigate it fairly easily.

#setwd("C:\\Users\\cschr\\Documents\\GitHub\\haiti-displacement-classification")
setwd("./data")
files <- list.files(pattern = "*.txt", full.names=TRUE)

hold.out <- data.frame(matrix(ncol=10, nrow=0))
colnames(hold.out) <- c("X","Y","MapX","MapY","Lat","Lon","B1","B2","B3","ClassTarp")

for (file in files) {
  lines <- readLines(file)[-c(1:8)]
  csv <- read.csv(textConnection(lines), header=FALSE, sep="")[-1]
  names(csv) <- c("X","Y","MapX","MapY","Lat","Lon","B1","B2","B3")
  csv$ClassTarp <- ifelse(grepl("NON|NOT", file), "No", "Yes")
  hold.out <- rbind(hold.out, csv)
}

hold.out$ClassTarp <- factor(hold.out$ClassTarp, levels = c("No", "Yes"))

head(hold.out)

#>      X    Y     MapX    MapY      Lat       Lon  B1 B2 B3 ClassTarp
#> 1 1745 1128 769849.6 2049927 18.52267 -72.44399 104 89 63        No
#> 2 1745 1129 769849.6 2049927 18.52267 -72.44399 101 80 60        No
#> 3 1746 1129 769849.6 2049927 18.52267 -72.44399 103 87 69        No
#> 4 1747 1129 769849.7 2049927 18.52267 -72.44399 107 93 72        No
#> 5 1748 1129 769849.8 2049927 18.52267 -72.44399 109 99 68        No
#> 6 1744 1129 769849.5 2049927 18.52267 -72.44399 103 73 53        No

5.2 EDA

The true problem with the hold-out data was the column names. This data did not include RGB labels but band numbers.

allTarps <- subset(hold.out, ClassTarp=="Yes")
(c(mean(allTarps$B1),mean(allTarps$B2),mean(allTarps$B3)))

#> [1] 111.6773 133.3961 165.2613

dataTarps <- subset(data, ClassTarp=="Yes")
(c(mean(dataTarps$Red),mean(dataTarps$Green),mean(dataTarps$Blue)))

#> [1] 169.6627 186.4149 205.0371

noTarps <- subset(hold.out, ClassTarp=="No")
(c(mean(noTarps$B1),mean(noTarps$B2),mean(noTarps$B3)))

#> [1] 118.3081 105.1734  81.7588

all <- rbind(allTarps,noTarps)[c(7,8,9,10)]

all %>%
  dplyr::select(-ClassTarp) %>%
  ggpairs(upper=list(continuous=wrap("cor", size=3)),
          mapping=ggplot2::aes(color=all$ClassTarp, alpha=0.7))

The bottom right cell of the correlation plots makes it clear that band 3 represented Blue. And after comparing the mean values of each band in the hold out data to the mean values of each color in the training data, it was evident that the bands were in RGB order.

hold.out <- hold.out %>% dplyr::rename(Red = B1, Green = B2, Blue = B3)

I was also curious of the locations of the blue tarp pixels.

mapboxToken <- paste(readLines("./.mapbox_token"), collapse="")
Sys.setenv("MAPBOX_TOKEN" = mapboxToken)

fig <- allTarps %>% plot_mapbox(lat = ~Lat, lon = ~Lon, mode = 'scattermapbox')

fig <- fig %>% config(mapboxAccessToken = Sys.getenv("MAPBOX_TOKEN")) %>%
  layout(
    mapbox = list(
      zoom = 13.5,
      center = list(lon=-72.406, lat=18.5208),
      style = 'open-street-map'
      ),
    width=600, height=400
    )

fig

A large group of blue tarp pixels were spread around the Fraide River and between two medical centers in Carrefour. The distance to water and a medical center may have played a role in where these displaced people setup camp. Or perhaps they setup camp close to where there homes were. Carrefour was one of the main areas devasted by the earthquake, with approximately 40-50% of the buildings damaged or destroyed (https://escweb.wr.usgs.gov/share/mooney/142.pdf), and it would make sense that people wouldn’t have the energy to travel that far after the event.

5.3 Testing Against the Hold-out Data

After getting the hold-out data in order, I was ready to test the trained models against the test set. I did this using the predictionStats() function shown in the Functions section. The function used each model to predict the hold-out data, and returned the updated measurements of accuracy, TPR, FPR, AUROC, and precision.

log.pred <- predictionStats(data.log, log.selected)
lda.pred <- predictionStats(data.lda, lda.selected)
qda.pred <- predictionStats(data.qda, qda.selected)
knn.pred <- predictionStats(data.knn, knn.selected)
plr.pred <- predictionStats(data.plr, plr.selected)
rdf.pred <- predictionStats(data.rdf, rdf.selected)
svm.pred <- predictionStats(data.svm, svm.selected)

6 Results (Hold-Out)

Model	Tuning	Tuning2	AUROC	Threshold	Accuracy	TPR	FPR	Precision
*Tuning2 refers to lambda for PLR and sigma for SVM.
Log Reg			0.9994	0.7	0.9947	0.9948	0.0171	0.9999
LDA			0.9977	0.7	0.9976	0.9983	0.1033	0.9992
QDA			0.9598	0.6	0.9821	0.9834	0.203	0.9985
KNN	20		0.9546	0.4	0.9911	0.9919	0.1136	0.9992
PLR	0.8	0	0.9995	0.7	0.9958	0.996	0.0194	0.9999
SVM	2	9	0.9561	0.6	0.9897	0.9942	0.6221	0.9955
RF	1		0.9802	0.5	0.9934	0.995	0.2152	0.9984

7 Final Conclusions

7.0.1 The Optimal Model (Cross-Validation)

Of the seven models trained through cross-validation, I did not see a clear winner. That is not to say the data is not suited to any particular type of model, but it is worth noting that the data was least-suited to discriminant analysis. The QDA and LDA models had the highest FPRs and lowest precisions, AUROCs, and accuracies on the table. Their high FPRs were accompanied by the highest TPRs on the table as well, which points towards both models overclassifying blue tarp pixels.

If I focused solely accuracy and FNR (1-TPR), the best models from this analysis were the QDA and KNN. The QDA model had the second to lowest accuracy, but the best performance in minimizing the FNR, which I believed was a very important value to minimize in these circumstances. All of the seven model types had AUROC, accuracy, and TPR values of at least 0.99, showing that each model would be effective.

The KNN model, with k=20 as the tuning parameter and 0.5 as the threshold, had the second highest accuracy, highest AUROC, and second highest precision metrics. It also had the second lowest FPR, which would be very beneficial to reducing wasted time on trying to reach areas no one was actually in. It is worth noting that the KNN model’s metrics might have been even higher if I had stuck with the chosen k=10 model, instead of increasing k to 20 to allow for more stability.

The random forest model had similar performance in cross-validation to the KNN model. They had matching Accuracy, identical TPR and precision, and the random forest model had the lowest FPR of all seven. In addition to an outstanding FPR, the model had the second best precision.

In a situation where people need to be located accurately and quickly, I believe it would also be pertinent to make sure no one is left behind. So when comparing these models I placed a lot of importance on the FNR (1-TPR). From that metric alone, the best model would be QDA. However, time and resources need to be considered in these situations and the FPR of the QDA model could result in a lot of wasted resources on unnecessary searches. Too keep a good balance between FNR and FPR, I believe the best model would be because of it’s very low FNR and low FPR values. Though my level of confidence in this decision is not that high because these statistics are very similar, and were generated without a test data set.

7.0.2 The Optimal Model (Hold-Out Test)

It was clear that the performances of most models in cross-validation and hold-out test showed major differences. Judging these performances from the same lense, finding the optimal model for locating the most displaced individuals within a reasonable amount of time and use of resources would be dependent on the FNR and FPR values. From that perspective there were two clear contenders, logistic regression and penalized logistic regression.

The logistic regression model had the second highest AUROC at 0.9994 and the lowest FPR at 0.0171. Such a low FPR would greatly improve the proper allocation of resources. Also, the logistic regression model had the highest precision seen in both the cross-validation and hold-out tests at 0.9999. These values were very impressive, but also quite similar to that of another model. The penalized logistic regression model had the highest AUROC at 0.9995, the second lowest FPR at 0.0226, and the second highest precision at 0.9998.

Although the metrics of these models are very similar, I believed the penalized logistic regression model to be the optimal solution. This was because the increase in FPR was still quite small (especially relative to the other models), and seemed to be a sacrifice worth the increase in TPR and decrease in FNR.

7.0.3 Compatible Findings

Even though the optimal models found from cross-validation and hold-out testing do not match, most results of each performance measure could be seen as compatible. it is difficult to distinguish a strong difference between values when everything either begins with 0.98 or 0.99, and the values could easily be considered identical.

Where the results are not compatible is in relative rankings. There were noticeable increases and decreases in ranks among values for several models. For TPR, QDA was the highest in cross validation, but fell to the lowest rank in the hold-out test. At the same time, random forest jumped from 6th to 3rd. Another noticeable change was in precision where SVM dropped from 3rd to last, and random forest from 1st to 6th. Although, with these drastic changes, some other models stayed relatively close to their starting rank for most metrics. It is because of this that I believe the results are reconcilable.

The discrepancies in the model performances could be attributed to the differences between the training and test data color distributions. Hue and brightness play a strong role in color values, and it could be possible the imagery had different levels of both. The training data had dimmer blues while the hold-out data had a stronger distinction between tarp and non-tarp pixels based solely on the blue value. The training on dimmer blue values could explain the increase in false positives for most models in the hold-out test, where non-tarp pixels that resembled the dimmer tarp pixels of the training data were wrongly classified. Perhaps the training data could be normalized to decrease the effects of brightness.

7.0.4 Relevant Metrics

Through this analysis, I believed the most important metrics to consider for all models were FNR and FPR. Balancing both of those values would lead to an optimal solution of the most people located in the fastest time. It was very important to reduce the FNR as to make sure no displaced people were being left behind. It was also important to recognize that cost of searching for and providing aid across a devasted region. Decreasing FNR without regard for FPR would not be useful because the most people would be located, but a large amount of resources could be wasted on trying to reach areas without people present. A third important metric considered was precision. Precision provided a measure of the dispersion of prediction errors. Higher precision values imply greater confidence in the TPR and FPR values seen.

7.0.5 Real-World Application

I think this method can be effective for classifying blue tarp pixels in imagery, but I do not believe it to be the optimal solution in a real-world application. In real-world context, there are a lot of additional factors that would need to be considered before any lives could be saved. Factors like road access and closures, location of displaced people relative to airports, and available funds/resources have a huge impact on the execution and effectiveness of relief efforts. For example, more points of access to aid would need to be set up in an area with many road closures than an area of the same size and density of people with open access to roads.

I think these models can greatly benefit relief efforts and influence logistic decisions by identifying highly probable locations of displaced people, but I also recognize there is a lot more work to be done before those people can be helped.

In a country like Haiti where about 60% of the population has cell phones (https://www.statista.com/statistics/502111/mobile-cellular-subscriptions-per-100-inhabitants-in-haiti/), pinging the locations of those devices might be a faster and more accurate approach to locating displaced people.

7.0.6 Model Improvement

On a larger scale of imagery, I believe the accuracy of the models could be improved with some geospatial influence. Predicting the location of people strictly off color leaves a lot of room to be misled. For example, a shed on a farm has a blue roof, or there is a water feature reflecting enough light.

If the nature of the placement of blue tarps followed a general rule, where most displaced people congregated together, then additional tools like kernel density could be used to add extra confidence to the predictions. Adding this spatial context would allow the model to rule out the “blue tarp” pixels that were located in very sparsely populated areas that are very unlikely to be hosting displaced individuals. The results of which could be provided in the form of heatmaps instead of direct point locations, which would be more beneficial to search parties by not having a strict location rather a general area in which displaced people are possibly moving locations. It is also likely that the displaced people did not travel very far from their original locations, so other factors like population density or buildings damaged in the area could also influence the predictions.

Disaster Relief Project: Part II

Christian Schroeder

Aug 14, 2021

1 Introduction

2 Training Data / EDA

2.0.1 Packages

2.0.2 The Pixel Data

2.0.3 Exploratory Analysis

3 Model Training

3.1 Set-up

3.1.1 Variables

3.1.2 Functions

3.1.3 Formulas

3.2 Logistic Regression

3.2.1 Model Training

3.2.2 Determining Threshold

3.2.3 ROC Curve

3.3 Linear Discriminant Analysis (LDA)

3.3.1 Model Training

3.3.2 Determining Threshold

3.3.3 ROC Curve

3.4 Quadratic Discriminant Analysis (QDA)

3.4.1 Model Training

3.4.2 Determining Threshold

3.4.3 ROC Curve

3.5 KNN

3.5.1 Model Training

3.5.1.1 Tuning Parameter

3.5.2 Determining Threshold

3.5.3 ROC Curve

3.6 Penalized Logistic Regression (ElasticNet)

3.6.1 Model Training

3.6.1.1 Tuning Parameters

3.6.2 Determining Threshold

3.6.3 ROC Curve

3.7 Random Forest

3.7.1 Model Training

3.7.2 Determining Threshold

3.7.3 ROC Curve

3.8 SVM (Support Vector Machines)

3.8.1 Model Training

3.8.2 Determining Threshold

3.8.3 ROC Curve

4 Results (Cross-Validation)

5 Hold-out Data / EDA

5.1 Importing The Data

5.2 EDA

5.3 Testing Against the Hold-out Data

6 Results (Hold-Out)

7 Final Conclusions

7.0.1 The Optimal Model (Cross-Validation)

7.0.2 The Optimal Model (Hold-Out Test)

7.0.3 Compatible Findings

7.0.4 Relevant Metrics

7.0.5 Real-World Application

7.0.6 Model Improvement