Do race,mental illness, threat level, and gender predict whether an individual was fleeing during a fatal police shooting?
The data set I am using is titled “fatal_police_shootings,” which comes from the https://www.openintro.org/data/index.php?data=fatal_police_shootings This data set is a subset of the Washington Post data base. This data set contains records of every fatal police shooting by an on-duty officer since January 1,2015. This data set has 6421 observations and 12 variables.
For this project I focused on specific 5 columns from the data set.
flee race gender signs_of_mental_illness threat_level
The Flee columns shows whether the individual was fleeing or not and it is a categorical variable. Race is showing the race of the individual and it is a categorical variable. Gender is showing the gender of the individual and it is a categorical variable. Signs_of_mental_illness is showing whether the individual showed signs of mental illness or not and it is a categorical variable. Threat_level is showing level of threat reported at the time of shooting and it is a categorical variable.
Washington Post
Before building my model, I first cleaned and explored the data set to make sure I was working with reliable information. I checked for missing values and found that some of the variables I needed (flee, race , and gender) has NA values, so I replaced them using the most common value (mode) to keep the data consistent. I did not impute the race but I removed the NAs from that variable. Since the outcome variable (flee) had four categories, I converted it into a binary variable to fit a logistic regression model. I also renamed the race column using clearer category names and converted all predictors into factors so they would work correctly in the model. After cleaning, I created a smaller data set with only the variables needed for my analysis. I used several dplyr functions, such as mutate, case_when, and select, to prepare the data. For quality control, I checked that each category had enough observations in both fleeing and non-fleeing groups. Once the data was cleaned and organized, I was ready to run the logistic regression and evaluate how well the predictors explained fleeing behavior.
For my analysis, I used a logistic regression model to see whether race,gender, signs of mental illness, and threat level could predict if a person was fleeing during a fatal police shooting. The original flee variable had four categories, so I first converted it into a binary outcome called flee_binary (0= not fleeing, 1= fleeing) to make it suitable for logistic regression. After cleaning the data set and preparing the predictors, I fit the final model using the selected variables. The model summary showed that race and signs of mental illness were significant predictors of fleeing, while gender and threat level were not. However, the overall predictive power of the model was low. This is shown by the pseudo R² value of 0.043, meaning the model explains only about 4.3% of the variation in fleeing behavior. In addition to that, the model struggled to correctly identify fleeing cases due to the imbalanced data set. Even though the model found some meaningful relationships, it is not very strong at predicting who was actually fleeing.
library(tidyverse)
fatal_shooting <- read_csv("fatal_police_shootings.csv")
str(fatal_shooting)
## spc_tbl_ [6,421 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ date : Date[1:6421], format: "2015-01-02" "2015-01-02" ...
## $ manner_of_death : chr [1:6421] "shot" "shot" "shot and Tasered" "shot" ...
## $ armed : chr [1:6421] "gun" "gun" "unarmed" "toy weapon" ...
## $ age : num [1:6421] 53 47 23 32 39 18 22 35 34 47 ...
## $ gender : chr [1:6421] "M" "M" "M" "M" ...
## $ race : chr [1:6421] "A" "W" "H" "W" ...
## $ city : chr [1:6421] "Shelton" "Aloha" "Wichita" "San Francisco" ...
## $ state : chr [1:6421] "WA" "OR" "KS" "CA" ...
## $ signs_of_mental_illness: logi [1:6421] TRUE FALSE FALSE TRUE FALSE FALSE ...
## $ threat_level : chr [1:6421] "attack" "attack" "other" "attack" ...
## $ flee : chr [1:6421] "Not fleeing" "Not fleeing" "Not fleeing" "Not fleeing" ...
## $ body_camera : logi [1:6421] FALSE FALSE FALSE FALSE FALSE FALSE ...
## - attr(*, "spec")=
## .. cols(
## .. date = col_date(format = ""),
## .. manner_of_death = col_character(),
## .. armed = col_character(),
## .. age = col_double(),
## .. gender = col_character(),
## .. race = col_character(),
## .. city = col_character(),
## .. state = col_character(),
## .. signs_of_mental_illness = col_logical(),
## .. threat_level = col_character(),
## .. flee = col_character(),
## .. body_camera = col_logical()
## .. )
## - attr(*, "problems")=<externalptr>
head(fatal_shooting)
## # A tibble: 6 × 12
## date manner_of_death armed age gender race city state
## <date> <chr> <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 2015-01-02 shot gun 53 M A Shelton WA
## 2 2015-01-02 shot gun 47 M W Aloha OR
## 3 2015-01-03 shot and Tasered unarmed 23 M H Wichita KS
## 4 2015-01-04 shot toy weapon 32 M W San Francisco CA
## 5 2015-01-04 shot nail gun 39 M H Evans CO
## 6 2015-01-04 shot gun 18 M W Guthrie OK
## # ℹ 4 more variables: signs_of_mental_illness <lgl>, threat_level <chr>,
## # flee <chr>, body_camera <lgl>
colSums(is.na(fatal_shooting))
## date manner_of_death armed
## 0 0 208
## age gender race
## 285 1 666
## city state signs_of_mental_illness
## 0 0 0
## threat_level flee body_camera
## 0 434 0
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x,ux)))]
}
fatal_shooting$flee[is.na(fatal_shooting$flee)] <- Mode(fatal_shooting$flee)
fatal_shooting$gender[is.na(fatal_shooting$gender)] <- Mode(fatal_shooting$gender)
fatal_shooting <- fatal_shooting |>
filter(!is.na(race))
I did not impute the missing race values because it can actually make the data more biased. Since race is already sensitive and unevenly represented, filling in NAs with the mode could exaggerate one group and give misleading results. So I kept the race values as they were to avoid adding extra bias.
colSums(is.na(fatal_shooting[c("flee","race","gender")]))
## flee race gender
## 0 0 0
fatal_shooting <- fatal_shooting |>
mutate(flee_binary = ifelse(flee == "Not fleeing",0,1) )
fatal_shooting
## # A tibble: 5,755 × 13
## date manner_of_death armed age gender race city state
## <date> <chr> <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 2015-01-02 shot gun 53 M A Shelton WA
## 2 2015-01-02 shot gun 47 M W Aloha OR
## 3 2015-01-03 shot and Tasered unarmed 23 M H Wichita KS
## 4 2015-01-04 shot toy weapon 32 M W San Francisco CA
## 5 2015-01-04 shot nail gun 39 M H Evans CO
## 6 2015-01-04 shot gun 18 M W Guthrie OK
## 7 2015-01-05 shot gun 22 M H Chandler AZ
## 8 2015-01-06 shot gun 35 M W Assaria KS
## 9 2015-01-06 shot unarmed 34 F W Burlington IA
## 10 2015-01-06 shot toy weapon 47 M B Knoxville PA
## # ℹ 5,745 more rows
## # ℹ 5 more variables: signs_of_mental_illness <lgl>, threat_level <chr>,
## # flee <chr>, body_camera <lgl>, flee_binary <dbl>
fatal_shooting <- fatal_shooting |>
mutate(race = case_when(
race == "W" ~ "White non-Hispanic",
race == "B" ~ "Black non-Hispanic",
race == "A" ~ "Asian",
race == "N" ~ "Native American",
race == "H" ~ "Hispanic",
race == "O" ~ "Other"
))
fatal_shooting
## # A tibble: 5,755 × 13
## date manner_of_death armed age gender race city state
## <date> <chr> <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 2015-01-02 shot gun 53 M Asian Shel… WA
## 2 2015-01-02 shot gun 47 M White non-Hi… Aloha OR
## 3 2015-01-03 shot and Tasered unarmed 23 M Hispanic Wich… KS
## 4 2015-01-04 shot toy weapon 32 M White non-Hi… San … CA
## 5 2015-01-04 shot nail gun 39 M Hispanic Evans CO
## 6 2015-01-04 shot gun 18 M White non-Hi… Guth… OK
## 7 2015-01-05 shot gun 22 M Hispanic Chan… AZ
## 8 2015-01-06 shot gun 35 M White non-Hi… Assa… KS
## 9 2015-01-06 shot unarmed 34 F White non-Hi… Burl… IA
## 10 2015-01-06 shot toy weapon 47 M Black non-Hi… Knox… PA
## # ℹ 5,745 more rows
## # ℹ 5 more variables: signs_of_mental_illness <lgl>, threat_level <chr>,
## # flee <chr>, body_camera <lgl>, flee_binary <dbl>
fatal_shooting <- fatal_shooting |>
mutate(race = as.factor(race),
gender = as.factor(gender),
signs_of_mental_illness = as.factor(signs_of_mental_illness),
threat_level = as.factor(threat_level))
str(fatal_shooting)
## tibble [5,755 × 13] (S3: tbl_df/tbl/data.frame)
## $ date : Date[1:5755], format: "2015-01-02" "2015-01-02" ...
## $ manner_of_death : chr [1:5755] "shot" "shot" "shot and Tasered" "shot" ...
## $ armed : chr [1:5755] "gun" "gun" "unarmed" "toy weapon" ...
## $ age : num [1:5755] 53 47 23 32 39 18 22 35 34 47 ...
## $ gender : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 1 2 ...
## $ race : Factor w/ 6 levels "Asian","Black non-Hispanic",..: 1 6 3 6 3 6 3 6 6 2 ...
## $ city : chr [1:5755] "Shelton" "Aloha" "Wichita" "San Francisco" ...
## $ state : chr [1:5755] "WA" "OR" "KS" "CA" ...
## $ signs_of_mental_illness: Factor w/ 2 levels "FALSE","TRUE": 2 1 1 2 1 1 1 1 1 1 ...
## $ threat_level : Factor w/ 3 levels "attack","other",..: 1 1 2 1 1 1 1 1 2 1 ...
## $ flee : chr [1:5755] "Not fleeing" "Not fleeing" "Not fleeing" "Not fleeing" ...
## $ body_camera : logi [1:5755] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ flee_binary : num [1:5755] 0 0 0 0 0 0 1 0 0 0 ...
fatal_clean <- fatal_shooting |>
select(flee_binary,race,gender,signs_of_mental_illness,threat_level)
fatal_clean
## # A tibble: 5,755 × 5
## flee_binary race gender signs_of_mental_illness threat_level
## <dbl> <fct> <fct> <fct> <fct>
## 1 0 Asian M TRUE attack
## 2 0 White non-Hispanic M FALSE attack
## 3 0 Hispanic M FALSE other
## 4 0 White non-Hispanic M TRUE attack
## 5 0 Hispanic M FALSE attack
## 6 0 White non-Hispanic M FALSE attack
## 7 1 Hispanic M FALSE attack
## 8 0 White non-Hispanic M FALSE attack
## 9 0 White non-Hispanic F FALSE other
## 10 0 Black non-Hispanic M FALSE attack
## # ℹ 5,745 more rows
xtabs(~flee_binary +race,data = fatal_clean)
## race
## flee_binary Asian Black non-Hispanic Hispanic Native American Other
## 0 85 925 691 60 30
## 1 20 603 375 30 17
## race
## flee_binary White non-Hispanic
## 0 2063
## 1 856
xtabs(~flee_binary +gender,data = fatal_clean)
## gender
## flee_binary F M
## 0 198 3656
## 1 70 1831
xtabs(~flee_binary +signs_of_mental_illness,data = fatal_clean)
## signs_of_mental_illness
## flee_binary FALSE TRUE
## 0 2718 1136
## 1 1697 204
xtabs(~flee_binary +threat_level,data = fatal_clean)
## threat_level
## flee_binary attack other undetermined
## 0 2485 1261 108
## 1 1235 615 51
Each category has more more than 5 in both groups, so these variables are safe to use in modeling, no need to drop or merge any levels.
model <- glm(flee_binary ~ race + gender + signs_of_mental_illness + threat_level,data = fatal_clean,family = "binomial")
summary(model)
##
## Call:
## glm(formula = flee_binary ~ race + gender + signs_of_mental_illness +
## threat_level, family = "binomial", data = fatal_clean)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.44753 0.29069 -4.980 6.37e-07 ***
## raceBlack non-Hispanic 0.95689 0.25773 3.713 0.000205 ***
## raceHispanic 0.78472 0.26035 3.014 0.002578 **
## raceNative American 0.70791 0.33960 2.085 0.037112 *
## raceOther 0.87109 0.39975 2.179 0.029323 *
## raceWhite non-Hispanic 0.63668 0.25560 2.491 0.012742 *
## genderM 0.22286 0.14615 1.525 0.127300
## signs_of_mental_illnessTRUE -1.20334 0.08288 -14.519 < 2e-16 ***
## threat_levelother 0.03653 0.06215 0.588 0.556654
## threat_levelundetermined -0.13743 0.17656 -0.778 0.436346
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7302.0 on 5754 degrees of freedom
## Residual deviance: 6988.9 on 5745 degrees of freedom
## AIC: 7008.9
##
## Number of Fisher Scoring iterations: 4
Interpretation
Now that we have fitted the logistic regression model predicting whether an individual was fleeing, we can look at the predictors that are statistically significant and interpret their effects. Below are the variables with meaningful log-odds ratios.
raceBlack non-Hispanic = 0.95689 (p=0.000205): Being Black non-Hispanic increases the log-odds of fleeing compared to the reference group(Asian individuals). This effect is statistically significant, which means race plays an important role in predicting fleeing behavior.
raceHispanic = 0.78472 (p=0.002578): Hispanic individuals have higher log-odds of fleeing than Asians. This is also significant, indicating that race meaningfully influences the likelihood of fleeing.
raceNative American =0.70791 (p=0.037112): Native American individuals show significantly higher log-odds of fleeing compared to Asians. Although the effect is smaller than other race groups, it is still statistically meaningful.
raceOther = 0.87109 (p=0.029323): Individuals categorized as “Other” race has significantly higher log-odds of fleeing than Asians. This suggests the “Other” race group is also an important predictor.
raceWhite non-Hispanic = 0.63668 (p=0.012742): White non-Hispanic individuals have significantly higher log-odds of fleeing compared to Asians. This again highlights race as a strong overall predictor.
signs_of_mental_illnessTRUE = -1.20334 (p<2e-16): This is one of the strongest predictors in the model. Individuals showing signs of mental illness have much lower log-odds of fleeing than those without mental illness. This effect is extremely significant, indicating that mental illness strongly decreases the likelihood of fleeing.
genderM = 0.22286 (p=0.127300): Gender is not significant (p>0.05) This means that once we account for race,threat level, and mental illness, gender does not meaningfully affect whether someone was fleeing.
threat_levelOther and threat_levelundetermined: Both threat level categories have p-values greater than 0.05, meaning they are not significant predictors of fleeing. Threat level does not improve the model’s predictive power once other factors are included.
Overall, the residual deviance is lower than the null deviance, meaning the predictors improve model fit compared to an intercept-only model.
r_square_all <- 1 - (model$deviance/model$null.deviance)
r_square_all
## [1] 0.04288325
The R² value for this model is 0.043, meaning the predictors explain about 4.3% of the variation in whether an individual was fleeing. This value is considered low, which is common in logistic regression, especially with social or behavioral data sets. A low R² does not mean the model is incorrect, it simply indicates that fleeing behavior is influenced by many factors not captured in this model. I report that the R² is low, and future models could include additional predictors to improve the explanatory power.
1 - pchisq((model$null.deviance - model$deviance), df=(length(model$coefficients)-1))
## [1] 0
The overall p-value of the model is effectively 0, meaning the model is statistically significant.
predicted.probs <- model$fitted.values
predicted.classes <- ifelse(predicted.probs > 0.3, 1, 0)
confusion <- table(
Predicted = factor(predicted.classes, levels = c(0, 1)),
Actual = factor(fatal_clean$flee_binary, levels = c(0, 1))
)
confusion
## Actual
## Predicted 0 1
## 0 1197 225
## 1 2657 1676
I used a lower threshold of 0.3 instead of the default 0.5 to reduce the number of false negatives.
1197 people were actually not fleeing, and the model correctly predicted them as not fleeing. This is a True Negative(TN).
225 people were actually fleeing, but the model mistakenly predicted them as not fleeing. This is False Negative (FN).
2657 people were actually not fleeing, but the model mistakenly predicted them as fleeing. This is False Positive (FP).
1676 people were actually fleeing, and the model correctly predicted them as fleeing. This is True Positive (TP).
TN <- 1197
FP <- 2657
FN <- 225
TP <- 1676
# Metrics
accuracy <- (TP + TN) / (TP + TN + FP + FN)
sensitivity <- TP / (TP + FN)
specificity <- TN / (TN + FP)
precision <- TP / (TP + FP)
f1_score <- 2 * (precision * sensitivity) / (precision + sensitivity)
# Print results
cat("Accuracy: ", round(accuracy, 4), "\n")
## Accuracy: 0.4992
cat("Sensitivity: ", round(sensitivity, 4), "\n")
## Sensitivity: 0.8816
cat("Specificity: ", round(specificity, 4), "\n")
## Specificity: 0.3106
cat("Precision: ", round(precision, 4), "\n")
## Precision: 0.3868
cat("F1 Score: ", round(f1_score, 4), "\n")
## F1 Score: 0.5377
The model shows moderate overall accuracy (50%), meaning it is correct roughly half of the time. Even though this accuracy not high, the more important point is how the model performs for each class because the data set imbalanced.
The model is very strong at identifying individuals who were actually fleeing, with a high sensitivity of about 88%. This means when a person truly was fleeing, the model identified them correctly most of the time.
However, the model struggles more with the non-fleeing group. The specificity is only about 31%, which means the model often predicts “fleeing” even when the person person was not fleeing. Because of this precision is also low (about 39%), showing that many of the model’s “fleeing” predictions were false alarms.
The F1-score (about 0.54) gives a balanced view of precision and sensitivity. The model is very good at catching fleeing individuals, but it does so by incorrectly labeling many non-fleeing individuals as fleeing.
library(pROC)
## Warning: package 'pROC' was built under R version 4.5.2
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
# ROC curve & AUC on full data
roc_obj <- roc(response = fatal_clean$flee_binary,
predictor = model$fitted.values,
levels = c(0,1),
direction = "<")
# Print AUC value
auc_val <- auc(roc_obj); auc_val
## Area under the curve: 0.6233
# Plot ROC with AUC displayed
plot.roc(roc_obj, print.auc = TRUE, legacy.axes = TRUE,
xlab = "False Positive Rate (1 - Specificity)",
ylab = "True Positive Rate (Sensitivity)")
The AUC = 0.623 means the model has a modest ability to distinguish between people who were fleeing and those who were not.
On the ROC plot, the curve is above the diagonal “random guess” line, which shows that the model performs better than chance. But it is not very strong.
In plain words: If you randomly pick one person who fled and one who did not flee, the model has about a 62.2% chance of giving the higher predicted probability to the person who actually fled.
This means the predictors (race,gender,mental illness,threat level) provide some useful information, but the model does not fully separate the two groups. More predictors would likely improve performance in the future.
In this project, I looked at whether race, gender, signs of mental illness, and threat level could help predict if a person was fleeing during a fatal police shooting. The results showed that some variables, especially race and signs of mental illness, were related to fleeing behavior. However, the model overall did not predict fleeing very well. The pseudo R² was low (around 0.04), meaning the predictors explained only a small pat of the behavior. The AUC value of 0.623 also showed that the model is only slightly better than random guessing.
These results suggest that fleeing is influenced by many other factors that were not included in this model. Important details like, what was happening during the incident, the environment, or how the situation unfolded, might help improve prediction. In the future, adding more meaningful variables, exploring more detailed incident level information, or trying different types of models could lead to better results. Overall, the current model provides some insights, but more information is needed to better understand the factors related to fleeing behavior.
https://ai.lange-analytics.com/htmlbook/LogRegr.html#LogRegr-AnalyzeChurn <- About imbalanced data sets
https://www.r-bloggers.com/2020/04/how-to-impute-missing-values-in-r/
https://libguides.princeton.edu/c.php?g=1391216&p=10290937#s-lg-box-wrapper-38136674 <- imputing NAs with mode