Fatal Shooting

Research Question

Do race,mental illness, threat level, and gender predict whether an individual was fleeing during a fatal police shooting?

Introduction

The data set I am using is titled “fatal_police_shootings,” which comes from the https://www.openintro.org/data/index.php?data=fatal_police_shootings This data set is a subset of the Washington Post data base. This data set contains records of every fatal police shooting by an on-duty officer since January 1,2015. This data set has 6421 observations and 12 variables.

For this project I focused on specific 5 columns from the data set.

flee race gender signs_of_mental_illness threat_level

The Flee columns shows whether the individual was fleeing or not and it is a categorical variable. Race is showing the race of the individual and it is a categorical variable. Gender is showing the gender of the individual and it is a categorical variable. Signs_of_mental_illness is showing whether the individual showed signs of mental illness or not and it is a categorical variable. Threat_level is showing level of threat reported at the time of shooting and it is a categorical variable.

Source

Washington Post

Data Analysis

Before building my model, I first cleaned and explored the data set to make sure I was working with reliable information. I checked for missing values and found that some of the variables I needed (flee, race , and gender) has NA values, so I replaced them using the most common value (mode) to keep the data consistent. I did not impute the race but I removed the NAs from that variable. Since the outcome variable (flee) had four categories, I converted it into a binary variable to fit a logistic regression model. I also renamed the race column using clearer category names and converted all predictors into factors so they would work correctly in the model. After cleaning, I created a smaller data set with only the variables needed for my analysis. I used several dplyr functions, such as mutate, case_when, and select, to prepare the data. For quality control, I checked that each category had enough observations in both fleeing and non-fleeing groups. Once the data was cleaned and organized, I was ready to run the logistic regression and evaluate how well the predictors explained fleeing behavior.

Regresion Analysis

For my analysis, I used a logistic regression model to see whether race,gender, signs of mental illness, and threat level could predict if a person was fleeing during a fatal police shooting. The original flee variable had four categories, so I first converted it into a binary outcome called flee_binary (0= not fleeing, 1= fleeing) to make it suitable for logistic regression. After cleaning the data set and preparing the predictors, I fit the final model using the selected variables. The model summary showed that race and signs of mental illness were significant predictors of fleeing, while gender and threat level were not. However, the overall predictive power of the model was low. This is shown by the pseudo R² value of 0.043, meaning the model explains only about 4.3% of the variation in fleeing behavior. In addition to that, the model struggled to correctly identify fleeing cases due to the imbalanced data set. Even though the model found some meaningful relationships, it is not very strong at predicting who was actually fleeing.

Loading the libraries

library(tidyverse)

Loading the data set

fatal_shooting <- read_csv("fatal_police_shootings.csv")

To look at the data type and first 6 rows

str(fatal_shooting)

## spc_tbl_ [6,421 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ date                   : Date[1:6421], format: "2015-01-02" "2015-01-02" ...
##  $ manner_of_death        : chr [1:6421] "shot" "shot" "shot and Tasered" "shot" ...
##  $ armed                  : chr [1:6421] "gun" "gun" "unarmed" "toy weapon" ...
##  $ age                    : num [1:6421] 53 47 23 32 39 18 22 35 34 47 ...
##  $ gender                 : chr [1:6421] "M" "M" "M" "M" ...
##  $ race                   : chr [1:6421] "A" "W" "H" "W" ...
##  $ city                   : chr [1:6421] "Shelton" "Aloha" "Wichita" "San Francisco" ...
##  $ state                  : chr [1:6421] "WA" "OR" "KS" "CA" ...
##  $ signs_of_mental_illness: logi [1:6421] TRUE FALSE FALSE TRUE FALSE FALSE ...
##  $ threat_level           : chr [1:6421] "attack" "attack" "other" "attack" ...
##  $ flee                   : chr [1:6421] "Not fleeing" "Not fleeing" "Not fleeing" "Not fleeing" ...
##  $ body_camera            : logi [1:6421] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   date = col_date(format = ""),
##   ..   manner_of_death = col_character(),
##   ..   armed = col_character(),
##   ..   age = col_double(),
##   ..   gender = col_character(),
##   ..   race = col_character(),
##   ..   city = col_character(),
##   ..   state = col_character(),
##   ..   signs_of_mental_illness = col_logical(),
##   ..   threat_level = col_character(),
##   ..   flee = col_character(),
##   ..   body_camera = col_logical()
##   .. )
##  - attr(*, "problems")=<externalptr>

head(fatal_shooting)

## # A tibble: 6 × 12
##   date       manner_of_death  armed        age gender race  city          state
##   <date>     <chr>            <chr>      <dbl> <chr>  <chr> <chr>         <chr>
## 1 2015-01-02 shot             gun           53 M      A     Shelton       WA   
## 2 2015-01-02 shot             gun           47 M      W     Aloha         OR   
## 3 2015-01-03 shot and Tasered unarmed       23 M      H     Wichita       KS   
## 4 2015-01-04 shot             toy weapon    32 M      W     San Francisco CA   
## 5 2015-01-04 shot             nail gun      39 M      H     Evans         CO   
## 6 2015-01-04 shot             gun           18 M      W     Guthrie       OK   
## # ℹ 4 more variables: signs_of_mental_illness <lgl>, threat_level <chr>,
## #   flee <chr>, body_camera <lgl>

Checking NAs

colSums(is.na(fatal_shooting))

##                    date         manner_of_death                   armed 
##                       0                       0                     208 
##                     age                  gender                    race 
##                     285                       1                     666 
##                    city                   state signs_of_mental_illness 
##                       0                       0                       0 
##            threat_level                    flee             body_camera 
##                       0                     434                       0

Imputing NAs with mode only for variables I am using

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x,ux)))]
}

fatal_shooting$flee[is.na(fatal_shooting$flee)] <- Mode(fatal_shooting$flee)
fatal_shooting$gender[is.na(fatal_shooting$gender)] <- Mode(fatal_shooting$gender)

fatal_shooting <- fatal_shooting |>
  filter(!is.na(race))

I did not impute the missing race values because it can actually make the data more biased. Since race is already sensitive and unevenly represented, filling in NAs with the mode could exaggerate one group and give misleading results. So I kept the race values as they were to avoid adding extra bias.

Checking NAs after imputng

colSums(is.na(fatal_shooting[c("flee","race","gender")]))

##   flee   race gender 
##      0      0      0

Converting flee into binary outcome

We are doing this because flee has 4 categories but for logistic regression we need only two outcomes(binary 0 or 1).

fatal_shooting <- fatal_shooting |>
  mutate(flee_binary = ifelse(flee == "Not fleeing",0,1) )
fatal_shooting

## # A tibble: 5,755 × 13
##    date       manner_of_death  armed        age gender race  city          state
##    <date>     <chr>            <chr>      <dbl> <chr>  <chr> <chr>         <chr>
##  1 2015-01-02 shot             gun           53 M      A     Shelton       WA   
##  2 2015-01-02 shot             gun           47 M      W     Aloha         OR   
##  3 2015-01-03 shot and Tasered unarmed       23 M      H     Wichita       KS   
##  4 2015-01-04 shot             toy weapon    32 M      W     San Francisco CA   
##  5 2015-01-04 shot             nail gun      39 M      H     Evans         CO   
##  6 2015-01-04 shot             gun           18 M      W     Guthrie       OK   
##  7 2015-01-05 shot             gun           22 M      H     Chandler      AZ   
##  8 2015-01-06 shot             gun           35 M      W     Assaria       KS   
##  9 2015-01-06 shot             unarmed       34 F      W     Burlington    IA   
## 10 2015-01-06 shot             toy weapon    47 M      B     Knoxville     PA   
## # ℹ 5,745 more rows
## # ℹ 5 more variables: signs_of_mental_illness <lgl>, threat_level <chr>,
## #   flee <chr>, body_camera <lgl>, flee_binary <dbl>

Renaming the race variable

fatal_shooting <- fatal_shooting |>
  mutate(race = case_when(
    race == "W" ~ "White non-Hispanic",
    race == "B" ~ "Black non-Hispanic",
    race == "A" ~ "Asian",
    race == "N" ~ "Native American",
    race == "H" ~ "Hispanic",
    race == "O" ~ "Other"
  ))
fatal_shooting

## # A tibble: 5,755 × 13
##    date       manner_of_death  armed        age gender race          city  state
##    <date>     <chr>            <chr>      <dbl> <chr>  <chr>         <chr> <chr>
##  1 2015-01-02 shot             gun           53 M      Asian         Shel… WA   
##  2 2015-01-02 shot             gun           47 M      White non-Hi… Aloha OR   
##  3 2015-01-03 shot and Tasered unarmed       23 M      Hispanic      Wich… KS   
##  4 2015-01-04 shot             toy weapon    32 M      White non-Hi… San … CA   
##  5 2015-01-04 shot             nail gun      39 M      Hispanic      Evans CO   
##  6 2015-01-04 shot             gun           18 M      White non-Hi… Guth… OK   
##  7 2015-01-05 shot             gun           22 M      Hispanic      Chan… AZ   
##  8 2015-01-06 shot             gun           35 M      White non-Hi… Assa… KS   
##  9 2015-01-06 shot             unarmed       34 F      White non-Hi… Burl… IA   
## 10 2015-01-06 shot             toy weapon    47 M      Black non-Hi… Knox… PA   
## # ℹ 5,745 more rows
## # ℹ 5 more variables: signs_of_mental_illness <lgl>, threat_level <chr>,
## #   flee <chr>, body_camera <lgl>, flee_binary <dbl>

Converting predictors to factors

fatal_shooting <- fatal_shooting |>
  mutate(race = as.factor(race),
         gender = as.factor(gender),
         signs_of_mental_illness = as.factor(signs_of_mental_illness),
         threat_level = as.factor(threat_level))
str(fatal_shooting)

## tibble [5,755 × 13] (S3: tbl_df/tbl/data.frame)
##  $ date                   : Date[1:5755], format: "2015-01-02" "2015-01-02" ...
##  $ manner_of_death        : chr [1:5755] "shot" "shot" "shot and Tasered" "shot" ...
##  $ armed                  : chr [1:5755] "gun" "gun" "unarmed" "toy weapon" ...
##  $ age                    : num [1:5755] 53 47 23 32 39 18 22 35 34 47 ...
##  $ gender                 : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 1 2 ...
##  $ race                   : Factor w/ 6 levels "Asian","Black non-Hispanic",..: 1 6 3 6 3 6 3 6 6 2 ...
##  $ city                   : chr [1:5755] "Shelton" "Aloha" "Wichita" "San Francisco" ...
##  $ state                  : chr [1:5755] "WA" "OR" "KS" "CA" ...
##  $ signs_of_mental_illness: Factor w/ 2 levels "FALSE","TRUE": 2 1 1 2 1 1 1 1 1 1 ...
##  $ threat_level           : Factor w/ 3 levels "attack","other",..: 1 1 2 1 1 1 1 1 2 1 ...
##  $ flee                   : chr [1:5755] "Not fleeing" "Not fleeing" "Not fleeing" "Not fleeing" ...
##  $ body_camera            : logi [1:5755] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ flee_binary            : num [1:5755] 0 0 0 0 0 0 1 0 0 0 ...

Making the final cleaned data set

fatal_clean <- fatal_shooting |>
  select(flee_binary,race,gender,signs_of_mental_illness,threat_level)
fatal_clean

## # A tibble: 5,755 × 5
##    flee_binary race               gender signs_of_mental_illness threat_level
##          <dbl> <fct>              <fct>  <fct>                   <fct>       
##  1           0 Asian              M      TRUE                    attack      
##  2           0 White non-Hispanic M      FALSE                   attack      
##  3           0 Hispanic           M      FALSE                   other       
##  4           0 White non-Hispanic M      TRUE                    attack      
##  5           0 Hispanic           M      FALSE                   attack      
##  6           0 White non-Hispanic M      FALSE                   attack      
##  7           1 Hispanic           M      FALSE                   attack      
##  8           0 White non-Hispanic M      FALSE                   attack      
##  9           0 White non-Hispanic F      FALSE                   other       
## 10           0 Black non-Hispanic M      FALSE                   attack      
## # ℹ 5,745 more rows

Quality Control

xtabs(~flee_binary +race,data = fatal_clean)

##            race
## flee_binary Asian Black non-Hispanic Hispanic Native American Other
##           0    85                925      691              60    30
##           1    20                603      375              30    17
##            race
## flee_binary White non-Hispanic
##           0               2063
##           1                856

xtabs(~flee_binary +gender,data = fatal_clean)

##            gender
## flee_binary    F    M
##           0  198 3656
##           1   70 1831

xtabs(~flee_binary +signs_of_mental_illness,data = fatal_clean)

##            signs_of_mental_illness
## flee_binary FALSE TRUE
##           0  2718 1136
##           1  1697  204

xtabs(~flee_binary +threat_level,data = fatal_clean)

##            threat_level
## flee_binary attack other undetermined
##           0   2485  1261          108
##           1   1235   615           51

Each category has more more than 5 in both groups, so these variables are safe to use in modeling, no need to drop or merge any levels.

Logistic Regression Model

model <- glm(flee_binary ~ race + gender + signs_of_mental_illness + threat_level,data = fatal_clean,family = "binomial")

summary(model)

## 
## Call:
## glm(formula = flee_binary ~ race + gender + signs_of_mental_illness + 
##     threat_level, family = "binomial", data = fatal_clean)
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -1.44753    0.29069  -4.980 6.37e-07 ***
## raceBlack non-Hispanic       0.95689    0.25773   3.713 0.000205 ***
## raceHispanic                 0.78472    0.26035   3.014 0.002578 ** 
## raceNative American          0.70791    0.33960   2.085 0.037112 *  
## raceOther                    0.87109    0.39975   2.179 0.029323 *  
## raceWhite non-Hispanic       0.63668    0.25560   2.491 0.012742 *  
## genderM                      0.22286    0.14615   1.525 0.127300    
## signs_of_mental_illnessTRUE -1.20334    0.08288 -14.519  < 2e-16 ***
## threat_levelother            0.03653    0.06215   0.588 0.556654    
## threat_levelundetermined    -0.13743    0.17656  -0.778 0.436346    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7302.0  on 5754  degrees of freedom
## Residual deviance: 6988.9  on 5745  degrees of freedom
## AIC: 7008.9
## 
## Number of Fisher Scoring iterations: 4

Interpretation

Now that we have fitted the logistic regression model predicting whether an individual was fleeing, we can look at the predictors that are statistically significant and interpret their effects. Below are the variables with meaningful log-odds ratios.

raceBlack non-Hispanic = 0.95689 (p=0.000205): Being Black non-Hispanic increases the log-odds of fleeing compared to the reference group(Asian individuals). This effect is statistically significant, which means race plays an important role in predicting fleeing behavior.

raceHispanic = 0.78472 (p=0.002578): Hispanic individuals have higher log-odds of fleeing than Asians. This is also significant, indicating that race meaningfully influences the likelihood of fleeing.

raceNative American =0.70791 (p=0.037112): Native American individuals show significantly higher log-odds of fleeing compared to Asians. Although the effect is smaller than other race groups, it is still statistically meaningful.

raceOther = 0.87109 (p=0.029323): Individuals categorized as “Other” race has significantly higher log-odds of fleeing than Asians. This suggests the “Other” race group is also an important predictor.

raceWhite non-Hispanic = 0.63668 (p=0.012742): White non-Hispanic individuals have significantly higher log-odds of fleeing compared to Asians. This again highlights race as a strong overall predictor.

signs_of_mental_illnessTRUE = -1.20334 (p<2e-16): This is one of the strongest predictors in the model. Individuals showing signs of mental illness have much lower log-odds of fleeing than those without mental illness. This effect is extremely significant, indicating that mental illness strongly decreases the likelihood of fleeing.

genderM = 0.22286 (p=0.127300): Gender is not significant (p>0.05) This means that once we account for race,threat level, and mental illness, gender does not meaningfully affect whether someone was fleeing.

threat_levelOther and threat_levelundetermined: Both threat level categories have p-values greater than 0.05, meaning they are not significant predictors of fleeing. Threat level does not improve the model’s predictive power once other factors are included.

Overall, the residual deviance is lower than the null deviance, meaning the predictors improve model fit compared to an intercept-only model.

Calculating the r-square

r_square_all <- 1 - (model$deviance/model$null.deviance)

r_square_all

## [1] 0.04288325

The R² value for this model is 0.043, meaning the predictors explain about 4.3% of the variation in whether an individual was fleeing. This value is considered low, which is common in logistic regression, especially with social or behavioral data sets. A low R² does not mean the model is incorrect, it simply indicates that fleeing behavior is influenced by many factors not captured in this model. I report that the R² is low, and future models could include additional predictors to improve the explanatory power.

Calculating p-value

1 - pchisq((model$null.deviance - model$deviance), df=(length(model$coefficients)-1))

## [1] 0

The overall p-value of the model is effectively 0, meaning the model is statistically significant.

Confusion matrix

predicted.probs <- model$fitted.values

predicted.classes <- ifelse(predicted.probs > 0.3, 1, 0) 

confusion <- table(
  Predicted = factor(predicted.classes, levels = c(0, 1)),
  Actual = factor(fatal_clean$flee_binary, levels = c(0, 1))
)

confusion

##          Actual
## Predicted    0    1
##         0 1197  225
##         1 2657 1676

I used a lower threshold of 0.3 instead of the default 0.5 to reduce the number of false negatives.

1197 people were actually not fleeing, and the model correctly predicted them as not fleeing. This is a True Negative(TN).

225 people were actually fleeing, but the model mistakenly predicted them as not fleeing. This is False Negative (FN).

2657 people were actually not fleeing, but the model mistakenly predicted them as fleeing. This is False Positive (FP).

1676 people were actually fleeing, and the model correctly predicted them as fleeing. This is True Positive (TP).

Calculating key performance metrics

TN <- 1197
FP <- 2657
FN <- 225
TP <- 1676

# Metrics
accuracy <- (TP + TN) / (TP + TN + FP + FN)
sensitivity <- TP / (TP + FN)  
specificity <- TN / (TN + FP)  
precision <- TP / (TP + FP)   
f1_score <- 2 * (precision * sensitivity) / (precision + sensitivity)

# Print results
cat("Accuracy:    ", round(accuracy, 4), "\n")

## Accuracy:     0.4992

cat("Sensitivity: ", round(sensitivity, 4), "\n")

## Sensitivity:  0.8816

cat("Specificity: ", round(specificity, 4), "\n")

## Specificity:  0.3106

cat("Precision:   ", round(precision, 4), "\n")

## Precision:    0.3868

cat("F1 Score:    ", round(f1_score, 4), "\n")

## F1 Score:     0.5377

The model shows moderate overall accuracy (50%), meaning it is correct roughly half of the time. Even though this accuracy not high, the more important point is how the model performs for each class because the data set imbalanced.

The model is very strong at identifying individuals who were actually fleeing, with a high sensitivity of about 88%. This means when a person truly was fleeing, the model identified them correctly most of the time.

However, the model struggles more with the non-fleeing group. The specificity is only about 31%, which means the model often predicts “fleeing” even when the person person was not fleeing. Because of this precision is also low (about 39%), showing that many of the model’s “fleeing” predictions were false alarms.

The F1-score (about 0.54) gives a balanced view of precision and sensitivity. The model is very good at catching fleeing individuals, but it does so by incorrectly labeling many non-fleeing individuals as fleeing.

ROC and AUC

library(pROC)

## Warning: package 'pROC' was built under R version 4.5.2

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

# ROC curve & AUC on full data
roc_obj <- roc(response = fatal_clean$flee_binary,
               predictor = model$fitted.values,
               levels = c(0,1),
               direction = "<")  

# Print AUC value
auc_val <- auc(roc_obj); auc_val

## Area under the curve: 0.6233

# Plot ROC with AUC displayed
plot.roc(roc_obj, print.auc = TRUE, legacy.axes = TRUE,
         xlab = "False Positive Rate (1 - Specificity)",
         ylab = "True Positive Rate (Sensitivity)")

The AUC = 0.623 means the model has a modest ability to distinguish between people who were fleeing and those who were not.

On the ROC plot, the curve is above the diagonal “random guess” line, which shows that the model performs better than chance. But it is not very strong.

In plain words: If you randomly pick one person who fled and one who did not flee, the model has about a 62.2% chance of giving the higher predicted probability to the person who actually fled.

This means the predictors (race,gender,mental illness,threat level) provide some useful information, but the model does not fully separate the two groups. More predictors would likely improve performance in the future.

Conclusion

In this project, I looked at whether race, gender, signs of mental illness, and threat level could help predict if a person was fleeing during a fatal police shooting. The results showed that some variables, especially race and signs of mental illness, were related to fleeing behavior. However, the model overall did not predict fleeing very well. The pseudo R² was low (around 0.04), meaning the predictors explained only a small pat of the behavior. The AUC value of 0.623 also showed that the model is only slightly better than random guessing.

These results suggest that fleeing is influenced by many other factors that were not included in this model. Important details like, what was happening during the incident, the environment, or how the situation unfolded, might help improve prediction. In the future, adding more meaningful variables, exploring more detailed incident level information, or trying different types of models could lead to better results. Overall, the current model provides some insights, but more information is needed to better understand the factors related to fleeing behavior.

References

https://ai.lange-analytics.com/htmlbook/LogRegr.html#LogRegr-AnalyzeChurn <- About imbalanced data sets

https://www.r-bloggers.com/2020/04/how-to-impute-missing-values-in-r/

https://libguides.princeton.edu/c.php?g=1391216&p=10290937#s-lg-box-wrapper-38136674 <- imputing NAs with mode