Research Question: Is the age of individuals involved in police use-of-force incidents different depending on their race, sex, and the type of force used?

The data for my project comes from the Minneapolis Police use of Force provided by OpenIntro (Link: https://www.openintro.org/data/index.php?data=mn_police_use_of_force). This data set contains 12,925 observations of police use-of-force incidents in Minneapolis, Minnesota from 2016 through August 2021. Each row represents one incident and has 13 variables, which includes information of demographics about the subjects (such as age, race, and sex), details about the type of force used by the officer, and details of when and where each incident occurred.

The purpose of this project is to see whether demographic factors such as race, sex and the type of force that was used are associated with differences in the ages of individuals involved in police use of force incidents. I wanted to look into this data set because police use of force is a major issue in public safety and social justice, and I wanted to see if there were any patterns that I could find using this data set.

Variables Selected:

age: age of subject (quantitative)

race: race of subject (categorical)

sex: gender of subject (categorical)

force_type: type of police force used (categorical)

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("~/Desktop/FALL25/data science")
mn_force_data <- read_csv("mn_police_use_of_force.csv")
## Rows: 12925 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (11): response_datetime, problem, is_911_call, primary_offense, subject_...
## dbl  (2): age, precinct
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(mn_force_data)
## spc_tbl_ [12,925 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ response_datetime: chr [1:12925] "2016/01/01 00:47:36" "2016/01/01 02:19:34" "2016/01/01 02:19:34" "2016/01/01 02:28:48" ...
##  $ problem          : chr [1:12925] "Assault in Progress" "Fight" "Fight" "Fight" ...
##  $ is_911_call      : chr [1:12925] "Yes" "No" "No" "No" ...
##  $ primary_offense  : chr [1:12925] "DASLT1" "DISCON" "DISCON" "PRIORI" ...
##  $ subject_injury   : chr [1:12925] NA NA NA NA ...
##  $ force_type       : chr [1:12925] "Bodily Force" "Chemical Irritant" "Chemical Irritant" "Chemical Irritant" ...
##  $ force_type_action: chr [1:12925] "Body Weight to Pin" "Personal Mace" "Personal Mace" "Crowd Control Mace" ...
##  $ race             : chr [1:12925] "Black" "Black" "White" "Black" ...
##  $ sex              : chr [1:12925] "Male" "Female" "Female" "Male" ...
##  $ age              : num [1:12925] 20 27 23 20 20 20 20 20 20 18 ...
##  $ type_resistance  : chr [1:12925] "Tensed" "Verbal Non-Compliance" "Verbal Non-Compliance" "Commission of Crime" ...
##  $ precinct         : num [1:12925] 1 1 1 1 1 1 1 1 1 1 ...
##  $ neighborhood     : chr [1:12925] "Downtown East" "Downtown West" "Downtown West" "Downtown West" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   response_datetime = col_character(),
##   ..   problem = col_character(),
##   ..   is_911_call = col_character(),
##   ..   primary_offense = col_character(),
##   ..   subject_injury = col_character(),
##   ..   force_type = col_character(),
##   ..   force_type_action = col_character(),
##   ..   race = col_character(),
##   ..   sex = col_character(),
##   ..   age = col_double(),
##   ..   type_resistance = col_character(),
##   ..   precinct = col_double(),
##   ..   neighborhood = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
head(mn_force_data)
## # A tibble: 6 × 13
##   response_datetime   problem         is_911_call primary_offense subject_injury
##   <chr>               <chr>           <chr>       <chr>           <chr>         
## 1 2016/01/01 00:47:36 Assault in Pro… Yes         DASLT1          <NA>          
## 2 2016/01/01 02:19:34 Fight           No          DISCON          <NA>          
## 3 2016/01/01 02:19:34 Fight           No          DISCON          <NA>          
## 4 2016/01/01 02:28:48 Fight           No          PRIORI          <NA>          
## 5 2016/01/01 02:28:48 Fight           No          PRIORI          <NA>          
## 6 2016/01/01 02:28:48 Fight           No          PRIORI          <NA>          
## # ℹ 8 more variables: force_type <chr>, force_type_action <chr>, race <chr>,
## #   sex <chr>, age <dbl>, type_resistance <chr>, precinct <dbl>,
## #   neighborhood <chr>

Data Analysis

In order to answer my research question, I first loaded my dataset and explored it by using basic EDA functions such as str() and head() to understand the structure of the data. My next step was to clean the data by keeping only the variables related to my analysis which were age, race, sex, and force_type using select(). Since the dataset recorded some ages as 0, I used the function mutate() to replace those values with NA, and then used filter() and na.omit() to remove any rows with missing values to make sure I was only using valid data. Once I cleaned the dataset, I examined it again with str(), head(), and summary() to make sure the variables looked correct before running my regression model.

Cleaning EDA functions

mn_force_clean <- mn_force_data |>
  select(age, race, sex, force_type) |>
  mutate(age = na_if(age, 0)) |>      
  filter(!is.na(age))          

Check for NA’s and remove

colSums(is.na(mn_force_clean))
##        age       race        sex force_type 
##          0        454         45          0
mn_force_clean <- na.omit(mn_force_clean)
str(mn_force_clean)
## tibble [11,176 × 4] (S3: tbl_df/tbl/data.frame)
##  $ age       : num [1:11176] 20 27 23 20 20 20 20 20 20 18 ...
##  $ race      : chr [1:11176] "Black" "Black" "White" "Black" ...
##  $ sex       : chr [1:11176] "Male" "Female" "Female" "Male" ...
##  $ force_type: chr [1:11176] "Bodily Force" "Chemical Irritant" "Chemical Irritant" "Chemical Irritant" ...
##  - attr(*, "na.action")= 'omit' Named int [1:466] 123 142 144 312 325 326 435 436 523 524 ...
##   ..- attr(*, "names")= chr [1:466] "123" "142" "144" "312" ...
head(mn_force_clean)
## # A tibble: 6 × 4
##     age race  sex    force_type       
##   <dbl> <chr> <chr>  <chr>            
## 1    20 Black Male   Bodily Force     
## 2    27 Black Female Chemical Irritant
## 3    23 White Female Chemical Irritant
## 4    20 Black Male   Chemical Irritant
## 5    20 Black Male   Chemical Irritant
## 6    20 Black Male   Chemical Irritant
summary(mn_force_clean)
##       age            race               sex             force_type       
##  Min.   : 8.00   Length:11176       Length:11176       Length:11176      
##  1st Qu.:23.00   Class :character   Class :character   Class :character  
##  Median :28.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :30.11                                                           
##  3rd Qu.:36.00                                                           
##  Max.   :82.00

Regression Analysis

In my analysis, I used a multiple linear regression model to show if age differs across race, sex, and force_type groups. Asian, Female, and chemical irritant was chosen as the baseline, which keeps those categories constant. The coefficient for sexMale was shown to be 2.5378, which means that males on average are about 2.5 years older than females involved in use of force incidents. Considering the very small p-value of <2e-16 indiacates this result is highly significant. Some race cateogies also proved to have signifanct differences like the races classified as “Other/Mixed Race” and “White.” “Other/Mixed Race” were shown to be estimated to about 3.4 years younger than the baseline race group with a p-value of .00306 and “White” was estimated to be about 2.7 years older than the baseline race group with a pvalue of .00325. None of the force_type categories were shown to be statistically significant which suggests the type of force used does not meaningfully predict age. The R-squared value was 0.06267, being around 6%. Since the value is low, this indicates that most of the variation in age is influecnes by factors not included in this dataset.

# Multiple linear regression mode
multiple_model <- lm(age ~ race + sex + force_type, data = mn_force_clean)
summary(multiple_model)
## 
## Call:
## lm(formula = age ~ race + sex + force_type, data = mn_force_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.222  -6.887  -1.841   5.159  52.159 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                            25.8312     5.8602   4.408 1.05e-05 ***
## raceBlack                              -0.8835     0.9021  -0.979  0.32742    
## raceNative American                     0.4974     0.9663   0.515  0.60676    
## raceOther / Mixed Race                 -3.3886     1.1439  -2.962  0.00306 ** 
## racePacific Islander                   -6.4060     4.1907  -1.529  0.12639    
## raceWhite                               2.6899     0.9138   2.944  0.00325 ** 
## sexMale                                 2.5278     0.2619   9.653  < 2e-16 ***
## force_typeBodily Force                  2.3657     5.7850   0.409  0.68260    
## force_typeChemical Irritant            -2.9074     5.7943  -0.502  0.61584    
## force_typeFirearm                       2.0245     9.1452   0.221  0.82481    
## force_typeGun Point Display             2.3366     5.8668   0.398  0.69043    
## force_typeImprovised Weapon             3.5511     5.8485   0.607  0.54375    
## force_typeLess Lethal                   4.3532     5.9872   0.727  0.46718    
## force_typeLess Lethal Projectile        9.5245     8.1798   1.164  0.24429    
## force_typeMaximal Restraint Technique   0.9280     5.8394   0.159  0.87374    
## force_typePolice K9 Bite               -1.8712     5.8985  -0.317  0.75108    
## force_typeTaser                         3.4119     5.7913   0.589  0.55577    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.02 on 11159 degrees of freedom
## Multiple R-squared:  0.06267,    Adjusted R-squared:  0.06133 
## F-statistic: 46.63 on 16 and 11159 DF,  p-value: < 2.2e-16

Model Assumptions and Diagnostics

To check the assumptions of multiple linear regression, I first examined component-residual plots using crPlots(). Since all my predictors are categorical, these plots appear as boxplots rather than smooth curves, showing how the adjusted age values differ across groups. For categorical variables, the goal is to see whether each group has a reasonably consistent pattern and whether any group stands far outside the others. The boxplots show normal variation without any unexpected patterns, meaning the linearity assumption is not violated. The next plot, the Residuals vs. Order plot, checks for independence. It shows residuals scattered randomly around the zero line with no trend, curve, or clustering, indicating that the observations in the dataset are independent. The third check uses the core diagnostic plots to assess linearity, homoscedasticity, normality, and influence. The first plot, Residuals vs. Fitted, shows the points fairly scattered with no strong curve, suggesting the model fits pretty well. The second plot, Scale-Location, shows a fairly consistent spread of residuals across the fitted values. Since the variability stays mostly stable and does not fan outward, the assumption of homoscedasticity is not violated. The fourth plot, the Q-Q plot, shows the middle following the line very closely, while the tails show some deviation, suggesting that the residuals are not perfectly normal, but this is common for large datasets. The last plot, Residuals vs. Leverage, shows most points having very low leverage with no major influential points. Checking for multicollinearity was not a concern for this dataset since all the predictors in the model are categorical, and the only way I could test correlation would be if I were working with numeric variables. The RMSE for the model is about 10.01, which means the predicted ages calculated by the model, differ from actual ages by roughly 10 years on average. This makes sense because the R-squared value was low, meaning race, sex, and force type only explain a small part of why ages differ, most of the variation in age is explained by other factors not included in the dataset.

1- Linearity check:

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
crPlots(multiple_model) 

2- Independence of observations

plot(resid(multiple_model), type="b",
     main="Residuals vs Order", ylab="Residuals"); abline(h=0, lty=2)

3- Core diagnostics (covers: linearity, homoscedasticity, normality, influence)

par(mfrow=c(2,2)); plot(multiple_model); par(mfrow=c(1,1))

4- Check Multicollinearity Since all my variables are categorical, my variables are non applicable to test for muticollinerity since that has to be with numeric values.

RMSE

residuals_multiple <- resid(multiple_model)

rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
## [1] 10.01015

Conclusion and Future Directions

My analysis of the Minneapolis Police Use of Force dataset showed that age does differ across certain groups involved in these incidents. After running a multiple linear regression model, It showed that sex and *only certain race categories** were significant predictors of age, while the type of force showed no significance. 2.5 years older than females involved in use-of-force incidents, with a highly significant p-value of <2e-16. Only two race categories were shown to have age differences which were individuals identified as White were about 2.7 years older, while those classified as Other / Mixed Race were roughly 3.4 years younger than the baseline. These findings answer my research question by showing that age varies across certain demographic characteristics, even though not all predictors were significant. Even though the age and two race predictors proved clear differences, the model overall did not explain much of variation in age. The R-squared value explained about 6% of the differences in age, so most of the variation comes from factors not included in this dataset. This suggests that while demographic patterns exist, they’re not strong enough to accurately predict age on their own. In the future, I’d like to explore other variables that might help explain more of this variation, like incident circumstances, time of day, location, or officer characteristics.