Police Shootings Project 2

Author

Gabriel Castillo

Intro Paragraph

Data Source: https://www.washingtonpost.com/graphics/investigations/police-shootings-database/

I am covering two fatal police shootings datasets in the United States with the Washington Post data collection. The first dataset has 9497 observations with 19 different variables. The variables are self explantory but here are the few that could be a little confusing. The threat variable with different catergorical options listed on the website as:

- shoot: The victim fired a weapon.
- point :The victim pointed a weapon at another individual.
- attack: The victim attacked with other weapons or physical force.
- threat: The victim had some kind of weapon visible to the officers on the scene.
- move: The victim was moving in a threatening way.
- flee: The victim was fleeing (see flee_status)
- accident: ❓
- undetermined: The threat type could not be determined from available evidence

The other confusing variable is armed_with which are filled in with the following options also copied from the github website:

options:
- gun: A firearm, handgun, shotgun, or other firearm
- knife : A knife or other cutting instrument (razors, hatches, axes, cleavers, scissors, broken bottles, ice picks, etc.)
- blunt_object: A blunt object (baseball bats, the butt of a handgun, clubs, bricks, tire irons, bottles, etc.)
- other: Any other weapon (BB guns, pellet guns, Tasers, pepper spray, stun guns, etc.)
- replica: A toy weapon, replica, or other non-functional firearm.
- undetermined: Whether the victim had a weapon could not be determined from available evidence.
- unknown: There was a weapon involved, but we do not know what kind.
- unarmed: The victim had no weapon according to available evidence.
- vehicle: A motor vehicle or vessel.

The other dataset fatal police shooting agencies just covers each department and the how many fatal police shooting they had from 2015 to present.

I chose this topic because I wanted to explore the trends many of these police shootings have in common and discover new ones. As a person of color whom are many of the victims in the data, I wanted to see the trend with people of color and learn about why there is so many fatal police shootings. It could be scary being confronted by police and see how the differents ways victims confronted the police. Scaning the data some of the victims did have mental disabilities problems. Hopefully I discover new trends that others didn’t find out int the data.

Loading Libraries and Importing datasets

library(readr)
setwd("C:/Users/casti/OneDrive/Documents/DATA 110")
fatal_police_shootings_data <- read_csv("fatal-police-shootings-data.csv")

Rows: 9497 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (12): threat_type, flee_status, armed_with, city, county, state, locati...
dbl   (4): id, latitude, longitude, age
lgl   (2): was_mental_illness_related, body_camera
date  (1): date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

fatal_police_shootings_agencies <- read_csv("fatal-police-shootings-agencies.csv") #Loading in the two Datasets

Rows: 3521 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): name, type, state, oricodes
dbl (2): id, total_shootings

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.3.3

Warning: package 'lubridate' was built under R version 4.3.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ purrr     1.0.2
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lubridate)
library(leaflet) # Loading in the packages

Warning: package 'leaflet' was built under R version 4.3.3

library(sf)

Warning: package 'sf' was built under R version 4.3.3

Linking to GEOS 3.11.2, GDAL 3.8.2, PROJ 9.3.1; sf_use_s2() is TRUE

library(ggrepel)

Warning: package 'ggrepel' was built under R version 4.3.3

Joining the two datasets and tidying

Police_shootings <- merge.data.frame(x = fatal_police_shootings_agencies, y = fatal_police_shootings_data, by.x = "id", by.y = "agency_ids")

Lubridate the date into separate columns

Police_shootings <- Police_shootings |>
  mutate(year = lubridate::year(date), month = lubridate::month(date),
         day = lubridate::day(date)) # I had to new seperate columns for the month,day,year.

Changing all the abreviations to their actual race name

Police_shootings <- Police_shootings |> mutate(race = recode(race, B = "Black", W = "White", H= "Hispanic", A = "Asian", N = "Native American", O = "Other", "N;H" = "Native American Hispanic", "B;H" = "Black Hispanic","W;B" = "Mixed", "W;H" = " White Hispanic", "W;B;N" = "Mixed Native American")) # Changing all the abreviations to their actual race name

Removing NAs

Police_shootings <- Police_shootings |> filter(!is.na(race)) # using only data with race display being available

Checking the variables

str(Police_shootings) # Making sure that my variables are categorized right

'data.frame':   7839 obs. of  27 variables:
 $ id                        : num  1 1 1 1 2 2 2 3 3 3 ...
 $ name.x                    : chr  "Elizabethtown Police Department" "Elizabethtown Police Department" "Elizabethtown Police Department" "Elizabethtown Police Department" ...
 $ type                      : chr  "local_police" "local_police" "local_police" "local_police" ...
 $ state.x                   : chr  "KY" "KY" "KY" "KY" ...
 $ oricodes                  : chr  "KY04701" "KY04701" "KY04701" "KY04701" ...
 $ total_shootings           : num  4 4 4 4 3 3 3 12 12 12 ...
 $ id.y                      : num  252 7422 618 7423 1149 ...
 $ date                      : Date, format: "2015-03-29" "2021-12-11" ...
 $ threat_type               : chr  "point" "attack" "threat" "undetermined" ...
 $ flee_status               : chr  "not" "not" "not" NA ...
 $ armed_with                : chr  "gun" "knife" "knife" "undetermined" ...
 $ city                      : chr  "Elizabethtown" "Elizabethtown" "Elizabethtown" "Elizabethtown" ...
 $ county                    : chr  "Hardin" NA "Hardin" "Hardin" ...
 $ state.y                   : chr  "KY" "KY" "KY" "KY" ...
 $ latitude                  : num  37.7 NA 37.7 NA 37.6 ...
 $ longitude                 : num  -85.9 NA -85.9 NA -120.9 ...
 $ location_precision        : chr  "not_available" NA "not_available" NA ...
 $ name.y                    : chr  "Byron Herbert" "Christopher Sterusky" "Joshua Blough" "Gary Lee McCormick" ...
 $ age                       : num  29 29 28 34 28 27 15 33 41 37 ...
 $ gender                    : chr  "male" "male" "male" "male" ...
 $ race                      : chr  "Black" "Black" "White" "Black" ...
 $ race_source               : chr  "not_available" "not_available" "not_available" "photo" ...
 $ was_mental_illness_related: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ body_camera               : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ year                      : num  2015 2021 2015 2021 2016 ...
 $ month                     : num  3 12 7 12 1 10 8 10 3 11 ...
 $ day                       : int  29 11 7 7 5 22 18 12 27 6 ...

Filtering for the dataset I am going use

New_Data <- Police_shootings |> 
  filter(year >= 2019) |>
  filter(armed_with == "unarmed") #only looking at years 2019 and above and that they were unarmed. Only innocent lives.

Making a frequency table

freq_tables= 
  apply(New_Data,2,table) # Made a frequency table for all the variables to further see what graphs I want to make

threat <- Police_shootings |>
  group_by(threat_type, body_camera, race ) |> # filter for the three variables
  summarise(cnt=n()) |> # count 
  mutate(props = round(cnt/sum(cnt),2)) # To create a proportion for each situation

`summarise()` has grouped output by 'threat_type', 'body_camera'. You can
override using the `.groups` argument.

threat # I wanted to see if threat_type, bodycam(On/Off), and Race have a big impact in fatal police shootings

# A tibble: 96 × 5
# Groups:   threat_type, body_camera [18]
   threat_type body_camera race              cnt props
   <chr>       <lgl>       <chr>           <int> <dbl>
 1 accident    FALSE       Black              14  0.34
 2 accident    FALSE       Hispanic           10  0.24
 3 accident    FALSE       White              17  0.41
 4 accident    TRUE        Black               3  0.38
 5 accident    TRUE        Hispanic            2  0.25
 6 accident    TRUE        White               3  0.38
 7 attack      FALSE       Asian              25  0.03
 8 attack      FALSE       Black             277  0.28
 9 attack      FALSE       Hispanic          196  0.2 
10 attack      FALSE       Native American    13  0.01
# ℹ 86 more rows

Renaming Columns

Police_shootings <- Police_shootings |>
  rename(Police_Departments = name.x)

# renaming name.x to Police Deparments

Police_shootings <- Police_shootings |>
  rename(Victims_names = name.y) 
# renaming name.y to Victims Names

Making My First Simple Plot For Exploration

Police_shootings |>
  ggplot(aes(threat_type, fill = gender)) + # looking at threat type and gender a simple plot
  geom_bar(position = "dodge", # dodge to make a little see through
           alpha = 0.5) +
  theme_bw() + # a nice theme
  ggtitle("Police Shooting Deaths Due to Threat Type ")

Discovery:

I found that males are the most victims and that shooting at officers or seen as a threat are the most common reasons for fatal police shootings

Second Simple Plot For Exploration

Police_shootings |>
  ggplot(aes(body_camera, fill = race )) + # The two variables I use to make the barchart
  geom_bar(
           alpha = 0.8) + # Making the color pop more but not too much
  theme_bw() +
  ggtitle("If Body Cam was Present in Fatal Police Shootings")

Discovery:

Most of the police shootings have no body cam and most of the victims are white and black people. Its a serious issue that officers don’t have their body cam on to protect themselves and people everyday.

Third Simple Plot for exploration

Police_shootings |>
  ggplot(aes(x = age, fill = gender)) +  # only lloking at gender and age for this histogram
  geom_histogram()+ # ggplot histogram
  theme_bw() + # same theme in all my simple plots
  ggtitle("Age of The Victims") # Adding a title

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Warning: Removed 126 rows containing non-finite outside the scale range
(`stat_bin()`).

Discovery:

The mean of the victims falls between 25 to 30 and the oldest being around 80 years old. This is a sad plot as it highlights how young most of the victims were. There should be more police training because 9,500 deaths from 2015 to present is a lot. I thought police shootings were much lower but hopefully more reform is being set in place.

  ggplot(threat,aes(x = cnt, y = props, col = threat_type, label = race, shape = body_camera )) + # looking into the threat dataset and looking at proportions and count. Labeling for race and adding a circle or triangle for true or false bodycam
  geom_point() + # A scatterplot 
    geom_smooth(method = "lm",se = FALSE, lty = 2, linewidth = 0.3) + # I made each correlation line be dotted and small so it looks more presentable
    geom_text_repel(nudge_x = 0.005) +  # To nudge the text to make the labels a little more visible 
    xlab(" # of Deaths ") +
    ylab("Proportions") + # Adding y and x titles 
    ggtitle("Proportions of deaths due to Threat type vs Bodycam vs Race") # final title

`geom_smooth()` using formula = 'y ~ x'

Warning: The following aesthetics were dropped during statistical transformation: label.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

Warning: ggrepel: 61 unlabeled data points (too many overlaps). Consider
increasing max.overlaps

Fit1 <- lm(cnt ~ props, data = threat) # Making a summary by count and proportion
summary(Fit1)


Call:
lm(formula = cnt ~ props, data = threat)

Residuals:
    Min      1Q  Median      3Q     Max 
-248.13  -78.34   -1.33    9.37  696.33 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -8.121     22.676  -0.358    0.721    
props        478.015     87.767   5.446  4.1e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 152.6 on 94 degrees of freedom
Multiple R-squared:  0.2399,    Adjusted R-squared:  0.2318 
F-statistic: 29.66 on 1 and 94 DF,  p-value: 4.098e-07

The Equation for the Model

Fatal Police shooting = 478.015(proportion of threat type,bodycam usage, and race) - 8.121

P-value :

The extremely small P value indicates that threat type, race, and if the body cam is on or off are strong evidence to support that these three variables are the big issues in fatal police shootings . The three * also indicate that these three variables are important.

Adjusted R squared :

The adjusted R squared of 0.23 % means that 76% of the variation of data is not explained by this model. I feel like if I added gender then it will be a higher R squared %.

Multi regression plots:

library(ggfortify) #loading in a library

Warning: package 'ggfortify' was built under R version 4.3.3

autoplot(Fit1,1:4, nrow=2,ncol=2) #Using autplots to make 4 plots for my regression analysis

Explanation for the Multi-Plots :

The residuals vs fitted plot shows that a linear plot is not appropriate.
The Q-Q plot shows that there is three outliers #45,71,57
The Scale location is diagonal and it shows all the points that skew it however the main outliers are #45,71,57

The cook distance shows that the three outliers have high leverage which can be causing problems in my model. I need to remove them.

Threat2 <- threat[-c(45,71,57),] # Removing the previous outliers 
fit2 <- lm(cnt ~ props, data = Threat2) # A new summary for the New Threat2
summary(fit2)


Call:
lm(formula = cnt ~ props, data = Threat2)

Residuals:
    Min      1Q  Median      3Q     Max 
-126.88  -40.43  -16.25   -3.48  471.54 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    17.25      14.60   1.181 0.240488    
props         223.25      60.06   3.717 0.000347 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 97.03 on 91 degrees of freedom
Multiple R-squared:  0.1318,    Adjusted R-squared:  0.1223 
F-statistic: 13.82 on 1 and 91 DF,  p-value: 0.0003474

I see a smaller R squared which is bad and a bigger p-value which is still is a strong case.

autoplot(fit2,1:4, nrow=2,ncol=2) # Using Autoplots Again

Explanation:

The cook value improved but the other threee diagnostic graphs stayed the same. In the future I might need to add more variables to see if the R squared value increase to a prefered precent like 0.7. The outliers from Fit1 did improve the R squared value and the plots themselves. The new outliers are 90,52, and 12 but they are much closer to the other data compared to Fit 1 outliers.

My Final Graph:

palette <- colorNumeric(
  c("blue", "yellow","purple", "red"),# Adding a desired pallete to highlight the departments with the most killings from 2015 
  domain = New_Data$total_shootings
)
popupShooting <- paste0( 
  "<strong>: Gender: </strong>", New_Data$gender, "<br>",
      "<strong> Race: </strong>", New_Data$race, "<br>",
      "<strong>Victim Name: </strong>", New_Data$Victims_names, "<br>",
   "<strong>: : </strong>", New_Data$gender, "<br>",
   "<strong>: Mentally illness: </strong>", New_Data$was_mental_illness_related, "<br>",
   "<strong>: Date: </strong>", New_Data$date, "<br>",
   "<strong>: age: </strong>", New_Data$age, "<br>",
   "<strong>: Department: </strong>", New_Data$Police_Departments, "<br>",
   "<strong>: Total Departments Shooting: </strong>", New_Data$total_shootings, "<br>",
  "<strong>: Armed With: </strong>", New_Data$armed_with, "<br>") # adding all the interactivity for my map

leaflet() |>
  setView(lng = -81.2, lat = 33.8, zoom = 3) |> # zoom three looks good to show all the United States
  addProviderTiles("OpenStreetMap.DE") |> # Adding the map background
  addCircles(
    data = New_Data, # adding the data
    radius = New_Data$age*10, # Adding the radius with age
    color = ~palette(New_Data$total_shootings), # Adding the pallete I made earlier
    fillColor = ~palette(New_Data$total_shootings),
    fillOpacity = 0.9, # so the colors aren't too bold
    popup = popupShooting
    )

Assuming "longitude" and "latitude" are longitude and latitude, respectively

Warning in validateCoords(lng, lat, funcName): Data contains 37 rows with
either missing or invalid lat/lon values and will be ignored

Final Essay:

a. The topic of my data is fatal police shootings around the United States. I used all the variables that from both datasets from the Washington Post in my final visualization. The variables cover the number od deaths from many police departments. Age, race, and name of the victims of each fatal police shooting . It also had records if the victim had a weapon and what type of weapon. Moreover if the victim was mentally challenged or if body cam was present. This dataset was alot to unpack but fun to explore the variables. I picked this dataset because I wanted to explore why police may have targeted people of color and use the variables as factors to why it escalated to a death under police hands.

b. A sad stat is that police officers fatally shoot about three people a day which is super devasting that a first world country have problems in their police departments in deescalating situations. In 2016 with heavy pressure, the fbi now track records of fatal and not fatal force of officers to make up the little data they had beforehand. It is very important for this type of data to be released and track so fellow data scientist can get answers by this data and also inform the public. Like shown in my plots, African Americans are one of the most targeted groups and the article said around 2x more. This is deal to the heavy policing in black neighborhoods. The real question is when the police departments are going to make change in these communities and spark better change. It might be scary being a police officer however there is still many ways to handle people in distress.

Article Source : Peeples, Lynne, and Nature magazine. “What the Data Say about Police Shootings.” Scientific American, 20 Feb. 2024, www.scientificamerican.com/article/what-the-data-say-about-police-shootings/.

c. My Map visualization represents the innocent unarmed victims because if I was going to plot every death ever since 2015. The map appeared to be full and it was hard to see any patterns. I only focused on unarmed people and from 2019 and up . The main thing that suprised me was that 220 people died from 2019 to present. I really liked how the color pallete helped with higlighting the departments and states that killed the most people. I feel like there is more fatal shootings in the east coast and this scares me being a person of color. The only differently I would have done is maybe add more methods to color code more variables or a new way to add them without making the map too crazy. I liked this project and I learned alot.