I am covering two fatal police shootings datasets in the United States with the Washington Post data collection. The first dataset has 9497 observations with 19 different variables. The variables are self explantory but here are the few that could be a little confusing. The threat variable with different catergorical options listed on the website as:
- shoot: The victim fired a weapon.
- point :The victim pointed a weapon at another individual.
- attack: The victim attacked with other weapons or physical force.
- threat: The victim had some kind of weapon visible to the officers on the scene.
- move: The victim was moving in a threatening way.
- flee: The victim was fleeing (see flee_status)
- accident: ❓
- undetermined: The threat type could not be determined from available evidence
The other confusing variable is armed_with which are filled in with the following options also copied from the github website:
options:
- gun: A firearm, handgun, shotgun, or other firearm
- knife : A knife or other cutting instrument (razors, hatches, axes, cleavers, scissors, broken bottles, ice picks, etc.)
- blunt_object: A blunt object (baseball bats, the butt of a handgun, clubs, bricks, tire irons, bottles, etc.)
- other: Any other weapon (BB guns, pellet guns, Tasers, pepper spray, stun guns, etc.)
- replica: A toy weapon, replica, or other non-functional firearm.
- undetermined: Whether the victim had a weapon could not be determined from available evidence.
- unknown: There was a weapon involved, but we do not know what kind.
- unarmed: The victim had no weapon according to available evidence.
- vehicle: A motor vehicle or vessel.
The other dataset fatal police shooting agencies just covers each department and the how many fatal police shooting they had from 2015 to present.
I chose this topic because I wanted to explore the trends many of these police shootings have in common and discover new ones. As a person of color whom are many of the victims in the data, I wanted to see the trend with people of color and learn about why there is so many fatal police shootings. It could be scary being confronted by police and see how the differents ways victims confronted the police. Scaning the data some of the victims did have mental disabilities problems. Hopefully I discover new trends that others didn’t find out int the data.
Rows: 9497 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (12): threat_type, flee_status, armed_with, city, county, state, locati...
dbl (4): id, latitude, longitude, age
lgl (2): was_mental_illness_related, body_camera
date (1): date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
fatal_police_shootings_agencies <-read_csv("fatal-police-shootings-agencies.csv") #Loading in the two Datasets
Rows: 3521 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): name, type, state, oricodes
dbl (2): id, total_shootings
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.3.3
Warning: package 'lubridate' was built under R version 4.3.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.3 ✔ purrr 1.0.2
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.0 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)library(leaflet) # Loading in the packages
Warning: package 'leaflet' was built under R version 4.3.3
library(sf)
Warning: package 'sf' was built under R version 4.3.3
Linking to GEOS 3.11.2, GDAL 3.8.2, PROJ 9.3.1; sf_use_s2() is TRUE
library(ggrepel)
Warning: package 'ggrepel' was built under R version 4.3.3
Police_shootings <- Police_shootings |>mutate(year = lubridate::year(date), month = lubridate::month(date),day = lubridate::day(date)) # I had to new seperate columns for the month,day,year.
Changing all the abreviations to their actual race name
Police_shootings <- Police_shootings |>mutate(race =recode(race, B ="Black", W ="White", H="Hispanic", A ="Asian", N ="Native American", O ="Other", "N;H"="Native American Hispanic", "B;H"="Black Hispanic","W;B"="Mixed", "W;H"=" White Hispanic", "W;B;N"="Mixed Native American")) # Changing all the abreviations to their actual race name
Removing NAs
Police_shootings <- Police_shootings |>filter(!is.na(race)) # using only data with race display being available
Checking the variables
str(Police_shootings) # Making sure that my variables are categorized right
New_Data <- Police_shootings |>filter(year >=2019) |>filter(armed_with =="unarmed") #only looking at years 2019 and above and that they were unarmed. Only innocent lives.
Making a frequency table
freq_tables=apply(New_Data,2,table) # Made a frequency table for all the variables to further see what graphs I want to make
threat <- Police_shootings |>group_by(threat_type, body_camera, race ) |># filter for the three variablessummarise(cnt=n()) |># count mutate(props =round(cnt/sum(cnt),2)) # To create a proportion for each situation
`summarise()` has grouped output by 'threat_type', 'body_camera'. You can
override using the `.groups` argument.
threat # I wanted to see if threat_type, bodycam(On/Off), and Race have a big impact in fatal police shootings
# A tibble: 96 × 5
# Groups: threat_type, body_camera [18]
threat_type body_camera race cnt props
<chr> <lgl> <chr> <int> <dbl>
1 accident FALSE Black 14 0.34
2 accident FALSE Hispanic 10 0.24
3 accident FALSE White 17 0.41
4 accident TRUE Black 3 0.38
5 accident TRUE Hispanic 2 0.25
6 accident TRUE White 3 0.38
7 attack FALSE Asian 25 0.03
8 attack FALSE Black 277 0.28
9 attack FALSE Hispanic 196 0.2
10 attack FALSE Native American 13 0.01
# ℹ 86 more rows
Police_shootings |>ggplot(aes(threat_type, fill = gender)) +# looking at threat type and gender a simple plotgeom_bar(position ="dodge", # dodge to make a little see throughalpha =0.5) +theme_bw() +# a nice themeggtitle("Police Shooting Deaths Due to Threat Type ")
Discovery:
I found that males are the most victims and that shooting at officers or seen as a threat are the most common reasons for fatal police shootings
Second Simple Plot For Exploration
Police_shootings |>ggplot(aes(body_camera, fill = race )) +# The two variables I use to make the barchartgeom_bar(alpha =0.8) +# Making the color pop more but not too muchtheme_bw() +ggtitle("If Body Cam was Present in Fatal Police Shootings")
Discovery:
Most of the police shootings have no body cam and most of the victims are white and black people. Its a serious issue that officers don’t have their body cam on to protect themselves and people everyday.
Third Simple Plot for exploration
Police_shootings |>ggplot(aes(x = age, fill = gender)) +# only lloking at gender and age for this histogramgeom_histogram()+# ggplot histogramtheme_bw() +# same theme in all my simple plotsggtitle("Age of The Victims") # Adding a title
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 126 rows containing non-finite outside the scale range
(`stat_bin()`).
Discovery:
The mean of the victims falls between 25 to 30 and the oldest being around 80 years old. This is a sad plot as it highlights how young most of the victims were. There should be more police training because 9,500 deaths from 2015 to present is a lot. I thought police shootings were much lower but hopefully more reform is being set in place.
ggplot(threat,aes(x = cnt, y = props, col = threat_type, label = race, shape = body_camera )) +# looking into the threat dataset and looking at proportions and count. Labeling for race and adding a circle or triangle for true or false bodycamgeom_point() +# A scatterplot geom_smooth(method ="lm",se =FALSE, lty =2, linewidth =0.3) +# I made each correlation line be dotted and small so it looks more presentablegeom_text_repel(nudge_x =0.005) +# To nudge the text to make the labels a little more visible xlab(" # of Deaths ") +ylab("Proportions") +# Adding y and x titles ggtitle("Proportions of deaths due to Threat type vs Bodycam vs Race") # final title
`geom_smooth()` using formula = 'y ~ x'
Warning: The following aesthetics were dropped during statistical transformation: label.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
variable into a factor?
Warning: ggrepel: 61 unlabeled data points (too many overlaps). Consider
increasing max.overlaps
Fit1 <-lm(cnt ~ props, data = threat) # Making a summary by count and proportionsummary(Fit1)
Call:
lm(formula = cnt ~ props, data = threat)
Residuals:
Min 1Q Median 3Q Max
-248.13 -78.34 -1.33 9.37 696.33
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.121 22.676 -0.358 0.721
props 478.015 87.767 5.446 4.1e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 152.6 on 94 degrees of freedom
Multiple R-squared: 0.2399, Adjusted R-squared: 0.2318
F-statistic: 29.66 on 1 and 94 DF, p-value: 4.098e-07
The Equation for the Model
Fatal Police shooting = 478.015(proportion of threat type,bodycam usage, and race) - 8.121
P-value :
The extremely small P value indicates that threat type, race, and if the body cam is on or off are strong evidence to support that these three variables are the big issues in fatal police shootings . The three * also indicate that these three variables are important.
Adjusted R squared :
The adjusted R squared of 0.23 % means that 76% of the variation of data is not explained by this model. I feel like if I added gender then it will be a higher R squared %.
Multi regression plots:
library(ggfortify) #loading in a library
Warning: package 'ggfortify' was built under R version 4.3.3
autoplot(Fit1,1:4, nrow=2,ncol=2) #Using autplots to make 4 plots for my regression analysis
Explanation for the Multi-Plots :
The residuals vs fitted plot shows that a linear plot is not appropriate.
The Q-Q plot shows that there is three outliers #45,71,57
The Scale location is diagonal and it shows all the points that skew it however the main outliers are #45,71,57
The cook distance shows that the three outliers have high leverage which can be causing problems in my model. I need to remove them.
Threat2 <- threat[-c(45,71,57),] # Removing the previous outliers fit2 <-lm(cnt ~ props, data = Threat2) # A new summary for the New Threat2summary(fit2)
Call:
lm(formula = cnt ~ props, data = Threat2)
Residuals:
Min 1Q Median 3Q Max
-126.88 -40.43 -16.25 -3.48 471.54
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.25 14.60 1.181 0.240488
props 223.25 60.06 3.717 0.000347 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 97.03 on 91 degrees of freedom
Multiple R-squared: 0.1318, Adjusted R-squared: 0.1223
F-statistic: 13.82 on 1 and 91 DF, p-value: 0.0003474
I see a smaller R squared which is bad and a bigger p-value which is still is a strong case.
autoplot(fit2,1:4, nrow=2,ncol=2) # Using Autoplots Again
Explanation:
The cook value improved but the other threee diagnostic graphs stayed the same. In the future I might need to add more variables to see if the R squared value increase to a prefered precent like 0.7. The outliers from Fit1 did improve the R squared value and the plots themselves. The new outliers are 90,52, and 12 but they are much closer to the other data compared to Fit 1 outliers.
My Final Graph:
palette <-colorNumeric(c("blue", "yellow","purple", "red"),# Adding a desired pallete to highlight the departments with the most killings from 2015 domain = New_Data$total_shootings)popupShooting <-paste0( "<strong>: Gender: </strong>", New_Data$gender, "<br>","<strong> Race: </strong>", New_Data$race, "<br>","<strong>Victim Name: </strong>", New_Data$Victims_names, "<br>","<strong>: : </strong>", New_Data$gender, "<br>","<strong>: Mentally illness: </strong>", New_Data$was_mental_illness_related, "<br>","<strong>: Date: </strong>", New_Data$date, "<br>","<strong>: age: </strong>", New_Data$age, "<br>","<strong>: Department: </strong>", New_Data$Police_Departments, "<br>","<strong>: Total Departments Shooting: </strong>", New_Data$total_shootings, "<br>","<strong>: Armed With: </strong>", New_Data$armed_with, "<br>") # adding all the interactivity for my map
leaflet() |>setView(lng =-81.2, lat =33.8, zoom =3) |># zoom three looks good to show all the United StatesaddProviderTiles("OpenStreetMap.DE") |># Adding the map backgroundaddCircles(data = New_Data, # adding the dataradius = New_Data$age*10, # Adding the radius with agecolor =~palette(New_Data$total_shootings), # Adding the pallete I made earlierfillColor =~palette(New_Data$total_shootings),fillOpacity =0.9, # so the colors aren't too boldpopup = popupShooting )
Assuming "longitude" and "latitude" are longitude and latitude, respectively
Warning in validateCoords(lng, lat, funcName): Data contains 37 rows with
either missing or invalid lat/lon values and will be ignored
Final Essay:
a. The topic of my data is fatal police shootings around the United States. I used all the variables that from both datasets from the Washington Post in my final visualization. The variables cover the number od deaths from many police departments. Age, race, and name of the victims of each fatal police shooting . It also had records if the victim had a weapon and what type of weapon. Moreover if the victim was mentally challenged or if body cam was present. This dataset was alot to unpack but fun to explore the variables. I picked this dataset because I wanted to explore why police may have targeted people of color and use the variables as factors to why it escalated to a death under police hands.
b. A sad stat is that police officers fatally shoot about three people a day which is super devasting that a first world country have problems in their police departments in deescalating situations. In 2016 with heavy pressure, the fbi now track records of fatal and not fatal force of officers to make up the little data they had beforehand. It is very important for this type of data to be released and track so fellow data scientist can get answers by this data and also inform the public. Like shown in my plots, African Americans are one of the most targeted groups and the article said around 2x more. This is deal to the heavy policing in black neighborhoods. The real question is when the police departments are going to make change in these communities and spark better change. It might be scary being a police officer however there is still many ways to handle people in distress.
Article Source : Peeples, Lynne, and Nature magazine. “What the Data Say about Police Shootings.” Scientific American, 20 Feb. 2024, www.scientificamerican.com/article/what-the-data-say-about-police-shootings/.
c. My Map visualization represents the innocent unarmed victims because if I was going to plot every death ever since 2015. The map appeared to be full and it was hard to see any patterns. I only focused on unarmed people and from 2019 and up . The main thing that suprised me was that 220 people died from 2019 to present. I really liked how the color pallete helped with higlighting the departments and states that killed the most people. I feel like there is more fatal shootings in the east coast and this scares me being a person of color. The only differently I would have done is maybe add more methods to color code more variables or a new way to add them without making the map too crazy. I liked this project and I learned alot.