Fatal Shootings by Police Since 2015

Introduction

In these past few years, there have been so many news stories about shootings by a police officer, and I wanted to know if the news has been reporting only certain cases. I also wanted to see if there were patterns or common factors in the shooting cases. For example where are they occurring more often, what race have a higher rate of shooting cases, if the victim was armed and what factors may have lead the victims to be killed.

Solution overview: To figure out what was common I am going to be using R to perform the data analysis and visualization to find and display the data and see what patterns emerge from it. For some of the section I will use charts and others I will be mapping out densities on a map of the US. Then at the end I will see what I can conclude from the visualizations.

Packages Required

The packages required will help manipulate the data properly and provide me with functions for visualization and mapping.

library(tidyverse)   #helps with data visualization, tidying and more
library(tibble)      #storing data as a tibble which is easier to control
library(kableExtra)  #table creating
library(formattable) #table manipulation
library(ggplot2)     #plotting and mapping
library(DT)          #previewing data tables
library(dplyr)       #data manipulation
library(plotly)      #interactive charts
library(ggrepel)     #prevents labels overlapping
library(usmap)       #mapping tools

Data Preparation

I used the data set from Github that was posted by The Washington Post and it shows every fatal shooting from 2015 from a on duty police officer, the data set is constantly updated with more cases. At this time, the data set provides information on just under 7,000 victims. It contains their names, date, the age, location and more. There are a lot of cases that have missing data and I will sort them out in the preparation section.

#loading the data
raw_data <- read.csv("fatal-police-shootings-data.csv")

#storing the data as a tibble
raw_tib <- as_tibble(raw_data)

Now that the data has been stored we can look at the data

#getting the total cases
nrow(raw_tib)
## [1] 6800
#Getting all the variable names
names(raw_tib)
##  [1] "id"                      "name"                   
##  [3] "date"                    "manner_of_death"        
##  [5] "armed"                   "age"                    
##  [7] "gender"                  "race"                   
##  [9] "city"                    "state"                  
## [11] "signs_of_mental_illness" "threat_level"           
## [13] "flee"                    "body_camera"            
## [15] "longitude"               "latitude"               
## [17] "is_geocoding_exact"

Now that the data has been loaded and the variable names are known, it’s just a matter of cleaning up the data, and picking the variables needed. The variables that I will be choosing are state, race, age, armed, threat_level and flee. I have a brief description of what the variables mean below and will further elaborate on some of them in the later sections.

#removing cases with missing data
raw_tib <- raw_tib %>% na_if("") %>% na.omit

#now select the variables wanted
cleaned <- raw_tib %>% select(c('state', 'race', 'age', 'armed', 'threat_level', 'flee'))

#previewing the cleaned data
head(cleaned,100) %>% datatable(options = list(
  columnDefs = list(list(className = 'dt-center', targets = 1:4))
  ))

Above is a preview of the cleaned data and the selected variables. Each case has a unique id and the total amount of cases that have full information on the victim came out to be 4,938 cases. The state column is the abbreviation of the states in the US. The race column consists of letters that symbolize a different ethnicity, A stands for Asian, B is Black, H is Hispanic, N is Native American, W is White. The armed column shows if the victim was armed, and with what, or unarmed. The threat_level column shows 3 different characters, attack, other, and undetermined. (I will go into more detail about this in the analysis section) The flee column states whether or not the victim was fleeing and the method, or if they were not fleeing.

To make it easier later on, I am going to divide the cleaned data into their own subsets so later it will be easier and the code will look a bit cleaner.

#putting the data into subsets
st <- cleaned %>% select(state)
rac <- cleaned %>% select(race)
ag <- cleaned %>% select(age)
arm <- cleaned %>% select(armed)
thr <- cleaned %>% select(threat_level)
fl <- cleaned %>% select(flee)

Data Analysis

Race

I am going to change the letters of the race into full words then find the count for the cases

#change letters into full words
rac['race'][rac["race"] =='A'] <- 'Asian'
rac['race'][rac["race"] =='B'] <- 'Black'
rac['race'][rac["race"] =='H'] <- 'Hispanic'
rac['race'][rac["race"] =='N'] <- 'Native American'
rac['race'][rac["race"] =='O'] <- 'Other'
rac['race'][rac["race"] =='W'] <- 'White' 

#getting count
rac <- rac %>%
  count(race) %>% 
  mutate(Proportion = n/4938)
rac$Proportion <- formattable(rac$Proportion, format="f", digits=2)

Now that I have the count and proportions of each race, I made a table and bar plot so we can see the spread of races among the cases

#making the table
rac %>% kbl(col.names = c("Race","Victims", "Proportion")) %>%
  kable_paper("striped","hover","condensed", full_width = T) 
Race Victims Proportion
Asian 88 0.02
Black 1318 0.27
Hispanic 911 0.18
Native American 73 0.01
Other 42 0.01
White 2506 0.51
#creating interactive barplot
barpltrc <- ggplot(data = rac, aes(x = race, y = n, fill = race)) + geom_bar(color='black', stat = "identity") + theme(axis.text.x = element_text(angle = 45, vjust = 0.45)) + xlab("Race") + ylab("Number of Cases") + ggtitle('Number of Shooting Cases Per Race')
ggplotly(barpltrc)

Using the plotly package allows for many type of interactive charts.Try hovering over parts of the plot and clicking and dragging over section of the plot!

From looking at the plot we can conclude that a lot of the victims were White, Black or Hispanic.

State

Here I gathered the amount of shooting cases from all states.

#count of cases in each state
st <- st %>% count(state)
#creating table for the cases
st %>% kbl(col.names = c("State","Victims")) %>%
  kable_styling("striped", "condensed", full_width = T) %>%
  scroll_box(height = '300px')
State Victims
AK 31
AL 91
AR 65
AZ 214
CA 730
CO 176
CT 17
DC 17
DE 11
FL 343
GA 179
HI 23
IA 31
ID 42
IL 107
IN 99
KS 52
KY 84
LA 94
MA 32
MD 81
ME 17
MI 81
MN 62
MO 118
MS 61
MT 25
NC 145
ND 9
NE 26
NH 14
NJ 54
NM 93
NV 87
NY 93
OH 146
OK 147
OR 78
PA 101
RI 2
SC 73
SD 12
TN 132
TX 422
UT 59
VA 93
VT 7
WA 127
WI 86
WV 36
WY 13

Side note: The data set has 51 states because it has Washington D.C as it’s own state

Below is a map of the states and the states with a higher amount of cases are colored in more than those that have a lower amount of cases.

#generating the map
plot_usmap(data = st, values = "n",  color = "black", labels=F) + 
  scale_fill_continuous( low = "white", high = "red", name = "Cases") + 
  theme(legend.position = "right") + 
  theme(panel.background = element_rect(colour = "black")) + 
  labs(title = "Shooting Cases by State")

After looking at the map, California, Texas and Florida seem to have the highest amount of shooting cases. Lets get the numbers for each state, and to get that, I generated a interactive bar chart of with the number of cases per state.

#generating barplot
stplt <- ggplot(st, aes(x = reorder(state, -n), y = n)) + geom_col(data=st, width = 0.8, color='black', fill = 'royalblue2') + theme(axis.text.x = element_text(angle = 90, vjust = 0.45)) +
  xlab("State") +
  ylab("Number of Shooting Cases")+
  ggtitle('Number of Shooting Cases In Each State')
ggplotly(stplt)

It seems that those top three states are well above the rest, with Florida being over than 100 more than Arizona.

Threat Level

On the data sets website, the threat level variable has 3 different levels, attack, other & undetermined. Attack is stating that the threat was a direct and immediate threat to life, this would include when officers were shot at, threatened with a gun or with any other weapon or physical force. Other includes incidents where the threat was not immediate or direct but still significant. Undetermined is unknown if the victim posed a significant or immediate threat.

#counting the amount for each level of threat
thr <- thr %>% 
  count(threat_level) %>%
  mutate(Proportion = n/4938)
thr$Proportion <- formattable(thr$Proportion, format="f", digits=2)


#creating the table
thr %>% kbl(col.names = c("Threat Level","Victims", "Proportion")) %>%
  kable_paper("hover", "condensed", full_width = F)
Threat Level Victims Proportion
attack 3256 0.66
other 1572 0.32
undetermined 110 0.02
#creating pie chart
piechrtthr <- plot_ly(thr, labels = ~threat_level, values = ~Proportion, type = 'pie', textinfo = 'label+percent', textposition = 'outside', hoverinfo = 'text', text = ~paste(n, "Cases"))
piechrtthr <- piechrtthr %>% layout(title = "Shooting Case Victims by Threat Level")
piechrtthr

The pie chart shows that almost 66% of the victims were immediate threats, and while almost 32% were not immediate threats, they still were threatening enough for the officers to take action.

Flee

In the data set, there were many ways the victim would flee, by car, on foot, or other methods. To simplify this, I combined all the methods of fleeing and combine them into one and labeled them as “Yes” and if they were not fleeing “No”.

#combining the variables into one
fl[fl =='Car' | fl == 'Foot' | fl == 'Other'] <- 'Yes'
fl[fl =='Not fleeing'] <- 'No'

#getting count
fl <- fl %>% count(flee) %>%
  mutate(Proportion = n/4938)
fl$Proportion <- formattable(fl$Proportion, format="f", digits=2)

#creating table
fl %>% kbl(col.names = c("Flee","Victims", "Proportion")) %>%
  kable_paper("hover","condensed", full_width = F)
Flee Victims Proportion
No 3283 0.66
Yes 1655 0.34
#creating pie chart
piechrttfl <- plot_ly(fl, labels = ~flee, values = ~Proportion, type = 'pie', textinfo = 'label+percent', textposition = 'inside', hoverinfo = 'text', text = ~paste(n, "Cases"))
piechrttfl <- piechrttfl %>% layout(title = "Shooting Case Victims by Flee Status")
piechrttfl

Only a little over 33% of the victims tried to flee while the rest did not attempt to flee.

Age

For the age I am going to put the data into ranges, the ranges are from 18 and under, 19-40 years old, 41-60 years old, and 60 and older.

#putting the ages into an age range
agerng <- cut(ag$age, c(1,18,40,60,100), labels = c("18 and Under", "19 to 40", "41 to 60", "60 and Up")) 

#counts the cases in each age range
rngtbl <- as.data.frame(table(agerng)) 
rngtbl <- rngtbl %>% mutate(Proportion = Freq/4938)
rngtbl$Proportion <- formattable(rngtbl$Proportion, format="f", digits=2)

#creating table
rngtbl %>% kbl(col.names = c("Age Range (In Years)", "Victims", 'Proportion')) %>% kable_paper("striped","hover", "condensed", full_width = F)
Age Range (In Years) Victims Proportion
18 and Under 190 0.04
19 to 40 3110 0.63
41 to 60 1399 0.28
60 and Up 239 0.05
#making interactive boxplot
boxage <- plot_ly(x = ag$age, type = "box", name = '')
boxage <- boxage %>% layout(title = "Age Range of Victims (Years)")
boxage

By looking at the box plot, we can see that the majority of the ages range from 27 to 45 years old with a max being 91 years old and a shocking 6 years old for minimum!

Armed

For the armed category, there are many different things that people were armed with, so just like the flee table, I am going to combine all the data into 3 categories, armed, unarmed and undetermined. However for this variable, there are far too many different weapons the victims were armed with, therefore I will use a for loop and if-else statements to quickly sort the weapons into one “Armed” category.

#combining the weapons and making the categories capitalized
for(i in 1:nrow(arm)){
  if(arm[i,'armed'] == "unarmed"){
    arm[i,'armed'] <- "Unarmed"
  }
  else if(arm[i,'armed'] == "undetermined"){
    arm[i,'armed'] <- "Undetermined"
  }
  else(arm[i,"armed"] <- "Armed")
}

#getting the count
arm <- arm %>% count(armed) %>%
  mutate(Proportion = n/4938)
arm$Proportion <- formattable(arm$Proportion, format="f", digits=2)

#generating the table
arm %>% kbl(col.names = c("Armed or Not","Victims", "Proportion")) %>%
  kable_material("hover", "condensed", full_width = T)
Armed or Not Victims Proportion
Armed 4469 0.91
Unarmed 376 0.08
Undetermined 93 0.02
#generating pie chart
piechrtarm <- plot_ly(arm, labels = ~armed, values = ~Proportion, type = 'pie', textinfo = 'label+percent', textposition = 'inside', hoverinfo = 'text', text = ~paste(n, "Cases"))
piechrtarm <- piechrtarm %>% layout(title = "Victims Armed And Unarmed") 
piechrtarm

Looking at the pie chart shows that over 90% of the victims are armed when killed. Now seeing this, how come the news seems to only show the cases that have unarmed or undetermined victims? There seems to be some sort of factor that is unknown.

Summary

Summarizing the problem: Finding common occurrences and patterns for fatal shooting cases involving an on duty police officer. The analysis includes results for highest amount of cases for race, states ranked highest to lowest for number of cases, the percentages for threat level of the victims, how many of the victims fled, the age range of the victims and if they were armed or not.

How the problem was addressed: By implementing the functions of packages like kable, plotly, ggplot and more, the data was collected and manipulated accordingly for analysis. After creating the visualizations for the data, the common occurrences were determined.

Conclusion

With all the analysis completed, we can conclude from the observations the most common factors in the fatal shootings. The majority of the victims are White with 51% of the victims, with Black and Hispanic following behind with a combined 45%. The state with the most shootings is California with more than 300 cases than Texas. 97% of the victims were posing a threat to officers or people, which is a good thing as police do not have the best reputation at the moment, so this shows that the majority of victims were not innocent civilians. We can conclude from the flee section that the majority of the victims did not flee and went down with a fight. The age of the victims are mainly ranging from 19 to 40 years old. And being that only 7.6% of victims were unarmed it goes to show that the police are taking precautions to try to only take down armed and dangerous people. What does this mean? While 7.6 % is still 376 people, which is a lot, this shows that the police are not just shooting people while the media can make it seem like the police are.

The were a few limitations on the data. Some being the lack of complete data, and other factors like what the situation was that led to the victims being shot. Next, to further improve the analysis, I would like to include more information on the victims and further manipulate the data to see if there was any prejudice, racial bias, and what race was had the most cases per year. I want these variables to be included because on the Washington Posts article, it states that black people are shot at a rate that is more than double of what white people are.read more here