The report is created as part of “Data wrangling in R” final project to analyze police fatal shooting incidents in United states from 2015 onwards and to consider several key findings. This issue has become quite controversial in recent years in US and has been an ongoing topic of serious discussion due to various incidents happening across country. It is, therefore, imperative for us to know which areas and important factors are need to be looked at for these shooting incidents. It is important to mention here that purpose of this report is not to adjudicate anyone but only to analyze important findings. To accomplish this, I retrieved relevant data set and listed down all the variables whose values can help us understand various trends.During data cleaning, when deemded necessay, rows with missing values were either deleted or imputed.After that, scenarios were established by carefully looking at the data which are then implemented using appropriate graph.
Package(s) used in project to execute R code are mentioned below:
library(dplyr) ## Package for data manipulation
library(tidyverse) #Set of packages including dpylr and ggplot
library(ggplot2) #Package to produce complex multi-layered graphs in R
library(RSocrata) ## Provides easier interaction with Socrata open data portals
library(tibble) #Used for manipulating and printing tibbles
library(readr) #Read text files from disk
library(lubridate) #contains functions to work with data/time
library(viridis) #Use color maps for better analysis
library(stringr) ## used for regex and other string functions
library(tm) #Package to create wordcloud
library(SnowballC) #Package to create wordcloud
library(wordcloud) #Package to create wordcloud
library(RColorBrewer) #Provides color schemes to various figures in R
library(corrplot) #Package for graphical display of correlation matrix
library(treemap) #Offers flexibility to draw treemaps
library(gmodels) #Package to use CrossTable function for making contigency table
Data is extracted from https://www.kaggle.com/washingtonpost/police-shootings link. The Washington Post compiles and regularly updates a database of every fatal shooting in the United States by a police officer in the line of duty since January 1, 2015. It involves persons involved,place and circumstances related to an incident.
Note: I previously planned to utilize https://data.cityofchicago.org/Public-Safety/Crimes-2015/vwwp-7yr9 data for data analysis but due to sheer volume and issues with quality of data, I, with permission of Proessor Bradley Boehmke changed it to similar data set.
In this section, a brief description of the dataset has been provided. There 1851 observations and 14 variables in original table.
| Variable Name | Description | Data Type |
|---|---|---|
| id | Unique number of a criminal case | Int |
| name | Name of person shot | Chr |
| date | Date of incident | Date |
| manner_of_death | Manner of death | Chr |
| armed | Whether person who died was armed or not | Chr |
| age | Age of person who died | Int |
| gender | Gender | Chr |
| race | Race of person who died | Chr |
| city | City where incident occured | Chr |
| state | State where incident occured | Chr |
| signs_of_mental_illness | Whether person showed signs of mental illness | Chr |
| threat_level | Threat level perceived by police officer | Chr |
| flee | Method employed by person to flee from place of incident | Chr |
| body_camera | Whether body camera installed on police officer’s uniform | Chr |
fatal_shooting <- as_tibble(read_csv("Fatal_shooting.csv",col_names = TRUE))
names(fatal_shooting) <- c("incident_id", "name","incident_date","manner_of_death",
"armed","age","gender","race","city","state","signs_of_mental_illness",
"threat_level","fleeing_method","body_camera")
#Number of rows and variables
dim(fatal_shooting)
## [1] 1851 14
#Names of variables
names(fatal_shooting)
## [1] "incident_id" "name"
## [3] "incident_date" "manner_of_death"
## [5] "armed" "age"
## [7] "gender" "race"
## [9] "city" "state"
## [11] "signs_of_mental_illness" "threat_level"
## [13] "fleeing_method" "body_camera"
#Checking top and bottom values
head(fatal_shooting)
## # A tibble: 6 × 14
## incident_id name incident_date manner_of_death armed
## <int> <chr> <date> <chr> <chr>
## 1 3 Tim Elliot 2015-01-02 shot gun
## 2 4 Lewis Lee Lembke 2015-01-02 shot gun
## 3 5 John Paul Quintero 2015-01-03 shot and Tasered unarmed
## 4 8 Matthew Hoffman 2015-01-04 shot toy weapon
## 5 9 Michael Rodriguez 2015-01-04 shot nail gun
## 6 11 Kenneth Joe Brown 2015-01-04 shot gun
## # ... with 9 more variables: age <int>, gender <chr>, race <chr>,
## # city <chr>, state <chr>, signs_of_mental_illness <chr>,
## # threat_level <chr>, fleeing_method <chr>, body_camera <chr>
tail(fatal_shooting)
## # A tibble: 6 × 14
## incident_id name incident_date manner_of_death
## <int> <chr> <date> <chr>
## 1 2060 George Bush III 2016-11-21 shot
## 2 2064 Michael Giles 2016-11-21 shot
## 3 2065 TK TK 2016-11-22 shot
## 4 2066 Ivory C. Pantallion 2016-11-22 shot
## 5 2067 Frank Nathaniel Clark 2016-11-22 shot
## 6 2068 Tai Upchurch 2016-11-22 shot
## # ... with 10 more variables: armed <chr>, age <int>, gender <chr>,
## # race <chr>, city <chr>, state <chr>, signs_of_mental_illness <chr>,
## # threat_level <chr>, fleeing_method <chr>, body_camera <chr>
#Counting missing values
summary(fatal_shooting)
## incident_id name incident_date
## Min. : 3.0 Length:1851 Min. :2015-01-02
## 1st Qu.: 587.5 Class :character 1st Qu.:2015-06-28
## Median :1079.0 Mode :character Median :2015-12-10
## Mean :1075.4 Mean :2015-12-09
## 3rd Qu.:1580.5 3rd Qu.:2016-05-27
## Max. :2068.0 Max. :2016-11-22
##
## manner_of_death armed age gender
## Length:1851 Length:1851 Min. : 6.00 Length:1851
## Class :character Class :character 1st Qu.:26.00 Class :character
## Mode :character Mode :character Median :34.00 Mode :character
## Mean :36.52
## 3rd Qu.:45.00
## Max. :86.00
## NA's :36
## race city state
## Length:1851 Length:1851 Length:1851
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## signs_of_mental_illness threat_level fleeing_method
## Length:1851 Length:1851 Length:1851
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## body_camera
## Length:1851
## Class :character
## Mode :character
##
##
##
##
sum(is.na(fatal_shooting$armed))
## [1] 5
sum(is.na(fatal_shooting$age))
## [1] 36
sum(is.na(fatal_shooting$race))
## [1] 141
sum(is.na(fatal_shooting$fleeing_method))
## [1] 23
# See what all rows have incomplete data
fatal_shooting_new <- fatal_shooting
#Excluding missing values
fatal_shooting_new<- fatal_shooting_new[-which(is.na(fatal_shooting_new$armed)), ]
fatal_shooting_new<- fatal_shooting_new[-which(is.na(fatal_shooting_new$age)), ]
fatal_shooting_new<- fatal_shooting_new[-which(is.na(fatal_shooting_new$fleeing_method)), ]
#Replacing characters in race with proper names
fatal_shooting_new$Race[fatal_shooting_new$race =='A'] <- "Asian"
fatal_shooting_new$Race[fatal_shooting_new$race =='B'] <- "Black"
fatal_shooting_new$Race[fatal_shooting_new$race =='W'] <- "White"
fatal_shooting_new$Race[fatal_shooting_new$race =='H'] <- "Hispanic"
fatal_shooting_new$Race[fatal_shooting_new$race =='N'] <- "Native American"
fatal_shooting_new$Race[fatal_shooting_new$race =='O'] <- "Others"
fatal_shooting_new$race[which(is.na(fatal_shooting_new$race))] <- "Unknown"
#Checking Null values again
sum(is.na(fatal_shooting_new$armed))
## [1] 0
sum(is.na(fatal_shooting_new$age))
## [1] 0
sum(is.na(fatal_shooting_new$race))
## [1] 0
sum(is.na(fatal_shooting_new$fleeing_method))
## [1] 0
** There are 205 missing values in original data set. Out of 205 rows with missing values, I removed missing values for three variables - Armed, Age and Fleeing Method as number of such rows are insignificant in comparison to total rows. For analysis, I require rows with missing values of Race variable. For that, I replaced null values with “Unknown” which simply means Race of the person shot was not identified at the time of incident.As almost all variables are categorical variables so outout of summary function really doesn’t reveal aanything significant. Interestingly, there is unique Incident ID column which does not contain all the values but nevertheless it is a column containing unique values.**
fatal_shooting_new1 <- fatal_shooting_new
# Extracting month and year from Incident Date
fatal_shooting_new1$Month <- month(fatal_shooting_new1$incident_date)
fatal_shooting_new1$Year <- year(fatal_shooting_new1$incident_date)
# Ranking every proportion of Fleeing method for each month of 2015
fleeing_data <- fatal_shooting_new1 %>%
filter(Year ==2015) %>%
group_by(Month, add=T) %>%
mutate(count_month = n()) %>%
group_by(fleeing_method, add=T) %>%
mutate(fleeing_percentage = round(100 * (n()/count_month),2)) %>%
select(c(fleeing_method, Month, fleeing_percentage)) %>%
distinct(fleeing_method, Month, .keep_all = T) %>%
arrange(Month, desc(fleeing_percentage)) %>%
group_by(Month) %>%
mutate(ranking = rank(desc(fleeing_percentage)))
# Plotting graph
ggplot(fleeing_data,
aes(x = Month, y = ranking, colour = fleeing_method, group = fleeing_method)) +
scale_x_continuous(breaks = c(1:12),
labels = c("Jan", "Feb", "Mar", "Apr", "May",
"Jun", "July", "Aug", "Sept", "Oct",
"Nov", "Dec")) +
geom_line(lwd = 2.7, alpha = 0.7) +
geom_point(size = 13) +
geom_text(aes(label = paste0(round(fleeing_percentage,0), "%")), color = "white") +
labs(x = "Months", y = "Rank", title = "Fleeing Methods by Type") +
scale_y_continuous(trans = "reverse",labels = c("#1","#2","#3","#4")) +
scale_colour_hue(name = "Fleeing Method") +
theme(legend.position = "bottom",legend.key=element_rect(colour = "black"))
** We can clearly conclude that in 70-80% of cases when a person was shot, suspect was not fleeing the place of incident and this trend continues throughout entire year 2015. In remaining few cases, suspect tried to escape using car or by foot before getting shot. It becomes vital for concerned authorities to take note of this finding i.e what prompted police officer(s) to shoot the suspect.**
# Extracting month and year from Incident Date
fatal_shooting_new1$Month <- month(fatal_shooting_new1$incident_date)
fatal_shooting_new1$Year <- year(fatal_shooting_new1$incident_date)
gender_number <- fatal_shooting_new1 %>%
filter(format(incident_date, "%Y") == 2015) %>%
group_by(state,gender) %>%
summarise(gender_count = n()) %>%
select(state,gender,gender_count) %>%
group_by(state) %>%
mutate(gender_sum = sum(gender_count)) %>%
arrange(desc(gender_sum)) %>%
ungroup() %>%
mutate(Ranking = dense_rank(desc(gender_sum))) %>%
subset(Ranking <= 10) %>%
select(c(state,gender,gender_count,gender_sum))
### Visualize the data
ggplot(data = gender_number,aes(y = gender_count,x = reorder(state, -gender_count),fill = gender)) +
geom_bar(stat = "identity") +
labs(y = "Gender Count",
x = "State",
title = "Top 10 states with maximum shooting incidents") +
scale_y_continuous(breaks = waiver(),limits = c(0,200))
Top 10 states with shooting incidents are listed. Regardless of gender, California and Texas leads the way in witnessing maximum such incidents. I am not concluding here that these are most dangerous places if police shootings is the judging criteria. This is because we need to consider population of each state before coming to such conclusion. Moreover, even one big incident can abruptly increase number of fatalities occuring in a state. Casualties in females during such incidents are very less as compared to men
race_data <- fatal_shooting_new1 %>%
group_by(race) %>%
mutate(count_race = n()) %>%
distinct(race,count_race,threat_level)
treemap(race_data,
index = c("race","threat_level"), #list of your categorical variables
vSize = "count_race", #quantitative variable
type = "index",
palette = "RdYlGn", #Scolor palette
inflate.labels = FALSE,
title = "Race & Threat Level wise distinction of shooting incidents",
fontsize.title = 14,
mirror.x = FALSE,
mirror.y = FALSE
)
Maximum fatalities are seen by white people followed by Black and Hispanic population. Asian and Native Americans have bore the least in such incidents
text_mining <- Corpus(VectorSource(fatal_shooting_new1$armed))
text_mining <- tm_map(text_mining, content_transformer(tolower))
# Remove numbers
text_mining <- tm_map(text_mining, removeNumbers)
# Remove english common stopwords
text_mining <- tm_map(text_mining, removeWords, stopwords("english"))
# Remove your own stop word
# specify your stopwords as a character vector
text_mining <- tm_map(text_mining, removeWords, c("weapon","unknown","and"))
# Remove punctuations
text_mining <- tm_map(text_mining, removePunctuation)
# Eliminate extra white spaces
text_mining <- tm_map(text_mining, stripWhitespace)
document_matrix <- TermDocumentMatrix(text_mining)
a <- as.matrix(document_matrix)
b <- sort(rowSums(a),decreasing=TRUE)
c <- data.frame(word = names(b),freq=b)
head(c, 10)
## word freq
## gun gun 1008
## knife knife 261
## unarmed unarmed 137
## vehicle vehicle 110
## undetermined undetermined 77
## toy toy 69
## machete machete 16
## metal metal 12
## baseball baseball 7
## bat bat 7
set.seed(1234)
wordcloud(words = c$word, freq = c$freq, min.freq = 1,
max.words = 200, random.order = FALSE,scale = c(8,0.9), rot.per = 0.45,
colors = brewer.pal(8, "Dark2"))
We will try to understand how the people were armed when they were shot through pretty intersting word cloud. It picks up most commonly used words in text and displays it through varying sizes.Gun, knife and vehicle(escaping/hiding) top the chart in this regard. This is further corroborated by document matrix of the same text in which frequency of each word is displayed against the word
joint <- CrossTable(fatal_shooting_new1$manner_of_death,fatal_shooting_new1$threat_level,prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 1790
##
##
## | fatal_shooting_new1$threat_level
## fatal_shooting_new1$manner_of_death | attack | other | undetermined | Row Total |
## ------------------------------------|--------------|--------------|--------------|--------------|
## shot | 1130 | 457 | 87 | 1674 |
## | 0.675 | 0.273 | 0.052 | 0.935 |
## | 0.954 | 0.891 | 0.946 | |
## | 0.631 | 0.255 | 0.049 | |
## ------------------------------------|--------------|--------------|--------------|--------------|
## shot and Tasered | 55 | 56 | 5 | 116 |
## | 0.474 | 0.483 | 0.043 | 0.065 |
## | 0.046 | 0.109 | 0.054 | |
## | 0.031 | 0.031 | 0.003 | |
## ------------------------------------|--------------|--------------|--------------|--------------|
## Column Total | 1185 | 513 | 92 | 1790 |
## | 0.662 | 0.287 | 0.051 | |
## ------------------------------------|--------------|--------------|--------------|--------------|
##
##
joint <- CrossTable(fatal_shooting_new1$signs_of_mental_illness,fatal_shooting_new1$threat_level,prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 1790
##
##
## | fatal_shooting_new1$threat_level
## fatal_shooting_new1$signs_of_mental_illness | attack | other | undetermined | Row Total |
## --------------------------------------------|--------------|--------------|--------------|--------------|
## False | 903 | 356 | 81 | 1340 |
## | 0.674 | 0.266 | 0.060 | 0.749 |
## | 0.762 | 0.694 | 0.880 | |
## | 0.504 | 0.199 | 0.045 | |
## --------------------------------------------|--------------|--------------|--------------|--------------|
## True | 282 | 157 | 11 | 450 |
## | 0.627 | 0.349 | 0.024 | 0.251 |
## | 0.238 | 0.306 | 0.120 | |
## | 0.158 | 0.088 | 0.006 | |
## --------------------------------------------|--------------|--------------|--------------|--------------|
## Column Total | 1185 | 513 | 92 | 1790 |
## | 0.662 | 0.287 | 0.051 | |
## --------------------------------------------|--------------|--------------|--------------|--------------|
##
##
joint <- CrossTable(fatal_shooting_new1$race,fatal_shooting_new1$threat_level,prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 1790
##
##
## | fatal_shooting_new1$threat_level
## fatal_shooting_new1$race | attack | other | undetermined | Row Total |
## -------------------------|--------------|--------------|--------------|--------------|
## A | 13 | 10 | 0 | 23 |
## | 0.565 | 0.435 | 0.000 | 0.013 |
## | 0.011 | 0.019 | 0.000 | |
## | 0.007 | 0.006 | 0.000 | |
## -------------------------|--------------|--------------|--------------|--------------|
## B | 300 | 115 | 28 | 443 |
## | 0.677 | 0.260 | 0.063 | 0.247 |
## | 0.253 | 0.224 | 0.304 | |
## | 0.168 | 0.064 | 0.016 | |
## -------------------------|--------------|--------------|--------------|--------------|
## H | 174 | 98 | 25 | 297 |
## | 0.586 | 0.330 | 0.084 | 0.166 |
## | 0.147 | 0.191 | 0.272 | |
## | 0.097 | 0.055 | 0.014 | |
## -------------------------|--------------|--------------|--------------|--------------|
## N | 13 | 8 | 1 | 22 |
## | 0.591 | 0.364 | 0.045 | 0.012 |
## | 0.011 | 0.016 | 0.011 | |
## | 0.007 | 0.004 | 0.001 | |
## -------------------------|--------------|--------------|--------------|--------------|
## O | 13 | 9 | 0 | 22 |
## | 0.591 | 0.409 | 0.000 | 0.012 |
## | 0.011 | 0.018 | 0.000 | |
## | 0.007 | 0.005 | 0.000 | |
## -------------------------|--------------|--------------|--------------|--------------|
## Unknown | 76 | 35 | 5 | 116 |
## | 0.655 | 0.302 | 0.043 | 0.065 |
## | 0.064 | 0.068 | 0.054 | |
## | 0.042 | 0.020 | 0.003 | |
## -------------------------|--------------|--------------|--------------|--------------|
## W | 596 | 238 | 33 | 867 |
## | 0.687 | 0.275 | 0.038 | 0.484 |
## | 0.503 | 0.464 | 0.359 | |
## | 0.333 | 0.133 | 0.018 | |
## -------------------------|--------------|--------------|--------------|--------------|
## Column Total | 1185 | 513 | 92 | 1790 |
## | 0.662 | 0.287 | 0.051 | |
## -------------------------|--------------|--------------|--------------|--------------|
##
##
The challenge in my case is that almost all variables are categorical, making it hard to apply correlation and understand the relationship among various variables. So, I relied on contigency table for three combinations to perform correlation. After seeing all the results, we can deduce that: a. Largest number of suspects with perceived level “Attack” were shot b. Use of shot and tasered is very less. Police department may want to see how to increase usage of tasers instead of real bullets c. Mental Illness is not the biggest factor for a police officer who tags threat level(Attack) to the suspect. Although, mental illness is still a big factor with 450 reported incidents d. Blacks and Whites have the highest number of casualty regardless of threat level
year_month <- fatal_shooting_new1 %>%
group_by(Year,Month) %>%
summarize(count = n())
ggplot(year_month, aes(Month, Year, fill = count)) +
geom_tile(color = "white", size = 0.1) +
scale_fill_viridis(name = "Number of #Killed") +
theme(legend.position = "bottom") +
labs( title = "Number of casualities per month and year(2015-2016)") +
geom_text(aes(label=count), color='white') +
scale_x_continuous(breaks = c(1:12),
labels = c("Jan", "Feb", "Mar", "Apr", "May",
"Jun", "July", "Aug", "Sept", "Oct",
"Nov", "Dec"))
July’15 witnessed maximum casualities and it is a worrying trend that shooting incidents are not decreasing since 2015 i.e number is consistently high
To analyze the data of police fatal shootings mainly for year 2015 to understand the ongoing trends and what can be done to address the problem.The entire data compiled by Washington post is retreived. It lists detail of every incident reported since 2015 and is used to accomplish our task. Data cleaning, data manipulation and graphical representation makes us view the data in variety of ways and we conclude few key findings: a. 70-80% of suspects were shot when they were not even trying to flee the scene. Circumstances in which such incidents happened can be seen in greater detail b. California and Texas have witnessed maximum incidents of police shootings. Police departments of each state must scrutinize these incidents on highest priority c. White people and Black people have the highest percentage casualty among all the races which is understanble considering their highest population d. In majority of cases, suspects were carrying gun or knife when they were shot. There is urgent need to make tougher laws to regulate sale of guns e. Use of taser guns compared to pistols by police officers is very less which is a concerning trend f. Number of shooting incidents remain consistently high for all the months of 2015-16