Synopsis

The report is created as part of “Data wrangling in R” final project to analyze police fatal shooting incidents in United states from 2015 onwards and to consider several key findings. This issue has become quite controversial in recent years in US and has been an ongoing topic of serious discussion due to various incidents happening across country. It is, therefore, imperative for us to know which areas and important factors are need to be looked at for these shooting incidents. It is important to mention here that purpose of this report is not to adjudicate anyone but only to analyze important findings. To accomplish this, I retrieved relevant data set and listed down all the variables whose values can help us understand various trends.During data cleaning, when deemded necessay, rows with missing values were either deleted or imputed.After that, scenarios were established by carefully looking at the data which are then implemented using appropriate graph.

Packages Required

Package(s) used in project to execute R code are mentioned below:

library(dplyr) ## Package for data manipulation
library(tidyverse) #Set of packages including dpylr and ggplot
library(ggplot2) #Package to produce complex multi-layered graphs in R
library(RSocrata) ## Provides easier interaction with Socrata open data portals
library(tibble) #Used for manipulating and printing tibbles
library(readr) #Read text files from disk
library(lubridate) #contains functions to work with data/time
library(viridis) #Use color maps for better analysis
library(stringr) ## used for regex and other string functions
library(tm) #Package to create wordcloud
library(SnowballC) #Package to create wordcloud
library(wordcloud) #Package to create wordcloud
library(RColorBrewer) #Provides color schemes to various figures in R
library(corrplot) #Package for graphical display of correlation matrix
library(treemap) #Offers flexibility to draw treemaps
library(gmodels) #Package to use CrossTable function for making contigency table

Data Source

Data is extracted from https://www.kaggle.com/washingtonpost/police-shootings link. The Washington Post compiles and regularly updates a database of every fatal shooting in the United States by a police officer in the line of duty since January 1, 2015. It involves persons involved,place and circumstances related to an incident.

Note: I previously planned to utilize https://data.cityofchicago.org/Public-Safety/Crimes-2015/vwwp-7yr9 data for data analysis but due to sheer volume and issues with quality of data, I, with permission of Proessor Bradley Boehmke changed it to similar data set.

Data Description

In this section, a brief description of the dataset has been provided. There 1851 observations and 14 variables in original table.

Variable Name Description Data Type
id Unique number of a criminal case Int
name Name of person shot Chr
date Date of incident Date
manner_of_death Manner of death Chr
armed Whether person who died was armed or not Chr
age Age of person who died Int
gender Gender Chr
race Race of person who died Chr
city City where incident occured Chr
state State where incident occured Chr
signs_of_mental_illness Whether person showed signs of mental illness Chr
threat_level Threat level perceived by police officer Chr
flee Method employed by person to flee from place of incident Chr
body_camera Whether body camera installed on police officer’s uniform Chr

Importing And Cleaning data

fatal_shooting <- as_tibble(read_csv("Fatal_shooting.csv",col_names = TRUE))
names(fatal_shooting) <- c("incident_id", "name","incident_date","manner_of_death",
                           "armed","age","gender","race","city","state","signs_of_mental_illness",
                           "threat_level","fleeing_method","body_camera")

#Number of rows and variables
dim(fatal_shooting)
## [1] 1851   14
#Names of variables
names(fatal_shooting)
##  [1] "incident_id"             "name"                   
##  [3] "incident_date"           "manner_of_death"        
##  [5] "armed"                   "age"                    
##  [7] "gender"                  "race"                   
##  [9] "city"                    "state"                  
## [11] "signs_of_mental_illness" "threat_level"           
## [13] "fleeing_method"          "body_camera"
#Checking top and bottom values
head(fatal_shooting)
## # A tibble: 6 × 14
##   incident_id               name incident_date  manner_of_death      armed
##         <int>              <chr>        <date>            <chr>      <chr>
## 1           3         Tim Elliot    2015-01-02             shot        gun
## 2           4   Lewis Lee Lembke    2015-01-02             shot        gun
## 3           5 John Paul Quintero    2015-01-03 shot and Tasered    unarmed
## 4           8    Matthew Hoffman    2015-01-04             shot toy weapon
## 5           9  Michael Rodriguez    2015-01-04             shot   nail gun
## 6          11  Kenneth Joe Brown    2015-01-04             shot        gun
## # ... with 9 more variables: age <int>, gender <chr>, race <chr>,
## #   city <chr>, state <chr>, signs_of_mental_illness <chr>,
## #   threat_level <chr>, fleeing_method <chr>, body_camera <chr>
tail(fatal_shooting)
## # A tibble: 6 × 14
##   incident_id                  name incident_date manner_of_death
##         <int>                 <chr>        <date>           <chr>
## 1        2060       George Bush III    2016-11-21            shot
## 2        2064         Michael Giles    2016-11-21            shot
## 3        2065                 TK TK    2016-11-22            shot
## 4        2066   Ivory C. Pantallion    2016-11-22            shot
## 5        2067 Frank Nathaniel Clark    2016-11-22            shot
## 6        2068          Tai Upchurch    2016-11-22            shot
## # ... with 10 more variables: armed <chr>, age <int>, gender <chr>,
## #   race <chr>, city <chr>, state <chr>, signs_of_mental_illness <chr>,
## #   threat_level <chr>, fleeing_method <chr>, body_camera <chr>
#Counting missing values
summary(fatal_shooting)
##   incident_id         name           incident_date       
##  Min.   :   3.0   Length:1851        Min.   :2015-01-02  
##  1st Qu.: 587.5   Class :character   1st Qu.:2015-06-28  
##  Median :1079.0   Mode  :character   Median :2015-12-10  
##  Mean   :1075.4                      Mean   :2015-12-09  
##  3rd Qu.:1580.5                      3rd Qu.:2016-05-27  
##  Max.   :2068.0                      Max.   :2016-11-22  
##                                                          
##  manner_of_death       armed                age           gender         
##  Length:1851        Length:1851        Min.   : 6.00   Length:1851       
##  Class :character   Class :character   1st Qu.:26.00   Class :character  
##  Mode  :character   Mode  :character   Median :34.00   Mode  :character  
##                                        Mean   :36.52                     
##                                        3rd Qu.:45.00                     
##                                        Max.   :86.00                     
##                                        NA's   :36                        
##      race               city              state          
##  Length:1851        Length:1851        Length:1851       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  signs_of_mental_illness threat_level       fleeing_method    
##  Length:1851             Length:1851        Length:1851       
##  Class :character        Class :character   Class :character  
##  Mode  :character        Mode  :character   Mode  :character  
##                                                               
##                                                               
##                                                               
##                                                               
##  body_camera       
##  Length:1851       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
sum(is.na(fatal_shooting$armed))
## [1] 5
sum(is.na(fatal_shooting$age))
## [1] 36
sum(is.na(fatal_shooting$race))
## [1] 141
sum(is.na(fatal_shooting$fleeing_method))
## [1] 23
# See what all rows have incomplete data

fatal_shooting_new <- fatal_shooting

#Excluding missing values
fatal_shooting_new<- fatal_shooting_new[-which(is.na(fatal_shooting_new$armed)), ]
fatal_shooting_new<- fatal_shooting_new[-which(is.na(fatal_shooting_new$age)), ]
fatal_shooting_new<- fatal_shooting_new[-which(is.na(fatal_shooting_new$fleeing_method)), ]

#Replacing characters in race with proper names
fatal_shooting_new$Race[fatal_shooting_new$race =='A'] <- "Asian"
fatal_shooting_new$Race[fatal_shooting_new$race =='B'] <- "Black"
fatal_shooting_new$Race[fatal_shooting_new$race =='W'] <- "White"
fatal_shooting_new$Race[fatal_shooting_new$race =='H'] <- "Hispanic"
fatal_shooting_new$Race[fatal_shooting_new$race =='N'] <- "Native American"
fatal_shooting_new$Race[fatal_shooting_new$race =='O'] <- "Others"

fatal_shooting_new$race[which(is.na(fatal_shooting_new$race))] <- "Unknown"

#Checking Null values again
sum(is.na(fatal_shooting_new$armed))
## [1] 0
sum(is.na(fatal_shooting_new$age))
## [1] 0
sum(is.na(fatal_shooting_new$race)) 
## [1] 0
sum(is.na(fatal_shooting_new$fleeing_method))
## [1] 0

** There are 205 missing values in original data set. Out of 205 rows with missing values, I removed missing values for three variables - Armed, Age and Fleeing Method as number of such rows are insignificant in comparison to total rows. For analysis, I require rows with missing values of Race variable. For that, I replaced null values with “Unknown” which simply means Race of the person shot was not identified at the time of incident.As almost all variables are categorical variables so outout of summary function really doesn’t reveal aanything significant. Interestingly, there is unique Incident ID column which does not contain all the values but nevertheless it is a column containing unique values.**

Data Analysis

Analysis 1

fatal_shooting_new1 <- fatal_shooting_new

# Extracting month and year from Incident Date
fatal_shooting_new1$Month <- month(fatal_shooting_new1$incident_date)
fatal_shooting_new1$Year <- year(fatal_shooting_new1$incident_date)

# Ranking every proportion of Fleeing method for each month of 2015
fleeing_data <- fatal_shooting_new1 %>%
                filter(Year ==2015) %>%
                group_by(Month, add=T) %>%
                mutate(count_month = n()) %>%
                group_by(fleeing_method, add=T) %>%
                mutate(fleeing_percentage = round(100 * (n()/count_month),2)) %>%
                select(c(fleeing_method, Month, fleeing_percentage)) %>%
                distinct(fleeing_method, Month, .keep_all = T) %>%
                arrange(Month, desc(fleeing_percentage)) %>%
                group_by(Month) %>%
                mutate(ranking = rank(desc(fleeing_percentage))) 


# Plotting graph
ggplot(fleeing_data, 
       aes(x = Month, y = ranking, colour = fleeing_method, group = fleeing_method)) + 
       scale_x_continuous(breaks = c(1:12),
                          labels = c("Jan", "Feb", "Mar", "Apr", "May",
                                     "Jun", "July", "Aug", "Sept", "Oct", 
                                     "Nov", "Dec")) +
       geom_line(lwd = 2.7, alpha = 0.7) +
       geom_point(size = 13) + 
       geom_text(aes(label = paste0(round(fleeing_percentage,0), "%")), color = "white") +
       labs(x = "Months", y = "Rank", title = "Fleeing Methods by Type") +
       scale_y_continuous(trans = "reverse",labels = c("#1","#2","#3","#4")) +
       scale_colour_hue(name = "Fleeing Method") +
       theme(legend.position = "bottom",legend.key=element_rect(colour = "black"))

** We can clearly conclude that in 70-80% of cases when a person was shot, suspect was not fleeing the place of incident and this trend continues throughout entire year 2015. In remaining few cases, suspect tried to escape using car or by foot before getting shot. It becomes vital for concerned authorities to take note of this finding i.e what prompted police officer(s) to shoot the suspect.**

Analysis 2

# Extracting month and year from Incident Date
fatal_shooting_new1$Month <- month(fatal_shooting_new1$incident_date)
fatal_shooting_new1$Year <- year(fatal_shooting_new1$incident_date)

gender_number <- fatal_shooting_new1 %>%
                 filter(format(incident_date, "%Y") == 2015) %>%
                 group_by(state,gender) %>%
                 summarise(gender_count = n()) %>%
                 select(state,gender,gender_count) %>%
                 group_by(state) %>%
                 mutate(gender_sum = sum(gender_count)) %>%
                 arrange(desc(gender_sum)) %>%
                 ungroup() %>%
                 mutate(Ranking = dense_rank(desc(gender_sum))) %>%
                 subset(Ranking <= 10) %>%
                 select(c(state,gender,gender_count,gender_sum))
  
### Visualize the data
ggplot(data = gender_number,aes(y = gender_count,x = reorder(state, -gender_count),fill = gender)) + 
geom_bar(stat = "identity") + 
labs(y = "Gender Count",
     x = "State",
     title = "Top 10 states with maximum shooting incidents") +
scale_y_continuous(breaks = waiver(),limits = c(0,200))

Top 10 states with shooting incidents are listed. Regardless of gender, California and Texas leads the way in witnessing maximum such incidents. I am not concluding here that these are most dangerous places if police shootings is the judging criteria. This is because we need to consider population of each state before coming to such conclusion. Moreover, even one big incident can abruptly increase number of fatalities occuring in a state. Casualties in females during such incidents are very less as compared to men

Analysis 3

race_data <- fatal_shooting_new1 %>%
              group_by(race) %>%
              mutate(count_race = n()) %>%
              distinct(race,count_race,threat_level)

treemap(race_data, 
        index = c("race","threat_level"),  #list of your categorical variables
        vSize = "count_race",  #quantitative variable
        type = "index", 
        palette = "RdYlGn", #Scolor palette
        inflate.labels = FALSE,
        title = "Race & Threat Level wise distinction of shooting incidents",
        fontsize.title = 14,
        mirror.x = FALSE,
        mirror.y = FALSE
)

Maximum fatalities are seen by white people followed by Black and Hispanic population. Asian and Native Americans have bore the least in such incidents

Analysis 4

text_mining <- Corpus(VectorSource(fatal_shooting_new1$armed))

text_mining <- tm_map(text_mining, content_transformer(tolower))
# Remove numbers
text_mining <- tm_map(text_mining, removeNumbers)
# Remove english common stopwords
text_mining <- tm_map(text_mining, removeWords, stopwords("english"))
# Remove your own stop word
# specify your stopwords as a character vector
text_mining <- tm_map(text_mining, removeWords, c("weapon","unknown","and")) 
# Remove punctuations
text_mining <- tm_map(text_mining, removePunctuation)
# Eliminate extra white spaces
text_mining <- tm_map(text_mining, stripWhitespace)

document_matrix <- TermDocumentMatrix(text_mining)
a <- as.matrix(document_matrix)
b <- sort(rowSums(a),decreasing=TRUE)
c <- data.frame(word = names(b),freq=b)
head(c, 10)
##                      word freq
## gun                   gun 1008
## knife               knife  261
## unarmed           unarmed  137
## vehicle           vehicle  110
## undetermined undetermined   77
## toy                   toy   69
## machete           machete   16
## metal               metal   12
## baseball         baseball    7
## bat                   bat    7
set.seed(1234)
wordcloud(words = c$word, freq = c$freq, min.freq = 1,
          max.words = 200, random.order = FALSE,scale = c(8,0.9), rot.per = 0.45, 
          colors = brewer.pal(8, "Dark2"))

We will try to understand how the people were armed when they were shot through pretty intersting word cloud. It picks up most commonly used words in text and displays it through varying sizes.Gun, knife and vehicle(escaping/hiding) top the chart in this regard. This is further corroborated by document matrix of the same text in which frequency of each word is displayed against the word

Analysis 5

joint <- CrossTable(fatal_shooting_new1$manner_of_death,fatal_shooting_new1$threat_level,prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1790 
## 
##  
##                                     | fatal_shooting_new1$threat_level 
## fatal_shooting_new1$manner_of_death |       attack |        other | undetermined |    Row Total | 
## ------------------------------------|--------------|--------------|--------------|--------------|
##                                shot |         1130 |          457 |           87 |         1674 | 
##                                     |        0.675 |        0.273 |        0.052 |        0.935 | 
##                                     |        0.954 |        0.891 |        0.946 |              | 
##                                     |        0.631 |        0.255 |        0.049 |              | 
## ------------------------------------|--------------|--------------|--------------|--------------|
##                    shot and Tasered |           55 |           56 |            5 |          116 | 
##                                     |        0.474 |        0.483 |        0.043 |        0.065 | 
##                                     |        0.046 |        0.109 |        0.054 |              | 
##                                     |        0.031 |        0.031 |        0.003 |              | 
## ------------------------------------|--------------|--------------|--------------|--------------|
##                        Column Total |         1185 |          513 |           92 |         1790 | 
##                                     |        0.662 |        0.287 |        0.051 |              | 
## ------------------------------------|--------------|--------------|--------------|--------------|
## 
## 
joint <- CrossTable(fatal_shooting_new1$signs_of_mental_illness,fatal_shooting_new1$threat_level,prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1790 
## 
##  
##                                             | fatal_shooting_new1$threat_level 
## fatal_shooting_new1$signs_of_mental_illness |       attack |        other | undetermined |    Row Total | 
## --------------------------------------------|--------------|--------------|--------------|--------------|
##                                       False |          903 |          356 |           81 |         1340 | 
##                                             |        0.674 |        0.266 |        0.060 |        0.749 | 
##                                             |        0.762 |        0.694 |        0.880 |              | 
##                                             |        0.504 |        0.199 |        0.045 |              | 
## --------------------------------------------|--------------|--------------|--------------|--------------|
##                                        True |          282 |          157 |           11 |          450 | 
##                                             |        0.627 |        0.349 |        0.024 |        0.251 | 
##                                             |        0.238 |        0.306 |        0.120 |              | 
##                                             |        0.158 |        0.088 |        0.006 |              | 
## --------------------------------------------|--------------|--------------|--------------|--------------|
##                                Column Total |         1185 |          513 |           92 |         1790 | 
##                                             |        0.662 |        0.287 |        0.051 |              | 
## --------------------------------------------|--------------|--------------|--------------|--------------|
## 
## 
joint <- CrossTable(fatal_shooting_new1$race,fatal_shooting_new1$threat_level,prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1790 
## 
##  
##                          | fatal_shooting_new1$threat_level 
## fatal_shooting_new1$race |       attack |        other | undetermined |    Row Total | 
## -------------------------|--------------|--------------|--------------|--------------|
##                        A |           13 |           10 |            0 |           23 | 
##                          |        0.565 |        0.435 |        0.000 |        0.013 | 
##                          |        0.011 |        0.019 |        0.000 |              | 
##                          |        0.007 |        0.006 |        0.000 |              | 
## -------------------------|--------------|--------------|--------------|--------------|
##                        B |          300 |          115 |           28 |          443 | 
##                          |        0.677 |        0.260 |        0.063 |        0.247 | 
##                          |        0.253 |        0.224 |        0.304 |              | 
##                          |        0.168 |        0.064 |        0.016 |              | 
## -------------------------|--------------|--------------|--------------|--------------|
##                        H |          174 |           98 |           25 |          297 | 
##                          |        0.586 |        0.330 |        0.084 |        0.166 | 
##                          |        0.147 |        0.191 |        0.272 |              | 
##                          |        0.097 |        0.055 |        0.014 |              | 
## -------------------------|--------------|--------------|--------------|--------------|
##                        N |           13 |            8 |            1 |           22 | 
##                          |        0.591 |        0.364 |        0.045 |        0.012 | 
##                          |        0.011 |        0.016 |        0.011 |              | 
##                          |        0.007 |        0.004 |        0.001 |              | 
## -------------------------|--------------|--------------|--------------|--------------|
##                        O |           13 |            9 |            0 |           22 | 
##                          |        0.591 |        0.409 |        0.000 |        0.012 | 
##                          |        0.011 |        0.018 |        0.000 |              | 
##                          |        0.007 |        0.005 |        0.000 |              | 
## -------------------------|--------------|--------------|--------------|--------------|
##                  Unknown |           76 |           35 |            5 |          116 | 
##                          |        0.655 |        0.302 |        0.043 |        0.065 | 
##                          |        0.064 |        0.068 |        0.054 |              | 
##                          |        0.042 |        0.020 |        0.003 |              | 
## -------------------------|--------------|--------------|--------------|--------------|
##                        W |          596 |          238 |           33 |          867 | 
##                          |        0.687 |        0.275 |        0.038 |        0.484 | 
##                          |        0.503 |        0.464 |        0.359 |              | 
##                          |        0.333 |        0.133 |        0.018 |              | 
## -------------------------|--------------|--------------|--------------|--------------|
##             Column Total |         1185 |          513 |           92 |         1790 | 
##                          |        0.662 |        0.287 |        0.051 |              | 
## -------------------------|--------------|--------------|--------------|--------------|
## 
## 

The challenge in my case is that almost all variables are categorical, making it hard to apply correlation and understand the relationship among various variables. So, I relied on contigency table for three combinations to perform correlation. After seeing all the results, we can deduce that: a. Largest number of suspects with perceived level “Attack” were shot b. Use of shot and tasered is very less. Police department may want to see how to increase usage of tasers instead of real bullets c. Mental Illness is not the biggest factor for a police officer who tags threat level(Attack) to the suspect. Although, mental illness is still a big factor with 450 reported incidents d. Blacks and Whites have the highest number of casualty regardless of threat level

Analysis 6

year_month <- fatal_shooting_new1 %>%
              group_by(Year,Month) %>%
              summarize(count = n())

ggplot(year_month, aes(Month, Year, fill = count)) + 
geom_tile(color = "white", size = 0.1) + 
scale_fill_viridis(name = "Number of #Killed") + 
theme(legend.position = "bottom") +
labs( title = "Number of casualities per month and year(2015-2016)") +
geom_text(aes(label=count), color='white') +
scale_x_continuous(breaks = c(1:12),
                   labels = c("Jan", "Feb", "Mar", "Apr", "May",
                              "Jun", "July", "Aug", "Sept", "Oct", 
                             "Nov", "Dec"))

July’15 witnessed maximum casualities and it is a worrying trend that shooting incidents are not decreasing since 2015 i.e number is consistently high

Summary

To analyze the data of police fatal shootings mainly for year 2015 to understand the ongoing trends and what can be done to address the problem.The entire data compiled by Washington post is retreived. It lists detail of every incident reported since 2015 and is used to accomplish our task. Data cleaning, data manipulation and graphical representation makes us view the data in variety of ways and we conclude few key findings: a. 70-80% of suspects were shot when they were not even trying to flee the scene. Circumstances in which such incidents happened can be seen in greater detail b. California and Texas have witnessed maximum incidents of police shootings. Police departments of each state must scrutinize these incidents on highest priority c. White people and Black people have the highest percentage casualty among all the races which is understanble considering their highest population d. In majority of cases, suspects were carrying gun or knife when they were shot. There is urgent need to make tougher laws to regulate sale of guns e. Use of taser guns compared to pistols by police officers is very less which is a concerning trend f. Number of shooting incidents remain consistently high for all the months of 2015-16