The dataset explored takes a deep dive into fatal police shootings in America. Since January 1st, 2015, The Washington Post has recorded every fatal shooting in the United States by an on-duty officer along with over a dozen variables. While the dataset includes a wide range of variables, the ones that will be explored in this assignment include the state and year that the shooting took place as well as the victims’ race. Since the dataset meets the three criteria, there is no necessary cleaning. The format is .csv, there are no spaces between variable names, and all the variable names are already lowercase. However, we will need to adjust the acronym of each race, adjust the class of date and then create a new column of just the year of each incident.
Source: https://github.com/washingtonpost/data-police-shootings
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(tinytex)
library(dplyr)
library(RColorBrewer)
library(ggalluvial) # this is the improved alluvial package
setwd("/Users/KathyOchoa/Documents/DATA 110/Project 1")
policeShootings <- read.csv("fatal-police-shootings-data.csv")
head(policeShootings) # first 6 rows of the dataset
## id name date manner_of_death armed age gender race
## 1 3 Tim Elliot 2015-01-02 shot gun 53 M A
## 2 4 Lewis Lee Lembke 2015-01-02 shot gun 47 M W
## 3 5 John Paul Quintero 2015-01-03 shot and Tasered unarmed 23 M H
## 4 8 Matthew Hoffman 2015-01-04 shot toy weapon 32 M W
## 5 9 Michael Rodriguez 2015-01-04 shot nail gun 39 M H
## 6 11 Kenneth Joe Brown 2015-01-04 shot gun 18 M W
## city state signs_of_mental_illness threat_level flee
## 1 Shelton WA True attack Not fleeing
## 2 Aloha OR False attack Not fleeing
## 3 Wichita KS False other Not fleeing
## 4 San Francisco CA True attack Not fleeing
## 5 Evans CO False attack Not fleeing
## 6 Guthrie OK False attack Not fleeing
## body_camera longitude latitude is_geocoding_exact
## 1 False -123.122 47.247 True
## 2 False -122.892 45.487 True
## 3 False -97.281 37.695 True
## 4 False -122.422 37.763 True
## 5 False -104.692 40.384 True
## 6 False -97.423 35.877 True
policeShootings$race[policeShootings$race == "W"] <- "White, non-Hispanic"
policeShootings$race[policeShootings$race == "B"] <- "Black, non-Hispanic"
policeShootings$race[policeShootings$race == "A"] <- "Asian"
policeShootings$race[policeShootings$race == "N"] <- "Native American"
policeShootings$race[policeShootings$race == "H"] <- "Hispanic"
policeShootings$race[policeShootings$race == "O"] <- "Other"
policeShootings$race[policeShootings$race == ""] <- "Unknown"
class(policeShootings$date) # determine class of date
## [1] "character"
policeShootings$date <- as.Date(policeShootings$date, format = "%Y-%m-%d") # change the format of the date
class(policeShootings$date) # determine the new class of date
## [1] "Date"
addYear <- policeShootings %>%
mutate(year = format(date, "%Y")) #adds new column
Create a new dataset, by_race, using data from addYear. Then group by race year. Finally, count each incident.
by_race <- addYear %>%
group_by(race, year) %>% # group by race then year
summarise(count=n()) # and then count number of incidents
## `summarise()` has grouped output by 'race'. You can override using the
## `.groups` argument.
head(by_race)
## # A tibble: 6 × 3
## # Groups: race [1]
## race year count
## <chr> <chr> <int>
## 1 Asian 2015 15
## 2 Asian 2016 15
## 3 Asian 2017 16
## 4 Asian 2018 21
## 5 Asian 2019 20
## 6 Asian 2020 15
plot1 <- ggplot(by_race, aes(x=year, y=count, alluvium=race)) +
theme_bw() +
geom_alluvium(aes(fill = race),
color = "white",
width = .1,
alpha = .9,
decreasing = FALSE) +
scale_fill_brewer(palette = "Set1") +
scale_fill_discrete(name = "Race") +
theme_minimal(base_size = 12) +
ggtitle("Police shootings by race each year from 2015-2022\n") +
ylab("Number of Shootings") +
xlab("Year") +
labs(caption = "Source: The Washington Post")
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.
plot1
states <- addYear %>%
group_by(state, year) %>% # group by state then year
summarise(count = n()) %>% # and then count the number of incidents
arrange(desc(count)) # and then arrange in descending order
## `summarise()` has grouped output by 'state'. You can override using the
## `.groups` argument.
head(states, 20) # check the highest 20 entries
## # A tibble: 20 × 3
## # Groups: state [4]
## state year count
## <chr> <chr> <int>
## 1 CA 2015 190
## 2 CA 2017 160
## 3 CA 2020 147
## 4 CA 2021 141
## 5 CA 2016 139
## 6 CA 2019 135
## 7 CA 2018 116
## 8 TX 2019 108
## 9 TX 2015 100
## 10 TX 2021 95
## 11 FL 2020 93
## 12 CA 2022 92
## 13 TX 2018 84
## 14 TX 2020 83
## 15 TX 2022 82
## 16 TX 2016 81
## 17 TX 2017 69
## 18 FL 2018 64
## 19 FL 2019 64
## 20 AZ 2018 61
We see that the four states with the highest shootings are California, Texas, Florida, and Arizona.
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot2 <- states %>%
filter(state == "CA" | state == "TX" | state == "FL" | state == "AZ") %>%
ggplot() +
geom_bar(aes(x=year, y=count, fill=state), position = "dodge", stat = "identity") +
labs(title = "4 States with the Highest Police Shootings",
fill = "State") +
theme_minimal(base_size = 14) +
ylab("Number of Shootings") +
xlab("Year")
plot2 <- ggplotly(plot2)
plot2
The first visualization is an alluvial displaying fatal police shootings from 2015 until 2022, categorized by race. Each race has an assigned color to display its total shootings compared to other races, as well to the annual total shootings. I initially hypothesized that there would be higher numbers of shootings among Hispanic and Black, non-Hispanic victims. However, I was surprised to learn that a majority of shooting victims have been White, non-Hispanic. I was also pleasantly surprised by the decreasing number of shootings among all races, except those categorized as “Unknown” after 2020. On the other hand, I was disappointed to see an increasing number of shootings among victims classified as “Unknown” after 2020, reaching a record maximum in 2021. It makes me wonder if classifying more victims under “Unknown” has been a deliberate attempt to conceal racial biases and racially motivated shootings. The second visualization is a bar graph comparing the four states with the highest total number of shootings each year from 2015 through 2022. I created a sub dataset of the addYear dataset, grouped by state, then year and counted the number of incidents. After arranging the entries in descending order, I looked at the order and determined that the four states with the highest fatal shootings are California (CA), Texas (TX), Florida (FL), and Arizona (AZ). The results of the first three states were not surprising since they are the three states with the highest population. I was slightly surprised that the forth state, Arizona, is among the states with the highest shootings, rather than a state like New York with a much higher population.
While these two plots are helpful visualizations to better understand the dataset, there were a few things I would have liked to incorporate into the plots. Firstly, I would have liked to count how many times a shooting occurred when the victim showed signs of mental illness as well as what race they were classified as. This information could have been plotted on a treemap with states or race as set as vSize and number of victims with signs of mental illness set to vColor. This visualization could help determine which states need more sensitivity training in mental health and which races suffer from mental illness related shootings. Secondly, I would have liked to create a heatmap with race on the vertical scale and the horizontal scale displaying number of total shootings, number of victims with mental illness, and number of victims who were not fleeing police. Lastly, including population by race as well as each states’ total population would benefit our results. Since this information is not included, the results from the two plots could appear to be skewed. Although the first plot displays higher shootings of white victims, it’s possible that this resulted from a higher overall white population compared to other races. Similarly, although the second plot displays the states with the highest shootings, this does not take into account total population. Thus, states with a higher population are bound to have higher police shootings and does not accurately compare to other states with a lower population.