The dataset explored takes a deep dive into fatal police shootings in America. Since January 1st, 2015, The Washington Post has recorded every fatal shooting in the United States by an on-duty officer along with over a dozen variables. While the dataset includes a wide range of variables, the ones that will be explored in this assignment include the state and year that the shooting took place as well as the victims’ race. Since the dataset meets the three criteria, there is no necessary cleaning. The format is .csv, there are no spaces between variable names, and all the variable names are already lowercase. However, we will need to adjust the acronym of each race, adjust the class of date and then create a new column of just the year of each incident.

Source: https://github.com/washingtonpost/data-police-shootings

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(tinytex)
library(dplyr)
library(RColorBrewer)
library(ggalluvial) # this is the improved alluvial package 

Read in the dataset

setwd("/Users/KathyOchoa/Documents/DATA 110/Project 1")
policeShootings <- read.csv("fatal-police-shootings-data.csv")
head(policeShootings) # first 6 rows of the dataset
##   id               name       date  manner_of_death      armed age gender race
## 1  3         Tim Elliot 2015-01-02             shot        gun  53      M    A
## 2  4   Lewis Lee Lembke 2015-01-02             shot        gun  47      M    W
## 3  5 John Paul Quintero 2015-01-03 shot and Tasered    unarmed  23      M    H
## 4  8    Matthew Hoffman 2015-01-04             shot toy weapon  32      M    W
## 5  9  Michael Rodriguez 2015-01-04             shot   nail gun  39      M    H
## 6 11  Kenneth Joe Brown 2015-01-04             shot        gun  18      M    W
##            city state signs_of_mental_illness threat_level        flee
## 1       Shelton    WA                    True       attack Not fleeing
## 2         Aloha    OR                   False       attack Not fleeing
## 3       Wichita    KS                   False        other Not fleeing
## 4 San Francisco    CA                    True       attack Not fleeing
## 5         Evans    CO                   False       attack Not fleeing
## 6       Guthrie    OK                   False       attack Not fleeing
##   body_camera longitude latitude is_geocoding_exact
## 1       False  -123.122   47.247               True
## 2       False  -122.892   45.487               True
## 3       False   -97.281   37.695               True
## 4       False  -122.422   37.763               True
## 5       False  -104.692   40.384               True
## 6       False   -97.423   35.877               True

Change the race abbreviation

policeShootings$race[policeShootings$race == "W"] <- "White, non-Hispanic"
policeShootings$race[policeShootings$race == "B"] <- "Black, non-Hispanic"
policeShootings$race[policeShootings$race == "A"] <- "Asian"
policeShootings$race[policeShootings$race == "N"] <- "Native American"
policeShootings$race[policeShootings$race == "H"] <- "Hispanic"
policeShootings$race[policeShootings$race == "O"] <- "Other"
policeShootings$race[policeShootings$race == ""] <- "Unknown"

Adjust the class of the date by converting from character format to date.

class(policeShootings$date) # determine class of date
## [1] "character"
policeShootings$date <- as.Date(policeShootings$date, format = "%Y-%m-%d") # change the format of the date
class(policeShootings$date) # determine the new class of date
## [1] "Date"

Create a new column, the four digit year of the incident. Then add (mutate) the new column to a new dataset, addYear.

addYear <- policeShootings %>%
 mutate(year = format(date, "%Y")) #adds new column

First plot

Create an alluvial that explores police shootings by race each year from 2015 - 2022

Create a new dataset, by_race, using data from addYear. Then group by race year. Finally, count each incident.

by_race <- addYear %>%
  group_by(race, year) %>% # group by race then year
  summarise(count=n()) # and then count number of incidents
## `summarise()` has grouped output by 'race'. You can override using the
## `.groups` argument.
head(by_race)
## # A tibble: 6 × 3
## # Groups:   race [1]
##   race  year  count
##   <chr> <chr> <int>
## 1 Asian 2015     15
## 2 Asian 2016     15
## 3 Asian 2017     16
## 4 Asian 2018     21
## 5 Asian 2019     20
## 6 Asian 2020     15
plot1 <- ggplot(by_race, aes(x=year, y=count, alluvium=race)) +
  theme_bw() +
  geom_alluvium(aes(fill = race), 
                color = "white",
                width = .1, 
                alpha = .9,
                decreasing = FALSE) +
  scale_fill_brewer(palette = "Set1") +
  scale_fill_discrete(name = "Race") +
  theme_minimal(base_size = 12) +
  ggtitle("Police shootings by race each year from 2015-2022\n") +
  ylab("Number of Shootings") +
  xlab("Year") +
  labs(caption = "Source: The Washington Post")
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.
plot1

Second Plot

Create a bar plot of the four states with the highest number of shootings.

states <- addYear %>%
  group_by(state, year) %>% # group by state then year
  summarise(count = n()) %>% # and then count the number of incidents
  arrange(desc(count)) # and then arrange in descending order
## `summarise()` has grouped output by 'state'. You can override using the
## `.groups` argument.
head(states, 20) # check the highest 20 entries
## # A tibble: 20 × 3
## # Groups:   state [4]
##    state year  count
##    <chr> <chr> <int>
##  1 CA    2015    190
##  2 CA    2017    160
##  3 CA    2020    147
##  4 CA    2021    141
##  5 CA    2016    139
##  6 CA    2019    135
##  7 CA    2018    116
##  8 TX    2019    108
##  9 TX    2015    100
## 10 TX    2021     95
## 11 FL    2020     93
## 12 CA    2022     92
## 13 TX    2018     84
## 14 TX    2020     83
## 15 TX    2022     82
## 16 TX    2016     81
## 17 TX    2017     69
## 18 FL    2018     64
## 19 FL    2019     64
## 20 AZ    2018     61

We see that the four states with the highest shootings are California, Texas, Florida, and Arizona.

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot2 <- states %>%
  filter(state == "CA" | state == "TX" | state == "FL" | state == "AZ") %>%
  ggplot() + 
  geom_bar(aes(x=year, y=count, fill=state), position = "dodge", stat = "identity") +
  labs(title = "4 States with the Highest Police Shootings", 
    fill = "State") +
  theme_minimal(base_size = 14) +
  ylab("Number of Shootings") +
  xlab("Year")
plot2 <- ggplotly(plot2)
plot2

Essay

Visualization Representation

The first visualization is an alluvial displaying fatal police shootings from 2015 until 2022, categorized by race. Each race has an assigned color to display its total shootings compared to other races, as well to the annual total shootings. I initially hypothesized that there would be higher numbers of shootings among Hispanic and Black, non-Hispanic victims. However, I was surprised to learn that a majority of shooting victims have been White, non-Hispanic. I was also pleasantly surprised by the decreasing number of shootings among all races, except those categorized as “Unknown” after 2020. On the other hand, I was disappointed to see an increasing number of shootings among victims classified as “Unknown” after 2020, reaching a record maximum in 2021. It makes me wonder if classifying more victims under “Unknown” has been a deliberate attempt to conceal racial biases and racially motivated shootings. The second visualization is a bar graph comparing the four states with the highest total number of shootings each year from 2015 through 2022. I created a sub dataset of the addYear dataset, grouped by state, then year and counted the number of incidents. After arranging the entries in descending order, I looked at the order and determined that the four states with the highest fatal shootings are California (CA), Texas (TX), Florida (FL), and Arizona (AZ). The results of the first three states were not surprising since they are the three states with the highest population. I was slightly surprised that the forth state, Arizona, is among the states with the highest shootings, rather than a state like New York with a much higher population.

Desired inclusions

While these two plots are helpful visualizations to better understand the dataset, there were a few things I would have liked to incorporate into the plots. Firstly, I would have liked to count how many times a shooting occurred when the victim showed signs of mental illness as well as what race they were classified as. This information could have been plotted on a treemap with states or race as set as vSize and number of victims with signs of mental illness set to vColor. This visualization could help determine which states need more sensitivity training in mental health and which races suffer from mental illness related shootings. Secondly, I would have liked to create a heatmap with race on the vertical scale and the horizontal scale displaying number of total shootings, number of victims with mental illness, and number of victims who were not fleeing police. Lastly, including population by race as well as each states’ total population would benefit our results. Since this information is not included, the results from the two plots could appear to be skewed. Although the first plot displays higher shootings of white victims, it’s possible that this resulted from a higher overall white population compared to other races. Similarly, although the second plot displays the states with the highest shootings, this does not take into account total population. Thus, states with a higher population are bound to have higher police shootings and does not accurately compare to other states with a lower population.