Introduction

The article I chose for this assignment is called “Where Police Have Killed Americans In 2015”, written by Ben Casselman (https://fivethirtyeight.com/features/where-police-have-killed-americans-in-2015/)

This article is about the release of Guardian’s interactive database of Americans killed by police in 2015. The data was retrieved from a combination of media coverage, reader submissions, and open-source information. The Guardian then verified the incidents with their own reporting processes.

Analysis

In this section, I do some data wrangling.

library(readr)

# retrieve the csv file from GitHub
urlfile = "https://raw.githubusercontent.com/fivethirtyeight/data/master/police-killings/police_killings.csv"

policekillings <- read_csv(url(urlfile), show_col_types = FALSE)

# subset the data into a smaller data frame
policekillings <- subset(policekillings, select=c("name", "age", "gender", "raceethnicity", "state", "h_income", "pov"))

# remove the rows that have "unknown" for age
policekillings <- policekillings[policekillings$age != "Unknown", ]

# remove the rows that have "-" for poverty
policekillings <- policekillings[policekillings$pov != "-", ]

# change columns from characters to numeric
policekillings$pov <- as.numeric(policekillings$pov)
policekillings$age <- as.integer(policekillings$age)

# rename the columns
colnames(policekillings) <- c("Name", "Age", "Gender", "Race", "State", "HouseholdIncome", "PovertyRate")

policekillings <- data.frame(policekillings)

# show a glimpse of the data frame
head(policekillings)
##                 Name Age Gender            Race State HouseholdIncome
## 1 A'donte Washington  16   Male           Black    AL           51367
## 2     Aaron Rutledge  27   Male           White    LA           27972
## 3        Aaron Siler  26   Male           White    WI           45365
## 4       Aaron Valdez  25   Male Hispanic/Latino    CA           48295
## 5       Adam Jovicic  29   Male           White    OH           68785
## 6      Adam Reinhart  29   Male           White    AZ           20833
##   PovertyRate
## 1        14.1
## 2        28.8
## 3        14.6
## 4        11.7
## 5         1.9
## 6        58.0

In this next section, I use the above subset to determine the breakdown of killings for each poverty rate range, grouped by state. This is visualized in the table below.

library(gt)
library(dplyr)
## show a table with each state's count of killings for each poverty level range

# Define the breakpoints for poverty rate categories
breaks <- seq(0, 100, by = 10)

# Label each category
custom_labels <- c(
  "0-10%", "10-20%", "20-30%", "30-40%", "40-50%",
  "50-60%", "60-70%", "70-80%", "80-90%", "90-100%"
)

# Create a new column with poverty rate categories
policekillings <- policekillings %>%
  mutate(pov_category = cut(PovertyRate, breaks = breaks, labels = custom_labels))

# Group the data by state and poverty rate category, calculate counts
summary_data <- policekillings %>%
  group_by(State, pov_category) %>%
  summarise(count = n()) 

# Create a gt table from the summarized data
policekillings_tbl <- gt(summary_data)

# Customize the table headers
policekillings_tbl <- policekillings_tbl |>
  tab_header(
    title = md("**Killings by Poverty Rate in Each State**")
  ) |>
  cols_label(
    State = "State", pov_category = md("**Poverty Rate Range**"), count = md("**Killings Count**")
  )

# Display the table
policekillings_tbl
Killings by Poverty Rate in Each State
Poverty Rate Range Killings Count
AK
10-20% 1
20-30% 1
AL
0-10% 3
10-20% 2
20-30% 2
30-40% 1
AR
10-20% 2
20-30% 1
30-40% 1
AZ
0-10% 4
10-20% 9
20-30% 5
30-40% 4
40-50% 1
50-60% 2
CA
0-10% 18
10-20% 22
20-30% 19
30-40% 9
40-50% 2
50-60% 3
70-80% 1
CO
0-10% 3
10-20% 3
20-30% 5
40-50% 1
CT
0-10% 1
DC
10-20% 1
DE
0-10% 1
10-20% 1
FL
0-10% 5
10-20% 13
20-30% 2
30-40% 5
40-50% 2
50-60% 2
GA
0-10% 5
10-20% 6
20-30% 2
30-40% 1
40-50% 2
HI
0-10% 2
10-20% 1
20-30% 1
IA
20-30% 1
30-40% 1
ID
0-10% 2
20-30% 1
40-50% 1
IL
10-20% 4
20-30% 3
30-40% 4
IN
0-10% 1
10-20% 2
20-30% 4
30-40% 1
KS
0-10% 2
10-20% 1
20-30% 1
30-40% 1
40-50% 1
KY
0-10% 1
10-20% 3
30-40% 2
40-50% 1
LA
10-20% 2
20-30% 2
30-40% 2
40-50% 2
50-60% 2
MA
0-10% 3
10-20% 1
40-50% 1
MD
0-10% 3
10-20% 3
20-30% 1
30-40% 1
40-50% 1
50-60% 1
ME
10-20% 1
MI
0-10% 4
10-20% 2
30-40% 1
40-50% 1
50-60% 1
MN
0-10% 2
10-20% 2
20-30% 1
40-50% 1
MO
10-20% 3
20-30% 2
30-40% 3
50-60% 1
60-70% 1
MS
10-20% 3
20-30% 1
30-40% 1
40-50% 1
MT
0-10% 1
20-30% 1
NC
0-10% 1
10-20% 2
20-30% 4
30-40% 2
60-70% 1
NE
0-10% 1
10-20% 4
30-40% 1
NH
0-10% 1
NJ
0-10% 5
10-20% 5
30-40% 1
NM
0-10% 2
10-20% 2
20-30% 1
NV
10-20% 2
20-30% 1
NY
0-10% 5
10-20% 3
20-30% 1
30-40% 1
40-50% 3
OH
0-10% 2
10-20% 3
20-30% 1
30-40% 2
40-50% 1
OK
0-10% 4
10-20% 3
20-30% 8
30-40% 6
50-60% 1
OR
0-10% 2
10-20% 1
20-30% 3
30-40% 2
PA
0-10% 2
10-20% 3
30-40% 2
SC
0-10% 2
10-20% 4
20-30% 2
40-50% 1
TN
0-10% 1
10-20% 3
30-40% 2
TX
0-10% 10
10-20% 12
20-30% 10
30-40% 7
40-50% 3
50-60% 1
UT
0-10% 2
10-20% 2
20-30% 1
VA
0-10% 2
10-20% 3
20-30% 3
30-40% 1
WA
10-20% 4
20-30% 5
30-40% 2
WI
10-20% 4
20-30% 1
WV
10-20% 2
WY
0-10% 1

Findings and Recommendations

The article is very short and only contains a data table with a small subset of the data. If I wanted to extend the work in the article, I would provide a few graphs to help visualize the data to readers.

Recommendation 1

I would provide a graph showing the distribution of killings based on poverty rate.

library(ggplot2)

colnames(policekillings) <- c("Name", "Age", "Gender", "Race", "State", "HouseholdIncome", "PovertyRate")

# Create histogram for distribution of killings based on poverty rate
ggplot() +
  geom_histogram(data = policekillings, aes(x = PovertyRate), fill = "lightblue", color = "darkblue", binwidth = 5, alpha = 0.5) +
  labs(
    title = "Distribution of Killings Based on Poverty Rate",
    x = "Poverty Rate (%)",
    y = "Frequency"
  ) +
  scale_x_continuous(breaks = seq(0, 100, by = 10)) 

As you can see from the above graph, there is a higher distribution of killings in areas where the poverty rate is between 5-25%. This is interesting, because the article states “One thing that’s clear from the data: Police killings tend to take place in neighborhoods that are poorer and blacker than the U.S. as a whole.” (*)

In the article, the author based this statement off of the household income data. Let’s see if the household income provides a different distribution.

Recommendation 2

I would provide a graph showing the distribution of killings based on household income.

colnames(policekillings) <- c("Name", "Age", "Gender", "Race", "State", "HouseholdIncome", "PovertyRate")

# Create histogram for distribution based on household income 
ggplot() +
  geom_histogram(data = policekillings, aes(x = HouseholdIncome), fill = "lightgreen", color = "darkgreen", binwidth = 10000, alpha = 0.5) +
  labs(
    title = "Distribution of Killings Based on Household Income",
    x = "Household Income ($)",
    y = "Frequency"
  ) +
  scale_x_continuous(breaks = seq(0, 140000, by = 15000)) 

As you can see from the above graph, there is a higher distribution of killings in areas where the household income is lower (between $15,000 and $60,000). If you’re looking at the data in this way, you could say the author was correct by their statement (*).

Sources

FiveThirtyEight. Where Police Have Killed Americans in 2015. https://fivethirtyeight.com/features/where-police-have-killed-americans-in-2015/

FiveThirtyEight. Police Killings Data https://github.com/fivethirtyeight/data/blob/master/police-killings