The article I chose for this assignment is called “Where Police Have Killed Americans In 2015”, written by Ben Casselman (https://fivethirtyeight.com/features/where-police-have-killed-americans-in-2015/)
This article is about the release of Guardian’s interactive database of Americans killed by police in 2015. The data was retrieved from a combination of media coverage, reader submissions, and open-source information. The Guardian then verified the incidents with their own reporting processes.
In this section, I do some data wrangling.
library(readr)
# retrieve the csv file from GitHub
urlfile = "https://raw.githubusercontent.com/fivethirtyeight/data/master/police-killings/police_killings.csv"
policekillings <- read_csv(url(urlfile), show_col_types = FALSE)
# subset the data into a smaller data frame
policekillings <- subset(policekillings, select=c("name", "age", "gender", "raceethnicity", "state", "h_income", "pov"))
# remove the rows that have "unknown" for age
policekillings <- policekillings[policekillings$age != "Unknown", ]
# remove the rows that have "-" for poverty
policekillings <- policekillings[policekillings$pov != "-", ]
# change columns from characters to numeric
policekillings$pov <- as.numeric(policekillings$pov)
policekillings$age <- as.integer(policekillings$age)
# rename the columns
colnames(policekillings) <- c("Name", "Age", "Gender", "Race", "State", "HouseholdIncome", "PovertyRate")
policekillings <- data.frame(policekillings)
# show a glimpse of the data frame
head(policekillings)
## Name Age Gender Race State HouseholdIncome
## 1 A'donte Washington 16 Male Black AL 51367
## 2 Aaron Rutledge 27 Male White LA 27972
## 3 Aaron Siler 26 Male White WI 45365
## 4 Aaron Valdez 25 Male Hispanic/Latino CA 48295
## 5 Adam Jovicic 29 Male White OH 68785
## 6 Adam Reinhart 29 Male White AZ 20833
## PovertyRate
## 1 14.1
## 2 28.8
## 3 14.6
## 4 11.7
## 5 1.9
## 6 58.0
In this next section, I use the above subset to determine the breakdown of killings for each poverty rate range, grouped by state. This is visualized in the table below.
library(gt)
library(dplyr)
## show a table with each state's count of killings for each poverty level range
# Define the breakpoints for poverty rate categories
breaks <- seq(0, 100, by = 10)
# Label each category
custom_labels <- c(
"0-10%", "10-20%", "20-30%", "30-40%", "40-50%",
"50-60%", "60-70%", "70-80%", "80-90%", "90-100%"
)
# Create a new column with poverty rate categories
policekillings <- policekillings %>%
mutate(pov_category = cut(PovertyRate, breaks = breaks, labels = custom_labels))
# Group the data by state and poverty rate category, calculate counts
summary_data <- policekillings %>%
group_by(State, pov_category) %>%
summarise(count = n())
# Create a gt table from the summarized data
policekillings_tbl <- gt(summary_data)
# Customize the table headers
policekillings_tbl <- policekillings_tbl |>
tab_header(
title = md("**Killings by Poverty Rate in Each State**")
) |>
cols_label(
State = "State", pov_category = md("**Poverty Rate Range**"), count = md("**Killings Count**")
)
# Display the table
policekillings_tbl
Killings by Poverty Rate in Each State | |
Poverty Rate Range | Killings Count |
---|---|
AK | |
10-20% | 1 |
20-30% | 1 |
AL | |
0-10% | 3 |
10-20% | 2 |
20-30% | 2 |
30-40% | 1 |
AR | |
10-20% | 2 |
20-30% | 1 |
30-40% | 1 |
AZ | |
0-10% | 4 |
10-20% | 9 |
20-30% | 5 |
30-40% | 4 |
40-50% | 1 |
50-60% | 2 |
CA | |
0-10% | 18 |
10-20% | 22 |
20-30% | 19 |
30-40% | 9 |
40-50% | 2 |
50-60% | 3 |
70-80% | 1 |
CO | |
0-10% | 3 |
10-20% | 3 |
20-30% | 5 |
40-50% | 1 |
CT | |
0-10% | 1 |
DC | |
10-20% | 1 |
DE | |
0-10% | 1 |
10-20% | 1 |
FL | |
0-10% | 5 |
10-20% | 13 |
20-30% | 2 |
30-40% | 5 |
40-50% | 2 |
50-60% | 2 |
GA | |
0-10% | 5 |
10-20% | 6 |
20-30% | 2 |
30-40% | 1 |
40-50% | 2 |
HI | |
0-10% | 2 |
10-20% | 1 |
20-30% | 1 |
IA | |
20-30% | 1 |
30-40% | 1 |
ID | |
0-10% | 2 |
20-30% | 1 |
40-50% | 1 |
IL | |
10-20% | 4 |
20-30% | 3 |
30-40% | 4 |
IN | |
0-10% | 1 |
10-20% | 2 |
20-30% | 4 |
30-40% | 1 |
KS | |
0-10% | 2 |
10-20% | 1 |
20-30% | 1 |
30-40% | 1 |
40-50% | 1 |
KY | |
0-10% | 1 |
10-20% | 3 |
30-40% | 2 |
40-50% | 1 |
LA | |
10-20% | 2 |
20-30% | 2 |
30-40% | 2 |
40-50% | 2 |
50-60% | 2 |
MA | |
0-10% | 3 |
10-20% | 1 |
40-50% | 1 |
MD | |
0-10% | 3 |
10-20% | 3 |
20-30% | 1 |
30-40% | 1 |
40-50% | 1 |
50-60% | 1 |
ME | |
10-20% | 1 |
MI | |
0-10% | 4 |
10-20% | 2 |
30-40% | 1 |
40-50% | 1 |
50-60% | 1 |
MN | |
0-10% | 2 |
10-20% | 2 |
20-30% | 1 |
40-50% | 1 |
MO | |
10-20% | 3 |
20-30% | 2 |
30-40% | 3 |
50-60% | 1 |
60-70% | 1 |
MS | |
10-20% | 3 |
20-30% | 1 |
30-40% | 1 |
40-50% | 1 |
MT | |
0-10% | 1 |
20-30% | 1 |
NC | |
0-10% | 1 |
10-20% | 2 |
20-30% | 4 |
30-40% | 2 |
60-70% | 1 |
NE | |
0-10% | 1 |
10-20% | 4 |
30-40% | 1 |
NH | |
0-10% | 1 |
NJ | |
0-10% | 5 |
10-20% | 5 |
30-40% | 1 |
NM | |
0-10% | 2 |
10-20% | 2 |
20-30% | 1 |
NV | |
10-20% | 2 |
20-30% | 1 |
NY | |
0-10% | 5 |
10-20% | 3 |
20-30% | 1 |
30-40% | 1 |
40-50% | 3 |
OH | |
0-10% | 2 |
10-20% | 3 |
20-30% | 1 |
30-40% | 2 |
40-50% | 1 |
OK | |
0-10% | 4 |
10-20% | 3 |
20-30% | 8 |
30-40% | 6 |
50-60% | 1 |
OR | |
0-10% | 2 |
10-20% | 1 |
20-30% | 3 |
30-40% | 2 |
PA | |
0-10% | 2 |
10-20% | 3 |
30-40% | 2 |
SC | |
0-10% | 2 |
10-20% | 4 |
20-30% | 2 |
40-50% | 1 |
TN | |
0-10% | 1 |
10-20% | 3 |
30-40% | 2 |
TX | |
0-10% | 10 |
10-20% | 12 |
20-30% | 10 |
30-40% | 7 |
40-50% | 3 |
50-60% | 1 |
UT | |
0-10% | 2 |
10-20% | 2 |
20-30% | 1 |
VA | |
0-10% | 2 |
10-20% | 3 |
20-30% | 3 |
30-40% | 1 |
WA | |
10-20% | 4 |
20-30% | 5 |
30-40% | 2 |
WI | |
10-20% | 4 |
20-30% | 1 |
WV | |
10-20% | 2 |
WY | |
0-10% | 1 |
The article is very short and only contains a data table with a small subset of the data. If I wanted to extend the work in the article, I would provide a few graphs to help visualize the data to readers.
I would provide a graph showing the distribution of killings based on poverty rate.
library(ggplot2)
colnames(policekillings) <- c("Name", "Age", "Gender", "Race", "State", "HouseholdIncome", "PovertyRate")
# Create histogram for distribution of killings based on poverty rate
ggplot() +
geom_histogram(data = policekillings, aes(x = PovertyRate), fill = "lightblue", color = "darkblue", binwidth = 5, alpha = 0.5) +
labs(
title = "Distribution of Killings Based on Poverty Rate",
x = "Poverty Rate (%)",
y = "Frequency"
) +
scale_x_continuous(breaks = seq(0, 100, by = 10))
As you can see from the above graph, there is a higher distribution of killings in areas where the poverty rate is between 5-25%. This is interesting, because the article states “One thing that’s clear from the data: Police killings tend to take place in neighborhoods that are poorer and blacker than the U.S. as a whole.” (*)
In the article, the author based this statement off of the household income data. Let’s see if the household income provides a different distribution.
I would provide a graph showing the distribution of killings based on household income.
colnames(policekillings) <- c("Name", "Age", "Gender", "Race", "State", "HouseholdIncome", "PovertyRate")
# Create histogram for distribution based on household income
ggplot() +
geom_histogram(data = policekillings, aes(x = HouseholdIncome), fill = "lightgreen", color = "darkgreen", binwidth = 10000, alpha = 0.5) +
labs(
title = "Distribution of Killings Based on Household Income",
x = "Household Income ($)",
y = "Frequency"
) +
scale_x_continuous(breaks = seq(0, 140000, by = 15000))
As you can see from the above graph, there is a higher distribution of killings in areas where the household income is lower (between $15,000 and $60,000). If you’re looking at the data in this way, you could say the author was correct by their statement (*).
FiveThirtyEight. Where Police Have Killed Americans in 2015. https://fivethirtyeight.com/features/where-police-have-killed-americans-in-2015/
FiveThirtyEight. Police Killings Data https://github.com/fivethirtyeight/data/blob/master/police-killings