DATA 607 Homework #1

Introduction

The article I chose for this assignment is called “Where Police Have Killed Americans In 2015”, written by Ben Casselman (https://fivethirtyeight.com/features/where-police-have-killed-americans-in-2015/)

This article is about the release of Guardian’s interactive database of Americans killed by police in 2015. The data was retrieved from a combination of media coverage, reader submissions, and open-source information. The Guardian then verified the incidents with their own reporting processes.

Analysis

In this section, I do some data wrangling.

library(readr)

# retrieve the csv file from GitHub
urlfile = "https://raw.githubusercontent.com/fivethirtyeight/data/master/police-killings/police_killings.csv"

policekillings <- read_csv(url(urlfile), show_col_types = FALSE)

# subset the data into a smaller data frame
policekillings <- subset(policekillings, select=c("name", "age", "gender", "raceethnicity", "state", "h_income", "pov"))

# remove the rows that have "unknown" for age
policekillings <- policekillings[policekillings$age != "Unknown", ]

# remove the rows that have "-" for poverty
policekillings <- policekillings[policekillings$pov != "-", ]

# change columns from characters to numeric
policekillings$pov <- as.numeric(policekillings$pov)
policekillings$age <- as.integer(policekillings$age)

# rename the columns
colnames(policekillings) <- c("Name", "Age", "Gender", "Race", "State", "HouseholdIncome", "PovertyRate")

policekillings <- data.frame(policekillings)

# show a glimpse of the data frame
head(policekillings)

##                 Name Age Gender            Race State HouseholdIncome
## 1 A'donte Washington  16   Male           Black    AL           51367
## 2     Aaron Rutledge  27   Male           White    LA           27972
## 3        Aaron Siler  26   Male           White    WI           45365
## 4       Aaron Valdez  25   Male Hispanic/Latino    CA           48295
## 5       Adam Jovicic  29   Male           White    OH           68785
## 6      Adam Reinhart  29   Male           White    AZ           20833
##   PovertyRate
## 1        14.1
## 2        28.8
## 3        14.6
## 4        11.7
## 5         1.9
## 6        58.0

In this next section, I use the above subset to determine the breakdown of killings for each poverty rate range, grouped by state. This is visualized in the table below.

library(gt)
library(dplyr)
## show a table with each state's count of killings for each poverty level range

# Define the breakpoints for poverty rate categories
breaks <- seq(0, 100, by = 10)

# Label each category
custom_labels <- c(
  "0-10%", "10-20%", "20-30%", "30-40%", "40-50%",
  "50-60%", "60-70%", "70-80%", "80-90%", "90-100%"
)

# Create a new column with poverty rate categories
policekillings <- policekillings %>%
  mutate(pov_category = cut(PovertyRate, breaks = breaks, labels = custom_labels))

# Group the data by state and poverty rate category, calculate counts
summary_data <- policekillings %>%
  group_by(State, pov_category) %>%
  summarise(count = n()) 

# Create a gt table from the summarized data
policekillings_tbl <- gt(summary_data)

# Customize the table headers
policekillings_tbl <- policekillings_tbl |>
  tab_header(
    title = md("**Killings by Poverty Rate in Each State**")
  ) |>
  cols_label(
    State = "State", pov_category = md("**Poverty Rate Range**"), count = md("**Killings Count**")
  )

# Display the table
policekillings_tbl

Poverty Rate Range	Killings Count
Killings by Poverty Rate in Each State
AK
10-20%	1
20-30%	1
AL
0-10%	3
10-20%	2
20-30%	2
30-40%	1
AR
10-20%	2
20-30%	1
30-40%	1
AZ
0-10%	4
10-20%	9
20-30%	5
30-40%	4
40-50%	1
50-60%	2
CA
0-10%	18
10-20%	22
20-30%	19
30-40%	9
40-50%	2
50-60%	3
70-80%	1
CO
0-10%	3
10-20%	3
20-30%	5
40-50%	1
CT
0-10%	1
DC
10-20%	1
DE
0-10%	1
10-20%	1
FL
0-10%	5
10-20%	13
20-30%	2
30-40%	5
40-50%	2
50-60%	2
GA
0-10%	5
10-20%	6
20-30%	2
30-40%	1
40-50%	2
HI
0-10%	2
10-20%	1
20-30%	1
IA
20-30%	1
30-40%	1
ID
0-10%	2
20-30%	1
40-50%	1
IL
10-20%	4
20-30%	3
30-40%	4
IN
0-10%	1
10-20%	2
20-30%	4
30-40%	1
KS
0-10%	2
10-20%	1
20-30%	1
30-40%	1
40-50%	1
KY
0-10%	1
10-20%	3
30-40%	2
40-50%	1
LA
10-20%	2
20-30%	2
30-40%	2
40-50%	2
50-60%	2
MA
0-10%	3
10-20%	1
40-50%	1
MD
0-10%	3
10-20%	3
20-30%	1
30-40%	1
40-50%	1
50-60%	1
ME
10-20%	1
MI
0-10%	4
10-20%	2
30-40%	1
40-50%	1
50-60%	1
MN
0-10%	2
10-20%	2
20-30%	1
40-50%	1
MO
10-20%	3
20-30%	2
30-40%	3
50-60%	1
60-70%	1
MS
10-20%	3
20-30%	1
30-40%	1
40-50%	1
MT
0-10%	1
20-30%	1
NC
0-10%	1
10-20%	2
20-30%	4
30-40%	2
60-70%	1
NE
0-10%	1
10-20%	4
30-40%	1
NH
0-10%	1
NJ
0-10%	5
10-20%	5
30-40%	1
NM
0-10%	2
10-20%	2
20-30%	1
NV
10-20%	2
20-30%	1
NY
0-10%	5
10-20%	3
20-30%	1
30-40%	1
40-50%	3
OH
0-10%	2
10-20%	3
20-30%	1
30-40%	2
40-50%	1
OK
0-10%	4
10-20%	3
20-30%	8
30-40%	6
50-60%	1
OR
0-10%	2
10-20%	1
20-30%	3
30-40%	2
PA
0-10%	2
10-20%	3
30-40%	2
SC
0-10%	2
10-20%	4
20-30%	2
40-50%	1
TN
0-10%	1
10-20%	3
30-40%	2
TX
0-10%	10
10-20%	12
20-30%	10
30-40%	7
40-50%	3
50-60%	1
UT
0-10%	2
10-20%	2
20-30%	1
VA
0-10%	2
10-20%	3
20-30%	3
30-40%	1
WA
10-20%	4
20-30%	5
30-40%	2
WI
10-20%	4
20-30%	1
WV
10-20%	2
WY
0-10%	1

Findings and Recommendations

The article is very short and only contains a data table with a small subset of the data. If I wanted to extend the work in the article, I would provide a few graphs to help visualize the data to readers.

Recommendation 1

I would provide a graph showing the distribution of killings based on poverty rate.

library(ggplot2)

colnames(policekillings) <- c("Name", "Age", "Gender", "Race", "State", "HouseholdIncome", "PovertyRate")

# Create histogram for distribution of killings based on poverty rate
ggplot() +
  geom_histogram(data = policekillings, aes(x = PovertyRate), fill = "lightblue", color = "darkblue", binwidth = 5, alpha = 0.5) +
  labs(
    title = "Distribution of Killings Based on Poverty Rate",
    x = "Poverty Rate (%)",
    y = "Frequency"
  ) +
  scale_x_continuous(breaks = seq(0, 100, by = 10))

As you can see from the above graph, there is a higher distribution of killings in areas where the poverty rate is between 5-25%. This is interesting, because the article states “One thing that’s clear from the data: Police killings tend to take place in neighborhoods that are poorer and blacker than the U.S. as a whole.” (*)

In the article, the author based this statement off of the household income data. Let’s see if the household income provides a different distribution.

Recommendation 2

I would provide a graph showing the distribution of killings based on household income.

colnames(policekillings) <- c("Name", "Age", "Gender", "Race", "State", "HouseholdIncome", "PovertyRate")

# Create histogram for distribution based on household income 
ggplot() +
  geom_histogram(data = policekillings, aes(x = HouseholdIncome), fill = "lightgreen", color = "darkgreen", binwidth = 10000, alpha = 0.5) +
  labs(
    title = "Distribution of Killings Based on Household Income",
    x = "Household Income ($)",
    y = "Frequency"
  ) +
  scale_x_continuous(breaks = seq(0, 140000, by = 15000))

As you can see from the above graph, there is a higher distribution of killings in areas where the household income is lower (between $15,000 and $60,000). If you’re looking at the data in this way, you could say the author was correct by their statement (*).

Sources

FiveThirtyEight. Where Police Have Killed Americans in 2015. https://fivethirtyeight.com/features/where-police-have-killed-americans-in-2015/

FiveThirtyEight. Police Killings Data https://github.com/fivethirtyeight/data/blob/master/police-killings