Hello everyone, as everyone knows, police brutality and/or killings have been a major issue throughout the USA for a very long time. The term was first used by the Chicago Tribune in 1872. But it only has become a big issue relatively recently. There have been many reactions to this issue, most notably the Black Lives Matter movement. In this project, I want to examine the relations of several factors between the Police Killings, such as age, poverty level, state, and graduation rate.
The dataset was found here: https://www.kaggle.com/kwullum/fatal-police-shootings-in-the-us?select=ShareRaceByCity.csv
library(rio)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble 3.1.2 ✓ purrr 0.3.4
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(base)
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
library(broom)
library(ggpubr)
library(ggthemes)
Let’s give the datasets names before calling them to make them easier to work with in the long run.
PovertyLevel <- "PercentagePeopleBelowPovertyLevel.csv"
GraduationRate <- "PercentOver25CompletedHighSchool.csv"
PoliceKillings <- "PoliceKillingsUS.csv"
Now let’s call the actual datasets.
setwd("/Users/justinpark/Desktop/DATA 110 Project 2")
PovertyLevel <- import(PovertyLevel)
GraduationRate <- import(GraduationRate)
PoliceKillings <- import(PoliceKillings)
With the scale of this dataset, there are many variables that I chose to look at. The first two variables are “poverty_rate” and “percent_completed_hs”. They respectively describe the poverty rate of each city in a state and the rate at which 25 yearr old or older have graduation from high school. “age” in the PoliceKillings dataset displays the ages of the many victims that were killed by the police. And lastly, is “state” and “Geographic Area”. Both of these define the states plus Washington DC. Many of these variable names would be changed later on for convenience.
First, I want to compare the poverty rates and graduation rates. To do so, I merged both datasets under one titled: “total”.
total <- merge(PovertyLevel, GraduationRate)
# total
First, there were too many observations to begin with. Because of this, I chose to only look at the state data for each variable rather than by city. As such, I selected “Geographic Area” (state) and the poverty and graduation rates. There were a lot of “NAs” throughout the dataset (makred by the “-”), so these should be rid of for both poverty and graduation rates. Then both of the variables should be converted into numerics as they were originally defined as characters. Then lastly, the variables’ names were changed for easy reference State, Pov_rate, and Grad_rate respectively.
total1 <- total %>%
select(`Geographic Area`, poverty_rate, percent_completed_hs)
total1 <- total1 %>%
filter(percent_completed_hs != "-")
total1$percent_completed_hs <- as.numeric(total1$percent_completed_hs)
total1 <- na.omit(total1)
colnames(total1) <- c("State", "Pov_rate", "Grad_rate")
# total1
Now I cleaned up the poverty rate in a separate chunk to make them distinguishable.
total2 <- total1 %>%
filter(Pov_rate != "-")
total2$Pov_rate <- as.numeric(total2$Pov_rate)
total2 <- na.omit(total2)
# total2
Now we’ll only select the states for the PoliceKillings dataset. The other variables will be used later.
PoliceKillings1 <- PoliceKillings %>%
select(state)
# PoliceKillings1
Now, just to check the data I arranged the Pov_rate and Grad_rate to check if there were still any NAs remaining. And since none remained, I could move on.
head <- total2 %>%
arrange(desc(Pov_rate)) %>%
head(20)
head
## State Pov_rate Grad_rate
## 1 AK 100 0.0
## 2 AK 100 100.0
## 3 AZ 100 0.0
## 4 AZ 100 100.0
## 5 AZ 100 50.0
## 6 AZ 100 77.8
## 7 AZ 100 40.7
## 8 CA 100 80.6
## 9 CA 100 100.0
## 10 CA 100 100.0
## 11 CO 100 100.0
## 12 IA 100 100.0
## 13 ID 100 0.0
## 14 MD 100 100.0
## 15 MD 100 100.0
## 16 MN 100 100.0
## 17 MN 100 0.0
## 18 MO 100 27.3
## 19 MT 100 100.0
## 20 MT 100 100.0
head1 <- total2 %>%
arrange(desc(Grad_rate)) %>%
head(20)
head1
## State Pov_rate Grad_rate
## 1 AK 0.0 100
## 2 AK 0.0 100
## 3 AK 0.0 100
## 4 AK 34.8 100
## 5 AK 0.0 100
## 6 AK 0.0 100
## 7 AK 0.0 100
## 8 AK 0.0 100
## 9 AK 15.5 100
## 10 AK 0.0 100
## 11 AK 5.4 100
## 12 AK 100.0 100
## 13 AK 0.0 100
## 14 AK 0.0 100
## 15 AK 0.0 100
## 16 AK 0.0 100
## 17 AK 0.0 100
## 18 AK 5.7 100
## 19 AK 35.9 100
## 20 AK 5.0 100
Since the dataset provided a list of the cities of each state, there were far too many observations. To fix this issue, I took the average rates of both poverty and graduation for each state. This way, there would only be 50 rows rather than the tens of thousands that were present before.
Avg <- total2 %>%
group_by(State) %>%
summarise(Avg_Pov_Rate = mean(Pov_rate), Avg_Grad_Rate = mean(Grad_rate))
head(Avg, 20)
## # A tibble: 20 x 3
## State Avg_Pov_Rate Avg_Grad_Rate
## <chr> <dbl> <dbl>
## 1 AK 19.9 84.5
## 2 AL 20.6 80.3
## 3 AR 23.0 79.9
## 4 AZ 25.7 80.5
## 5 CA 17.1 82.0
## 6 CO 13.4 90.1
## 7 CT 9.14 91.6
## 8 DC 18 89.3
## 9 DE 12.6 88.5
## 10 FL 17.6 85.7
## 11 GA 23.8 79.0
## 12 HI 13.4 91.7
## 13 IA 12.3 90.1
## 14 ID 18.2 85.2
## 15 IL 13.9 88.5
## 16 IN 15.5 86.3
## 17 KS 14.8 88.2
## 18 KY 20.0 82.4
## 19 LA 22.3 79.3
## 20 MA 9.59 92.4
Since I would be looking for the relation between both average poverty and graduation rates, a regression line would be necessary. Since highcharter does not provide any direct possibilities of adding a regression line, I used package “broom” to create a separate regression line that I would add to the scatterplot.
model <- lm(Avg_Grad_Rate ~ Avg_Pov_Rate, data = Avg)
fit <- augment(model) %>% arrange(Avg_Pov_Rate)
Now that the line was created, all I had to do was add it to the scatterplot:
graph <- highchart() %>%
hc_add_series(Avg,
type = "scatter",
hcaes(x = Avg_Pov_Rate, y = Avg_Grad_Rate,
group = State)) %>%
hc_add_series(fit, type = "line", hcaes(x = Avg_Pov_Rate, y = .fitted)) %>%
hc_xAxis(title = list(text = "Average Poverty Rate (%)"),
labels = list(format = "{value}%")) %>%
hc_yAxis(title = list(text = "Average Graduation Rate (%)"),
labels = list(format = "{value}%")) %>%
hc_title(
text = "<b>Relations Between Average State Poverty and Graduation Rates in 2015</b>",
margin = 20,
align = "center",
style = list(color = "#22A884", useHTML = TRUE)) %>%
hc_plotOptions(series = list(marker = list(symbol = "circle"))) %>%
hc_legend(FALSE) %>%
hc_tooltip(borderColor = "black",
pointFormat = '{point.x:.2f}% {point.y:.2f}%')
graph
And even just at first glance, a relationship between the two variables can be observed. Now I would have to determine the actual correlation (slope of the line) and possibly even the p-value.
To find these I made a separate scatterplot with the line using the package “ggpubr”. It produced the same exact graph except the correlation (R) and p-value were given.
ggscatter(Avg, x = "Avg_Pov_Rate", y = "Avg_Grad_Rate",
add = "reg.line", conf.int = FALSE,
cor.coef = TRUE, cor.method = "pearson", cor.coef.coord = c(20, 90),
xlab = "Average Poverty Rate (%)",
ylab = "Average Graduation Rate (%)")
## `geom_smooth()` using formula 'y ~ x'
From this, as of now I can say two things: The correlation coeffecient (R), which is -0.86, conveys that there is a relatively strong and negative relatioship between Average Poverty Rates and Average Graduation Rates throughout all 50 states of the USA. This indicates that as the Average Poverty Rate increases, the Average Graduation Rate decreases. The p-value of 4.8x10^-16 (~ 0) essentailly indicates that the test results were statistically significant.
At first, I intended on relating the Average Rates of Poverty and Graduation to that of the Police Killings. But the amount of cleaning that it would have taken made me choose a different route to take. Rather, I instead chose to take a much simpler approach and examine the Police Killings based on age and state. So first, I chose to look at the killings by state, and I would have to alter the dataset to display the counts of police killings by state rather.
PoliceKillings2 <- PoliceKillings1 %>%
group_by(state) %>%
count() %>%
arrange(desc(n))
# PoliceKillings2
This numbers shows the number of all the police killings from the years (2015-17) which is displayed in this dataset.
count(PoliceKillings)
## n
## 1 2535
Essentially, this can be interpreted as 2535/3 = 845 police killings per year. This will come in use later.
This graph attempts to show the number of police killings in each state. The results were quite astonishing.
ggplot(PoliceKillings2) +
geom_bar(aes(x = reorder(state, n), y = n, fill = state), stat = 'identity') +
labs(title = "Number of Police Killings by State (2015-17)",
x = "State",
y = "Number of Police Killings") +
theme_solarized() +
theme(axis.text.x = element_text(size = 7, angle = 70, hjust = 1),
plot.title = element_text(hjust = 0.50),
legend.position = "none")
Very clearly, 4 of the 50 states have significantly more killings than the others. But what is most shocking is the number of killings in California (CA). California nearly displays twice the amount of killings from the second highest (Texas [TX]).
Next we look at the distribution of the ages that were killed by the police.
ggplot(PoliceKillings) +
geom_bar(aes(x = age)) +
labs(title = "Number of Police Killings by Age (2015-17)",
x = "Age",
y = "Number of Police Killings") +
theme_solarized() +
scale_x_continuous(breaks = seq(0, 100, 10)) +
theme(axis.text.x = element_text(size = 7, angle = 70, hjust = 1),
plot.title = element_text(hjust = 0.50),
legend.position = "none")
## Warning: Removed 77 rows containing non-finite values (stat_count).
Just by first obseration, the distribution shows a skew to the right. This indicates that most of the police killing victims were in the age range of 20 to around late 30s.
It’s been several years since the first attempts to decrease the number of police brutality and killings throughout the USA. But what factors are related to these cases and are there any observable patterns? This project aims to roughly answer questions such as these.
First let’s discuss the cleaning. Much of the variables and actual data required lots of cleaning. For example, the dataset included many NAs, but they were depicted with “-” rather than a blank. Furthermore, the names for many of the variables needed renaming, such as “Geographic Area” which represented the states in an awkward manner. And lastly, many of the variables, specifically those regarding rates (poverty and graduation), were defined as characters rather than numerics. Because of this, I had to change them to numerics so that the graphs could be made correctly.
Starting with the first graph. The average state poverty and graduation rates were displayed. And as mentioned the correlation coefficient (R), which is -0.86, conveys that there is a relatively strong and negative relationship between Average Poverty Rates and Average Graduation Rates throughout all 50 states of the USA. This indicates that as the Average Poverty Rate increases, the Average Graduation Rate decreases. Furthermore, the p-value of 4.8x10^-16 (~ 0) essentially indicates that the test results were statistically significant.
But what does this mean? Essentially, throughout the states, a higher poverty rate relates to a lower high school graduation rate among those over 25. This isn’t very surprising. But I wanted to compare this to the actual police killings. Was there a relation there? However, I could not find a way to merge the dataset of poverty rates & graduation rates with that of the police killings without running into a bunch of errors and cleaning issues. Because of this, I needed to look at different factors regarding the victims of these cases.
But are there any patterns regarding police killings? And there were certainly many that were visible. The graph displaying police killings based on states shows California to be at the top, without any competition. However, the graph does not display proportions. So even though California has more police shootings than any other state, “it also has more people. When you account for population size across all 50 states and the District of Columbia, California’s rate of police shootings is in the middle of the pack, but higher than the national average” (Roshenhall). So if proportions were to be looked at, the graph would differ highly. After doing a bit of research, I found that New Mexico, Alaska, and Oklahoma are the 3 highest in police killing rates (for all people regardless of race or age).
And lastly, we encounter the age groups. The last graph clearly shows that young adults tend to be victims of police killings more often. And this is backed by the Washington Post. An article explains that “an overwhelming majority of people shot and killed by police are male — over 95 percent. More than half the victims are between 20 and 40 years old”.
In the end, I would have hoped to incpororate both the poverty and graduation rates to the actual police killings, as they would have certainly provided interesting results. However, with more knowledge of R this would seem very plausible and beneficial in the long run for future projects. I would have also hoped to change the “Number of Police Killings by State” graph with proportions rather than actual numbers. This would have provided a better comparison to for the police killings as the data would be normalized.
Ultimately, it can be said that police brutality and killings are a major issue in the USA. Even the number of these killings continues to be the same annually. From the data I acquired, I determined that around 845 killings happen yearly on average from 2015-17. However, the Washington Post portrays a bigger issue as it shows nearly 1000 killings every year. And despite the pandemic in 2020 and 2021, the number is still the same. Many of the efforts to reduce these numbers seem to be in vain, as progress cannot be seen in either the numbers or even the victim age groups. Hopefully, things will look to improve in the long term as more and more people realize that improvements were not made at all.
Sources: https://www.washingtonpost.com/graphics/investigations/police-shootings-database/ https://mappingpoliceviolence.org/states https://calmatters.org/explainers/california-police-shootings-deadly-force-new-law-explained/