The data I have selected presents a thorough database of homicide report data in the United States from 1980 - 2014, including cases both solved and still open. Assessing and reducing homicide occurrences requires highly complex investigation into socioeconomic, political, psychological and educational issues that drive motives. Through an ex post facto data approach however, broad insights can be gleaned on the level of national statistics and person profiles. I seek to employ a quantitative approach to shed light on how the state of homicide in the United States is changing, and in doing so, hope to assist in renegotiating how discussions of the socioeconomic, political, psychological and educational factors will be carried out.
I would like to track national homicide activity from 1980 - 2014 in order to answer the following questions: How have homicide rates changed over time? By state? Are investigators becoming more effective at solving and closing homicide cases? Through these questions I would like to better identify and summarize what criminal activity looks like in the U.S.
My approach will primarily employ techniques in visualization to communicate the way our crime data has changed over the past three and a half decades.
library(tidyverse) # To load tidyverse package, including: magrittr, dplyr, ggplot2
library(ggrepel) # Repel overlapping text in plots
library(DT) # To produce HTML output tables
The following packages were loaded for my analysis:
The data set I am using is an exhaustive catalog of homicide data in the United States from 1980 - 2014. This data is a part of the Murder Accountability Project, which, according to the organization, “includes murders from the FBI’s Supplementary Homicide Report from 1976 to the present and Freedom of Information Act data on more than 22,000 homicides that were not reported to the Justice Department. This dataset includes the age, race, sex, ethnicity of victims and perpetrators, in addition to the relationship between the victim and perpetrator and weapon used.” The data has since been used to revisit and reassess cold cases.
#Read in initial data set variable list
read_csv("HomicideData.csv", n_max = 1)
## # A tibble: 1 x 24
## `Record ID` `Agency Code` `Agency Name` `Agency Type` City
## <chr> <chr> <chr> <chr> <chr>
## 1 000001 AK00101 Anchorage Municipal Police Anchorage
## # ... with 19 more variables: State <chr>, Year <int>, Month <chr>,
## # Incident <int>, `Crime Type` <chr>, `Crime Solved` <chr>, `Victim
## # Sex` <chr>, `Victim Age` <int>, `Victim Race` <chr>, `Victim
## # Ethnicity` <chr>, `Perpetrator Sex` <chr>, `Perpetrator Age` <int>,
## # `Perpetrator Race` <chr>, `Perpetrator Ethnicity` <chr>,
## # Relationship <chr>, Weapon <chr>, `Victim Count` <int>, `Perpetrator
## # Count` <int>, `Record Source` <chr>
ReducedData <- read_csv("HomicideData.csv") %>%
select(everything(), -c(`Record ID`, `Agency Code`, `Perpetrator Ethnicity`,
`Agency Name`, `Incident`, `Victim Ethnicity`,
`Perpetrator Count`, `Record Source`,
`Victim Count`))
I started by generating the list of variables present in the data set, then created a new data frame, ReducedData, that included only the variables relevant to my analysis. There are 638454 observations in the dataset, and 15 variables
VicAgeRandom <- ReducedData %>%
subset(`Victim Age` == 998)
PerpAgeRandom <- ReducedData %>%
subset(`Perpetrator Age` < 5)
I then discovered anomalies that existed in both the variables Victim Age and Perpetrator Age. There were 974 instances where the Victim Age was recorded as 998, and 216420 where the Perpetrator Age was recorded as 0. Both of these scenarios represented unsolved crimes, and were included in the majority of the following analysis.
na.zero <- function (x) {
x[is.na(x)] <- 0
return(x)
}
CrimesPerYear <- ReducedData %>%
group_by(State, Year, `Crime Solved`) %>%
count() %>%
spread(`Crime Solved`, n) %>%
rename(`Case Closed` = Yes, `Case Open` = No) %>%
na.zero %>%
mutate(`Solve Rate` = `Case Closed` / (`Case Closed` + `Case Open`))
CrimesPerYear %>%
mutate(`Solve Rate` = round(`Solve Rate`, 3)) %>%
datatable()
I then generated the CrimesPerYear data frame in order to organize the data into a breakdown of how many cases reported were solved, not solved, and the subsequent annual case solving rate for each state. This data frame required the function na.zero in order to set instances where spreading the data resulted in values of NA. There are 1759 observations in the dataset, and 5 variables
CrimesByAgency <- ReducedData %>%
group_by(`Agency Type`, Year, `Crime Solved`) %>%
count() %>%
spread(`Crime Solved`, n) %>%
rename(`Case Closed` = Yes, `Case Open` = No) %>%
na.zero %>%
mutate(`Solve Rate` = `Case Closed` / (`Case Closed` + `Case Open`))
CrimesByAgency %>%
mutate(`Solve Rate` = round(`Solve Rate`, 3)) %>%
datatable()
I then created the CrimesByAgency data frame so that I could determine which levels of law enforcement where most often involved in homicide cases. This then allows me to make claims about the success different enforcement agencies have on closing cases. There are 232 observations in the dataset, and 5 variables
AvgSolveRate <- CrimesPerYear %>%
group_by(Year) %>%
summarise(`Yearly Avg` = mean(`Solve Rate`))
AvgSolveRate %>%
mutate('Yearly Avg' = round(`Yearly Avg`, 3)) %>%
datatable(rownames = FALSE)
I finally created the AvgSolveRate data frame in order to generate a figure that represents the annual nationwide crime solving percentage.
I first wanted to compare how the magnitude of total reports, closed cases, and open cases were changing. Illustrated below, total homicide reports is represented in blue, cases closed in red, and cases remaining open in green. On the whole, since 1980 the magnitude of reported homicides across the United States has fallen significantly.
ReducedData %>%
group_by(Year, `Crime Solved`) %>%
count() %>%
spread(`Crime Solved`, n) %>%
rename(`Case Closed` = Yes, `Case Open` = No) %>%
na.zero %>%
mutate(`Report Total` = `Case Closed` + `Case Open`) %>%
ggplot(aes(x = Year, y = `Report Total`)) +
geom_line(color = "blue") +
geom_line(aes(x = Year, y = `Case Open`), color = "green") +
geom_line(aes(x = Year, y = `Case Closed`), color = "red") +
ylab("Frequency") +
theme_minimal() +
labs(title = "Annual Homicides Reports from 1980 thorugh 2014",
subtitle = "The magnitude of reported homicides has experienced a general downward trend \nsince 1980, with the exception of the boom during the 1990s")
While a decrease in overall magnitude is a positive trend in homicide reports, I then wanted to test what homicide growth rate in the U.S. was telling us. I tested the year-over-year homicide reports to decide how annual report statistics were changing. Since 1980, the national homicide growth rate sits at -4%, confirming the overall decline in homicide. Notably, in the 1990s after a boom period of homicide, the rate of decrease was consistently the greatest, reaching over -10% in 1995.
ReducedData %>%
group_by(Year, `Crime Solved`) %>%
count() %>%
ungroup() %>%
spread(`Crime Solved`, n) %>%
rename(`Case Closed` = Yes, `Case Open` = No) %>%
na.zero %>%
mutate(`Report Total` = `Case Closed` + `Case Open`,
`Crime Lag` = lead(`Report Total`)) %>%
na.zero %>%
mutate(`Homicide Growth Rate` = (`Crime Lag`- `Report Total`) / `Report Total`,
GrowthAvg = mean(`Homicide Growth Rate`),
Above = ifelse(`Homicide Growth Rate` > 0, TRUE, FALSE)) %>%
filter(Year != 2014) %>%
ggplot(aes(x = Year, y = `Homicide Growth Rate`, color = Above)) +
geom_segment(aes(x = Year, y = 0, xend = Year,
yend = `Homicide Growth Rate`)) +
geom_hline(aes(yintercept = GrowthAvg), linetype = "dashed") +
annotate("text", x = 2015.5, y = -0.035, label = "Average = - 4%", size = 2.4) +
geom_point() +
scale_y_continuous(labels = scales::percent) +
theme_minimal() +
theme(legend.position = "none") +
labs(title = "Year over Year Homicide Growth Rate in the United States",
subtitle = "The average decrease in homicide since 1980 is 4%")
Despite this positive vital, I wanted to further probe into the efficiency of law enforcement in solving cases. Over the period, the national average proportion of cases that remained open was 29.8%. The data suggests that while homicide was historically more prevalent, the criminal process has become less efficient at closing cases. From the early 1990s to the late 2000s the rate at which homicide cases are closed remained above the average with the exception of 1998. Since 2009 however, that figure has begun to tend back towards the average.
ReducedData %>%
group_by(Year, `Crime Solved`) %>%
count() %>%
spread(`Crime Solved`, n) %>%
rename(`Case Closed` = Yes, `Case Open` = No) %>%
na.zero %>%
ungroup() %>%
mutate(`Report Total` = `Case Closed` + `Case Open`,
`Case Open Percent` = `Case Open` / `Report Total`,
TotalAvg = mean(`Case Open Percent`),
Above = ifelse(`Case Open Percent` - TotalAvg > 0, TRUE, FALSE)) %>%
ggplot(aes(x = Year, y = `Case Open Percent`, color = Above)) +
geom_segment(aes(x = Year, y = TotalAvg, xend = Year, yend = `Case Open Percent`)) +
geom_point() +
geom_hline(aes(yintercept = TotalAvg), linetype = "dashed") +
annotate("text", x = 2015, y = .295, label = "Average = 29.8%", size = 2.4) +
scale_y_continuous(labels = scales::percent) +
theme_minimal() +
theme(legend.position = "none") +
labs(title = "Proportion of Homicides Reported that remained unsolved",
subtitle = "The average percentage of reported homicides that remain unsolved is 29.8%, \nbut that figure has been higher in recent years when compared to the 1980s")
What about the longevity of the crime solve rate though? Information about how the movement of the national solve rate helps in understanding the national patterns related to law enforcement efficacy. What the data indicates is that, decade over decade, the homicide solve rate has remained relatively stagnant, with a slight uptick in 2010 of approximately 2%.
SolveRateMean <- CrimesPerYear %>%
filter(Year %in% c(1980, 1990, 2000, 2010)) %>%
group_by(Year) %>%
summarise(Mean = mean(`Solve Rate`)) %>%
mutate(Label = paste0(prettyNum(round(Mean, 3))))
CrimesPerYear %>%
filter(Year %in% seq(1980, 2014, by = 10)) %>%
ggplot(aes(x = `Solve Rate`)) +
geom_histogram(color = "grey30", fill = "dodgerblue", alpha = .3) +
facet_grid(Year ~ ., scales = "free") +
geom_vline(data = SolveRateMean, aes(xintercept = Mean),
linetype = "dashed", color = "red") +
geom_text(data = SolveRateMean, aes(x = Mean + .04, y = 8, label = Label),
size = 4) +
scale_x_continuous(labels = scales::percent) +
ylab("State Frequency") +
theme_bw() +
labs(title = "Comparison of Homicide Solve Rate Decade over Decade",
subtitle = "The mean solve rate over the 35 year period surveyed experienced little \nfluctuation, remaining near 77% with a slight increase in 2010")
The next analysis was evaluating State’s abilities to close homicide cases year over year. In 2014 the top performing states were North and South Dakota, Alabama, Vermont and Montana. The lowest performing states were New Jersey, Pennsylvania, Massachusetts, Washington D.C. and Illinois. Interestingly, in spite of large fluctuations from state to state, the national homicide solve rate remained relatively stable over the 35 year period documented.
top5 <- CrimesPerYear %>%
ungroup() %>%
mutate(State = factor(State)) %>%
arrange(desc(`Solve Rate`)) %>%
filter(Year == 2014) %>%
slice(1:5)
bottom5 <- CrimesPerYear %>%
ungroup() %>%
mutate(State = factor(State)) %>%
arrange(`Solve Rate`) %>%
filter(Year == 2014) %>%
slice(1:5)
CrimesPerYear %>%
left_join(AvgSolveRate) %>%
ggplot(aes(x = Year, y = `Solve Rate`, group = State)) +
geom_line(alpha = .1) +
geom_line(data = filter(CrimesPerYear, State %in% top5$State),
aes(Year, `Solve Rate`, group = State), color = "dodgerblue") +
geom_line(data = filter(CrimesPerYear, State %in% bottom5$State),
aes(Year, `Solve Rate`, group = State), color = "red") +
geom_line(aes(Year, `Yearly Avg`, group = 1), linetype = "dashed") +
annotate("text", x = 2014.5, y = .78, label = "Average", size = 3) +
geom_point(data = top5, aes(Year, `Solve Rate`), color = "dodgerblue") +
geom_text_repel(data = top5, aes(label = State), nudge_x = 3, size = 3) +
geom_point(data = bottom5, aes(Year, `Solve Rate`), color = "red") +
geom_text_repel(data = bottom5, aes(label = State), nudge_x = 3, size = 3) +
scale_y_continuous(labels = scales::percent) +
theme_minimal() +
labs(title = "Annual State Solved Homicide Rate",
subtitle = "The national average has remained relatively unchanged since 1980")
Then that analysis was fleshed out to the most recent year in the study, 2014, to identify which states were performing better or worse relative to the national average. Notably, many of the highest populated states were the largest under performers. These included New York, California, Illinois and Massachusetts.
CrimesPerYear %>%
arrange(`Solve Rate`) %>%
left_join(AvgSolveRate) %>%
mutate(Above = ifelse(`Solve Rate` - `Yearly Avg` > 0, TRUE, FALSE)) %>%
filter(Year == 2014) %>%
ungroup() %>%
mutate(State = factor(State, levels = State)) %>%
ggplot(aes(x = `Solve Rate`, y = State, color = Above)) +
geom_segment(aes(x = `Yearly Avg`, y = State, xend = `Solve Rate`,
yend = State)) +
geom_point() +
scale_x_continuous(labels = scales::percent) +
theme_minimal() +
theme(legend.position = "none") +
labs(title = "2014 Comparison of State Homicide Solve Rate")
A natural question to ask when figuring out whether or not homicide cases are being handled properly is who is taking on these cases? Differing agencies benefit or lose out on access to time and resources which seems that it should bear an impact on their ability to deal with cases. From the body of the data, we see that far and away Municipal Police are the ones carrying out investigations, followed by Sheriffs. Noticeably, Regional Police and Tribal Police are rarely involved in the homicides reported. Municipal Police dealt with nearly 500,000 of the roughly 638,000 homicides included in the report.
ReducedData %>%
ggplot(aes(x = `Agency Type`, fill = `Agency Type`)) +
geom_bar() +
scale_y_continuous(labels = scales::comma) +
theme_minimal() +
ylab("Frequency") +
labs(title = "Frequency of Agency Type handling homicide investigations")
For further investigation, Tribal Police have been relegated from consideration. The next metric of interest was the development of each agencies Solve Rate over time. Surprisingly, it seems that Municipal Police have in fact faltered in becoming more effective at closing cases. Only Municipal Police and Special Police have lower Solve Rates now than in 1980.
CrimesByAgency %>%
filter(`Agency Type` != "Tribal Police") %>%
ggplot(aes(x = Year, y = `Solve Rate`, group = `Agency Type`,
color = `Agency Type`)) +
geom_smooth(se = FALSE) +
facet_grid(`Agency Type` ~ .) +
scale_y_continuous(labels = scales::percent) +
theme_bw() +
theme(legend.position = "none") +
labs(title = "Solve Rates over time by Agency Type",
subtitle = "Only Municipal Police and Special Police report lower case closed numbers when \ncompared to the solve rate in 1980")
To further understand this, the Case Close Growth rate was tracked over time for Municipal Police. On average, the year over year solve rate has decreased since 1980 by .23%. Although the fluctuations around 0% make identifying a solid trend difficult, since 2010 the Municipal Police Force has yet to increase their Solve Rate from the previous year.
CrimesByAgency %>%
ungroup() %>%
filter(`Agency Type` == "Municipal Police") %>%
mutate(Lead = lead(`Solve Rate`),
`Crime Close Growth` = (Lead - `Solve Rate`) / `Solve Rate`) %>%
na.zero() %>%
mutate(GrowthAvg = mean(`Crime Close Growth`),
Above = ifelse(`Crime Close Growth` > 0, TRUE, FALSE)) %>%
filter(Year != 2014) %>%
ggplot(aes(x = Year, y = `Crime Close Growth`, color = Above)) +
geom_segment(aes(x = Year, y = 0, xend = Year,
yend = `Crime Close Growth`)) +
geom_hline(aes(yintercept = GrowthAvg), linetype = "dashed") +
annotate("text", x = 2014, y = 0.001, label = "Average = - .23%", size = 3) +
geom_point() +
scale_y_continuous(labels = scales::percent) +
theme_minimal() +
theme(legend.position = "none") +
labs(title = "Municipal Police Force Year over Year Solve Rate Growth",
subtitle = "Although Municipial Police deal with by far the most homicide cases, their \ngrowth rate averages to an annual decrease of .23%")
The next step is to come down from the aerie view of crime statistics in identifying and understanding the profiles of victims and perpetrators in relation to the homicides at hand. This is the phase where the data can serve most actively in bearing an impact on the aforementioned national crime-health indicators. Over the period surveyed, the median victim average decade over decade surprisingly moved only marginally away from age 30, shedding light on frequently targeted groups. The median was used due to significant skewness present in the victim age data
The mean was used in the analysis of perpetrator age due to a relative lack of skewness in the data. Although this figure also remained stable, it tended around age 21, signifying a 9 year disparity between the ages of victims and perpetrators.
VicAgeMedian <- ReducedData %>%
filter(Year %in% c(1980, 1990, 2000, 2010)) %>%
group_by(Year) %>%
summarise(Median = median(`Victim Age`)) %>%
mutate(Label = paste0(prettyNum(round(Median, 1))))
ReducedData %>%
filter(`Victim Age` < 998, `Perpetrator Age` != 0,
Year %in% c(1980,1990,2000,2010)) %>%
ggplot(aes(x = `Victim Age`)) +
geom_histogram(color = "grey30", fill = "dodgerblue", alpha = .3) +
geom_vline(data = VicAgeMedian, aes(xintercept = Median), linetype = "dashed",
color = "red") +
facet_grid(Year ~ .) +
geom_text(data = VicAgeMedian, aes(x = Median + 3.5, y = 2000, label = Label),
size = 4) +
ylab("Frequency") +
labs(title = "Victim Age Distribution in the United States from 1980 to 2010",
subtitle = "Median Victim Age has remained steadily around 30 years of age") +
theme_bw()
PerpAgeMean <- ReducedData %>%
filter(Year %in% c(1980, 1990, 2000, 2010)) %>%
group_by(Year) %>%
summarise(Mean = mean(`Perpetrator Age`)) %>%
mutate(Label = paste0(prettyNum(round(Mean, 1))))
ReducedData %>%
filter(`Victim Age` < 998, `Perpetrator Age` != 0,
Year %in% c(1980,1990,2000,2010)) %>%
ggplot(aes(x = `Perpetrator Age`)) +
geom_histogram(color = "grey30", fill = "dodgerblue", alpha = .3) +
facet_grid(Year ~ .) +
geom_vline(data = PerpAgeMean, aes(xintercept = Mean), linetype = "dashed",
color = "red") +
geom_text(data = PerpAgeMean, aes(x = Mean - 3.5, y = 2950, label = Label),
size = 4) +
ylab("Frequency") +
labs(title = "Perpetrator Age Distribution in the United States from 1980 to 2010",
subtitle = "Mean perpetrator age has remained steadily around 21 years of age") +
theme_bw()
The ultimate goal of an analysis of this scope is to assess the evidence at hand in retroactively identifying portions of the data that give new insights into cold cases, as well as tailoring enforcement procedures around the predictions given by a model. In particular, identifying how victim-perpetrator relationships, weapon type, and geography bear weight in homicide cases presents an area where more work needs to be done. Modeling these figures requires linear regression models as well as further investigation in the descriptive statistics relevant to creating victim and perpetrator profiles.