This project explores the question: What demographic and situational factors are common among people who commit homicide in the United States? And can those factors be used to identify offender types or predict how a crime is committed?
The goal is to use clustering and logistic regression in order to uncover meaningful patterns in homicide offender behavior.
All visualizations were created using base R and the ggplot2 package.
To answer this question, I used the FBI Homicide Reports dataset (1980-2008). The data was originally compiled from the Supplementary Homicide Reports (SHR) program and the Freedom of Information Act data. It contains detailed information on over 630,000 homicides reported by U.S. law enforcement agencies.
This dataset was obtained from from Kaggle (https://www.kaggle.com/datasets/murderaccountability/homicide-reports). It includes 24 variables per case, but for the purpose of this analysis, I focused on characteristics related to the offender in order to investigate common traits and patterns in homicide offenders.
Key variables include:
Perpetrator Age
: Age of the individual who committed
the homicidePerpetrator Sex
: Gender of the offender
(Male/Female)Perpetrator Race
: Racial identification of the offender
(e.g. White, Native American/Alaskan Native, Asian/Pacific Islander,
Black, Unknown)Perpetrator Ethnicity
: Hispanic/Not
Hispanic/UnknownRelationship
: The relationship between the offender and
the victim (e.g. Stranger, Wife, Neighbor, etc.)Weapon
: The type of weapon used in the homicide (e.g.,
Handgun, Knife, Firearm, Blunt Object, Shotgun, Rifle)Crime Solved
: Whether the case was cleared by arrest or
notdf <- read.csv("database.csv", stringsAsFactors = FALSE)
str(df) #View the structure of the dataset
## 'data.frame': 638454 obs. of 24 variables:
## $ Record.ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Agency.Code : chr "AK00101" "AK00101" "AK00101" "AK00101" ...
## $ Agency.Name : chr "Anchorage" "Anchorage" "Anchorage" "Anchorage" ...
## $ Agency.Type : chr "Municipal Police" "Municipal Police" "Municipal Police" "Municipal Police" ...
## $ City : chr "Anchorage" "Anchorage" "Anchorage" "Anchorage" ...
## $ State : chr "Alaska" "Alaska" "Alaska" "Alaska" ...
## $ Year : int 1980 1980 1980 1980 1980 1980 1980 1980 1980 1980 ...
## $ Month : chr "January" "March" "March" "April" ...
## $ Incident : int 1 1 2 1 2 1 2 1 2 3 ...
## $ Crime.Type : chr "Murder or Manslaughter" "Murder or Manslaughter" "Murder or Manslaughter" "Murder or Manslaughter" ...
## $ Crime.Solved : chr "Yes" "Yes" "No" "Yes" ...
## $ Victim.Sex : chr "Male" "Male" "Female" "Male" ...
## $ Victim.Age : int 14 43 30 43 30 30 42 99 32 38 ...
## $ Victim.Race : chr "Native American/Alaska Native" "White" "Native American/Alaska Native" "White" ...
## $ Victim.Ethnicity : chr "Unknown" "Unknown" "Unknown" "Unknown" ...
## $ Perpetrator.Sex : chr "Male" "Male" "Unknown" "Male" ...
## $ Perpetrator.Age : int 15 42 0 42 0 36 27 35 0 40 ...
## $ Perpetrator.Race : chr "Native American/Alaska Native" "White" "Unknown" "White" ...
## $ Perpetrator.Ethnicity: chr "Unknown" "Unknown" "Unknown" "Unknown" ...
## $ Relationship : chr "Acquaintance" "Acquaintance" "Unknown" "Acquaintance" ...
## $ Weapon : chr "Blunt Object" "Strangulation" "Unknown" "Strangulation" ...
## $ Victim.Count : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Perpetrator.Count : int 0 0 0 0 1 0 0 0 0 1 ...
## $ Record.Source : chr "FBI" "FBI" "FBI" "FBI" ...
The raw dataset contains over 630,000 records, but many rows include
incomplete or unknown values. To fix this, I got rid of unwanted values
like “Unkown”, converted Perpetrator Age
from character to
numeric, and removed rows with missing data.
df <- read.csv("database.csv", stringsAsFactors = FALSE)
# Filter out unwanted values
df_clean <- subset(df,
Perpetrator.Age != 0 &
Perpetrator.Sex != "Unknown" &
Perpetrator.Race != "Unknown" &
Perpetrator.Ethnicity != "Unknown" &
Relationship != "Unknown" &
Weapon != "Unknown"
)
# Convert age to numeric
df_clean$Perpetrator.Age <- as.numeric(df_clean$Perpetrator.Age)
# Drop rows with missing values
df_clean <- na.omit(df_clean)
nrow(df_clean)
## [1] 152624
After cleaning, the dataset was reduced to 152,624 complete records.
sex_counts <- table(df_clean$Perpetrator.Sex)
sex_percents <- round(prop.table(sex_counts) * 100,2)
data.frame(Sex = names(sex_counts),
Count = as.vector(sex_counts),
Percent = sex_percents)
An overwhelming majority of offenders are male - about 88% or 134,327 of the 152,624 cases.
weapon_counts <- sort(table(df_clean$Weapon), decreasing = TRUE)
barplot(head(weapon_counts, 10),
col = "lemonchiffon",
las = 2,
main = "Top 10 Weapons Used",
ylab = "Count")
weapon_percents <- round(prop.table(weapon_counts) * 100, 2)
data.frame(Weapon = names(head(weapon_counts, 10)),
Count = as.vector(head(weapon_counts, 10)),
Percent = head(weapon_percents, 10))
Handguns are by far the most common weapon with 75,974 cases. The second most common is a knife followed by blunt objects.
hist(df_clean$Perpetrator.Age,
breaks = 30,
col = "rosybrown1",
main = "Distribution of Perpetrator Age",
xlab = "Age")
summary(df_clean$Perpetrator.Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 22.00 28.00 31.44 38.00 99.00
The age distribution is right-skewed, with a median of 28. Most offenders fall between ages 20 and 40, but a smaller number of outliers exist in much older age ranges.
rel_counts <- sort(table(df_clean$Relationship), decreasing = TRUE)
barplot(head(rel_counts, 10),
col = "thistle",
las = 2,
main = "Top 10 Offender-Victim Relationships",
ylab = "Count")
rel_percents <- round(prop.table(rel_counts) * 100, 2)
data.frame(Relationship = names(head(rel_counts, 10)),
Count = as.vector(head(rel_counts, 10)),
Percent = head(rel_percents, 10))
The most common offender-victim relationship is “Acquaintance”. This means offenders are more likely to kill someone they know rather than a stranger.
Another noteworthy detail is that “Girlfriend” ranks 5th, while “Boyfriend” is last, suggesting a gendered pattern in partner violence. Male perpetrators target female partners much more frequently than the reverse, which is alarming yet not surprising.
weapon_by_sex <- table(df_clean$Weapon, df_clean$Perpetrator.Sex)
top_weapons <- names(sort(rowSums(weapon_by_sex), decreasing = TRUE) [1:5])
weapon_by_sex_top <- weapon_by_sex[top_weapons, ]
barplot(weapon_by_sex_top,
beside = TRUE,
col = c("lemonchiffon", "lavender", "burlywood4", "snow3", "honeydew3"),
legend = TRUE,
las = 2,
main = "Top Weapons Used by Sex",
ylab = "Count")
This barplot shows the raw count of homicides committed using each weapon. Both male and female offenders share the same top 5 weapons of handguns, knives, blunt objects, shotguns, and lastly, rifles. While men are responsible for the vast majority of cases across all weapons, the order of weapon preference is consistent between genders.
weapon_counts_df <- as.data.frame.matrix(weapon_by_sex_top)
weapon_counts_df$Weapon <- rownames(weapon_counts_df)
weapon_totals <- rowSums(weapon_by_sex_top)
weapon_counts_df$Male.Percent <- round(weapon_by_sex_top[, "Male"] / weapon_totals * 100, 2)
weapon_counts_df$Female.Percent <- round(weapon_by_sex_top[, "Female"] / weapon_totals * 100, 2)
weapon_summary <- weapon_counts_df[, c("Weapon", "Male", "Male.Percent", "Female", "Female.Percent")]
print(weapon_summary)
## Weapon Male Male.Percent Female Female.Percent
## Handgun Handgun 68198 89.76 7776 10.24
## Knife Knife 23914 82.23 5169 17.77
## Blunt Object Blunt Object 15786 88.58 2036 11.42
## Shotgun Shotgun 10367 92.53 837 7.47
## Rifle Rifle 7856 91.95 688 8.05
This table shows counts and proportions within each weapon. While handguns are the most commonly used weapon overall, knives are relatively more used by women (17.8%) then men (82.2%). This indicates weapon preference differs when normalized for gender.
cluster_df <- df_clean
cluster_df$SexNum <- ifelse(cluster_df$Perpetrator.Sex == "Male", 1, 0)
cluster_df$RaceNum <- as.numeric(factor(cluster_df$Perpetrator.Race))
race_levels <- levels(df_clean$Perpetrator.Race)
race_levels
## NULL
cluster_data <- data.frame(
Age = cluster_df$Perpetrator.Age,
Sex = cluster_df$SexNum,
Race = cluster_df$RaceNum
)
cluster_scaled <- scale(cluster_data)
set.seed(123)
k3 <- kmeans(cluster_scaled, centers = 3, nstart = 25)
cluster_df$Cluster <- k3$cluster
table(cluster_df$Cluster)
##
## 1 2 3
## 54237 18297 80090
aggregate(cluster_data, by = list(Cluster = cluster_df$Cluster), FUN = mean)
To explore whether there are distinct “types” of homicide offenders, I applied k-means clustering using three variables: age, sex, and race. Each offender was grouped into one of three clusters based on these features.
RaceNum key: 1 = Asian/Pacific Islander 2 = Black 3 = Native American/Alaska Native 4 = White
The results showed three distinct clusters:
These clusters highlight potential offender profiles that may assist in understanding the demographic distribution of homicide perpetrators, which could be valuable for crime prevention and criminology research. However, since context and motive are not included, clustering should be considered exploratory.
df_clean$UsedHandgun <- ifelse(df_clean$Weapon == "Handgun", 1, 0)
df_clean$Sex <- factor(df_clean$Perpetrator.Sex)
df_clean$Race <- factor(df_clean$Perpetrator.Race)
df_clean$Relationship <- factor(df_clean$Relationship)
logit_model <- glm(UsedHandgun ~ Perpetrator.Age + Sex + Race + Relationship,
data = df_clean, family = "binomial")
summary(logit_model)
##
## Call:
## glm(formula = UsedHandgun ~ Perpetrator.Age + Sex + Race + Relationship,
## family = "binomial", data = df_clean)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.4507900 0.0481999 -9.353 < 2e-16 ***
## Perpetrator.Age 0.0080737 0.0004333 18.633 < 2e-16 ***
## SexMale 0.2892946 0.0228226 12.676 < 2e-16 ***
## RaceBlack 0.0815408 0.0409562 1.991 0.046489 *
## RaceNative American/Alaska Native -1.0187714 0.0734538 -13.870 < 2e-16 ***
## RaceWhite -0.2009066 0.0405884 -4.950 7.43e-07 ***
## RelationshipBoyfriend -0.1434708 0.0440460 -3.257 0.001125 **
## RelationshipBoyfriend/Girlfriend -1.0191726 0.0917239 -11.111 < 2e-16 ***
## RelationshipBrother -0.4007945 0.0424830 -9.434 < 2e-16 ***
## RelationshipCommon-Law Husband -0.2254360 0.0682314 -3.304 0.000953 ***
## RelationshipCommon-Law Wife -0.2884987 0.0593824 -4.858 1.18e-06 ***
## RelationshipDaughter -1.4063118 0.0529240 -26.572 < 2e-16 ***
## RelationshipEmployee -0.0890615 0.1592775 -0.559 0.576053
## RelationshipEmployer 0.0187857 0.1294244 0.145 0.884594
## RelationshipEx-Husband 0.9955441 0.1251889 7.952 1.83e-15 ***
## RelationshipEx-Wife 0.2135461 0.0698162 3.059 0.002223 **
## RelationshipFamily -0.5459371 0.0352465 -15.489 < 2e-16 ***
## RelationshipFather -0.6145466 0.0511453 -12.016 < 2e-16 ***
## RelationshipFriend -0.0870114 0.0208334 -4.177 2.96e-05 ***
## RelationshipGirlfriend -0.2827982 0.0268430 -10.535 < 2e-16 ***
## RelationshipHusband 0.5311216 0.0384376 13.818 < 2e-16 ***
## RelationshipIn-Law -0.0189401 0.0457306 -0.414 0.678753
## RelationshipMother -1.1265660 0.0590605 -19.075 < 2e-16 ***
## RelationshipNeighbor -0.4642148 0.0384302 -12.079 < 2e-16 ***
## RelationshipSister -0.3330803 0.0947122 -3.517 0.000437 ***
## RelationshipSon -1.0866688 0.0416764 -26.074 < 2e-16 ***
## RelationshipStepdaughter -1.2242367 0.1363824 -8.977 < 2e-16 ***
## RelationshipStepfather -0.5807401 0.0833194 -6.970 3.17e-12 ***
## RelationshipStepmother -0.3958748 0.2234313 -1.772 0.076428 .
## RelationshipStepson -0.6451417 0.0911380 -7.079 1.45e-12 ***
## RelationshipStranger 0.5240941 0.0137949 37.992 < 2e-16 ***
## RelationshipWife -0.0265027 0.0231825 -1.143 0.252947
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 211579 on 152623 degrees of freedom
## Residual deviance: 203513 on 152592 degrees of freedom
## AIC: 203577
##
## Number of Fisher Scoring iterations: 4
predicted <- ifelse(predict(logit_model, type = "response") > 0.5, 1, 0)
mean(predicted == df_clean$UsedHandgun)
## [1] 0.5938581
To predict whether a homicide was committed with a handgun, I used logistic regression. The predictors were perpetrator age, sex, race, and their relationship to the victim.
The model’s predictions were correct in approximately 60% of cases, indicating that demographic and relationship data offer valuable yet limited predictive power.
Key findings:
Positive coefficients indicate higher odds of handgun use. For
example, a coefficient of 0.53 for Husband
means the odds
of using a handgun increase by approximately 70% for husbands.
This model demonstrates that both demographics and the relationship to the victim influence weapon choice in homicide cases.
coefs <- summary(logit_model)$coefficients
conf_int <- confint(logit_model)
## Waiting for profiling to be done...
coef_df <- data.frame(
Variable = rownames(coefs),
Estimate = coefs[, "Estimate"],
Lower = conf_int[, 1],
Upper = conf_int[, 2]
)
coef_df <- coef_df[coef_df$Variable != "(Intercept)", ]
library(ggplot2)
ggplot(coef_df, aes(x = reorder(Variable, Estimate), y = Estimate)) +
geom_point(color = "lightpink3") +
geom_errorbar(aes(ymin = Lower, ymax = Upper), width = 0.2) +
geom_hline(yintercept = 0, linetype = "dashed", color = "burlywood4") +
coord_flip() +
labs(title = "Effect of Predictor Variables on Handgun Use",
x = "Predictor",
y = "Log-Odds Estimate (+- 95% CI)") +
theme_minimal()
While the logistic regression output provides valuable information, it’s a little difficult to interpret at a glance. I decided to create a coefficient plot to visually display each variable’s effect on the likelihood of handgun use, along with confidence intervals for signifigance.
Each dot represents the effect of a variable while each horizontal line represents the confidence interval. Dots to the right of 0 (like Strangers or Husbands) indicate an increased likelihood of handgun use while dots to the left of 0 (like Daughter or Mother) indicate a decreased likelihood.
This project explored offender-level characteristics in U.S. homicide data to identify common patterns and predict weapon use. Through exploratory analysis, I found that most homicide offenders are male and between the ages of 20 and 30. Surprisingly, the most common relationship between victim and offender was “Acquaintance”, suggesting that many homicides occur between people who know each other.
Using k-means clustering, I identified three distinct offender profiles: young Black males, Native American females, and White males.
Finally, I built a logistic regression model to predict whether a handgun was used in the homicide. The model revealed that offender sex, race, age, and relationship to the victim all significantly affect the likelihood of handgun use. Relationships like “Stranger” and “Husband” were strong positive predictors, while family members like “Daughter” and “Mother” were strong negative predictors.
Overall, this analysis shows that demographic and relational context can help us to understand trends in violent crime.
Murder Accountability Project. “Homicide Reports, 1980-2014”. Kaggle, 2017, https://www.kaggle.com/datasets/murderaccountability/homicide-reports/data.