Introduction/Problem Statement

This project explores the question: What demographic and situational factors are common among people who commit homicide in the United States? And can those factors be used to identify offender types or predict how a crime is committed?

The goal is to use clustering and logistic regression in order to uncover meaningful patterns in homicide offender behavior.

All visualizations were created using base R and the ggplot2 package.

Dataset Description

To answer this question, I used the FBI Homicide Reports dataset (1980-2008). The data was originally compiled from the Supplementary Homicide Reports (SHR) program and the Freedom of Information Act data. It contains detailed information on over 630,000 homicides reported by U.S. law enforcement agencies.

This dataset was obtained from from Kaggle (https://www.kaggle.com/datasets/murderaccountability/homicide-reports). It includes 24 variables per case, but for the purpose of this analysis, I focused on characteristics related to the offender in order to investigate common traits and patterns in homicide offenders.

Key variables include:

Loading the Data

df <- read.csv("database.csv", stringsAsFactors = FALSE)

str(df) #View the structure of the dataset
## 'data.frame':    638454 obs. of  24 variables:
##  $ Record.ID            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Agency.Code          : chr  "AK00101" "AK00101" "AK00101" "AK00101" ...
##  $ Agency.Name          : chr  "Anchorage" "Anchorage" "Anchorage" "Anchorage" ...
##  $ Agency.Type          : chr  "Municipal Police" "Municipal Police" "Municipal Police" "Municipal Police" ...
##  $ City                 : chr  "Anchorage" "Anchorage" "Anchorage" "Anchorage" ...
##  $ State                : chr  "Alaska" "Alaska" "Alaska" "Alaska" ...
##  $ Year                 : int  1980 1980 1980 1980 1980 1980 1980 1980 1980 1980 ...
##  $ Month                : chr  "January" "March" "March" "April" ...
##  $ Incident             : int  1 1 2 1 2 1 2 1 2 3 ...
##  $ Crime.Type           : chr  "Murder or Manslaughter" "Murder or Manslaughter" "Murder or Manslaughter" "Murder or Manslaughter" ...
##  $ Crime.Solved         : chr  "Yes" "Yes" "No" "Yes" ...
##  $ Victim.Sex           : chr  "Male" "Male" "Female" "Male" ...
##  $ Victim.Age           : int  14 43 30 43 30 30 42 99 32 38 ...
##  $ Victim.Race          : chr  "Native American/Alaska Native" "White" "Native American/Alaska Native" "White" ...
##  $ Victim.Ethnicity     : chr  "Unknown" "Unknown" "Unknown" "Unknown" ...
##  $ Perpetrator.Sex      : chr  "Male" "Male" "Unknown" "Male" ...
##  $ Perpetrator.Age      : int  15 42 0 42 0 36 27 35 0 40 ...
##  $ Perpetrator.Race     : chr  "Native American/Alaska Native" "White" "Unknown" "White" ...
##  $ Perpetrator.Ethnicity: chr  "Unknown" "Unknown" "Unknown" "Unknown" ...
##  $ Relationship         : chr  "Acquaintance" "Acquaintance" "Unknown" "Acquaintance" ...
##  $ Weapon               : chr  "Blunt Object" "Strangulation" "Unknown" "Strangulation" ...
##  $ Victim.Count         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Perpetrator.Count    : int  0 0 0 0 1 0 0 0 0 1 ...
##  $ Record.Source        : chr  "FBI" "FBI" "FBI" "FBI" ...

Cleaning the Data

The raw dataset contains over 630,000 records, but many rows include incomplete or unknown values. To fix this, I got rid of unwanted values like “Unkown”, converted Perpetrator Age from character to numeric, and removed rows with missing data.

df <- read.csv("database.csv", stringsAsFactors = FALSE)

# Filter out unwanted values
df_clean <- subset(df,
  Perpetrator.Age != 0 &
  Perpetrator.Sex != "Unknown" &
  Perpetrator.Race != "Unknown" &
  Perpetrator.Ethnicity != "Unknown" &
  Relationship != "Unknown" &
  Weapon != "Unknown"
)

# Convert age to numeric 
df_clean$Perpetrator.Age <- as.numeric(df_clean$Perpetrator.Age)

# Drop rows with missing values
df_clean <- na.omit(df_clean)

nrow(df_clean)
## [1] 152624

After cleaning, the dataset was reduced to 152,624 complete records.

Exploratory Data Analysis

Do most homicide offenders tend to be male or female?

sex_counts <- table(df_clean$Perpetrator.Sex)
sex_percents <- round(prop.table(sex_counts) * 100,2)
data.frame(Sex = names(sex_counts),
           Count = as.vector(sex_counts),
           Percent = sex_percents)

An overwhelming majority of offenders are male - about 88% or 134,327 of the 152,624 cases.

Most Common Weapons

weapon_counts <- sort(table(df_clean$Weapon), decreasing = TRUE)

barplot(head(weapon_counts, 10),
        col = "lemonchiffon",
        las = 2,
        main = "Top 10 Weapons Used",
        ylab = "Count")

weapon_percents <- round(prop.table(weapon_counts) * 100, 2)
data.frame(Weapon = names(head(weapon_counts, 10)),
           Count = as.vector(head(weapon_counts, 10)),
           Percent = head(weapon_percents, 10))

Handguns are by far the most common weapon with 75,974 cases. The second most common is a knife followed by blunt objects.

Age Distribution

hist(df_clean$Perpetrator.Age,
     breaks = 30,
     col = "rosybrown1",
     main = "Distribution of Perpetrator Age",
     xlab = "Age")

summary(df_clean$Perpetrator.Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   22.00   28.00   31.44   38.00   99.00

The age distribution is right-skewed, with a median of 28. Most offenders fall between ages 20 and 40, but a smaller number of outliers exist in much older age ranges.

Relationship to Victim

rel_counts <- sort(table(df_clean$Relationship), decreasing = TRUE)
barplot(head(rel_counts, 10),
        col = "thistle",
        las = 2,
        main = "Top 10 Offender-Victim Relationships",
        ylab = "Count")

rel_percents <- round(prop.table(rel_counts) * 100, 2)
data.frame(Relationship = names(head(rel_counts, 10)),
           Count = as.vector(head(rel_counts, 10)),
           Percent = head(rel_percents, 10))

The most common offender-victim relationship is “Acquaintance”. This means offenders are more likely to kill someone they know rather than a stranger.

Another noteworthy detail is that “Girlfriend” ranks 5th, while “Boyfriend” is last, suggesting a gendered pattern in partner violence. Male perpetrators target female partners much more frequently than the reverse, which is alarming yet not surprising.

Weapon Used by Sex

weapon_by_sex <- table(df_clean$Weapon, df_clean$Perpetrator.Sex)

top_weapons <- names(sort(rowSums(weapon_by_sex), decreasing = TRUE) [1:5])
weapon_by_sex_top <- weapon_by_sex[top_weapons, ]

barplot(weapon_by_sex_top,
        beside = TRUE,
        col = c("lemonchiffon", "lavender", "burlywood4", "snow3", "honeydew3"),
        legend = TRUE,
        las = 2,
        main = "Top Weapons Used by Sex",
        ylab = "Count")

This barplot shows the raw count of homicides committed using each weapon. Both male and female offenders share the same top 5 weapons of handguns, knives, blunt objects, shotguns, and lastly, rifles. While men are responsible for the vast majority of cases across all weapons, the order of weapon preference is consistent between genders.

weapon_counts_df <- as.data.frame.matrix(weapon_by_sex_top)
weapon_counts_df$Weapon <- rownames(weapon_counts_df)

weapon_totals <- rowSums(weapon_by_sex_top)
weapon_counts_df$Male.Percent <- round(weapon_by_sex_top[, "Male"] / weapon_totals * 100, 2)
weapon_counts_df$Female.Percent <- round(weapon_by_sex_top[, "Female"] / weapon_totals * 100, 2)

weapon_summary <- weapon_counts_df[, c("Weapon", "Male", "Male.Percent", "Female", "Female.Percent")]
print(weapon_summary)
##                    Weapon  Male Male.Percent Female Female.Percent
## Handgun           Handgun 68198        89.76   7776          10.24
## Knife               Knife 23914        82.23   5169          17.77
## Blunt Object Blunt Object 15786        88.58   2036          11.42
## Shotgun           Shotgun 10367        92.53    837           7.47
## Rifle               Rifle  7856        91.95    688           8.05

This table shows counts and proportions within each weapon. While handguns are the most commonly used weapon overall, knives are relatively more used by women (17.8%) then men (82.2%). This indicates weapon preference differs when normalized for gender.

Clustering Offender Types

cluster_df <- df_clean

cluster_df$SexNum <- ifelse(cluster_df$Perpetrator.Sex == "Male", 1, 0)

cluster_df$RaceNum <- as.numeric(factor(cluster_df$Perpetrator.Race))

race_levels <- levels(df_clean$Perpetrator.Race)
race_levels
## NULL
cluster_data <- data.frame(
  Age = cluster_df$Perpetrator.Age,
  Sex = cluster_df$SexNum,
  Race = cluster_df$RaceNum
)

cluster_scaled <- scale(cluster_data)

set.seed(123)
k3 <- kmeans(cluster_scaled, centers = 3, nstart = 25)
cluster_df$Cluster <- k3$cluster
table(cluster_df$Cluster)
## 
##     1     2     3 
## 54237 18297 80090
aggregate(cluster_data, by = list(Cluster = cluster_df$Cluster), FUN = mean)

To explore whether there are distinct “types” of homicide offenders, I applied k-means clustering using three variables: age, sex, and race. Each offender was grouped into one of three clusters based on these features.

RaceNum key: 1 = Asian/Pacific Islander 2 = Black 3 = Native American/Alaska Native 4 = White

The results showed three distinct clusters:

  • Cluster 1: Predominantly young Black male offenders, with an average age just uder 30
  • Cluster 2: A smaller cluster of female offenders, most of whom are Native American or Alaskan Native, and slightly older on average
  • Cluster 3: The second-largest group, consisting mostly of White males around age 32

These clusters highlight potential offender profiles that may assist in understanding the demographic distribution of homicide perpetrators, which could be valuable for crime prevention and criminology research. However, since context and motive are not included, clustering should be considered exploratory.

Logistic Regression to Predict the Weapon Type

df_clean$UsedHandgun <- ifelse(df_clean$Weapon == "Handgun", 1, 0)

df_clean$Sex <- factor(df_clean$Perpetrator.Sex)
df_clean$Race <- factor(df_clean$Perpetrator.Race)
df_clean$Relationship <- factor(df_clean$Relationship)

logit_model <- glm(UsedHandgun ~ Perpetrator.Age + Sex + Race + Relationship,
                   data = df_clean, family = "binomial")
summary(logit_model)
## 
## Call:
## glm(formula = UsedHandgun ~ Perpetrator.Age + Sex + Race + Relationship, 
##     family = "binomial", data = df_clean)
## 
## Coefficients:
##                                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       -0.4507900  0.0481999  -9.353  < 2e-16 ***
## Perpetrator.Age                    0.0080737  0.0004333  18.633  < 2e-16 ***
## SexMale                            0.2892946  0.0228226  12.676  < 2e-16 ***
## RaceBlack                          0.0815408  0.0409562   1.991 0.046489 *  
## RaceNative American/Alaska Native -1.0187714  0.0734538 -13.870  < 2e-16 ***
## RaceWhite                         -0.2009066  0.0405884  -4.950 7.43e-07 ***
## RelationshipBoyfriend             -0.1434708  0.0440460  -3.257 0.001125 ** 
## RelationshipBoyfriend/Girlfriend  -1.0191726  0.0917239 -11.111  < 2e-16 ***
## RelationshipBrother               -0.4007945  0.0424830  -9.434  < 2e-16 ***
## RelationshipCommon-Law Husband    -0.2254360  0.0682314  -3.304 0.000953 ***
## RelationshipCommon-Law Wife       -0.2884987  0.0593824  -4.858 1.18e-06 ***
## RelationshipDaughter              -1.4063118  0.0529240 -26.572  < 2e-16 ***
## RelationshipEmployee              -0.0890615  0.1592775  -0.559 0.576053    
## RelationshipEmployer               0.0187857  0.1294244   0.145 0.884594    
## RelationshipEx-Husband             0.9955441  0.1251889   7.952 1.83e-15 ***
## RelationshipEx-Wife                0.2135461  0.0698162   3.059 0.002223 ** 
## RelationshipFamily                -0.5459371  0.0352465 -15.489  < 2e-16 ***
## RelationshipFather                -0.6145466  0.0511453 -12.016  < 2e-16 ***
## RelationshipFriend                -0.0870114  0.0208334  -4.177 2.96e-05 ***
## RelationshipGirlfriend            -0.2827982  0.0268430 -10.535  < 2e-16 ***
## RelationshipHusband                0.5311216  0.0384376  13.818  < 2e-16 ***
## RelationshipIn-Law                -0.0189401  0.0457306  -0.414 0.678753    
## RelationshipMother                -1.1265660  0.0590605 -19.075  < 2e-16 ***
## RelationshipNeighbor              -0.4642148  0.0384302 -12.079  < 2e-16 ***
## RelationshipSister                -0.3330803  0.0947122  -3.517 0.000437 ***
## RelationshipSon                   -1.0866688  0.0416764 -26.074  < 2e-16 ***
## RelationshipStepdaughter          -1.2242367  0.1363824  -8.977  < 2e-16 ***
## RelationshipStepfather            -0.5807401  0.0833194  -6.970 3.17e-12 ***
## RelationshipStepmother            -0.3958748  0.2234313  -1.772 0.076428 .  
## RelationshipStepson               -0.6451417  0.0911380  -7.079 1.45e-12 ***
## RelationshipStranger               0.5240941  0.0137949  37.992  < 2e-16 ***
## RelationshipWife                  -0.0265027  0.0231825  -1.143 0.252947    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 211579  on 152623  degrees of freedom
## Residual deviance: 203513  on 152592  degrees of freedom
## AIC: 203577
## 
## Number of Fisher Scoring iterations: 4
predicted <- ifelse(predict(logit_model, type = "response") > 0.5, 1, 0)
mean(predicted == df_clean$UsedHandgun)
## [1] 0.5938581

To predict whether a homicide was committed with a handgun, I used logistic regression. The predictors were perpetrator age, sex, race, and their relationship to the victim.

The model’s predictions were correct in approximately 60% of cases, indicating that demographic and relationship data offer valuable yet limited predictive power.

Key findings:

  • Older perpetrators are more likely to use handguns.
  • Males are significantly more likely than females to use handguns.
  • Native American offenders are much less likely to use handguns compared to White offenders.
  • Strangers and husbands are significantly more likely to use handguns than other relationship types.
  • Close family members (e.g., daughters, mothers, sons) are less likely to use handguns. This possibly indicates more impulsive or personal violence involving other weapon types.

Positive coefficients indicate higher odds of handgun use. For example, a coefficient of 0.53 for Husband means the odds of using a handgun increase by approximately 70% for husbands.

This model demonstrates that both demographics and the relationship to the victim influence weapon choice in homicide cases.

coefs <- summary(logit_model)$coefficients
conf_int <- confint(logit_model)
## Waiting for profiling to be done...
coef_df <- data.frame(
  Variable = rownames(coefs),
  Estimate = coefs[, "Estimate"],
  Lower = conf_int[, 1],
  Upper = conf_int[, 2]
)

coef_df <- coef_df[coef_df$Variable != "(Intercept)", ]

library(ggplot2)
ggplot(coef_df, aes(x = reorder(Variable, Estimate), y = Estimate)) +
  geom_point(color = "lightpink3") +
  geom_errorbar(aes(ymin = Lower, ymax = Upper), width = 0.2) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "burlywood4") +
  coord_flip() +
  labs(title = "Effect of Predictor Variables on Handgun Use",
       x = "Predictor",
       y = "Log-Odds Estimate (+- 95% CI)") +
  theme_minimal()

While the logistic regression output provides valuable information, it’s a little difficult to interpret at a glance. I decided to create a coefficient plot to visually display each variable’s effect on the likelihood of handgun use, along with confidence intervals for signifigance.

Each dot represents the effect of a variable while each horizontal line represents the confidence interval. Dots to the right of 0 (like Strangers or Husbands) indicate an increased likelihood of handgun use while dots to the left of 0 (like Daughter or Mother) indicate a decreased likelihood.

Conclusion

This project explored offender-level characteristics in U.S. homicide data to identify common patterns and predict weapon use. Through exploratory analysis, I found that most homicide offenders are male and between the ages of 20 and 30. Surprisingly, the most common relationship between victim and offender was “Acquaintance”, suggesting that many homicides occur between people who know each other.

Using k-means clustering, I identified three distinct offender profiles: young Black males, Native American females, and White males.

Finally, I built a logistic regression model to predict whether a handgun was used in the homicide. The model revealed that offender sex, race, age, and relationship to the victim all significantly affect the likelihood of handgun use. Relationships like “Stranger” and “Husband” were strong positive predictors, while family members like “Daughter” and “Mother” were strong negative predictors.

Overall, this analysis shows that demographic and relational context can help us to understand trends in violent crime.

Dataset Citation

Murder Accountability Project. “Homicide Reports, 1980-2014”. Kaggle, 2017, https://www.kaggle.com/datasets/murderaccountability/homicide-reports/data.