Data and Methods

Sources of Data

For this project, as a group, we used the U.S. Crime Dataset available on Kaggle: US Crime Dataset. This dataset was collected by the U.S. Federal Government to track reported crimes and is summarized from FBI sources. The dataset includes reported crimes from 12/31/1979 to 8/30/2014. Key variables include: - Location: City, State
- Temporal: Year, Month
- Crime Details: Incident, Crime Type, Crime Solved
- Victim Information: Age, Sex, Race/Ethnicity
- Perpetrator Information: Age, Sex, Race/Ethnicity
- Other Details: Relationship, Weapon, Victim/Perpetrator Count
The dataset contains no missing data, though some entries are marked “unknown.” No official metadata dictionary or user guide is available, and while no academic papers use this exact Kaggle dataset, similar FBI UCR data has been widely studied.

In addition to this data set, we also used the State Crime CSV dataset from the CORGIS Dataset Project (Whitcomb, Choi, & Guan, 2021). The dataset is publicly available here: State Crime CSV File.
This dataset provides annual crime statistics for each U.S. state, including data on violent and property crimes, as well as population information. It is intended to allow researchers, students, and educators to explore patterns in crime across the United States. - Sample Size: The dataset includes all 50 states and the District of Columbia, with multiple years of data per state (n = number of state-year observations = 2,550+).
- Variables Used in Analysis:
- State: Name of the state (categorical)
- Year: Year of observation (numeric)
- Population: State population (numeric)
- Data.Rates.Violent.Murder: Homicide rate per 100,000 population (numeric; dependent variable for analysis)
- Other variables include rates for robbery, aggravated assault, property crimes, and demographic statistics.
We focus primarily on Data.Rates.Violent.Murder as the dependent variable, with State and Year serving as key grouping variables to summarize and compare homicide rates across states. Additional independent variables, such as population size, may be used for normalization or contextual analysis.

Methods

We used R for all data wrangling, analysis, and visualization, leveraging the following packages:

#1 Load required libraries

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.4.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.4.3

library(readr)

## Warning: package 'readr' was built under R version 4.4.3

#2 Set working directory

setwd("~/Desktop/BDATA 200")

#3 Load the dataset

crime <- read.csv("state_crime.csv", stringsAsFactors = FALSE)

#4 View column names

colnames(crime)

##  [1] "State"                         "Year"                         
##  [3] "Data.Population"               "Data.Rates.Property.All"      
##  [5] "Data.Rates.Property.Burglary"  "Data.Rates.Property.Larceny"  
##  [7] "Data.Rates.Property.Motor"     "Data.Rates.Violent.All"       
##  [9] "Data.Rates.Violent.Assault"    "Data.Rates.Violent.Murder"    
## [11] "Data.Rates.Violent.Rape"       "Data.Rates.Violent.Robbery"   
## [13] "Data.Totals.Property.All"      "Data.Totals.Property.Burglary"
## [15] "Data.Totals.Property.Larceny"  "Data.Totals.Property.Motor"   
## [17] "Data.Totals.Violent.All"       "Data.Totals.Violent.Assault"  
## [19] "Data.Totals.Violent.Murder"    "Data.Totals.Violent.Rape"     
## [21] "Data.Totals.Violent.Robbery"

Results

Descriptive Summary of Homicide Rates

# Calculate descriptive statistics for homicide rates
summary_stats <- crime %>%
  summarise(
    mean_homicide = mean(Data.Rates.Violent.Murder, na.rm = TRUE),
    sd_homicide = sd(Data.Rates.Violent.Murder, na.rm = TRUE),
    min_homicide = min(Data.Rates.Violent.Murder, na.rm = TRUE),
    max_homicide = max(Data.Rates.Violent.Murder, na.rm = TRUE),
    median_homicide = median(Data.Rates.Violent.Murder, na.rm = TRUE),
    first_quartile = quantile(Data.Rates.Violent.Murder, 0.25, na.rm = TRUE),
    third_quartile = quantile(Data.Rates.Violent.Murder, 0.75, na.rm = TRUE),
    na_count = sum(is.na(Data.Rates.Violent.Murder))
  )

summary_stats

##   mean_homicide sd_homicide min_homicide max_homicide median_homicide
## 1      6.477207    5.886449          0.2         80.6             5.4
##   first_quartile third_quartile na_count
## 1            3.1            8.4        0

#5 Filter out DC and calculate average homicide rate by state

state_data <- crime %>%
  filter(State != "District of Columbia") %>%
  group_by(State) %>%
  summarise(
    Avg_Homicide_Rate = mean(Data.Rates.Violent.Murder, na.rm = TRUE)
  )

#6 Top 10 states

top10 <- state_data %>%
  arrange(desc(Avg_Homicide_Rate)) %>%
  slice(1:10)

top10

## # A tibble: 10 × 2
##    State          Avg_Homicide_Rate
##    <chr>                      <dbl>
##  1 Louisiana                  12.7 
##  2 Mississippi                10.4 
##  3 Georgia                    10.3 
##  4 Alabama                    10.1 
##  5 Nevada                      9.83
##  6 South Carolina              9.81
##  7 Texas                       9.44
##  8 Maryland                    9.04
##  9 Florida                     8.93
## 10 New Mexico                  8.69

ggplot(top10, aes(x = reorder(State, Avg_Homicide_Rate),
                  y = Avg_Homicide_Rate)) +
  geom_text(aes(label = round(Avg_Homicide_Rate, 1)),  
            hjust = -0.1, size = 4) +
  geom_bar(stat = "identity", fill = "skyblue") +
  coord_flip() +
  labs(
    title = "Top 10 States by Average Homicide Rate",
    x = "State",
    y = "Average Homicide Rate"
  ) +
  theme_minimal() +
  expand_limits(y = max(top10$Avg_Homicide_Rate) * 1.1)

ggplot(top10, aes(x = reorder(State, Avg_Homicide_Rate),
                  y = Avg_Homicide_Rate)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  geom_text(aes(label = round(Avg_Homicide_Rate, 1)),
            vjust = -0.3, size = 4) +   
  labs(
    title = "Top 10 States by Average Homicide Rate",
    x = "State",
    y = "Average Homicide Rate"
  ) +
  theme_minimal() +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1)))

#Discussion

The results show some clear patterns in homicide rates across the U.S. When we rank the top 10 states by average homicide rate (calculated using state population and total homicide counts each year), Louisiana comes out on top with 12.7%. Most of the states in the top 10 are in the Southeast, and their rates are way higher than the overall U.S. average. This shows that in some regions, they have consistently higher homicide rates than others. For people making policy or deciding where to put resources, this could help focus efforts on the states that need it most. Programs targeting prevention or community support could make a bigger impact in these areas. There are some limits to what we can say. This data set only looks at state-level data, so we can’t see the more detailed picture at the county or city level. And because this is descriptive analysis, we can’t say why these states have higher rates. Future studies could go deeper into demographics, policies, or other factors to figure out the reasons.

Top 10 States

Mao N

2026-03-06

Introduction

Data and Methods

Sources of Data

Methods

Results

Descriptive Summary of Homicide Rates