This project uses a crime dataset from 1991 to 2022. The most common crimes during this time are Damage, Intimidation, and Simple Assault. By using the Random Forest method, we aim to predict how often these crimes will happen in 2023. This analysis will help understand crime trends and support decisions for law enforcement and policy making.

if(!require(tidyverse))install.packages("tidyverse")
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyverse)

This line of code checks if the tidyverse package is installed; if not, it installs it. Then, it loads the tidyverse package. Tidyverse is a collection of R packages for data manipulation and visualization.

if(!require(gridExtra))install.packages("gridExtra")
## Loading required package: gridExtra
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(gridExtra)

Here, we check if the gridExtra package is installed. If not, we install and load it. The gridExtra package helps to arrange multiple grid-based figures on a page

# Load necessary libraries
library(dplyr)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift

dplyr is used for data manipulation.

randomForest is used for building and training Random Forest models.

caret is used for creating data partitions and model evaluation.

Load the data into a dataframe

data <- read.csv("hate_crime.csv")

Renaming Some of the Columns

data <- data %>% mutate(offense_name = ifelse(offense_name =="Destruction/Damage/Vandalism of Property",
                                                 "Damage", offense_name))

data <- data %>% mutate(offense_name = ifelse(offense_name =="Simple Assault",
                                                 "Sim.Assault", offense_name))

First few rows and columns of the dataframe

head(data)
##   incident_id data_year       ori   pug_agency_name pub_agency_unit
## 1          43      1991 AR0350100        Pine Bluff                
## 2          44      1991 AR0350100        Pine Bluff                
## 3          45      1991 AR0600300 North Little Rock                
## 4          46      1991 AR0600300 North Little Rock                
## 5          47      1991 AR0670000            Sevier                
## 6        3015      1991 AR0040200            Rogers                
##   agency_type_name state_abbr state_name      division_name region_name
## 1             City         AR   Arkansas West South Central       South
## 2             City         AR   Arkansas West South Central       South
## 3             City         AR   Arkansas West South Central       South
## 4             City         AR   Arkansas West South Central       South
## 5           County         AR   Arkansas West South Central       South
## 6             City         AR   Arkansas West South Central       South
##   population_group_code   population_group_description incident_date
## 1                     3 Cities from 50,000 thru 99,999    1991-07-04
## 2                     3 Cities from 50,000 thru 99,999    1991-12-24
## 3                     3 Cities from 50,000 thru 99,999    1991-07-10
## 4                     3 Cities from 50,000 thru 99,999    1991-10-06
## 5                    8D  Non-MSA counties under 10,000    1991-10-14
## 6                     5 Cities from 10,000 thru 24,999    1991-08-31
##   adult_victim_count juvenile_victim_count total_offender_count
## 1                 NA                    NA                    1
## 2                 NA                    NA                    1
## 3                 NA                    NA                    1
## 4                 NA                    NA                    2
## 5                 NA                    NA                    1
## 6                 NA                    NA                    1
##   adult_offender_count juvenile_offender_count             offender_race
## 1                   NA                      NA Black or African American
## 2                   NA                      NA Black or African American
## 3                   NA                      NA Black or African American
## 4                   NA                      NA Black or African American
## 5                   NA                      NA                     White
## 6                   NA                      NA                     White
##   offender_ethnicity victim_count
## 1      Not Specified            1
## 2      Not Specified            2
## 3      Not Specified            2
## 4      Not Specified            1
## 5      Not Specified            1
## 6      Not Specified            1
##                                                  offense_name
## 1                                          Aggravated Assault
## 2 Aggravated Assault;Destruction/Damage/Vandalism of Property
## 3     Aggravated Assault;Murder and Nonnegligent Manslaughter
## 4                                                Intimidation
## 5                                                Intimidation
## 6                                                Intimidation
##   total_individual_victims                      location_name
## 1                        1                     Residence/Home
## 2                        1 Highway/Road/Alley/Street/Sidewalk
## 3                        2                     Residence/Home
## 4                        1                     Residence/Home
## 5                        1                     School/College
## 6                        1 Highway/Road/Alley/Street/Sidewalk
##                        bias_desc victim_types multiple_offense multiple_bias
## 1 Anti-Black or African American   Individual                S             S
## 2                     Anti-White   Individual                M             S
## 3                     Anti-White   Individual                M             S
## 4                     Anti-White   Individual                S             S
## 5 Anti-Black or African American   Individual                S             S
## 6 Anti-Black or African American   Individual                S             S

Summary Statistics of the dataframe

summary(data)
##   incident_id        data_year        ori            pug_agency_name   
##  Min.   :      2   Min.   :1991   Length:241663      Length:241663     
##  1st Qu.:  60446   1st Qu.:1999   Class :character   Class :character  
##  Median : 120873   Median :2006   Mode  :character   Mode  :character  
##  Mean   : 349025   Mean   :2007                                        
##  3rd Qu.: 181301   3rd Qu.:2016                                        
##  Max.   :1494167   Max.   :2022                                        
##                                                                        
##  pub_agency_unit    agency_type_name    state_abbr         state_name       
##  Length:241663      Length:241663      Length:241663      Length:241663     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  division_name      region_name        population_group_code
##  Length:241663      Length:241663      Length:241663        
##  Class :character   Class :character   Class :character     
##  Mode  :character   Mode  :character   Mode  :character     
##                                                             
##                                                             
##                                                             
##                                                             
##  population_group_description incident_date      adult_victim_count
##  Length:241663                Length:241663      Min.   :  0.00    
##  Class :character             Class :character   1st Qu.:  0.00    
##  Mode  :character             Mode  :character   Median :  1.00    
##                                                  Mean   :  0.73    
##                                                  3rd Qu.:  1.00    
##                                                  Max.   :146.00    
##                                                  NA's   :170538    
##  juvenile_victim_count total_offender_count adult_offender_count
##  Min.   : 0.0          Min.   : 0.0000      Min.   : 0.00       
##  1st Qu.: 0.0          1st Qu.: 0.0000      1st Qu.: 0.00       
##  Median : 0.0          Median : 1.0000      Median : 0.00       
##  Mean   : 0.1          Mean   : 0.9559      Mean   : 0.61       
##  3rd Qu.: 0.0          3rd Qu.: 1.0000      3rd Qu.: 1.00       
##  Max.   :60.0          Max.   :99.0000      Max.   :60.00       
##  NA's   :172978                             NA's   :177148      
##  juvenile_offender_count offender_race      offender_ethnicity
##  Min.   : 0.00           Length:241663      Length:241663     
##  1st Qu.: 0.00           Class :character   Class :character  
##  Median : 0.00           Mode  :character   Mode  :character  
##  Mean   : 0.12                                                
##  3rd Qu.: 0.00                                                
##  Max.   :20.00                                                
##  NA's   :177155                                               
##   victim_count     offense_name       total_individual_victims
##  Min.   :  1.000   Length:241663      Min.   :  0.000         
##  1st Qu.:  1.000   Class :character   1st Qu.:  1.000         
##  Median :  1.000   Mode  :character   Median :  1.000         
##  Mean   :  1.242                      Mean   :  0.989         
##  3rd Qu.:  1.000                      3rd Qu.:  1.000         
##  Max.   :900.000                      Max.   :147.000         
##                                       NA's   :4859            
##  location_name       bias_desc         victim_types       multiple_offense  
##  Length:241663      Length:241663      Length:241663      Length:241663     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  multiple_bias     
##  Length:241663     
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

Filter the data for “Damage”

damage_data <- data %>% filter(offense_name == "Damage")

This line of code filters the data to include only rows where the offense_name column is “Damage”. This subset is stored in damage_data.

Aggregate the data by year

damage_yearly <- damage_data %>% group_by(data_year) %>% 
  summarise(count = n())

This line of code aggregates the damage_yearly by data_year and counts the number of occurrences each year. The result is stored in damage_yearly.

# Inspect the aggregated data
damage_yearly
## # A tibble: 32 × 2
##    data_year count
##        <int> <int>
##  1      1991  1231
##  2      1992  1760
##  3      1993  2182
##  4      1994  1642
##  5      1995  2148
##  6      1996  2663
##  7      1997  2396
##  8      1998  2446
##  9      1999  2509
## 10      2000  2634
## # ℹ 22 more rows

Prepare the features (X) and labels (y)

X <- damage_yearly$data_year
y <- damage_yearly$count

This line of code extracts the feature (year) and labels (counts) into X and y variables.

Convert to data frame for modeling

data_model <- data.frame(X, y)

Split the data into training and testing sets

set.seed(42)
train_index <- createDataPartition(y, p = 0.8, list = FALSE)
train_data <- data_model[train_index, ]
test_data <- data_model[-train_index, ]

The first line of code sets a random seed using set.seed(42) to ensure reproducibility. This ensures that the random processes in the code (like splitting data into training and testing sets) produce the same results every time you run the code. It is just like using the same starting point for a random number generator so that the sequence of random numbers it produces is always the same.

Splits the data into training (80%) and testing (20%) sets using createDataPartition. The indices of the training data are stored in train_index. The training and testing data are stored in train_data and test_data, respectively.

Separate features and labels for training and testing sets

X_train <- train_data$X
y_train <- train_data$y
X_test <- test_data$X
y_test <- test_data$y

These lines of code separate the features (X_train and X_test) and labels (y_train and y_test) for both training and testing sets.

1 X_train <- train_data\(X: ◦ Extracts the feature (year) from the training data and assigns it to X_train. 2 y_train <- train_data\)y: ◦ Extracts the label (number of “Damage” crimes) from the training data and assigns it to y_train. 3 X_test <- test_data\(X: ◦ Extracts the feature (year) from the testing data and assigns it to X_test. 4 y_test <- test_data\)y: ◦ Extracts the label (number of “Damage” crimes) from the testing data and assigns it to y_test.

These lines of code separate the input data (features) from the output data (labels) for both the training and testing datasets. This separation is essential for training a machine learning model and then evaluating its performance on new data.

Train the Random Forest model

model <- randomForest(y ~ X, data = train_data, ntree = 100, set.seed = 42)

This line of code trains a Random Forest model using randomForest. The formula y ~ X specifies that y is the dependent variable and X is the independent variable. The model is trained on train_data with 100 trees (ntree = 100). The random seed ensures reproducibility.

Predict on the test set

y_pred <- predict(model, newdata = data.frame(X = X_test))

This line of code uses the trained model to predict the y values for the X_test data. The predictions are stored in y_pred.

Evaluate the model

mse <- mean((y_test - y_pred)^2)
print(paste("Mean Squared Error:", mse))
## [1] "Mean Squared Error: 61895.7503074861"

This line of code calculates the Mean Squared Error (MSE) between the actual y_test values and the predicted y_pred values to evaluate the model’s performance. The MSE is printed.

Predict the frequency of occurrences for the year 2023

year_2023 <- data.frame(X = 2023)
prediction_2023 <- predict(model, year_2023)

print(paste("Predicted frequency of Damage in 2023:", round(prediction_2023)))
## [1] "Predicted frequency of Damage in 2023: 2546"

Creates a data frame year_2023 with the year 2023. It uses the trained model to predict the frequency of “Damage” occurrences for the year 2023.

The last line of code outputs the predicted frequency of “Damage” occurrences for the year 2023. The prediction of occurrences of Damage in 2023 is 2,546.

This final output provides insight into future trends based on historical data.

Filter the data for “Intimidation”

intimidation_data <- data %>% filter(offense_name == "Intimidation")

This line of code filters the data to include only rows where the offense_name column is “Intimidation”. This subset is stored in intimidation_data.

Aggregate the data by year

intimidation_yearly <- intimidation_data %>% group_by(data_year) %>% 
  summarise(count = n())

This line of code aggregates the intimidation_data by data_year and counts the number of occurrences each year. The result is stored in intimidation_yearly.

# Inspect the aggregated data
intimidation_yearly
## # A tibble: 32 × 2
##    data_year count
##        <int> <int>
##  1      1991  1519
##  2      1992  2288
##  3      1993  2421
##  4      1994  2091
##  5      1995  2838
##  6      1996  2965
##  7      1997  2750
##  8      1998  2710
##  9      1999  2605
## 10      2000  2632
## # ℹ 22 more rows

Prepare the features (X) and labels (y)

X <- intimidation_yearly$data_year
y <- intimidation_yearly$count

This line of code extracts the feature (year) and labels (counts) into X and y variables.

Convert to data frame for modeling

data_model <- data.frame(X, y)

Split the data into training and testing sets

set.seed(42)
train_index <- createDataPartition(y, p = 0.8, list = FALSE)
train_data <- data_model[train_index, ]
test_data <- data_model[-train_index, ]

Splits the data into training (80%) and testing (20%) sets using createDataPartition. The indices of the training data are stored in train_index. The training and testing data are stored in train_data and test_data, respectively.

Separate features and labels for training and testing sets

X_train <- train_data$X
y_train <- train_data$y
X_test <- test_data$X
y_test <- test_data$y

These lines of code separate the features (X_train and X_test) and labels (y_train and y_test) for both training and testing sets.

Train the Random Forest model

model <- randomForest(y ~ X, data = train_data, ntree = 100, set.seed = 42)

This line of code trains a Random Forest model using randomForest. The formula y ~ X specifies that y is the dependent variable and X is the independent variable. The model is trained on train_data with 100 trees (ntree = 100). The random seed ensures reproducibility.

Predict on the test set

y_pred <- predict(model, newdata = data.frame(X = X_test))

This line of code uses the trained model to predict the y values for the X_test data. The predictions are stored in y_pred.

Evaluate the model

mse <- mean((y_test - y_pred)^2)
print(paste("Mean Squared Error:", mse))
## [1] "Mean Squared Error: 67203.8869917432"

This line of code calculates the Mean Squared Error (MSE) between the actual y_test values and the predicted y_pred values to evaluate the model’s performance. The MSE is printed.

Predict the frequency of occurrences for the year 2023

year_2023 <- data.frame(X = 2023)
prediction_2023 <- predict(model, year_2023)

print(paste("Predicted frequency of Intimidation in 2023:", round(prediction_2023)))
## [1] "Predicted frequency of Intimidation in 2023: 3092"

Creates a data frame year_2023 with the year 2023. It uses the trained model to predict the frequency of “Intimidation” occurrences for the year 2023.

The last line of code outputs the predicted frequency of “Intimidation” occurrences for the year 2023. The prediction of intimidation in 2023 is 3,092.

This final output provides insight into future trends based on historical data.

Filter the data for “Simple Assault”

sim_data <- data %>% filter(offense_name == "Sim.Assault")

This line of code filters the data to include only rows where the offense_name column is “Simple Assault”. This subset is stored in sim_data.

Aggregate the data by year

sim_yearly <- sim_data %>% group_by(data_year) %>% 
  summarise(count = n())

This line of code aggregates the sim_data by data_year and counts the number of occurrences each year. The result is stored in sim_yearly

# Inspect the aggregated data
sim_yearly
## # A tibble: 32 × 2
##    data_year count
##        <int> <int>
##  1      1991   760
##  2      1992  1240
##  3      1993  1424
##  4      1994  1025
##  5      1995  1395
##  6      1996  1280
##  7      1997  1361
##  8      1998  1366
##  9      1999  1437
## 10      2000  1380
## # ℹ 22 more rows

Prepare the features (X) and labels (y)

X <- sim_yearly$data_year
y <- sim_yearly$count

This line of code extracts the feature (year) and labels (counts) into X and y variables.

Convert to data frame for modeling

data_model <- data.frame(X, y)

Split the data into training and testing sets

set.seed(42)
train_index <- createDataPartition(y, p = 0.8, list = FALSE)
train_data <- data_model[train_index, ]
test_data <- data_model[-train_index, ]

Splits the data into training (80%) and testing (20%) sets using createDataPartition. The indices of the training data are stored in train_index. The training and testing data are stored in train_data and test_data, respectively.

Separate features and labels for training and testing sets

X_train <- train_data$X
y_train <- train_data$y
X_test <- test_data$X
y_test <- test_data$y

These lines of code separate the features (X_train and X_test) and labels (y_train and y_test) for both training and testing sets.

Train the Random Forest model

model <- randomForest(y ~ X, data = train_data, ntree = 100, set.seed = 42)

This line of code trains a Random Forest model using randomForest. The formula y ~ X specifies that y is the dependent variable and X is the independent variable. The model is trained on train_data with 100 trees (ntree = 100). The random seed ensures reproducibility.

Predict on the test set

y_pred <- predict(model, newdata = data.frame(X = X_test))

This line of code uses the trained model to predict the y values for the X_test data. The predictions are stored in y_pred.

Evaluate the model

mse <- mean((y_test - y_pred)^2)
print(paste("Mean Squared Error:", mse))
## [1] "Mean Squared Error: 28555.0848982084"

This line of code calculates the Mean Squared Error (MSE) between the actual y_test values and the predicted y_pred values to evaluate the model’s performance.

Predict the frequency of occurrences for the year 2023

year_2023 <- data.frame(X = 2023)
prediction_2023 <- predict(model, year_2023)

print(paste("Predicted frequency of Simple Assault in 2023:", round(prediction_2023)))
## [1] "Predicted frequency of Simple Assault in 2023: 2273"

Creates a data frame year_2023 with the year 2023. It uses the trained model to predict the frequency of “Simple Assault” occurrences for the year 2023.

The last line of code outputs the predicted frequency of “Simple Assault” occurrences for the year 2023. The predicted occurrences of Simple Assault in 2023 is 2,273.

This final output provides insight into future trends based on historical data.