This project uses a crime dataset from 1991 to 2022. The most common crimes during this time are Damage, Intimidation, and Simple Assault. By using the Random Forest method, we aim to predict how often these crimes will happen in 2023. This analysis will help understand crime trends and support decisions for law enforcement and policy making.
if(!require(tidyverse))install.packages("tidyverse")
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyverse)
This line of code checks if the tidyverse package is installed; if not, it installs it. Then, it loads the tidyverse package. Tidyverse is a collection of R packages for data manipulation and visualization.
if(!require(gridExtra))install.packages("gridExtra")
## Loading required package: gridExtra
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(gridExtra)
Here, we check if the gridExtra package is installed. If not, we install and load it. The gridExtra package helps to arrange multiple grid-based figures on a page
# Load necessary libraries
library(dplyr)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
dplyr is used for data manipulation.
randomForest is used for building and training Random Forest models.
caret is used for creating data partitions and model evaluation.
data <- read.csv("hate_crime.csv")
data <- data %>% mutate(offense_name = ifelse(offense_name =="Destruction/Damage/Vandalism of Property",
"Damage", offense_name))
data <- data %>% mutate(offense_name = ifelse(offense_name =="Simple Assault",
"Sim.Assault", offense_name))
head(data)
## incident_id data_year ori pug_agency_name pub_agency_unit
## 1 43 1991 AR0350100 Pine Bluff
## 2 44 1991 AR0350100 Pine Bluff
## 3 45 1991 AR0600300 North Little Rock
## 4 46 1991 AR0600300 North Little Rock
## 5 47 1991 AR0670000 Sevier
## 6 3015 1991 AR0040200 Rogers
## agency_type_name state_abbr state_name division_name region_name
## 1 City AR Arkansas West South Central South
## 2 City AR Arkansas West South Central South
## 3 City AR Arkansas West South Central South
## 4 City AR Arkansas West South Central South
## 5 County AR Arkansas West South Central South
## 6 City AR Arkansas West South Central South
## population_group_code population_group_description incident_date
## 1 3 Cities from 50,000 thru 99,999 1991-07-04
## 2 3 Cities from 50,000 thru 99,999 1991-12-24
## 3 3 Cities from 50,000 thru 99,999 1991-07-10
## 4 3 Cities from 50,000 thru 99,999 1991-10-06
## 5 8D Non-MSA counties under 10,000 1991-10-14
## 6 5 Cities from 10,000 thru 24,999 1991-08-31
## adult_victim_count juvenile_victim_count total_offender_count
## 1 NA NA 1
## 2 NA NA 1
## 3 NA NA 1
## 4 NA NA 2
## 5 NA NA 1
## 6 NA NA 1
## adult_offender_count juvenile_offender_count offender_race
## 1 NA NA Black or African American
## 2 NA NA Black or African American
## 3 NA NA Black or African American
## 4 NA NA Black or African American
## 5 NA NA White
## 6 NA NA White
## offender_ethnicity victim_count
## 1 Not Specified 1
## 2 Not Specified 2
## 3 Not Specified 2
## 4 Not Specified 1
## 5 Not Specified 1
## 6 Not Specified 1
## offense_name
## 1 Aggravated Assault
## 2 Aggravated Assault;Destruction/Damage/Vandalism of Property
## 3 Aggravated Assault;Murder and Nonnegligent Manslaughter
## 4 Intimidation
## 5 Intimidation
## 6 Intimidation
## total_individual_victims location_name
## 1 1 Residence/Home
## 2 1 Highway/Road/Alley/Street/Sidewalk
## 3 2 Residence/Home
## 4 1 Residence/Home
## 5 1 School/College
## 6 1 Highway/Road/Alley/Street/Sidewalk
## bias_desc victim_types multiple_offense multiple_bias
## 1 Anti-Black or African American Individual S S
## 2 Anti-White Individual M S
## 3 Anti-White Individual M S
## 4 Anti-White Individual S S
## 5 Anti-Black or African American Individual S S
## 6 Anti-Black or African American Individual S S
summary(data)
## incident_id data_year ori pug_agency_name
## Min. : 2 Min. :1991 Length:241663 Length:241663
## 1st Qu.: 60446 1st Qu.:1999 Class :character Class :character
## Median : 120873 Median :2006 Mode :character Mode :character
## Mean : 349025 Mean :2007
## 3rd Qu.: 181301 3rd Qu.:2016
## Max. :1494167 Max. :2022
##
## pub_agency_unit agency_type_name state_abbr state_name
## Length:241663 Length:241663 Length:241663 Length:241663
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## division_name region_name population_group_code
## Length:241663 Length:241663 Length:241663
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## population_group_description incident_date adult_victim_count
## Length:241663 Length:241663 Min. : 0.00
## Class :character Class :character 1st Qu.: 0.00
## Mode :character Mode :character Median : 1.00
## Mean : 0.73
## 3rd Qu.: 1.00
## Max. :146.00
## NA's :170538
## juvenile_victim_count total_offender_count adult_offender_count
## Min. : 0.0 Min. : 0.0000 Min. : 0.00
## 1st Qu.: 0.0 1st Qu.: 0.0000 1st Qu.: 0.00
## Median : 0.0 Median : 1.0000 Median : 0.00
## Mean : 0.1 Mean : 0.9559 Mean : 0.61
## 3rd Qu.: 0.0 3rd Qu.: 1.0000 3rd Qu.: 1.00
## Max. :60.0 Max. :99.0000 Max. :60.00
## NA's :172978 NA's :177148
## juvenile_offender_count offender_race offender_ethnicity
## Min. : 0.00 Length:241663 Length:241663
## 1st Qu.: 0.00 Class :character Class :character
## Median : 0.00 Mode :character Mode :character
## Mean : 0.12
## 3rd Qu.: 0.00
## Max. :20.00
## NA's :177155
## victim_count offense_name total_individual_victims
## Min. : 1.000 Length:241663 Min. : 0.000
## 1st Qu.: 1.000 Class :character 1st Qu.: 1.000
## Median : 1.000 Mode :character Median : 1.000
## Mean : 1.242 Mean : 0.989
## 3rd Qu.: 1.000 3rd Qu.: 1.000
## Max. :900.000 Max. :147.000
## NA's :4859
## location_name bias_desc victim_types multiple_offense
## Length:241663 Length:241663 Length:241663 Length:241663
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## multiple_bias
## Length:241663
## Class :character
## Mode :character
##
##
##
##
damage_data <- data %>% filter(offense_name == "Damage")
This line of code filters the data to include only rows where the offense_name column is “Damage”. This subset is stored in damage_data.
damage_yearly <- damage_data %>% group_by(data_year) %>%
summarise(count = n())
This line of code aggregates the damage_yearly by data_year and counts the number of occurrences each year. The result is stored in damage_yearly.
# Inspect the aggregated data
damage_yearly
## # A tibble: 32 × 2
## data_year count
## <int> <int>
## 1 1991 1231
## 2 1992 1760
## 3 1993 2182
## 4 1994 1642
## 5 1995 2148
## 6 1996 2663
## 7 1997 2396
## 8 1998 2446
## 9 1999 2509
## 10 2000 2634
## # ℹ 22 more rows
X <- damage_yearly$data_year
y <- damage_yearly$count
This line of code extracts the feature (year) and labels (counts) into X and y variables.
data_model <- data.frame(X, y)
set.seed(42)
train_index <- createDataPartition(y, p = 0.8, list = FALSE)
train_data <- data_model[train_index, ]
test_data <- data_model[-train_index, ]
The first line of code sets a random seed using set.seed(42) to ensure reproducibility. This ensures that the random processes in the code (like splitting data into training and testing sets) produce the same results every time you run the code. It is just like using the same starting point for a random number generator so that the sequence of random numbers it produces is always the same.
Splits the data into training (80%) and testing (20%) sets using createDataPartition. The indices of the training data are stored in train_index. The training and testing data are stored in train_data and test_data, respectively.
X_train <- train_data$X
y_train <- train_data$y
X_test <- test_data$X
y_test <- test_data$y
These lines of code separate the features (X_train and X_test) and labels (y_train and y_test) for both training and testing sets.
1 X_train <- train_data\(X: ◦ Extracts the feature (year) from the training data and assigns it to X_train. 2 y_train <- train_data\)y: ◦ Extracts the label (number of “Damage” crimes) from the training data and assigns it to y_train. 3 X_test <- test_data\(X: ◦ Extracts the feature (year) from the testing data and assigns it to X_test. 4 y_test <- test_data\)y: ◦ Extracts the label (number of “Damage” crimes) from the testing data and assigns it to y_test.
These lines of code separate the input data (features) from the output data (labels) for both the training and testing datasets. This separation is essential for training a machine learning model and then evaluating its performance on new data.
model <- randomForest(y ~ X, data = train_data, ntree = 100, set.seed = 42)
This line of code trains a Random Forest model using randomForest. The formula y ~ X specifies that y is the dependent variable and X is the independent variable. The model is trained on train_data with 100 trees (ntree = 100). The random seed ensures reproducibility.
y_pred <- predict(model, newdata = data.frame(X = X_test))
This line of code uses the trained model to predict the y values for the X_test data. The predictions are stored in y_pred.
mse <- mean((y_test - y_pred)^2)
print(paste("Mean Squared Error:", mse))
## [1] "Mean Squared Error: 61895.7503074861"
This line of code calculates the Mean Squared Error (MSE) between the actual y_test values and the predicted y_pred values to evaluate the model’s performance. The MSE is printed.
year_2023 <- data.frame(X = 2023)
prediction_2023 <- predict(model, year_2023)
print(paste("Predicted frequency of Damage in 2023:", round(prediction_2023)))
## [1] "Predicted frequency of Damage in 2023: 2546"
Creates a data frame year_2023 with the year 2023. It uses the trained model to predict the frequency of “Damage” occurrences for the year 2023.
The last line of code outputs the predicted frequency of “Damage” occurrences for the year 2023. The prediction of occurrences of Damage in 2023 is 2,546.
This final output provides insight into future trends based on historical data.
intimidation_data <- data %>% filter(offense_name == "Intimidation")
This line of code filters the data to include only rows where the offense_name column is “Intimidation”. This subset is stored in intimidation_data.
intimidation_yearly <- intimidation_data %>% group_by(data_year) %>%
summarise(count = n())
This line of code aggregates the intimidation_data by data_year and counts the number of occurrences each year. The result is stored in intimidation_yearly.
# Inspect the aggregated data
intimidation_yearly
## # A tibble: 32 × 2
## data_year count
## <int> <int>
## 1 1991 1519
## 2 1992 2288
## 3 1993 2421
## 4 1994 2091
## 5 1995 2838
## 6 1996 2965
## 7 1997 2750
## 8 1998 2710
## 9 1999 2605
## 10 2000 2632
## # ℹ 22 more rows
X <- intimidation_yearly$data_year
y <- intimidation_yearly$count
This line of code extracts the feature (year) and labels (counts) into X and y variables.
data_model <- data.frame(X, y)
set.seed(42)
train_index <- createDataPartition(y, p = 0.8, list = FALSE)
train_data <- data_model[train_index, ]
test_data <- data_model[-train_index, ]
Splits the data into training (80%) and testing (20%) sets using createDataPartition. The indices of the training data are stored in train_index. The training and testing data are stored in train_data and test_data, respectively.
X_train <- train_data$X
y_train <- train_data$y
X_test <- test_data$X
y_test <- test_data$y
These lines of code separate the features (X_train and X_test) and labels (y_train and y_test) for both training and testing sets.
model <- randomForest(y ~ X, data = train_data, ntree = 100, set.seed = 42)
This line of code trains a Random Forest model using randomForest. The formula y ~ X specifies that y is the dependent variable and X is the independent variable. The model is trained on train_data with 100 trees (ntree = 100). The random seed ensures reproducibility.
y_pred <- predict(model, newdata = data.frame(X = X_test))
This line of code uses the trained model to predict the y values for the X_test data. The predictions are stored in y_pred.
mse <- mean((y_test - y_pred)^2)
print(paste("Mean Squared Error:", mse))
## [1] "Mean Squared Error: 67203.8869917432"
This line of code calculates the Mean Squared Error (MSE) between the actual y_test values and the predicted y_pred values to evaluate the model’s performance. The MSE is printed.
year_2023 <- data.frame(X = 2023)
prediction_2023 <- predict(model, year_2023)
print(paste("Predicted frequency of Intimidation in 2023:", round(prediction_2023)))
## [1] "Predicted frequency of Intimidation in 2023: 3092"
Creates a data frame year_2023 with the year 2023. It uses the trained model to predict the frequency of “Intimidation” occurrences for the year 2023.
The last line of code outputs the predicted frequency of “Intimidation” occurrences for the year 2023. The prediction of intimidation in 2023 is 3,092.
This final output provides insight into future trends based on historical data.
sim_data <- data %>% filter(offense_name == "Sim.Assault")
This line of code filters the data to include only rows where the offense_name column is “Simple Assault”. This subset is stored in sim_data.
sim_yearly <- sim_data %>% group_by(data_year) %>%
summarise(count = n())
This line of code aggregates the sim_data by data_year and counts the number of occurrences each year. The result is stored in sim_yearly
# Inspect the aggregated data
sim_yearly
## # A tibble: 32 × 2
## data_year count
## <int> <int>
## 1 1991 760
## 2 1992 1240
## 3 1993 1424
## 4 1994 1025
## 5 1995 1395
## 6 1996 1280
## 7 1997 1361
## 8 1998 1366
## 9 1999 1437
## 10 2000 1380
## # ℹ 22 more rows
X <- sim_yearly$data_year
y <- sim_yearly$count
This line of code extracts the feature (year) and labels (counts) into X and y variables.
data_model <- data.frame(X, y)
set.seed(42)
train_index <- createDataPartition(y, p = 0.8, list = FALSE)
train_data <- data_model[train_index, ]
test_data <- data_model[-train_index, ]
Splits the data into training (80%) and testing (20%) sets using createDataPartition. The indices of the training data are stored in train_index. The training and testing data are stored in train_data and test_data, respectively.
X_train <- train_data$X
y_train <- train_data$y
X_test <- test_data$X
y_test <- test_data$y
These lines of code separate the features (X_train and X_test) and labels (y_train and y_test) for both training and testing sets.
model <- randomForest(y ~ X, data = train_data, ntree = 100, set.seed = 42)
This line of code trains a Random Forest model using randomForest. The formula y ~ X specifies that y is the dependent variable and X is the independent variable. The model is trained on train_data with 100 trees (ntree = 100). The random seed ensures reproducibility.
y_pred <- predict(model, newdata = data.frame(X = X_test))
This line of code uses the trained model to predict the y values for the X_test data. The predictions are stored in y_pred.
mse <- mean((y_test - y_pred)^2)
print(paste("Mean Squared Error:", mse))
## [1] "Mean Squared Error: 28555.0848982084"
This line of code calculates the Mean Squared Error (MSE) between the actual y_test values and the predicted y_pred values to evaluate the model’s performance.
year_2023 <- data.frame(X = 2023)
prediction_2023 <- predict(model, year_2023)
print(paste("Predicted frequency of Simple Assault in 2023:", round(prediction_2023)))
## [1] "Predicted frequency of Simple Assault in 2023: 2273"
Creates a data frame year_2023 with the year 2023. It uses the trained model to predict the frequency of “Simple Assault” occurrences for the year 2023.
The last line of code outputs the predicted frequency of “Simple Assault” occurrences for the year 2023. The predicted occurrences of Simple Assault in 2023 is 2,273.
This final output provides insight into future trends based on historical data.