This dataset comprises reported crimes in Chicago from 2001 to present, excluding the most recent seven days. Sourced from the Chicago Police Department’s CLEAR system, this project will utilize data from the 2019 to 2023 timeframe. Our objective is to employ predictive modeling techniques to forecast crime patterns and inform strategic decision-making.
In order to simplify our project, we are going to remove unnecesary columns to avoid complexity
The business questions focus on strategic decision-making in the context of crime prevention. Specifically, they explore how to allocate resources effectively , identify effective crime prevention initiatives, and evaluate the effectiveness of existing policies.
Before proceeding we need to import necessary libraries, they are:
# load library
library(tidyverse)
library(lubridate)
library(padr)
library(zoo)
library(forecast)
library(MLmetrics)
library(tseries)
library(ggplot2)
Next we need to prepare our data before we can continue to next process:
df_crime <- read.csv("data/Crimes_-_2001_to_Present_20240814.csv")
df_crime %>% head()
Lets delve deeper to understand our data
#check structure
df_crime %>%
glimpse()
#> Rows: 1,185,002
#> Columns: 22
#> $ ID <int> 13327763, 13325009, 13324997, 13327752, 13324881,…
#> $ Case.Number <chr> "JH103488", "JH100002", "JH100010", "JH102557", "…
#> $ Date <chr> "12/31/2023 11:59:00 PM", "12/31/2023 11:51:00 PM…
#> $ Block <chr> "010XX N ORLEANS ST", "051XX S PRINCETON AVE", "0…
#> $ IUCR <chr> "1320", "0550", "0530", "0890", "0486", "0454", "…
#> $ Primary.Type <chr> "CRIMINAL DAMAGE", "ASSAULT", "ASSAULT", "THEFT",…
#> $ Description <chr> "TO VEHICLE", "AGGRAVATED POLICE OFFICER - HANDGU…
#> $ Location.Description <chr> "STREET", "STREET", "APARTMENT", "BAR OR TAVERN",…
#> $ Arrest <chr> "false", "true", "false", "false", "false", "fals…
#> $ Domestic <chr> "false", "false", "true", "false", "true", "false…
#> $ Beat <int> 1823, 935, 624, 122, 923, 2532, 1732, 1215, 1834,…
#> $ District <int> 18, 9, 6, 1, 9, 25, 17, 12, 18, 6, 6, 7, 18, 15, …
#> $ Ward <int> 27, 20, 8, 42, 14, 37, 30, 27, 42, 17, 9, 16, 42,…
#> $ Community.Area <int> 8, 37, 69, 32, 63, 25, 21, 28, 8, 71, 49, 67, 8, …
#> $ FBI.Code <chr> "14", "04A", "04A", "06", "08B", "08B", "22", "08…
#> $ X.Coordinate <int> 1173727, 1175152, 1183685, 1175349, 1159244, 1140…
#> $ Y.Coordinate <int> 1907173, 1871065, 1854148, 1902127, 1870437, 1909…
#> $ Year <int> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2…
#> $ Updated.On <chr> "01/08/2024 03:59:56 PM", "01/17/2024 03:41:31 PM…
#> $ Latitude <dbl> 41.90070, 41.80158, 41.75497, 41.88682, 41.80020,…
#> $ Longitude <dbl> -87.63733, -87.63318, -87.60241, -87.63152, -87.6…
#> $ Location <chr> "(41.900698378, -87.637329754)", "(41.801583507, …
#check missing value
df_crime %>%
is.na() %>%
colSums()
#> ID Case.Number Date
#> 0 0 0
#> Block IUCR Primary.Type
#> 0 0 0
#> Description Location.Description Arrest
#> 0 0 0
#> Domestic Beat District
#> 0 0 0
#> Ward Community.Area FBI.Code
#> 48 2 0
#> X.Coordinate Y.Coordinate Year
#> 18316 18316 0
#> Updated.On Latitude Longitude
#> 0 18316 18316
#> Location
#> 0
Insights:
Date and
Case.Number columns to aggregate case counts by date.After having more understanding about our data, next we can start
preparing our data. First, we need to fix our date column into date
format and put it in new column called Crime.Date
# Convert the string to a date-time object using mdy_hms
df_crime$Crime.Date <-mdy_hms(df_crime$Date, tz = "UTC")
head(df_crime)
Next, for the next bit this is what we do:
crime_day and extract the date
parts and assign to new column Date column. crime_day column. Case.Number that
occurred on each daydf_crime_n <-
df_crime %>%
mutate(crime_day = floor_date(Crime.Date, "day")) %>%
select(-Date) %>%
group_by(crime_day) %>%
summarise(crime_commited = n_distinct(Case.Number))
head(df_crime_n)
tail(df_crime_n)
After validating the results, we must ensure they are in ascending order and check for any missing intervals.
# arrange the data
df_crime_arrange <- arrange(df_crime_n, crime_day)
head(df_crime_arrange)
mindate <- min(df_crime_arrange$crime_day)
maxdate <- max(df_crime_arrange$crime_day)
Next, lets check for missing interval in our dataframe, by creating
formatted_complete_day object and fill it with day sequence
from our oldest and newest crime day and then compare it with our
f_crime_arrange$crime_day
complete_day <- seq(from = min(df_crime_arrange$crime_day),
to = max(df_crime_arrange$crime_day),
by = "day")
# Removing time zones from the entire time series
formatted_complete_day <- as.Date(format(complete_day, format = "%Y-%m-%d"))
# check for oldest and newest sequence
head(formatted_complete_day)
#> [1] "2019-01-01" "2019-01-02" "2019-01-03" "2019-01-04" "2019-01-05"
#> [6] "2019-01-06"
tail(formatted_complete_day)
#> [1] "2023-12-26" "2023-12-27" "2023-12-28" "2023-12-29" "2023-12-30"
#> [6] "2023-12-31"
#compare with our data
all(df_crime_arrange$crime_day == formatted_complete_day)
#> [1] TRUE
In last bit, we check for missing intervals in dataframe. Since the result is TRUE, it means there are no missing intervals in our data frame. We can proceed to next steps.
head(df_crime_arrange)
In this step, we are going to create Time Series Object from
df_crime_arrange dataframe. The ts function
requires the following parameters:
data: The numerical column to be forecast.frequency: The frequency of the recurring pattern to be
analyzed.# create ts object
df_crime_ts<-ts(data=df_crime_arrange$crime_commited, frequency=365)
Lets explore our newly made ts objects
# Create ts plot
autoplot(df_crime_ts,
ylab = "Crime Rate",
xlab = "Time",
main = "Daily Crime Rate with Seasonal Patterns")
Insights:
# Decomposition
# Create decompose
df_crime_ts %>%
decompose() %>%
autoplot()
Insights:
summary(df_crime_ts)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 340.0 578.0 653.5 648.9 721.0 1898.0
# Distribution
hist(df_crime_ts)
Insights:
Based on insights gain from decomposition and descriptive Statistics above, here our next step:
In next bit we are going to identify days that has crime_commited above third quartile.
# Calculate IQR
iqr_value <- IQR(df_crime_arrange$crime_commited)
# Identify potential outliers
df_crime_outliers <- df_crime_arrange %>%
filter(crime_commited > quantile(crime_commited, 0.75) + 1.5 * iqr_value)
df_crime_outliers
Our analysis identified extreme data points on May 31, 2020. Subsequent research revealed that Chicago experienced civil unrest from May 28 to June 1, 2020, following the death of George Floyd, which likely influenced the data. To mitigate the impact of these outliers, we will replace the May 31 data with the average of the preceding and following day’s values.
# Identify the maximum value
max_value <- max(df_crime_arrange$crime_commited)
# Replace the maximum value with NA
df_crime_arrange$crime_commited[df_crime_arrange$crime_commited == max_value] <- NA
# Fill NA with the average of the nearest two values
df_crime_arrange$crime_commited <- na.approx(df_crime_arrange$crime_commited)
# check the data
filtered_data <- df_crime_arrange %>%
filter(crime_day == as.Date("2020-05-31"))
filtered_data
# create ts object
df_crime_ts2<-ts(data=df_crime_arrange$crime_commited, frequency=365)
df_crime_msts <- df_crime_ts2 %>%
msts( seasonal.periods = c(7,30,365))
autoplot(df_crime_msts)
# Create decompose
decom_crime <-
df_crime_msts %>%
mstl
autoplot(decom_crime)
Insights:
The dataset will be divided into two subsets: training and test/validation. Given a five-year dataset, we will allocate the first four years (January 1, 2019, to December 31, 2023) for model training, representing 80% of the data. The remaining year will serve as the test/validation set.
The msts_train dataset will consist of the first four
years of data, from January 1, 2019, to December 31, 2023. This dataset,
encompassing approximately 1461 data points, will be used to train the
model. The subsequent year, from January 1, 2024, to December 31, 2024,
will be incorporated into the msts_test dataset for model
evaluation. This dataset contains approximately 365 data points.
msts_train <- head(df_crime_msts, length(df_crime_msts) - 365)
msts_test <- tail(df_crime_msts, 365)
A stationary time series has statistical properties (mean, variance, autocorrelation) that remain constant over time. This means that the data doesn’t exhibit trends or seasonal patterns. Data stationary is important because non-stationary data can lead to inaccurate forecasts and misleading results.
There are two method to check stationary: 1. Visual Inspection
autoplot(msts_train)
To determine if our data exhibits stationarity, we will use the Augmented Dickey-Fuller (ADF) test. This test evaluates the following hypotheses:
The decision to reject or fail to reject the null hypothesis is based on the p-value. If the p-value exceeds the significance level (typically 0.05), we conclude that the data is non-stationary. Conversely, a p-value below the significance level indicates stationarity.
adf.test(msts_train)
#>
#> Augmented Dickey-Fuller Test
#>
#> data: msts_train
#> Dickey-Fuller = -3.2702, Lag order = 11, p-value = 0.07582
#> alternative hypothesis: stationary
Insights:
Based on visual inspection, we can see that our data is already stationary
from ADF test’s p-value = 0.07582 which is lower than significance level, we can conclude that our data is stationary
STLM and TBATS are two primary methods for handling time series data with multiple seasonal patterns. STLM is suitable for initial exploration of data and understanding basic seasonal patterns, while TBATS is preferred for accurate forecasting with complex seasonal structures. The choice between the two depends on the specific needs of the analysis and the complexity of the time series data.
The TBATS ((Trigonometric Seasonality, Box-Cox Transformation, ARMA Errors, Trend and Seasonal components) model excels at handling intricate seasonal patterns, encompassing those with varying frequencies. It can also adapt to non-linear trends and evolving seasonality over time.
We can leverage the tbats() function with specific
parameters:
use.box.cox = TRUEuse.trend = FALSEuse.damped.trend = TRUEmodel_tbats <- msts_train %>%
tbats(use.box.cox = TRUE,
use.trend = FALSE,
use.damped.trend = TRUE)
The visualizations below show the actual number of crimes commited and the TBATS model.
# Plot of TBATS model results with actual data
msts_train %>%
autoplot()+
autolayer(msts_test, series = "Data test") +
autolayer(model_tbats$fitted, series = "TBATS" )+
ggtitle("Actual Crime Rate vs Model TBATS") +
xlab("Time") +
ylab("Crime Commited")
Insights : The TBATS model is well suited to follow the trend of the actual data, as not much deviation is visible from the plot.
The STLM (Seasonal and Trend decomposition using Loess) model is more
intuitive and simpler to implement compared to TBATS. It excels in
generating easily interpretable components, including trend,
seasonality, and residuals. To utilize the STLM model, employ the
stlm() function with a lambda value of 0 and
default settings for other parameters.
model_stlm <- msts_train %>%
stlm(lambda = 0) %>%
forecast(h = 365)
The visualizations below show the actual number of visitors and the STLM model.
# Plot of STLM Model results with actual data
msts_train %>%
autoplot()+
autolayer(msts_test, series = "Data test") +
autolayer(model_stlm$fitted, series = "STLM" )+
ggtitle("Actual Crime Rate vs Model STLM") +
xlab("Time") +
ylab("Crime Commited")
Insights : The STLM model is well suited to follow the trend of the actual data, as not much deviation is visible from the plot.
Following the development of STLM and TBATS models, we will proceed with generating forecasts.
To create a forecast with model_tbats, we use the
forecast() function, filling the parameters with
model name and h=365.
forecast_mod<-forecast(model_tbats, h = 365)
The visualizations below show the actual and estimated number of visitors using the TBATS model.
# Plot of TBATS prediction results with actual data
msts_train %>%
autoplot()+
autolayer(msts_test, series = "Data test") +
autolayer(model_tbats$fitted, series = "TBATS" )+
autolayer(forecast_mod$mean, series = "Forecast")+
ggtitle("Actual Crime Rate vs Forecast (TBATS)") +
xlab("Time") +
ylab("Crime Commited")
Insights : The TBATS model is considered good at making forecasts that match the test data, it’s indicating that the model is able to capture the general pattern in the data.
To create a forecast with model_stlm, we use the
forecast() function, filling the parameters with
model name and h=365.
forecast_stlm<-forecast(model_stlm, h = 365)
The visualizations below show the actual and estimated number of visitors using the STLM model.
# Plot of STLM prediction results with actual data
msts_train %>%
autoplot()+
autolayer(msts_test, series = "Data test") +
autolayer(model_stlm$fitted, series = "STLM" )+
autolayer(forecast_stlm$mean, series = "Forecast")+
ggtitle("Actual Crime Rate vs Forecast (STLM)") +
xlab("Time") +
ylab("Crime Commited")
Insight: The STLM model is considered good at making forecasts that match the test data, it’s indicating that the model is able to capture the general pattern in the data.
To compare which model performs better, we can calculate the magnitude of the error between the forecast results and the actual values. Similar to regression, we can use error metrics such as MAE (Mean Absolute Error), RMSE (Root Mean Squared Error), and MAPE (Mean Absolute Percentage Error)
The forecast package provides the
accuracy() function to display these error metrics at once:
accuracy(object, x) - object = A forecast
object that stores the forecast results of the model - x =
The data to be evaluated
metrics<-rbind(accuracy(forecast_mod$mean, msts_test),
accuracy(forecast_stlm$mean, msts_test))
rownames(metrics)<-c("TBATS Model", "STLM Model")
rmarkdown:: paged_table(data.frame(metrics))
Insights:
Based on the metrics, it appears that the TBATS model performs slightly better than the STLM model. This is primarily due to its lower RMSE and MAE values, indicating better overall accuracy and lower average absolute errors. Additionally, its lower ACF1 value suggests less autocorrelation in the residuals.
The TBATS model relies on two key assumptions: normally distributed errors and uncorrelated residuals. We will employ hypothesis tests with a significance level of 0.05 to evaluate if our model satisfies these criteria.
We check for normality using various method, Shapiro-Wilk Test which best suited for small sample size and Kolmogorov-Smirnov Test which best suited for larger sample size.
# Shapiro-Wilk Test
shapiro.test(residuals(forecast_mod))
#>
#> Shapiro-Wilk normality test
#>
#> data: residuals(forecast_mod)
#> W = 0.95283, p-value < 0.00000000000000022
# Kolmogorov-Smirnov Test
ks.test(forecast_mod$residuals, "pnorm")
#>
#> Asymptotic one-sample Kolmogorov-Smirnov test
#>
#> data: forecast_mod$residuals
#> D = 0.28341, p-value < 0.00000000000000022
#> alternative hypothesis: two-sided
# Histogram
hist(forecast_mod$residuals, main = "Histogram of Residuals")
# Q-Q Plot
qqnorm(forecast_mod$residuals)
qqline(forecast_mod$residuals)
Insights
Since p-value for both test is lower than 0.05 (p-value = 0.00000000000000022 and p-value = 0.00000000000000022) for Shapiro-Wilk Test and Kolmogorov-Smirnov test, so we reject null hypothesis. This means our model’s residuals is slightly not normally distributed.
Looking at Histogram and QQ plot, we can conclude that our model’s residual is not normaly distributed
Autocorrelation refers to the relationship between a data point and its past values. In the context of time series analysis, it’s crucial to ensure that the residuals (errors) of a model are uncorrelated. This means that the value of a residual at a particular time point should not be dependent on the value of the residual at a previous time point.
# Ljung-Box Test
Box.test(forecast_mod$residuals,type = "Ljung-Box")
#>
#> Box-Ljung test
#>
#> data: forecast_mod$residuals
#> X-squared = 0.033072, df = 1, p-value = 0.8557
# ACF Plot
acf(forecast_mod$residuals)
Insights
Based on the result, the p-value are greater than 0.05 (p-value = 0.8557), so we fail to reject null hypothesis. This means that our residuals are not correlated or no auto-correlation presence.
From ACF Plot shows there are no significant spikes outside the confidence band, this suggest no auto-correlation presence.
The STLM model has a Mean Absolute Error (MAE) of 67.58105, while the TBATS model has a lower MAE of 64.20831. Therefore, the TBATS model was chosen for predicting Chicago crime. This decision was based on the comparison of the two models’ performance in time series analysis with multiple seasonality.
If we saw original data, there are one particular event that creates extreme outlier, which is a people unrest from the result of demonstration for George Floyd murder from 28-31 May 2020. When we take out this, the highest crime committed in Chicago area would take place at 2020-01-01 with 1054 cases.
max_crime <- df_crime_arrange %>%
filter(crime_commited == max(crime_commited))
max_crime
Chicago Crime Data 2001 - present, Retrieved from https://data.cityofchicago.org/Public-Safety/Chicago-Police-Department-Illinois-Uniform-Crime-R/c7ck-438e/about_data
Wikipedia contributors. (2024, August 18). George Floyd protests in Chicago. In Wikipedia, The Free Encyclopedia. Retrieved 11:16, August 19, 2024, from https://en.wikipedia.org/w/index.php?title=George_Floyd_protests_in_Chicago&oldid=1240901576