1 Introduction

This dataset comprises reported crimes in Chicago from 2001 to present, excluding the most recent seven days. Sourced from the Chicago Police Department’s CLEAR system, this project will utilize data from the 2019 to 2023 timeframe. Our objective is to employ predictive modeling techniques to forecast crime patterns and inform strategic decision-making.

In order to simplify our project, we are going to remove unnecesary columns to avoid complexity

2 Business Question

The business questions focus on strategic decision-making in the context of crime prevention. Specifically, they explore how to allocate resources effectively , identify effective crime prevention initiatives, and evaluate the effectiveness of existing policies.

3 Importing Libraries

Before proceeding we need to import necessary libraries, they are:

# load library
library(tidyverse)
library(lubridate) 
library(padr)
library(zoo)
library(forecast)
library(MLmetrics)
library(tseries)
library(ggplot2)

4 Data Collection and Preparation

Next we need to prepare our data before we can continue to next process:

4.1 Gather Relevant Data

df_crime <- read.csv("data/Crimes_-_2001_to_Present_20240814.csv")
df_crime %>% head()

4.2 Understanding Data

Lets delve deeper to understand our data

#check structure
df_crime %>%
  glimpse()
#> Rows: 1,185,002
#> Columns: 22
#> $ ID                   <int> 13327763, 13325009, 13324997, 13327752, 13324881,…
#> $ Case.Number          <chr> "JH103488", "JH100002", "JH100010", "JH102557", "…
#> $ Date                 <chr> "12/31/2023 11:59:00 PM", "12/31/2023 11:51:00 PM…
#> $ Block                <chr> "010XX N ORLEANS ST", "051XX S PRINCETON AVE", "0…
#> $ IUCR                 <chr> "1320", "0550", "0530", "0890", "0486", "0454", "…
#> $ Primary.Type         <chr> "CRIMINAL DAMAGE", "ASSAULT", "ASSAULT", "THEFT",…
#> $ Description          <chr> "TO VEHICLE", "AGGRAVATED POLICE OFFICER - HANDGU…
#> $ Location.Description <chr> "STREET", "STREET", "APARTMENT", "BAR OR TAVERN",…
#> $ Arrest               <chr> "false", "true", "false", "false", "false", "fals…
#> $ Domestic             <chr> "false", "false", "true", "false", "true", "false…
#> $ Beat                 <int> 1823, 935, 624, 122, 923, 2532, 1732, 1215, 1834,…
#> $ District             <int> 18, 9, 6, 1, 9, 25, 17, 12, 18, 6, 6, 7, 18, 15, …
#> $ Ward                 <int> 27, 20, 8, 42, 14, 37, 30, 27, 42, 17, 9, 16, 42,…
#> $ Community.Area       <int> 8, 37, 69, 32, 63, 25, 21, 28, 8, 71, 49, 67, 8, …
#> $ FBI.Code             <chr> "14", "04A", "04A", "06", "08B", "08B", "22", "08…
#> $ X.Coordinate         <int> 1173727, 1175152, 1183685, 1175349, 1159244, 1140…
#> $ Y.Coordinate         <int> 1907173, 1871065, 1854148, 1902127, 1870437, 1909…
#> $ Year                 <int> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2…
#> $ Updated.On           <chr> "01/08/2024 03:59:56 PM", "01/17/2024 03:41:31 PM…
#> $ Latitude             <dbl> 41.90070, 41.80158, 41.75497, 41.88682, 41.80020,…
#> $ Longitude            <dbl> -87.63733, -87.63318, -87.60241, -87.63152, -87.6…
#> $ Location             <chr> "(41.900698378, -87.637329754)", "(41.801583507, …
#check missing value
df_crime %>%
  is.na() %>% 
  colSums()
#>                   ID          Case.Number                 Date 
#>                    0                    0                    0 
#>                Block                 IUCR         Primary.Type 
#>                    0                    0                    0 
#>          Description Location.Description               Arrest 
#>                    0                    0                    0 
#>             Domestic                 Beat             District 
#>                    0                    0                    0 
#>                 Ward       Community.Area             FBI.Code 
#>                   48                    2                    0 
#>         X.Coordinate         Y.Coordinate                 Year 
#>                18316                18316                    0 
#>           Updated.On             Latitude            Longitude 
#>                    0                18316                18316 
#>             Location 
#>                    0

Insights:

  1. Multiple features, including Ward, Community.Area, X.Coordinate, Y.Coordinate, Latitude, and Longitude, contain missing data points.
  2. The Date column is currently in an incorrect data format and requires conversion to a date data type.
  3. To facilitate forecasting, we will incorporate Date and Case.Number columns to aggregate case counts by date.

After having more understanding about our data, next we can start preparing our data. First, we need to fix our date column into date format and put it in new column called Crime.Date

# Convert the string to a date-time object using mdy_hms
df_crime$Crime.Date <-mdy_hms(df_crime$Date, tz = "UTC")
head(df_crime)

Next, for the next bit this is what we do:

  1. Create new column called crime_day and extract the date parts and assign to new column
  2. Remove Date column.
  3. Groups the data by the crime_day column.
  4. Calculates the total number of crimes Case.Number that occurred on each day
df_crime_n <- 
  df_crime %>% 
  mutate(crime_day = floor_date(Crime.Date, "day")) %>%
  select(-Date) %>% 
  group_by(crime_day) %>%
  summarise(crime_commited = n_distinct(Case.Number))

head(df_crime_n)
tail(df_crime_n)

4.3 Data cleaning

After validating the results, we must ensure they are in ascending order and check for any missing intervals.

# arrange the data
df_crime_arrange <- arrange(df_crime_n, crime_day)
head(df_crime_arrange)
mindate <- min(df_crime_arrange$crime_day)
maxdate <- max(df_crime_arrange$crime_day)

Next, lets check for missing interval in our dataframe, by creating formatted_complete_day object and fill it with day sequence from our oldest and newest crime day and then compare it with our f_crime_arrange$crime_day

complete_day <- seq(from = min(df_crime_arrange$crime_day),
                 to = max(df_crime_arrange$crime_day),
                 by = "day")

# Removing time zones from the entire time series
formatted_complete_day <-  as.Date(format(complete_day, format = "%Y-%m-%d"))

# check for oldest and newest sequence
head(formatted_complete_day)
#> [1] "2019-01-01" "2019-01-02" "2019-01-03" "2019-01-04" "2019-01-05"
#> [6] "2019-01-06"
tail(formatted_complete_day)
#> [1] "2023-12-26" "2023-12-27" "2023-12-28" "2023-12-29" "2023-12-30"
#> [6] "2023-12-31"
#compare with our data
all(df_crime_arrange$crime_day == formatted_complete_day) 
#> [1] TRUE

In last bit, we check for missing intervals in dataframe. Since the result is TRUE, it means there are no missing intervals in our data frame. We can proceed to next steps.

head(df_crime_arrange)

4.4 Time series creation

In this step, we are going to create Time Series Object from df_crime_arrange dataframe. The ts function requires the following parameters:

  • data: The numerical column to be forecast.
  • frequency: The frequency of the recurring pattern to be analyzed.
# create ts object
df_crime_ts<-ts(data=df_crime_arrange$crime_commited, frequency=365)

4.5 Exploratory data analysis (EDA)

Lets explore our newly made ts objects

Result

Time Series Plot

# Create ts plot
autoplot(df_crime_ts, 
     ylab = "Crime Rate", 
     xlab = "Time", 
     main = "Daily Crime Rate with Seasonal Patterns")

Insights:

  • High Volatility: The crime rate exhibits significant fluctuations over time, with numerous peaks and valleys.
  • Potential Seasonality: While there are some recurring patterns, they are not immediately apparent due to the high volatility. Further analysis, such as decomposition, would be necessary to confirm seasonality.

Time Series Decomposition

# Decomposition
# Create decompose
df_crime_ts %>% 
  decompose() %>% 
  autoplot()

Insights:

  • Data: The original time series exhibits high volatility with several spikes, indicating potential outliers or anomalies.
  • Trend: There’s a clear downward trend in the data, suggesting a decreasing overall level over time.
  • Seasonal: The seasonal component shows a consistent pattern, likely representing weekly or monthly seasonality. The amplitude of the seasonal component is relatively high compared to the trend.
  • Remainder: The remainder component appears to have some structure, indicating that the decomposition might not fully capture all the patterns in the data. There are also some outliers in the remainder, which could be due to the original data’s outliers or other uncaptured patterns.

Descriptive Statistics

summary(df_crime_ts)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   340.0   578.0   653.5   648.9   721.0  1898.0
# Distribution
hist(df_crime_ts)

Insights:

  • Outliers: The maximum value (1898) might be considered an outlier, as it is significantly higher than the rest of the data. It’s worth investigating this further to understand the reason for this unusually high value.
  • Distribution: The right-skewness suggests that most crime counts are relatively low, with a few instances of significantly higher crime numbers.

Based on insights gain from decomposition and descriptive Statistics above, here our next step:

  • Check for reason behind outliers and act accordingly
  • Test for multi seasonality by constructing msts object which allowing us to further analysis

4.6 Check for Outliers

In next bit we are going to identify days that has crime_commited above third quartile.

# Calculate IQR
iqr_value <- IQR(df_crime_arrange$crime_commited)

# Identify potential outliers
df_crime_outliers <- df_crime_arrange %>%
  filter(crime_commited > quantile(crime_commited, 0.75) + 1.5 * iqr_value)

df_crime_outliers

Our analysis identified extreme data points on May 31, 2020. Subsequent research revealed that Chicago experienced civil unrest from May 28 to June 1, 2020, following the death of George Floyd, which likely influenced the data. To mitigate the impact of these outliers, we will replace the May 31 data with the average of the preceding and following day’s values.

# Identify the maximum value
max_value <- max(df_crime_arrange$crime_commited)

# Replace the maximum value with NA
df_crime_arrange$crime_commited[df_crime_arrange$crime_commited == max_value] <- NA

# Fill NA with the average of the nearest two values
df_crime_arrange$crime_commited <- na.approx(df_crime_arrange$crime_commited)

# check the data
filtered_data <- df_crime_arrange %>%
  filter(crime_day == as.Date("2020-05-31"))

filtered_data

4.7 Multi Seasonality Time Series

4.7.1 Create msts Object

# create ts object
df_crime_ts2<-ts(data=df_crime_arrange$crime_commited, frequency=365)
df_crime_msts <- df_crime_ts2 %>%
  msts( seasonal.periods = c(7,30,365))

autoplot(df_crime_msts)

4.7.2 Decomposition

# Create decompose
decom_crime <- 
  df_crime_msts %>% 
   mstl
  
autoplot(decom_crime)

Insights:

  • Dominant Seasonality: The weekly seasonality (Seasonal7) appears to be the most prominent component, suggesting a strong weekly pattern in the data.
  • Multiple Seasonalities: The presence of Seasonal13, Seasonal36, and Seasonal365 indicates potential multiple seasonal patterns, although their impact might be less significant than the weekly component.
  • Trend: The downward trend suggests a decreasing overall level of the time series over time.
  • Noise: The remainder component contains a significant amount of noise, indicating that the model might not fully capture all the variations in the data.

5 Model and Evaluation

5.1 Cross Validation

The dataset will be divided into two subsets: training and test/validation. Given a five-year dataset, we will allocate the first four years (January 1, 2019, to December 31, 2023) for model training, representing 80% of the data. The remaining year will serve as the test/validation set.

The msts_train dataset will consist of the first four years of data, from January 1, 2019, to December 31, 2023. This dataset, encompassing approximately 1461 data points, will be used to train the model. The subsequent year, from January 1, 2024, to December 31, 2024, will be incorporated into the msts_test dataset for model evaluation. This dataset contains approximately 365 data points.

msts_train <- head(df_crime_msts, length(df_crime_msts) - 365)
msts_test <- tail(df_crime_msts,  365)

5.2 Stationarity

A stationary time series has statistical properties (mean, variance, autocorrelation) that remain constant over time. This means that the data doesn’t exhibit trends or seasonal patterns. Data stationary is important because non-stationary data can lead to inaccurate forecasts and misleading results.

There are two method to check stationary: 1. Visual Inspection

autoplot(msts_train)

  1. Statistical Tests: Stationarity is a prerequisite for accurate time series modeling.

To determine if our data exhibits stationarity, we will use the Augmented Dickey-Fuller (ADF) test. This test evaluates the following hypotheses:

  • Null hypothesis (H0): The data is non-stationary.
  • Alternative hypothesis (H1): The data is stationary.

The decision to reject or fail to reject the null hypothesis is based on the p-value. If the p-value exceeds the significance level (typically 0.05), we conclude that the data is non-stationary. Conversely, a p-value below the significance level indicates stationarity.

adf.test(msts_train)
#> 
#>  Augmented Dickey-Fuller Test
#> 
#> data:  msts_train
#> Dickey-Fuller = -3.2702, Lag order = 11, p-value = 0.07582
#> alternative hypothesis: stationary

Insights:

  1. Based on visual inspection, we can see that our data is already stationary

  2. from ADF test’s p-value = 0.07582 which is lower than significance level, we can conclude that our data is stationary

5.3 Model Selection

STLM and TBATS are two primary methods for handling time series data with multiple seasonal patterns. STLM is suitable for initial exploration of data and understanding basic seasonal patterns, while TBATS is preferred for accurate forecasting with complex seasonal structures. The choice between the two depends on the specific needs of the analysis and the complexity of the time series data.

5.3.1 TBATS

The TBATS ((Trigonometric Seasonality, Box-Cox Transformation, ARMA Errors, Trend and Seasonal components) model excels at handling intricate seasonal patterns, encompassing those with varying frequencies. It can also adapt to non-linear trends and evolving seasonality over time.

We can leverage the tbats() function with specific parameters:

  • Box-Cox transformation: This parameter allows us to specify a transformation applied to the data to stabilize its variance.
    • Use: use.box.cox = TRUE
  • Trend: Since this model doesn’t utilize a trend component, we set:
    • Use: use.trend = FALSE
  • Damping parameter: This parameter controls the extent of seasonal decay.
    • Use: use.damped.trend = TRUE
model_tbats <- msts_train %>%
            tbats(use.box.cox = TRUE, 
                  use.trend = FALSE, 
                  use.damped.trend = TRUE)

The visualizations below show the actual number of crimes commited and the TBATS model.

# Plot of TBATS model results with actual data
msts_train %>% 
  autoplot()+
  autolayer(msts_test, series = "Data test") +
  autolayer(model_tbats$fitted, series = "TBATS" )+
  ggtitle("Actual Crime Rate vs Model TBATS") +
  xlab("Time") +
  ylab("Crime Commited")

Insights : The TBATS model is well suited to follow the trend of the actual data, as not much deviation is visible from the plot.

5.3.2 STLM

The STLM (Seasonal and Trend decomposition using Loess) model is more intuitive and simpler to implement compared to TBATS. It excels in generating easily interpretable components, including trend, seasonality, and residuals. To utilize the STLM model, employ the stlm() function with a lambda value of 0 and default settings for other parameters.

model_stlm <- msts_train %>%
  stlm(lambda = 0) %>% 
  forecast(h = 365)

The visualizations below show the actual number of visitors and the STLM model.

# Plot of STLM Model results with actual data
msts_train %>% 
  autoplot()+
  autolayer(msts_test, series = "Data test") +
  autolayer(model_stlm$fitted, series = "STLM" )+
  ggtitle("Actual Crime Rate vs Model STLM") +
  xlab("Time") +
  ylab("Crime Commited")

Insights : The STLM model is well suited to follow the trend of the actual data, as not much deviation is visible from the plot.

5.4 Forecast

Following the development of STLM and TBATS models, we will proceed with generating forecasts.

Results
TBATS

To create a forecast with model_tbats, we use the forecast() function, filling the parameters with model name and h=365.

forecast_mod<-forecast(model_tbats, h = 365)

The visualizations below show the actual and estimated number of visitors using the TBATS model.

# Plot of TBATS prediction results with actual data
msts_train %>% 
  autoplot()+
  autolayer(msts_test, series = "Data test") +
  autolayer(model_tbats$fitted, series = "TBATS" )+
  autolayer(forecast_mod$mean, series = "Forecast")+
  ggtitle("Actual Crime Rate vs Forecast (TBATS)") +
  xlab("Time") +
  ylab("Crime Commited")

Insights : The TBATS model is considered good at making forecasts that match the test data, it’s indicating that the model is able to capture the general pattern in the data.

STLM

To create a forecast with model_stlm, we use the forecast() function, filling the parameters with model name and h=365.

forecast_stlm<-forecast(model_stlm, h = 365)

The visualizations below show the actual and estimated number of visitors using the STLM model.

# Plot of STLM prediction results with actual data
msts_train %>% 
  autoplot()+
  autolayer(msts_test, series = "Data test") +
  autolayer(model_stlm$fitted, series = "STLM" )+
  autolayer(forecast_stlm$mean, series = "Forecast")+
  ggtitle("Actual Crime Rate vs Forecast (STLM)") +
  xlab("Time") +
  ylab("Crime Commited")

Insight: The STLM model is considered good at making forecasts that match the test data, it’s indicating that the model is able to capture the general pattern in the data.

5.5 Model Evaluation and Validation

To compare which model performs better, we can calculate the magnitude of the error between the forecast results and the actual values. Similar to regression, we can use error metrics such as MAE (Mean Absolute Error), RMSE (Root Mean Squared Error), and MAPE (Mean Absolute Percentage Error)

The forecast package provides the accuracy() function to display these error metrics at once: accuracy(object, x) - object = A forecast object that stores the forecast results of the model - x = The data to be evaluated

metrics<-rbind(accuracy(forecast_mod$mean, msts_test),
               accuracy(forecast_stlm$mean, msts_test))
rownames(metrics)<-c("TBATS Model", "STLM Model")
rmarkdown:: paged_table(data.frame(metrics))

Insights:

  • RMSE: The TBATS Model has a lower RMSE (79.42) compared to the STLM Model (81.93). A lower RMSE indicates better overall accuracy.
  • MAE: Similarly, the TBATS Model has a lower MAE (64.20) than the STLM Model (67.58), suggesting that it makes smaller average errors.
  • MAPE: The TBATS Model also has a lower MAPE (8.8%) compared to the STLM Model (9.29%). This means it has a smaller average percentage error.
  • ACF1: Both models have similar ACF1 values, indicating that they both capture autocorrelation in the data to a comparable extent.
  • Theil’s U: The TBATS Model has a slightly lower Theil’s U (1.23) compared to the STLM Model (1.28). This suggests that the TBATS Model is more efficient in forecasting compared to a naive model.

Based on the metrics, it appears that the TBATS model performs slightly better than the STLM model. This is primarily due to its lower RMSE and MAE values, indicating better overall accuracy and lower average absolute errors. Additionally, its lower ACF1 value suggests less autocorrelation in the residuals.

6 Prediction Performance

The TBATS model relies on two key assumptions: normally distributed errors and uncorrelated residuals. We will employ hypothesis tests with a significance level of 0.05 to evaluate if our model satisfies these criteria.

6.1 Normality

We check for normality using various method, Shapiro-Wilk Test which best suited for small sample size and Kolmogorov-Smirnov Test which best suited for larger sample size.

Results
Shapiro-Wilk Test
# Shapiro-Wilk Test
shapiro.test(residuals(forecast_mod))
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  residuals(forecast_mod)
#> W = 0.95283, p-value < 0.00000000000000022
Kolmogorov-Smirnov Test
# Kolmogorov-Smirnov Test
ks.test(forecast_mod$residuals, "pnorm")
#> 
#>  Asymptotic one-sample Kolmogorov-Smirnov test
#> 
#> data:  forecast_mod$residuals
#> D = 0.28341, p-value < 0.00000000000000022
#> alternative hypothesis: two-sided
Histogram
# Histogram
hist(forecast_mod$residuals, main = "Histogram of Residuals")

Q-Q Plot
# Q-Q Plot
qqnorm(forecast_mod$residuals)
qqline(forecast_mod$residuals)

Insights

  1. Normality test using Shapiro-Wilk Test and Kolmogorov-Smirnov Test hold hypothesis:
  • H0: The residuals of the models are normally distributed
  • H1: The residuals of the models are not normally distributed
  1. Since p-value for both test is lower than 0.05 (p-value = 0.00000000000000022 and p-value = 0.00000000000000022) for Shapiro-Wilk Test and Kolmogorov-Smirnov test, so we reject null hypothesis. This means our model’s residuals is slightly not normally distributed.

  2. Looking at Histogram and QQ plot, we can conclude that our model’s residual is not normaly distributed

6.2 Auto-Correlation

Autocorrelation refers to the relationship between a data point and its past values. In the context of time series analysis, it’s crucial to ensure that the residuals (errors) of a model are uncorrelated. This means that the value of a residual at a particular time point should not be dependent on the value of the residual at a previous time point.

# Ljung-Box Test
Box.test(forecast_mod$residuals,type = "Ljung-Box")
#> 
#>  Box-Ljung test
#> 
#> data:  forecast_mod$residuals
#> X-squared = 0.033072, df = 1, p-value = 0.8557
# ACF Plot
acf(forecast_mod$residuals)

Insights

  1. For auto-correlation test using Ljung-Box Test, we hold hypothesis:
  • H0: error has no-autocorrelation
  • H1: error has autocorrelation
  1. Based on the result, the p-value are greater than 0.05 (p-value = 0.8557), so we fail to reject null hypothesis. This means that our residuals are not correlated or no auto-correlation presence.

  2. From ACF Plot shows there are no significant spikes outside the confidence band, this suggest no auto-correlation presence.

7 Conclusion

The STLM model has a Mean Absolute Error (MAE) of 67.58105, while the TBATS model has a lower MAE of 64.20831. Therefore, the TBATS model was chosen for predicting Chicago crime. This decision was based on the comparison of the two models’ performance in time series analysis with multiple seasonality.

If we saw original data, there are one particular event that creates extreme outlier, which is a people unrest from the result of demonstration for George Floyd murder from 28-31 May 2020. When we take out this, the highest crime committed in Chicago area would take place at 2020-01-01 with 1054 cases.

max_crime <- df_crime_arrange %>%
  filter(crime_commited == max(crime_commited))

max_crime

8 Reference

Chicago Crime Data 2001 - present, Retrieved from https://data.cityofchicago.org/Public-Safety/Chicago-Police-Department-Illinois-Uniform-Crime-R/c7ck-438e/about_data

Wikipedia contributors. (2024, August 18). George Floyd protests in Chicago. In Wikipedia, The Free Encyclopedia. Retrieved 11:16, August 19, 2024, from https://en.wikipedia.org/w/index.php?title=George_Floyd_protests_in_Chicago&oldid=1240901576