Introduction

This project was created to fulfill the Learning by Building (LBB) assignment on the TS - Time Series material in the Machine Learning Specialization Course. The dataset used in this project is the Chicago Crime Dataset. This dataset represents recorded occurrences of criminal activities (excluding murder cases, as individual victim data is available) within the City of Chicago. The data spans from 2001 to the current date, excluding the last seven days. The information is sourced from the CLEAR (Citizen Law Enforcement Analysis and Reporting) system of the Chicago Police Department.

With the Chicago crime dataset, our aim is to develop a prediction model using time series analysis to forecast crime frequency. Through this analysis, we will investigate if there are discernible patterns that indicate the likelihood of a crime occurring and identify specific time periods that demand increased vigilance and caution.

knitr::include_graphics("assets/180805183142-01-chicago-0805.jpg")

Import Library

library(dplyr)
library(lubridate)
library(tidyr)
library(tidyverse)
library(padr)
library(forecast)
library(tseries)
library(gridExtra)
library(MLmetrics)
library(nortest)

Read Dataset

crime <- read.csv("dataset/Crimes_-_2001_to_Present.csv")
head(crime)

Here are the description of all the columns.

  • ID: Unique identifier for the record.
  • Case.Number: The Chicago Police Department RD Number (Records Division Number), which is unique to the incident.
  • Date: Date when the incident occurred. this is sometimes a best estimate.
  • Block: The partially redacted address where the incident occurred, placing it on the same block as the actual address.
  • IUCR: The Illinois Unifrom Crime Reporting code. This is directly linked to the Primary Type and Description.
  • Primary.Type: The primary description of the IUCR code.
  • Description: The secondary description of the IUCR code, a subcategory of the primary description.
  • Location.Description: Description of the location where the incident occurred.
  • Arrest: Indicates whether an arrest was made.
  • Domestic: Indicates whether the incident was domestic-related as defined by the Illinois Domestic Violence Act.
  • Beat: Indicates the beat where the incident occurred. A beat is the smallest police geographic area
  • District: Indicates the police district where the incident occurred.
  • Ward: The ward (City Council district) where the incident occurred.
  • Community.Area: Indicates the community area where the incident occurred. Chicago has 77 community areas.
  • FBI.Code: Indicates the crime classification as outlined in the FBI’s National Incident-Based Reporting System (NIBRS).
  • X.Coordinate: The x coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection.
  • Y.Coordinate: The y coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection.
  • Year: Year the incident occurred.
  • Updated.On: Date and time the record was last updated.
  • Latitude: The latitude of the location where the incident occurred.
  • Longitude: The longitude of the location where the incident occurred.
  • Location: The location where the incident occurred.

Data Wrangling

We will use only the necessary information or columns for the time series analysis such as Date and Primary.Type.

crimes <- crime %>% 
  select(c(Date, Primary.Type)) %>% 
  mutate(Date = mdy_hms(Date), 
         Date = floor_date(Date, unit = "hours"),
         Primary.Type = as.factor(Primary.Type)) %>% 
  arrange(Date)

To address the constraints of our analysis, we will focus on studying a specific crime case. To select this case, we generated a plot showcasing information on the five most prevalent crimes in Chicago. This plot helped us identify the crime category that will be further analyzed in-depth.

top_5_crimes <- crimes %>%
  count(Primary.Type, sort = TRUE) %>%
  head(5) %>%
  mutate(Primary.Type = factor(Primary.Type, levels = rev(Primary.Type)))

ggplot(top_5_crimes, aes(x = n, y = reorder(Primary.Type, n), fill = n)) +
  geom_col() +
  scale_fill_gradient(low = "#f3ff82", high = "#1f4260", labels= scales::comma) + 
  scale_x_continuous(labels = scales::comma) + 
  labs(title = 'Top 5 Crime Case in Chicago', 
       x = 'Number of Crime', 
       y = 'Crime',
       fill = 'Number of Crimes') +
  theme(plot.title = element_text(hjust = 0.5))

Based on the plot presented earlier, it is evident that theft has the highest occurrence of crimes compared to other types. Consequently, for this analysis, our focus will be on conducting a time series prediction analysis specifically for the theft case as the top 1 crime case.

For the purpose of this analysis, we will narrow down the data to focus solely on the number of theft crimes occurring between the beginning of 2018 and the end of 2022.

theft <- crimes %>% 
  filter(Primary.Type == 'THEFT') %>% 
  group_by(Date) %>% 
  summarise(Theft = n()) %>% 
  filter(Date > '2018-01-01' & Date < '2023-01-02')
head(theft, 7)

There are 6 data from the end of 2018, so we will use slice to remove them.

theft <- theft %>% 
  slice(-(1:6))

Don’t forget to check the tail of data too.

tail(theft, 5)

There are 17 data from the beginning of 2023, so we will also use slice to remove them.

theft <- theft %>% 
  slice(-(42177:42193))

Make sure once again for the range of the Date.

range(theft$Date)
## [1] "2018-01-01 00:00:00 UTC" "2022-12-31 23:00:00 UTC"

Data Pre-Processing

Check the missing value due to the data that used for time series analysis must be treated with no missing values.

colSums(is.na(theft))
##  Date Theft 
##     0     0

Next step, the data should be ordered by time.

theft <- theft %>% 
  pad(start_val = ymd_hms("2018-01-01 00:00:00"), end_val = ymd_hms("2022-12-31 23:00:00")) %>% 
  replace(., is.na(.), 0)
range(theft$Date)
## [1] "2018-01-01 00:00:00 UTC" "2022-12-31 23:00:00 UTC"

Time Series

theft_ts <- ts(data = theft$Theft,
   start = min(theft$Date),
   frequency = 24) # daily seasonality

autoplot(theft_ts)

From the plot, we got insight that the data is does not have a trend (down trend or up trend). Further analysis uses the decomposed plot to determine the properties of the data.

# decompose ts object
theft_deco <- decompose(theft_ts)
autoplot(theft_deco)

Considering the plot, it is evident that the pattern exhibits fluctuations or oscillations, suggesting a multiseasonal time series phenomenon. As a result, to gain deeper insights, we should reanalyze this data using another seasonal frequency that may reveal additional seasonal patterns beyond what we have currently identified.

Daily, Weekly Seasonality

theft$Theft %>% 
  msts(seasonal.periods = c(24,24*7)) %>% 
  mstl() %>% 
  autoplot() 

Daily, Weekly, Monthly Seasonality

theft$Theft %>% 
  msts(seasonal.periods = c(24, 24*7, 24*7*4)) %>% 
  mstl() %>% 
  autoplot() 

### Daily, Weekly, Monthly, Annualy Seasonality

theft$Theft %>% 
  msts(seasonal.periods = c(24, 24*7, 24*7*4, 24*7*4*12)) %>% 
  mstl() %>% 
  autoplot()

The latest time series object displayed the most optimal decomposition among the three attempts seasonality, making it the preferred choice for building the time series model. It is crucial to acknowledge that the data follows an additive time series pattern, which will be valuable information in the subsequent steps of model building.

theft_ts <- theft$Theft %>% 
  msts(seasonal.periods = c(24, 24*7, 24*7*4, 24*7*4*12))

Cross Validation

The cross-validation scheme for time series should be splitted sequentially.

theft_train <- theft_ts %>% head(length(theft_ts) - 24*7*4*12)
theft_test <- theft_ts %>% tail(24*7*4*12)

Analyze the Seasonality

# Decompose MSTS Object
theft_multi_dec <- theft_ts %>%
  mstl()

theft_multi_dec %>%
  tail(24*7*4*12) %>%
  autoplot()

Create the msts object as a dataframe.

df_theft_multi <- as.data.frame(theft_multi_dec)

Hourly Seasonality

df_theft_multi %>%
  mutate(day = theft$Date) %>%
  group_by(day) %>%
  summarise(seasonal = sum(Seasonal24 + Seasonal168 + Seasonal672 + Seasonal8064)) %>%
  head(24*7) %>%
  ggplot(aes(x = day, y = seasonal)) +
  geom_point(col = "#1f4260") + 
  geom_line(col = "#17b79c") +
  theme_minimal()

Insight: Based on the daily plot with a time interval of 1 week, we observe that the average theft crime exhibits a gradual increase from 10 am, reaching its highest point at 5 pm, which corresponds to the hours after office hours. Subsequently, the figure displays fluctuations, gradually decreasing after midnight. During this time, it would be prudent to advise the public to exercise heightened caution and vigilance.

Daily Seasonality

df_theft_multi %>%
  mutate(day = wday(theft$Date, label = TRUE)) %>%
  group_by(day) %>%
  summarise(seasonal = sum(Seasonal24 + Seasonal168 + Seasonal672 + Seasonal8064)) %>%
  ggplot(aes(x = day, y = seasonal, group = 1)) + 
  geom_line(col = "#17b79c") +
  geom_point(col = "#1f4260") + 
  theme_minimal()

As observed from the plot, there is a notable trend of increased theft incidents on Wednesday and Friday.

Monthly Seasonality

df_theft_multi %>%
  mutate(month = month(theft$Date, label = TRUE)) %>%
  group_by(month) %>%
  summarise(seasonal = sum(Seasonal24 + Seasonal168 + Seasonal672 + Seasonal8064)) %>%
  ggplot(aes(x = month, y = seasonal, group = 1)) +  
  geom_line(col = "#17b79c") +
  geom_point(col = "#1f4260") + 
  theme_minimal()

As observed from the plot, there is a notable trend of increased theft incidents from April until August.

Model Building

For the model building phase, I will conduct a comparison between Simple Exponential Smoothing (SES), ETS Holt-Winters and Seasonal ARIMA.

Simple Exponential Smoothing (SES)

theft_ses <- ets(theft_train, model = "ANN")
summary(theft_ses)
## ETS(A,N,N) 
## 
## Call:
##  ets(y = theft_train, model = "ANN") 
## 
##   Smoothing parameters:
##     alpha = 0.5128 
## 
##   Initial states:
##     l = 11.1006 
## 
##   sigma:  3.3013
## 
##      AIC     AICc      BIC 
## 460351.2 460351.2 460376.7 
## 
## Training set error measures:
##                         ME    RMSE      MAE  MPE MAPE     MASE        ACF1
## Training set -0.0003627239 3.30125 2.542178 -Inf  Inf 0.818224 -0.03970625

ETS Holt-Winters

theft_ets <- stlm(theft_train, method = "ets")
summary(theft_ets)
##               Length Class  Mode     
## stl           250320 mstl   numeric  
## model             19 ets    list     
## modelfunction      1 -none- function 
## lambda             0 -none- NULL     
## x              35760 msts   numeric  
## series             1 -none- character
## m                  1 -none- numeric  
## fitted         35760 ts     numeric  
## residuals      35760 ts     numeric

Seasonal ARIMA

theft_arima <- stlm(theft_train, method = "arima")
summary(theft_arima)
##               Length Class          Mode     
## stl           250320 mstl           numeric  
## model             18 forecast_ARIMA list     
## modelfunction      1 -none-         function 
## lambda             0 -none-         NULL     
## x              35760 msts           numeric  
## series             1 -none-         character
## m                  1 -none-         numeric  
## fitted         35760 ts             numeric  
## residuals      35760 ts             numeric

Forecasting

theft_ses_f <- forecast(theft_ses, h = 24*7*4*12)
theft_ets_f <- forecast(theft_ets, h = 24*7*4*12)
theft_arima_f <- forecast(theft_arima, h = 24*7*4*12)

Visualize the forecast

ses <- autoplot(theft_ses_f, series = "SES", fcol = "#abeb88") +
  autolayer(theft_ts, series = "Actual", color = "black") + 
  labs(subtitle = "Number of Theft Case at Chicago from Jan 2018 - Dec 2022",
       y = "Frequency") +
  theme_minimal()

ets <- autoplot(theft_ets_f, series = "ETS", fcol = "#67d294") +
  autolayer(theft_ts, series = "Actual", color = "black") + 
  labs(subtitle = "Number of Theft Case at Chicago from Jan 2018 - Dec 2022",
       y = "Frequency") +
  theme_minimal()

arima <- autoplot(theft_arima_f, series = "ARIMA", fcol = "#009a9c") +
  autolayer(theft_ts, series = "Actual", color = "black") +
  labs(subtitle = "Number of Theft Case at Chicago from Jan 2018 - Dec 2022",
       y = "Frequency") +
  theme_minimal()

grid.arrange(ses,ets,arima)

### MAE Value

data.frame(SES= MAE(theft_ses_f$mean, theft_test), ETS = MAE(theft_ets_f$mean, theft_test), ARIMA = MAE(theft_arima_f$mean, theft_test))

Based on the analysis conducted, we have successfully forecasted theft frequency, and the ARIMA model emerged as the most effective choice because ARIMA model demonstrated slightly better accuracy in predicting theft incidents. Therefore, the ARIMA model is deemed as the preferable option for this particular time series forecasting task.

Assumption Check

Normality: Anderson-Darling Test

  • H0 : residuals are normally distributed
  • H1 : residuals are not normally distributed
ad.test(theft_arima_f$residuals)
## 
##  Anderson-Darling normality test
## 
## data:  theft_arima_f$residuals
## A = 57.652, p-value < 2.2e-16
hist(theft_arima_f$residuals, breaks = 20)

Autocorrelation: Box.test - Ljng-Box

  • H0 : No autocorrelation in the forecast errors
  • H1 : there is an autocorrelation in the forecast errors
Box.test(theft_arima_f$residuals, type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  theft_arima_f$residuals
## X-squared = 0.54887, df = 1, p-value = 0.4588

Based on the assumption check, there is no autocorrelation on our forecast residuals (p-value > 0.05).

Conclusion

Based on the analysis and the performance evaluation of the models, we have successfully forecasted theft frequency, and the ARIMA model has been identified as the best model. It demonstrated the best performance among the models considered for this particular time series forecasting task. If from the seasonality analysis side, and the theft crime is likely to increase around 10 am and there is a notable trend of increased theft incidents from April until August.