Introduction

About Dataset

This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present (2022), minus the most recent seven days. Data is extracted from the Chicago Police Department’s CLEAR (Citizen Law Enforcement Analysis and Reporting) system.

Objective

Using Chicago crime dataset, we will make a prediction model to forecasting crime frequency based on time series, and analyze whether is there any pattern for a crime to likely to happen and time that required to be more careful.

Library Preparation

library(dplyr)
library(lubridate)
library(tidyr)
library(tidyverse)
library(padr)
library(forecast)
library(tseries)
library(gridExtra)
library(MLmetrics)
library(nortest)

Data Preparation

Import Data

crime <- read.csv('data_input/crimes.csv')

head(crime)
glimpse(crime)
## Rows: 7,473,529
## Columns: 22
## $ ID                   <int> 10224738, 10224739, 11646166, 10224740, 10224741,~
## $ Case.Number          <chr> "HY411648", "HY411615", "JC213529", "HY411595", "~
## $ Date                 <chr> "09/05/2015 01:30:00 PM", "09/04/2015 11:30:00 AM~
## $ Block                <chr> "043XX S WOOD ST", "008XX N CENTRAL AVE", "082XX ~
## $ IUCR                 <chr> "0486", "0870", "0810", "2023", "0560", "0610", "~
## $ Primary.Type         <chr> "BATTERY", "THEFT", "THEFT", "NARCOTICS", "ASSAUL~
## $ Description          <chr> "DOMESTIC BATTERY SIMPLE", "POCKET-PICKING", "OVE~
## $ Location.Description <chr> "RESIDENCE", "CTA BUS", "RESIDENCE", "SIDEWALK", ~
## $ Arrest               <chr> "false", "false", "false", "true", "false", "fals~
## $ Domestic             <chr> "true", "false", "true", "false", "true", "false"~
## $ Beat                 <int> 924, 1511, 631, 1412, 1522, 614, 1434, 1034, 1222~
## $ District             <int> 9, 15, 6, 14, 15, 6, 14, 10, 12, 8, 8, 16, 5, 2, ~
## $ Ward                 <int> 12, 29, 8, 35, 28, 21, 32, 25, 27, 15, 13, 45, 34~
## $ Community.Area       <int> 61, 25, 44, 21, 25, 71, 24, 31, 27, 63, 65, 11, 4~
## $ FBI.Code             <chr> "08B", "06", "06", "18", "08A", "05", "05", "06",~
## $ X.Coordinate         <int> 1165074, 1138875, NA, 1152037, 1141706, 1168430, ~
## $ Y.Coordinate         <int> 1875917, 1904869, NA, 1920384, 1900086, 1850165, ~
## $ Year                 <int> 2015, 2015, 2018, 2015, 2015, 2015, 2015, 2015, 2~
## $ Updated.On           <chr> "02/10/2018 03:50:01 PM", "02/10/2018 03:50:01 PM~
## $ Latitude             <dbl> 41.81512, 41.89508, NA, 41.93741, 41.88190, 41.74~
## $ Longitude            <dbl> -87.67000, -87.76540, NA, -87.71665, -87.75512, -~
## $ Location             <chr> "(41.815117282, -87.669999562)", "(41.895080471, ~

Column Description:

  • ID: Unique identifier for the record.
  • Case.Number: The Chicago Police Department RD Number (Records Division Number), which is unique to the incident.
  • Date: Date when the incident occurred. this is sometimes a best estimate.
  • Block: The partially redacted address where the incident occurred, placing it on the same block as the actual address.
  • IUCR: The Illinois Unifrom Crime Reporting code. This is directly linked to the Primary Type and Description.
  • Primary.Type: The primary description of the IUCR code.
  • Description: The secondary description of the IUCR code, a subcategory of the primary description.
  • Location.Description: Description of the location where the incident occurred.
  • Arrest: Indicates whether an arrest was made.
  • Domestic: Indicates whether the incident was domestic-related as defined by the Illinois Domestic Violence Act.
  • Beat: Indicates the beat where the incident occurred. A beat is the smallest police geographic area
  • District: Indicates the police district where the incident occurred.
  • Ward: The ward (City Council district) where the incident occurred.
  • Community.Area: Indicates the community area where the incident occurred. Chicago has 77 community areas.
  • FBI.Code: Indicates the crime classification as outlined in the FBI’s National Incident-Based Reporting System (NIBRS).
  • X.Coordinate: The x coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection.
  • Y.Coordinate: The y coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection.
  • Year: Year the incident occurred.
  • Updated.On: Date and time the record was last updated.
  • Latitude: The latitude of the location where the incident occurred.
  • Longitude: The longitude of the location where the incident occurred.
  • Location: The location where the incident occurred.

Data Wrangling

As we will conduct Time Series Analysis, we are going to use only the necessary Information which is Date and the kind of crime that recorded in the Primary.Type.

crimes <- crime %>% 
  select(c(Date, Primary.Type)) %>% 
  mutate(Primary.Type = as.factor(Primary.Type),
         Date = mdy_hms(Date), 
         Date = floor_date(Date, unit = "hours")) %>% #takes a date-time object and rounds it down to hours unit
  arrange(Date)

As a limitation of the problem, we will choose a crime case to be analyzed. To determine the case, a plot was created that presented data on the top 5 crimes in Chicago.

crimes %>% 
  count(Primary.Type, sort = T) %>% 
  head(5) %>% 
  ggplot(aes(x = n, y = reorder(Primary.Type, n))) +
  geom_col()+
  labs(title = 'Top 5 Crime in Chicago', 
       x = 'Number of Crime', 
       y = 'Crime')

From the chart above, theft has the highest number of crimes among others. Therefore, in this analysis we will try to do a time series prediction analysis for the theft case.

range(crimes$Date)
## [1] "2001-01-01 00:00:00 UTC" "2022-01-13 23:00:00 UTC"

For this analysis, we will limit the data analysis to only the number of theft crimes from the beginning of 2018 to the end of 2021.

theft <- crimes %>% 
  filter(Primary.Type == 'THEFT') %>% 
  group_by(Date) %>% 
  summarise(Theft = n()) %>% 
  filter(Date > '2018-01-01' & Date < '2022-01-02')
head(theft, 10)

There are still 6 data from 2017, so we will use slice to remove the data.

theft <- theft %>% 
  slice(-(1:6)) 
tail(theft, 10)

There are still 17 data from 2022, so we will use slice to remove the data.

length(theft$Date)
## [1] 33641
theft <- theft %>%
  slice(-c(33625:33641))
range(theft$Date)
## [1] "2018-01-01 00:00:00 UTC" "2021-12-31 22:00:00 UTC"

Now we have the correct range of data. We will fix the missing intervals in the pre-processing data

Data Pre-Processing

Data provisions that must be treated in the Time Series, among others:

  1. no missing intervals
  2. no missing values
  3. data should be ordered by time
colSums(is.na(theft))
##  Date Theft 
##     0     0
theft <- theft %>% 
  pad(start_val = ymd_hms("2018-01-01 00:00:00"), end_val = ymd_hms("2021-12-31 23:00:00")) %>% 
  replace(., is.na(.), 0)
## pad applied on the interval: hour
range(theft$Date)
## [1] "2018-01-01 00:00:00 UTC" "2021-12-31 23:00:00 UTC"

TS Object & EDA

theft_ts <- ts(data = theft$Theft,
   start = min(theft$Date),
   frequency = 24) # daily seasonality

In this section, I tried to explore whether our timeseries object has trend and seasonal properties (one-seasonal/multiseasonal).

# decompose ts object
theft_deco <- decompose(theft_ts)
autoplot(theft_deco)

Based on the plot, the trend still shows a pattern of ups and downs which indicates a multiseasonal time series object. Therefore, we should try reanalyze this data through multiseasonal time series using another seasonal frequency.

  1. First Trial
theft$Theft %>% 
  msts(seasonal.periods = c(24,24*7)) %>% # multiseasonal ts (daily, weekly)
  mstl() %>% # multiseasonal ts decomposition
  autoplot() 

  1. Second Trial
theft$Theft %>% 
  msts(seasonal.periods = c(24, 24*7, 24*7*4)) %>% # multiseasonal ts (daily, weekly, monthly)
  mstl() %>% # multiseasonal ts decomposition
  autoplot() 

  1. Third Trial
theft$Theft %>% 
  msts(seasonal.periods = c(24, 24*7, 24*7*4, 24*7*4*12)) %>% # multiseasonal ts (daily, weekly, monthly, annualy)
  mstl() %>% # multiseasonal ts decomposition
  autoplot() 

The last ts object showed the best decomposition among the 3 trials and therefore will be used for the time series model building. Additionally, note that the data is an additive time series and this is a valuable information for the model building later.

# assign final ts object
theft_msts <- theft$Theft %>% 
  msts(seasonal.periods = c(24, 24*7, 24*7*4, 24*7*4*12))

# check for stationary
adf.test(theft_msts)
## Warning in adf.test(theft_msts): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  theft_msts
## Dickey-Fuller = -15.149, Lag order = 32, p-value = 0.01
## alternative hypothesis: stationary

Based on Augmented Dickey-Fuller Test (adf.test) result, the p-value is < alpha and therefore the data is already stationary. Therefore, for a model building using SARIMA, we do not need to perform differencing on the data first.

Cross validation

After finishing our time frame, we will continue to conduct the cross validation activities. we will split our data into two types of data. Train data as a data that will be trained in to our model, and test data as a data that provide an unbiased evaluation of a model fit on the training data set.

We will split our train data into two type of data set named train and test. Our test data will be a set of data consist of last 365 days of reported crime from our data frame.

length(theft$Date)
## [1] 35064

The cross-validation scheme for time series should not be sampled randomly, but splitted sequentially.

theft_train <- theft_msts %>% head(length(theft_msts) - 24*7*4*12)
theft_test <- theft_msts %>% tail(24*7*4*12)
head(theft_msts)
## Multi-Seasonal Time Series:
## Start: 1 1
## Seasonal Periods: 24 168 672 8064
## Data:
## [1] 14  8  4  5  2  2

Seasonality Analysis

# Decompose MSTS Object
theft_multi_dec <- theft_msts %>%
  mstl()

theft_multi_dec %>%
  tail(24*7*4*12) %>%
  autoplot()

# Create a data frame based on MSTS Object

df_theft_multi <- as.data.frame(theft_multi_dec)
glimpse(df_theft_multi)
## Rows: 35,064
## Columns: 7
## $ Data         <dbl> 14, 8, 4, 5, 2, 2, 2, 4, 3, 10, 2, 9, 14, 2, 3, 7, 9, 7, ~
## $ Trend        <dbl> 7.038822, 7.038940, 7.039058, 7.039175, 7.039293, 7.03941~
## $ Seasonal24   <dbl> 0.89884443, -2.08155997, -3.88467415, -3.66681841, -4.309~
## $ Seasonal168  <dbl> -0.06864737, 0.51792606, 0.18783236, -0.86665844, 0.36975~
## $ Seasonal672  <dbl> -0.26391796, -0.04478123, 0.02080604, 0.81419168, 0.41859~
## $ Seasonal8064 <dbl> 1.1201326, 0.9887574, 0.6611658, 0.7187973, 0.8358824, 0.~
## $ Remainder    <dbl> 5.27476587, 1.58071770, -0.02418773, 0.96131257, -2.35383~
  1. Hourly Seasonality

These are the plot of Hourly Seasonality of Battery Crime in Chicago

df_theft_multi %>%
  mutate(day = theft$Date) %>%
  group_by(day) %>%
  summarise(seasonal = sum(Seasonal24 + Seasonal168 + Seasonal672 + Seasonal8064)) %>%
  head(24*7) %>%
  ggplot(aes(x = day, y = seasonal)) +
  geom_point(col = "maroon") + 
  geom_line(col = "red") +
  theme_minimal()

Based on the daily plot at 1 week, it can be seen that the average theft crime starts to rise at 10 am, and reaches a peak value at 5 pm (after office hours). The figure will fluctuate and start to fall above 12 o’clock in the morning. At that time the public can be advised to be more careful.

  1. Daily Seasonality
df_theft_multi %>%
  mutate(day = wday(theft$Date, label = T)) %>%
  group_by(day) %>%
  summarise(seasonal = sum(Seasonal24 + Seasonal168 + Seasonal672 + Seasonal8064)) %>%
  ggplot(aes(x = day, y = seasonal)) +
  geom_col() +
  theme_minimal()

As we can see from the plot above, The theft tend to increase in the Wednesday, Thursday, Friday and Saturday.

  1. Monthly Seasonality
df_theft_multi %>%
  mutate(month = month(theft$Date, label = T)) %>%
  group_by(month) %>%
  summarise(seasonal = sum(Seasonal24 + Seasonal168 + Seasonal672 + Seasonal8064)) %>%
  ggplot(aes(x = month, y = seasonal)) +
  geom_col() +
  theme_minimal()

Based on the monthly plot, the theft tend to increase from June until October.

Model Building

For model building, I will compare between two of the widely used time series modeling in business and industry: ETS Holt-Winters and Seasonal Arima. I use ETS Holt-Winters because my data contain both seasonal and trend. I also want to compare it between seasonal Arima to check whether seasonal ARIMA can give better forecasting performance.

# ets Holt-Winters
theft_ets <- stlm(theft_train, method = "ets") # no log transformation for additive data

# SARIMA
theft_arima <- stlm(theft_train, method = "arima")

Forecast & Evaluation

# forecast
theft_ets_f <- forecast(theft_ets, h = 24*7*4*12)
theft_arima_f <- forecast(theft_arima, h = 24*7*4*12)
# visualization

a <- autoplot(theft_ets_f, series = "ETS", fcol = "red") +
  autolayer(theft_msts, series = "Actual", color = "black") + 
  labs(subtitle = "Number of Theft Case at Chicago, from Jan 2018 - Dec 2021",
       y = "Theft Frequency") +
  theme_minimal()

b <- autoplot(theft_arima_f, series = "ARIMA", fcol = "blue") +
  autolayer(theft_msts, series = "Actual", color = "black") +
  labs(subtitle = "Number of Theft Case at Chicago, from Jan 2018 - Dec 2021",
       y = "Theft Frequency") +
  theme_minimal()

grid.arrange(a,b)

# model evaluation: root mean squared error (RMSE)
data.frame(ETS = MAE(theft_ets_f$mean, theft_test), ARIMA = MAE(theft_arima_f$mean, theft_test))

From the analysis above, we can conclude that we have sucsesfully forecast theft frequency and found ARIMA as the best model with the lowest error (MAE ~2.303%, although not very different from ETS model).

Assumption Check

  1. Normality: Anderson-Darling Test
  • H0 : residuals are normally distributed
  • H1 : residuals are not normally distributed
ad.test(theft_arima_f$residuals)
## 
##  Anderson-Darling normality test
## 
## data:  theft_arima_f$residuals
## A = 35.48, p-value < 2.2e-16
hist(theft_arima_f$residuals, breaks = 20)

  1. Autocorrelation: Box.test - Ljng-Box
  • H0 : No autocorrelation in the forecast errors
  • H1 : there is an autocorrelation in the forecast errors
Box.test(theft_arima_f$residuals, type = "Ljung-Box") # there is not enough data to reject H0
## 
##  Box-Ljung test
## 
## data:  theft_arima_f$residuals
## X-squared = 0.0058471, df = 1, p-value = 0.939

Based on the assumption check, there is no autocorrelation on our forecast residuals (p-value > 0.05). Still, our forecast’s residuals are not distributed normally, therefore it’s residuals may not be appeared around its mean as seen in the histogram.

In a time series, such errors might emerge from various unpredictable events and is actually quite unavoidable. One strategy to overcome it is to analyze what kinds of unpredictable events that might occur and occurs frequently. This can be done by time series analysis using seasonality adjustment.

Conclusion

Based on the analysis and our model performance, our model able to predict the theft crime rate in the city of Chicago with Measure of Error (MAE) around 2.303%. Still, it can be optimized thoroughly with several way as our model does not fulfill the Normality assumptions.

As for the seasonal analysis, it is safe to say that the theft crime is likely to increase around 10 am, it reaches a peak value in 5 pm (after office hours) and tend to increase towards 12 o’clock in the morning. The crime itself is more likely to happen from June until October.