Introduction

The dataset we’re using reflects reported incidents of crime that occured in the City of Chicago from 2001 to present (2020). Data is extracted from Chicago Police Department’s CLEAR (Citizen Law Enforcement Analysis and Reporting) system.

What we’re trying to see as an exercise in Time Series, is to focus on one crime only, THEFT. We’ll try to see if there’s any seasonality or trend related toward the amount of theft going on, and maybe give a helpful tip on when theft is more likely to occur.

EDA

Create Time Series Object

Plotting our Time Series

Our Decomposed time series containing Trend, Seasonal, and Error plot

There’s a sharp spike in theft rate from 2015 to 2016, as well as 2019 to 2020. What’s causing it is abit unclear, even after looking online for news reference. This may impact our model perfromance greatly, but we’ll try our forecast anyway.


Theft Seasonality Plot

Looking at our graph, this means that the rate of theft increases from May to August, and the rate of theft is lowest on February.

Cross Validation

We need to split our data into train and test in order check our model performance. In this case, I’m using the last 8 month as our test period.

Modelling

We will try both HoltWinters as well as ARIMA and compare our MAPE (Mean Absolute Percentage Error) to measure our model performance, where the closer it is to 0, the better.

HoltWinters

Our model forecast.

Our MAPE in HoltWinters is 52.37%.

## [1] 51.86823

Our HW model forecast compared to our actual data.

ARIMA

We’ll be using auto.arima for our model

ARIMA’s forecast plot

Our MAPE in ARIMA model is 39.8%. A 12.57% improvement from our HoltWinters model, but still not a good number.

## [1] 39.20893

Our ARIMA model forecast compared to our actual data.

Conclusion for Theft Monthly Time Series

There’s a large spike starting from 2019, similar to the behaviour observed in 2015. Unfortunately, such sudden spike is causing our model to fail to make an accurate prediction.

Therefore, the insight that can probably be taken from our theft time series model comes from our seasonality plot, where we know that the rate of theft is high from May to August.


Theft Spike from 2015 to 2016

Inspecting the period between March 2015 to August 2016, we found that there is multiple seasonality, Weekly and Monthly.

Below is our modiefied Data Wrangling to accomodate our need.

## Warning: 1 failed to parse.

Below is our multiple seasonality time series object

Below is our decomposed time series object, showing the trend, both seasonality, and random/error value.

Weekly Seasonality Analysis

As we can see from the plot below, crime rate is the highest on the 15th to 23rd each month.

The months where theft incident is high are April, July, October, and December.

Based on the two seasonality plot above, we can recommend that Chicago Police and Chicago citizens to be more vigilant for theft around April, July, October, and December, as well as 15th to 23rd of each month, because according to the data we observed, that’s when theft is most likely to happen.


Cross Validation

We’re taking the last 20 weeks as our test, and the rest as our training data.

HoltWinters Complex Seasonal Model Forecast

Below is our forecast for HoltWinters model.

Comparing our forecast with our actual data

The resulting MAPE of our model is 11%.

## [1] 11.0139

Conclusion

After examining both overall data and the data on 2015-2016, there’s a difference in our seasonal data pattern, as we can see below.

Monthly Seasonal

Our monthly seasonal data can be used to determine when the police and citizen should be more aware of theft.

2015 to 2016


Both of our forecast failed to predict the sudden jump/fall of our theft rate, although the model in between 2015 to 2016 has smaller MAPE value because the sudden jump/fall period is smaller.


In conclusion, while we can extract good seasonal data, the forecast ability is not as reliable, because of the existence of said sudden jump/fall in numbers that we don’t seem to be able to predict.