Theft in Chicago - A Time Series Report

Calvin Christopher

Sunday, May 30th 2021

1. Context

Theft is the taking of another person’s property or services or scrap money without that person’s permission or consent with the intent to deprive the rightful owner of it.The word theft is also used as an informal shorthand term for some crimes against property, such as burglary, embezzlement, larceny, looting, robbery, shoplifting, library theft or fraud. In some jurisdictions, theft is considered to be synonymous with larceny, in others, theft has replaced larceny. Someone who carries out an act of or makes a career out of theft is known as a thief. (Source: Wikipedia)

In this report we are going to make a report using Time Series method to see if there is a pattern (Seasonality) and/or trend to thievery in Chicago. The Data itself came from Chicago Police Department and ranged from January 2001 to October 2018.

2. Data Preparation

2.1 Library Preparation

library(dplyr) 
library(forecast) 
library(tseries) 
library(MLmetrics) 
library(ggplot2)
library(lubridate)
library(TSstudio)

2.2 Read Data

theft <- read.csv("theft-ts.csv")
str(theft)

## 'data.frame':    6492 obs. of  2 variables:
##  $ Date        : chr  "2001-01-01" "2001-01-02" "2001-01-03" "2001-01-04" ...
##  $ Amount_Theft: int  413 221 226 243 265 246 205 244 261 255 ...

theft <- theft %>%
  mutate(Date = ymd(Date)) %>% 
  arrange(Date) 

# Check for missing value
colSums(is.na(theft))

##         Date Amount_Theft 
##            0            0

range(theft$Date)

## [1] "2001-01-01" "2018-10-10"

Column Description:

Date Date

Amount_Theft Daily theft instances

3. Modelling

3.1 TS Object

theft_ts <- ts(data = theft$Amount_Theft, start = ymd("2010-01-01"), frequency = 365)
autoplot(theft_ts) +
  ylab("Confirmed Case")

ggplot(theft %>% head(3650), aes(x = Date, y = Amount_Theft)) +
  geom_line()

theft_dc <- decompose(x = theft_ts)

autoplot(theft_dc)

3.2 Seasonality Analysis

Seasonality analysis will help us know when the target value is high and/or low.

theft_seas <- theft %>% 
  mutate(seasonal = theft_dc$seasonal, 
         month = month(ymd(Date), label = T, abbr = T)) %>%
  distinct(month, seasonal) 

# ploting
theft_seas %>% 
  ggplot(mapping = aes(x = month, y = seasonal)) +
  geom_col() + 
  theme_minimal() +
  ylab("Confirmed Case")

By referring to the plot above, police and other departments of justice can anticipate when thievery most likely to happen, and then dispatch more patrol or other smart strategies to reduce or prevent theft.

Insight:

Theft are highest mid-year, with July and August at the highest.
Theft seems to drop during winter, with February at the lowest, followed by January and December.

3.3 Cross Validation

We are going to split the data using head and tail since the data should be splitted sequentially rather than randomly.

theft_train <- theft_ts %>% head(length(theft_ts) - 365)
theft_test <- theft_ts %>% tail(365)

3.4 Modeling

For modeling we are going to use Holt-Winters and SARIMA since our data contains both Trends and Seasonality.

#Holt-Winters Model
theft_hw <- HoltWinters(x = theft_train)

# Check if data is Stationary
adf.test(diff(theft_train, lag = 365, differences = 1))

## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff(theft_train, lag = 365, differences = 1)
## Dickey-Fuller = -10.033, Lag order = 17, p-value = 0.01
## alternative hypothesis: stationary

Data is stationary with 1 times differencing. It is crucial to use a stationary data in SARIMA, usually differencing are done 1-2 times. It is advisable not to use data with more than 2 times differencing.

# Auto SARIMA 
theft_auto_sarima <- stlm(y = theft_train, s.window = 365, method='arima')
summary(theft_auto_sarima)

##               Length Class          Mode     
## stl           24508  mstl           numeric  
## model            18  forecast_ARIMA list     
## modelfunction     1  -none-         function 
## lambda            0  -none-         NULL     
## x              6127  ts             numeric  
## series            1  -none-         character
## m                 1  -none-         numeric  
## fitted         6127  ts             numeric  
## residuals      6127  ts             numeric

Forecasts:

forecast_hw <- forecast(object = theft_hw, h = 365)
forecast_sarima <- forecast(theft_auto_sarima, h = 365)

3.5 Model Evaluation

#MAPE(y_pred = forecast_hw$mean, y_true = btc_test)
#MAPE(y_pred = forecast_sarima$mean, y_true = btc_test)

accuracy(object = forecast_hw, x = theft_test)

##                      ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.3380423 31.60689 23.27167 -1.692058 11.33559 0.7829420
## Test set      9.0453118 25.09082 20.22590  3.791786 11.60077 0.6804715
##                      ACF1 Theil's U
## Training set -0.002588675        NA
## Test set      0.211432606 0.9607076

accuracy(object = forecast_sarima, x = theft_test)

##                      ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.2886329 28.44855 20.99752 -1.807458 10.08910 0.7064314
## Test set      5.6276813 25.01393 19.25523  1.835932 11.06513 0.6478146
##                     ACF1 Theil's U
## Training set 0.002013285        NA
## Test set     0.419684183 0.9152758

SARIMA model is better since it has lowered MAPE, although they differ only a bit. Therefore we are going to continue using SARIMA model.

plot_forecast(forecast_obj = forecast_sarima)

Normality (Shapiro Test) is a test to see if the residuals are normal or abnormal by looking at the distribution. If they are close to the mean, the data is normal. Vice-versa.

shapiro.test(forecast_sarima$residuals[0:5000])

## 
##  Shapiro-Wilk normality test
## 
## data:  forecast_sarima$residuals[0:5000]
## W = 0.95482, p-value < 2.2e-16

Since the result p-value is < 0.05, it means that residuals are not normally distributed.

hist(forecast_sarima$residuals, breaks = 100)

Although the data seems normally distributed in the histogram, when we look at it using line plot it will look similar to the error plot from our object decomposition.

Box.test(forecast_sarima$residuals)

## 
##  Box-Pierce test
## 
## data:  forecast_sarima$residuals
## X-squared = 0.024835, df = 1, p-value = 0.8748

Since the p-value is > 0.05, it means that the data is normal (No Auto-correlation on residuals).

4. Conclusion

Based on model evaluation, there are no auto-correlation on our forecast residuals (p-value > 0.05). Still, our forecast residuals are not distributed normally, therefore its residuals may not be appeared around its mean as seen in the histogram above. But, if we inspect the distribution of residuals through a line plot, it is actually resembles the error plot from our time series object decomposition.

In a time series, such errors might emerge from various unpredictable events and is actually quite unavoidable. One strategy to overcome it is to analyze what kinds of unpredictable events that might occur and occurs frequently. This can be done by time series analysis using seasonality adjustment. From that insight, police and/or departments of justice can develop a standard operational procedure and smart strategies to deal with such events.