Introduction

Chicago is one of America's most iconic cities. It has a colorful history, which rich histories such. Recently, Chicago was also a setting for one of Netflix's popular series : Ozark. The story has it that Chicago is the center for drug distribution for the Navarro cartel.

So, how true is the series? A quick search on the internet reveals a recently released DEA report on the. The report shows that drug crime exists in Chicago, although they are distributed by the Cartel de Jalisco Nueva Generacion, the Sinaloa Cartel and the Guerros Unidos, to name a few.

The government of the City of Chicago has provided a publicly available crime database. I have created a subset of the data, narcotics from year 2016 - 2019 available here. With the dataset, I hope to take you on a journey to answer some questions that may help in fighting drug-related crimes :

  1. Are there any weekly, monthly and seasonal patterns to drug crimes?
  2. How is the trend of drug-related crimes in Chicago throughout the years?
  3. How much does COVID impact drug related crimes?

Data pre-processing

Before proceeding, let's import the required packages

library(tidyverse)
library(lubridate)
library(forecast)
library(kableExtra)
library(forecast)
library(zoo)

Next we load the data

data <- read_csv('chicago_narcotics_2016_2020.csv');
glimpse(data)
## Rows: 57,115
## Columns: 10
## $ case_number          <chr> "JC341682", "HZ337479", "JA352545", "JA420415", …
## $ date                 <chr> "2019-07-09 17:23:00 UTC", "2016-07-05 23:00:00 …
## $ iucr                 <dbl> 2092, 2093, 2022, 1822, 2022, 1812, 2093, 2022, …
## $ description          <chr> "SOLICIT NARCOTICS ON PUBLICWAY", "FOUND SUSPECT…
## $ location_description <chr> "SIDEWALK", "STREET", "STREET", "CHA PARKING LOT…
## $ arrest               <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
## $ domestic             <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ district             <dbl> 9, 2, 9, 5, 16, 4, 16, 16, 16, 8, 17, 4, 2, 6, 5…
## $ ward                 <dbl> 3, 3, 12, 9, 41, 10, 45, 45, 45, 13, 50, 8, 4, 8…
## $ community_area       <dbl> 37, 37, 59, 54, 10, 48, 12, 10, 11, 64, 13, 47, …
data %>% head()
## # A tibble: 6 x 10
##   case_number date   iucr description location_descri… arrest domestic district
##   <chr>       <chr> <dbl> <chr>       <chr>            <lgl>  <lgl>       <dbl>
## 1 JC341682    2019…  2092 SOLICIT NA… SIDEWALK         TRUE   FALSE           9
## 2 HZ337479    2016…  2093 FOUND SUSP… STREET           TRUE   FALSE           2
## 3 JA352545    2017…  2022 POSS: COCA… STREET           TRUE   FALSE           9
## 4 JA420415    2017…  1822 MANU/DEL:C… CHA PARKING LOT… TRUE   FALSE           5
## 5 JB176183    2018…  2022 POSS: COCA… SIDEWALK         TRUE   FALSE          16
## 6 JB474406    2018…  1812 POSS: CANN… STREET           TRUE   FALSE           4
## # … with 2 more variables: ward <dbl>, community_area <dbl>

Cleaning and Wrangling

As we will only be focusing on the occurrences of the crimes, we actually need only the date column. We'll have to change the timezone to Chicago local timezone

data <- data %>% 
    select(date) %>% 
    mutate(
        datetime = with_tz(date, tz = 'US/Michigan'),
    ) %>%
    arrange(datetime)

After adjusting the timezone, there are 5 crimes which happened in the late hours of 2015. To make our analysis easier, let's remove these records, so that we limit our analysis from 2016 - 2020 local time.

data <- data %>% 
    filter(date >= '2016-01-01')

Let's add other columns to facilitate the analysis :

  • date (date) : Date
  • day_of_week (numeric) : Day of the week, from Monday (1) to Sunday (7)
  • day (numeric) : Day of the month, from 1 - 31
  • week_of_month (numeric) : Week of the month from 1 - 6
  • month (numeric) : Month of the year, from Jan (1) to Dec (12)
  • year (numeric) : Year
  • is_day (logical) : Categorized into day (06:00 - 17:59) or night (18:00 - 05:59)
data.clean <- data %>% 
    mutate(
        date = date(datetime),
        day_of_week = wday(date),
        day = day(datetime),
        #  week of month calculation : 
        # https://stackoverflow.com/questions/25199851/r-how-to-get-the-week-number-of-the-month
        week_of_month = (5 + day(date) + wday(floor_date(date, 'month'))) %/% 7,
        month = month(date),
        year = as.factor(year(date)),
        is_day = ifelse(hour(date) >= 6 & hour(date) < 18, TRUE, FALSE)
    )

head(data.clean)
## # A tibble: 6 x 8
##   date       datetime            day_of_week   day week_of_month month year 
##   <date>     <dttm>                    <dbl> <int>         <dbl> <dbl> <fct>
## 1 2015-12-31 2015-12-31 12:50:00           5    31             5    12 2015 
## 2 2015-12-31 2015-12-31 16:23:00           5    31             5    12 2015 
## 3 2015-12-31 2015-12-31 21:00:00           5    31             5    12 2015 
## 4 2015-12-31 2015-12-31 21:27:00           5    31             5    12 2015 
## 5 2015-12-31 2015-12-31 23:28:00           5    31             5    12 2015 
## 6 2016-01-01 2016-01-01 00:35:00           6     1             1     1 2016 
## # … with 1 more variable: is_day <lgl>

Exploratory data analysis

Before getting into modeling, let's take a look at the data. Let's plot the number of crimes for the data that we have

data.daily <- data.clean %>% 
    group_by(date) %>% 
    summarize(n_crime = n())

data.ts.daily <- ts(data.daily$n_crime, start = c(2016,1), frequency = 365)
data.ts.daily %>% autoplot() + 
    geom_vline(xintercept = c(2016:2020), col = 'red') +
    labs(x = 'Year', y = 'No. of narcotic crimes in a day', title = 'Number of narcotic crimes in Chicago, 2016 - 2020')

The data shows a certain sinusoidal (wavelike) seasonality with high crime numbers at the start of the year, decreasing towards the middle of the year, and an increase around the later quarter before going down. As we can also see, 2020 is unique with a massive drop of crime around the end of the first quarter. As we all know, this is most probably due to the lockdown enforced due to the COVID-19 pandemic in April 2020.

decompose(data.ts.daily) %>% autoplot()

### Yearly trend

From the data, we can see that there is an increasing trend from the beginning of 2017 which plateaued around Q1 of 2019. The trend then nosedived during Q3 of 2019, plunging until the start of 2020. The trendline is a 365-days moving average. That is why the effects COVID has been felt during Q3 2019. we should not take the trend literally. We'll revisit this again later during modeling.

Yearly seasonality

From the decomposition, we can see that there is a seasonality to the data. Crimes spike during the first quarter of the year, and peaks around the Feb-March. To get a better understanding, we can isolate at the seasonality of drug crimes and compare it to the 7-days average value.

seas.var <- decompose(data.ts.daily)$seasonal[1:365] # Taking 1 yearly seasonality

seas.var.ma <- ma(seas.var, 7) # Moving verage

ts(seas.var, frequency = 30) %>% autoplot(col = 'darkgrey') +
    labs(title = '7-Days average seasonal effect of drug crimes in 1 year', x = 'month', y = 'deviance') +
    geom_line(aes(y = seas.var.ma), col = 'red') +
    geom_hline(yintercept = 0, col = 'blue', ) + 
    geom_vline(xintercept = seq(1, 13, by = 0.25), col = 'lightblue') +
    geom_vline(xintercept = seq(1, 13), col = 'blue', lty = 2) +
    ylim(c(-12, 12))

From the graph, we can see that the early months (1-5, Jan-May), drug-related crimes are above average. Drug use begins to flatten out and reduce from June (6) until the end of the year (12). However, we can see weeks of large spikes at week 1 of July (7), and week 1 of September (9). Another notable spike is during the 1st week of March (3). These peaks might coincide with the popular spring, summer and autumn breaks notorious for college parties.

Year-to-year trend

To better understand the year-to-year trend, we can remove the effects of seasonality before looking at the trend.

data.daily <- data.daily %>%
    mutate(
        adj = seasadj(decompose(data.ts.daily)),
    )

# Comparing seasonal vs
data.daily %>% 
    ggplot(aes(x = date, y = n_crime)) +
    geom_line(alpha = 0.4) +
    geom_line(aes(y = ma(adj, order = 7)), col = 'red', )

The grey lines in the graph are values with seasonality, while the red line is the moving average of the data without seasonality. Even without seasonality, there is a drop in drug crimes since the middle of Q1 2016 until the end of the year, slow steady rise from the beginning of 2017 until Q3 2019, followed by quite a decrease in Q4 and the pronounced drop in early 2020 due to COVID.

Monthly trend

data.clean %>%
    group_by(year, month, day) %>% 
    summarize(mean_n_crime = n()) %>% # Count number of crimes for each day
    group_by(day) %>% 
    summarize(mean_n_crime = mean(mean_n_crime)) %>% # Average out the number of crimes
    ggplot(aes(x = day, y = mean_n_crime)) + 
    geom_line() + 
    scale_x_continuous(breaks = seq(1, 31)) +
    labs(x = 'Day of the month', y = 'Average narcotics crime count', title = 'Average crime count for every day of the month') + 
    geom_vline(xintercept = seq(0, 28, by = 7), col = 'red', lty = 2)

From the graph, we can see that most drug crimes were caught during the first days of the month. Drug related crimes spike up again at the beginning of the 4th week, and rises significantly the last 7 days of the month.

Daily trend

Similarly, let's see the effect of day of the week to the number of drug cases

data.clean %>%
    group_by(year, month, week_of_month, day_of_week) %>% 
    summarize(mean_n_crime = n()) %>% # Count number of crimes for each day
    group_by(day_of_week) %>% 
    summarize(mean_n_crime = mean(mean_n_crime)) %>% # Average out the number of crimes
    ggplot(aes(x = day_of_week, y = mean_n_crime)) + 
    geom_line() + 
    scale_x_continuous(breaks = seq(1, 7), labels = c("Mon","Tue", "Wed", "Thu", "Fri", "Sat", "Sun")) +
    scale_y_continuous(breaks = seq(28, 42)) +
    labs(x = 'Day of the week', y = 'Average narcotics crime count', title = 'Average crime count for every day of the week')

From the data, it can be seen that Saturday is the day where most drug crimes are caught, with almost a 50% increase compared to Monday.

Modeling

The COVID-19 pandemic really impacts the trend. Including drug trends. To limit the impact, Let's cut down the data until Feb 2020, where we have a more clearer. After getting the, we'll try to measure the impact of the pandemic to drug crimes.

Time series prediction is a difficult subject where. For this modeling, we'll try to create two models : 1 based on the daily data to predict daily data, and another based on the monthly data to predict monthly crime data. We'll isolate the data from the effect of COVID 19 by cutting the data until 29th February 2020, where the impact is not yet felt. We'll use 2 months worth of data (1 Jan 2020 - 29 Feb 2020) to be used as the test data set, and the remaining data as the train data set.

# Getting data to be used in the model : from 1 Jan 2016 - 29 Feb 2020
data.model <- data.clean %>% 
    filter(date >= '2016-01-01' & date <= '2020-02-29') %>% 
    group_by(date) %>% 
    summarize(n_crime = n())

# Splitting into train and test data
data.test <- data.model %>% filter(date >= '2020-01-01')
data.train <- data.model %>% filter (date < '2020-01-01')

# Visualizing the split data
ts(data.train$n_crime, start = c(2016, 1), frequency = 365) %>% autoplot(series = 'Train') +
    autolayer(ts(data.test$n_crime, start = c(2020, 1), frequency = 365), series = 'Test') +
    labs(title = 'Train and test datasets', x = 'Year', y = 'Num crime')

Daily model

# Creating train and test data set for daily model
data.train.daily = ts(data.train$n_crime, start = c(2016, 1), frequency = 365)
data.test.daily = ts(data.test$n_crime, start = c(2020, 1), frequency = 365)

# Decomposing data
decompose(data.train.daily) %>% autoplot()

We can see that the data has train, seasonality and trend. We'll approach this data using 2 models : HoltWinters and ARIMA

HoltWinters model

First we'll make a HoltWinters model using automated fitting.

# Fitting models
model.hw.daily <- HoltWinters(data.train.daily)

# Making predictions
model.hw.daily.result <- forecast(model.hw.daily, h = length(data.test.daily))

# Calculating accuracy
model.hw.daily.acc <- accuracy(model.hw.daily.result$fitted, data.test.daily)
model.hw.daily.acc
##                ME     RMSE      MAE      MPE     MAPE
## Test set 9.300797 9.300797 9.300797 31.00266 31.00266
# Visualizing
data.test.daily %>% autoplot(series = 'real') +
    autolayer(model.hw.daily.result$mean, series = 'predicted') + 
    labs(title = 'Prediction using Holt-Winters model', x = 'Time', y = 'Number of crimes')

The model differ quite significantly with MAPE of 31.0026556% and MPE of 31.0026556. This shows that the predicted values are higher than the real value.

Model tuning

Holt-Winter models can be tuned by assigning 3 variables : alpha, beta and gamma. These three values corresponds to the weights of error, trend and seasonality terms in the model. The values of these parameters range between 0-1, where a value of 0 means an equal weighting to all data values, and a value of 1 give greater weights to more recent data points in influencing the model.

The values for these variables from the automated model fitting are :

  • alpha : 0.0702478
  • beta : 0
  • gamma : 0.5936261

From the fitted models, the value of beta is 0. This means the model removes the effect of trend in the model. As we can see, for most of the data, the model has an increasing value. However, beginning from Q3 2019, the trend inverts, it decreases. As there is no constant trend, the model eliminates the value of trend in the prediction.

The model attributes a high value to gamma, recognizing quite a remarkable effect of seasonality, while only attributing a small value to gamma. The plotted graph shows that our prediction overshoots the real value. Let's try to adjust the values to better reflect on recent events by increasing the value of alpha and gamma.

# Fitting models
model.hw.daily <- HoltWinters(data.train.daily, alpha = 0.7, beta = 0, gamma = 0.7)

# Making predictions
model.hw.daily.result <- forecast(model.hw.daily, h = length(data.test.daily))

# Calculating accuracy
model.hw.daily.acc <- accuracy(model.hw.daily.result$fitted, data.test.daily)
model.hw.daily.acc
##                ME     RMSE      MAE      MPE     MAPE
## Test set 4.049182 4.049182 4.049182 13.49727 13.49727
# Visualizing
data.test.daily %>% autoplot(series = 'real') +
    autolayer(model.hw.daily.result$mean, series = 'predicted') + 
    labs(title = 'Prediction using HoltWinters model', x = 'Time', y = 'Number of crimes')

By tuning the values of alpha = 0.7 and gamma = 0.7, we have improved the error. Visually, we can also see that the model more closely resembles the prediction.

Selecting training data

Reflecting on the value of betta = 0, let's try to narrow down the training dataset by only including 2019 instead of starting from 2016. As we know, each month of the calendar except February has 30 or 31 days. Unfortunately, we can not specify variable length of time in the time series object in R. Therefore, we have to make each month have 30 days by deleting every 31st day. We'll also miss 2 days from February, and we'll have to shift the starting period of the test dataset by 2 days earlier.

# Getting only 30 days
data.train.2019 <- data.train %>% 
    filter(date >= '2019-01-01' & date <= '2019-12-30' & day(date) != 31)

# Modeling data
data.train.2019 <- ts(data.train.2019$n_crime, start = c(1,1), frequency = 30)
model.hw.daily.2019 <- HoltWinters(data.train.2019)

# Creating 2019 test dataset
data.test.2019 <- ts(data.test$n_crime, start = c(12, 28), frequency = 30)

# Making predictions
model.hw.daily.2019.result <- forecast(model.hw.daily.2019, h = length(data.test.2019))

# Calculating accuracy
model.hw.daily.2019.acc <- accuracy(model.hw.daily.2019.result$fitted, data.test.2019)
model.hw.daily.2019.acc
##                 ME     RMSE      MAE       MPE     MAPE
## Test set -4.402395 4.402395 4.402395 -14.67465 14.67465
# Visualizing and comparing the result
data.test.2019 %>% autoplot(series = 'real') +
    autolayer(model.hw.daily.2019.result$mean, series = 'predicted') + 
    labs(title = 'Prediction using HoltWinters model, only 2019 dataset', x = 'Time', y = 'Number of crimes')

Before any tuning, the model using 1 year of data produces a better MAPE than the model that uses 3 years data (2016 - 2019). This highlights an important factor of time-series prediction :

Past performance does not guarantee future predictions

Time series predictions using is only as good if prevailing trends and seasonality ensues. However, the world is complex, and sudden changes to the regularity such as the COVID-19 pandemic throws the models off. The model trained with 2019 fits the trend better. However, it

Model Improvement

Our previous attempt highlights the difficulty of using granular, real-time spans such as days for prediction. To overcome this problem, it is common to practice calendar adjustment to regularize the input data.

Calendar adjustment

Calendar adjustment works by minimizing variations due to different number of days / weeks in a month by looking at averages. Let's convert our daily data into monthly data

# Creating monthly average train data set
data.train.monthly <- data.train %>% 
    mutate(
        year = year(date),
        month = month(date)
    ) %>% 
    group_by(year, month) %>% 
    summarize(n_crime = mean(n_crime))

data.train.monthly <- ts(data.train.monthly$n_crime, start = c(2016,1), frequency = 12)

# Decomposing the time series
data.train.monthly %>% decompose() %>% autoplot()

As we can see, the trend and seasonality is much simpler.

# Creating monthly test data
data.test.monthly <- data.test %>% 
    mutate(
        year = year(date),
        month = month(date)
    ) %>% 
    group_by(year, month) %>% 
    summarize(n_crime = mean(n_crime))

data.test.monthly <- ts(data.test.monthly$n_crime, start = end(data.train.monthly) + c(0, 1) , frequency = 12)
# Training data
model.hw.monthly <- HoltWinters(data.train.monthly)
model.hw.monthly.result <- forecast(model.hw.monthly, h = length(data.test.monthly))

# 
accuracy(model.hw.monthly.result$mean, data.test.monthly)
##                 ME     RMSE      MAE       MPE     MAPE ACF1 Theil's U
## Test set -2.910803 3.515282 2.910803 -7.833375 7.833375 -0.5  1.857236
# Visualizing train, test and predicted values
data.train.monthly %>% autoplot(series = 'data') +
    autolayer(model.hw.monthly.result$fitted, series = 'model') + 
    autolayer(data.test.monthly, series = 'test') + 
    autolayer(model.hw.monthly.result$mean, series = 'prediction')

# Visualizing and comparing the result
data.test.monthly %>% autoplot(series = 'real') +
    autolayer(model.hw.monthly.result$mean, series = 'predicted') + 
    labs(title = 'Prediction using HoltWinters model, monthly dataset', x = 'Time', y = 'Number of crimes')

SARIMA model

Let's try a more complicated model. The SARIMA models autcorrelation between the data points , independence and moving averagea to make a prediction. To be able to use the ARIMA model, we have to make our dataset stationary, meaning no trend and seasonality. To do that, we take the diff of the dataset, that is the difference between each data points. Let's try using the daily data with Seasonal ARIMA (SARIMA) to create a model.

The Seasonal ARIMA (SARIMA) model is an extension of the ARIMA model with addition of seasonal factor. The ARIMA model takes in 3 parameters that we can tune :

  • p : number of lag in autoragresion (AR)
  • d : order in integration (I)
  • q : order used in moving average (MA)

In addition these values, the SARIMA model takes in an additional 3 values from P, D dan Q which are the same from the 3 parameters above, but from the seasonal element. To model our time-series automatically, we can use the auto.arima() function. By default, the model takes into account seasonality, making a SARIMA model.

model.sarima.1 <- auto.arima(data.train.monthly)
summary(model.sarima.1)
## Series: data.train.monthly 
## ARIMA(1,1,0)(0,1,0)[12] 
## 
## Coefficients:
##           ar1
##       -0.3010
## s.e.   0.1653
## 
## sigma^2 estimated as 13.78:  log likelihood=-95.11
## AIC=194.22   AICc=194.6   BIC=197.33
## 
## Training set error measures:
##                      ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 0.08918743 3.124395 2.006166 0.2484202 5.737377 0.3728083
##                    ACF1
## Training set 0.05327045

The trained model has parameters :

  • p = 1
  • d = 1
  • q = 0
  • P = 0
  • D = 1
  • Q = 0

Tuning the SARIMA model

How can we tune these models? Starting from the model parameters given by auto.arima, we can try different combinations: p = 1, d = 1, q = 1, P = 1, D = 1, Q = 1

model.sarima.2 <- Arima(data.train.monthly, order = c(1, 1, 1), seasonal = c(1,1,1))
summary(model.sarima.2)
## Series: data.train.monthly 
## ARIMA(1,1,1)(1,1,1)[12] 
## 
## Coefficients:
##           ar1     ma1    sar1     sma1
##       -0.9453  0.7785  0.7856  -0.9999
## s.e.   0.1053  0.1603  0.3111   0.4324
## 
## sigma^2 estimated as 12.66:  log likelihood=-93.91
## AIC=197.81   AICc=199.88   BIC=205.59
## 
## Training set error measures:
##                     ME     RMSE      MAE       MPE    MAPE      MASE
## Training set 0.1026505 2.859765 1.714083 0.2571001 4.89391 0.3185301
##                     ACF1
## Training set -0.02447706

We received a slight improvement. Let's try, by tweaking the seasonal parameters, namely with the combination p = 2, d = 1, q = 1, P = 1, D = 2, Q = 3

model.sarima.3 <- Arima(data.train.monthly, order = c(2, 1, 1), seasonal = c(1,2,3))
summary(model.sarima.3)
## Series: data.train.monthly 
## ARIMA(2,1,1)(1,2,3)[12] 
## 
## Coefficients:
##           ar1      ar2      ma1     sar1     sma1     sma2    sma3
##       -0.1859  -0.0132  -0.1014  -0.2872  -0.1033  -0.6184  0.1916
## s.e.   2.6249   0.5850   2.5427   0.3587      NaN   0.0703     NaN
## 
## sigma^2 estimated as 23.48:  log likelihood=-72.84
## AIC=161.68   AICc=171.97   BIC=170.77
## 
## Training set error measures:
##                      ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.4303191 2.797468 1.580446 -1.002008 4.118434 0.2936961
##                     ACF1
## Training set -0.03866684

Out of the 3 models, we receive a little better prediction using the 3rd model.

# Predictions using the different SARIMA models
model.sarima.3.pred <- forecast(model.sarima.3, h = 2)

# Visualizing the model and prediction
data.train.monthly %>% autoplot(series = 'original') +
    autolayer(model.sarima.3.pred$fitted, series = 'model') + 
    autolayer(data.test.monthly, series = 'test') + 
    autolayer(model.sarima.3.pred$mean, series = 'predicted')

Prediction

Now that we have made models from the 2 classes, let's try predicting drug related crimes from Jan 2020 until Jul 2020, assuming COVID did not happen.

For the first class, we will use the monthly Holt-Winters model model.hw.monthly and model.sarima.3 for the second class

# Predicting with Holt-winers model
model.hw.monthly.forecast <- forecast(model.hw.monthly, h = 7)

# Predicting with SARIMA
model.sarima.3.forecast <- forecast(model.sarima.3, h = 7)
# Creating 2020 time series
data.2020 <- data.clean %>% 
    filter(date >= '2020-01-01') %>% 
    group_by(
        month = month(date),
        date = date(date)
    ) %>% 
    summarize(n_crime = n()) %>% 
    summarize(n_crime = mean(n_crime))

data.2020.ts <- ts(data.2020$n_crime, start = c(2020,1), frequency = 12)

# Comparing the figures
data.2020.ts %>% autoplot(series = 'with COVID-19') +
    autolayer(model.sarima.3.forecast$mean, series = 'SARIMA') +
    autolayer(model.hw.monthly.forecast$mean, series = 'Holt-Winters') +
    labs(title = 'Drug crime prediction in 2020', x = 'Month', y = 'no of drug-related crimes in a month') +
    scale_x_continuous(trans = yearmon_trans(format = "%b", n = 9))

With the predictions, we can see how much covid has reduced drug related crimes

# Calculating difference between predicted and COVID-level drug crimes
months <- as.character(month(c(1:7), label = T, abbr = T))

crime.diffs <- data.frame(
    cbind(months,
          round(data.2020$n_crime, 0),
          round(model.sarima.3.forecast$mean, 0),
            round((model.sarima.3.forecast$mean - data.2020$n_crime)/model.sarima.3.forecast$mean * 100, 2),
          round(model.hw.monthly.forecast$mean, 0),
          round((model.hw.monthly.forecast$mean - data.2020$n_crime)/model.hw.monthly.forecast$mean * 100, 2)
    )
)

colnames(crime.diffs) <- c('Month', 'Data','SARIMA 3', 'diff SARIMA (%)', 'Holt-Winters', 'diff Holt-Winters (%)')

crime.diffs %>% 
    kable(caption = 'Comparing data and predicted drug related crimes in 2020') %>% 
    kable_styling()
Comparing data and predicted drug related crimes in 2020
Month Data SARIMA 3 diff SARIMA (%) Holt-Winters diff Holt-Winters (%)
Jan 39 43 8.42 40 2.33
Feb 37 43 14.98 42 11.72
Mar 22 39 43.5 44 49.46
Apr 6 36 84.77 42 86.72
May 13 32 59.8 41 68.94
Jun 10 33 68.52 42 75.27
Jul 13 35 62.25 42 69.06

From the table above, we can see that COVID greatly reduced drug-related crimes between 50-90%.

Conlusion

The class of ARIMA model is a more complex model that can capture seasonality better compared to the the family of exponential smoothing models such as the Holt-Winters model. This can be seen from.

After our adventure, we can answer the questions that we posed in the beginning

  1. Are there any weekly, monthly and seasonal patterns to drug crimes?

As we have seen in Yearly Trend, Monthly Trend and Weekly Trend, there area certain regular patterns that repeats throughout the interval.

  1. How is the trend of drug-related crimes in Chicago throughout the years?

Drug-related crimes decreased in 2016, only to increase throught 2017 until mid 2019, before levelling out in and decreasing. We can see this in the year-to-year trend

  1. How much does COVID impact drug related crimes?

From the comparison table above, we can see that COVID-19 really impacted the use of drug-related crimes. This is most pronounced around April, where lockdowns were in place. Without COVID, the models predict a trend that does not really. This might follow the downwards trend that had started since 2019. Nevertheless, time-series forecasting is very sensitive to changes, and as the pandemic shows, the predictions only hold when all other factors hold the same (ceteris paribus).