The dataset we’re using reflects reported incidents of crime that occured in the City of Chicago from 2001 to present (2020). Data is extracted from Chicago Police Department’s CLEAR (Citizen Law Enforcement Analysis and Reporting) system.
What we’re trying to see as an exercise in Time Series, is to focus on one crime only, THEFT. We’ll try to see if there’s any seasonality or trend related toward the amount of theft going on, and maybe give a helpful tip on when theft is more likely to occur.
We want to see how many theft happens in the duration of a day.
crime_select_1 <- crime %>%
select(-id, -case_number, -block, -iucr, -location_description, -description, -fbi_code,
-beat, -district, -ward, -community_area, -x_coordinate, -y_coordinate, -year, -longitude, -latitude, -location) %>%
mutate(date = mdy_hms(date)) %>%
filter(primary_type == "THEFT") %>%
filter(date <= as.Date("2020-02-28")) %>% #removing March 2020 because it's still ongoing
select(date) %>%
arrange(date) %>%
group_by(year = year(date)) %>%
count(month(date, label = T)) %>%
ungroup(year) %>%
select(-year)
# Rename our columns
colnames(crime_select_1) <- c("month", "total_theft")Create Time Series Object
Plotting our Time Series
Our Decomposed time series containing Trend, Seasonal, and Error plot
There’s a sharp spike in theft rate from 2015 to 2016, as well as 2019 to 2020. What’s causing it is abit unclear, even after looking online for news reference. This may impact our model perfromance greatly, but we’ll try our forecast anyway.
crime_select_1 %>%
mutate(seasonal = crime_dc$seasonal,
trend = crime_dc$trend,
random = crime_dc$random) %>%
distinct(month,seasonal) %>%
ggplot(aes(month, seasonal)) + geom_col()Looking at our graph, this means that the rate of theft increases from May to August, and the rate of theft is lowest on February.
We need to split our data into train and test in order check our model performance. In this case, I’m using the last 8 month as our test period.
We will try both HoltWinters as well as ARIMA and compare our MAPE (Mean Absolute Percentage Error) to measure our model performance, where the closer it is to 0, the better.
Our model forecast.
crime_1_model_forecast <- forecast(crime_1_model, h = length(crime_1_test))
autoplot(crime_1_model_forecast)Our MAPE in HoltWinters is 52.37%.
## [1] 51.86823
Our HW model forecast compared to our actual data.
We’ll be using auto.arima for our model
ARIMA’s forecast plot
crime_1_model_arima_forecast <- forecast(crime_1_model_arima, h = length(crime_1_test))
autoplot(crime_1_model_arima_forecast)Our MAPE in ARIMA model is 39.8%. A 12.57% improvement from our HoltWinters model, but still not a good number.
## [1] 39.20893
Our ARIMA model forecast compared to our actual data.
There’s a large spike starting from 2019, similar to the behaviour observed in 2015. Unfortunately, such sudden spike is causing our model to fail to make an accurate prediction.
Therefore, the insight that can probably be taken from our theft time series model comes from our seasonality plot, where we know that the rate of theft is high from May to August.
Inspecting the period between March 2015 to August 2016, we found that there is multiple seasonality, Weekly and Monthly.
Below is our modiefied Data Wrangling to accomodate our need.
crime_select_2 <- crime %>%
select(-id, -case_number, -block, -iucr, -location_description, -description, -fbi_code,
-beat, -district, -ward, -community_area, -x_coordinate, -y_coordinate, -year, -longitude, -latitude, -location) %>%
mutate(date = mdy_hms(date)) %>%
filter(primary_type == "THEFT") %>%
select(date) %>%
arrange(date) %>%
filter(date >= as.Date("2015-03-01") & date <= as.Date("2016-08-31")) %>%
group_by(year = year(date), monthly = month(date)) %>%
count(daily = day(date)) %>%
ungroup()## Warning: 1 failed to parse.
Below is our multiple seasonality time series object
crime_weekly_ts <- msts(crime_select_2$n, start = c(3,1), seasonal.periods = c(7, 30))
autoplot(crime_weekly_ts)Below is our decomposed time series object, showing the trend, both seasonality, and random/error value.
As we can see from the plot below, crime rate is the highest on the 15th to 23rd each month.
as.data.frame(crime_weekly_dc) %>%
mutate(day = crime_select_2$daily) %>%
group_by(day) %>%
summarise(seasonal = sum(Seasonal7 + Seasonal30)) %>%
ggplot(aes(day, seasonal)) +geom_col()The months where theft incident is high are April, July, October, and December.
as.data.frame(crime_weekly_dc) %>%
mutate(day = crime_select_2$daily, month = month(crime_select_2$monthly, label = T)) %>%
group_by(month) %>%
summarise(seasonal = sum(Seasonal7 + Seasonal30)) %>%
ggplot(aes(month, seasonal)) +geom_col()Based on the two seasonality plot above, we can recommend that Chicago Police and Chicago citizens to be more vigilant for theft around April, July, October, and December, as well as 15th to 23rd of each month, because according to the data we observed, that’s when theft is most likely to happen.
We’re taking the last 20 weeks as our test, and the rest as our training data.
Below is our forecast for HoltWinters model.
model_hw_crime_msts <- HoltWinters(crime_w_train)
model_hw_msts_forecast <- forecast(model_hw_crime_msts, h=n_test)
autoplot(model_hw_msts_forecast)Comparing our forecast with our actual data
The resulting MAPE of our model is 11%.
## [1] 11.0139
After examining both overall data and the data on 2015-2016, there’s a difference in our seasonal data pattern, as we can see below.
Our monthly seasonal data can be used to determine when the police and citizen should be more aware of theft.
crime_select_1 %>%
mutate(seasonal = crime_dc$seasonal,
trend = crime_dc$trend,
random = crime_dc$random) %>%
distinct(month,seasonal) %>%
ggplot(aes(month, seasonal)) + geom_col()as.data.frame(crime_weekly_dc) %>%
mutate(day = crime_select_2$daily, month = month(crime_select_2$monthly, label = T)) %>%
group_by(month) %>%
summarise(seasonal = sum(Seasonal7 + Seasonal30)) %>%
ggplot(aes(month, seasonal)) +geom_col()Both of our forecast failed to predict the sudden jump/fall of our theft rate, although the model in between 2015 to 2016 has smaller MAPE value because the sudden jump/fall period is smaller.
In conclusion, while we can extract good seasonal data, the forecast ability is not as reliable, because of the existence of said sudden jump/fall in numbers that we don’t seem to be able to predict.