Introduction

This dataset depicts reported crime incidents in the City of Chicago from 2010 to the 2015, with the exception of murders where data is collected for each victim. The data is obtained from the CLEAR (Citizen Law Enforcement Analysis and Reporting) system operated by the Chicago Police Department. In line with the commitment to safeguard the privacy of crime victims, addresses of incidents are only displayed at the block level, with specific locations remaining undisclosed. This measure is taken to ensure the protection of personal information and provide security for those involved. The dataset can be accessed here: Link to the Chicago Crimes Dataset

Time series analysis of this dataset will be conducted using two models: the Holt-Winters model and the Seasonal and Long-Term Memory (stlm) model. The stlm model combines elements from Long Short-Term Memory (LSTM) and seasonal components to forecast time series data. In evaluating its performance, similar steps will be taken as with the Holt-Winters model, This process included comparing the model predictions against actual data and computing error metrics, such as Mean Absolute Percentage Error (MAPE), as well as Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE).

Import Library

These libraries collectively equip us with a robust toolkit to effectively analyze and derive insights from the time series dataset, enhancing our ability to make informed decisions and interpretations.Here are some of the libraries :

# load library
library(dplyr) #data manipulation
library(lubridate) # date manipulation
library(forecast) # time series library
library(TTR) # for Simple moving average function
library(MLmetrics) # calculate error
library(tseries) # adf.test
library(fpp) # usconsumtion
library(padr) # melakukan proses pading
library(ggplot2)
library(maptools)

dplyr: This library is widely used for data manipulation tasks. It provides functions that facilitate filtering, summarizing, mutating, and arranging data.

lubridate: Lubridate simplifies working with date and time data. It offers intuitive functions to parse, manipulate, and format date-time objects.

forecast: Forecast is a dedicated time series library. It equips you with tools for time series forecasting, including various forecasting methods and evaluation techniques.

TTR (Technical Trading Rules): TTR offers functions related to technical analysis and trading strategies. It’s commonly used for tasks like calculating moving averages and other technical indicators.

MLmetrics: This library provides functions to calculate various error metrics for assessing the performance of machine learning models, such as Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), etc.

tseries: Tseries contains tools for time series analysis, including functions like the Augmented Dickey-Fuller test (adf.test) for testing stationarity.

fpp (Forecasting: Principles and Practice): This library accompanies the book “Forecasting: Principles and Practice” by Rob J. Hyndman and George Athanasopoulos. It includes datasets and functions to support the concepts discussed in the book.

padr: Padr is used for handling missing data by performing padding processes, which can be crucial in maintaining data continuity for time series analysis.

ggplot2: This popular data visualization library enables the creation of a wide variety of graphs and charts using a layered approach.

maptools: Maptools provides functions for reading and manipulating geographic data, which is particularly useful for geospatial analysis and mapping.

By utilizing these libraries, you can streamline various tasks such as data preparation, time series modeling, forecasting, error evaluation, data visualization, and geospatial analysis, ultimately enhancing the depth and breadth of your analysis.

Exploratory Data Analysis

Exploratory Data Analysis (EDA) plays a pivotal role in gaining a comprehensive understanding of the Chicago crime dataset. Through EDA, we delve into the dataset’s intricacies, patterns, and anomalies, shedding light on key insights that lay the foundation for informed analysis. By employing techniques such as data visualization and summary statistics, we unravel the temporal distribution of reported crimes, uncover potential trends across different crime types, and identify any notable patterns across geographic locations. Moreover, EDA allows us to detect missing or inconsistent data, providing valuable insights into data quality and potential preprocessing requirements. This exploratory phase equips us with the necessary context and knowledge to subsequently apply sophisticated analytical methods, contributing to a more holistic and data-driven examination of crime dynamics in the city of Chicago.

Reading Dataset

Reading a dataset is the crucial first step in any data analysis. In the context of crime analysis in Chicago, this step initiates the process of understanding and extracting information from a relevant data source. By applying functions such as read.csv() or other dataset reading tools, we gain access to information hidden within columns and rows of data.

crime <- read.csv("data/Crimes_-_2001_to_Present.csv")

head(crime)
glimpse(crime)
#> Rows: 7,863,375
#> Columns: 22
#> $ ID                   <int> 11646166, 11645836, 11243268, 1896258, 11645527, …
#> $ Case.Number          <chr> "JC213529", "JC212333", "JB167760", "G749215", "J…
#> $ Date                 <chr> "09/01/2018 12:01:00 AM", "05/01/2016 12:25:00 AM…
#> $ Block                <chr> "082XX S INGLESIDE AVE", "055XX S ROCKWELL ST", "…
#> $ IUCR                 <chr> "0810", "1153", "1562", "0460", "1153", "1153", "…
#> $ Primary.Type         <chr> "THEFT", "DECEPTIVE PRACTICE", "SEX OFFENSE", "BA…
#> $ Description          <chr> "OVER $500", "FINANCIAL IDENTITY THEFT OVER $ 300…
#> $ Location.Description <chr> "RESIDENCE", "", "APARTMENT", "STREET", "OTHER", …
#> $ Arrest               <chr> "false", "false", "false", "false", "false", "fal…
#> $ Domestic             <chr> "true", "false", "false", "false", "false", "fals…
#> $ Beat                 <int> 631, 824, 1913, 1824, 811, 412, 222, 233, 2515, 3…
#> $ District             <int> 6, 8, 19, 18, 8, 4, 2, 2, 25, 3, 17, 6, 19, 2, 11…
#> $ Ward                 <int> 8, 15, 47, NA, 23, 8, 4, 5, 30, 5, 33, 6, 46, 3, …
#> $ Community.Area       <int> 44, 63, 3, NA, 56, 45, 39, 41, 19, 43, 14, 44, 3,…
#> $ FBI.Code             <chr> "06", "11", "17", "08B", "11", "11", "14", "18", …
#> $ X.Coordinate         <int> NA, NA, NA, NA, NA, NA, 1184667, NA, NA, NA, NA, …
#> $ Y.Coordinate         <int> NA, NA, NA, NA, NA, NA, 1875669, NA, NA, NA, NA, …
#> $ Year                 <int> 2018, 2016, 2017, 2001, 2015, 2001, 2015, 2018, 2…
#> $ Updated.On           <chr> "04/06/2019 04:04:43 PM", "04/06/2019 04:04:43 PM…
#> $ Latitude             <dbl> NA, NA, NA, NA, NA, NA, 41.81400, NA, NA, NA, NA,…
#> $ Longitude            <dbl> NA, NA, NA, NA, NA, NA, -87.59814, NA, NA, NA, NA…
#> $ Location             <chr> "", "", "", "", "", "", "(41.81399924, -87.598137…

This dataset consists of 7,863,375 rows and 22 columns, each containing valuable information about crime in Chicago. Here’s a breakdown of each column in the crime data frame :

  1. ID: Unique identifier for each crime record.

  2. Case.Number: The case number associated with the crime.

  3. Date: The date and time when the crime occurred.

  4. Block: The block where the crime occurred.

  5. IUCR: The Illinois Uniform Crime Reporting (IUCR) code for the crime.

  6. Primary.Type: The primary type of the crime.

  7. Description: The nature of the crime.

  8. Location.Description: The location type where the crime occurred.

  9. Arrest: Indicating whether an arrest was made (true or false).

  10. Domestic: Indicating whether the crime was domestic in nature (true or false).

  11. Beat: The police beat where the crime occurred.

  12. District: The police district where the crime occurred.

  13. Ward: The ward of the city where the crime occurred.

  14. Community.Area: The community area where the crime occurred.

  15. BI.Code: The FBI crime code for the crime.

  16. X.Coordinate: The X-coordinate of the location where the crime occurred.

  17. Y.Coordinate: The Y-coordinate of the location where the crime occurred.

  18. Year: The year when the crime occurred.

  19. Updated.On: The date and time when the crime record was last updated.

  20. Latitude: The latitude coordinate of the location where the crime occurred.

  21. Longitude: The longitude coordinate of the location where the crime occurred.

  22. Location: The precise location (latitude and longitude) of the crime.

Top 10 Crime in Chicago

We will focus on one crime case of top 10 crime in chicago. By selecting this particular case, we will be able to analyze it more extensively, identify specific patterns, and gain deeper insights into the characteristics of the crime. Here are the top ten highest crime cases in Chicago that will be the focus of our analysis:

# Menghitung jumlah kejahatan untuk setiap tipe kejahatan
crime_counts <- crime %>%
  group_by(Primary.Type) %>%
  summarize(Count = n()) %>%
  arrange(desc(Count)) %>%
  top_n(10)

# Membuat grafik bar
ggplot(crime_counts, aes(y = reorder(Primary.Type, Count), x = Count, fill = Count)) +
  geom_bar(stat = "identity") +
  scale_fill_gradient(low = "black", high = "red")+
  theme_minimal() +
  labs(title = "Top Ten Crimes in Chicago (All Years)",
       x = "Crime Type",
       y = "Number of Crimes",
       fill = "Crime Type") +
  guides(fill = FALSE)

We will specifically focus on the “Criminal Damage” crime case, which is one of the top 10 crime categories in Chicago.

Data Wrangling

The data wrangling process is an essential stage in data preparation before further analysis. In the context of crime analysis in Chicago, this step involves filtering, selecting, and reorganizing data related to criminal damage cases. First, the data will be filtered to include only criminal damage crime reports. Subsequently, irrelevant columns or those with incomplete values will be excluded, ensuring a cleaner and more focused dataset. Furthermore, the process will include sorting dates and grouping based on specific attributes to prepare the data for time series analysis.

# Filter data kejahatan dengan jenis criminal
criminal_crime <- crime %>%
  filter(Primary.Type == "CRIMINAL DAMAGE")
criminal_crime

This process leads to the creation of a highly relevant new data subset for crime analysis in Chicago. By filtering crime data based on the “Criminal Damage” crime type, we are generating a new dataset that specifically focuses on criminal damage cases. This action enables us to delve deeper into patterns, trends, and unique characteristics that may be associated with this type of crime.

Convert Data Type

In an effort to ensure the smooth progression of further analysis, the step of data type conversion is crucial. In the context of crime analysis in Chicago, it is important to ensure that relevant attributes, such as the date of incidents, have been transformed into appropriate data types.

crime_date<- criminal_crime %>%
  mutate(
    Date = as.Date(Date, format = "%m/%d/%Y %I:%M:%S %p"),
    Arrest = as.logical(Arrest),
    Domestic = as.logical(Domestic),
    Latitude = as.numeric(Latitude),
    Longitude = as.numeric(Longitude),
    Case.Number = as.factor(Case.Number),
    Block = as.factor(Block),
    IUCR = as.factor(IUCR),
    Primary.Type = as.factor(Primary.Type),
    Description = as.factor(Description),
    Location.Description = as.factor(Location.Description),
    FBI.Code = as.factor(FBI.Code),
    Updated.On = as.factor(Updated.On)
  )

# Tampilkan beberapa baris pertama setelah data wrangling
head(crime_date)

Sub-Setting Dataset for Time Series

In this process, we are focusing on the analysis of crime incidents over a specific time range, from January 1, 2010, to December 30, 2015. By aggregating the data and counting the number of crimes for each individual date.

# Menghitung jumlah kejahatan untuk setiap tanggal
crime_date <- crime_date %>%
  group_by(Date) %>%
  summarize(Count = n()) %>%
  arrange(Date) %>%
  filter(Date >= as.Date("2010-01-01") & Date <= as.Date("2015-12-30"))
head(crime_date)

This process aims to create a highly relevant new data subset for crime analysis in Chicago. By filtering crime data based on the “Criminal Damage” crime type and the time range from 2010 to 2015, we are generating a new dataset specifically focused on criminal cases. This action allows us to delve deeper into patterns, trends, and unique characteristics that may be associated with this type of crime within the specified time frame.

summary(crime_date)
#>       Date                Count       
#>  Min.   :2010-01-01   Min.   : 30.00  
#>  1st Qu.:2011-07-02   1st Qu.: 77.00  
#>  Median :2012-12-30   Median : 90.00  
#>  Mean   :2012-12-30   Mean   : 91.83  
#>  3rd Qu.:2014-06-30   3rd Qu.:106.00  
#>  Max.   :2015-12-30   Max.   :184.00

Time Range: criminal crime data has been analyzed over a broader time range, from 2010 to 2015. This encompasses a longer period and allows us to observe long-term trends in criminal damage crime.

Number of Incidents: The count of criminal damage incidents varies between 30 and 184. An average of around 91.83 incidents per date indicates fluctuations in the frequency of crime throughout this period.

Padding Dataset

One of the prerequisites for effective time series analysis is having sequential and complete data. To fulfill this requirement, we are implementing a padding technique known as “Padding Dataset” to ensure the integrity of our time series data.

crime_date <- crime_date %>%
  pad()
all(seq.Date(from = as.Date(min(crime_date$Date)), to = as.Date(max(crime_date$Date)), by = "day") == crime_date$Date)
#> [1] TRUE

The “true” output confirms that there are no missing dates and that each date within the designated timeframe has been accurately captured in the dataset.

Checking Missing Value

Next, we will check for the presence of missing values in the “crime_date” dataset. By identifying any missing values,

# check missing values
anyNA(crime_date)
#> [1] FALSE

The output “FALSE” indicates that there are no missing values in the “crime_date” dataset.

Creating Time Series Object

The time series object depicts a sequence of consecutive observations within a specific time interval, enabling us to identify patterns, trends, and fluctuations in the data over time.

In this case, we will set the frequency to 365 to reflect the number of days in a year. By configuring this frequency, we can aggregate daily data into annual data, enabling us to identify long-term trends and patterns in criminal crime incidents in Chicago during the specified time range. Setting the appropriate frequency is a crucial step to ensure that our time series analysis aligns with the desired observation scale, and this will provide us with a broader perspective on the dynamics of crime over the past several years.

criminal_ts <- ts(crime_date$Count,
               start = 2010,
               frequency = 365)

Next, let’s visualize the dataset. The following code will generate a visualization of the “criminal_ts” time series object:

autoplot(criminal_ts) +
  labs(title = "criminal Time Series",
       x = "Date",
       y = "Number of Thef")

Next, we will proceed to perform decomposition to gain a deeper insight into the trend, seasonal patterns, and residuals of this dataset.

Decomposition

Decomposition is a method that allows us to separate a time series into main components, including trend, seasonal, and residual components. In this example, we will use the additive decomposition method.

# your code here
# Perform decomposition assuming additive time series
decomposed_criminal <- decompose(criminal_ts, type = "additive")

# Plot the decomposed components
autoplot(decomposed_criminal) +
  labs(title = "Decomposition of Assault Time Series",
       x = "Date")

This dataset shows a declining trend from 2010 to 2015, which can be observed from the fluctuations in the number of criminal damage crime incidents. Additionally, we can also identify a seasonal pattern or “seasonality” that is stagnant or additive in nature. This pattern indicates periodic variations in the number of criminal crime incidents that tend to repeat over a specific time scale.

adf.test(criminal_ts)
#> 
#>  Augmented Dickey-Fuller Test
#> 
#> data:  criminal_ts
#> Dickey-Fuller = -5.6557, Lag order = 12, p-value = 0.01
#> alternative hypothesis: stationary

The p-value is a crucial value in hypothesis testing. In this case, the null hypothesis is that the data is non-stationary (has a unit root), and the alternative hypothesis is that the data is stationary. A p-value of 0.01 indicates that the data is likely stationary since the p-value is smaller than a commonly used significance level of 0.05.

In conclusion, based on the Augmented Dickey-Fuller Test results, there is evidence to suggest that the time series data “criminal_ts” is stationary, which is a favorable characteristic for time series analysis.

Cross Validation

In this step, we will perform cross-validation on the previously created time series object. The purpose of cross-validation is to divide the data into alternating training and testing subsets. This way, we can test the model’s performance on unseen data to obtain a more objective estimate of how well the model can make predictions in the real world.

# your code here
# Separate data into train and test sets
test_criminal <- tail(criminal_ts, 365)
train_criminal <- head(criminal_ts, -365)

we are performing data separation to create distinct training and testing sets. The test_criminal subset consists of the last 365 data points

length(train_criminal)
#> [1] 1825

It contains 1831 data points. This training dataset will be used to build and train our forecasting models.

length(test_criminal)
#> [1] 365

The test_criminal set consists of 365 data points, as we are conducting a four-year test data selection. This approach is essential because our dataset spans from 2010 to 2015.

autoplot(train_criminal) +
  autolayer(test_criminal)

Modeling

Modeling in time series analysis plays a central role in our efforts to understand patterns, trends, and behavior of data over time. In the context of criminal crime analysis in Chicago, we will apply appropriate modeling methods to forecast the future count of crimes. The Holt-Winters model is utilized to forecast time series data with seasonality. It involves three components: level (α), trend (β), and seasonal component (γ). Initially, we will build a Holt-Winters model using our training data and subsequently use the model for making predictions on the test data. STL-M ARIMA is a combination of Seasonal Decomposition of Time Series (STL) and ARIMA model. This approach enables us to forecast data exhibiting complex seasonal patterns by applying ARIMA modeling to the separated seasonal component. These modeling methods will guide us in predicting future crime counts effectively and provide insights into the behavior of criminal damage crime over time in Chicago.

Holt-Winters

The Holt-Winters model is a forecasting method that is useful for handling time series data with seasonal components. In the context of criminal crime analysis in Chicago, I have applied the Holt-Winters model to the training data.

model_hw <- HoltWinters(
  x = train_criminal,
  seasonal = "additive"
)
summary(model_hw)
#>              Length Class  Mode     
#> fitted       5840   mts    numeric  
#> x            1825   ts     numeric  
#> alpha           1   -none- numeric  
#> beta            1   -none- numeric  
#> gamma           1   -none- numeric  
#> coefficients  367   -none- numeric  
#> seasonal        1   -none- character
#> SSE             1   -none- numeric  
#> call            3   -none- call

The code creates a Holt-Winters model using the training data train_criminal and configures it to consider additive seasonality. The resulting model is stored in the variable model_hw.

STLM

STLM (Seasonal and Trend decomposition using Loess) is an effective method in time series analysis for separating the main components of the data, namely trend and seasonality. This approach combines the Loess regression approach with seasonal decomposition, allowing us to identify and observe short-term and long-term patterns in the time series data.

seasonal_window <- 365

model_stlm <- stlm(train_criminal, 
                   method = "arima",
                   s.window = seasonal_window,
                   robust = TRUE)

summary(model_stlm)
#>               Length Class          Mode     
#> stl           7300   mstl           numeric  
#> model           18   forecast_ARIMA list     
#> modelfunction    1   -none-         function 
#> lambda           0   -none-         NULL     
#> x             1825   ts             numeric  
#> series           1   -none-         character
#> m                1   -none-         numeric  
#> fitted        1825   ts             numeric  
#> residuals     1825   ts             numeric

The line of code creates the STLM model using the ‘train_criminal’ time series data. The parameter s.window is 365 and the method parameter is set to ‘arima,’ indicating that the ARIMA (AutoRegressive Integrated Moving Average) model will be utilized for forecasting.

Model Performance

In assessing the effectiveness of our forecasting models, we turn to model performance metrics. Here is the code to view the performance metrics of the models:

accuracy(model_hw$fitted[,1], criminal_ts)
#>                 ME     RMSE     MAE       MPE     MAPE      ACF1 Theil's U
#> Test set 0.1978621 16.73723 12.3296 -1.856645 14.63264 0.2020875 0.9798595
accuracy(model_stlm$fitted, criminal_ts)
#>                   ME    RMSE      MAE       MPE     MAPE         ACF1 Theil's U
#> Test set -0.07385308 13.2293 9.995969 -1.882639 11.15342 -0.007889541 0.7385049

The Holt-Winters model has a MAPE value of 14.63264, while the STL-M ARIMA model has a MAPE value of 11.15342. The lower MAPE value in the STL-M ARIMA model indicates that it provides more accurate predictions in estimating the average percentage error of the forecasts compared to the actual values in the test dataset. Thus, the STL-M ARIMA model tends to be closer to the actual data and exhibits a lower level of inaccuracy compared to the Holt-Winters model. Here are the Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) values for both models:

For the Holt-Winters model (model_hw):

  • RMSE (Root Mean Squared Error): 16.73723
  • MAE (Mean Absolute Error): 12.3296

For the STL-M ARIMA model (model_stlm):

  • RMSE (Root Mean Squared Error): 13.2293
  • MAE (Mean Absolute Error): 9.995969

In summary, both the RMSE and MAE metrics suggest that the STL-M ARIMA model outperforms the Holt-Winters model in forecasting the “Criminal Damage” time series data.

Forecasting

The two forecasting models that have been created, namely the Holt-Winters Model and the STL-M ARIMA Model, will assist us in predicting the number of criminal damage crimes within the specified time range. We will utilize the forecasting outcomes from these two models to gain insights into how the number of criminal crimes is projected to change in the future. Furthermore, we will compare the performance of both models to determine which one provides more accurate and reliable predictions.

# your code here
# Perform forecasting for 365 days using Holt-Winters model
hw_forecast <- forecast(model_hw, h = 365)

# Perform forecasting for 365 days using stlm model
stlm_forecast <- forecast(model_stlm, h = 365)

Here, I continue with the forecasting process using the two models that I have previously created. First, I conducted a forecast for the next 365 days using the Holt-Winters Model. Next, I also performed a forecast for the next 365 days using the STL-M ARIMA Model.

Model Evaluation

This stage, I am evaluating the performance of the two forecasting models that have been created. I am using error metrics to measure how closely the model predictions align with the actual values in the test dataset.

accuracy(hw_forecast$mean, test_criminal)
#>                 ME     RMSE      MAE       MPE     MAPE      ACF1 Theil's U
#> Test set -8.683297 19.12184 15.78949 -14.52474 22.31743 0.4703442  1.384764
accuracy(stlm_forecast$mean, test_criminal)
#>                 ME     RMSE      MAE       MPE     MAPE      ACF1 Theil's U
#> Test set -4.048004 17.15061 14.04535 -8.996596 19.54944 0.4251647  1.237184
# Calculate MAPE for Holt-Winters model
mape_hw <- accuracy(hw_forecast, test_criminal )[2, "MAPE"]

# Calculate MAPE for ARIMA model
mape_arima <- accuracy(stlm_forecast, test_criminal)[2, "MAPE"]

# Print the MAPE values
print(paste("MAPE for Holt-Winters model:", round(mape_hw, 2)))
#> [1] "MAPE for Holt-Winters model: 22.32"
print(paste("MAPE for ARIMA model:", round(mape_arima, 2)))
#> [1] "MAPE for ARIMA model: 19.55"
  1. Holt-Winters Model: The Holt-Winters model has a MAPE of 22.32%. This means that, on average, the model’s predictions deviate from the actual values by approximately 22.32% in terms of percentage. While this value provides an indication of prediction accuracy, it also implies that the model’s predictions may have a relatively larger average percentage difference from the actual values.

  2. ARIMA Model: On the other hand, the ARIMA model exhibits a lower MAPE of 19.55%. This lower MAPE suggests that the ARIMA model’s predictions have a closer average percentage difference from the actual values compared to the Holt-Winters model. In other words, the ARIMA model is providing more accurate forecasts on average.

Next, let’s take a look at the visualization of the forecasting results from both models to provide a clearer understanding of their performance in predicting crime data.

# Plese type your code
train_criminal %>% 
  autoplot()+
  autolayer(object = test_criminal)+
  autolayer(stlm_forecast$mean)+
  labs(title = "STL-M ARIMA Forecasting",
       x = "Date",
       y = "Number of Crimes")

# Plese type your code
train_criminal %>% 
  autoplot()+
  autolayer(object = test_criminal)+
  autolayer(hw_forecast$mean)+
  labs(title = "Holt Winter Forecasting",
       x = "Date",
       y = "Number of Crimes")

In this visualization, the training data and test data are displayed alongside the forecasting results from the STL-M ARIMA and Holt-Winters models. Through this visualization, we can identify the differences between their predictions and understand which model is closer to the actual data in forecasting the number of crimes.

mape_difference <- mape_arima - mape_hw
mape_difference
#> [1] -2.767997

The STL-M ARIMA seems to have better predictive performance in terms of the MAPE metric, indicating that it provides more accurate and reliable predictions for the given “criminal damage” time series data compared to the Holt-Winters model.

Assumption

In this step, we will test two assumptions related to the STL-M ARIMA model: the absence of autocorrelation in residuals and the normality of residuals.

No-autocorrelation residual

Box.test(model_stlm$residuals, type = "Ljung-Box")
#> 
#>  Box-Ljung test
#> 
#> data:  model_stlm$residuals
#> X-squared = 0.11378, df = 1, p-value = 0.7359

The p-value of model_stlm (0.7359) is greater than the commonly used significance level of 0.05, there is not enough evidence to reject the null hypothesis, Based on the p-value, there is no significant autocorrelation detected in the residuals of stlm-ARIMA model.

Normality of residual

\(H_0\): Residuals are normally distributed. \(H_1\): Residuals are not normally distributed.

shapiro.test(model_stlm$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_stlm$residuals
#> W = 0.97502, p-value < 0.00000000000000022
hist(model_stlm$residuals, breaks = 20)

The Shapiro-Wilk normality test conducted on the residuals of the STL-M ARIMA model resulted in a p-value that is extremely small (p-value < 0.00000000000000022). This suggests strong evidence against the null hypothesis of normality, indicating that the residuals may not be normally distributed.

plot(model_stlm$residuals)

Although the residuals of the model do not always perfectly follow a normal distribution, this is actually common in many cases. There are various real-world factors that can lead to fluctuations and uncertainties in data, which in turn affect the distribution of residuals. In time series analysis, our main goal is to understand patterns, trends, and fluctuations in the data as much as possible, and to develop models that can handle this complex variability.

In the context of time series analysis, we often encounter data with inherent characteristics that are difficult to predict and follow complex patterns. Residuals that do not adhere to a normal distribution do not always indicate errors or model failure. Instead, this reflects the natural characteristics of the data and the challenges faced in forecasting it. Therefore, in many cases, our focus is more on the model’s ability to capture the main trends and patterns in the data, rather than how closely the distribution of residuals resembles a normal distribution.

However, it is important to take appropriate steps to thoroughly examine assumptions and model performance. In this regard, assumption tests such as tests for normality and tests for non-autocorrelation of residuals are important to ensure that the model provides reliable results. Even though residuals may not follow a normal distribution, by conducting assumption tests and proper analysis, we can ensure that the model provides valuable and useful insights in understanding and forecasting time series data.

Conclusion

In the case of analyzing criminal damage in Chicago using a time series approach, we have undertaken a series of steps to understand, analyze, and forecast the patterns of theft crimes. We began with data exploration and visualization to gain an initial overview of trends and fluctuations within the dataset. Subsequently, we applied the Holt-Winters and STL-M ARIMA forecasting models to predict future crime counts.

Through performance testing and a comparison between the two models, we found that the STL-M ARIMA model provides more accurate predictions in terms of average percentage errors and successfully addresses key assumptions such as the lack of autocorrelation in residuals. While the residuals of the STL-M ARIMA model do not always perfectly adhere to a normal distribution, we acknowledge the inherent complexity of time series data and prioritize the model’s ability to capture main trends and patterns.

In conclusion, time series analysis proves to be a valuable tool for understanding and forecasting phenomena such as criminal activities. Despite challenges posed by intricate variations and non-ideal residual distributions, this approach provides valuable insights for decision-making and the development of handling strategies. By incorporating relevant assumption tests and thorough analyses, forecasting models can serve as reliable tools to tackle the complexities of time series data, as demonstrated in the case of criminal damage in Chicago.