Abstract

Through earlier assignments, we have analyzed and performed the evaluation of the impact of several independent variables on crime using regression modelling. In this analysis the focus was to determine the seasonality and trend patterns thst the total crime count shows (Kai-Rung-Gai) and to make predictions using different methods of for time series modelling.

Through time series analysis, I derived the necessary components of ARIMA, Average method, Naive method, Drift method, ETS, Hw_addtitive and Hw_multiplicative models and fit it to the available data set. ARIMA was the most accurate model produced to forecast the crime count for Ku-Ring-Gai for the next 2 years.

From the predicted data points and standard error, I determine the estimated crime count for Ku-Ring-Gai for Feb,2018 to Jan,2020.

Introduction

In Assessment 2, we performed analysis and evaluation of the impact of several independent variables on crime, The Methods of analysis include Data Visualization, Time Series Analysis, Measures of Similarity and Regression analysis. Analysis was conducted on all Local Government Areas (LGA) in New South Wales and then on smaller group of 42 LGAs with similar characteristics to Ku-Ring-Gai. In AT-2 the time series analysis was done for crime count, number of residents, crime rate, building count and number of dwellings.

The goal was to provide the councillors the best information to make decisions for the benefit of the community, and we focused on determining how factors like population density, socioeconomic factor, employment, education etc were related to crime. For the individual exploration, I would like to extend on the monthly crime count of Kai-Rung-Gai council and establish a Time Series Analysis to forecast the crime count for the next two years.

The reason behind choosing Ku-Ring-gai is due to the available data and better understading of the data. The aim of this analysis is to find the accurate predictive value for the monthly crime count for Kai-Rung-Gai council using time series modelling. This extended analysis will be predicted for the near future and the forecasting will help the local council to move forward with their proposed high-density urban development within their council area.

Research Question

This paper explores the Crime count of Ku-Ring-Gai and predictions of crime count in the near future. Specifically, it will answer the following research questions:

Dataset

Using the data from Assignment 2, “crime_df” , the monthly total crime count for Kai-Rung-Gai was created (Ku_ring_crime.csv) that Includes Date, Year, Month and total crime count. Using this dataset the timeseries dataset “y” was created.

Ku-Ring-Gai Crime data

## # A tibble: 288 x 3
##    date        year total_crime_count
##    <date>     <int>             <dbl>
##  1 1995-01-01  1995               269
##  2 1995-02-01  1995               275
##  3 1995-03-01  1995               390
##  4 1995-04-01  1995               307
##  5 1995-05-01  1995               286
##  6 1995-06-01  1995               271
##  7 1995-07-01  1995               329
##  8 1995-08-01  1995               319
##  9 1995-09-01  1995               278
## 10 1995-10-01  1995               328
## # ... with 278 more rows

Time series data “y”

## 
## Attaching package: 'forecast'
## The following object is masked from 'package:Metrics':
## 
##     accuracy
## Loading required package: fma
## 
## Attaching package: 'fma'
## The following objects are masked from 'package:MASS':
## 
##     cement, housing, petrol
## Loading required package: expsmooth
##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1995 269 275 390 307 286 271 329 319 278 328 287 302
## 1996 286 323 332 282 289 319 307 313 382 379 350 339
## 1997 340 332 286 385 368 305 352 311 360 365 420 377
## 1998 390 369 369 425 362 413 400 477 425 428 439 433
## 1999 413 359 350 355 323 397 372 322 357 392 460 441
## 2000 437 400 482 495 456 425 545 438 403 444 451 446
## 2001 414 348 398 414 497 480 509 453 438 508 478 391
## 2002 408 471 412 413 469 391 496 411 354 395 401 366
## 2003 415 429 453 478 469 457 396 400 408 402 380 332
## 2004 397 334 359 372 355 319 319 315 301 311 307 316
## 2005 349 287 302 357 336 288 249 266 290 267 298 290
## 2006 280 268 308 268 253 240 278 297 292 339 349 344
## 2007 253 309 320 342 283 280 302 282 266 306 344 295
## 2008 316 277 327 306 350 310 261 307 272 321 346 255
## 2009 350 284 341 365 322 383 360 302 372 310 307 348
## 2010 345 301 300 323 303 312 329 298 232 239 254 255
## 2011 329 274 320 325 264 244 250 237 343 258 275 280
## 2012 286 242 274 349 302 397 283 369 246 311 344 305
## 2013 326 298 328 327 316 286 298 248 276 282 263 334
## 2014 272 251 291 224 267 241 222 282 293 271 291 283
## 2015 247 231 276 299 220 214 234 234 209 226 226 195
## 2016 215 170 270 292 298 263 238 202 215 232 204 226
## 2017 241 260 263 216 281 289 251 237 230 273 299 270
## 2018 256

EDA of crime series data

##  Time-Series [1:277] from 1995 to 2018: 269 275 390 307 286 271 329 319 278 328 ...
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   170.0   276.0   315.0   327.5   372.0   545.0
## [1] "ts"
## [1] 1995    1
## [1] 2018    1
## [1] 12
##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1995   1   2   3   4   5   6   7   8   9  10  11  12
## 1996   1   2   3   4   5   6   7   8   9  10  11  12
## 1997   1   2   3   4   5   6   7   8   9  10  11  12
## 1998   1   2   3   4   5   6   7   8   9  10  11  12
## 1999   1   2   3   4   5   6   7   8   9  10  11  12
## 2000   1   2   3   4   5   6   7   8   9  10  11  12
## 2001   1   2   3   4   5   6   7   8   9  10  11  12
## 2002   1   2   3   4   5   6   7   8   9  10  11  12
## 2003   1   2   3   4   5   6   7   8   9  10  11  12
## 2004   1   2   3   4   5   6   7   8   9  10  11  12
## 2005   1   2   3   4   5   6   7   8   9  10  11  12
## 2006   1   2   3   4   5   6   7   8   9  10  11  12
## 2007   1   2   3   4   5   6   7   8   9  10  11  12
## 2008   1   2   3   4   5   6   7   8   9  10  11  12
## 2009   1   2   3   4   5   6   7   8   9  10  11  12
## 2010   1   2   3   4   5   6   7   8   9  10  11  12
## 2011   1   2   3   4   5   6   7   8   9  10  11  12
## 2012   1   2   3   4   5   6   7   8   9  10  11  12
## 2013   1   2   3   4   5   6   7   8   9  10  11  12
## 2014   1   2   3   4   5   6   7   8   9  10  11  12
## 2015   1   2   3   4   5   6   7   8   9  10  11  12
## 2016   1   2   3   4   5   6   7   8   9  10  11  12
## 2017   1   2   3   4   5   6   7   8   9  10  11  12
## 2018   1
## Warning: Ignoring unknown parameters: facets
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Fig 1: Total crime count of Ku-Ring-Gai (Jan,1995-Jan,2018)

Fig 2: Aggregating the cycles and displaying an year on year trend

Fig 3: Box plot of crime counts across all months

Seasonality

Fig 4: A seasonal plot: Displaying seasonal pattern of each year

The above plot is especially useful in identifying years in which the pattern changes.From seasonal plot, there is no uniforamity in the spikes but the spikes observed is different for a different month each year. This difference in the seasonality could be due to the events taking place in different months (suppose easter, in March and April), or Christmass that is in december, so wont be affected much.

Fig 5: Subseries plot, displaying yearly seasonal pattern by means of each month

The horizontal lines indicate the means for each month. This form of plot enables the underlying seasonal pattern to be seen clearly, and also shows the changes in seasonality over time. It is especially useful in identifying changes within particular seasons.

Subseries plot shows an increase in crime in March and April and then decreases in winter, however starts increasing in Oct-Nov and then decreases in Decmeber. The Dec decrease could be probably becauseof christmass where the security is high.An overall decrease in winter is also observed. Even though the mean value of each month is quite different their variance is small. Hence, we have strong seasonal effect with a cycle of 12 months or less.

Decomposition

Fig 6: Decomposition: Seasonality and Trend patterns in Ku-Ring-Gai crime data

1. Average Method

Fig 7: Average Method for forcasting total crime count(Per month)

As we can see, forecasting using the mean of all the values on the time series does not seem a realistic prediction. So, now we will use the naive method, which forecasts the next values using the last value observed.

2. Naive method

Fig 8: Naive Method for forcasting total crime count(Per month)

Now the prediction seems more accurate than the average method, but this prediction is not taking into account the seasonality of the series nor the trend. The seasonality pattern can be shown with the season naive method, which will take the values of the last cycle.

3. Seasonal naive method

Fig 9: Seasonal Naive method for forcasting total crime count(Per month). Although this method is more accurate and complex than the methods shown before, the trend has not been recognized yet.

The simplest method to predict using the general trend of the serie is the drift method, which prediction will be the result of drawing a line between the first and the last line of the series.

4. Drift method

Fig 10:Drift Method for forcasting total crime count(Per month).This prediction understands the trend but lacks the seasonality pattern.

5. ETS

Fig 11: ETS modelling for forcasting total crime count(Per month)

6. Holt-Winters’ Seasonal Methods for Forecasting Time Series

HoltWinters forecast method, which will combine the season naïve and drift methods. Holt (1957) and Winters (1960) extended Holt’s method to capture seasonality. There are two variations to this method that differ in the nature of the seasonal component.

Applying Holt- winter’s additive method

The additive method is preferred when the seasonal variations are roughly constant through the series

## 
## Forecast method: Holt-Winters' additive method
## 
## Model Information:
## Holt-Winters' additive method 
## 
## Call:
##  hw(y = y, seasonal = "additive") 
## 
##   Smoothing parameters:
##     alpha = 0.3624 
##     beta  = 1e-04 
##     gamma = 0.001 
## 
##   Initial states:
##     l = 299.3987 
##     b = 0.0727 
##     s = -3.3207 8.8442 -0.0899 -12.8302 -11.9615 0.5237
##            2.5029 9.4775 15.2263 5.6447 -19.4996 5.4826
## 
##   sigma:  36.8196
## 
##      AIC     AICc      BIC 
## 3573.113 3575.476 3634.721 
## 
## Error measures:
##                      ME    RMSE      MAE       MPE     MAPE     MASE
## Training set -0.5685053 35.7404 28.32289 -1.170132 8.823303 0.605483
##                    ACF1
## Training set 0.03121484
## 
## Forecasts:
##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## Feb 2018       244.4037 197.2175 291.5900 172.2387 316.5688
## Mar 2018       269.6834 219.4927 319.8742 192.9233 346.4436
## Apr 2018       279.2334 226.2065 332.2604 198.1357 360.3312
## May 2018       273.4182 217.6977 329.1387 188.2010 358.6354
## Jun 2018       266.5572 208.2660 324.8485 177.4085 355.7060
## Jul 2018       264.7560 204.0013 325.5107 171.8397 357.6723
## Aug 2018       252.3346 189.2111 315.4580 155.7956 348.8736
## Sep 2018       251.4369 186.0291 316.8447 151.4043 351.4695
## Oct 2018       264.2908 196.6745 331.9070 160.8806 367.7010
## Nov 2018       273.2461 203.4899 343.0022 166.5633 379.9289
## Dec 2018       261.0633 189.2299 332.8968 151.2035 370.9232
## Jan 2019       269.8472 195.9936 343.7008 156.8979 382.7965
## Feb 2019       245.0744 169.2423 320.9065 129.0992 361.0496
## Mar 2019       270.3541 192.6035 348.1047 151.4448 389.2633
## Apr 2019       279.9041 200.2801 359.5280 158.1297 401.6784
## May 2019       274.0888 192.6335 355.5442 149.5137 398.6640
## Jun 2019       267.2279 183.9804 350.4753 139.9119 394.5439
## Jul 2019       265.4267 180.4239 350.4295 135.4261 395.4273
## Aug 2019       253.0052 166.2816 339.7289 120.3728 385.6376
## Sep 2019       252.1076 163.6956 340.5196 116.8931 387.3221
## Oct 2019       264.9614 174.8918 355.0311 127.2118 402.7110
## Nov 2019       273.9167 182.2184 365.6150 133.6763 414.1572
## Dec 2019       261.7340 168.4346 355.0334 119.0448 404.4232
## Jan 2020       270.5179 175.6434 365.3923 125.4199 415.6159

The forecasted values of crime count for the next two years i.e assuming it is for Feb, 2018 to Jan, 2020 is displayed. The values are calculated at 80% and 95% confidence interval.From the above results, we are able to find the overall smoothing parmater (alpha), trend smoothing parameter(beta) and seasonal smoothing parameter(gamma). The initial values of level, trend and seasonality is interpreted. The data is smoothed by applying Holt-Winter’s additive method.Above is the smoothed or predicted values of the given data.

##           Jan      Feb      Mar      Apr      May      Jun      Jul
## 1995 304.9540 267.0114 295.1205 339.1652 321.8365 301.9477 288.8228
## 1996 307.9495 275.1297 317.7879 332.4657 308.4971 294.5322 301.5664
## 1997 352.8610 323.4211 351.8154 337.4263 349.0299 349.0664 331.2533
## 1998 382.2626 360.3201 388.6845 391.1458 397.7620 377.9070 388.8643
## 1999 438.0135 404.2160 413.0265 399.8388 377.8592 351.1488 365.9728
## 2000 422.1182 402.7566 426.9392 456.5803 464.7875 454.9077 442.2583
## 2001 452.5304 413.7974 415.1908 418.6199 411.1649 435.5472 449.9846
## 2002 451.5628 410.9767 458.0287 451.0294 431.5665 438.3661 419.5351
## 2003 395.0441 377.5736 421.3891 442.5309 449.7863 449.8984 450.9431
## 2004 381.6413 362.5326 377.3379 380.3642 371.6946 358.7542 342.7223
## 2005 320.2129 305.9080 324.1958 325.8180 331.4587 326.1801 310.7256
## 2006 293.4518 263.7779 290.4342 306.5149 286.8573 267.6017 255.9399
## 2007 334.8939 280.4404 315.9374 327.0724 326.7981 303.9571 293.6784
## 2008 311.1036 288.2184 309.2727 325.3702 312.6015 319.2100 314.3148
## 2009 301.1600 294.1843 315.6450 334.4707 339.8516 326.3916 345.3115
## 2010 339.2792 316.6227 336.1552 332.6948 323.4437 309.1162 308.5123
## 2011 265.7813 263.9235 292.7348 312.2784 311.1357 287.1528 269.8724
## 2012 283.7423 259.7364 278.4820 286.4979 303.3305 295.9534 330.9711
## 2013 320.7441 297.8150 323.0817 334.5875 325.9607 315.5678 303.0894
## 2014 302.5246 266.6143 286.1494 297.6065 265.0346 258.9263 250.6763
## 2015 286.0465 247.0575 266.4471 279.5215 280.7653 251.9012 236.3970
## 2016 221.1516 194.0914 210.5795 241.7243 254.0374 263.1524 261.3719
## 2017 228.6752 208.2960 252.3503 265.8185 241.8483 249.1698 261.8545
## 2018 276.6991                                                      
##           Aug      Sep      Oct      Nov      Dec
## 1995 290.9709 300.3362 305.0563 322.3816 297.4677
## 1996 291.1111 298.1981 341.4388 364.0129 346.8985
## 1997 326.3744 320.0279 347.3387 362.6471 371.4081
## 1998 380.4807 414.7682 431.3027 439.0773 426.9886
## 1999 355.8188 342.7622 360.7133 381.0086 397.5708
## 2000 467.1437 455.8641 449.5417 456.5621 442.4590
## 2001 458.8845 456.0088 462.3700 487.9433 472.2704
## 2002 434.6837 425.3358 412.4228 415.0669 397.8034
## 2003 418.3619 410.8885 422.8386 424.2507 396.0296
## 2004 321.4642 318.2908 324.9739 328.8210 308.6816
## 2005 275.6997 271.3285 291.0371 291.2179 281.4659
## 2006 251.3235 267.0479 288.9874 316.0421 315.7876
## 2007 284.1172 282.5045 289.4465 304.3546 306.5156
## 2008 282.4057 290.4602 296.7296 314.4608 313.6331
## 2009 338.1429 324.1520 354.5118 347.3341 320.3686
## 2010 303.3821 300.6184 288.6566 279.5987 258.0249
## 2011 250.0734 244.4401 293.0826 289.3337 271.8695
## 2012 301.0113 324.8932 309.0947 318.7797 315.6854
## 2013 288.7858 273.0812 287.0075 294.2041 270.6115
## 2014 227.7699 246.5353 276.2305 283.3157 273.9146
## 2015 223.0982 226.1477 232.7238 239.2681 222.2601
## 2016 240.4908 225.6116 234.5675 242.6120 216.4033
## 2017 245.4952 241.5194 250.1592 267.3811 266.6879
## 2018

Fig 12: Holt-Winter’s additive method: forcasting total crime count(Per month)

The original data and data smoothed with Holt-Winter’s method is plotted.

Fig 13: Plotting the smoothed data (Holt-Winter’s additive method)

Decomposing the additive time series data

Fig 13: Estimated components for the Holt-Winters method with additive seasonal components.

Fig 13 shows the plot of estimates of level, trend and seasonal component of the time series data. The trend shows that crime count very slightly decreases over time as the small value of \(?????\) for the additive model means the slope component hardly changes over time.

Measuring Forecast Accuracy: Residual Plot

Fig 14: Residual Plot of Holt- winter’s Additive model

From the figure we can see that the above residual plot shows some pattern in it. Hence, we cannot say that the forecasted values are incorrect (as slight variations are observed in the plot above) but will apply other methods to it.

Applying Holt- winter’s Multiplicative method

Multiplicative method is preferred when the seasonal variations are changing proportional to the level of the series With the multiplicative method, the seasonal component is expressed in relative terms (percentages), and the series is seasonally adjusted by dividing through by the seasonal component.

## 
## Forecast method: Holt-Winters' multiplicative method
## 
## Model Information:
## Holt-Winters' multiplicative method 
## 
## Call:
##  hw(y = crime, seasonal = "multiplicative") 
## 
##   Smoothing parameters:
##     alpha = 0.4438 
##     beta  = 0.0068 
##     gamma = 8e-04 
## 
##   Initial states:
##     l = 297.6833 
##     b = 5.2668 
##     s = 0.9845 1.0361 1.0104 0.955 0.9733 1.0231
##            0.9905 1.0146 1.0419 1.0208 0.9442 1.0056
## 
##   sigma:  0.1166
## 
##      AIC     AICc      BIC 
## 3586.627 3588.989 3648.235 
## 
## Error measures:
##                     ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -2.911101 36.08355 28.68287 -1.714943 8.970069 0.6131785
##                     ACF1
## Training set -0.03624604
## 
## Forecasts:
##          Point Forecast    Lo 80    Hi 80     Lo 95    Hi 95
## Feb 2018       249.4617 212.1731 286.7503 192.43373 306.4897
## Mar 2018       269.6703 225.4032 313.9375 201.96960 337.3711
## Apr 2018       274.9994 226.0314 323.9674 200.10932 349.8895
## May 2018       267.5499 216.3434 318.7565 189.23627 345.8636
## Jun 2018       261.0631 207.7408 314.3854 179.51366 342.6125
## Jul 2018       269.3268 210.9539 327.6997 180.05320 358.6004
## Aug 2018       256.2232 197.5695 314.8768 166.52015 345.9262
## Sep 2018       251.3134 190.7860 311.8408 158.74479 343.8821
## Oct 2018       265.6113 198.5278 332.6948 163.01591 368.2067
## Nov 2018       272.2022 200.3119 344.0926 162.25539 382.1491
## Dec 2018       258.5263 187.3013 329.7513 149.59704 367.4556
## Jan 2019       263.8561 188.1878 339.5245 148.13139 379.5809
## Feb 2019       247.5774 173.8030 321.3519 134.74915 360.4057
## Mar 2019       267.6321 184.9147 350.3495 141.12676 394.1375
## Apr 2019       272.9196 185.5617 360.2776 139.31715 406.5221
## May 2019       265.5252 177.6245 353.4259 131.09274 399.9577
## Jun 2019       259.0862 170.4902 347.6822 123.59026 394.5821
## Jul 2019       267.2860 172.9794 361.5927 123.05647 411.5156
## Aug 2019       254.2805 161.8045 346.7565 112.85061 395.7103
## Sep 2019       249.4067 156.0030 342.8105 106.55799 392.2555
## Oct 2019       263.5949 162.0272 365.1625 108.26052 418.9292
## Nov 2019       270.1345 163.1277 377.1412 106.48178 433.7871
## Dec 2019       256.5612 152.1589 360.9634  96.89165 416.2307
## Jan 2020       261.8492 152.4646 371.2338  94.55993 429.1385

Fig 15: Forecasting: Holt-Winter’s Multiplicative method

Fig 16: Residual Plot of Holt- winter’s Multiplicative model

Tthe above residual plot doesn’t show any pattern in it.so it cannot be concluded that the forecasted values are completely correct.

Decomposing the Multiplicative time series data

Fig 17: Decomposing: Estimated components for the Holt-Winters multiplicative seasonal components.

The above figure estimates of level,trend and seasonal component of the time series data. The figure shows that trend decreases over time.The small value of \(??\) for the multiplicative model means that the seasonal component hardly changes over time.

Compare Holt-Winters Additive and Multiplicative Methods:

# crime <- window(y,start=c(1995, 1))
train <- window(crime, end=c(2014,12))  # Use 20 years of the data as the training set
ts.h<- length(crime)- length(train)
test <- window(crime, start= c(2015,1))

crime.hw_addive =  hw(train,seasonal="additive", h=ts.h)
crime.hw_multiplicative = hw(train,seasonal="multiplicative", h= ts.h)
result  =   rbind(
  forecast::accuracy(crime.hw_addive,   test)[2,c(2,3,5,6)],
  forecast::accuracy(crime.hw_multiplicative,test)[2,c(2,3,5,6)])
rownames(result)<-      c( "Hw.Additive", "Hw_Multip")
result
##                 RMSE      MAE     MAPE      MASE
## Hw.Additive 40.82635 33.43966 15.14005 0.6924204
## Hw_Multip   45.64793 37.57081 17.06836 0.7779624

Fig 18: Forecasting accuacy of Holt Winters (Additive and Multiplicative) method

Additive method has a Lower RMSE, MAE, MPE & MASE than multiplicative method that suggest that addtive method is mosre appropriate.

Fig 19: Comparison of Holt-Winters Additive and Multiplicative Methods.

The results of the forecast show that the prediction fits perfectly the trend and the seasonality of our time series.

6. ARIMA model

ARIMA models provide another approach to time series forecasting. Exponential smoothing and ARIMA model is the most widely used approach to time series forecasting, and provide complementary approaches to the problem. ARIMA models aim to describe the autocorrelations in the data.

Stationarity and Differencing

## 
## Attaching package: 'aTSA'
## The following objects are masked from 'package:tseries':
## 
##     adf.test, kpss.test, pp.test
## The following object is masked from 'package:forecast':
## 
##     forecast
## The following object is masked from 'package:graphics':
## 
##     identify
## # A tibble: 1 x 5
##   statistic p.value parameter method                       alternative
##       <dbl>   <dbl>     <dbl> <chr>                        <chr>      
## 1     -2.93   0.186         6 Augmented Dickey-Fuller Test stationary
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 2.5634 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0653 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The concept of stationarity is fundamental for time series. To be able to fit a time series ARIMA model and to approach a time series forecast, the series should be stationary. The mean, variance and the autocorrelation of the series should not change with time.

In the time series plot (Fig 1), it is observed that the data is decreasing over the years. We can determine whether a series is stationary by using the KPSS test using the ur.kpss() function from the urca package.

KPSS: The test statistic is higher than the critical value, indicating that the data is not stationary. We use differencing method to make the data stationary. diff():Differencing the series, the results are significant, that is within the expected range. The test statistic is less than the critical value, indicating that the null hypothesis is accepted, hence it is a stationary data.

Autocorrelation

With a stationary time series, the next step is to select the best ARIMA model. The autocorrelation function (ACF) and partial autocorrelation function (PACF) will help to choose the order parameters for the ARIMA model. These plots also helps in detecting time series patterns such as trends, seasonal and cyclic behaviours in the dataset.

## 
## Autocorrelations of series 'y', by lag
## 
## 0.0833 0.1667 0.2500 0.3333 0.4167 0.5000 0.5833 0.6667 0.7500 0.8333 
##  0.809  0.778  0.763  0.738  0.729  0.697  0.699  0.685  0.672  0.648 
## 0.9167 1.0000 1.0833 1.1667 1.2500 1.3333 1.4167 1.5000 1.5833 1.6667 
##  0.662  0.649  0.635  0.619  0.603  0.590  0.560  0.595  0.602  0.553 
## 1.7500 1.8333 1.9167 2.0000 
##  0.538  0.528  0.524  0.498
## 
##  Augmented Dickey-Fuller Test
## 
## data:  y
## Dickey-Fuller = -2.9253, Lag order = 6, p-value = 0.1863
## alternative hypothesis: stationary

Fig 20a: ACF of Differenced Series

Pacf(y, main='')

# ggAcf(y) + ggtitle("ACF of a stationary time series") + theme_bw()
# Pacf(y, main='PACF for Differenced Series')

Fig 20b: PACF for Differenced Series

White noise

Fig 21: White noise of Differnced series

For white noise series, we expect each autocorrelation to be close to zero. Of course, they will not be exactly equal to zero as there is some random variation. For a white noise series, we expect 95% of the spikes in the ACF to lie within \(±2/???T\) (T=277). In this analysis, all of the autocorrelation coefficients lie within the limits, confirming that the data are white noise.

ARIMA

The ARIMA model is built on the dataset to describe the autocorrelations in the data and used to time series forecasting. Using the ACF and the PACF plots the ARIMA model (0,1,1).

From the autocorrelation and partial auto-correlations graphs, we can decompose these graphs to find each parameter needed for the seasonal ARIMA model (Appendix). From a non-seasonal component, we can see a trailing off PACF and a cut-off ACF at lag 1, indicating a Moving Average model at lag 1, (p=0, q=1), and due to the differenced transformed to make the data stationary, we include a non-seasonal difference (d=1). Based on the ACF and PACF

Split up training and test sets

To test the accuracy of both models the dataset was split into training and testing data sets. The training comprised of the first 140 data points from January 1995 to December 2014, while the training set contained 47 data points representing January 2015 to January 2018.

The model was tested for accuracy against the training set, forecast horizon of three years, when trained just from the data up to December 2014. The ARIMA model proved to be the most accurate (ARIMA(0,1,1)). Therefore, the first model will be used to forecast the total crime count for the next 2 years in Ku-Ring-Gai.

Modelling

#   ARIMA   model
crime_arifit    =   auto.arima(train,   seasonal=TRUE) # ARIMA(0,1,1)
crime_arifit
## Series: train 
## ARIMA(0,1,1) 
## 
## Coefficients:
##           ma1
##       -0.6604
## s.e.   0.0550
## 
## sigma^2 estimated as 1455:  log likelihood=-1209.2
## AIC=2422.39   AICc=2422.44   BIC=2429.35
# AIC=2422.39   AICc=2422.44   BIC=2429.35

auto.arima() function (AICc=2422.44) within the forecast package worked better than arima() function (AICc=2788.85). So, the model of auto.arima() function is being considered for further analysis & forecasting.

Residual Checks

Checking the residuals and the assumptions. Almost all residuals seems to be within the boundary and the Ljung Box test returns a p-value of 19.87%, indicating that there is slight autocorrelation in the forecast errors. The ARIMA model is built on the dataset(0,1,1).

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1)
## Q* = 28.464, df = 23, p-value = 0.1987
## 
## Model df: 1.   Total lags used: 24

The p-value greater than 0.05. It indicates that the results are not significant and hence we can conclude that the residuals are not distinguishable from a white noise.

Forecast and test accuracy

The ARIMA(0,1,1) model has the best RMSE well, and is considered the most accurate among all models used(including Appendix- ARIMA, Seasonal Arima)

##                        ME     RMSE      MAE        MPE      MAPE      MASE
## Training set  -0.02032931 37.98436 30.24926  -1.045813  9.021275 0.6263583
## Test set     -35.76642160 47.80521 40.28630 -16.715355 18.242232 0.8341909
##                    ACF1 Theil's U
## Training set 0.02599103        NA
## Test set     0.43992947  1.366679

The forecast from auto.arima model. The Crime ocunt forecast over the next two years are slightly likely to decrease in Ku-ring-gai. In other words, this change in crime count should be taken as an opportunity to move forward with a proposed high-density urban development in Ku-Ring-gai.

Comparison oF foracasting mechanism

It is better to use a set of simple forecasting methods and compare them with the chosen forecasting accuracy. The different methods are: + Average method + Naïve method + Drift method + ETS + Hw_addtitive + Hw_multiplicative

##              RMSE      MAE     MAPE      MASE
## ARIMA    47.80521 40.28630 18.24223 0.8341909
## Average 102.21540 97.16937 42.39647 2.0120440
## Naive    50.81737 43.05405 19.49108 0.8915016
## Drift    51.59737 43.84406 19.83789 0.9078599
## ETS      53.55083 47.37636 21.91886 0.9810018

Fig 21: Predictionaccuracy of ARIMA, Averagy, Naive, Drift and ETS models

Forecast accuracy of the chosen method being compared with a set of simple forecasting methods. Based on the result, it has been found that the best method is the ARIMA method-Lowest RMSE (regardless of which accuracy measure is used).

Fig 22: Comparison of all methos: Actual vs Predicted crime count Crime: Ku-Ring-Gai

Based on the above result, while comparing with the different methods, it has been observed that the best method is the ARIMA model which has the smallest RMSE value of 47.805.

Forcasting for the next 2years

Fig 22: Comparison of all models: Forecast for Crime (Feb, 2018 to Jan, 2020: Ku-Ring-Gai

Results

Using all above modelling methods, the two best models based on describing trend+seasonality and autocorrelations are Holt Winters and ARIMA respctively.

Conclusion and Reflections

The objective of this research paper is to expand on initial research on the effect of various parameters on crime The forecast made from the analysis is in line with the expectation since there is a clear pattern in the data. In comparison with the models used in Assignment task 2, time series analysis seems to perform better with slight decrease in crime count in forecasted value. and of the models applied for the time series analysis the best two are Holt Winters additive and ARIMA models, and in these two the best model is Holt Winter Additive method with an RMSE value of 40.8. The timeseries model predicts only based on the crime data, and there can be other factors like employment, gender, health, drug addiction, socio economic status, education, urbanization, gender etc., that may also impact the forecast data. These should be considered while analyzing crime count and making predictions of crime.

The current analysis was able to predict the crime in the next couple years, which may allow us to infer whether high density urban development should be implemented in Ku-Ring-Gai. The paper demonstrated that this is recommendable as crime count appears to be slightly decreasing in the next couple of years. However, this may not be the case for other councils in the crime_df.csv. As such, it is recommended that further research and forecasting is required on this area, and for time series modelling ensemble approach would make better predictions, reducing the RMSE values to a significance level.

Overall, given the limited time frame and the scope of this assignment, I think the aim to determine the trend and seasonality that exist in the crima data, making predictions of the next teo years and using multiple methods for those predictions and then finding the most efficient method based on the least RMSE value was fulfilled.

References

  1. https://otexts.com/fpp2/autocorrelation.html
  2. file:///C:/Users/Public/Zarmina-Data%20science/data%20statistics/block%20session%203/time_seriesWorkbook.nb.html
  3. https://www.analyticsvidhya.com/blog/2015/12/complete-tutorial-time-series-modeling/
  4. https://otexts.com/fpp2/holt-winters.html
  5. https://rpubs.com/mr148/303786