FALSE Parsed with column specification:
FALSE cols(
FALSE .default = col_double(),
FALSE LGA = col_character(),
FALSE `Offence category` = col_character(),
FALSE Subcategory = col_character()
FALSE )
FALSE See spec(...) for full column specifications.
Previously in assignment 2, DeTechTives members worked on finding correlation and relationship between crime and other factors extracting from historical Census Data, Weather Stations, and the Socio-economic Index for Advantage (SEIFA). Insights gained from our project would be essential for the process of making policy related to crime control. In our project, we found significant influence of unemployment and income inequality toward crime rate. Moreover, we also found patterns of crime happening during seasonal events of the year. This lead to my own extending research question.
In fact, the relationship between crime and seasons is not knew and it has been known for decades ago. This is the methodologies used to answer my research question. A famous statistician, Adolph Quetelet, from 150 years ago stated that “The seasons in their course, exercise a very marked influence: thus, during summer the greatest number of crimes against persons are committed and the fewest against property; the contrary takes place during the winter” (Gentleman and Whitmore, 1994). Unfortunately, we did not complete it in assignment 2. Thus, I decided to further study crime using time series analysis, an advanced technique of capturing trend and seasonality of the data. I choose to focus on Blacktown. My objective is to make prediction of crime cases in Blacktown in the near future which will be useful information for the government making better policy to reduce crime.
Among all types of crime, theft report has the highest number of cases (See figure A2 - Appendix). Sydney has the highest number of crime, almost double that of Blacktown which comes second (See figure A1 - Appendix).
Following statistician Adolph Quetelet, I grouped crimes into 3 categories: Crime Against Persons, Property Crime, and Other Offences. I will focus on the first 2 types. Property Crime in Blacktown is currently quite low which is about 600 cases lower than the peak in 2009 and there are signs that the trend will increase steadily in the next coming years (Figure 1.).The variation of Property Crime in a year is only about 300 cases compared to the 2009 peak which had the variation of almost 450 cases. On contrary, Crime Against Persons is currently quite high which is only about 10 cases lower than the peak in 2016 and there are signs that the trend will increase steadily in the next coming years (Figure 2.). The variation of crime cases throughout the year in 2018 is about 175 cases compared to the 2016 peak which had the variation of almost 225 cases. Surprisingly, the mean of crime each month does not quite follow the insights of statistician Adolph Quetelet. Though it is true that summer has the lowest Crime Against Persons and winter is the highest, but this is not the case for Property Crime (Figure 1. & 2.). This might have been caused by large changes of both crime types in the last decades.
Figure 1. Property Crime
Figure2. Crime against Persons
It is important to consider the following points before starting applying time series models to your data. Though the data DeTechTives collected is from 1995 to 2018 but due to incomplete data in 2018, I decided to exclude data on 2018 from my time series analysis. * Granularity: Most of the commonly used time series models are well suited for monthly, quarterly or yearly data. This data is monthly data in 23 years. * Number of data points: This is related to granularity. With monthly data, 36 data points represents three years of data which is sufficient for seasonal adjustment. This project collected 276 data points represents 23 years of data. * Response variable: Although it is not necessary to assume normality of errors, time series models for count data is still underdeveloped. Of course, if you have large counts, you can just treat it as a continuous variable. * Missing Values: The original data that DeTechTives has collected was clean with no missing value. The data was recorded monthly for over 23 years. Unfortunately, the format was not convenient with the use of data frame in R, so I had to transform it into familiar format before converting it to time series for analysis.
Initial KPSS test is performed to check whether the data that DeTechTives has collected is stationary. This test will make sure that the mean, the variance, and the autocorrelation of the data does not change over time (Itl.nist.gov, 2019). The result of the test-statistic for Property Crime and Crime Against Persons is 0.0736 and 0.0559. By applying the test to the differenced data, we see that the test statistic is small, and well within the range we would expect for stationary data. So we can conclude that the differenced data are stationary. Differencing is the technique of calculating the differences between a value and the one from the time period immediately before it. It is a very common technique in time series analysis which allow data scientists to transform a non-stationary data into a stationary one. After differencing, the result is shown below which is well within the range of 1% critical value. The result of the test, combined with abundant number of data points collected monthly, satisfied all conditions necessary for a good time series analysis.
## [1] "| Result of KPSS test for property crime |"
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.0736
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
## [1] "| Result of KPSS test for crime against persons |"
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.0559
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
Results of ACF and PACF for both crime types without differencing showed that ACF does not stay within threshold limit which strengthen the need for differencing (Figure 3.0 & 4.0).
Figure 3.0 ACF & PACF of Property Crime (without differencing)
Figure 4.0 ACF & PACF of Crime against Persons (without differencing)
After differencing, ACF and PACF of both crime types is show in Figure 3. and Figure 4. below (Figure 3. & 4.).
Figure 3. ACF & PACF of Property Crime (differencing once)
Figure 4. ACF & PACF of Crime against Persons (differencing once)
The results of ACF and PACF after differencing showed that there are more lags that stay within the limit which is better, but is still not good enough as there is no trend of stable values with more lags. Thus, I decided to check seasonal factor from my time series data to see if removing seasonality can stabilize the ACF and PACF (Figure 5. & 6.).
Figure 5. ACF & PACF of Property Crime (with differencing & removing seasonal)
Figure 6. ACF & PACF of Crime against Persons (with differencing & removing seasonal)
Both ACF and PACF of both types of crime after differencing and removing seasonality showed promising results. Insights gained from exploring ACF and PACF will help me optimize the ARIMA model.
In order to find the most optimized parameters for the ARIMA model, I first use the auto.arima() function available in R then check the residuals from ACF and do portmanteau test to evaluate the model.
## [1] "| Result of ACF residual for property crime |"
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,4)(0,0,2)[12]
## Q* = 23.552, df = 18, p-value = 0.1703
##
## Model df: 6. Total lags used: 24
## [1] "| Result of ACF residual for crime against persons |"
##
## Ljung-Box test
##
## data: Residuals from ARIMA(4,1,2)(2,0,0)[12]
## Q* = 19.743, df = 16, p-value = 0.232
##
## Model df: 8. Total lags used: 24
In both cases, ACF chart is not within the threshold limit especially for Crime Against Persons and it is not good. However, both p-values from Ljung-Box test are higher than 0.05 which implies that these maybe are good models. This means auto.arima() function can be used to find the optimized parameters for ARIMA model.
In the case of Property Crime, PACF in Figure 5. showed that values started decreasing after lag 3 and KPSS test gave good result at differencing once. After investigate further and testing with different parameters, I found the combination of ARIMA(3,1,2)(1,1,2)[12] gave reasonable white noise result with p-value of 0.0968which is higher than 0.05. ACF showed that almost values are within threshold limit and mean of residual is close to 0 (Figure 7.).
Figure 7. ARIMA results for Property Crime
##
## Ljung-Box test
##
## data: Residuals from ARIMA(3,1,2)(1,1,2)[12]
## Q* = 23.677, df = 16, p-value = 0.0968
##
## Model df: 8. Total lags used: 24
For Crime Against Persons, fine-tune parameters with different values and I found the combination of ARIMA(4,1,2)(2,0,0)[12] predicted above gave the most optimized result with p-value of 0.0921 which is higher than 0.05. Residual curve is normal with mean close to 0 and ACF values are mostly within threshold limit. All these factors show that this model for Crime Against Persons is reasonable. (Figure 8.).
Figure 8. ARIMA results for Crime against Persons
In order to access the goodness of my model, I removed year 2017 and 2018 then use the model from 1995 to 2016 to predict crime in those 2 years. After that, I compare the results in 2017 and 2018. Both predictions are very good as validation value and prediction value followed pretty closely with each other (Figure 9.& 10.)
Figure 9. Validation & Prediction comparison for Property Crime (blue = Validation, red = Prediction)
Figure 10. Validation & Prediction comparison for Crime Against Persons (blue = Validation, red = Prediction)
## $y
## [1] "Number of Crime Cases"
##
## attr(,"class")
## [1] "labels"
Using the ARIMA model with fine-tune parameters in the previous section, below is the result of the forecasting of Property Crime and Crime Against Persons for the next 2 years, 2019 and 2020, shown in Figure 11. and Figure 12 respectively. Prediction chart is included in Table T1. and Table T2. in the Appendix.
Figure 11. Forecast for Property Crime
Figure 12. Forecast for Crime against Persons
The results of of my forecast is that Property Crime will decrease slightly while Crime Against Persons will increase in the near future. Moreover in Figure 12., there are 3 peaks and 2 troughs in the 2 years period which suggests that the number of Crime Against Persons in the summer is the lowest while the number of crime in the winter will be high. Thus, the relationship between crime and seasons is real and policies regarding absence and holiday schedule should be made appropriately for the Police Department.
Based on the results of of my forecast, Property Crime will decrease slightly while Crime Against Persons will increase in the near future. This insight would suggest that policy in the near future need to focus more on Crime Against Persons. Below are some suggestions to descease Crime Against Persons:
My new analysis which enhances the insights gained in Assessment Task 2 further giving insights for the government making better policy in order to reduce crime, particularly in Blacktown (NSW). Previously, we found unemployment and income would have strong correlation with crime rate. The analysis in this blog showed that Crime Against Persons is currently high and will increase in the near future and summer would be a relatively safe time for citizen. Thus, some policy suggestions has been giving out to scope with the insights gained from time series analysis.
In this assignment, I gained effective skills in doing time series analysis. Moreover, knowledge gained from researching crime related issues, particularly in Australia, is also valuable. The ARIMA model chosen in my analysis might not be the most efficient one because ACF chart is not within the threshold limit especially in both Property Crime case and Crime Against Persons case, but validation showed a quite good prediction result in Property Crime case and a good prediction result in Crime Against Persons case . For future reference, a different time-series technique should be applied to compare with the analysis of my ARIMA model in this blog.
Gentleman, J. and Whitmore, G. (1994). Case studies in data analysis. New York: Springer-Verlag.
Cook, P. and Kang, S. (2016). Birthdays, Schooling, and Crime: Regression-Discontinuity Analysis of School Performance, Delinquency, Dropout, and Crime Initiation. American Economic Journal: Applied Economics, 8(1), pp.33-57.
Hyndman, R. and Athanasopoulos, G. (2018). Forecasting. [Heathmont, Vic.]: OTexts.
JOHN JAY Collage of Criminal Justice (2014). Bridging the Great Divide: Can Police-Community Partnerships Reduce Crime and Strengthen Our Democracy?. [online] The City University of New York. Available at: http://www.jjay.cuny.edu/sites/default/files/Research/OSF_Panelist_Bios.pdf [Accessed 10 Jun. 2019].
Itl.nist.gov. (2019). NIST/SEMATECH e-Handbook of Statistical Methods. [online] Available at: http://www.itl.nist.gov/div898/handbook/ [Accessed 7 Jun. 2019].
Figure A1. Crime Cases based on LGAs
Figure A2. Crime Cases based on Crime Category
Table T1. Forecast for Property Crime
##
## Forecast method: ARIMA(3,1,2)(1,1,2)[12]
##
## Model Information:
##
## Call:
## arima(x = dfts.property, order = c(3, 1, 2), seasonal = list(order = c(1, 1,
## 2), period = 12), method = "ML")
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 sar1 sma1 sma2
## -0.3138 0.3041 0.1252 -0.0696 -0.6179 -0.0033 -0.8495 -0.0761
## s.e. 0.2853 0.1486 0.0855 0.2840 0.2078 0.5139 0.5151 0.4613
##
## sigma^2 estimated as 9508: log likelihood = -1661.13, aic = 3340.26
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -7.861958 95.28308 73.90091 -0.7037087 4.664433 0.5442881
## ACF1
## Training set -0.004490831
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2019 1362.636 1237.364 1487.908 1171.0493 1554.222
## Feb 2019 1326.109 1178.947 1473.270 1101.0447 1551.173
## Mar 2019 1438.508 1282.095 1594.921 1199.2950 1677.721
## Apr 2019 1359.566 1191.432 1527.701 1102.4266 1616.706
## May 2019 1472.223 1298.022 1646.425 1205.8056 1738.641
## Jun 2019 1398.110 1216.810 1579.409 1120.8357 1675.384
## Jul 2019 1369.589 1182.797 1556.380 1083.9162 1655.261
## Aug 2019 1404.163 1211.786 1596.539 1109.9480 1698.377
## Sep 2019 1409.287 1211.793 1606.780 1107.2465 1711.327
## Oct 2019 1434.850 1232.347 1637.353 1125.1487 1744.551
## Nov 2019 1419.570 1212.246 1626.893 1102.4959 1736.644
## Dec 2019 1345.303 1133.275 1557.332 1021.0337 1669.573
## Jan 2020 1380.437 1159.133 1601.740 1041.9818 1718.892
## Feb 2020 1300.510 1072.236 1528.784 951.3957 1649.625
## Mar 2020 1459.860 1225.674 1694.046 1101.7039 1818.017
## Apr 2020 1351.374 1111.134 1591.614 983.9585 1718.790
## May 2020 1458.142 1212.518 1703.767 1082.4921 1833.793
## Jun 2020 1383.194 1132.157 1634.232 999.2651 1767.124
## Jul 2020 1359.627 1103.452 1615.803 967.8405 1751.414
## Aug 2020 1398.911 1137.666 1660.156 999.3711 1798.451
## Sep 2020 1404.577 1138.400 1670.755 997.4945 1811.660
## Oct 2020 1430.950 1159.923 1701.976 1016.4504 1845.449
## Nov 2020 1404.128 1128.349 1679.907 982.3601 1825.896
## Dec 2020 1343.235 1062.779 1623.692 914.3145 1772.156
Table T2. Forecast for Crime against Persons
##
## Forecast method: ARIMA(4,1,2)(2,0,0)[12]
##
## Model Information:
##
## Call:
## arima(x = dfts.against, order = c(4, 1, 2), seasonal = list(order = c(2, 0,
## 0), period = 12), method = "CSS")
##
## Coefficients:
## ar1 ar2 ar3 ar4 ma1 ma2 sar1 sar2
## 0.7079 0.0781 -0.1015 -0.2300 -1.3722 0.4732 0.1671 0.3568
## s.e. 0.1985 0.0936 0.0858 0.0664 0.1917 0.1842 0.0670 0.0583
##
## sigma^2 estimated as 1334: part log likelihood = -1439.84
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 1.529587 34.63531 25.68254 -0.07612771 5.168646 0.5980242
## ACF1
## Training set -0.0156553
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2019 596.7932 549.9872 643.5991 525.2096 668.3767
## Feb 2019 535.4596 486.0881 584.8312 459.9523 610.9669
## Mar 2019 544.5719 491.4889 597.6548 463.3885 625.7552
## Apr 2019 536.7358 481.5733 591.8983 452.3720 621.0996
## May 2019 521.3855 466.0390 576.7321 436.7402 606.0309
## Jun 2019 530.9200 475.4630 586.3771 446.1058 615.7343
## Jul 2019 532.9382 477.4596 588.4168 448.0910 617.7854
## Aug 2019 518.2210 462.6994 573.7427 433.3080 603.1341
## Sep 2019 551.0928 495.3462 606.8395 465.8357 636.3499
## Oct 2019 584.8288 528.5829 641.0747 498.8081 670.8495
## Nov 2019 587.5403 530.4415 644.6392 500.2151 674.8655
## Dec 2019 580.7633 522.5648 638.9618 491.7564 669.7702
## Jan 2020 574.9221 513.5836 636.2606 481.1130 668.7312
## Feb 2020 555.5721 492.7243 618.4200 459.4546 651.6897
## Mar 2020 559.5284 495.2854 623.7714 461.2772 657.7795
## Apr 2020 539.2301 473.9740 604.4862 439.4295 639.0307
## May 2020 545.8769 480.0277 611.7261 445.1692 646.5845
## Jun 2020 556.9404 490.5885 623.2923 455.4640 658.4168
## Jul 2020 546.9837 480.1807 613.7866 444.8174 649.1499
## Aug 2020 534.9060 467.6127 602.1993 431.9898 637.8223
## Sep 2020 565.9743 498.0810 633.8675 462.1405 669.8081
## Oct 2020 598.5517 529.9457 667.1576 493.6280 703.4753
## Nov 2020 607.8440 538.4247 677.2632 501.6764 714.0115
## Dec 2020 585.1448 514.8583 655.4312 477.6509 692.6386