Previously in assignment 2, DeTechTives members worked on finding correlation and relationship between crime and other factors extracting from historical Census Data, Weather Stations, and the Socio-economic Index for Advantage (SEIFA). Insights gained from our project would be essential for the process of making policy related to crime control.
In our project, we found significant influence of unemployment and income inequality toward crime rate. Moreover, we also found patterns of crime happening during seasonal events of the year. Surprisingly, the relationship between crime and seasons is not knew. In fact, it has been known for decades ago. A famous statistician, Adolph Quetelet, from 150 years ago stated that “The seasons in their course, exercise a very marked influence: thus, during summer the greatest number of crimes against persons are committed and the fewest against property; the contrary takes place during the winter” (Gentleman and Whitmore, 1994). Unfortunately, we did not complete it in assignment 2. Thus, I decided to further investigate crime using time series analysis, an advanced technique of capturing trend and seasonality of the data. My focus will be in Sydney, the biggest and the most crime-populated city in Australia (See figure A1 - Appendix). My objective is to make prediction of crime cases in Sydney in the near future which will be useful information for the government making better policy to reduce crime.
Among all types of crime, theft report has the highest number of cases and takes 46% of all crime call (See figure A2 - Appendix). Sydney has the highest number of crime, almost double that of Blacktown which comes second (See figure A1 - Appendix).
Taking insights from statistician Adolph Quetelet, I grouped crimes into 3 categories: Crime Against Persons, Property Crime, and Other Offences. My focus will be in the first 2 types. Property Crime in Sydney currently is the lowest in the last 24 years which is much lower than the highest peak in 2001 (Figure 1.). The variation of Property Crime in a year is only about 500 cases compared to the 2001 peak which had the variation of almost 1,500 cases. On contrary, Crime Against Persons is currently quite high which is only at 100 cases lower than the peak in 2009 and there are signs that the trend will increase steadily in the next coming years (Figure 2.). The variation of crime cases throughout the year in 2018 is also the highest in the last 10 years. Surprisingly, the mean of crime each month does not quite follow the insights of statistician Adolph Quetelet. Though it is true that summer has the lowest Property Crime and winter is the highest, but this is also the case for Crime Against Persons (Figure 1. & 2.). This might have been caused by great changes of both crime types in the last decades.
Figure 1. Property Crime
Figure2. Crime against Persons
The original data that DeTechTives has collected was clean with no missing value. The data was recorded monthly for over 24 years. Unfortunately, the format was not convenient with the use of data frame in R, so I had to transform it into familiar format before converting it to time series for analysis.
Initial KPSS test is performed to check whether the data that DeTechTives has collected is stationary. This test will make sure that the mean, the variance, and the autocorrelation of the data does not change over time (Itl.nist.gov, 2019). The result of the test-statistic for Property Crime and Crime Against Persons is 3.7742 and 3.0989 respectively which suggested differencing is necessary. Differencing is the technique of calculating the differences between a value and the one from the time period immediately before it. It is a very common technique in time series analysis which allow data scientists to transform a non-stationary data into a stationary one. After differencing, the result is shown below which is well within the range of 1% critical value. The result of the test, combined with abundant number of data points collected monthly, satisfied all conditions necessary for a good time series analysis.
## [1] "=========================================="
## [1] "| Result of KPSS test for property crime |"
## [1] "=========================================="
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.0949
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
## [1] "================================================="
## [1] "| Result of KPSS test for crime against persons |"
## [1] "================================================="
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.0204
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
Results of ACF and PACF for both crime types without differencing showed that ACF does not stay within threshold limit which strengthen the need for differencing (Figure A3 & A4 - Appendix). After differencing, ACF and PACF of both crime types is show in Figure 3. and Figure 4. below.
Figure 3. ACF & PACF of Property Crime (differencing once)
Figure 4. ACF & PACF of Crime against Persons (differencing once)
The results of ACF and PACF after differencing showed that there are more lags that stay within the limit which is better, but is still not good enough as there is no trend of stable values with more lags. Thus, I decided to check seasonal factor from my time series data to see if removing seasonality can stabilize the ACF and PACF (Figure 5. & 6.).
Figure 5. ACF & PACF of Property Crime (with differencing & removing seasonal)
Figure 6. ACF & PACF of Crime against Persons (with differencing & removing seasonal)
Both ACF and PACF of both types of crime after differencing and removing seasonality showed promising results. Insights gained from exploring ACF and PACF will help me optimize the ARIMA model.
In order to find the most optimized parameters for the ARIMA model, I first use the auto.arima() function available in R then check the residuals from ACF and do portmanteau test to evaluate the model (Hyndman and Athanasopoulos, 2018).
## [1] "============================================="
## [1] "| Result of ACF residual for property crime |"
## [1] "============================================="
##
## Ljung-Box test
##
## data: Residuals from ARIMA(3,1,1)(2,0,0)[12]
## Q* = 36.474, df = 18, p-value = 0.006132
##
## Model df: 6. Total lags used: 24
## [1] "===================================================="
## [1] "| Result of ACF residual for crime against persons |"
## [1] "===================================================="
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,1)(1,1,0)[12]
## Q* = 86.291, df = 22, p-value = 1.454e-09
##
## Model df: 2. Total lags used: 24
In both cases, ACF chart is not within the threshold limit especially for Crime Against Persons, and both p-value from Ljung-Box test are lower than 0.05 which implies that this is not a good model. This means auto.arima() function can not be used to find the optimized parameters for ARIMA model.
In the case of Property Crime, PACF in Figure 5. showed that values started decreasing after lag 3 and KPSS test gave good result at differencing once. After investigate further and testing with different parameters, I found the combination of ARIMA(3,1,1)(1,0,3)[12] gave reasonable white noise result with p-value of 0.08646. ACF showed that values are within threshold limit and mean of residual is close to 0 (Figure 7.).
Figure 7. ARIMA results for Property Crime
##
## Ljung-Box test
##
## data: Residuals from ARIMA(3,1,1)(1,0,3)[12]
## Q* = 24.142, df = 16, p-value = 0.08646
##
## Model df: 8. Total lags used: 24
For Crime Against Persons, PACF in Figure 5. suggested that lag order at 3 is reasonable and differencing once also give good KPSS test result. Fine-tune parameters with different values and I found the combination of ARIMA(3,1,1)(1,1,1)[12] gave the most optimized result with p-value of 0.03969 which is still lower than 0.05. Residual curve is normal with mean close to 0 and ACF values are mostly within threshold limit. All these factors show that this model for Crime Against Persons is reasonable though it might not be as good as the model for Property Crime (Figure 8.).
Figure 8. ARIMA results for Crime against Persons
In order to access the goodness of my model, I removed year 2017 and 2018 then use the model from 1995 to 2016 to predict crime in those 2 years. After that, I compare the results in 2017 and 2018. Both predictions are very good as validation value and prediction value followed pretty closely with each other (Figure 9.)
Figure 9. Validation & Prediction comparison for Property Crime
Figure 10. Validation & Prediction comparison for Crime Against Persons
## $y
## [1] "Number of Crime Cases"
##
## attr(,"class")
## [1] "labels"
Using the ARIMA model with fine-tune parameters in the previous section, below is the result of the forecasting of Property Crime and Crime Against Persons for the next 2 years, 2019 and 2020, shown in Figure 11. and Figure 12 respectively. Prediction chart is included in Table T1. and Table T2. in the Appendix.
Figure 11. Forecast for Property Crime
Figure 12. Forecast for Crime against Persons
Based on the results of of my forecast, Property Crime will decrease slightly while Crime Against Persons will increase in the near future. This insight would suggest that policy in the near future need to concentrate more on Crime Against Persons. Below are some suggestions to reduce Crime Against Persons:
In the last few years, gun control, a major factor leading to Crime Against Persons, has been a very hot topic in Australia especially after the Christchurch mosque shootings incident happened a couple months ago. Alcohol is another factor that according to the National Council on Alcoholism and Drug Dependence, this factor accounts for 40% of all violent crimes. Community policing refers to police officers who works on the same area to build proactive partnership with the citizen to identify and solve problems. This tactics, originated from the United States, is currently practiced in Australia and should be encouraged to strengthen the relationship between the polices and the citizens. According to the research of my group in assignment 2, education is one of the major factor that would lead to crime. A research in 2016 suggested that by raising the age and grade for dropping out of school, more kids will be able to complete school which eventually reduce crimes. This tactics should be used to reduce crime in the future.
Moreover in Figure 12., there are 3 peaks and 2 troughs in the 2 years period which suggests that the number of Crime Against Persons in the summer is the lowest while the number of crime in the winter will be high. Thus, policies regarding absence and holiday schedule should be made appropriately for the Police Department.
The analysis included in this blog is an extension of assignment 2 to further giving insights for the government making better policy in order to reduce crime, particularly in Sydney. Previously, we found unemployment and income inequality would have strong correlation with crime rate. The analysis in this blog showed that Crime Against Persons is currently high and will increase in the near future. Forever, summer would be a relatively safe time for citizen. Thus, some policy suggestions has been giving out to scope with the insights gained from time series analysis.
In this assignment, I gained tremendous skills in doing time series analysis. Additionally, knowledge gained from researching crime related issues, particularly in Australia, is also valuable. The ARIMA model chosen in my analysis might not be the most efficient one because p-value in the Ljung-Box test is still lower than the desired result, but validation showed good prediction results. For future reference, a different time-series technique should be applied to compare with the analysis of my ARIMA model in this blog.
Cook, P. and Kang, S. (2016). Birthdays, Schooling, and Crime: Regression-Discontinuity Analysis of School Performance, Delinquency, Dropout, and Crime Initiation. American Economic Journal: Applied Economics, 8(1), pp.33-57.
Gentleman, J. and Whitmore, G. (1994). Case studies in data analysis. New York: Springer-Verlag.
Hyndman, R. and Athanasopoulos, G. (2018). Forecasting. [Heathmont, Vic.]: OTexts.
JOHN JAY Collage of Criminal Justice (2014). Bridging the Great Divide: Can Police-Community Partnerships Reduce Crime and Strengthen Our Democracy?. [online] The City University of New York. Available at: http://www.jjay.cuny.edu/sites/default/files/Research/OSF_Panelist_Bios.pdf [Accessed 10 Jun. 2019].
Itl.nist.gov. (2019). NIST/SEMATECH e-Handbook of Statistical Methods. [online] Available at: http://www.itl.nist.gov/div898/handbook/ [Accessed 7 Jun. 2019].
Figure A1. Crime Cases based on LGAs
Figure A2. Crime Cases based on Crime Category
Figure A3. ACF & PACF of Property Crime (without differencing)
Figure A4. ACF & PACF of Crime against Persons (without differencing)
Table T1. Forecast for Property Crime
##
## Forecast method: ARIMA(3,1,1)(1,0,3)[12]
##
## Model Information:
##
## Call:
## arima(x = dfts.property, order = c(3, 1, 1), seasonal = list(order = c(1, 0,
## 3), period = 12), method = "ML")
##
## Coefficients:
## ar1 ar2 ar3 ma1 sar1 sma1 sma2 sma3
## 0.3563 0.1087 0.0509 -0.7569 0.9720 -0.8058 -0.1279 0.1398
## s.e. 0.1850 0.0969 0.0753 0.1736 0.0173 0.0667 0.0752 0.0725
##
## sigma^2 estimated as 37432: log likelihood = -1924.81, aic = 3867.63
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -9.633881 193.1378 145.1631 -0.463487 4.891133 0.4846615
## ACF1
## Training set 0.001090959
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2019 1804.672 1556.7250 2052.619 1425.4699 2183.874
## Feb 2019 1604.105 1315.0302 1893.181 1162.0031 2046.208
## Mar 2019 1768.338 1447.0727 2089.604 1277.0050 2259.671
## Apr 2019 1657.438 1307.3996 2007.475 1122.1007 2192.774
## May 2019 1738.664 1364.3575 2112.970 1166.2118 2311.115
## Jun 2019 1627.615 1231.4594 2023.771 1021.7472 2233.483
## Jul 2019 1558.419 1142.1035 1974.734 921.7195 2195.118
## Aug 2019 1549.567 1114.3933 1984.740 884.0264 2215.107
## Sep 2019 1514.010 1060.9814 1967.038 821.1626 2206.857
## Oct 2019 1590.127 1120.0561 2060.197 871.2159 2309.037
## Nov 2019 1544.377 1057.9459 2030.807 800.4450 2288.308
## Dec 2019 1605.165 1102.9598 2107.371 837.1083 2373.222
## Jan 2020 1702.043 1173.1304 2230.956 893.1409 2510.946
## Feb 2020 1495.914 946.3185 2045.509 655.3804 2336.447
## Mar 2020 1678.536 1109.3911 2247.682 808.1038 2548.969
## Apr 2020 1554.593 966.6061 2142.580 655.3446 2453.841
## May 2020 1619.084 1013.1047 2225.063 692.3186 2545.850
## Jun 2020 1534.200 910.8613 2157.538 580.8858 2487.514
## Jul 2020 1454.175 814.0170 2094.333 475.1377 2433.212
## Aug 2020 1444.701 788.2026 2101.200 440.6730 2448.730
## Sep 2020 1429.560 757.1464 2101.974 401.1922 2457.928
## Oct 2020 1517.898 829.9562 2205.840 465.7818 2570.014
## Nov 2020 1478.740 775.6250 2181.856 403.4182 2554.063
## Dec 2020 1525.847 807.8863 2243.808 427.8207 2623.874
Table T2. Forecast for Crime against Persons
##
## Forecast method: ARIMA(3,1,1)(1,1,1)[12]
##
## Model Information:
##
## Call:
## arima(x = dfts.against, order = c(3, 1, 1), seasonal = list(order = c(1, 1,
## 1), period = 12), method = "CSS")
##
## Coefficients:
## ar1 ar2 ar3 ma1 sar1 sma1
## -0.0154 -0.0118 0.1384 -0.7768 -0.1218 -0.8478
## s.e. 0.1057 0.0862 0.0752 0.0847 0.0635 0.0408
##
## sigma^2 estimated as 2072: part log likelihood = -1440.2
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -1.152649 43.25056 31.68385 -0.4530731 4.961031 0.5838001
## ACF1
## Training set -0.0006039902
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2019 729.2043 670.8644 787.5442 639.9811 818.4275
## Feb 2019 657.0354 597.4502 716.6205 565.9077 748.1630
## Mar 2019 734.7847 673.9749 795.5945 641.7842 827.7852
## Apr 2019 653.1980 588.9415 717.4544 554.9262 751.4697
## May 2019 641.0001 575.1864 706.8138 540.3468 741.6535
## Jun 2019 602.3592 535.0239 669.6946 499.3786 705.3398
## Jul 2019 622.6645 553.5672 691.7619 516.9893 728.3398
## Aug 2019 609.0782 538.4591 679.6972 501.0757 717.0806
## Sep 2019 624.8280 552.7195 696.9366 514.5476 735.1085
## Oct 2019 672.5460 598.9408 746.1511 559.9766 785.1153
## Nov 2019 650.1362 575.0899 725.1825 535.3628 764.9096
## Dec 2019 753.8464 677.3861 830.3066 636.9105 870.7822
## Jan 2020 725.7582 647.5505 803.9660 606.1498 845.3667
## Feb 2020 641.4304 561.7946 721.0662 519.6379 763.2229
## Mar 2020 736.1901 655.1513 817.2290 612.2519 860.1283
## Apr 2020 645.9718 563.5049 728.4387 519.8496 772.0941
## May 2020 637.5325 553.6983 721.3667 509.3191 765.7459
## Jun 2020 599.6061 514.4265 684.7857 469.3351 729.8770
## Jul 2020 622.5491 536.0382 709.0600 490.2421 754.8562
## Aug 2020 605.4041 517.5869 693.2212 471.0993 739.7088
## Sep 2020 629.2302 540.1260 718.3344 492.9571 765.5034
## Oct 2020 672.3173 581.9433 762.6913 534.1023 810.5324
## Nov 2020 653.9639 562.3385 745.5893 513.8349 794.0928
## Dec 2020 751.5136 658.6536 844.3736 609.4965 893.5307