Introduction

This article mainly analyzes the occurrence of different types of crime from the perspective of time and geographical location. My most interesting question is whether time will affect the number of crimes. Of all the ?ata and the reponse velocity of Toronto Police. I mainly use crime type, time and geographic data.I use the linear model to evaluate the efficiency of police. I hope my research will help the Toronto Police Department to allocate police force reasonably.

Input Toronto crime data in Rmarkdown There is no null value in the dataset However,in order to calculate the police response speed, I also added two columns, which are the difference of reported year, occurrence year and reported day, occurrence day.

thefts<-read.csv("auto_thefts.csv")
a<-is.na(thefts)

Checking the head of the dataset. There are 26 variables in the dataset.

head(thefts)
##   Index_ event_unique_id           occurrencedate             reporteddate
## 1 169469  GO-20181562633 2018-08-24T03:00:00.000Z 2018-08-24T07:20:00.000Z
## 2 169470  GO-20181581089 2018-08-26T22:00:00.000Z 2018-08-27T00:12:00.000Z
## 3 169471  GO-20181582261 2018-08-27T05:39:00.000Z 2018-08-27T07:28:00.000Z
## 4 169472  GO-20181582390 2018-08-24T18:00:00.000Z 2018-08-27T08:06:00.000Z
## 5 169473  GO-20171056390 2017-06-13T22:30:00.000Z 2017-06-14T05:56:00.000Z
## 6 169474  GO-20171063947 2017-06-14T21:00:00.000Z 2017-06-15T06:11:00.000Z
##   premisetype                offence reportedyear reportedmonth
## 1       House Theft Of Motor Vehicle         2018        August
## 2     Outside Theft Of Motor Vehicle         2018        August
## 3       Other Theft Of Motor Vehicle         2018        August
## 4     Outside Theft Of Motor Vehicle         2018        August
## 5       House Theft Of Motor Vehicle         2017          June
## 6       House Theft Of Motor Vehicle         2017          June
##   reportedday reporteddayofyear reporteddayofweek reportedhour
## 1          24               236        Friday                7
## 2          27               239        Monday                0
## 3          27               239        Monday                7
## 4          27               239        Monday                8
## 5          14               165        Wednesday             5
## 6          15               166        Thursday              6
##   occurrenceyear occurrencemonth occurrenceday occurrencedayofyear
## 1           2018          August            24                 236
## 2           2018          August            26                 238
## 3           2018          August            27                 239
## 4           2018          August            24                 236
## 5           2017            June            13                 164
## 6           2017            June            14                 165
##   occurrencedayofweek occurrencehour        MCI Division Hood_ID
## 1          Friday                  3 Auto Theft      D42     130
## 2          Sunday                 22 Auto Theft      D42     131
## 3          Monday                  5 Auto Theft      D42     131
## 4          Friday                 18 Auto Theft      D41     126
## 5          Tuesday                22 Auto Theft      D12      28
## 6          Wednesday              21 Auto Theft      D22      15
##         Neighbourhood      Lat      Long
## 1      Milliken (130) 43.82499 -79.27251
## 2         Rouge (131) 43.80894 -79.20272
## 3         Rouge (131) 43.82441 -79.21520
## 4   Dorset Park (126) 43.75995 -79.27584
## 5         Rustic (28) 43.70583 -79.50460
## 6 Kingsway South (15) 43.65693 -79.50514

Statistical Methods

I mainly use bar char nd boxplot to show the statistical result in the project. And hypothesis test, the simple linear regrression to do the inferential statistics.

Results Plot 1

First of all,I used Barchart to count and compare the number of different types of crimes per year. As can be seen from the figure, crime of outside type is the most frequent, followed by house type. The least is apartment and other. The number of recorded crimes reach?d a new high in 2018.

library(ggplot2)
g <- ggplot(thefts, aes(reportedyear))
g + geom_bar(aes(fill = thefts$premisetype))

Plot 2

Second, we look at the number of crimes that occur each month. The results of monthly statistics show the characteristics of time series. There are fewer crimes in winter, but the trend of crimes in summer is on the rise. So Canadian police should deploy more in advance in the summer to prevent crime.

g <- ggplot(thefts, aes(reportedmonth))
g + geom_bar(aes(fill = thefts$premisetype))

Plot 3

Most crimes occur between the afternoon and the evening. But commercial and other crimes are concentrated at noon. So Toronto police should allocate police force according to the time of day to deal wit? accidents.

p <- ggplot(thefts, aes(premisetype,occurrencehour)) 
p + geom_boxplot() + coord_flip() +theme(axis.line = element_line(colour  = "black") ,panel.background = element_rect(fill = NA))

Plot 4

I calculated the difference bet?een the date of the crime and the date of the report. This conclusion is shocking because although most of the cases can be found in time, there are still some cases, especially in outside and apartment types, which have been found for more than five or ev?n 10 years. This shows that the efficiency of Toronto Police needs to be improved.

thefts$year<-thefts$reportedyear - thefts$occurrenceyear
p <- ggplot(thefts, aes(premisetype, year)) 
p + geom_boxplot() + coord_flip() +theme(axis.line = element_line(colour  = "black") ,panel.background = element_rect(fill = NA))
## Warning: Removed 3 rows containing non-finite values (stat_boxplot).

Plot 5

Finally, there is a statistical map of the crime place in Toronto. Most of them occurred in longitude - 79.5 to - 79.3 and dimensions -43.65 to -54.75. This shows that th? police force in this range should be more equipped to prevent more risks.

plot(thefts$Long,thefts$Lat)

Plot 6

Finally, we want to make linear regression in inferential statistics to judge the efficiency of Toronto Police in detecting crimes. We take the time of the event occurrence as the abscissa and the time interval of the report as the ordinate. Get the following model (2018 is deleted in the model because too many records in 2018 will affect the judgment).

R squared equal 0.6712, which means the result of model is good. The earlier the time is, the later the crime is discovered.

thefts$day<-thefts$reporteddayofyear-thefts$occurrencedayofyear
thefts$year<-thefts$reportedyear-thefts$occurrenceyear
thefts$accuratedat<-365*thefts$year+thefts$day
thefts<-na.omit(thefts)
thefts$oc<-365*(thefts$occurrenceyear-2000)+thefts$occurrencedayofyear
thefts<-thefts[with(thefts,thefts$oc<5500),]
summary(lm(accuratedat~oc,data = thefts))
## 
## Call:
## lm(formula = accuratedat ~ oc, data = thefts)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -169.20  -81.86    0.63   73.10 2361.27 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.378e+03  5.023e+01   87.17   <2e-16 ***
## oc          -8.236e-01  9.475e-03  -86.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 114.9 on 3700 degrees of freedom
## Multiple R-squared:  0.6712, Adjusted R-squared:  0.6711 
## F-statistic:  7554 on 1 and 3700 DF,  p-value: < 2.2e-16

Multi-linear regression

After consider another prdictor, premisetype, the R squared is upto 0.6717.

mod1<-lm(accuratedat~oc+premisetype,data = thefts)
summary(mod1)
## 
## Call:
## lm(formula = accuratedat ~ oc + premisetype, data = thefts)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -171.32  -81.25    1.49   73.51 2356.81 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            4.380e+03  5.104e+01  85.830   <2e-16 ***
## oc                    -8.239e-01  9.486e-03 -86.851   <2e-16 ***
## premisetypeCommercial  7.577e+00  1.301e+01   0.582    0.560    
## premisetypeHouse       2.726e+00  1.181e+01   0.231    0.818    
## premisetypeOther      -1.923e+01  1.586e+01  -1.212    0.225    
## premisetypeOutside    -1.424e+00  1.142e+01  -0.125    0.901    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 114.9 on 3696 degrees of freedom
## Multiple R-squared:  0.6717, Adjusted R-squared:  0.6712 
## F-statistic:  1512 on 5 and 3696 DF,  p-value: < 2.2e-16

PLot 7

The linear color of regression of different kinds of crime basically coincides, which shows that in fact, this variable is not likely to affect the time of crime discovery.

library(broom)
ggplot(aes(x = oc, y = accuratedat, color = factor(premisetype)),data = thefts)+
  geom_point(alpha = 0.5)+
  geom_line(data = augment(mod1),
            aes(y = .fitted,colour = factor(premisetype)))

Conclusion

According to the analysis of the above five charts, the crime rate increased from 2016 to 2018 and most crimes were con?entrated in the summer afternoon. The efficiency of the Toronto police has yet to be improved because there are nearly 15 years of crimes that have been discovered. This shocked me because it showed that the police were blind spots. Finally, in the city ce?ter longitude range of - 79.5 to - 79.3, the number of crimes in the dimension range of - 43.65 to - 43.75 is significantly higher than that in other areas. The police should allocate the police force reasonably according to the time and place.According to linear regression, the later the time of crime detection is, the earlier the time of crime detection is, which means the efficiency of polic is higher.