Research goal

We want to estimate criminal violence level in Chicago using open data that is too really big for primitive handy calculations.

Data

We use dataset “Crimes - 2001 to present” from data.gov. This dataset reflects reported incidents of crime that occurred in the City of Chicago from 2001 to present. Data is extracted from the Chicago Police Department’s CLEAR (Citizen Law Enforcement Analysis and Reporting) system.

For more about Chicago criminal data see https://catalog.data.gov/dataset/crimes-2001-to-present-398a4. The previous parts of the current research see here https://rpubs.com/alex-lev.

Preparing data

We omit reading file with data and converting character strings to numbers as the time consuming routine process

#chicago_crime<-readRDS(file = "chicago_crime.rds") 
#it takes two minutes to load compressed file in memory
#chicago_crime <- na.omit(chicago_crime)#omit NA
#dd_hms <- separate(chicago_crime,"Day_time",c("MDY","HMS"),sep = " ")
#separate Day_time by two parts
#dd_hms <- separate(dd_hms,"MDY",c("Y","M","D"),convert=T)
#separeate Year - Y, Month - M, Day - D as numbers
#d_hom <- filter(dd_hms,dd_hms$`Primary Type`=="HOMICIDE")
#filtering data by type of crime - HOMICIDE

Agregating and arranging data

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
load(file = "d_hom.RData")
d_hom_ym <- filter(d_hom,Y!=2017) %>% group_by(Y) %>% count(M) %>% arrange()
#Homocide data for 2001 - 2016 by years with months summary
d_hom_ym
## Source: local data frame [192 x 3]
## Groups: Y [?]
## 
##        Y     M     n
##    <int> <int> <int>
## 1   2001     1    28
## 2   2001     2    20
## 3   2001     3    31
## 4   2001     4    45
## 5   2001     5    31
## 6   2001     6    56
## 7   2001     7    60
## 8   2001     8    48
## 9   2001     9    40
## 10  2001    10    38
## # ... with 182 more rows

Violence proccess control

We use quality control package for demonstrating violence rate (number of homicides by month) fluctuations.

library(qcc)
## Package 'qcc', version 2.6
## Type 'citation("qcc")' for citing this R package in publications.
d_hom_xbar <- qcc(d_hom_ym$n,type = "R",sizes = 12,std.dev = sd(d_hom_ym$n),labels = d_hom_ym$Y)

d_hom_xbar$limits #low and upper limits by 3 sigma
##       LCL      UCL
##  9.067561 72.00536
d_hom_crit <- d_hom_xbar$violations$beyond.limits# critical time points with extream number of homicides beyond random gaussian proccess limits
d_hom_ym[d_hom_crit,1:3]
## Source: local data frame [5 x 3]
## Groups: Y [2]
## 
##       Y     M     n
##   <int> <int> <int>
## 1  2002     8    77
## 2  2016     6    75
## 3  2016     8    92
## 4  2016    10    80
## 5  2016    11    78

As we see the second half of 2016 homicide rate demonstrates anomaly (remember D.Trump replica for Chicago crimes). But we can see August 2002 too as critical time. Is B.Obama presidency worse than G.W.Bush one for Chicago violence? Let’s see!

Student test

We want to compare two sample means of homicide numbers for Bush and Obama for 8 years or 96 months. Our null hypothesis is \(H_0:N_B=N_O\) - means are equal for both presidents and alternative one - \(H_1:{N_B}\neq{N_O}\) - means are not equal.

Frequentist case

To begin with we should split data for two time periods: 2001 - 2008 as Bush and 2009 - 20016 as Obama.

Bush <- d_hom_ym %>% filter(Y<=2008)
Obama <- d_hom_ym %>% filter(Y>2008)
hist(Bush$n, col="red", xlab = "Chicago homicide rate by month", main = "G.W.Bush, 2001 - 2008")

hist(Obama$n, col="blue", xlab = "Chicago homicide rate by month", main = "B.Obama, 2009 - 2016")

Histograms demonstrate outliers for Obama presidency. Really, it’s an amazing thing - the first afroamerican president in the 250 years history of the USA inspired violence at the end of his presidency! God bless D.Trump to end this bloody stuff in Chicago if possible!!!

Now we come up to the test. Attention, please!

shapiro.test(Bush$n) # Bush is normal
## 
##  Shapiro-Wilk normality test
## 
## data:  Bush$n
## W = 0.97749, p-value = 0.09741
shapiro.test(Obama$n) # Obama is not normal
## 
##  Shapiro-Wilk normality test
## 
## data:  Obama$n
## W = 0.95174, p-value = 0.00142
var.test(Bush$n,Obama$n)
## 
##  F test to compare two variances
## 
## data:  Bush$n and Obama$n
## F = 0.82403, num df = 95, denom df = 95, p-value = 0.3472
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.5498319 1.2349819
## sample estimates:
## ratio of variances 
##          0.8240342
t.test(Bush$n,Obama$n,mu = 0,var.equal = T)
## 
##  Two Sample t-test
## 
## data:  Bush$n and Obama$n
## t = -0.048078, df = 190, p-value = 0.9617
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.940098  3.752598
## sample estimates:
## mean of x mean of y 
##  40.48958  40.58333

We see no difference of two means at all ( p-value = 0.9617)!

Bayesian case

Let’s try Bayes for Student test.

library(BEST)
BESTout <- BESTmcmc(Bush$n,Obama$n,rnd.seed = 12345,numSavedSteps = 5000,verbose = F)
summary(BESTout)
##              mean  median     mode HDI%  HDIlo  HDIup compVal %>compVal
## mu1       40.2281 40.2202 40.05100   95 37.660 42.936                  
## mu2       40.1032 40.1179 40.21258   95 37.490 43.056                  
## muDiff     0.1249  0.1303 -0.06192   95 -3.597  3.963       0      52.5
## sigma1    12.4497 12.4096 12.32932   95 10.481 14.543                  
## sigma2    13.1998 13.1974 13.31158   95 10.456 15.549                  
## sigmaDiff -0.7501 -0.7569 -0.68749   95 -3.736  2.048       0      30.1
## nu        28.2679 20.9335 11.74518   95  3.333 75.507                  
## log10nu    1.3296  1.3208  1.36331   95  0.759  1.950                  
## effSz      0.0102  0.0104 -0.00537   95 -0.285  0.306       0      52.5
plot(BESTout,which = "mean")

plot(BESTout,which = "sd")

plot(BESTout,which = "effect")

At first glance we see the same result as in previous frequentest test. But the devil is in the details. we will explain details below.

Conclusions

  1. Chicago criminal violence process in terms of homicide rate demonstrates random fluctuations by years and months.
  2. Applying quality control to the data gives interesting results.
  3. There are distinct outbreaks of violence due to increasing homicide rate in the end of B.Obama presidency (second half of 2016).
  4. Student test for G.W.Bush and B.Obama means both in frequentest and Bayesian models demonstrate no difference. That is good news for democrats and D.Trump’s opponents especially.
  5. The bad news according to Bayes and MCMC is that the difference in means is shifted to Obama time (not to G.W.Bush) with 52.5% probability while difference in standard deviations is shifted to G.W.Bush time with 70% probability. But the effect size is shifted to B.Obama time too with 52.5% probability though very little.