We want to estimate criminal violence level in Chicago using open data that is too really big for primitive handy calculations.
We use dataset “Crimes - 2001 to present” from data.gov. This dataset reflects reported incidents of crime that occurred in the City of Chicago from 2001 to present. Data is extracted from the Chicago Police Department’s CLEAR (Citizen Law Enforcement Analysis and Reporting) system.
For more about Chicago criminal data see https://catalog.data.gov/dataset/crimes-2001-to-present-398a4. The previous parts of the current research see here https://rpubs.com/alex-lev.
We omit reading file with data and converting character strings to numbers as the time consuming routine process
#chicago_crime<-readRDS(file = "chicago_crime.rds")
#it takes two minutes to load compressed file in memory
#chicago_crime <- na.omit(chicago_crime)#omit NA
#dd_hms <- separate(chicago_crime,"Day_time",c("MDY","HMS"),sep = " ")
#separate Day_time by two parts
#dd_hms <- separate(dd_hms,"MDY",c("Y","M","D"),convert=T)
#separeate Year - Y, Month - M, Day - D as numbers
#d_hom <- filter(dd_hms,dd_hms$`Primary Type`=="HOMICIDE")
#filtering data by type of crime - HOMICIDE
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
load(file = "d_hom.RData")
d_hom_ym <- filter(d_hom,Y!=2017) %>% group_by(Y) %>% count(M) %>% arrange()
#Homocide data for 2001 - 2016 by years with months summary
d_hom_ym
## Source: local data frame [192 x 3]
## Groups: Y [?]
##
## Y M n
## <int> <int> <int>
## 1 2001 1 28
## 2 2001 2 20
## 3 2001 3 31
## 4 2001 4 45
## 5 2001 5 31
## 6 2001 6 56
## 7 2001 7 60
## 8 2001 8 48
## 9 2001 9 40
## 10 2001 10 38
## # ... with 182 more rows
We use quality control package for demonstrating violence rate (number of homicides by month) fluctuations.
library(qcc)
## Package 'qcc', version 2.6
## Type 'citation("qcc")' for citing this R package in publications.
d_hom_xbar <- qcc(d_hom_ym$n,type = "R",sizes = 12,std.dev = sd(d_hom_ym$n),labels = d_hom_ym$Y)
d_hom_xbar$limits #low and upper limits by 3 sigma
## LCL UCL
## 9.067561 72.00536
d_hom_crit <- d_hom_xbar$violations$beyond.limits# critical time points with extream number of homicides beyond random gaussian proccess limits
d_hom_ym[d_hom_crit,1:3]
## Source: local data frame [5 x 3]
## Groups: Y [2]
##
## Y M n
## <int> <int> <int>
## 1 2002 8 77
## 2 2016 6 75
## 3 2016 8 92
## 4 2016 10 80
## 5 2016 11 78
As we see the second half of 2016 homicide rate demonstrates anomaly (remember D.Trump replica for Chicago crimes). But we can see August 2002 too as critical time. Is B.Obama presidency worse than G.W.Bush one for Chicago violence? Let’s see!
We want to compare two sample means of homicide numbers for Bush and Obama for 8 years or 96 months. Our null hypothesis is \(H_0:N_B=N_O\) - means are equal for both presidents and alternative one - \(H_1:{N_B}\neq{N_O}\) - means are not equal.
To begin with we should split data for two time periods: 2001 - 2008 as Bush and 2009 - 20016 as Obama.
Bush <- d_hom_ym %>% filter(Y<=2008)
Obama <- d_hom_ym %>% filter(Y>2008)
hist(Bush$n, col="red", xlab = "Chicago homicide rate by month", main = "G.W.Bush, 2001 - 2008")
hist(Obama$n, col="blue", xlab = "Chicago homicide rate by month", main = "B.Obama, 2009 - 2016")
Histograms demonstrate outliers for Obama presidency. Really, it’s an amazing thing - the first afroamerican president in the 250 years history of the USA inspired violence at the end of his presidency! God bless D.Trump to end this bloody stuff in Chicago if possible!!!
Now we come up to the test. Attention, please!
shapiro.test(Bush$n) # Bush is normal
##
## Shapiro-Wilk normality test
##
## data: Bush$n
## W = 0.97749, p-value = 0.09741
shapiro.test(Obama$n) # Obama is not normal
##
## Shapiro-Wilk normality test
##
## data: Obama$n
## W = 0.95174, p-value = 0.00142
var.test(Bush$n,Obama$n)
##
## F test to compare two variances
##
## data: Bush$n and Obama$n
## F = 0.82403, num df = 95, denom df = 95, p-value = 0.3472
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.5498319 1.2349819
## sample estimates:
## ratio of variances
## 0.8240342
t.test(Bush$n,Obama$n,mu = 0,var.equal = T)
##
## Two Sample t-test
##
## data: Bush$n and Obama$n
## t = -0.048078, df = 190, p-value = 0.9617
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.940098 3.752598
## sample estimates:
## mean of x mean of y
## 40.48958 40.58333
We see no difference of two means at all ( p-value = 0.9617)!
Let’s try Bayes for Student test.
library(BEST)
BESTout <- BESTmcmc(Bush$n,Obama$n,rnd.seed = 12345,numSavedSteps = 5000,verbose = F)
summary(BESTout)
## mean median mode HDI% HDIlo HDIup compVal %>compVal
## mu1 40.2281 40.2202 40.05100 95 37.660 42.936
## mu2 40.1032 40.1179 40.21258 95 37.490 43.056
## muDiff 0.1249 0.1303 -0.06192 95 -3.597 3.963 0 52.5
## sigma1 12.4497 12.4096 12.32932 95 10.481 14.543
## sigma2 13.1998 13.1974 13.31158 95 10.456 15.549
## sigmaDiff -0.7501 -0.7569 -0.68749 95 -3.736 2.048 0 30.1
## nu 28.2679 20.9335 11.74518 95 3.333 75.507
## log10nu 1.3296 1.3208 1.36331 95 0.759 1.950
## effSz 0.0102 0.0104 -0.00537 95 -0.285 0.306 0 52.5
plot(BESTout,which = "mean")
plot(BESTout,which = "sd")
plot(BESTout,which = "effect")
At first glance we see the same result as in previous frequentest test. But the devil is in the details. we will explain details below.