Our assignment as a group is to analyze our dataset on crime from 50 US states and synthesize a report. We are required to design a statistical model which will best predict the variables which are most closely related to the risk for murder and nonnegligent manslaughter. This will allow us to find out which initiatives are worth funding to reduce the rate of murder. This report shows the steps taken by me to analyze the data and design the model.
library(tidyverse)
library(lmtest)
library(broom)
library(MASS)
library(knitr)
load("crime.RData")
summary(crime)
## state murde punemploy bradyscore
## Length:50 Min. : 10.00 Min. :2.800 Min. :-39.00
## Class :character 1st Qu.: 62.25 1st Qu.:4.225 1st Qu.:-19.00
## Mode :character Median : 195.00 Median :5.050 Median : -8.75
## Mean : 310.68 Mean :5.006 Mean : 3.94
## 3rd Qu.: 462.25 3rd Qu.:5.950 3rd Qu.: 19.62
## Max. :1861.00 Max. :6.700 Max. : 76.00
## totpop medage rate_robbe rate_aggra
## Min. : 586102 Min. :30.0 Min. : 10.10 Min. : 69.3
## 1st Qu.: 1853216 1st Qu.:36.0 1st Qu.: 52.70 1st Qu.:161.2
## Median : 4546634 Median :37.5 Median : 79.45 Median :224.8
## Mean : 6407342 Mean :37.6 Mean : 80.71 Mean :234.1
## 3rd Qu.: 7065179 3rd Qu.:39.0 3rd Qu.:107.92 3rd Qu.:289.1
## Max. :39032444 Max. :44.0 Max. :217.50 Max. :497.1
## rate_burgl rate_larce rate_motor sqmiles
## Min. :223.7 Min. :1064 Min. : 28.4 Min. : 1034
## 1st Qu.:351.8 1st Qu.:1435 1st Qu.:132.1 1st Qu.: 36741
## Median :466.8 Median :1784 Median :194.4 Median : 53891
## Mean :488.9 Mean :1783 Mean :199.6 Mean : 70637
## 3rd Qu.:587.3 3rd Qu.:2064 3rd Qu.:255.8 3rd Qu.: 81226
## Max. :828.8 Max. :2934 Max. :436.8 Max. :570641
The summary is showing that the average number of murders in 2015 was about 311. While average score for legislations from The Brady Campaign was 3.94. The median score is -8.75, and maximum is 76. The average aggravated assault in this year was 234 per 100,000 inhabitants. The average rates for robbery, burglary, and larceny/theft was 80.71, 488.9, and 1783 per 100,000 inhabitants.
head(crime)
## # A tibble: 6 x 12
## state murde punemploy bradyscore totpop medage rate_robbe rate_aggra
## <chr> <dbl> <dbl> <dbl> <int> <int> <dbl> <dbl>
## 1 alab~ 348 6.1 -18 4.85e6 38 94.9 328.
## 2 alas~ 59 6.5 -30 7.38e5 33 103. 497.
## 3 ariz~ 309 6.1 -39 6.80e6 37 93.1 267.
## 4 arka~ 181 5 -24 2.98e6 37 70.4 380
## 5 cali~ 1861 6.2 76 3.90e7 36 135 254.
## 6 colo~ 176 3.9 22 5.44e6 36 60.9 197.
## # ... with 4 more variables: rate_burgl <dbl>, rate_larce <dbl>,
## # rate_motor <dbl>, sqmiles <dbl>
dim(crime)
## [1] 50 12
This dataset contains data from 50 states and 12 variables.
hist(rpois(10000, lambda = 5), main = "", xlab = "murde")
This is a multiple linear regression including “percentage unemployed” and “Brady Score” variables to calculate the murder rate.
model1 <- lm(murde ~ punemploy + bradyscore, data = crime)
model1 %>% summary
##
## Call:
## lm(formula = murde ~ punemploy + bradyscore, data = crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -516.92 -159.74 -50.45 112.53 1242.14
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -269.493 230.032 -1.172 0.2473
## punemploy 114.016 44.992 2.534 0.0147 *
## bradyscore 2.388 1.474 1.620 0.1119
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 332.6 on 47 degrees of freedom
## Multiple R-squared: 0.1643, Adjusted R-squared: 0.1287
## F-statistic: 4.62 on 2 and 47 DF, p-value: 0.01473
The regression model can be written as:
E(No. of murders) = -269.5 + 114(percentage unemployed) + 2.39(brady score)
plot(model1,1)
plot(model1, 2)
The above graphs show that 2 of the assumptions of normality were met by this model for multilinear regression.
model1 %>% AIC
## [1] 727.4865
This model is includes the variables “percentage unemployed” and “Brady Score” to calculate the murder rates in a Poisson regression model.
model2 <- glm(murde ~ punemploy + bradyscore, data = crime, family = "poisson")
model2 %>% summary
##
## Call:
## glm(formula = murde ~ punemploy + bradyscore, family = "poisson",
## data = crime)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -30.700 -13.835 -3.731 5.836 51.596
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.5250520 0.0481641 73.19 <2e-16 ***
## punemploy 0.4141513 0.0087533 47.31 <2e-16 ***
## bradyscore 0.0066528 0.0002168 30.68 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 15937 on 49 degrees of freedom
## Residual deviance: 12571 on 47 degrees of freedom
## AIC: 12922
##
## Number of Fisher Scoring iterations: 5
The regression model can be written as:
log(Risk of murder) = 3.53 + 0.41(percentage unemployed) + 0.007(brady score)
model2 %>% coef() %>% exp()
## (Intercept) punemploy bradyscore
## 33.955539 1.513086 1.006675
model2 %>% confint() %>% exp()
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 30.887034 37.305599
## punemploy 1.487399 1.539322
## bradyscore 1.006247 1.007102
model2 %>% AIC
## [1] 12922.01
The hypotheses are as follows:
checkpois <- function(pglm) {
sum(residuals(pglm, type = "pearson")^2) %>%
pchisq(., pglm$df.residual, lower.tail = F)
}
checkpois(model2)
## [1] 0
Since it is [1], the assumptoms were not met by this model.
Accodring to Tatum’s analysis on her models, we have decided that this model will be the best fit for our data. We have analyzed other variables and have chosen “punemploy”, “bradyscore” and “rate_robbe” according to their significance. Offset will be applied which will include the total population as the denominator.
model3 <- glm(murde ~ punemploy + bradyscore + rate_robbe, data = crime, family = "poisson", offset = log(totpop))
model3 %>% summary
##
## Call:
## glm(formula = murde ~ punemploy + bradyscore + rate_robbe, family = "poisson",
## data = crime, offset = log(totpop))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -9.954 -3.352 -0.074 2.238 11.085
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.115e+01 5.851e-02 -190.55 <2e-16 ***
## punemploy 1.500e-01 1.212e-02 12.38 <2e-16 ***
## bradyscore -4.900e-03 2.286e-04 -21.43 <2e-16 ***
## rate_robbe 4.666e-03 3.039e-04 15.35 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1809.25 on 49 degrees of freedom
## Residual deviance: 925.67 on 46 degrees of freedom
## AIC: 1279.1
##
## Number of Fisher Scoring iterations: 4
The regression model can be written as:
log(Risk of murder) = -1.12 + 0.15 (percentage unemployed) - 0.0049(brady score) + 0.00467(rate of robbery)
We exponentiate our coefficients to get the relative risks.
model3 %>% coef() %>% exp()
## (Intercept) punemploy bradyscore rate_robbe
## 1.438742e-05 1.161838e+00 9.951118e-01 1.004677e+00
model3 %>% confint() %>% exp()
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 1.282495e-05 1.613119e-05
## punemploy 1.134591e+00 1.189786e+00
## bradyscore 9.946656e-01 9.955575e-01
## rate_robbe 1.004078e+00 1.005274e+00
table3 <- tidy(model3, conf.int = T) %>% mutate(.,
RR = exp(estimate),
lb = exp(conf.low),
ub = exp(conf.high)) %>%
filter(., term != "(Intercept)") %>%
dplyr::select(., term, RR : ub)
table3
## # A tibble: 3 x 4
## term RR lb ub
## <chr> <dbl> <dbl> <dbl>
## 1 punemploy 1.16 1.13 1.19
## 2 bradyscore 0.995 0.995 0.996
## 3 rate_robbe 1.00 1.00 1.01
kable(table3, digits = 3)
| term | RR | lb | ub |
|---|---|---|---|
| punemploy | 1.162 | 1.135 | 1.190 |
| bradyscore | 0.995 | 0.995 | 0.996 |
| rate_robbe | 1.005 | 1.004 | 1.005 |
ggplot(table3, aes(x = term, y= RR)) +
geom_pointrange(aes(ymin = lb, ymax = ub)) +
geom_hline(yintercept = 1, linetype = 2, color = "blue")
model3 %>% AIC
## [1] 1279.114
checkpois <- function(pglm) {
sum(residuals(pglm, type = "pearson")^2) %>%
pchisq(., pglm$df.residual, lower.tail = F)
}
checkpois(model3)
## [1] 2.360128e-166
The assumptions for this model was still not met.
model3ca <- glm(murde ~ punemploy, data = crime,
family = "quasipoisson", offset = log(totpop))
model3ca %>% coef() %>% exp()
## (Intercept) punemploy
## 1.741289e-05 1.210681e+00
model3ca %>% confint() %>% exp()
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 9.135169e-06 3.252276e-05
## punemploy 1.079199e+00 1.361442e+00
Since “punemploy” alone is 1.21, and with other variables is 1.16 it is not a confounder.
model3cb <- glm(murde ~ bradyscore, data = crime,
family = "quasipoisson", offset = log(totpop))
model3cb %>% coef() %>% exp()
## (Intercept) bradyscore
## 5.051273e-05 9.970790e-01
model3cb %>% confint() %>% exp()
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 4.579944e-05 5.552889e-05
## bradyscore 9.945165e-01 9.995924e-01
Bradyscore alone is 9.970790e-01, and with other variables is 9.951118e-01. It is also not a confounder.
model3cc <- glm(murde ~ rate_robbe, data = crime,
family = "quasipoisson", offset = log(totpop))
model3cc %>% coef() %>% exp()
## (Intercept) rate_robbe
## 3.054193e-05 1.004497e+00
model3cc %>% confint() %>% exp()
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 2.208705e-05 4.195665e-05
## rate_robbe 1.001553e+00 1.007438e+00
RR for rate_robbe 1.001553e+00, with other variables it is 1.004677e+00, also not a confounder.
model3i <- glm(murde ~ punemploy + bradyscore + rate_robbe + punemploy * bradyscore * rate_robbe,
data = crime,
family = "quasipoisson", offset = log(totpop))
summary(model3i)
##
## Call:
## glm(formula = murde ~ punemploy + bradyscore + rate_robbe + punemploy *
## bradyscore * rate_robbe, family = "quasipoisson", data = crime,
## offset = log(totpop))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.4753 -2.3558 0.0763 2.5113 9.0883
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.240e+01 6.894e-01 -17.981 < 2e-16 ***
## punemploy 3.767e-01 1.311e-01 2.873 0.00635 **
## bradyscore -3.661e-02 3.516e-02 -1.041 0.30377
## rate_robbe 1.820e-02 6.760e-03 2.692 0.01017 *
## punemploy:bradyscore 3.899e-03 6.601e-03 0.591 0.55788
## punemploy:rate_robbe -2.458e-03 1.224e-03 -2.009 0.05100 .
## bradyscore:rate_robbe 2.443e-04 2.948e-04 0.829 0.41201
## punemploy:bradyscore:rate_robbe -2.866e-05 5.398e-05 -0.531 0.59833
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 17.51433)
##
## Null deviance: 1809.25 on 49 degrees of freedom
## Residual deviance: 724.24 on 42 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 4
As the p-value is more than 0.05, we can reject the hypothesis that there is interaction present in this model.
This is done to check the model fit.
model3 %>% lrtest()
## Likelihood ratio test
##
## Model 1: murde ~ punemploy + bradyscore + rate_robbe
## Model 2: murde ~ 1
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 4 -635.56
## 2 1 -1077.35 -3 883.58 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As the p-value for this test is less than 0.05, we can reject the null hypothesis which was that there is no variable associated with our outcome, “murde” (murder and nonnegligent manslaughter). It suggests that there is at least one explanatory variable which is associated with our outcome.