Our assignment as a group is to analyze our dataset on crime from 50 US states and synthesize a report. We are required to design a statistical model which will best predict the variables which are most closely related to the risk for murder and nonnegligent manslaughter. This will allow us to find out which initiatives are worth funding to reduce the rate of murder. This report shows the steps taken by me to analyze the data and design the model.

Load packages and crime file

library(tidyverse)
library(lmtest)
library(broom)
library(MASS)
library(knitr)

load("crime.RData")

Explore the data

summary(crime)

##     state               murde           punemploy       bradyscore    
##  Length:50          Min.   :  10.00   Min.   :2.800   Min.   :-39.00  
##  Class :character   1st Qu.:  62.25   1st Qu.:4.225   1st Qu.:-19.00  
##  Mode  :character   Median : 195.00   Median :5.050   Median : -8.75  
##                     Mean   : 310.68   Mean   :5.006   Mean   :  3.94  
##                     3rd Qu.: 462.25   3rd Qu.:5.950   3rd Qu.: 19.62  
##                     Max.   :1861.00   Max.   :6.700   Max.   : 76.00  
##      totpop             medage       rate_robbe       rate_aggra   
##  Min.   :  586102   Min.   :30.0   Min.   : 10.10   Min.   : 69.3  
##  1st Qu.: 1853216   1st Qu.:36.0   1st Qu.: 52.70   1st Qu.:161.2  
##  Median : 4546634   Median :37.5   Median : 79.45   Median :224.8  
##  Mean   : 6407342   Mean   :37.6   Mean   : 80.71   Mean   :234.1  
##  3rd Qu.: 7065179   3rd Qu.:39.0   3rd Qu.:107.92   3rd Qu.:289.1  
##  Max.   :39032444   Max.   :44.0   Max.   :217.50   Max.   :497.1  
##    rate_burgl      rate_larce     rate_motor       sqmiles      
##  Min.   :223.7   Min.   :1064   Min.   : 28.4   Min.   :  1034  
##  1st Qu.:351.8   1st Qu.:1435   1st Qu.:132.1   1st Qu.: 36741  
##  Median :466.8   Median :1784   Median :194.4   Median : 53891  
##  Mean   :488.9   Mean   :1783   Mean   :199.6   Mean   : 70637  
##  3rd Qu.:587.3   3rd Qu.:2064   3rd Qu.:255.8   3rd Qu.: 81226  
##  Max.   :828.8   Max.   :2934   Max.   :436.8   Max.   :570641

The summary is showing that the average number of murders in 2015 was about 311. While average score for legislations from The Brady Campaign was 3.94. The median score is -8.75, and maximum is 76. The average aggravated assault in this year was 234 per 100,000 inhabitants. The average rates for robbery, burglary, and larceny/theft was 80.71, 488.9, and 1783 per 100,000 inhabitants.

head(crime)

## # A tibble: 6 x 12
##   state murde punemploy bradyscore totpop medage rate_robbe rate_aggra
##   <chr> <dbl>     <dbl>      <dbl>  <int>  <int>      <dbl>      <dbl>
## 1 alab~   348       6.1        -18 4.85e6     38       94.9       328.
## 2 alas~    59       6.5        -30 7.38e5     33      103.        497.
## 3 ariz~   309       6.1        -39 6.80e6     37       93.1       267.
## 4 arka~   181       5          -24 2.98e6     37       70.4       380 
## 5 cali~  1861       6.2         76 3.90e7     36      135         254.
## 6 colo~   176       3.9         22 5.44e6     36       60.9       197.
## # ... with 4 more variables: rate_burgl <dbl>, rate_larce <dbl>,
## #   rate_motor <dbl>, sqmiles <dbl>

dim(crime)

## [1] 50 12

This dataset contains data from 50 states and 12 variables.

Poisson Plot on the Frequency of Murders in the US in 2015

hist(rpois(10000, lambda = 5), main = "", xlab = "murde")

Regression Models

Model 1

This is a multiple linear regression including “percentage unemployed” and “Brady Score” variables to calculate the murder rate.

model1 <- lm(murde ~ punemploy + bradyscore, data = crime)

model1 %>% summary

## 
## Call:
## lm(formula = murde ~ punemploy + bradyscore, data = crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -516.92 -159.74  -50.45  112.53 1242.14 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -269.493    230.032  -1.172   0.2473  
## punemploy    114.016     44.992   2.534   0.0147 *
## bradyscore     2.388      1.474   1.620   0.1119  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 332.6 on 47 degrees of freedom
## Multiple R-squared:  0.1643, Adjusted R-squared:  0.1287 
## F-statistic:  4.62 on 2 and 47 DF,  p-value: 0.01473

The regression model can be written as:

E(No. of murders) = -269.5 + 114(percentage unemployed) + 2.39(brady score)

Plots for Model 1 & Checking Assumptions

1. Residuals

plot(model1,1)

2. Normality

plot(model1, 2)

The above graphs show that 2 of the assumptions of normality were met by this model for multilinear regression.

AIC calculation

model1 %>% AIC

## [1] 727.4865

Model 2

This model is includes the variables “percentage unemployed” and “Brady Score” to calculate the murder rates in a Poisson regression model.

model2 <- glm(murde ~ punemploy + bradyscore, data = crime, family = "poisson")

model2 %>% summary

## 
## Call:
## glm(formula = murde ~ punemploy + bradyscore, family = "poisson", 
##     data = crime)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -30.700  -13.835   -3.731    5.836   51.596  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 3.5250520  0.0481641   73.19   <2e-16 ***
## punemploy   0.4141513  0.0087533   47.31   <2e-16 ***
## bradyscore  0.0066528  0.0002168   30.68   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 15937  on 49  degrees of freedom
## Residual deviance: 12571  on 47  degrees of freedom
## AIC: 12922
## 
## Number of Fisher Scoring iterations: 5

The regression model can be written as:

log(Risk of murder) = 3.53 + 0.41(percentage unemployed) + 0.007(brady score)

Exponentiated Relative Risks (RR) for Model 2

model2 %>% coef() %>% exp()

## (Intercept)   punemploy  bradyscore 
##   33.955539    1.513086    1.006675

95% Confidence Intervals (CI) for Model 2

model2 %>% confint() %>% exp()

## Waiting for profiling to be done...

##                 2.5 %    97.5 %
## (Intercept) 30.887034 37.305599
## punemploy    1.487399  1.539322
## bradyscore   1.006247  1.007102

AIC for Model 2

model2 %>% AIC

## [1] 12922.01

Check Poisson assumptions for Model 2

The hypotheses are as follows:

H0: The Poisson assumption is met
H1: The Poisson assumption is not met

checkpois <- function(pglm) {
sum(residuals(pglm, type = "pearson")^2) %>%
pchisq(., pglm$df.residual, lower.tail = F)
}

checkpois(model2)

## [1] 0

Since it is [1], the assumptoms were not met by this model.

Model 3

Accodring to Tatum’s analysis on her models, we have decided that this model will be the best fit for our data. We have analyzed other variables and have chosen “punemploy”, “bradyscore” and “rate_robbe” according to their significance. Offset will be applied which will include the total population as the denominator.

model3 <- glm(murde ~ punemploy + bradyscore + rate_robbe, data = crime, family = "poisson", offset = log(totpop))

model3 %>% summary

## 
## Call:
## glm(formula = murde ~ punemploy + bradyscore + rate_robbe, family = "poisson", 
##     data = crime, offset = log(totpop))
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -9.954  -3.352  -0.074   2.238  11.085  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.115e+01  5.851e-02 -190.55   <2e-16 ***
## punemploy    1.500e-01  1.212e-02   12.38   <2e-16 ***
## bradyscore  -4.900e-03  2.286e-04  -21.43   <2e-16 ***
## rate_robbe   4.666e-03  3.039e-04   15.35   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1809.25  on 49  degrees of freedom
## Residual deviance:  925.67  on 46  degrees of freedom
## AIC: 1279.1
## 
## Number of Fisher Scoring iterations: 4

The regression model can be written as:

log(Risk of murder) = -1.12 + 0.15 (percentage unemployed) - 0.0049(brady score) + 0.00467(rate of robbery)

Exponentiated RRs for Model 3

We exponentiate our coefficients to get the relative risks.

model3 %>% coef() %>% exp()

##  (Intercept)    punemploy   bradyscore   rate_robbe 
## 1.438742e-05 1.161838e+00 9.951118e-01 1.004677e+00

95% CIs for Model 3

model3 %>% confint() %>% exp()

## Waiting for profiling to be done...

##                    2.5 %       97.5 %
## (Intercept) 1.282495e-05 1.613119e-05
## punemploy   1.134591e+00 1.189786e+00
## bradyscore  9.946656e-01 9.955575e-01
## rate_robbe  1.004078e+00 1.005274e+00

Tabulated RRs and 95% CIs

table3 <- tidy(model3, conf.int = T) %>% mutate(.,
                           RR = exp(estimate),
                           lb = exp(conf.low),
                           ub = exp(conf.high)) %>%
  filter(., term != "(Intercept)") %>%
  dplyr::select(., term, RR : ub)
table3

## # A tibble: 3 x 4
##   term          RR    lb    ub
##   <chr>      <dbl> <dbl> <dbl>
## 1 punemploy  1.16  1.13  1.19 
## 2 bradyscore 0.995 0.995 0.996
## 3 rate_robbe 1.00  1.00  1.01

kable(table3, digits = 3)

term	RR	lb	ub
punemploy	1.162	1.135	1.190
bradyscore	0.995	0.995	0.996
rate_robbe	1.005	1.004	1.005

Figure

ggplot(table3, aes(x = term, y= RR)) +
  geom_pointrange(aes(ymin = lb, ymax = ub)) +
  geom_hline(yintercept = 1, linetype = 2, color = "blue")

AIC for Model 3

model3 %>% AIC

## [1] 1279.114

Check Poisson assumptions for Model 3

checkpois <- function(pglm) {
sum(residuals(pglm, type = "pearson")^2) %>%
pchisq(., pglm$df.residual, lower.tail = F)
}

checkpois(model3)

## [1] 2.360128e-166

The assumptions for this model was still not met.

To Check for Confounding and Interaction

Confounding

Percent unemployed

model3ca <- glm(murde ~ punemploy, data = crime,
family = "quasipoisson", offset = log(totpop))
model3ca %>% coef() %>% exp()

##  (Intercept)    punemploy 
## 1.741289e-05 1.210681e+00

model3ca %>% confint() %>% exp()

## Waiting for profiling to be done...

##                    2.5 %       97.5 %
## (Intercept) 9.135169e-06 3.252276e-05
## punemploy   1.079199e+00 1.361442e+00

Since “punemploy” alone is 1.21, and with other variables is 1.16 it is not a confounder.

Brady Score

model3cb <- glm(murde ~ bradyscore, data = crime,
family = "quasipoisson", offset = log(totpop))
model3cb %>% coef() %>% exp()

##  (Intercept)   bradyscore 
## 5.051273e-05 9.970790e-01

model3cb %>% confint() %>% exp()

## Waiting for profiling to be done...

##                    2.5 %       97.5 %
## (Intercept) 4.579944e-05 5.552889e-05
## bradyscore  9.945165e-01 9.995924e-01

Bradyscore alone is 9.970790e-01, and with other variables is 9.951118e-01. It is also not a confounder.

Rate of Robbery

model3cc <- glm(murde ~ rate_robbe, data = crime,
family = "quasipoisson", offset = log(totpop))
model3cc %>% coef() %>% exp()

##  (Intercept)   rate_robbe 
## 3.054193e-05 1.004497e+00

model3cc %>% confint() %>% exp()

## Waiting for profiling to be done...

##                    2.5 %       97.5 %
## (Intercept) 2.208705e-05 4.195665e-05
## rate_robbe  1.001553e+00 1.007438e+00

RR for rate_robbe 1.001553e+00, with other variables it is 1.004677e+00, also not a confounder.

Interaction

model3i <- glm(murde ~ punemploy + bradyscore + rate_robbe + punemploy * bradyscore * rate_robbe,
data = crime,
family = "quasipoisson", offset = log(totpop))

summary(model3i)

## 
## Call:
## glm(formula = murde ~ punemploy + bradyscore + rate_robbe + punemploy * 
##     bradyscore * rate_robbe, family = "quasipoisson", data = crime, 
##     offset = log(totpop))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.4753  -2.3558   0.0763   2.5113   9.0883  
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     -1.240e+01  6.894e-01 -17.981  < 2e-16 ***
## punemploy                        3.767e-01  1.311e-01   2.873  0.00635 ** 
## bradyscore                      -3.661e-02  3.516e-02  -1.041  0.30377    
## rate_robbe                       1.820e-02  6.760e-03   2.692  0.01017 *  
## punemploy:bradyscore             3.899e-03  6.601e-03   0.591  0.55788    
## punemploy:rate_robbe            -2.458e-03  1.224e-03  -2.009  0.05100 .  
## bradyscore:rate_robbe            2.443e-04  2.948e-04   0.829  0.41201    
## punemploy:bradyscore:rate_robbe -2.866e-05  5.398e-05  -0.531  0.59833    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 17.51433)
## 
##     Null deviance: 1809.25  on 49  degrees of freedom
## Residual deviance:  724.24  on 42  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4

As the p-value is more than 0.05, we can reject the hypothesis that there is interaction present in this model.

Likelihood Test for Model 3

This is done to check the model fit.

model3 %>% lrtest()

## Likelihood ratio test
## 
## Model 1: murde ~ punemploy + bradyscore + rate_robbe
## Model 2: murde ~ 1
##   #Df   LogLik Df  Chisq Pr(>Chisq)    
## 1   4  -635.56                         
## 2   1 -1077.35 -3 883.58  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As the p-value for this test is less than 0.05, we can reject the null hypothesis which was that there is no variable associated with our outcome, “murde” (murder and nonnegligent manslaughter). It suggests that there is at least one explanatory variable which is associated with our outcome.

Case Study 1: Predictors of Murder Rate in US States

Jayati Atahar

9/13/2020

Load packages and crime file

Explore the data

Poisson Plot on the Frequency of Murders in the US in 2015

Regression Models

Model 1

Plots for Model 1 & Checking Assumptions

1. Residuals

2. Normality

AIC calculation

Model 2

Exponentiated Relative Risks (RR) for Model 2

95% Confidence Intervals (CI) for Model 2

AIC for Model 2

Check Poisson assumptions for Model 2

Model 3

Exponentiated RRs for Model 3

95% CIs for Model 3

Tabulated RRs and 95% CIs

Figure

AIC for Model 3

Check Poisson assumptions for Model 3

To Check for Confounding and Interaction

Confounding

Percent unemployed

Brady Score

Rate of Robbery

Interaction

Likelihood Test for Model 3