Safiya Stewart

Sociology 712

Professor Songe

4/5/19

For this week’s assignment, I chose to work with data provided on the Kaggle website, an online platform for data scientists, students and novices to data exploration etc. My topic of focus was on Gun Violence in the United States, sadly to say that the U.S. has the second highest ranking of gun violence in the world preceded by Brazil according to Laura Santhanam of PBS (Public Broadcasting Station) article can be found here. The Gun Violence data on Kaggle can be found here, the data was scraped from the Gun Violence Archive website (additional link to this site can be found on Kaggle’s website) and data were collected from 2013-2018 and consisted of over 260,000 incidents. Variables that were used in this data set are: date of crime, state of crime, age of suspect, number of people injured or killed (mutually exclusive), congressional districts and gun types etc. For this week’s assignment however, the following variables were used in my analysis.

State
Injured (# of respondents injured)
Guns involved (# of guns invovled)
District (district of shooting)

Dependent variable: Injured Independent variables: State, Guns involved, Killed & District

My research question is as follows:

Does gun violence injury significantly vary on the district and state levels?

The subjects in my analysis were individuals

HYPOTHESIS

I believe that gun violence will vary significanlty by state and district levels due to varying laws in different states for e.g. states such as Florida and Alabama practice more leniency in their gun laws as opposed to Pennsylvania and Illinois. Other factors that may influence the difference in gun violence is one’s religion and political affiliation.

#install.packages("nlme")
#install.packages("haven")
#install.packages("lmerTest")

library(nlme)
library(dplyr)
library(magrittr)
library(tidyr)
library(haven)
library(lmerTest)
library(ggplot2)
library(texreg)
library(readr)

read_csv("/Users/safiesaf/Downloads/gun-violence-data_01-2013_03-2018.csv")

## # A tibble: 239,677 x 29
##    incident_id date       state city_or_county address n_killed n_injured
##          <dbl> <date>     <chr> <chr>          <chr>      <dbl>     <dbl>
##  1      461105 2013-01-01 Penn… Mckeesport     1506 V…        0         4
##  2      460726 2013-01-01 Cali… Hawthorne      13500 …        1         3
##  3      478855 2013-01-01 Ohio  Lorain         1776 E…        1         3
##  4      478925 2013-01-05 Colo… Aurora         16000 …        4         0
##  5      478959 2013-01-07 Nort… Greensboro     307 Mo…        2         2
##  6      478948 2013-01-07 Okla… Tulsa          6000 b…        4         0
##  7      479363 2013-01-19 New … Albuquerque    2806 L…        5         0
##  8      479374 2013-01-21 Loui… New Orleans    LaSall…        0         5
##  9      479389 2013-01-21 Cali… Brentwood      1100 b…        0         4
## 10      492151 2013-01-23 Mary… Baltimore      1500 b…        1         6
## # … with 239,667 more rows, and 22 more variables: incident_url <chr>,
## #   source_url <chr>, incident_url_fields_missing <lgl>,
## #   congressional_district <dbl>, gun_stolen <chr>, gun_type <chr>,
## #   incident_characteristics <chr>, latitude <dbl>,
## #   location_description <chr>, longitude <dbl>, n_guns_involved <dbl>,
## #   notes <chr>, participant_age <chr>, participant_age_group <chr>,
## #   participant_gender <chr>, participant_name <chr>,
## #   participant_relationship <chr>, participant_status <chr>,
## #   participant_type <chr>, sources <chr>, state_house_district <dbl>,
## #   state_senate_district <dbl>

Gah<-read_csv("/Users/safiesaf/Downloads/gun-violence-data_01-2013_03-2018.csv")

GunViolence<-Gah%>%
  rename("State"=state,
         "Injured"=n_injured,
         "Guns_Inv"=n_guns_involved,
         "District"=congressional_district)%>%
  
filter(!is.na(Injured),
         !is.na(State),
         !is.na(Guns_Inv),
         !is.na(District))

This command shows us that there are 53 unique responses for the state level.

length(unique(GunViolence$State))

## [1] 51

This command shows us that there are 54 unique responses for the district level.

length(unique(GunViolence$District))

## [1] 54

** Data grouped by State**

GunViolence%>% 
  group_by(State) %>% 
  summarise(NSta = n())

## # A tibble: 51 x 2
##    State                 NSta
##    <chr>                <int>
##  1 Alabama               2540
##  2 Alaska                 710
##  3 Arizona               1363
##  4 Arkansas              1561
##  5 California            9073
##  6 Colorado              1787
##  7 Connecticut           1712
##  8 Delaware               829
##  9 District of Columbia  1195
## 10 Florida               8470
## # … with 41 more rows

Ecological Analysis

The results below show the estimated average of respondents that were injured by gun violence and the estimated average of guns that were involved in gun violence by the 51 states.

StateDat <- GunViolence%>% 
  group_by(State) %>% 
  summarise(mean_inj = mean(Injured, na.rm = TRUE), mean_gi = mean(Guns_Inv, na.rm = TRUE))
head(StateDat)

## # A tibble: 6 x 3
##   State      mean_inj mean_gi
##   <chr>         <dbl>   <dbl>
## 1 Alabama       0.562    1.32
## 2 Alaska        0.193    1.37
## 3 Arizona       0.453    1.27
## 4 Arkansas      0.459    1.28
## 5 California    0.412    1.80
## 6 Colorado      0.307    1.52

The results below are statistically significant with a p value < 0.05. On average, for every additional gun involved in gun violence, the estimated number of injuries sustained per state was -0.27, this result may or may not reflect results on the district level so we must do additional analysis to avoid ecological fallacy.

EcoModel <- lm(mean_inj ~ mean_gi, data = StateDat)
summary(EcoModel)

## 
## Call:
## lm(formula = mean_inj ~ mean_gi, data = StateDat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.26893 -0.07800  0.01776  0.08983  0.29120 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.7431     0.1659    4.48 4.49e-05 ***
## mean_gi      -0.2651     0.1194   -2.22   0.0311 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1285 on 49 degrees of freedom
## Multiple R-squared:  0.09139,    Adjusted R-squared:  0.07284 
## F-statistic: 4.928 on 1 and 49 DF,  p-value: 0.03108

Complete Pooling

The results below are statistically significant with a p value < 0.05. The state level is removed and now we can focus on results at the congressional district level. The results show that for every 1 unit increase in congressional district, the estimated number of injuries sustained from gun violence decreases by 0.0011.

What happens if we were curious to find out if any relationship exists among states? It may be worth it to run an analysis on variance.

CPoolingMeth <-lm(Injured ~ District, data = GunViolence)
summary(CPoolingMeth)

## 
## Call:
## lm(formula = Injured ~ District, data = GunViolence)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.460 -0.456 -0.446  0.544 52.545 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.4599206  0.0028069 163.855  < 2e-16 ***
## District    -0.0010715  0.0002422  -4.424 9.68e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7248 on 128299 degrees of freedom
## Multiple R-squared:  0.0001526,  Adjusted R-squared:  0.0001448 
## F-statistic: 19.58 on 1 and 128299 DF,  p-value: 9.68e-06

No-Pooling Model (The Intercept):

The results below show the intercept which represents the average number of injuries sustained by congressional district. The intercept represents the injury averages of the districts that are smaller (district=1). The histogram shows the average injuries sustained by lower congressional districts are a little more than 0.25 with a variation of more than 0 and less than 1.25

Intercept <- GunViolence %>%
  group_by(State)%>%
  do(model = lm(Injured ~ District, data = .))
coef <- Intercept %>% 
do(data.frame(Interc = coef(.$mod)[1]))
ggplot(coef, aes(x = Interc)) + geom_histogram()+xlab("Injury sustained by lower districts")

The Slope

This model shows the difference in injuries sustained by gun violence by district across the state level. The results show that on average, the difference in injuries by district on the state level is a about 0 with a variation a little more that -0.4 and less than 0.2

Slope <- GunViolence%>% 
    group_by(State)%>% 
    do(model = lm(Injured ~ District, data = .))
coef <- Slope %>%
do(data.frame(DSlo = coef(.$mod)[2]))
ggplot(coef, aes(x = DSlo)) + geom_histogram()+xlab("Difference in Injuries by District across states")

Random Intercept

This model combines complete pooling with no pooling. It shows us variation between the states while also imposing structure on said variations. It does not allow variations between the districts however and the analysis can be interpreted as follows: the standard deviation between states is 0.13. On average, injuries for smaller districts are 0.39 with a decrease of 0.003 for larger districts.

Model <- lme(Injured ~ District, data = GunViolence, random = ~1|State, method = "ML")
summary(Model)

## Linear mixed-effects model fit by maximum likelihood
##  Data: GunViolence 
##        AIC      BIC    logLik
##   278155.4 278194.5 -139073.7
## 
## Random effects:
##  Formula: ~1 | State
##         (Intercept)  Residual
## StdDev:    0.133584 0.7147803
## 
## Fixed effects: Injured ~ District 
##                  Value   Std.Error     DF   t-value p-value
## (Intercept)  0.3925360 0.019039354 128249 20.617086       0
## District    -0.0027132 0.000337238 128249 -8.045284       0
##  Correlation: 
##          (Intr)
## District -0.083
## 
## Standardized Within-Group Residuals:
##        Min         Q1        Med         Q3        Max 
## -1.0340807 -0.6355143 -0.4996146  0.7003635 73.5634170 
## 
## Number of Observations: 128301
## Number of Groups: 51

Ramdom Slope (Partial pooling)

This model allows the district level to vary between states and the analyses are as follows: smaller congressional districts experience injuries on average of 0.37 with a standard deviation of 0.18 across states, while injuries for larger congressional districts are 0.007 with a standard deviation of 0.02 across states. The intercept and slope are negatively correlated across the state level at 0.78. This means that as districts get larger across the state level, injuries sustained from gun violence gets smaller.

Model1 <- lme(Injured ~ District, data = GunViolence, random = ~ District|State, method = "ML")
summary(Model1)

## Linear mixed-effects model fit by maximum likelihood
##  Data: GunViolence 
##        AIC      BIC    logLik
##   276706.1 276764.7 -138347.1
## 
## Random effects:
##  Formula: ~District | State
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev     Corr  
## (Intercept) 0.18367828 (Intr)
## District    0.02191375 -0.828
## Residual    0.71047299       
## 
## Fixed effects: Injured ~ District 
##                 Value   Std.Error     DF   t-value p-value
## (Intercept) 0.3719063 0.026251011 128249 14.167314  0.0000
## District    0.0074220 0.003557173 128249  2.086481  0.0369
##  Correlation: 
##          (Intr)
## District -0.774
## 
## Standardized Within-Group Residuals:
##        Min         Q1        Med         Q3        Max 
## -1.3955727 -0.6281676 -0.4818851  0.7153096 74.0725860 
## 
## Number of Observations: 128301
## Number of Groups: 51

The AIC results show that the best fit model for the data set is Model1 (because the results are the smallest) which is the Random slope (partial pooling) model. This model shows the districts varying across the state level.

AIC(CPoolingMeth, Model, Model1)

##              df      AIC
## CPoolingMeth  3 281506.9
## Model         4 278155.4
## Model1        6 276706.1

#htmlreg(list(Model, Model1), doctype = FALSE)

CONCLUSION

In conclusion, my hypothesis was wrong in that I assumed injuries from gun violence will vary more on the state and district levels but it was quite the opposite. On average,the results showed that smaller districts are estimated to have higher injuries sustained from gun violence as opposed to larger districts across the state level.

Assignment 8: Multilevel Modeling