For this week’s assignment, I chose to work with data provided on the Kaggle website, an online platform for data scientists, students and novices to data exploration etc. My topic of focus was on Gun Violence in the United States, sadly to say that the U.S. has the second highest ranking of gun violence in the world preceded by Brazil according to Laura Santhanam of PBS (Public Broadcasting Station) article can be found here. The Gun Violence data on Kaggle can be found here, the data was scraped from the Gun Violence Archive website (additional link to this site can be found on Kaggle’s website) and data were collected from 2013-2018 and consisted of over 260,000 incidents. Variables that were used in this data set are: date of crime, state of crime, age of suspect, number of people injured or killed (mutually exclusive), congressional districts and gun types etc. For this week’s assignment however, the following variables were used in my analysis.
Dependent variable: Injured Independent variables: State, Guns involved, Killed & District
My research question is as follows:
Does gun violence injury significantly vary on the district and state levels?
The subjects in my analysis were individuals
HYPOTHESIS
I believe that gun violence will vary significanlty by state and district levels due to varying laws in different states for e.g. states such as Florida and Alabama practice more leniency in their gun laws as opposed to Pennsylvania and Illinois. Other factors that may influence the difference in gun violence is one’s religion and political affiliation.
#install.packages("nlme")
#install.packages("haven")
#install.packages("lmerTest")
library(nlme)
library(dplyr)
library(magrittr)
library(tidyr)
library(haven)
library(lmerTest)
library(ggplot2)
library(texreg)
library(readr)
read_csv("/Users/safiesaf/Downloads/gun-violence-data_01-2013_03-2018.csv")
## # A tibble: 239,677 x 29
## incident_id date state city_or_county address n_killed n_injured
## <dbl> <date> <chr> <chr> <chr> <dbl> <dbl>
## 1 461105 2013-01-01 Penn… Mckeesport 1506 V… 0 4
## 2 460726 2013-01-01 Cali… Hawthorne 13500 … 1 3
## 3 478855 2013-01-01 Ohio Lorain 1776 E… 1 3
## 4 478925 2013-01-05 Colo… Aurora 16000 … 4 0
## 5 478959 2013-01-07 Nort… Greensboro 307 Mo… 2 2
## 6 478948 2013-01-07 Okla… Tulsa 6000 b… 4 0
## 7 479363 2013-01-19 New … Albuquerque 2806 L… 5 0
## 8 479374 2013-01-21 Loui… New Orleans LaSall… 0 5
## 9 479389 2013-01-21 Cali… Brentwood 1100 b… 0 4
## 10 492151 2013-01-23 Mary… Baltimore 1500 b… 1 6
## # … with 239,667 more rows, and 22 more variables: incident_url <chr>,
## # source_url <chr>, incident_url_fields_missing <lgl>,
## # congressional_district <dbl>, gun_stolen <chr>, gun_type <chr>,
## # incident_characteristics <chr>, latitude <dbl>,
## # location_description <chr>, longitude <dbl>, n_guns_involved <dbl>,
## # notes <chr>, participant_age <chr>, participant_age_group <chr>,
## # participant_gender <chr>, participant_name <chr>,
## # participant_relationship <chr>, participant_status <chr>,
## # participant_type <chr>, sources <chr>, state_house_district <dbl>,
## # state_senate_district <dbl>
Gah<-read_csv("/Users/safiesaf/Downloads/gun-violence-data_01-2013_03-2018.csv")
GunViolence<-Gah%>%
rename("State"=state,
"Injured"=n_injured,
"Guns_Inv"=n_guns_involved,
"District"=congressional_district)%>%
filter(!is.na(Injured),
!is.na(State),
!is.na(Guns_Inv),
!is.na(District))
This command shows us that there are 53 unique responses for the state level.
length(unique(GunViolence$State))
## [1] 51
This command shows us that there are 54 unique responses for the district level.
length(unique(GunViolence$District))
## [1] 54
** Data grouped by State**
GunViolence%>%
group_by(State) %>%
summarise(NSta = n())
## # A tibble: 51 x 2
## State NSta
## <chr> <int>
## 1 Alabama 2540
## 2 Alaska 710
## 3 Arizona 1363
## 4 Arkansas 1561
## 5 California 9073
## 6 Colorado 1787
## 7 Connecticut 1712
## 8 Delaware 829
## 9 District of Columbia 1195
## 10 Florida 8470
## # … with 41 more rows
The results below show the estimated average of respondents that were injured by gun violence and the estimated average of guns that were involved in gun violence by the 51 states.
StateDat <- GunViolence%>%
group_by(State) %>%
summarise(mean_inj = mean(Injured, na.rm = TRUE), mean_gi = mean(Guns_Inv, na.rm = TRUE))
head(StateDat)
## # A tibble: 6 x 3
## State mean_inj mean_gi
## <chr> <dbl> <dbl>
## 1 Alabama 0.562 1.32
## 2 Alaska 0.193 1.37
## 3 Arizona 0.453 1.27
## 4 Arkansas 0.459 1.28
## 5 California 0.412 1.80
## 6 Colorado 0.307 1.52
The results below are statistically significant with a p value < 0.05. On average, for every additional gun involved in gun violence, the estimated number of injuries sustained per state was -0.27, this result may or may not reflect results on the district level so we must do additional analysis to avoid ecological fallacy.
EcoModel <- lm(mean_inj ~ mean_gi, data = StateDat)
summary(EcoModel)
##
## Call:
## lm(formula = mean_inj ~ mean_gi, data = StateDat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.26893 -0.07800 0.01776 0.08983 0.29120
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.7431 0.1659 4.48 4.49e-05 ***
## mean_gi -0.2651 0.1194 -2.22 0.0311 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1285 on 49 degrees of freedom
## Multiple R-squared: 0.09139, Adjusted R-squared: 0.07284
## F-statistic: 4.928 on 1 and 49 DF, p-value: 0.03108
The results below are statistically significant with a p value < 0.05. The state level is removed and now we can focus on results at the congressional district level. The results show that for every 1 unit increase in congressional district, the estimated number of injuries sustained from gun violence decreases by 0.0011.
What happens if we were curious to find out if any relationship exists among states? It may be worth it to run an analysis on variance.
CPoolingMeth <-lm(Injured ~ District, data = GunViolence)
summary(CPoolingMeth)
##
## Call:
## lm(formula = Injured ~ District, data = GunViolence)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.460 -0.456 -0.446 0.544 52.545
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4599206 0.0028069 163.855 < 2e-16 ***
## District -0.0010715 0.0002422 -4.424 9.68e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7248 on 128299 degrees of freedom
## Multiple R-squared: 0.0001526, Adjusted R-squared: 0.0001448
## F-statistic: 19.58 on 1 and 128299 DF, p-value: 9.68e-06
The results below show the intercept which represents the average number of injuries sustained by congressional district. The intercept represents the injury averages of the districts that are smaller (district=1). The histogram shows the average injuries sustained by lower congressional districts are a little more than 0.25 with a variation of more than 0 and less than 1.25
Intercept <- GunViolence %>%
group_by(State)%>%
do(model = lm(Injured ~ District, data = .))
coef <- Intercept %>%
do(data.frame(Interc = coef(.$mod)[1]))
ggplot(coef, aes(x = Interc)) + geom_histogram()+xlab("Injury sustained by lower districts")
This model shows the difference in injuries sustained by gun violence by district across the state level. The results show that on average, the difference in injuries by district on the state level is a about 0 with a variation a little more that -0.4 and less than 0.2
Slope <- GunViolence%>%
group_by(State)%>%
do(model = lm(Injured ~ District, data = .))
coef <- Slope %>%
do(data.frame(DSlo = coef(.$mod)[2]))
ggplot(coef, aes(x = DSlo)) + geom_histogram()+xlab("Difference in Injuries by District across states")
This model combines complete pooling with no pooling. It shows us variation between the states while also imposing structure on said variations. It does not allow variations between the districts however and the analysis can be interpreted as follows: the standard deviation between states is 0.13. On average, injuries for smaller districts are 0.39 with a decrease of 0.003 for larger districts.
Model <- lme(Injured ~ District, data = GunViolence, random = ~1|State, method = "ML")
summary(Model)
## Linear mixed-effects model fit by maximum likelihood
## Data: GunViolence
## AIC BIC logLik
## 278155.4 278194.5 -139073.7
##
## Random effects:
## Formula: ~1 | State
## (Intercept) Residual
## StdDev: 0.133584 0.7147803
##
## Fixed effects: Injured ~ District
## Value Std.Error DF t-value p-value
## (Intercept) 0.3925360 0.019039354 128249 20.617086 0
## District -0.0027132 0.000337238 128249 -8.045284 0
## Correlation:
## (Intr)
## District -0.083
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -1.0340807 -0.6355143 -0.4996146 0.7003635 73.5634170
##
## Number of Observations: 128301
## Number of Groups: 51
This model allows the district level to vary between states and the analyses are as follows: smaller congressional districts experience injuries on average of 0.37 with a standard deviation of 0.18 across states, while injuries for larger congressional districts are 0.007 with a standard deviation of 0.02 across states. The intercept and slope are negatively correlated across the state level at 0.78. This means that as districts get larger across the state level, injuries sustained from gun violence gets smaller.
Model1 <- lme(Injured ~ District, data = GunViolence, random = ~ District|State, method = "ML")
summary(Model1)
## Linear mixed-effects model fit by maximum likelihood
## Data: GunViolence
## AIC BIC logLik
## 276706.1 276764.7 -138347.1
##
## Random effects:
## Formula: ~District | State
## Structure: General positive-definite, Log-Cholesky parametrization
## StdDev Corr
## (Intercept) 0.18367828 (Intr)
## District 0.02191375 -0.828
## Residual 0.71047299
##
## Fixed effects: Injured ~ District
## Value Std.Error DF t-value p-value
## (Intercept) 0.3719063 0.026251011 128249 14.167314 0.0000
## District 0.0074220 0.003557173 128249 2.086481 0.0369
## Correlation:
## (Intr)
## District -0.774
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -1.3955727 -0.6281676 -0.4818851 0.7153096 74.0725860
##
## Number of Observations: 128301
## Number of Groups: 51
The AIC results show that the best fit model for the data set is Model1 (because the results are the smallest) which is the Random slope (partial pooling) model. This model shows the districts varying across the state level.
AIC(CPoolingMeth, Model, Model1)
## df AIC
## CPoolingMeth 3 281506.9
## Model 4 278155.4
## Model1 6 276706.1
#htmlreg(list(Model, Model1), doctype = FALSE)
In conclusion, my hypothesis was wrong in that I assumed injuries from gun violence will vary more on the state and district levels but it was quite the opposite. On average,the results showed that smaller districts are estimated to have higher injuries sustained from gun violence as opposed to larger districts across the state level.