Gun violence, especially mass shootings is now a disease in USA. Therefore, I think it is an interesting topic to understand and analyze if there is a correlation between unemployment rate and mass shootings in the USA. So, i set the research question for this project: “Does unemployment rate have an effect on mass shootings in US states?” For this analysis, I collected the unemployment rate data for the states from The U.S. Bureau of Labor Statistics for the year between 2017 and 2021 by using web scrapping from their web page. I also collected mass shooting casualty data as CSV files from the Gun Violence Archive and stored them on my github.Then, I will clean up and transform the data as required to conduct some exploratory and statistical analysis (linear regression analysis) to see if there is a correlation between the stated variables.
library(dplyr)
library(rvest)
library(ggplot2)
# Uemployment rate data of 2021 annual averages
unemp2021_url<-read_html("https://www.bls.gov/lau/lastrk21.htm")
unemp_data_2021<-unemp2021_url %>% html_nodes(xpath='//*[@id="lastrk21"]/tbody') %>% html_table()
unemp2021_df<-as.data.frame(unemp_data_2021)
glimpse(unemp_data_2021)
## List of 1
## $ : tibble [53 × 3] (S3: tbl_df/tbl/data.frame)
## ..$ X1: chr [1:53] "United States" "" "Nebraska" "Utah" ...
## ..$ X2: num [1:53] 5.3 NA 2.5 2.7 3.1 3.2 3.4 3.4 3.4 3.4 ...
## ..$ X3: int [1:53] NA NA 1 2 3 4 5 5 5 5 ...
head(unemp2021_df)
## X1 X2 X3
## 1 United States 5.3 NA
## 2 NA NA
## 3 Nebraska 2.5 1
## 4 Utah 2.7 2
## 5 South Dakota 3.1 3
## 6 Kansas 3.2 4
# Unemployment rate data of 2020 annual averages
unemp2020_url<-read_html("https://www.bls.gov/lau/lastrk20.htm")
unemp_data_2020<-unemp2020_url %>% html_nodes(xpath='//*[@id="lastrk20"]/tbody') %>% html_table()
unemp2020_df<-as.data.frame(unemp_data_2020)
head(unemp2020_df)
## X1 X2 X3
## 1 United States 8.1 NA
## 2 NA NA
## 3 Nebraska 4.1 1
## 4 South Dakota 4.3 2
## 5 Utah 4.7 3
## 6 Maine 5.0 4
# Unemployment rate data of 2019 annual averages
unemp2019_url<-read_html("https://www.bls.gov/lau/lastrk19.htm")
unemp_data_2019<-unemp2019_url %>% html_nodes(xpath='//*[@id="lastrk19"]/tbody') %>% html_table()
unemp2019_df<-as.data.frame(unemp_data_2019)
head(unemp2019_df)
## X1 X2 X3
## 1 United States 3.7 NA
## 2 NA NA
## 3 North Dakota 2.1 1
## 4 Vermont 2.3 2
## 5 Hawaii 2.5 3
## 6 Colorado 2.6 4
# Unemployment rate data of 2018 annual averages
unemp2018_url<-read_html("https://www.bls.gov/lau/lastrk18.htm")
unemp_data_2018<-unemp2018_url %>% html_nodes(xpath='//*[@id="lastrk18"]/tbody') %>% html_table()
unemp2018_df<-as.data.frame(unemp_data_2018)
head(unemp2018_df)
## X1 X2 X3
## 1 United States 3.9 NA
## 2 NA NA
## 3 Hawaii 2.4 1
## 4 North Dakota 2.4 1
## 5 Iowa 2.5 3
## 6 New Hampshire 2.6 4
# Unemployment rate data of 2017 annual average
unemp2017_url<-read_html("https://www.bls.gov/lau/lastrk17.htm")
unemp_data_2017<-unemp2017_url %>% html_nodes(xpath='//*[@id="lastrk17"]/tbody') %>% html_table()
unemp2017_df<-as.data.frame(unemp_data_2017)
head(unemp2017_df)
## X1 X2 X3
## 1 United States 4.4 NA
## 2 NA NA
## 3 Hawaii 2.2 1
## 4 Colorado 2.6 2
## 5 North Dakota 2.6 2
## 6 New Hampshire 2.8 4
# Remove irrelevant rows and columns from data frames
unemp2021_df<-unemp2021_df %>% slice(-c(1,2))
unemp2021_df<-subset(unemp2021_df, select=-X3)
unemp2020_df<-unemp2020_df %>% slice(-c(1,2))
unemp2020_df<-subset(unemp2020_df, select=-X3)
unemp2019_df<-unemp2019_df %>% slice(-c(1,2))
unemp2019_df<-subset(unemp2019_df, select=-X3)
unemp2018_df<-unemp2018_df %>% slice(-c(1,2))
unemp2018_df<-subset(unemp2018_df, select=-X3)
unemp2017_df<-unemp2017_df %>% slice(-c(1,2))
unemp2017_df<-subset(unemp2017_df, select=-X3)
# Renaming columns
unemp2021_df<-unemp2021_df %>% rename(state=X1,unemployment_rate_2021=X2)
head(unemp2021_df)
## state unemployment_rate_2021
## 1 Nebraska 2.5
## 2 Utah 2.7
## 3 South Dakota 3.1
## 4 Kansas 3.2
## 5 Alabama 3.4
## 6 Minnesota 3.4
unemp2020_df<-unemp2020_df %>% rename(state=X1,unemployment_rate_2020=X2)
head(unemp2020_df)
## state unemployment_rate_2020
## 1 Nebraska 4.1
## 2 South Dakota 4.3
## 3 Utah 4.7
## 4 Maine 5.0
## 5 Iowa 5.1
## 6 North Dakota 5.1
unemp2019_df<-unemp2019_df %>% rename(state=X1,unemployment_rate_2019=X2)
head(unemp2019_df)
## state unemployment_rate_2019
## 1 North Dakota 2.1
## 2 Vermont 2.3
## 3 Hawaii 2.5
## 4 Colorado 2.6
## 5 Iowa 2.6
## 6 New Hampshire 2.6
unemp2018_df<-unemp2018_df %>% rename(state=X1,unemployment_rate_2018=X2)
head(unemp2018_df)
## state unemployment_rate_2018
## 1 Hawaii 2.4
## 2 North Dakota 2.4
## 3 Iowa 2.5
## 4 New Hampshire 2.6
## 5 Vermont 2.6
## 6 South Dakota 2.8
unemp2017_df<-unemp2017_df %>% rename(state=X1,unemployment_rate_2017=X2)
head(unemp2017_df)
## state unemployment_rate_2017
## 1 Hawaii 2.2
## 2 Colorado 2.6
## 3 North Dakota 2.6
## 4 New Hampshire 2.8
## 5 Nebraska 3.0
## 6 Vermont 3.0
# Combining data frames with one column name in common
unemp_df<-full_join(unemp2021_df,unemp2020_df,by="state") %>%
full_join(unemp2019_df,by="state") %>%
full_join(unemp2018_df,by="state") %>%
full_join(unemp2017_df,by="state")
head(unemp_df)
## state unemployment_rate_2021 unemployment_rate_2020
## 1 Nebraska 2.5 4.1
## 2 Utah 2.7 4.7
## 3 South Dakota 3.1 4.3
## 4 Kansas 3.2 5.7
## 5 Alabama 3.4 6.5
## 6 Minnesota 3.4 6.3
## unemployment_rate_2019 unemployment_rate_2018 unemployment_rate_2017
## 1 3.0 2.9 3.0
## 2 2.6 2.9 3.1
## 3 2.8 2.8 3.1
## 4 3.1 3.3 3.6
## 5 3.2 3.9 4.5
## 6 3.4 3.1 3.5
# Convert to lowercase in data frames
unemp_df$state<-tolower(unemp_df$state)
head(unemp_df)
## state unemployment_rate_2021 unemployment_rate_2020
## 1 nebraska 2.5 4.1
## 2 utah 2.7 4.7
## 3 south dakota 3.1 4.3
## 4 kansas 3.2 5.7
## 5 alabama 3.4 6.5
## 6 minnesota 3.4 6.3
## unemployment_rate_2019 unemployment_rate_2018 unemployment_rate_2017
## 1 3.0 2.9 3.0
## 2 2.6 2.9 3.1
## 3 2.8 2.8 3.1
## 4 3.1 3.3 3.6
## 5 3.2 3.9 4.5
## 6 3.4 3.1 3.5
# Sorting data frame by alphabetic order
unemp_df <- unemp_df[order(unemp_df$state),]
head(unemp_df)
## state unemployment_rate_2021 unemployment_rate_2020
## 5 alabama 3.4 6.5
## 46 alaska 6.4 8.2
## 27 arizona 4.9 7.7
## 17 arkansas 4.0 6.1
## 51 california 7.3 10.2
## 33 colorado 5.4 6.9
## unemployment_rate_2019 unemployment_rate_2018 unemployment_rate_2017
## 5 3.2 3.9 4.5
## 46 5.5 6.0 6.5
## 27 4.9 4.8 5.0
## 17 3.5 3.6 3.7
## 51 4.1 4.3 4.8
## 33 2.6 3.0 2.6
# Adding average unemployment rate column
unemp_df<-unemp_df %>%
rowwise() %>%
mutate(avg_unemp_rate = mean(c_across(unemployment_rate_2021:unemployment_rate_2017))) %>% select(state,avg_unemp_rate)
glimpse(unemp_df)
## Rows: 51
## Columns: 2
## Rowwise:
## $ state <chr> "alabama", "alaska", "arizona", "arkansas", "california…
## $ avg_unemp_rate <dbl> 4.30, 6.52, 5.46, 4.18, 6.14, 4.10, 5.18, 4.98, 6.34, 4…
head(unemp_df)
## # A tibble: 6 × 2
## # Rowwise:
## state avg_unemp_rate
## <chr> <dbl>
## 1 alabama 4.3
## 2 alaska 6.52
## 3 arizona 5.46
## 4 arkansas 4.18
## 5 california 6.14
## 6 colorado 4.1
df_2017<-read.csv("https://raw.githubusercontent.com/Raji030/data607_mass_shooting/main/2017.csv")
df_2018<-read.csv("https://raw.githubusercontent.com/Raji030/data607_mass_shooting/main/2018.csv")
df_2019<-read.csv("https://raw.githubusercontent.com/Raji030/data607_mass_shooting/main/2019.csv")
df4<-read.csv("https://raw.githubusercontent.com/Raji030/data607_mass_shooting/main/19-22.csv")
# Subset data frame after columns renaming and adding new column
df_2017<-df_2017 %>% rename(state=State,killed=X..Killed,injured=X..Injured)
df_2017<-df_2017 %>% mutate(total_casualty_2017=killed+injured) %>% select(state,total_casualty_2017)
df_2018<-df_2018 %>% rename(state=State,killed=X..Killed,injured=X..Injured)
df_2018<-df_2018 %>% mutate(total_casualty_2018=killed+injured) %>% select(state,total_casualty_2018)
df_2019<-df_2019 %>% rename(state=State,killed=X..Killed,injured=X..Injured)
df_2019<-df_2019 %>% mutate(total_casualty_2019=killed+injured) %>% select(state,total_casualty_2019)
df_2020 <- df4[c(1310:1919), ]
df_2020<-df_2020 %>% rename(state=State,killed=X..Killed,injured=X..Injured)
df_2020<-df_2020%>% mutate(total_casualty_2020=killed+injured) %>% select(state,total_casualty_2020)
df_2021<-df4[c(620:1309),]
df_2021<-df_2021%>% rename(state=State,killed=X..Killed,injured=X..Injured)
df_2021<-df_2021%>% mutate(total_casualty_2021=killed+injured) %>% select(state,total_casualty_2021)
#Merging rows with the same name in state column
df_2017<-df_2017 %>% group_by(state) %>% summarise_each(funs(sum))
## Warning: `summarise_each_()` was deprecated in dplyr 0.7.0.
## ℹ Please use `across()` instead.
## ℹ The deprecated feature was likely used in the dplyr package.
## Please report the issue at <https://github.com/tidyverse/dplyr/issues>.
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
##
## # Simple named list: list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
##
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
df_2018<-df_2018%>% group_by(state) %>% summarise_each(funs(sum))
df_2019<-df_2019 %>% group_by(state) %>% summarise_each(funs(sum))
df_2020<-df_2020 %>% group_by(state) %>% summarise_each(funs(sum))
df_2021<-df_2021 %>% group_by(state) %>% summarise_each(funs(sum))
# Combine casualty dataframe
casualty_df<-full_join(df_2021,df_2020,by="state") %>%
full_join(df_2019,by="state") %>%
full_join(df_2018,by="state") %>%
full_join(df_2017,by="state")
head(casualty_df)
## # A tibble: 6 × 6
## state total_casualty_2021 total_casualty_2020 total_cas…¹ total…² total…³
## <chr> <int> <int> <int> <int> <int>
## 1 Alabama 107 53 67 67 44
## 2 Alaska 5 13 NA NA NA
## 3 Arizona 37 20 33 9 20
## 4 Arkansas 29 58 23 24 37
## 5 California 224 190 258 186 186
## 6 Colorado 74 58 22 36 12
## # … with abbreviated variable names ¹total_casualty_2019, ²total_casualty_2018,
## # ³total_casualty_2017
# Convert state name to lower case
casualty_df$state<-tolower(casualty_df$state)
# order state name alphabetically
casualty_df <- casualty_df[order(casualty_df$state),]
casualty_df
## # A tibble: 48 × 6
## state total_casualty_2021 total_casu…¹ total…² total…³ total…⁴
## <chr> <int> <int> <int> <int> <int>
## 1 alabama 107 53 67 67 44
## 2 alaska 5 13 NA NA NA
## 3 arizona 37 20 33 9 20
## 4 arkansas 29 58 23 24 37
## 5 california 224 190 258 186 186
## 6 colorado 74 58 22 36 12
## 7 connecticut 8 34 13 8 8
## 8 delaware 29 14 10 5 4
## 9 district of columbia 75 63 32 25 24
## 10 florida 187 159 74 179 122
## # … with 38 more rows, and abbreviated variable names ¹total_casualty_2020,
## # ²total_casualty_2019, ³total_casualty_2018, ⁴total_casualty_2017
df_combined<-full_join(unemp_df,casualty_df,by="state")
df_combined
## # A tibble: 51 × 7
## # Rowwise:
## state avg_unemp_rate total_c…¹ total…² total…³ total…⁴ total…⁵
## <chr> <dbl> <int> <int> <int> <int> <int>
## 1 alabama 4.3 107 53 67 67 44
## 2 alaska 6.52 5 13 NA NA NA
## 3 arizona 5.46 37 20 33 9 20
## 4 arkansas 4.18 29 58 23 24 37
## 5 california 6.14 224 190 258 186 186
## 6 colorado 4.1 74 58 22 36 12
## 7 connecticut 5.18 8 34 13 8 8
## 8 delaware 4.98 29 14 10 5 4
## 9 district of columbia 6.34 75 63 32 25 24
## 10 florida 4.8 187 159 74 179 122
## # … with 41 more rows, and abbreviated variable names ¹total_casualty_2021,
## # ²total_casualty_2020, ³total_casualty_2019, ⁴total_casualty_2018,
## # ⁵total_casualty_2017
df_combined[is.na(df_combined)] <- 0
df_combined
## # A tibble: 51 × 7
## # Rowwise:
## state avg_unemp_rate total_c…¹ total…² total…³ total…⁴ total…⁵
## <chr> <dbl> <int> <int> <int> <int> <int>
## 1 alabama 4.3 107 53 67 67 44
## 2 alaska 6.52 5 13 0 0 0
## 3 arizona 5.46 37 20 33 9 20
## 4 arkansas 4.18 29 58 23 24 37
## 5 california 6.14 224 190 258 186 186
## 6 colorado 4.1 74 58 22 36 12
## 7 connecticut 5.18 8 34 13 8 8
## 8 delaware 4.98 29 14 10 5 4
## 9 district of columbia 6.34 75 63 32 25 24
## 10 florida 4.8 187 159 74 179 122
## # … with 41 more rows, and abbreviated variable names ¹total_casualty_2021,
## # ²total_casualty_2020, ³total_casualty_2019, ⁴total_casualty_2018,
## # ⁵total_casualty_2017
# Adding average and total casualty column
df_final<-df_combined %>%
rowwise() %>%
mutate(avg_casualty = mean(c_across(total_casualty_2021:total_casualty_2017)),total_casualty=sum(c_across(total_casualty_2021:total_casualty_2017))) %>% select(state,avg_unemp_rate,avg_casualty,total_casualty)
df_final<-data.frame(df_final)
glimpse(df_final)
## Rows: 51
## Columns: 4
## $ state <chr> "alabama", "alaska", "arizona", "arkansas", "california…
## $ avg_unemp_rate <dbl> 4.30, 6.52, 5.46, 4.18, 6.14, 4.10, 5.18, 4.98, 6.34, 4…
## $ avg_casualty <dbl> 67.6, 3.6, 23.8, 34.2, 208.8, 40.4, 14.2, 12.4, 43.8, 1…
## $ total_casualty <int> 338, 18, 119, 171, 1044, 202, 71, 62, 219, 721, 378, 0,…
head(df_final)
## state avg_unemp_rate avg_casualty total_casualty
## 1 alabama 4.30 67.6 338
## 2 alaska 6.52 3.6 18
## 3 arizona 5.46 23.8 119
## 4 arkansas 4.18 34.2 171
## 5 california 6.14 208.8 1044
## 6 colorado 4.10 40.4 202
ggplot(unemp_df,aes(x=reorder(state,-avg_unemp_rate), avg_unemp_rate))+geom_bar(stat='identity',width = 0.8,color='red')+coord_flip()
index1<-which.max(unemp_df$avg_unemp_rate)
unemp_df$state[index1]
## [1] "nevada"
unemp_df$avg_unemp_rate[index1]
## [1] 6.82
index2<-which.min(unemp_df$avg_unemp_rate)
unemp_df$state[index2]
## [1] "nebraska"
unemp_df$avg_unemp_rate[index2]
## [1] 3.1
ggplot(df_final,aes(x=reorder(state,-total_casualty), total_casualty))+geom_bar(stat='identity',width = 0.8,color='red')+coord_flip()
index3<-which.max(df_final$total_casualty)
df_final$state[index3]
## [1] "illinois"
df_final$total_casualty[index3]
## [1] 1345
index4<-which.min(df_final$total_casualty)
df_final$state[index4]
## [1] "hawaii"
df_final$total_casualty[index4]
## [1] 0
plotting the relationship between the average unemployment rate and total casualty in US states for the year between 2017 and 2021 considering average unemployment rate as the predictor (independent variable ).
ggplot(df_final,aes(x=avg_unemp_rate,y=total_casualty))+geom_point(stat="identity")+geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
df_final %>%
summarise(cor(avg_unemp_rate,total_casualty))
## cor(avg_unemp_rate, total_casualty)
## 1 0.4384533
The coefficient value above indicates a moderate relationship between the variables.
By using the lm function to fit the linear model ( a.k.a regression line):
linear_model <- lm(total_casualty ~ avg_unemp_rate, data =df_final)
summary(linear_model)
##
## Call:
## lm(formula = total_casualty ~ avg_unemp_rate, data = df_final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -464.97 -179.17 -40.33 129.22 967.15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -373.81 186.17 -2.008 0.05019 .
## avg_unemp_rate 131.41 38.48 3.415 0.00129 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 264.2 on 49 degrees of freedom
## Multiple R-squared: 0.1922, Adjusted R-squared: 0.1758
## F-statistic: 11.66 on 1 and 49 DF, p-value: 0.00129
With the summary above,the least squares regression line for the linear model: yˆ= 131.41 × (avg_unemp_rate)-373.81
The positive slope line indicates that it’s trend is upward. The slope of the line also indicates that if change in average unemployment rate happens by 1, the change in total casualty value will go up by 131.41. The R-squared value above is very low (0.19) which indicating that the correlation between the two variables is not strong. It is also reflecting that 19% of the variability in the predicting variables can be explained by this model. The p-value,0.00129 is lower than the usual significance level (0.05) which indicating that the average unemployment rate is statistically significant.
To assess whether the linear model is reliable, I will check for linearity, nearly normal residual and constant variability of the residuals.
**Linearity check by the residuals vs. fitted (predicted) plot:
ggplot(data = linear_model, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")
From the plot above, it is seen that the residuals are not distributed
around 0 with a consistent pattern, which indicating a non linear trend.
So, I can not assume a linear regression model for fitting the data set
here.
** Nearly normal residuals check by plotting histogram:
ggplot(data = linear_model, aes(x = .resid)) +
geom_histogram(binwidth = 120) +
xlab("Residuals")
or by a normal probability plot of the residuals:
ggplot(data = linear_model, aes(sample = .resid)) +
stat_qq()
It is seen that the histogram is not nearly normally distributed (right skewed). Also, the qq plot is not reflecting a straight line. It has zig-zag and curvature in its shape and errant points. So, the residuals condition for being nearly normal is not met here.
** Constant variability check: From the residuals vs. fitted plot above, it is seen that the spread of the residuals is not roughly equal at each level of the fitted values.So, the constant variance condition or assumption is not properly met here.
The correlation coefficient value (0.44) is suggesting a moderate correlation between the unemployment rate and mass shooting casualty and the p-value (0.00129) is lower than the usual significance level (0.05) is indicating that the unemployment rate is statistically significant.From the linear model diagnostics above, it is seen that the linear regression model was unable to satisfy all the conditions for a strong correlation between the two variables. So, the model will not work properly to predict the relationship between the variables i chose here. Hence, from this analysis it can be said that the unemployment rate in states is not the proper predictor of the mass shooting casualty in US. Rather, it is a relatively a weak predictor of the mass shooting casualties and it is practical. Since, other factors like far-right radicalization, community’s change in employment or economic well-being over time have much impact on mass shooting casualties. Therefore, it can be said that the effect of unemployment rate alone is less on mass shooting casualty in USA.
Unemployment rate data source: https://www.bls.gov/ US mass shooting casualty data source: https://www.gunviolencearchive.org/ Introduction to linear regression: https://fall2022.data606.net/chapters/chapter8/ https://github.com/Raji030/data606_lab8