Introduction

Gun violence, especially mass shootings is now a disease in USA. Therefore, I think it is an interesting topic to understand and analyze if there is a correlation between unemployment rate and mass shootings in the USA. So, i set the research question for this project: “Does unemployment rate have an effect on mass shootings in US states?” For this analysis, I collected the unemployment rate data for the states from The U.S. Bureau of Labor Statistics for the year between 2017 and 2021 by using web scrapping from their web page. I also collected mass shooting casualty data as CSV files from the Gun Violence Archive and stored them on my github.Then, I will clean up and transform the data as required to conduct some exploratory and statistical analysis (linear regression analysis) to see if there is a correlation between the stated variables.

Load libraries

library(dplyr)
library(rvest)
library(ggplot2)

Getting unemployment rate data for states for the year between 2017 and 2021 through web srcapping from Us Bureau of Labor Statistics website and creating data frames

# Uemployment rate data of 2021 annual averages
unemp2021_url<-read_html("https://www.bls.gov/lau/lastrk21.htm")
unemp_data_2021<-unemp2021_url %>% html_nodes(xpath='//*[@id="lastrk21"]/tbody') %>% html_table()
unemp2021_df<-as.data.frame(unemp_data_2021)
glimpse(unemp_data_2021)
## List of 1
##  $ : tibble [53 × 3] (S3: tbl_df/tbl/data.frame)
##   ..$ X1: chr [1:53] "United States" "" "Nebraska" "Utah" ...
##   ..$ X2: num [1:53] 5.3 NA 2.5 2.7 3.1 3.2 3.4 3.4 3.4 3.4 ...
##   ..$ X3: int [1:53] NA NA 1 2 3 4 5 5 5 5 ...
head(unemp2021_df)
##              X1  X2 X3
## 1 United States 5.3 NA
## 2                NA NA
## 3      Nebraska 2.5  1
## 4          Utah 2.7  2
## 5  South Dakota 3.1  3
## 6        Kansas 3.2  4
# Unemployment rate data of 2020 annual averages
unemp2020_url<-read_html("https://www.bls.gov/lau/lastrk20.htm")
unemp_data_2020<-unemp2020_url %>% html_nodes(xpath='//*[@id="lastrk20"]/tbody') %>% html_table()
unemp2020_df<-as.data.frame(unemp_data_2020)
head(unemp2020_df)
##              X1  X2 X3
## 1 United States 8.1 NA
## 2                NA NA
## 3      Nebraska 4.1  1
## 4  South Dakota 4.3  2
## 5          Utah 4.7  3
## 6         Maine 5.0  4
# Unemployment rate data of 2019 annual averages
unemp2019_url<-read_html("https://www.bls.gov/lau/lastrk19.htm")
unemp_data_2019<-unemp2019_url %>% html_nodes(xpath='//*[@id="lastrk19"]/tbody') %>% html_table()
unemp2019_df<-as.data.frame(unemp_data_2019)
head(unemp2019_df)
##              X1  X2 X3
## 1 United States 3.7 NA
## 2                NA NA
## 3  North Dakota 2.1  1
## 4       Vermont 2.3  2
## 5        Hawaii 2.5  3
## 6      Colorado 2.6  4
# Unemployment rate data of 2018 annual averages
unemp2018_url<-read_html("https://www.bls.gov/lau/lastrk18.htm")
unemp_data_2018<-unemp2018_url %>% html_nodes(xpath='//*[@id="lastrk18"]/tbody') %>% html_table()
unemp2018_df<-as.data.frame(unemp_data_2018)
head(unemp2018_df)
##              X1  X2 X3
## 1 United States 3.9 NA
## 2                NA NA
## 3        Hawaii 2.4  1
## 4  North Dakota 2.4  1
## 5          Iowa 2.5  3
## 6 New Hampshire 2.6  4
# Unemployment rate data of 2017 annual average
unemp2017_url<-read_html("https://www.bls.gov/lau/lastrk17.htm")
unemp_data_2017<-unemp2017_url %>% html_nodes(xpath='//*[@id="lastrk17"]/tbody') %>% html_table()
unemp2017_df<-as.data.frame(unemp_data_2017)
head(unemp2017_df)
##              X1  X2 X3
## 1 United States 4.4 NA
## 2                NA NA
## 3        Hawaii 2.2  1
## 4      Colorado 2.6  2
## 5  North Dakota 2.6  2
## 6 New Hampshire 2.8  4

Unemployment rate data tidying and transformation

# Remove irrelevant rows and columns from data frames
unemp2021_df<-unemp2021_df %>% slice(-c(1,2))
unemp2021_df<-subset(unemp2021_df, select=-X3)

unemp2020_df<-unemp2020_df %>% slice(-c(1,2))
unemp2020_df<-subset(unemp2020_df, select=-X3)

unemp2019_df<-unemp2019_df %>% slice(-c(1,2))
unemp2019_df<-subset(unemp2019_df, select=-X3)

unemp2018_df<-unemp2018_df %>% slice(-c(1,2))
unemp2018_df<-subset(unemp2018_df, select=-X3)

unemp2017_df<-unemp2017_df %>% slice(-c(1,2))
unemp2017_df<-subset(unemp2017_df, select=-X3)

# Renaming columns
unemp2021_df<-unemp2021_df %>% rename(state=X1,unemployment_rate_2021=X2)
head(unemp2021_df)
##          state unemployment_rate_2021
## 1     Nebraska                    2.5
## 2         Utah                    2.7
## 3 South Dakota                    3.1
## 4       Kansas                    3.2
## 5      Alabama                    3.4
## 6    Minnesota                    3.4
unemp2020_df<-unemp2020_df %>% rename(state=X1,unemployment_rate_2020=X2)
head(unemp2020_df)
##          state unemployment_rate_2020
## 1     Nebraska                    4.1
## 2 South Dakota                    4.3
## 3         Utah                    4.7
## 4        Maine                    5.0
## 5         Iowa                    5.1
## 6 North Dakota                    5.1
unemp2019_df<-unemp2019_df %>% rename(state=X1,unemployment_rate_2019=X2)
head(unemp2019_df)
##           state unemployment_rate_2019
## 1  North Dakota                    2.1
## 2       Vermont                    2.3
## 3        Hawaii                    2.5
## 4      Colorado                    2.6
## 5          Iowa                    2.6
## 6 New Hampshire                    2.6
unemp2018_df<-unemp2018_df %>% rename(state=X1,unemployment_rate_2018=X2)
head(unemp2018_df)
##           state unemployment_rate_2018
## 1        Hawaii                    2.4
## 2  North Dakota                    2.4
## 3          Iowa                    2.5
## 4 New Hampshire                    2.6
## 5       Vermont                    2.6
## 6  South Dakota                    2.8
unemp2017_df<-unemp2017_df %>% rename(state=X1,unemployment_rate_2017=X2)
head(unemp2017_df)
##           state unemployment_rate_2017
## 1        Hawaii                    2.2
## 2      Colorado                    2.6
## 3  North Dakota                    2.6
## 4 New Hampshire                    2.8
## 5      Nebraska                    3.0
## 6       Vermont                    3.0
# Combining data frames with one column name in common
unemp_df<-full_join(unemp2021_df,unemp2020_df,by="state") %>% 
          full_join(unemp2019_df,by="state") %>% 
          full_join(unemp2018_df,by="state") %>% 
          full_join(unemp2017_df,by="state")
head(unemp_df)
##          state unemployment_rate_2021 unemployment_rate_2020
## 1     Nebraska                    2.5                    4.1
## 2         Utah                    2.7                    4.7
## 3 South Dakota                    3.1                    4.3
## 4       Kansas                    3.2                    5.7
## 5      Alabama                    3.4                    6.5
## 6    Minnesota                    3.4                    6.3
##   unemployment_rate_2019 unemployment_rate_2018 unemployment_rate_2017
## 1                    3.0                    2.9                    3.0
## 2                    2.6                    2.9                    3.1
## 3                    2.8                    2.8                    3.1
## 4                    3.1                    3.3                    3.6
## 5                    3.2                    3.9                    4.5
## 6                    3.4                    3.1                    3.5
# Convert to lowercase in data frames
unemp_df$state<-tolower(unemp_df$state)
head(unemp_df)
##          state unemployment_rate_2021 unemployment_rate_2020
## 1     nebraska                    2.5                    4.1
## 2         utah                    2.7                    4.7
## 3 south dakota                    3.1                    4.3
## 4       kansas                    3.2                    5.7
## 5      alabama                    3.4                    6.5
## 6    minnesota                    3.4                    6.3
##   unemployment_rate_2019 unemployment_rate_2018 unemployment_rate_2017
## 1                    3.0                    2.9                    3.0
## 2                    2.6                    2.9                    3.1
## 3                    2.8                    2.8                    3.1
## 4                    3.1                    3.3                    3.6
## 5                    3.2                    3.9                    4.5
## 6                    3.4                    3.1                    3.5
# Sorting data frame by alphabetic order
unemp_df <- unemp_df[order(unemp_df$state),]
head(unemp_df)
##         state unemployment_rate_2021 unemployment_rate_2020
## 5     alabama                    3.4                    6.5
## 46     alaska                    6.4                    8.2
## 27    arizona                    4.9                    7.7
## 17   arkansas                    4.0                    6.1
## 51 california                    7.3                   10.2
## 33   colorado                    5.4                    6.9
##    unemployment_rate_2019 unemployment_rate_2018 unemployment_rate_2017
## 5                     3.2                    3.9                    4.5
## 46                    5.5                    6.0                    6.5
## 27                    4.9                    4.8                    5.0
## 17                    3.5                    3.6                    3.7
## 51                    4.1                    4.3                    4.8
## 33                    2.6                    3.0                    2.6
# Adding average unemployment rate column
unemp_df<-unemp_df %>%
  rowwise() %>%
  mutate(avg_unemp_rate = mean(c_across(unemployment_rate_2021:unemployment_rate_2017))) %>% select(state,avg_unemp_rate)
glimpse(unemp_df)
## Rows: 51
## Columns: 2
## Rowwise: 
## $ state          <chr> "alabama", "alaska", "arizona", "arkansas", "california…
## $ avg_unemp_rate <dbl> 4.30, 6.52, 5.46, 4.18, 6.14, 4.10, 5.18, 4.98, 6.34, 4…
head(unemp_df)
## # A tibble: 6 × 2
## # Rowwise: 
##   state      avg_unemp_rate
##   <chr>               <dbl>
## 1 alabama              4.3 
## 2 alaska               6.52
## 3 arizona              5.46
## 4 arkansas             4.18
## 5 california           6.14
## 6 colorado             4.1

Getting mass shooting casualty data

df_2017<-read.csv("https://raw.githubusercontent.com/Raji030/data607_mass_shooting/main/2017.csv")
df_2018<-read.csv("https://raw.githubusercontent.com/Raji030/data607_mass_shooting/main/2018.csv")
df_2019<-read.csv("https://raw.githubusercontent.com/Raji030/data607_mass_shooting/main/2019.csv")
df4<-read.csv("https://raw.githubusercontent.com/Raji030/data607_mass_shooting/main/19-22.csv")

Mass shooting casualty data tidying and transformation

# Subset data frame after columns renaming and adding new column
df_2017<-df_2017 %>% rename(state=State,killed=X..Killed,injured=X..Injured)
df_2017<-df_2017 %>% mutate(total_casualty_2017=killed+injured) %>% select(state,total_casualty_2017)

df_2018<-df_2018 %>% rename(state=State,killed=X..Killed,injured=X..Injured)
df_2018<-df_2018 %>% mutate(total_casualty_2018=killed+injured) %>% select(state,total_casualty_2018)

df_2019<-df_2019 %>% rename(state=State,killed=X..Killed,injured=X..Injured)
df_2019<-df_2019 %>% mutate(total_casualty_2019=killed+injured) %>% select(state,total_casualty_2019)

df_2020 <- df4[c(1310:1919), ]
df_2020<-df_2020 %>% rename(state=State,killed=X..Killed,injured=X..Injured)
df_2020<-df_2020%>% mutate(total_casualty_2020=killed+injured) %>% select(state,total_casualty_2020)

df_2021<-df4[c(620:1309),]
df_2021<-df_2021%>% rename(state=State,killed=X..Killed,injured=X..Injured)
df_2021<-df_2021%>% mutate(total_casualty_2021=killed+injured) %>% select(state,total_casualty_2021)

#Merging rows with the same name in state column
df_2017<-df_2017 %>% group_by(state) %>% summarise_each(funs(sum))
## Warning: `summarise_each_()` was deprecated in dplyr 0.7.0.
## ℹ Please use `across()` instead.
## ℹ The deprecated feature was likely used in the dplyr package.
##   Please report the issue at <https://github.com/tidyverse/dplyr/issues>.
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
## 
## # Simple named list: list(mean = mean, median = median)
## 
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
## 
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
df_2018<-df_2018%>% group_by(state) %>% summarise_each(funs(sum))
df_2019<-df_2019 %>% group_by(state) %>% summarise_each(funs(sum))
df_2020<-df_2020 %>% group_by(state) %>% summarise_each(funs(sum))
df_2021<-df_2021 %>% group_by(state) %>% summarise_each(funs(sum))

# Combine casualty dataframe
casualty_df<-full_join(df_2021,df_2020,by="state") %>% 
          full_join(df_2019,by="state") %>% 
          full_join(df_2018,by="state") %>% 
          full_join(df_2017,by="state")
head(casualty_df)
## # A tibble: 6 × 6
##   state      total_casualty_2021 total_casualty_2020 total_cas…¹ total…² total…³
##   <chr>                    <int>               <int>       <int>   <int>   <int>
## 1 Alabama                    107                  53          67      67      44
## 2 Alaska                       5                  13          NA      NA      NA
## 3 Arizona                     37                  20          33       9      20
## 4 Arkansas                    29                  58          23      24      37
## 5 California                 224                 190         258     186     186
## 6 Colorado                    74                  58          22      36      12
## # … with abbreviated variable names ¹​total_casualty_2019, ²​total_casualty_2018,
## #   ³​total_casualty_2017
# Convert state name to lower case
casualty_df$state<-tolower(casualty_df$state)

# order state name alphabetically
casualty_df <- casualty_df[order(casualty_df$state),]
casualty_df
## # A tibble: 48 × 6
##    state                total_casualty_2021 total_casu…¹ total…² total…³ total…⁴
##    <chr>                              <int>        <int>   <int>   <int>   <int>
##  1 alabama                              107           53      67      67      44
##  2 alaska                                 5           13      NA      NA      NA
##  3 arizona                               37           20      33       9      20
##  4 arkansas                              29           58      23      24      37
##  5 california                           224          190     258     186     186
##  6 colorado                              74           58      22      36      12
##  7 connecticut                            8           34      13       8       8
##  8 delaware                              29           14      10       5       4
##  9 district of columbia                  75           63      32      25      24
## 10 florida                              187          159      74     179     122
## # … with 38 more rows, and abbreviated variable names ¹​total_casualty_2020,
## #   ²​total_casualty_2019, ³​total_casualty_2018, ⁴​total_casualty_2017

Combine unemployment and casualty data and get final data frame

df_combined<-full_join(unemp_df,casualty_df,by="state")
df_combined
## # A tibble: 51 × 7
## # Rowwise: 
##    state                avg_unemp_rate total_c…¹ total…² total…³ total…⁴ total…⁵
##    <chr>                         <dbl>     <int>   <int>   <int>   <int>   <int>
##  1 alabama                        4.3        107      53      67      67      44
##  2 alaska                         6.52         5      13      NA      NA      NA
##  3 arizona                        5.46        37      20      33       9      20
##  4 arkansas                       4.18        29      58      23      24      37
##  5 california                     6.14       224     190     258     186     186
##  6 colorado                       4.1         74      58      22      36      12
##  7 connecticut                    5.18         8      34      13       8       8
##  8 delaware                       4.98        29      14      10       5       4
##  9 district of columbia           6.34        75      63      32      25      24
## 10 florida                        4.8        187     159      74     179     122
## # … with 41 more rows, and abbreviated variable names ¹​total_casualty_2021,
## #   ²​total_casualty_2020, ³​total_casualty_2019, ⁴​total_casualty_2018,
## #   ⁵​total_casualty_2017
df_combined[is.na(df_combined)] <- 0
df_combined
## # A tibble: 51 × 7
## # Rowwise: 
##    state                avg_unemp_rate total_c…¹ total…² total…³ total…⁴ total…⁵
##    <chr>                         <dbl>     <int>   <int>   <int>   <int>   <int>
##  1 alabama                        4.3        107      53      67      67      44
##  2 alaska                         6.52         5      13       0       0       0
##  3 arizona                        5.46        37      20      33       9      20
##  4 arkansas                       4.18        29      58      23      24      37
##  5 california                     6.14       224     190     258     186     186
##  6 colorado                       4.1         74      58      22      36      12
##  7 connecticut                    5.18         8      34      13       8       8
##  8 delaware                       4.98        29      14      10       5       4
##  9 district of columbia           6.34        75      63      32      25      24
## 10 florida                        4.8        187     159      74     179     122
## # … with 41 more rows, and abbreviated variable names ¹​total_casualty_2021,
## #   ²​total_casualty_2020, ³​total_casualty_2019, ⁴​total_casualty_2018,
## #   ⁵​total_casualty_2017
# Adding average and total casualty column
df_final<-df_combined %>%
  rowwise() %>%
  mutate(avg_casualty = mean(c_across(total_casualty_2021:total_casualty_2017)),total_casualty=sum(c_across(total_casualty_2021:total_casualty_2017))) %>% select(state,avg_unemp_rate,avg_casualty,total_casualty)
df_final<-data.frame(df_final)
glimpse(df_final)
## Rows: 51
## Columns: 4
## $ state          <chr> "alabama", "alaska", "arizona", "arkansas", "california…
## $ avg_unemp_rate <dbl> 4.30, 6.52, 5.46, 4.18, 6.14, 4.10, 5.18, 4.98, 6.34, 4…
## $ avg_casualty   <dbl> 67.6, 3.6, 23.8, 34.2, 208.8, 40.4, 14.2, 12.4, 43.8, 1…
## $ total_casualty <int> 338, 18, 119, 171, 1044, 202, 71, 62, 219, 721, 378, 0,…
head(df_final)
##        state avg_unemp_rate avg_casualty total_casualty
## 1    alabama           4.30         67.6            338
## 2     alaska           6.52          3.6             18
## 3    arizona           5.46         23.8            119
## 4   arkansas           4.18         34.2            171
## 5 california           6.14        208.8           1044
## 6   colorado           4.10         40.4            202

Visualize average unemployment rate in states

ggplot(unemp_df,aes(x=reorder(state,-avg_unemp_rate), avg_unemp_rate))+geom_bar(stat='identity',width = 0.8,color='red')+coord_flip()

Check state names with highest and lowest average unemployement rate

index1<-which.max(unemp_df$avg_unemp_rate)
unemp_df$state[index1]
## [1] "nevada"
unemp_df$avg_unemp_rate[index1]
## [1] 6.82
index2<-which.min(unemp_df$avg_unemp_rate)
unemp_df$state[index2]
## [1] "nebraska"
unemp_df$avg_unemp_rate[index2]
## [1] 3.1

Visualize total casualty in states

ggplot(df_final,aes(x=reorder(state,-total_casualty), total_casualty))+geom_bar(stat='identity',width = 0.8,color='red')+coord_flip()

Check state names with highest and lowest mass shooting casualty

index3<-which.max(df_final$total_casualty)
df_final$state[index3]
## [1] "illinois"
df_final$total_casualty[index3]
## [1] 1345
index4<-which.min(df_final$total_casualty)
df_final$state[index4]
## [1] "hawaii"
df_final$total_casualty[index4]
## [1] 0

Linear regression analysis

plotting the relationship between the average unemployment rate and total casualty in US states for the year between 2017 and 2021 considering average unemployment rate as the predictor (independent variable ).

ggplot(df_final,aes(x=avg_unemp_rate,y=total_casualty))+geom_point(stat="identity")+geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

Find correlation coefficient

df_final %>%
summarise(cor(avg_unemp_rate,total_casualty))
##   cor(avg_unemp_rate, total_casualty)
## 1                           0.4384533

The coefficient value above indicates a moderate relationship between the variables.

Find the linear model

By using the lm function to fit the linear model ( a.k.a regression line):

linear_model <- lm(total_casualty ~ avg_unemp_rate, data =df_final)
summary(linear_model)
## 
## Call:
## lm(formula = total_casualty ~ avg_unemp_rate, data = df_final)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -464.97 -179.17  -40.33  129.22  967.15 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)   
## (Intercept)     -373.81     186.17  -2.008  0.05019 . 
## avg_unemp_rate   131.41      38.48   3.415  0.00129 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 264.2 on 49 degrees of freedom
## Multiple R-squared:  0.1922, Adjusted R-squared:  0.1758 
## F-statistic: 11.66 on 1 and 49 DF,  p-value: 0.00129

With the summary above,the least squares regression line for the linear model: yˆ= 131.41 × (avg_unemp_rate)-373.81

The positive slope line indicates that it’s trend is upward. The slope of the line also indicates that if change in average unemployment rate happens by 1, the change in total casualty value will go up by 131.41. The R-squared value above is very low (0.19) which indicating that the correlation between the two variables is not strong. It is also reflecting that 19% of the variability in the predicting variables can be explained by this model. The p-value,0.00129 is lower than the usual significance level (0.05) which indicating that the average unemployment rate is statistically significant.

Model diagnostic

To assess whether the linear model is reliable, I will check for linearity, nearly normal residual and constant variability of the residuals.

**Linearity check by the residuals vs. fitted (predicted) plot:

ggplot(data = linear_model, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")

From the plot above, it is seen that the residuals are not distributed around 0 with a consistent pattern, which indicating a non linear trend. So, I can not assume a linear regression model for fitting the data set here.

** Nearly normal residuals check by plotting histogram:

ggplot(data = linear_model, aes(x = .resid)) +
geom_histogram(binwidth = 120) +
xlab("Residuals")

or by a normal probability plot of the residuals:

ggplot(data = linear_model, aes(sample = .resid)) +
stat_qq()

It is seen that the histogram is not nearly normally distributed (right skewed). Also, the qq plot is not reflecting a straight line. It has zig-zag and curvature in its shape and errant points. So, the residuals condition for being nearly normal is not met here.

** Constant variability check: From the residuals vs. fitted plot above, it is seen that the spread of the residuals is not roughly equal at each level of the fitted values.So, the constant variance condition or assumption is not properly met here.

Conclusion

The correlation coefficient value (0.44) is suggesting a moderate correlation between the unemployment rate and mass shooting casualty and the p-value (0.00129) is lower than the usual significance level (0.05) is indicating that the unemployment rate is statistically significant.From the linear model diagnostics above, it is seen that the linear regression model was unable to satisfy all the conditions for a strong correlation between the two variables. So, the model will not work properly to predict the relationship between the variables i chose here. Hence, from this analysis it can be said that the unemployment rate in states is not the proper predictor of the mass shooting casualty in US. Rather, it is a relatively a weak predictor of the mass shooting casualties and it is practical. Since, other factors like far-right radicalization, community’s change in employment or economic well-being over time have much impact on mass shooting casualties. Therefore, it can be said that the effect of unemployment rate alone is less on mass shooting casualty in USA.

References:

Unemployment rate data source: https://www.bls.gov/ US mass shooting casualty data source: https://www.gunviolencearchive.org/ Introduction to linear regression: https://fall2022.data606.net/chapters/chapter8/ https://github.com/Raji030/data606_lab8