Analysis of Mass shootings in USA

Dashboard

Use the below link for our visualization dashboard

Loading packages

This project makes use of the following libraries. We used “readr” and “readxl” to read the data, “skimr” to explore and find different data characteristics, “datadictonary” and “flextable” to create and display data dictionary containing a description of each variable in the dataset, and “lubridate” and “stringr” to manipulate the data. These are some of the primary functions carried out with the assistance of the libraries listed below.

library(readr)
library(readxl)
library(skimr)
library(flextable)
library(dplyr)
library(stringr)
library(tidyr)
library(lubridate)
library(stringr)
library(ggplot2)
library(forcats)
library(datadictionary)
library(tidyverse)

Importing the data

These are the datasets that were used in this project. First, there is “Mass Shootings.csv,” which contains data on mass shootings in the United States of America from 1982 to 2022. The second dataset was “State level data.xlsx,” which contains estimates of the proportion of adult, noninstitutionalized residents who live in a household with a firearm in each U.S. state and year (1980-2016), as well as the data used to produce these estimates.

MassShootings <- read_csv("Mother Jones - Mass Shootings Database, 1982 - 2022 - Sheet1.csv")

State_level_data <- read_excel("Sate level data & factor score.xlsx")

Exploring the data

Exploring and displaying high-level characteristics of the data sets, e.g., important variables and their types, different levels for factor variables, any patterns of missing values.

After Exploring the data by the use of skim function, I can observe that there are no missing values in the datasets I have used for this project.

skim(MassShootings)

Data summary
Name	MassShootings
Number of rows	135
Number of columns	24
_______________________
Column type frequency:
character	22
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
case	0	1.00	12	45	135
location	0	1.00	12	36	125
date	0	1.00	8	10	135
summary	0	1.00	83	908	135
injured	0	1.00	1	21	32
total_victims	0	1.00	1	3	41
Place	0	1.00	5	10	10
age_of_shooter	0	1.00	1	2	45
prior_signs_mental_health_issues	0	1.00	1	7	7
mental_health_details	0	1.00	1	619	93
weapons_obtained_legally	0	1.00	1	131	9
where_obtained	0	1.00	1	124	75
weapon_type	0	1.00	1	389	72
weapon_details	1	0.99	1	328	108
race	0	1.00	1	15	10
gender	0	1.00	1	13	5
sources	0	1.00	41	951	135
mental_health_sources	0	1.00	1	444	85
sources_additional_age	0	1.00	1	951	101
latitude	0	1.00	1	11	127
longitude	0	1.00	1	12	127
type	0	1.00	4	5	2

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
fatalities	0	1	7.87	7.60	3	4.0	6	8.5	58	▇▁▁▁▁
year	0	1	2009.84	10.71	1982	2003.5	2014	2018.0	2022	▁▂▂▃▇

skim(State_level_data)

Data summary
Name	State_level_data
Number of rows	1850
Number of columns	20
_______________________
Column type frequency:
character	1
numeric	19
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
STATE	0	1	4	14	0	50	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
FIP	1	29.32	15.63	1.00	17.00	29.50	42.00	56.00	▆▇▇▇▇
Year	1	1998.00	10.68	1980.00	1989.00	1998.00	2007.00	2016.00	▇▇▇▇▇
HFR	1	0.44	0.14	0.03	0.37	0.46	0.55	0.80	▂▂▇▇▁
HFR_se	1	0.03	0.01	0.02	0.03	0.03	0.04	0.05	▂▂▇▅▃
universl	1	0.18	0.38	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▂
permit	1	0.22	0.42	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▂
Fem_FS_S	1	-0.05	1.96	-9.00	0.24	0.35	0.49	0.92	▁▁▁▁▇
Male_FS_S	1	0.62	0.13	0.14	0.56	0.63	0.71	0.87	▁▂▃▇▃
BRFSS	1	-8.25	2.55	-9.00	-9.00	-9.00	-9.00	0.64	▇▁▁▁▁
GALLUP	1	-5.73	4.51	-9.00	-9.00	-9.00	0.43	0.80	▇▁▁▁▅
GSS	1	-4.75	4.68	-9.00	-9.00	-9.00	0.40	0.73	▇▁▁▁▆
PEW	1	-6.02	4.39	-9.00	-9.00	-9.00	0.35	0.73	▇▁▁▁▃
HuntLic	1	0.37	0.28	-9.00	0.25	0.34	0.46	1.14	▁▁▁▁▇
GunsAmmo	1	0.00	0.99	-3.47	-0.65	-0.03	0.42	3.93	▁▅▇▁▁
BackChk	1	-4.62	4.55	-9.00	-9.00	-9.00	-0.13	2.82	▇▁▁▆▂
PewQChng	1	0.11	0.31	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
BS1	1	9.76	3.58	0.00	8.73	12.00	12.00	12.00	▁▁▁▁▇
BS2	1	6.00	5.06	0.00	0.27	6.00	11.73	12.00	▇▂▂▂▇
BS3	1	2.24	3.58	0.00	0.00	0.00	3.27	12.00	▇▁▁▁▁

Using Lubridate package and String modification

Cleaning the data set and modifying the variables and data to match our requirements for the Project

# Removing the two values as they were not provided and given as character. So replacing them with 0

MassShootings[7,6] = "0"
MassShootings[7,7] = "0"

# Changing the variable type character to numeric to perform group level summary statistics

MassShootings$injured <- as.numeric(as.character(MassShootings$injured))
MassShootings$total_victims <- as.numeric(as.character(MassShootings$total_victims))


# separating the location variable into two variables city and state and removing the space before state values using str_trim

# Using lubridate package and the function mdy to modify the date variable
MassShootings<- separate(MassShootings,location,c("City","STATE"),sep=",") %>% mutate(STATE=str_trim(STATE),date=mdy(date), Year=year(date))



# Making adjustments to the race variable that are capitalized differently and has no values using mutate and stringr function str_replace


MassShootings <- MassShootings %>% mutate(across('race',str_replace,"black","Black"))
MassShootings <- MassShootings %>% mutate(across('race',str_replace,"unclear","Unknown"))
MassShootings <- MassShootings %>% mutate(across('race',str_replace,"-","Unknown"))
MassShootings <- MassShootings %>% mutate(across('race',str_replace,"white","White"))

# Modifying the place variable with different capitalizations and wordings

MassShootings <- MassShootings %>% mutate(across('Place',str_replace,"workplace","Workplace"))
MassShootings <- MassShootings %>% mutate(across('Place',str_replace,"\nWorkplace","Workplace"))
MassShootings <- MassShootings %>% mutate(across('Place',str_replace,"religious","Religious"))
MassShootings <- MassShootings %>% mutate(across('Place',str_replace,"Other\n","Other"))

Data Dictionary for each of the datasets

# creating a data dictionary using the function datadictionary and keeping only required data using slice function

Dictionary1<-datadictionary::create_dictionary(MassShootings)

# Slicing only required rows of the data dictionary

Dictionary1<- Dictionary1 %>% slice(3,5,7,9,11,13,18,23,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,65)

# updating the description of variables to the dictionary

Dictionary1$label<-c("case name assigned to the mass shooting ",
                     "City", 
                     "State",
                     "Date of incident",
                    "Summary of the Mass shooting",
                    "The number of people dead",
                    "The number of people injured",
                    "Total number of victims",
                     "Type of the place",
                    "Age of the shooter",
                    "Any prior signs for the offender's mental issues",
                    "Mental health details of the shooter",
                     "Legality of the weapons obtained",
                    "where was the weapons obtained from",
                    "Type of the weapon used",
                    "Details of the weapon",
                    "Race of the shooter",
                     "Gender of the shooter",
                    "Source of the data",
                    "Sources of the mental health",
                    "Additional sources",
                    "Latitude of the place of incident",
                     "Longitude of the place of incident",
                    "Type of Mass shooting",
                    "Year of mass shooting")

Dictionary1<- select(Dictionary1,1:3)

# renaming the variables

Dictionary1<- Dictionary1 %>% rename(Variable = item,Description=label,Type=class)
Dictionary1[Dictionary1 == "character"] <- "Categorical"
Dictionary1[Dictionary1 == "numeric"] <- "Quantitative"

flextable(Dictionary1,cwidth = 4,cheight = 1)

Variable	Description	Type
case	case name assigned to the mass shooting	Categorical
City	City	Categorical
STATE	State	Categorical
date	Date of incident	Date
summary	Summary of the Mass shooting	Categorical
fatalities	The number of people dead	Quantitative
injured	The number of people injured	Quantitative
total_victims	Total number of victims	Quantitative
Place	Type of the place	Categorical
age_of_shooter	Age of the shooter	Categorical
prior_signs_mental_health_issues	Any prior signs for the offender's mental issues	Categorical
mental_health_details	Mental health details of the shooter	Categorical
weapons_obtained_legally	Legality of the weapons obtained	Categorical
where_obtained	where was the weapons obtained from	Categorical
weapon_type	Type of the weapon used	Categorical
weapon_details	Details of the weapon	Categorical
race	Race of the shooter	Categorical
gender	Gender of the shooter	Categorical
sources	Source of the data	Categorical
mental_health_sources	Sources of the mental health	Categorical
sources_additional_age	Additional sources	Categorical
latitude	Latitude of the place of incident	Categorical
longitude	Longitude of the place of incident	Categorical
type	Type of Mass shooting	Categorical
Year	Year of mass shooting	Quantitative

# Creating data dictionary for 2nd dataset of Household firearm estimate

Dictionary2<-datadictionary::create_dictionary(State_level_data)

#Slicing the data dictionary to keep the required rows 
Dictionary2<- Dictionary2 %>% slice(3,8,13,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90,95)

# Selecting only required columns of the data ditionary
Dictionary2<- select(Dictionary2,1:3)

# renaming the variables

Dictionary2<- Dictionary2 %>% rename(Variable = item,Description=label,Type=class)
Dictionary2[Dictionary2 == "character"] <- "Categorical"
Dictionary2[Dictionary2 == "numeric"] <- "Quantitative"

# updating the description of variables to the dictionary

Dictionary2$Description<-c("State FIP value ",
                           "Year",
                          "Name of the State",
                          "Factor scores for household firearm ownership latent factor",
                          "Standard errors of factor scores for household firearm ownership latent factor",
                          "State has universal background checks law (1=yes; 0=no)",
                          "State has permit to purchase law (1=yes; 0=no)",
                          "Female Firearm Suicide/Total Female Suicide *100",
                          "Male Firearm Suicide/Total Male Suicide *100",
                          "BRFSS state-level survey estimate",
                          "Gallup state-level survey estimate",
                          "General Social Survey state-level survey estimate",
                          "Pew state-level survey estimate",
                          "Square root of Hunting Licenses/Population",
                          "Within-year standardization of the square root of Guns and Ammo Subscriptions/Population",
                          "Within-year standardization of Background Checks (without checks for permits)/Population",
                          "Indicator for Pew firearm ownership question wording change",
                          "Blended linear spline 1 represents roughly 1980–1992",
                          "Blended linear spline 1 represents roughly 1993–2004",
                          "Blended linear spline 1 represents roughly 2005–2016")

flextable(Dictionary2,cwidth = 4,cheight = 1)

Variable	Description	Type
FIP	State FIP value	Quantitative
Year	Year	Quantitative
STATE	Name of the State	Categorical
HFR	Factor scores for household firearm ownership latent factor	Quantitative
HFR_se	Standard errors of factor scores for household firearm ownership latent factor	Quantitative
universl	State has universal background checks law (1=yes; 0=no)	Quantitative
permit	State has permit to purchase law (1=yes; 0=no)	Quantitative
Fem_FS_S	Female Firearm Suicide/Total Female Suicide *100	Quantitative
Male_FS_S	Male Firearm Suicide/Total Male Suicide *100	Quantitative
BRFSS	BRFSS state-level survey estimate	Quantitative
GALLUP	Gallup state-level survey estimate	Quantitative
GSS	General Social Survey state-level survey estimate	Quantitative
PEW	Pew state-level survey estimate	Quantitative
HuntLic	Square root of Hunting Licenses/Population	Quantitative
GunsAmmo	Within-year standardization of the square root of Guns and Ammo Subscriptions/Population	Quantitative
BackChk	Within-year standardization of Background Checks (without checks for permits)/Population	Quantitative
PewQChng	Indicator for Pew firearm ownership question wording change	Quantitative
BS1	Blended linear spline 1 represents roughly 1980–1992	Quantitative
BS2	Blended linear spline 1 represents roughly 1993–2004	Quantitative
BS3	Blended linear spline 1 represents roughly 2005–2016	Quantitative

Creating 5 unique visualizations

Plot 1

From the below graph we can observe most number of shootings took place over the time (1982-2022)

# plot 1: Line chart showing in which year the most shootings took place

# creating a new df Yearcount that has the Count of mass shootings in the specific year

YearCount <- MassShootings %>% count(year)

massshootings_year<-YearCount %>% ggplot(aes(x = year, y = n)) +
  geom_line(size=1,color="Red") +
  scale_y_continuous(breaks = c(2,4,6,8,10,12))+
  scale_x_continuous(breaks = c(1980,1985,1990,1995,2000,2005,2010,2015,2020,2025))+
  labs(title = "Mass shootings in the United States from 1980-2022",
       x = "Year",
       y = "Number of incidents",
       caption = "Data source: www.motherjones.com") +
  theme(plot.title = element_text(hjust = 0.5))+
  theme_get()

massshootings_year

Plot 2

From the below graph we can observe that number of mass shooting incidents took place in every state and compare them.

# creating a new df locationcount that has the values of number of mass shootings occured in each state
locationCount <- MassShootings %>% count(STATE)

massshootings_state<-locationCount %>% ggplot(aes(x = fct_reorder(STATE, n), y = n)) +
  geom_col(color="black",fill="#87CEEB") +
  coord_flip()+
  scale_y_continuous(breaks = c(0,4,8,12,16,20,24,28))+
  labs(title = "Number of shootings per state in the United States",
       x = "States",
       y = "Number of incidents",
       caption = "Data source: www.motherjones.com") +
  theme_bw()+
  theme(axis.text.y  = element_text(size = 7))

massshootings_state

Plot 3

From this plot we can see the mass shootings committed by suspects of different ethnicity.

# Removing the Las vegas mass shooting data as it was only big outlier that alters the results of analysis and for better visualization 
massshootings2 <- MassShootings[-45,]

massshootings_race<-massshootings2 %>% mutate(race=fct_relevel(race,"White","Black","Unknown","Latino","Asian","Other","Native American"))%>%
  ggplot(aes(x=race,y=total_victims,fill=race)) +
  geom_boxplot(color="black") +
  scale_y_continuous(breaks = c(0,10,20,30,40,50,60,70,80,90,100,110,120))+
  scale_fill_viridis_d()+
  labs(title = "Victims in mass shootings by suspects of different ethnicities ",
       x = "Race of the criminal",
       y = "Number of victims",
       caption = "Data source: www.motherjones.com")+
  theme_light()+
  theme( legend.position = "bottom", panel.border = element_rect(color = "black", fill = NA),
         axis.line = element_line(color = "black"),
      axis.ticks = element_line(color = "black"),
      axis.text = element_text(color = "black",size = 7))

massshootings_race

Plot4

Pivotting data into longer format

This visualization was created to identify firearm suicide commited by people in mississippi based on their gender.

# Creating a new table from the data set with data required for the plot from the state of mississippi

Mississippi_firearm_suicide<-filter(State_level_data,STATE=="Mississippi")%>%select("STATE","Year","Male_FS_S","Fem_FS_S")

# Multiplying the variables with 100 as they are very small in number and hard to visualize if plotted with same values
Mississippi_firearm_suicide <- Mississippi_firearm_suicide %>% mutate(Male_FS_S=Male_FS_S*100,Fem_FS_S=Fem_FS_S*100)

# renaming column names to male and female instead of Male_FS_S, Fem_FS_S

Mississippi_firearm_suicide <-  rename(Mississippi_firearm_suicide, Male=Male_FS_S,Female=Fem_FS_S)

# Pivoting the variables to make it easier to visualize both genders in the same plot
Mississippi_firearm_suicide <- Mississippi_firearm_suicide %>% pivot_longer(cols = c(Male,Female),names_to = "Gender",values_to ="Firearm_sucide_rate")

# Plot 4
mississippi_suicide_gender<-Mississippi_firearm_suicide %>% ggplot(aes(x=Year,y=Firearm_sucide_rate,color=Gender))+
  geom_line(size=1)+
  geom_point(shape=16, color="black", fill="black", size=1)+
  scale_x_continuous(breaks = c(1980,1985,1990,1995,2000,2005,2010,2015,2020))+
  scale_color_discrete(labels = c("Female", "Male"))+
  labs(title = "Firearm suicide rate in mississippi based on gender",
       x="Year",
       Y="Rate of firearm suicide",
       caption = "RAND : State-Level Estimates of Household Firearm Ownership")+
  theme_light()+
  theme(legend.position = "bottom",panel.border = element_rect(color = "black", fill = NA),
         axis.line = element_line(color = "black"),
      axis.ticks = element_line(color = "black"),
      axis.text = element_text(color = "black",size = 7))

mississippi_suicide_gender

plot 5

Joining two datasets using the function Left_join()

This plot is to identify the firearm suicides in males per year

# Joining the two data sets by Year and STATE variables to keep the Variables Year,STATE,Fem_FS_S,Male_FS_S to find out male suicides from 1980-2020

Newtable <- left_join(MassShootings,select(State_level_data,c(Year,STATE,Fem_FS_S,Male_FS_S)),by=c("Year","STATE"))

# Removing the NA values in the newtable as there are  53 missing values after left joining the data sets

Newtable <- Newtable[complete.cases(Newtable),]

# Multiplying the variable by 100 to make it visually better in the plot

Newtable$Fem_FS_S <- Newtable$Fem_FS_S*100
Newtable$Male_FS_S <- Newtable$Male_FS_S*100


# The  Male_FS_S    variable is - Male Firearm Suicide/Total Male Suicide *100

# creating a plot to show the male fire arm suicide percentage

male_suicide_percentage<-Newtable %>% ggplot(aes(x=Year,y=Male_FS_S))+
  geom_col(color="red",fill="red")+
  scale_x_continuous(breaks = c(1980,1985,1990,1995,2000,2005,2010,2015,2020))+
  labs(title = " Percentage of Male firearm suicide per year ",
       subtitle = "The male firearm suicide variable is multiplied by 100 to make it visually better",
       x = "Year",
       y = "percentage of Male firearm suicides",
       caption = "Data source: www.motherjones.com & RAND : State-Level Estimates of Household Firearm Ownership") +
  theme_bw()
  

male_suicide_percentage

Two tables of group-level summary statistics

The two summary tables shows the summary statistics of of firearm suicide rate based on gender and number of victims in mass shootings by ethnicity of suspect.

# 1- This data frame is for the summary statistics of firearm suicide rate in mississipi based on gender

firearm_suicide_gender<- Mississippi_firearm_suicide %>% group_by(Gender) %>% 
  summarise(mean_f=round(mean(Firearm_sucide_rate),2),
            median_f=round(median(Firearm_sucide_rate),2),
            sd_f=round(sd(Firearm_sucide_rate),2),
            max_f=round(max(Firearm_sucide_rate),2),
            min_f=round(min(Firearm_sucide_rate),2),.groups = 'drop') %>%
  as.data.frame()

 firearm_suicide_gender

##   Gender mean_f median_f sd_f max_f min_f
## 1 Female  66.61    66.67 9.87 85.25 49.06
## 2   Male  78.03    79.09 5.52 87.05 67.53

# 2- Summary statistics data frame of number of victims in mass shootings by ethnicity of suspect
total_victims_race<- massshootings2 %>% group_by(race) %>% 
  summarise(mean_v=round(mean(total_victims),2),
            median_v=round(median(total_victims),2),
            sd_v=round(sd(total_victims),2),
            max_v=round(max(total_victims),2),
            min_v=round(min(total_victims),2),.groups = 'drop') %>%
  as.data.frame()

total_victims_race

##              race mean_v median_v  sd_v max_v min_v
## 1           Asian  13.75        7 17.29    55     3
## 2           Black   9.30        7  6.06    25     0
## 3          Latino  10.45        7  9.84    38     3
## 4 Native American   9.00        6  5.20    15     6
## 5           Other  41.20       35 36.91   102     7
## 6         Unknown   9.46        6  8.16    34     4
## 7           White  16.79       11 14.43    82     3

Frequency table

Frequency table to show the frequency of massshootings occuring at places by the suspects of different ethnicities.

From the below table we can observe that most number of incidents took place in the workplace

freq_table<- table(MassShootings$Place, MassShootings$race)
freq_table

##            
##             Asian Black Latino Native American Other Unknown White
##   Airport       0     0      1               0     0       0     0
##   Military      0     1      1               0     2       1     1
##   Other         2     9      3               1     1       3    32
##   Religious     0     0      0               0     0       1     7
##   School        3     1      1               2     1       1    11
##   Workplace     3    12      5               0     1       7    21

as.data.frame(freq_table)

##         Var1            Var2 Freq
## 1    Airport           Asian    0
## 2   Military           Asian    0
## 3      Other           Asian    2
## 4  Religious           Asian    0
## 5     School           Asian    3
## 6  Workplace           Asian    3
## 7    Airport           Black    0
## 8   Military           Black    1
## 9      Other           Black    9
## 10 Religious           Black    0
## 11    School           Black    1
## 12 Workplace           Black   12
## 13   Airport          Latino    1
## 14  Military          Latino    1
## 15     Other          Latino    3
## 16 Religious          Latino    0
## 17    School          Latino    1
## 18 Workplace          Latino    5
## 19   Airport Native American    0
## 20  Military Native American    0
## 21     Other Native American    1
## 22 Religious Native American    0
## 23    School Native American    2
## 24 Workplace Native American    0
## 25   Airport           Other    0
## 26  Military           Other    2
## 27     Other           Other    1
## 28 Religious           Other    0
## 29    School           Other    1
## 30 Workplace           Other    1
## 31   Airport         Unknown    0
## 32  Military         Unknown    1
## 33     Other         Unknown    3
## 34 Religious         Unknown    1
## 35    School         Unknown    1
## 36 Workplace         Unknown    7
## 37   Airport           White    0
## 38  Military           White    1
## 39     Other           White   32
## 40 Religious           White    7
## 41    School           White   11
## 42 Workplace           White   21

Implementing a Permutation test

# 2b) Implement at least one permutation test based on a traditional hypothesis test, such as a two-sample t-test or a chi-squared test of independence, to test a hypothesis of interest for your data.
# creating a new variable shooter which determines if the shooter is minor or adult

MassShootings <- mutate(MassShootings, Shooter = ifelse(age_of_shooter >= 18, "Adult", "Minor"))

# creating a new dataframe with fatalities and shooter variables

TableA <- data.frame(MassShootings$fatalities,MassShootings$Shooter)

TableA<-TableA %>% rename(fatalities=MassShootings.fatalities,shooter=MassShootings.Shooter)

table(TableA$Shooter)

## < table of extent 0 >

# creating a box plot of fatalities by those shooters

boxplot(TableA$fatalities~TableA$shooter, las = 1, 
        ylab = "fatalities ", 
        xlab = "shooter",
        main = "fatalities by age of shooter")

# difference in sample MEANS
mean(TableA$fatalities[TableA$shooter == "Minor"]) # mean for minor

## [1] 5.769231

mean(TableA$fatalities[TableA$shooter == "Adult"]) # mean for Adult

## [1] 8.098361

# absolute difference in means
test.stat1 <- abs(mean(TableA$fatalities[TableA$shooter == "Adult"]) - 
                  mean(TableA$fatalities[TableA$shooter == "Minor"]))
test.stat1

## [1] 2.32913

# difference in sample MEDIANS
median(TableA$fatalities[TableA$shooter == "Adult"]) # median for Adult

## [1] 6

median(TableA$fatalities[TableA$shooter == "Minor"]) # median for minor

## [1] 4

# absolute difference in medians
test.stat2 <- abs(median(TableA$fatalities[TableA$shooter == "Adult"]) - 
                  median(TableA$fatalities[TableA$shooter == "Minor"]))  
test.stat2

## [1] 2

# Permutation Test
  
# for getting the same results by using the seed
set.seed(1979)  
  
# number of observations to sample
n <- length(TableA$shooter)  
  
# number of permutation samples to take
P <- 100000
  
# the variable we will resample from 
variable <- TableA$fatalities  
  
# initialize a matrix to store the permutation data
PermSamples <- matrix(0, nrow = n, ncol = P)

  
# each column is a permutation sample of data. now, get those permutation samples, using a loop

for(i in 1:P)
  {
    PermSamples[, i] <- sample(variable, 
                               size = n, 
                               replace = FALSE)
  }
  
# first 5 of the samples of first 5 observations
PermSamples[1:5, 1:5]

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    5    7   12    3    4
## [2,]    6    9    7    3    9
## [3,]    8    5    4    9    7
## [4,]    8    9    3    5    3
## [5,]    4   27    4    5    5

# initialize vectors to store all of the Test-stats
Perm.test.stat1 <- Perm.test.stat2 <- rep(0, P)
  
# loop thru, and calculate the test-stats
for (i in 1:P)
  {
    # calculate the perm-test-stat1 and save it
    Perm.test.stat1[i] <- abs(mean(PermSamples[TableA$shooter == "Adult",i]) - 
                              mean(PermSamples[TableA$shooter == "Minor",i]))
      
    # calculate the perm-test-stat2 and save it
    Perm.test.stat2[i] <- abs(median(PermSamples[TableA$shooter == "Adult",i]) - 
                              median(PermSamples[TableA$shooter == "Minor",i]))
  }
  

# The test stats

test.stat1; test.stat2

## [1] 2.32913

## [1] 2

# First 15 observations of both the test stats
round(Perm.test.stat1[1:15], 1)

##  [1] 2.2 1.0 0.3 4.8 6.4 3.4 1.3 0.1 6.9 3.1 2.9 1.0 0.3 0.2 0.1

round(Perm.test.stat2[1:15], 1)

##  [1] 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 4.5 2.0 1.0 1.0 0.0 0.5 1.0

# Caluculating p value
(Perm.test.stat1 >= test.stat1)[1:15]

##  [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE
## [13] FALSE FALSE FALSE

mean((Perm.test.stat1 >= test.stat1)[1:15])

## [1] 0.4

mean(Perm.test.stat1 >= test.stat1)

## [1] 0.28505

mean(Perm.test.stat2 >= test.stat2)

## [1] 0.11823

Implementing Bootstrap (Non-parametric)

# Simulating data from distribution
set.seed(1989)
n <- 135
observed <- MassShootings$total_victims


# Create a raincloud plot to visualize the simulated data. Describe the data in terms of its number of peaks (e.g., unimodal, bimodal, etc.) and its symmetry.

tibble(value=observed) %>% ggplot(aes(y = value)) +
  ggdist::stat_halfeye(adjust = .5, width = 2*.3, .width = c(0.5, 1)) + 
    geom_boxplot(width = .3, outlier.shape = NA) +
  ggdist::stat_dots(side = "left", dotsize = 6, justification = 1.05, binwidth = .1,
                    color = "black") +
    coord_flip() +
    labs(y = "Value",
         title = "simulated values from non-normal distribution") + 
  theme_bw() +
  theme(legend.position = "none")

# What is the sample median for this data?

median(observed)

## [1] 10

# Number of bootstrap samples
B <- 10000

# Instantiating matrix for bootstrap samples
boots <- matrix(NA, nrow = n, ncol = B)

# Sampling with replacement B times
for(b in 1:B) {
boots[, b] <- observed[sample(1:n, size = n, replace = TRUE)]
}


#  Using the generated bootstrap samples, create a bootstrap distribution of sample medians, and visualize this distribution using a histogram.

# Installing vector for bootstrap medians
bootmedians <- vector(length=B)

#sampling with replacement B times

for (b in 1:B){
  bootmedians[b]<-median(boots[,b])
}

# creating a histogram

tibble(Median=bootmedians) %>% ggplot(aes(x = Median)) +
  geom_histogram(color="white")+
  labs(y = "frequency",
       title = "Distribution of bootstrap medians") + 
  theme_bw()

# Use the bootstrap samples to obtain a nonparametric estimate of the standard error of the sample median.

SEestimate<-sd(bootmedians)
SEestimate

## [1] 0.9319044

# Use the bootstrap samples to obtain a nonparametric 95% confidence interval for the population median.

lowerBoundMed <- quantile(bootmedians,probs=0.025)
upperBoundMed <- quantile(bootmedians,probs=0.975)

We are 95% confident that the true median is in between 8 and 11.

# creating a histogram

tibble(Median=bootmedians) %>% ggplot(aes(x = Median)) +
  geom_histogram(color="white")+
  geom_vline(xintercept= lowerBoundMed,size=1,
             color="dodgerblue",linetype="solid")+
  geom_vline(xintercept= upperBoundMed,size=1,
             color="dodgerblue",linetype="solid")+
  labs(y = "frequency",
       title = "Distribution of bootstrap medians") + 
  theme_bw()

Parametric Bootstrapping

# Generate B=10,000 samples each of size 30 from a normal distribution using the estimated mean and standard deviation from the observed data set, and visualize this distribution using a histogram.

B <- 10000

# Instantiating matrix for bootstrap samples
paramBoots <- matrix(NA, nrow = n, ncol = B)
xBar<-mean(observed)
s<-sd(observed)

# Sampling a normal set of n values, B times
for(b in 1:B) {
paramBoots[, b] <- rnorm(n=n,mean=xBar,sd=s)
}


# Installing vector for bootstrap medians
bootParmedians <- vector(length=B)

# caluculating median for each simulated data set
for (b in 1:B){
  bootParmedians[b]<-median(paramBoots[,b])
}


# creating a histogram

tibble(Median=bootParmedians) %>% ggplot(aes(x = Median)) +
  geom_histogram(color="white")+
  labs(y = "frequency",
       title = "Distribution of Parametric bootstrap medians") + 
  theme_bw()

# Obtain a parametric bootstrap estimate of the standard error of the sample median.

SEparestimate<-sd(bootmedians)
SEparestimate

## [1] 0.9319044

# Obtain a parametric bootstrap 95% confidence interval for the sample median.

lowerBoundParMed <- quantile(bootParmedians,probs=0.025)
upperBoundParMed <- quantile(bootParmedians,probs=0.975)


# creating a histogram

tibble(Median=bootParmedians) %>% ggplot(aes(x = Median)) +
  geom_histogram(color="white")+
  geom_vline(xintercept= lowerBoundParMed,size=1,
             color="dodgerblue",linetype="solid")+
   geom_vline(xintercept= upperBoundParMed,size=1,
             color="dodgerblue",linetype="solid")+
  labs(y = "frequency",
       title = "Distribution of parametric bootstrap medians") + 
  theme_bw()

We are 95% confident that the true median is in between 8.16 and 29.96.

The parametric bootstrap operates under the assumption of data being in a normal distribution but since the data is skewed in nature the score of parametric bootstrap is very bad.

Group Project - Contributions

Collaboration was critical to the project’s success. Narender Tumu and Sujith Prakash Parsa are members of the project team. Because the team was small, communication was much easier. Throughout the project, we always reviewed each other’s work and made improvements. We used several online tools, such as “Google Drive” and “Google Meet,” to share project files and meet on a regular basis.

Data and idea for project - Narender,Sujith (Thank you professor for helping us to find the data)

Data dictionary and exploratory data analysis - Narender

Data visualization (plot 1&2) - Sujith

Data visualization (plot 3,4,5) - Narender

Tables of summary statistics, merging data sets, pivoting data, Date/time manipulation, string manipulation, Dashboard - Narender

Frequency table,Project report,Implementing permutation and Bootstrap(Parametric and Non-parametric) methods - Sujith

References

(https://www.motherjones.com/politics/2012/12/mass-shootings-mother-jones-full-data/)

(https://www.rand.org/pubs/tools/TL354.html)

(https://www.geeksforgeeks.org/permutation-hypothesis-test-in-r-programming/)

(https://r-graph-gallery.com/)

(https://dplyr.tidyverse.org/index.html)