Analysis of Mass shootings in USA
Loading packages
This project makes use of the following libraries. We used “readr” and “readxl” to read the data, “skimr” to explore and find different data characteristics, “datadictonary” and “flextable” to create and display data dictionary containing a description of each variable in the dataset, and “lubridate” and “stringr” to manipulate the data. These are some of the primary functions carried out with the assistance of the libraries listed below.
library(readr)
library(readxl)
library(skimr)
library(flextable)
library(dplyr)
library(stringr)
library(tidyr)
library(lubridate)
library(stringr)
library(ggplot2)
library(forcats)
library(datadictionary)
library(tidyverse)Importing the data
These are the datasets that were used in this project. First, there is “Mass Shootings.csv,” which contains data on mass shootings in the United States of America from 1982 to 2022. The second dataset was “State level data.xlsx,” which contains estimates of the proportion of adult, noninstitutionalized residents who live in a household with a firearm in each U.S. state and year (1980-2016), as well as the data used to produce these estimates.
MassShootings <- read_csv("Mother Jones - Mass Shootings Database, 1982 - 2022 - Sheet1.csv")
State_level_data <- read_excel("Sate level data & factor score.xlsx")Exploring the data
Exploring and displaying high-level characteristics of the data sets, e.g., important variables and their types, different levels for factor variables, any patterns of missing values.
After Exploring the data by the use of skim function, I can observe that there are no missing values in the datasets I have used for this project.
skim(MassShootings)| Name | MassShootings |
| Number of rows | 135 |
| Number of columns | 24 |
| _______________________ | |
| Column type frequency: | |
| character | 22 |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| case | 0 | 1.00 | 12 | 45 | 0 | 135 | 0 |
| location | 0 | 1.00 | 12 | 36 | 0 | 125 | 0 |
| date | 0 | 1.00 | 8 | 10 | 0 | 135 | 0 |
| summary | 0 | 1.00 | 83 | 908 | 0 | 135 | 0 |
| injured | 0 | 1.00 | 1 | 21 | 0 | 32 | 0 |
| total_victims | 0 | 1.00 | 1 | 3 | 0 | 41 | 0 |
| Place | 0 | 1.00 | 5 | 10 | 0 | 10 | 0 |
| age_of_shooter | 0 | 1.00 | 1 | 2 | 0 | 45 | 0 |
| prior_signs_mental_health_issues | 0 | 1.00 | 1 | 7 | 0 | 7 | 0 |
| mental_health_details | 0 | 1.00 | 1 | 619 | 0 | 93 | 0 |
| weapons_obtained_legally | 0 | 1.00 | 1 | 131 | 0 | 9 | 0 |
| where_obtained | 0 | 1.00 | 1 | 124 | 0 | 75 | 0 |
| weapon_type | 0 | 1.00 | 1 | 389 | 0 | 72 | 0 |
| weapon_details | 1 | 0.99 | 1 | 328 | 0 | 108 | 0 |
| race | 0 | 1.00 | 1 | 15 | 0 | 10 | 0 |
| gender | 0 | 1.00 | 1 | 13 | 0 | 5 | 0 |
| sources | 0 | 1.00 | 41 | 951 | 0 | 135 | 0 |
| mental_health_sources | 0 | 1.00 | 1 | 444 | 0 | 85 | 0 |
| sources_additional_age | 0 | 1.00 | 1 | 951 | 0 | 101 | 0 |
| latitude | 0 | 1.00 | 1 | 11 | 0 | 127 | 0 |
| longitude | 0 | 1.00 | 1 | 12 | 0 | 127 | 0 |
| type | 0 | 1.00 | 4 | 5 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| fatalities | 0 | 1 | 7.87 | 7.60 | 3 | 4.0 | 6 | 8.5 | 58 | ▇▁▁▁▁ |
| year | 0 | 1 | 2009.84 | 10.71 | 1982 | 2003.5 | 2014 | 2018.0 | 2022 | ▁▂▂▃▇ |
skim(State_level_data)| Name | State_level_data |
| Number of rows | 1850 |
| Number of columns | 20 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 19 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| STATE | 0 | 1 | 4 | 14 | 0 | 50 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| FIP | 0 | 1 | 29.32 | 15.63 | 1.00 | 17.00 | 29.50 | 42.00 | 56.00 | ▆▇▇▇▇ |
| Year | 0 | 1 | 1998.00 | 10.68 | 1980.00 | 1989.00 | 1998.00 | 2007.00 | 2016.00 | ▇▇▇▇▇ |
| HFR | 0 | 1 | 0.44 | 0.14 | 0.03 | 0.37 | 0.46 | 0.55 | 0.80 | ▂▂▇▇▁ |
| HFR_se | 0 | 1 | 0.03 | 0.01 | 0.02 | 0.03 | 0.03 | 0.04 | 0.05 | ▂▂▇▅▃ |
| universl | 0 | 1 | 0.18 | 0.38 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| permit | 0 | 1 | 0.22 | 0.42 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| Fem_FS_S | 0 | 1 | -0.05 | 1.96 | -9.00 | 0.24 | 0.35 | 0.49 | 0.92 | ▁▁▁▁▇ |
| Male_FS_S | 0 | 1 | 0.62 | 0.13 | 0.14 | 0.56 | 0.63 | 0.71 | 0.87 | ▁▂▃▇▃ |
| BRFSS | 0 | 1 | -8.25 | 2.55 | -9.00 | -9.00 | -9.00 | -9.00 | 0.64 | ▇▁▁▁▁ |
| GALLUP | 0 | 1 | -5.73 | 4.51 | -9.00 | -9.00 | -9.00 | 0.43 | 0.80 | ▇▁▁▁▅ |
| GSS | 0 | 1 | -4.75 | 4.68 | -9.00 | -9.00 | -9.00 | 0.40 | 0.73 | ▇▁▁▁▆ |
| PEW | 0 | 1 | -6.02 | 4.39 | -9.00 | -9.00 | -9.00 | 0.35 | 0.73 | ▇▁▁▁▃ |
| HuntLic | 0 | 1 | 0.37 | 0.28 | -9.00 | 0.25 | 0.34 | 0.46 | 1.14 | ▁▁▁▁▇ |
| GunsAmmo | 0 | 1 | 0.00 | 0.99 | -3.47 | -0.65 | -0.03 | 0.42 | 3.93 | ▁▅▇▁▁ |
| BackChk | 0 | 1 | -4.62 | 4.55 | -9.00 | -9.00 | -9.00 | -0.13 | 2.82 | ▇▁▁▆▂ |
| PewQChng | 0 | 1 | 0.11 | 0.31 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| BS1 | 0 | 1 | 9.76 | 3.58 | 0.00 | 8.73 | 12.00 | 12.00 | 12.00 | ▁▁▁▁▇ |
| BS2 | 0 | 1 | 6.00 | 5.06 | 0.00 | 0.27 | 6.00 | 11.73 | 12.00 | ▇▂▂▂▇ |
| BS3 | 0 | 1 | 2.24 | 3.58 | 0.00 | 0.00 | 0.00 | 3.27 | 12.00 | ▇▁▁▁▁ |
Using Lubridate package and String modification
Cleaning the data set and modifying the variables and data to match our requirements for the Project
# Removing the two values as they were not provided and given as character. So replacing them with 0
MassShootings[7,6] = "0"
MassShootings[7,7] = "0"
# Changing the variable type character to numeric to perform group level summary statistics
MassShootings$injured <- as.numeric(as.character(MassShootings$injured))
MassShootings$total_victims <- as.numeric(as.character(MassShootings$total_victims))
# separating the location variable into two variables city and state and removing the space before state values using str_trim
# Using lubridate package and the function mdy to modify the date variable
MassShootings<- separate(MassShootings,location,c("City","STATE"),sep=",") %>% mutate(STATE=str_trim(STATE),date=mdy(date), Year=year(date))
# Making adjustments to the race variable that are capitalized differently and has no values using mutate and stringr function str_replace
MassShootings <- MassShootings %>% mutate(across('race',str_replace,"black","Black"))
MassShootings <- MassShootings %>% mutate(across('race',str_replace,"unclear","Unknown"))
MassShootings <- MassShootings %>% mutate(across('race',str_replace,"-","Unknown"))
MassShootings <- MassShootings %>% mutate(across('race',str_replace,"white","White"))
# Modifying the place variable with different capitalizations and wordings
MassShootings <- MassShootings %>% mutate(across('Place',str_replace,"workplace","Workplace"))
MassShootings <- MassShootings %>% mutate(across('Place',str_replace,"\nWorkplace","Workplace"))
MassShootings <- MassShootings %>% mutate(across('Place',str_replace,"religious","Religious"))
MassShootings <- MassShootings %>% mutate(across('Place',str_replace,"Other\n","Other"))Data Dictionary for each of the datasets
# creating a data dictionary using the function datadictionary and keeping only required data using slice function
Dictionary1<-datadictionary::create_dictionary(MassShootings)
# Slicing only required rows of the data dictionary
Dictionary1<- Dictionary1 %>% slice(3,5,7,9,11,13,18,23,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,65)
# updating the description of variables to the dictionary
Dictionary1$label<-c("case name assigned to the mass shooting ",
"City",
"State",
"Date of incident",
"Summary of the Mass shooting",
"The number of people dead",
"The number of people injured",
"Total number of victims",
"Type of the place",
"Age of the shooter",
"Any prior signs for the offender's mental issues",
"Mental health details of the shooter",
"Legality of the weapons obtained",
"where was the weapons obtained from",
"Type of the weapon used",
"Details of the weapon",
"Race of the shooter",
"Gender of the shooter",
"Source of the data",
"Sources of the mental health",
"Additional sources",
"Latitude of the place of incident",
"Longitude of the place of incident",
"Type of Mass shooting",
"Year of mass shooting")
Dictionary1<- select(Dictionary1,1:3)
# renaming the variables
Dictionary1<- Dictionary1 %>% rename(Variable = item,Description=label,Type=class)
Dictionary1[Dictionary1 == "character"] <- "Categorical"
Dictionary1[Dictionary1 == "numeric"] <- "Quantitative"
flextable(Dictionary1,cwidth = 4,cheight = 1)Variable | Description | Type |
|---|---|---|
case | case name assigned to the mass shooting | Categorical |
City | City | Categorical |
STATE | State | Categorical |
date | Date of incident | Date |
summary | Summary of the Mass shooting | Categorical |
fatalities | The number of people dead | Quantitative |
injured | The number of people injured | Quantitative |
total_victims | Total number of victims | Quantitative |
Place | Type of the place | Categorical |
age_of_shooter | Age of the shooter | Categorical |
prior_signs_mental_health_issues | Any prior signs for the offender's mental issues | Categorical |
mental_health_details | Mental health details of the shooter | Categorical |
weapons_obtained_legally | Legality of the weapons obtained | Categorical |
where_obtained | where was the weapons obtained from | Categorical |
weapon_type | Type of the weapon used | Categorical |
weapon_details | Details of the weapon | Categorical |
race | Race of the shooter | Categorical |
gender | Gender of the shooter | Categorical |
sources | Source of the data | Categorical |
mental_health_sources | Sources of the mental health | Categorical |
sources_additional_age | Additional sources | Categorical |
latitude | Latitude of the place of incident | Categorical |
longitude | Longitude of the place of incident | Categorical |
type | Type of Mass shooting | Categorical |
Year | Year of mass shooting | Quantitative |
# Creating data dictionary for 2nd dataset of Household firearm estimate
Dictionary2<-datadictionary::create_dictionary(State_level_data)
#Slicing the data dictionary to keep the required rows
Dictionary2<- Dictionary2 %>% slice(3,8,13,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90,95)
# Selecting only required columns of the data ditionary
Dictionary2<- select(Dictionary2,1:3)
# renaming the variables
Dictionary2<- Dictionary2 %>% rename(Variable = item,Description=label,Type=class)
Dictionary2[Dictionary2 == "character"] <- "Categorical"
Dictionary2[Dictionary2 == "numeric"] <- "Quantitative"
# updating the description of variables to the dictionary
Dictionary2$Description<-c("State FIP value ",
"Year",
"Name of the State",
"Factor scores for household firearm ownership latent factor",
"Standard errors of factor scores for household firearm ownership latent factor",
"State has universal background checks law (1=yes; 0=no)",
"State has permit to purchase law (1=yes; 0=no)",
"Female Firearm Suicide/Total Female Suicide *100",
"Male Firearm Suicide/Total Male Suicide *100",
"BRFSS state-level survey estimate",
"Gallup state-level survey estimate",
"General Social Survey state-level survey estimate",
"Pew state-level survey estimate",
"Square root of Hunting Licenses/Population",
"Within-year standardization of the square root of Guns and Ammo Subscriptions/Population",
"Within-year standardization of Background Checks (without checks for permits)/Population",
"Indicator for Pew firearm ownership question wording change",
"Blended linear spline 1 represents roughly 1980–1992",
"Blended linear spline 1 represents roughly 1993–2004",
"Blended linear spline 1 represents roughly 2005–2016")
flextable(Dictionary2,cwidth = 4,cheight = 1)Variable | Description | Type |
|---|---|---|
FIP | State FIP value | Quantitative |
Year | Year | Quantitative |
STATE | Name of the State | Categorical |
HFR | Factor scores for household firearm ownership latent factor | Quantitative |
HFR_se | Standard errors of factor scores for household firearm ownership latent factor | Quantitative |
universl | State has universal background checks law (1=yes; 0=no) | Quantitative |
permit | State has permit to purchase law (1=yes; 0=no) | Quantitative |
Fem_FS_S | Female Firearm Suicide/Total Female Suicide *100 | Quantitative |
Male_FS_S | Male Firearm Suicide/Total Male Suicide *100 | Quantitative |
BRFSS | BRFSS state-level survey estimate | Quantitative |
GALLUP | Gallup state-level survey estimate | Quantitative |
GSS | General Social Survey state-level survey estimate | Quantitative |
PEW | Pew state-level survey estimate | Quantitative |
HuntLic | Square root of Hunting Licenses/Population | Quantitative |
GunsAmmo | Within-year standardization of the square root of Guns and Ammo Subscriptions/Population | Quantitative |
BackChk | Within-year standardization of Background Checks (without checks for permits)/Population | Quantitative |
PewQChng | Indicator for Pew firearm ownership question wording change | Quantitative |
BS1 | Blended linear spline 1 represents roughly 1980–1992 | Quantitative |
BS2 | Blended linear spline 1 represents roughly 1993–2004 | Quantitative |
BS3 | Blended linear spline 1 represents roughly 2005–2016 | Quantitative |
Creating 5 unique visualizations
Plot 1
From the below graph we can observe most number of shootings took place over the time (1982-2022)
# plot 1: Line chart showing in which year the most shootings took place
# creating a new df Yearcount that has the Count of mass shootings in the specific year
YearCount <- MassShootings %>% count(year)
massshootings_year<-YearCount %>% ggplot(aes(x = year, y = n)) +
geom_line(size=1,color="Red") +
scale_y_continuous(breaks = c(2,4,6,8,10,12))+
scale_x_continuous(breaks = c(1980,1985,1990,1995,2000,2005,2010,2015,2020,2025))+
labs(title = "Mass shootings in the United States from 1980-2022",
x = "Year",
y = "Number of incidents",
caption = "Data source: www.motherjones.com") +
theme(plot.title = element_text(hjust = 0.5))+
theme_get()
massshootings_yearPlot 2
From the below graph we can observe that number of mass shooting incidents took place in every state and compare them.
# creating a new df locationcount that has the values of number of mass shootings occured in each state
locationCount <- MassShootings %>% count(STATE)
massshootings_state<-locationCount %>% ggplot(aes(x = fct_reorder(STATE, n), y = n)) +
geom_col(color="black",fill="#87CEEB") +
coord_flip()+
scale_y_continuous(breaks = c(0,4,8,12,16,20,24,28))+
labs(title = "Number of shootings per state in the United States",
x = "States",
y = "Number of incidents",
caption = "Data source: www.motherjones.com") +
theme_bw()+
theme(axis.text.y = element_text(size = 7))
massshootings_statePlot 3
From this plot we can see the mass shootings committed by suspects of different ethnicity.
# Removing the Las vegas mass shooting data as it was only big outlier that alters the results of analysis and for better visualization
massshootings2 <- MassShootings[-45,]
massshootings_race<-massshootings2 %>% mutate(race=fct_relevel(race,"White","Black","Unknown","Latino","Asian","Other","Native American"))%>%
ggplot(aes(x=race,y=total_victims,fill=race)) +
geom_boxplot(color="black") +
scale_y_continuous(breaks = c(0,10,20,30,40,50,60,70,80,90,100,110,120))+
scale_fill_viridis_d()+
labs(title = "Victims in mass shootings by suspects of different ethnicities ",
x = "Race of the criminal",
y = "Number of victims",
caption = "Data source: www.motherjones.com")+
theme_light()+
theme( legend.position = "bottom", panel.border = element_rect(color = "black", fill = NA),
axis.line = element_line(color = "black"),
axis.ticks = element_line(color = "black"),
axis.text = element_text(color = "black",size = 7))
massshootings_racePlot4
Pivotting data into longer format
This visualization was created to identify firearm suicide commited by people in mississippi based on their gender.
# Creating a new table from the data set with data required for the plot from the state of mississippi
Mississippi_firearm_suicide<-filter(State_level_data,STATE=="Mississippi")%>%select("STATE","Year","Male_FS_S","Fem_FS_S")
# Multiplying the variables with 100 as they are very small in number and hard to visualize if plotted with same values
Mississippi_firearm_suicide <- Mississippi_firearm_suicide %>% mutate(Male_FS_S=Male_FS_S*100,Fem_FS_S=Fem_FS_S*100)
# renaming column names to male and female instead of Male_FS_S, Fem_FS_S
Mississippi_firearm_suicide <- rename(Mississippi_firearm_suicide, Male=Male_FS_S,Female=Fem_FS_S)
# Pivoting the variables to make it easier to visualize both genders in the same plot
Mississippi_firearm_suicide <- Mississippi_firearm_suicide %>% pivot_longer(cols = c(Male,Female),names_to = "Gender",values_to ="Firearm_sucide_rate")
# Plot 4
mississippi_suicide_gender<-Mississippi_firearm_suicide %>% ggplot(aes(x=Year,y=Firearm_sucide_rate,color=Gender))+
geom_line(size=1)+
geom_point(shape=16, color="black", fill="black", size=1)+
scale_x_continuous(breaks = c(1980,1985,1990,1995,2000,2005,2010,2015,2020))+
scale_color_discrete(labels = c("Female", "Male"))+
labs(title = "Firearm suicide rate in mississippi based on gender",
x="Year",
Y="Rate of firearm suicide",
caption = "RAND : State-Level Estimates of Household Firearm Ownership")+
theme_light()+
theme(legend.position = "bottom",panel.border = element_rect(color = "black", fill = NA),
axis.line = element_line(color = "black"),
axis.ticks = element_line(color = "black"),
axis.text = element_text(color = "black",size = 7))
mississippi_suicide_genderplot 5
Joining two datasets using the function Left_join()
This plot is to identify the firearm suicides in males per year
# Joining the two data sets by Year and STATE variables to keep the Variables Year,STATE,Fem_FS_S,Male_FS_S to find out male suicides from 1980-2020
Newtable <- left_join(MassShootings,select(State_level_data,c(Year,STATE,Fem_FS_S,Male_FS_S)),by=c("Year","STATE"))
# Removing the NA values in the newtable as there are 53 missing values after left joining the data sets
Newtable <- Newtable[complete.cases(Newtable),]
# Multiplying the variable by 100 to make it visually better in the plot
Newtable$Fem_FS_S <- Newtable$Fem_FS_S*100
Newtable$Male_FS_S <- Newtable$Male_FS_S*100
# The Male_FS_S variable is - Male Firearm Suicide/Total Male Suicide *100
# creating a plot to show the male fire arm suicide percentage
male_suicide_percentage<-Newtable %>% ggplot(aes(x=Year,y=Male_FS_S))+
geom_col(color="red",fill="red")+
scale_x_continuous(breaks = c(1980,1985,1990,1995,2000,2005,2010,2015,2020))+
labs(title = " Percentage of Male firearm suicide per year ",
subtitle = "The male firearm suicide variable is multiplied by 100 to make it visually better",
x = "Year",
y = "percentage of Male firearm suicides",
caption = "Data source: www.motherjones.com & RAND : State-Level Estimates of Household Firearm Ownership") +
theme_bw()
male_suicide_percentageTwo tables of group-level summary statistics
The two summary tables shows the summary statistics of of firearm suicide rate based on gender and number of victims in mass shootings by ethnicity of suspect.
# 1- This data frame is for the summary statistics of firearm suicide rate in mississipi based on gender
firearm_suicide_gender<- Mississippi_firearm_suicide %>% group_by(Gender) %>%
summarise(mean_f=round(mean(Firearm_sucide_rate),2),
median_f=round(median(Firearm_sucide_rate),2),
sd_f=round(sd(Firearm_sucide_rate),2),
max_f=round(max(Firearm_sucide_rate),2),
min_f=round(min(Firearm_sucide_rate),2),.groups = 'drop') %>%
as.data.frame()
firearm_suicide_gender## Gender mean_f median_f sd_f max_f min_f
## 1 Female 66.61 66.67 9.87 85.25 49.06
## 2 Male 78.03 79.09 5.52 87.05 67.53
# 2- Summary statistics data frame of number of victims in mass shootings by ethnicity of suspect
total_victims_race<- massshootings2 %>% group_by(race) %>%
summarise(mean_v=round(mean(total_victims),2),
median_v=round(median(total_victims),2),
sd_v=round(sd(total_victims),2),
max_v=round(max(total_victims),2),
min_v=round(min(total_victims),2),.groups = 'drop') %>%
as.data.frame()
total_victims_race## race mean_v median_v sd_v max_v min_v
## 1 Asian 13.75 7 17.29 55 3
## 2 Black 9.30 7 6.06 25 0
## 3 Latino 10.45 7 9.84 38 3
## 4 Native American 9.00 6 5.20 15 6
## 5 Other 41.20 35 36.91 102 7
## 6 Unknown 9.46 6 8.16 34 4
## 7 White 16.79 11 14.43 82 3
Frequency table
Frequency table to show the frequency of massshootings occuring at places by the suspects of different ethnicities.
From the below table we can observe that most number of incidents took place in the workplace
freq_table<- table(MassShootings$Place, MassShootings$race)
freq_table ##
## Asian Black Latino Native American Other Unknown White
## Airport 0 0 1 0 0 0 0
## Military 0 1 1 0 2 1 1
## Other 2 9 3 1 1 3 32
## Religious 0 0 0 0 0 1 7
## School 3 1 1 2 1 1 11
## Workplace 3 12 5 0 1 7 21
as.data.frame(freq_table)## Var1 Var2 Freq
## 1 Airport Asian 0
## 2 Military Asian 0
## 3 Other Asian 2
## 4 Religious Asian 0
## 5 School Asian 3
## 6 Workplace Asian 3
## 7 Airport Black 0
## 8 Military Black 1
## 9 Other Black 9
## 10 Religious Black 0
## 11 School Black 1
## 12 Workplace Black 12
## 13 Airport Latino 1
## 14 Military Latino 1
## 15 Other Latino 3
## 16 Religious Latino 0
## 17 School Latino 1
## 18 Workplace Latino 5
## 19 Airport Native American 0
## 20 Military Native American 0
## 21 Other Native American 1
## 22 Religious Native American 0
## 23 School Native American 2
## 24 Workplace Native American 0
## 25 Airport Other 0
## 26 Military Other 2
## 27 Other Other 1
## 28 Religious Other 0
## 29 School Other 1
## 30 Workplace Other 1
## 31 Airport Unknown 0
## 32 Military Unknown 1
## 33 Other Unknown 3
## 34 Religious Unknown 1
## 35 School Unknown 1
## 36 Workplace Unknown 7
## 37 Airport White 0
## 38 Military White 1
## 39 Other White 32
## 40 Religious White 7
## 41 School White 11
## 42 Workplace White 21
Implementing a Permutation test
# 2b) Implement at least one permutation test based on a traditional hypothesis test, such as a two-sample t-test or a chi-squared test of independence, to test a hypothesis of interest for your data.
# creating a new variable shooter which determines if the shooter is minor or adult
MassShootings <- mutate(MassShootings, Shooter = ifelse(age_of_shooter >= 18, "Adult", "Minor"))
# creating a new dataframe with fatalities and shooter variables
TableA <- data.frame(MassShootings$fatalities,MassShootings$Shooter)
TableA<-TableA %>% rename(fatalities=MassShootings.fatalities,shooter=MassShootings.Shooter)
table(TableA$Shooter)## < table of extent 0 >
# creating a box plot of fatalities by those shooters
boxplot(TableA$fatalities~TableA$shooter, las = 1,
ylab = "fatalities ",
xlab = "shooter",
main = "fatalities by age of shooter")# difference in sample MEANS
mean(TableA$fatalities[TableA$shooter == "Minor"]) # mean for minor## [1] 5.769231
mean(TableA$fatalities[TableA$shooter == "Adult"]) # mean for Adult## [1] 8.098361
# absolute difference in means
test.stat1 <- abs(mean(TableA$fatalities[TableA$shooter == "Adult"]) -
mean(TableA$fatalities[TableA$shooter == "Minor"]))
test.stat1## [1] 2.32913
# difference in sample MEDIANS
median(TableA$fatalities[TableA$shooter == "Adult"]) # median for Adult## [1] 6
median(TableA$fatalities[TableA$shooter == "Minor"]) # median for minor## [1] 4
# absolute difference in medians
test.stat2 <- abs(median(TableA$fatalities[TableA$shooter == "Adult"]) -
median(TableA$fatalities[TableA$shooter == "Minor"]))
test.stat2## [1] 2
# Permutation Test
# for getting the same results by using the seed
set.seed(1979)
# number of observations to sample
n <- length(TableA$shooter)
# number of permutation samples to take
P <- 100000
# the variable we will resample from
variable <- TableA$fatalities
# initialize a matrix to store the permutation data
PermSamples <- matrix(0, nrow = n, ncol = P)
# each column is a permutation sample of data. now, get those permutation samples, using a loop
for(i in 1:P)
{
PermSamples[, i] <- sample(variable,
size = n,
replace = FALSE)
}
# first 5 of the samples of first 5 observations
PermSamples[1:5, 1:5]## [,1] [,2] [,3] [,4] [,5]
## [1,] 5 7 12 3 4
## [2,] 6 9 7 3 9
## [3,] 8 5 4 9 7
## [4,] 8 9 3 5 3
## [5,] 4 27 4 5 5
# initialize vectors to store all of the Test-stats
Perm.test.stat1 <- Perm.test.stat2 <- rep(0, P)
# loop thru, and calculate the test-stats
for (i in 1:P)
{
# calculate the perm-test-stat1 and save it
Perm.test.stat1[i] <- abs(mean(PermSamples[TableA$shooter == "Adult",i]) -
mean(PermSamples[TableA$shooter == "Minor",i]))
# calculate the perm-test-stat2 and save it
Perm.test.stat2[i] <- abs(median(PermSamples[TableA$shooter == "Adult",i]) -
median(PermSamples[TableA$shooter == "Minor",i]))
}
# The test stats
test.stat1; test.stat2## [1] 2.32913
## [1] 2
# First 15 observations of both the test stats
round(Perm.test.stat1[1:15], 1)## [1] 2.2 1.0 0.3 4.8 6.4 3.4 1.3 0.1 6.9 3.1 2.9 1.0 0.3 0.2 0.1
round(Perm.test.stat2[1:15], 1)## [1] 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 4.5 2.0 1.0 1.0 0.0 0.5 1.0
# Caluculating p value
(Perm.test.stat1 >= test.stat1)[1:15]## [1] FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE FALSE
## [13] FALSE FALSE FALSE
mean((Perm.test.stat1 >= test.stat1)[1:15])## [1] 0.4
mean(Perm.test.stat1 >= test.stat1)## [1] 0.28505
mean(Perm.test.stat2 >= test.stat2)## [1] 0.11823
Implementing Bootstrap (Non-parametric)
# Simulating data from distribution
set.seed(1989)
n <- 135
observed <- MassShootings$total_victims
# Create a raincloud plot to visualize the simulated data. Describe the data in terms of its number of peaks (e.g., unimodal, bimodal, etc.) and its symmetry.
tibble(value=observed) %>% ggplot(aes(y = value)) +
ggdist::stat_halfeye(adjust = .5, width = 2*.3, .width = c(0.5, 1)) +
geom_boxplot(width = .3, outlier.shape = NA) +
ggdist::stat_dots(side = "left", dotsize = 6, justification = 1.05, binwidth = .1,
color = "black") +
coord_flip() +
labs(y = "Value",
title = "simulated values from non-normal distribution") +
theme_bw() +
theme(legend.position = "none")# What is the sample median for this data?
median(observed)## [1] 10
# Number of bootstrap samples
B <- 10000
# Instantiating matrix for bootstrap samples
boots <- matrix(NA, nrow = n, ncol = B)
# Sampling with replacement B times
for(b in 1:B) {
boots[, b] <- observed[sample(1:n, size = n, replace = TRUE)]
}
# Using the generated bootstrap samples, create a bootstrap distribution of sample medians, and visualize this distribution using a histogram.
# Installing vector for bootstrap medians
bootmedians <- vector(length=B)
#sampling with replacement B times
for (b in 1:B){
bootmedians[b]<-median(boots[,b])
}
# creating a histogram
tibble(Median=bootmedians) %>% ggplot(aes(x = Median)) +
geom_histogram(color="white")+
labs(y = "frequency",
title = "Distribution of bootstrap medians") +
theme_bw()# Use the bootstrap samples to obtain a nonparametric estimate of the standard error of the sample median.
SEestimate<-sd(bootmedians)
SEestimate## [1] 0.9319044
# Use the bootstrap samples to obtain a nonparametric 95% confidence interval for the population median.
lowerBoundMed <- quantile(bootmedians,probs=0.025)
upperBoundMed <- quantile(bootmedians,probs=0.975)We are 95% confident that the true median is in between 8 and 11.
# creating a histogram
tibble(Median=bootmedians) %>% ggplot(aes(x = Median)) +
geom_histogram(color="white")+
geom_vline(xintercept= lowerBoundMed,size=1,
color="dodgerblue",linetype="solid")+
geom_vline(xintercept= upperBoundMed,size=1,
color="dodgerblue",linetype="solid")+
labs(y = "frequency",
title = "Distribution of bootstrap medians") +
theme_bw()Parametric Bootstrapping
# Generate B=10,000 samples each of size 30 from a normal distribution using the estimated mean and standard deviation from the observed data set, and visualize this distribution using a histogram.
B <- 10000
# Instantiating matrix for bootstrap samples
paramBoots <- matrix(NA, nrow = n, ncol = B)
xBar<-mean(observed)
s<-sd(observed)
# Sampling a normal set of n values, B times
for(b in 1:B) {
paramBoots[, b] <- rnorm(n=n,mean=xBar,sd=s)
}
# Installing vector for bootstrap medians
bootParmedians <- vector(length=B)
# caluculating median for each simulated data set
for (b in 1:B){
bootParmedians[b]<-median(paramBoots[,b])
}
# creating a histogram
tibble(Median=bootParmedians) %>% ggplot(aes(x = Median)) +
geom_histogram(color="white")+
labs(y = "frequency",
title = "Distribution of Parametric bootstrap medians") +
theme_bw()# Obtain a parametric bootstrap estimate of the standard error of the sample median.
SEparestimate<-sd(bootmedians)
SEparestimate## [1] 0.9319044
# Obtain a parametric bootstrap 95% confidence interval for the sample median.
lowerBoundParMed <- quantile(bootParmedians,probs=0.025)
upperBoundParMed <- quantile(bootParmedians,probs=0.975)
# creating a histogram
tibble(Median=bootParmedians) %>% ggplot(aes(x = Median)) +
geom_histogram(color="white")+
geom_vline(xintercept= lowerBoundParMed,size=1,
color="dodgerblue",linetype="solid")+
geom_vline(xintercept= upperBoundParMed,size=1,
color="dodgerblue",linetype="solid")+
labs(y = "frequency",
title = "Distribution of parametric bootstrap medians") +
theme_bw()We are 95% confident that the true median is in between 8.16 and 29.96.
The parametric bootstrap operates under the assumption of data being in a normal distribution but since the data is skewed in nature the score of parametric bootstrap is very bad.
Group Project - Contributions
Collaboration was critical to the project’s success. Narender Tumu and Sujith Prakash Parsa are members of the project team. Because the team was small, communication was much easier. Throughout the project, we always reviewed each other’s work and made improvements. We used several online tools, such as “Google Drive” and “Google Meet,” to share project files and meet on a regular basis.
Data and idea for project - Narender,Sujith (Thank you professor for helping us to find the data)
Data dictionary and exploratory data analysis - Narender
Data visualization (plot 1&2) - Sujith
Data visualization (plot 3,4,5) - Narender
Tables of summary statistics, merging data sets, pivoting data, Date/time manipulation, string manipulation, Dashboard - Narender
Frequency table,Project report,Implementing permutation and Bootstrap(Parametric and Non-parametric) methods - Sujith