Analysis of Mass shootings in USA

Dashboard

Use the below link for our visualization dashboard

https://rpubs.com/tumun/981145

Loading packages

This project makes use of the following libraries. We used “readr” and “readxl” to read the data, “skimr” to explore and find different data characteristics, “datadictonary” and “flextable” to create and display  data dictionary containing a description of each variable in the dataset, and “lubridate” and “stringr” to manipulate the data. These are some of the primary functions carried out with the assistance of the libraries listed below.

library(readr)
library(readxl)
library(skimr)
library(flextable)
library(dplyr)
library(stringr)
library(tidyr)
library(lubridate)
library(stringr)
library(ggplot2)
library(forcats)
library(datadictionary)
library(tidyverse)

Importing the data

These are the datasets that were used in this project. First, there is “Mass Shootings.csv,” which contains data on mass shootings in the United States of America from 1982 to 2022. The second dataset was “State level data.xlsx,” which contains estimates of the proportion of adult, noninstitutionalized residents who live in a household with a firearm in each U.S. state and year (1980-2016), as well as the data used to produce these estimates.

MassShootings <- read_csv("Mother Jones - Mass Shootings Database, 1982 - 2022 - Sheet1.csv")

State_level_data <- read_excel("Sate level data & factor score.xlsx")

Exploring the data

Exploring and displaying high-level characteristics of the data sets, e.g., important variables and their types, different levels for factor variables, any patterns of missing values.

After Exploring the data by the use of skim function, I can observe that there are no missing values in the datasets I have used for this project.

skim(MassShootings)
Data summary
Name MassShootings
Number of rows 135
Number of columns 24
_______________________
Column type frequency:
character 22
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
case 0 1.00 12 45 0 135 0
location 0 1.00 12 36 0 125 0
date 0 1.00 8 10 0 135 0
summary 0 1.00 83 908 0 135 0
injured 0 1.00 1 21 0 32 0
total_victims 0 1.00 1 3 0 41 0
Place 0 1.00 5 10 0 10 0
age_of_shooter 0 1.00 1 2 0 45 0
prior_signs_mental_health_issues 0 1.00 1 7 0 7 0
mental_health_details 0 1.00 1 619 0 93 0
weapons_obtained_legally 0 1.00 1 131 0 9 0
where_obtained 0 1.00 1 124 0 75 0
weapon_type 0 1.00 1 389 0 72 0
weapon_details 1 0.99 1 328 0 108 0
race 0 1.00 1 15 0 10 0
gender 0 1.00 1 13 0 5 0
sources 0 1.00 41 951 0 135 0
mental_health_sources 0 1.00 1 444 0 85 0
sources_additional_age 0 1.00 1 951 0 101 0
latitude 0 1.00 1 11 0 127 0
longitude 0 1.00 1 12 0 127 0
type 0 1.00 4 5 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
fatalities 0 1 7.87 7.60 3 4.0 6 8.5 58 ▇▁▁▁▁
year 0 1 2009.84 10.71 1982 2003.5 2014 2018.0 2022 ▁▂▂▃▇
skim(State_level_data)
Data summary
Name State_level_data
Number of rows 1850
Number of columns 20
_______________________
Column type frequency:
character 1
numeric 19
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
STATE 0 1 4 14 0 50 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
FIP 0 1 29.32 15.63 1.00 17.00 29.50 42.00 56.00 ▆▇▇▇▇
Year 0 1 1998.00 10.68 1980.00 1989.00 1998.00 2007.00 2016.00 ▇▇▇▇▇
HFR 0 1 0.44 0.14 0.03 0.37 0.46 0.55 0.80 ▂▂▇▇▁
HFR_se 0 1 0.03 0.01 0.02 0.03 0.03 0.04 0.05 ▂▂▇▅▃
universl 0 1 0.18 0.38 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
permit 0 1 0.22 0.42 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
Fem_FS_S 0 1 -0.05 1.96 -9.00 0.24 0.35 0.49 0.92 ▁▁▁▁▇
Male_FS_S 0 1 0.62 0.13 0.14 0.56 0.63 0.71 0.87 ▁▂▃▇▃
BRFSS 0 1 -8.25 2.55 -9.00 -9.00 -9.00 -9.00 0.64 ▇▁▁▁▁
GALLUP 0 1 -5.73 4.51 -9.00 -9.00 -9.00 0.43 0.80 ▇▁▁▁▅
GSS 0 1 -4.75 4.68 -9.00 -9.00 -9.00 0.40 0.73 ▇▁▁▁▆
PEW 0 1 -6.02 4.39 -9.00 -9.00 -9.00 0.35 0.73 ▇▁▁▁▃
HuntLic 0 1 0.37 0.28 -9.00 0.25 0.34 0.46 1.14 ▁▁▁▁▇
GunsAmmo 0 1 0.00 0.99 -3.47 -0.65 -0.03 0.42 3.93 ▁▅▇▁▁
BackChk 0 1 -4.62 4.55 -9.00 -9.00 -9.00 -0.13 2.82 ▇▁▁▆▂
PewQChng 0 1 0.11 0.31 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
BS1 0 1 9.76 3.58 0.00 8.73 12.00 12.00 12.00 ▁▁▁▁▇
BS2 0 1 6.00 5.06 0.00 0.27 6.00 11.73 12.00 ▇▂▂▂▇
BS3 0 1 2.24 3.58 0.00 0.00 0.00 3.27 12.00 ▇▁▁▁▁

Using Lubridate package and String modification

Cleaning the data set and modifying the variables and data to match our requirements for the Project

# Removing the two values as they were not provided and given as character. So replacing them with 0

MassShootings[7,6] = "0"
MassShootings[7,7] = "0"

# Changing the variable type character to numeric to perform group level summary statistics

MassShootings$injured <- as.numeric(as.character(MassShootings$injured))
MassShootings$total_victims <- as.numeric(as.character(MassShootings$total_victims))


# separating the location variable into two variables city and state and removing the space before state values using str_trim

# Using lubridate package and the function mdy to modify the date variable
MassShootings<- separate(MassShootings,location,c("City","STATE"),sep=",") %>% mutate(STATE=str_trim(STATE),date=mdy(date), Year=year(date))



# Making adjustments to the race variable that are capitalized differently and has no values using mutate and stringr function str_replace


MassShootings <- MassShootings %>% mutate(across('race',str_replace,"black","Black"))
MassShootings <- MassShootings %>% mutate(across('race',str_replace,"unclear","Unknown"))
MassShootings <- MassShootings %>% mutate(across('race',str_replace,"-","Unknown"))
MassShootings <- MassShootings %>% mutate(across('race',str_replace,"white","White"))

# Modifying the place variable with different capitalizations and wordings

MassShootings <- MassShootings %>% mutate(across('Place',str_replace,"workplace","Workplace"))
MassShootings <- MassShootings %>% mutate(across('Place',str_replace,"\nWorkplace","Workplace"))
MassShootings <- MassShootings %>% mutate(across('Place',str_replace,"religious","Religious"))
MassShootings <- MassShootings %>% mutate(across('Place',str_replace,"Other\n","Other"))

Data Dictionary for each of the datasets

# creating a data dictionary using the function datadictionary and keeping only required data using slice function

Dictionary1<-datadictionary::create_dictionary(MassShootings)

# Slicing only required rows of the data dictionary

Dictionary1<- Dictionary1 %>% slice(3,5,7,9,11,13,18,23,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,65)

# updating the description of variables to the dictionary

Dictionary1$label<-c("case name assigned to the mass shooting ",
                     "City", 
                     "State",
                     "Date of incident",
                    "Summary of the Mass shooting",
                    "The number of people dead",
                    "The number of people injured",
                    "Total number of victims",
                     "Type of the place",
                    "Age of the shooter",
                    "Any prior signs for the offender's mental issues",
                    "Mental health details of the shooter",
                     "Legality of the weapons obtained",
                    "where was the weapons obtained from",
                    "Type of the weapon used",
                    "Details of the weapon",
                    "Race of the shooter",
                     "Gender of the shooter",
                    "Source of the data",
                    "Sources of the mental health",
                    "Additional sources",
                    "Latitude of the place of incident",
                     "Longitude of the place of incident",
                    "Type of Mass shooting",
                    "Year of mass shooting")

Dictionary1<- select(Dictionary1,1:3)

# renaming the variables

Dictionary1<- Dictionary1 %>% rename(Variable = item,Description=label,Type=class)
Dictionary1[Dictionary1 == "character"] <- "Categorical"
Dictionary1[Dictionary1 == "numeric"] <- "Quantitative"

flextable(Dictionary1,cwidth = 4,cheight = 1)
# Creating data dictionary for 2nd dataset of Household firearm estimate

Dictionary2<-datadictionary::create_dictionary(State_level_data)

#Slicing the data dictionary to keep the required rows 
Dictionary2<- Dictionary2 %>% slice(3,8,13,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90,95)

# Selecting only required columns of the data ditionary
Dictionary2<- select(Dictionary2,1:3)

# renaming the variables

Dictionary2<- Dictionary2 %>% rename(Variable = item,Description=label,Type=class)
Dictionary2[Dictionary2 == "character"] <- "Categorical"
Dictionary2[Dictionary2 == "numeric"] <- "Quantitative"

# updating the description of variables to the dictionary

Dictionary2$Description<-c("State FIP value ",
                           "Year",
                          "Name of the State",
                          "Factor scores for household firearm ownership latent factor",
                          "Standard errors of factor scores for household firearm ownership latent factor",
                          "State has universal background checks law (1=yes; 0=no)",
                          "State has permit to purchase law (1=yes; 0=no)",
                          "Female Firearm Suicide/Total Female Suicide *100",
                          "Male Firearm Suicide/Total Male Suicide *100",
                          "BRFSS state-level survey estimate",
                          "Gallup state-level survey estimate",
                          "General Social Survey state-level survey estimate",
                          "Pew state-level survey estimate",
                          "Square root of Hunting Licenses/Population",
                          "Within-year standardization of the square root of Guns and Ammo Subscriptions/Population",
                          "Within-year standardization of Background Checks (without checks for permits)/Population",
                          "Indicator for Pew firearm ownership question wording change",
                          "Blended linear spline 1 represents roughly 1980–1992",
                          "Blended linear spline 1 represents roughly 1993–2004",
                          "Blended linear spline 1 represents roughly 2005–2016")

flextable(Dictionary2,cwidth = 4,cheight = 1)

Creating 5 unique visualizations

Plot 1

From the below graph we can observe most number of shootings took place over the time (1982-2022)

# plot 1: Line chart showing in which year the most shootings took place

# creating a new df Yearcount that has the Count of mass shootings in the specific year

YearCount <- MassShootings %>% count(year)

massshootings_year<-YearCount %>% ggplot(aes(x = year, y = n)) +
  geom_line(size=1,color="Red") +
  scale_y_continuous(breaks = c(2,4,6,8,10,12))+
  scale_x_continuous(breaks = c(1980,1985,1990,1995,2000,2005,2010,2015,2020,2025))+
  labs(title = "Mass shootings in the United States from 1980-2022",
       x = "Year",
       y = "Number of incidents",
       caption = "Data source: www.motherjones.com") +
  theme(plot.title = element_text(hjust = 0.5))+
  theme_get()

massshootings_year

Plot 2

From the below graph we can observe that number of mass shooting incidents took place in every state and compare them.

# creating a new df locationcount that has the values of number of mass shootings occured in each state
locationCount <- MassShootings %>% count(STATE)

massshootings_state<-locationCount %>% ggplot(aes(x = fct_reorder(STATE, n), y = n)) +
  geom_col(color="black",fill="#87CEEB") +
  coord_flip()+
  scale_y_continuous(breaks = c(0,4,8,12,16,20,24,28))+
  labs(title = "Number of shootings per state in the United States",
       x = "States",
       y = "Number of incidents",
       caption = "Data source: www.motherjones.com") +
  theme_bw()+
  theme(axis.text.y  = element_text(size = 7))

massshootings_state

Plot 3

From this plot we can see the mass shootings committed by suspects of different ethnicity.

# Removing the Las vegas mass shooting data as it was only big outlier that alters the results of analysis and for better visualization 
massshootings2 <- MassShootings[-45,]

massshootings_race<-massshootings2 %>% mutate(race=fct_relevel(race,"White","Black","Unknown","Latino","Asian","Other","Native American"))%>%
  ggplot(aes(x=race,y=total_victims,fill=race)) +
  geom_boxplot(color="black") +
  scale_y_continuous(breaks = c(0,10,20,30,40,50,60,70,80,90,100,110,120))+
  scale_fill_viridis_d()+
  labs(title = "Victims in mass shootings by suspects of different ethnicities ",
       x = "Race of the criminal",
       y = "Number of victims",
       caption = "Data source: www.motherjones.com")+
  theme_light()+
  theme( legend.position = "bottom", panel.border = element_rect(color = "black", fill = NA),
         axis.line = element_line(color = "black"),
      axis.ticks = element_line(color = "black"),
      axis.text = element_text(color = "black",size = 7))

massshootings_race

Plot4

Pivotting data into longer format

This visualization was created to identify firearm suicide commited by people in mississippi based on their gender.

# Creating a new table from the data set with data required for the plot from the state of mississippi

Mississippi_firearm_suicide<-filter(State_level_data,STATE=="Mississippi")%>%select("STATE","Year","Male_FS_S","Fem_FS_S")

# Multiplying the variables with 100 as they are very small in number and hard to visualize if plotted with same values
Mississippi_firearm_suicide <- Mississippi_firearm_suicide %>% mutate(Male_FS_S=Male_FS_S*100,Fem_FS_S=Fem_FS_S*100)

# renaming column names to male and female instead of Male_FS_S, Fem_FS_S

Mississippi_firearm_suicide <-  rename(Mississippi_firearm_suicide, Male=Male_FS_S,Female=Fem_FS_S)

# Pivoting the variables to make it easier to visualize both genders in the same plot
Mississippi_firearm_suicide <- Mississippi_firearm_suicide %>% pivot_longer(cols = c(Male,Female),names_to = "Gender",values_to ="Firearm_sucide_rate")

# Plot 4
mississippi_suicide_gender<-Mississippi_firearm_suicide %>% ggplot(aes(x=Year,y=Firearm_sucide_rate,color=Gender))+
  geom_line(size=1)+
  geom_point(shape=16, color="black", fill="black", size=1)+
  scale_x_continuous(breaks = c(1980,1985,1990,1995,2000,2005,2010,2015,2020))+
  scale_color_discrete(labels = c("Female", "Male"))+
  labs(title = "Firearm suicide rate in mississippi based on gender",
       x="Year",
       Y="Rate of firearm suicide",
       caption = "RAND : State-Level Estimates of Household Firearm Ownership")+
  theme_light()+
  theme(legend.position = "bottom",panel.border = element_rect(color = "black", fill = NA),
         axis.line = element_line(color = "black"),
      axis.ticks = element_line(color = "black"),
      axis.text = element_text(color = "black",size = 7))

mississippi_suicide_gender

plot 5

Joining two datasets using the function Left_join()

This plot is to identify the firearm suicides in males per year

# Joining the two data sets by Year and STATE variables to keep the Variables Year,STATE,Fem_FS_S,Male_FS_S to find out male suicides from 1980-2020

Newtable <- left_join(MassShootings,select(State_level_data,c(Year,STATE,Fem_FS_S,Male_FS_S)),by=c("Year","STATE"))

# Removing the NA values in the newtable as there are  53 missing values after left joining the data sets

Newtable <- Newtable[complete.cases(Newtable),]

# Multiplying the variable by 100 to make it visually better in the plot

Newtable$Fem_FS_S <- Newtable$Fem_FS_S*100
Newtable$Male_FS_S <- Newtable$Male_FS_S*100


# The  Male_FS_S    variable is - Male Firearm Suicide/Total Male Suicide *100

# creating a plot to show the male fire arm suicide percentage

male_suicide_percentage<-Newtable %>% ggplot(aes(x=Year,y=Male_FS_S))+
  geom_col(color="red",fill="red")+
  scale_x_continuous(breaks = c(1980,1985,1990,1995,2000,2005,2010,2015,2020))+
  labs(title = " Percentage of Male firearm suicide per year ",
       subtitle = "The male firearm suicide variable is multiplied by 100 to make it visually better",
       x = "Year",
       y = "percentage of Male firearm suicides",
       caption = "Data source: www.motherjones.com & RAND : State-Level Estimates of Household Firearm Ownership") +
  theme_bw()
  

male_suicide_percentage

Two tables of group-level summary statistics

The two summary tables shows the summary statistics of of firearm suicide rate based on gender and number of victims in mass shootings by ethnicity of suspect.

# 1- This data frame is for the summary statistics of firearm suicide rate in mississipi based on gender

firearm_suicide_gender<- Mississippi_firearm_suicide %>% group_by(Gender) %>% 
  summarise(mean_f=round(mean(Firearm_sucide_rate),2),
            median_f=round(median(Firearm_sucide_rate),2),
            sd_f=round(sd(Firearm_sucide_rate),2),
            max_f=round(max(Firearm_sucide_rate),2),
            min_f=round(min(Firearm_sucide_rate),2),.groups = 'drop') %>%
  as.data.frame()

 firearm_suicide_gender
##   Gender mean_f median_f sd_f max_f min_f
## 1 Female  66.61    66.67 9.87 85.25 49.06
## 2   Male  78.03    79.09 5.52 87.05 67.53
# 2- Summary statistics data frame of number of victims in mass shootings by ethnicity of suspect
total_victims_race<- massshootings2 %>% group_by(race) %>% 
  summarise(mean_v=round(mean(total_victims),2),
            median_v=round(median(total_victims),2),
            sd_v=round(sd(total_victims),2),
            max_v=round(max(total_victims),2),
            min_v=round(min(total_victims),2),.groups = 'drop') %>%
  as.data.frame()

total_victims_race
##              race mean_v median_v  sd_v max_v min_v
## 1           Asian  13.75        7 17.29    55     3
## 2           Black   9.30        7  6.06    25     0
## 3          Latino  10.45        7  9.84    38     3
## 4 Native American   9.00        6  5.20    15     6
## 5           Other  41.20       35 36.91   102     7
## 6         Unknown   9.46        6  8.16    34     4
## 7           White  16.79       11 14.43    82     3

Frequency table

Frequency table to show the frequency of massshootings occuring at places by the suspects of different ethnicities.

From the below table we can observe that most number of incidents took place in the workplace

freq_table<- table(MassShootings$Place, MassShootings$race)
freq_table 
##            
##             Asian Black Latino Native American Other Unknown White
##   Airport       0     0      1               0     0       0     0
##   Military      0     1      1               0     2       1     1
##   Other         2     9      3               1     1       3    32
##   Religious     0     0      0               0     0       1     7
##   School        3     1      1               2     1       1    11
##   Workplace     3    12      5               0     1       7    21
as.data.frame(freq_table)
##         Var1            Var2 Freq
## 1    Airport           Asian    0
## 2   Military           Asian    0
## 3      Other           Asian    2
## 4  Religious           Asian    0
## 5     School           Asian    3
## 6  Workplace           Asian    3
## 7    Airport           Black    0
## 8   Military           Black    1
## 9      Other           Black    9
## 10 Religious           Black    0
## 11    School           Black    1
## 12 Workplace           Black   12
## 13   Airport          Latino    1
## 14  Military          Latino    1
## 15     Other          Latino    3
## 16 Religious          Latino    0
## 17    School          Latino    1
## 18 Workplace          Latino    5
## 19   Airport Native American    0
## 20  Military Native American    0
## 21     Other Native American    1
## 22 Religious Native American    0
## 23    School Native American    2
## 24 Workplace Native American    0
## 25   Airport           Other    0
## 26  Military           Other    2
## 27     Other           Other    1
## 28 Religious           Other    0
## 29    School           Other    1
## 30 Workplace           Other    1
## 31   Airport         Unknown    0
## 32  Military         Unknown    1
## 33     Other         Unknown    3
## 34 Religious         Unknown    1
## 35    School         Unknown    1
## 36 Workplace         Unknown    7
## 37   Airport           White    0
## 38  Military           White    1
## 39     Other           White   32
## 40 Religious           White    7
## 41    School           White   11
## 42 Workplace           White   21

Implementing a Permutation test

# 2b) Implement at least one permutation test based on a traditional hypothesis test, such as a two-sample t-test or a chi-squared test of independence, to test a hypothesis of interest for your data.
# creating a new variable shooter which determines if the shooter is minor or adult

MassShootings <- mutate(MassShootings, Shooter = ifelse(age_of_shooter >= 18, "Adult", "Minor"))

# creating a new dataframe with fatalities and shooter variables

TableA <- data.frame(MassShootings$fatalities,MassShootings$Shooter)

TableA<-TableA %>% rename(fatalities=MassShootings.fatalities,shooter=MassShootings.Shooter)

table(TableA$Shooter)
## < table of extent 0 >
# creating a box plot of fatalities by those shooters

boxplot(TableA$fatalities~TableA$shooter, las = 1, 
        ylab = "fatalities ", 
        xlab = "shooter",
        main = "fatalities by age of shooter")

# difference in sample MEANS
mean(TableA$fatalities[TableA$shooter == "Minor"]) # mean for minor
## [1] 5.769231
mean(TableA$fatalities[TableA$shooter == "Adult"]) # mean for Adult
## [1] 8.098361
# absolute difference in means
test.stat1 <- abs(mean(TableA$fatalities[TableA$shooter == "Adult"]) - 
                  mean(TableA$fatalities[TableA$shooter == "Minor"]))
test.stat1
## [1] 2.32913
# difference in sample MEDIANS
median(TableA$fatalities[TableA$shooter == "Adult"]) # median for Adult
## [1] 6
median(TableA$fatalities[TableA$shooter == "Minor"]) # median for minor
## [1] 4
# absolute difference in medians
test.stat2 <- abs(median(TableA$fatalities[TableA$shooter == "Adult"]) - 
                  median(TableA$fatalities[TableA$shooter == "Minor"]))  
test.stat2
## [1] 2
# Permutation Test
  
# for getting the same results by using the seed
set.seed(1979)  
  
# number of observations to sample
n <- length(TableA$shooter)  
  
# number of permutation samples to take
P <- 100000
  
# the variable we will resample from 
variable <- TableA$fatalities  
  
# initialize a matrix to store the permutation data
PermSamples <- matrix(0, nrow = n, ncol = P)

  
# each column is a permutation sample of data. now, get those permutation samples, using a loop

for(i in 1:P)
  {
    PermSamples[, i] <- sample(variable, 
                               size = n, 
                               replace = FALSE)
  }
  
# first 5 of the samples of first 5 observations
PermSamples[1:5, 1:5]
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    5    7   12    3    4
## [2,]    6    9    7    3    9
## [3,]    8    5    4    9    7
## [4,]    8    9    3    5    3
## [5,]    4   27    4    5    5
# initialize vectors to store all of the Test-stats
Perm.test.stat1 <- Perm.test.stat2 <- rep(0, P)
  
# loop thru, and calculate the test-stats
for (i in 1:P)
  {
    # calculate the perm-test-stat1 and save it
    Perm.test.stat1[i] <- abs(mean(PermSamples[TableA$shooter == "Adult",i]) - 
                              mean(PermSamples[TableA$shooter == "Minor",i]))
      
    # calculate the perm-test-stat2 and save it
    Perm.test.stat2[i] <- abs(median(PermSamples[TableA$shooter == "Adult",i]) - 
                              median(PermSamples[TableA$shooter == "Minor",i]))
  }
  

# The test stats

test.stat1; test.stat2
## [1] 2.32913
## [1] 2
# First 15 observations of both the test stats
round(Perm.test.stat1[1:15], 1)
##  [1] 2.2 1.0 0.3 4.8 6.4 3.4 1.3 0.1 6.9 3.1 2.9 1.0 0.3 0.2 0.1
round(Perm.test.stat2[1:15], 1)
##  [1] 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 4.5 2.0 1.0 1.0 0.0 0.5 1.0
# Caluculating p value
(Perm.test.stat1 >= test.stat1)[1:15]
##  [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE
## [13] FALSE FALSE FALSE
mean((Perm.test.stat1 >= test.stat1)[1:15])
## [1] 0.4
mean(Perm.test.stat1 >= test.stat1)
## [1] 0.28505
mean(Perm.test.stat2 >= test.stat2)
## [1] 0.11823

Implementing Bootstrap (Non-parametric)

# Simulating data from distribution
set.seed(1989)
n <- 135
observed <- MassShootings$total_victims


# Create a raincloud plot to visualize the simulated data. Describe the data in terms of its number of peaks (e.g., unimodal, bimodal, etc.) and its symmetry.

tibble(value=observed) %>% ggplot(aes(y = value)) +
  ggdist::stat_halfeye(adjust = .5, width = 2*.3, .width = c(0.5, 1)) + 
    geom_boxplot(width = .3, outlier.shape = NA) +
  ggdist::stat_dots(side = "left", dotsize = 6, justification = 1.05, binwidth = .1,
                    color = "black") +
    coord_flip() +
    labs(y = "Value",
         title = "simulated values from non-normal distribution") + 
  theme_bw() +
  theme(legend.position = "none")

# What is the sample median for this data?

median(observed)
## [1] 10
# Number of bootstrap samples
B <- 10000

# Instantiating matrix for bootstrap samples
boots <- matrix(NA, nrow = n, ncol = B)

# Sampling with replacement B times
for(b in 1:B) {
boots[, b] <- observed[sample(1:n, size = n, replace = TRUE)]
}


#  Using the generated bootstrap samples, create a bootstrap distribution of sample medians, and visualize this distribution using a histogram.

# Installing vector for bootstrap medians
bootmedians <- vector(length=B)

#sampling with replacement B times

for (b in 1:B){
  bootmedians[b]<-median(boots[,b])
}

# creating a histogram

tibble(Median=bootmedians) %>% ggplot(aes(x = Median)) +
  geom_histogram(color="white")+
  labs(y = "frequency",
       title = "Distribution of bootstrap medians") + 
  theme_bw()

# Use the bootstrap samples to obtain a nonparametric estimate of the standard error of the sample median.

SEestimate<-sd(bootmedians)
SEestimate
## [1] 0.9319044
# Use the bootstrap samples to obtain a nonparametric 95% confidence interval for the population median.

lowerBoundMed <- quantile(bootmedians,probs=0.025)
upperBoundMed <- quantile(bootmedians,probs=0.975)

We are 95% confident that the true median is in between 8 and 11.

# creating a histogram

tibble(Median=bootmedians) %>% ggplot(aes(x = Median)) +
  geom_histogram(color="white")+
  geom_vline(xintercept= lowerBoundMed,size=1,
             color="dodgerblue",linetype="solid")+
  geom_vline(xintercept= upperBoundMed,size=1,
             color="dodgerblue",linetype="solid")+
  labs(y = "frequency",
       title = "Distribution of bootstrap medians") + 
  theme_bw()

Parametric Bootstrapping

# Generate B=10,000 samples each of size 30 from a normal distribution using the estimated mean and standard deviation from the observed data set, and visualize this distribution using a histogram.

B <- 10000

# Instantiating matrix for bootstrap samples
paramBoots <- matrix(NA, nrow = n, ncol = B)
xBar<-mean(observed)
s<-sd(observed)

# Sampling a normal set of n values, B times
for(b in 1:B) {
paramBoots[, b] <- rnorm(n=n,mean=xBar,sd=s)
}


# Installing vector for bootstrap medians
bootParmedians <- vector(length=B)

# caluculating median for each simulated data set
for (b in 1:B){
  bootParmedians[b]<-median(paramBoots[,b])
}


# creating a histogram

tibble(Median=bootParmedians) %>% ggplot(aes(x = Median)) +
  geom_histogram(color="white")+
  labs(y = "frequency",
       title = "Distribution of Parametric bootstrap medians") + 
  theme_bw()

# Obtain a parametric bootstrap estimate of the standard error of the sample median.

SEparestimate<-sd(bootmedians)
SEparestimate
## [1] 0.9319044
# Obtain a parametric bootstrap 95% confidence interval for the sample median.

lowerBoundParMed <- quantile(bootParmedians,probs=0.025)
upperBoundParMed <- quantile(bootParmedians,probs=0.975)


# creating a histogram

tibble(Median=bootParmedians) %>% ggplot(aes(x = Median)) +
  geom_histogram(color="white")+
  geom_vline(xintercept= lowerBoundParMed,size=1,
             color="dodgerblue",linetype="solid")+
   geom_vline(xintercept= upperBoundParMed,size=1,
             color="dodgerblue",linetype="solid")+
  labs(y = "frequency",
       title = "Distribution of parametric bootstrap medians") + 
  theme_bw()

We are 95% confident that the true median is in between 8.16 and 29.96.

The parametric bootstrap operates under the assumption of data being in a normal distribution but since the data is skewed in nature the score of parametric bootstrap is very bad.

Group Project - Contributions

Collaboration was critical to the project’s success. Narender Tumu and Sujith Prakash Parsa are members of the project team. Because the team was small, communication was much easier. Throughout the project, we always reviewed each other’s work and made improvements. We used several online tools, such as “Google Drive” and “Google Meet,” to share project files and meet on a regular basis.

Data and idea for project - Narender,Sujith (Thank you professor for helping us to find the data)

Data dictionary and exploratory data analysis - Narender

Data visualization (plot 1&2) - Sujith

Data visualization (plot 3,4,5) - Narender

Tables of summary statistics, merging data sets, pivoting data, Date/time manipulation, string manipulation, Dashboard - Narender

Frequency table,Project report,Implementing permutation and Bootstrap(Parametric and Non-parametric) methods - Sujith