Purpose

The beauty of this RMarkdown approach is that anything we know we’ll want to analyze in the future, we can write code for now. By replacing only a few lines of code at the start, we can evaluate entirely different/new datasets with this same tailored approach. Switching out datasets (ex. combining years, adding new BARD data, etc.), allows us to be able to run this same analysis for any dataset with a few clicks.

The primary goals of this approach are:

  1. Ease of reproducibility
  2. Automation of analyses to negate potential user error

Section 1. Checking for Variable Missingness

The first thing we’re going to do is check the dataset for missing observations across variables. This function returns a graph that shows variable missingness across the entire given dataset. This process helps us evaluate which variables can be analyzed in the following sections

check_missing_variables(Overview_deaths) # checking which variables are missing data

Section 2. Quick Counts

Now that we know what we can work with, let’s make some basic tables and crosstabs. We’ll utilize the count function primarily for this. (We’re only doing this for several variables for the purpose of illustrating capabilities).

Overview_deaths %>%
  count(WaterConditions)
## # A tibble: 5 x 2
##   WaterConditions     n
##   <chr>           <int>
## 1 Calm              293
## 2 Choppy            157
## 3 Rough              66
## 4 Unknown            76
## 5 Very rough          9
Overview_deaths %>%
  count(NumberDeaths)
## # A tibble: 4 x 2
##   NumberDeaths     n
##          <dbl> <int>
## 1            1   542
## 2            2    50
## 3            3     6
## 4            4     3
table(Overview_deaths$WaterConditions, Overview_deaths$NumberDeaths)
##             
##                1   2   3   4
##   Calm       264  24   3   2
##   Choppy     140  13   3   1
##   Rough       58   8   0   0
##   Unknown     72   4   0   0
##   Very rough   8   1   0   0

Section 3. Key Graphs

Now, the interesting part..

df_WaterConditions <- Overview_deaths %>%
  group_by(WaterConditions) %>%
  filter(WaterConditions!="Unknown") %>%
  summarise(counts = n())

ggplot(df_WaterConditions, aes(x = WaterConditions, y = counts)) +
  geom_bar(fill = "dodgerblue", stat = "identity") +
  geom_text(aes(label = counts), vjust = -0.3) +
    labs(y = "Total Deaths", fill="test", x = "Water Conditions", title = "Number of Fatalities: by Water Conditions at Time of Incident")

df_drownings <- Overview_deaths %>%
  group_by(NumberDrownings) %>%
  summarise(counts = n())

ggplot(df_drownings, aes(x = NumberDrownings, y = counts)) +
  geom_bar(fill = "dodgerblue", stat = "identity") +
  geom_text(aes(label = counts), vjust = -0.3) +
    labs(y = "Total Deaths", fill="test", x = "Number of Drowning Reported", title = "Number of Fatalities")

df_DayofWeek <- Overview_deaths %>%
  group_by(DayofWeek) %>%
  summarise(counts = n())

ggplot(df_DayofWeek, aes(x = DayofWeek, y = counts)) +
  geom_bar(fill = "dodgerblue", stat = "identity") +
  geom_text(aes(label = counts), vjust = -0.3) +
     labs(y = "Total Number of Deaths", fill="test", x = "", title = "Number of Fatalities by Day of the Week")

df_CauseofDeath <- Overview_deaths %>%
  group_by(CauseCat) %>%
  summarise(counts = n())

ggplot(df_CauseofDeath, aes(x = CauseCat, y = counts)) +
  geom_bar(fill = "dodgerblue", stat = "identity") +
  geom_text(aes(label = counts), vjust = -0.3) +
    labs(y = "Total Number of Deaths", fill="test", x = "Primary Cause of Accident", title = "Number of Fatalities by Accident Cause") + 
  theme(axis.text.x = element_text(angle = 15, hjust = 1))

df_TimeCat <- Overview_deaths %>%
  group_by(TimeCat) %>%
  summarise(counts = n())

ggplot(df_TimeCat, aes(x = TimeCat, y = counts)) +
  geom_bar(fill = "dodgerblue", stat = "identity") +
  geom_text(aes(label = counts), vjust = -0.3) +
    scale_x_discrete(limit = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13),
                     labels = c("12:00am to 2:30am", "2:31am to 4:30am","4:31am to 6:30am", "6:31am to 8:30am",
                                "8:31am to 10:30am", "10:31am to 12:30pm", "12:31pm to 2:30pm", "2:31pm to 4:30pm",
                                "4:31pm to 6:30pm", "6:31pm to 8:30pm", "8:31pm to 10:30pm", "10:31 pm to 11:59 pm",
                                "Unknown")) + 
  theme(axis.text.x = element_text(angle = 55, hjust = 1)) + 
  labs(x="Time Observed", y="Number of Fatalities", title = "Number of Fatalities by Time of Day")

df_BodyofWaterType <- Overview_deaths %>%
  group_by(TypeOfBodyOfWater) %>%
  summarise(counts = n())

ggplot(df_BodyofWaterType, aes(x = TypeOfBodyOfWater, y = counts)) +
  geom_bar(fill = "dodgerblue", stat = "identity") +
  geom_text(aes(label = counts), vjust = -0.3) +
 labs(y = "Total Deaths", fill="test", x = "Body of Water", title = "Number of Fatalities by Body of Water Type") +
  theme(axis.text.x = element_text(angle = 55, hjust = 1))

There are many listed primary accident causes or AccidentCause1s. Let’s get a quick sense of which causes are most common before graphing it below

table(Overview_deaths$AccidentCause1)
## 
##                           Alcohol use                              Dam/lock 
##                                    87                                     6 
##                              Drug use                     Equipment failure 
##                                     5                                     7 
##                       Excessive speed                       Failure to vent 
##                                    21                                     1 
##                    Force of wake/wave                      Hazardous waters 
##                                    11                                    58 
##                          Hull failure             Ignition of fuel or vapor 
##                                     5                                     4 
##                    Improper anchoring                      Improper loading 
##                                     3                                    16 
##                      Improper lookout                     Machinery failure 
##                                    23                                     9 
## Missing/inadequate aids to navigation            Navigation rules violation 
##                                     1                                    15 
##                  Operator inattention                 Operator inexperience 
##                                    50                                    42 
##                                 Other                           Overloading 
##                                    51                                    15 
##    People on gunwale, bow, or transom                     Restricted vision 
##                                    12                                     2 
##                            Sharp turn                      Starting in gear 
##                                     7                                     1 
##              Sudden medical condition                               Unknown 
##                                    18                                    93 
##                               Weather 
##                                    38

Okay, given that..

df_MainCause <- Overview_deaths %>%
  group_by(AccidentCause1) %>%
   filter(AccidentCause1=="Alcohol use"|AccidentCause1=="Operator inattention"|
           AccidentCause1=="Operator inexperience"|AccidentCause1=="Hazardous waters") %>%
  summarise(counts = n())

ggplot(df_MainCause, aes(x = AccidentCause1, y = counts)) +
  geom_bar(fill = "dodgerblue", stat = "identity") +
  geom_text(aes(label = counts), vjust = -0.3) +
labs(title="Primary Accident Cause of Fatalities", y= "Total Number of Deaths", x="Primary Accident Cause") 

For the sake of illustrating ad hoc requests that can be applied in code, I have created the example below. Let’s say we want to get a sense of how much each state reported in total damages.

Total_Damage_State <-Overview %>%
  filter(TotalDamage>0) %>%
  count(TotalDamage, State)

# This will quickly inform us of the states with the highest damage amounts recorded 
Total_Damage_State_Threshold <-Total_Damage_State %>%
  #filter(TotalDamage>=10000) %>%
  ggplot(aes(x=State, y=TotalDamage)) +
  geom_col(fill="dodgerblue") +
  scale_y_continuous(labels = scales::comma) 

Total_Damage_State_Threshold1 <- Total_Damage_State %>%
 # filter(TotalDamage>=10000) %>%
  ggplot(aes(x = reorder(State, -TotalDamage), y = TotalDamage)) + geom_bar(stat = "identity") +
   geom_col(fill="dodgerblue") +
  scale_y_continuous(labels = scales::comma) +
  labs(x="State (Acronym)", y="Total Damages Reported", title = "Total Damages Reported by State") + 
  theme(axis.text.x = element_text(angle = 75, hjust = 1))

Total_Damage_State_Threshold1

Lastly, given that we have robust coordinate data for fatalities in BARD, let’s get a quick glimpse at what this can look like once plotted out