Reproducable Research: Course Project 2 - Analysis of health and financial implications of weather events across the USA (1951

Synopsis

This report details the analysis procedure and results of an analysis of health and financial implications of weather events across the USA between 1951 - 2011.

To begin, Storm Data taken from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) database was unzipped, read in and tidied, keeping only the variables of most importance (EVTYPE, FATALITIES, INJURIES, PROPDMG and CROPDMG). The tidied data was then categorised, creating a new variable (EVCAT) of umbrella categories for the originally logged event types (EVTYPE). The umbrella categories were created by studying the list of distinct event types and grouping them manually with the help of regular expression searches. The data was then summarised to answer the two proposed questions, as follows:

For Question 1, the total number of fatalities and injuries were calculated for event category and plotted the results as side-by-side bar graphs, showing the number of fatalities and injuries in descending order. The results were that the top three causes of fatality were Extreme Heat, Fire and Flooding, while the top three causes of injury were Extreme Heat, Flooding and Fire.

For Question 2, the total amounts of property and crop damage were calculated for each event category and plotted as side-by-side bar graphs, showing the number of fatalities and injuries in descending order. The results were that the top three causes of property damage were Fire, Flooding and Icy Weather, while the top three causes of crop damage were Fire, Flooding and Drought.

Data Processing

The data was loaded and processed using the vroom package for speed. No preprocessing was done outside this document, beyond the unzipping of the original data file. The dplyr package was also used for it’s convenient functions for data manipulation. These two packages should be installed using the install.packages("vroom") and install.packages("dplyr") commands respectively, if they are not present on the machine where the analysis is being repeated.

#  # install first if not installed
library("vroom")
library("dplyr")
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "StormData.csv.bz2")
data <- vroom("StormData.csv.bz2")

Next, from the original data, only the most important variables were selected for analysis (EVTYPE, FATALITIES, INJURIES, PROPDMG and CROPDMG):

tidy_data <- data %>% select(EVTYPE, FATALITIES, INJURIES, PROPDMG, CROPDMG)

Question 1

From this foundation, we were able to begin answering question 1:

Across the USA, which types of events are most harmful with respect to population health?

For this question we were interested in the FATALITIES and INJURIES variables in the dataset. From the tidy data, we selected only the events which had health implications (e.g. resulted in at least one injury or fatality). We also converted any remaining zero values to NAs, as follows:

filtered_data_q1 <- tidy_data %>% 
  filter(FATALITIES > 0 | INJURIES > 0) %>%
  mutate(FATALITIES = replace(FATALITIES, FATALITIES == 0, NA)) %>%
  mutate(INJURIES = replace(INJURIES, INJURIES == 0, NA))

From looking at the resulting data, we saw that the event types (EVTYPE) were not originally logged in dataset in a structured way i.e. they were present in many different forms, with different spelllings, in both upper or lower case and sometimes containing errors. To address this problem we produced umbrella categories for the event types by manually performing simple regular expression searches on the list of distinct event types.

NOTE: In our analysis, there may be room for improvement in terms of how the umbrella categories are grouped. For example, it was not clear if events such as tornadoes should belong in a broader category called ‘wind’. In the event we settled on 21 categories that we felt allowed trends to be seen but still captured some of the nuances within the data.

categorised_data_q1 <- filtered_data_q1 %>% 
  mutate(EVCAT = 
    case_when(
      grepl("TORNADO|HURRICANE|TYPHOON|Hurricane|Whirlwind|TORNDAO|GUSTNADO", EVTYPE) == TRUE ~ "Tornadoes",
      grepl("ICE|ICY|FREEZING|Freezing|FREEZE|Freeze|FROST|Frost", EVTYPE) == TRUE ~ "Icy Weather",
      grepl("TSUNAMI", EVTYPE) == TRUE ~ "Tsunamis",
      grepl("WIND|Wind|wind", EVTYPE) == TRUE ~ "Wind",
      grepl("SNOW|Snow|snow|BLIZZARD|HAIL|SLEET|WINT|Wintry", EVTYPE) == TRUE ~ "Winter Weather", 
      # "WINT" matches "WINTER WEATHER" and "WINTER STORM/S"
      grepl("FIRE|SMOKE|ASH", EVTYPE) == TRUE ~ "Fire",
      grepl("HEAT|Heat|WARM", EVTYPE) == TRUE ~ "Extreme Heat",
      grepl("RAIN|Rainfall|SHOWER|PRECIPITATION|Precipitation", EVTYPE) == TRUE ~ "Rain",
      grepl("FOG", EVTYPE) == TRUE ~ "Fog",
      grepl("SURF", EVTYPE) == TRUE ~ "Surf",
      grepl("LIGHTNING|LIGNTNING|LIGHTING", EVTYPE) == TRUE ~ "Lightning",
      grepl("FLOOD|Flood|flood|FLD", EVTYPE) == TRUE ~ "Flooding",
      grepl("COLD|Cold|EXPOSURE|Exposure|LOW TEMPERATURE|HYPOTHERMIA", EVTYPE) == TRUE ~ "Extreme Cold",
      grepl("AVALANCHE|AVALANCE|LANDSLIDE|Mudslide|MUD SLIDE|MUDSLIDES|MUDSLIDE|Landslump|ROCK SLIDE|Erosion", EVTYPE) == TRUE ~ "Landslides",
      grepl("DUST|Dust", EVTYPE) == TRUE ~ "Dust Storms",
      grepl("THUNDERSTORM|THUNDERSTORMW|TROPICAL", EVTYPE) == TRUE ~ "Thunderstorms",
      grepl("SEA|WAVE|Surf|DROWNING|RISING|SURGE|HIGH SWELLS|HIGH WATER|TIDE", EVTYPE) == TRUE ~ "Rough/High Seas or Waves",
      grepl("DROUGHT", EVTYPE) == TRUE ~ "Drought",
      grepl("RIP CURRENT", EVTYPE) == TRUE ~ "Rip Currents",
      grepl("MARINE|Marine", EVTYPE) == TRUE ~ "Marine Accidents",
      grepl("COASTAL|Coastal", EVTYPE) == TRUE ~ "Coastal Storms"
    )
) %>% mutate(EVCAT = as.factor(EVCAT))

The umbrella categories can be shown as follows:

categories_q1 <- categorised_data_q1 %>% pull(EVCAT)
levels(categories_q1)

##  [1] "Coastal Storms"           "Drought"                 
##  [3] "Dust Storms"              "Extreme Cold"            
##  [5] "Extreme Heat"             "Fire"                    
##  [7] "Flooding"                 "Fog"                     
##  [9] "Icy Weather"              "Landslides"              
## [11] "Lightning"                "Marine Accidents"        
## [13] "Rain"                     "Rip Currents"            
## [15] "Rough/High Seas or Waves" "Surf"                    
## [17] "Thunderstorms"            "Tornadoes"               
## [19] "Tsunamis"                 "Wind"                    
## [21] "Winter Weather"

It is also possible to see which distinct events fall under which of these categories by doing the following (example for the umbrella category “Fire”):

categorised_data_q1 %>% filter(EVCAT == "Fire") %>% distinct(EVTYPE) %>% pull(EVTYPE)

##  [1] "FLASH FLOOD"          "WILD FIRES"           "FLOOD/FLASH FLOOD"   
##  [4] "FLASH FLOODING"       "FLASH FLOOD/FLOOD"    "FLASH FLOODING/FLOOD"
##  [7] "FLASH FLOODS"         "WILD/FOREST FIRE"     "BRUSH FIRE"          
## [10] "WILDFIRE"

Now only a few difficult-to-categorise events remain, which are given an EVCAT value of NA and which we will subsequently ignore in our analysis:

uncategorised_events_q1 <- categorised_data_q1 %>% filter(is.na(EVCAT)) %>% distinct(EVTYPE) %>% pull(EVTYPE)
uncategorised_events_q1

## [1] "HIGH"           "FUNNEL CLOUD"   "DRY MICROBURST" "WATERSPOUT"    
## [5] "GLAZE"          "MIXED PRECIP"   "OTHER"

Once we were happy with the categorisation procedure, we were able to group the number of fatalities by the umbrella categories, keeping only the top ten results, as follows:

summarised_data_q1 <- categorised_data_q1 %>% 
  group_by(EVCAT) %>% 
  summarise(FATALITIES = sum(FATALITIES, na.rm = TRUE), INJURIES = sum(INJURIES, na.rm = TRUE)) %>%
  filter(!is.na(EVCAT)) %>% # remove uncategorised events
  slice(1:10)

The last line is responsible for removing events that don’t belong to any category (i.e. the list of uncategorised_events we saw previously).

Please see the results section for a bar plot of this data.

Question 2

We will now move on to question 2, following the same procedure of filtering, categorising and summarising. The question is as follows:

Across the USA, which types of events have the greatest economic consequences?

To address this question, we first filtered the data only by the events that caused property or crop damage (i.e. PROPDMG or CROPDMG that were > 0). We also converted any remaining zero values to NA.

filtered_data_q2 <- tidy_data %>% 
  filter(PROPDMG > 0 | CROPDMG > 0)  %>%
  mutate(PROPDMG = replace(PROPDMG, PROPDMG == 0, NA)) %>%
  mutate(CROPDMG = replace(CROPDMG, CROPDMG == 0, NA))

Then we performed the same categorisation as for question 1, but on the question 2 data:

categorised_data_q2 <- filtered_data_q2 %>%
  mutate(EVCAT = 
    case_when(
      grepl("TORNADO|HURRICANE|TYPHOON|Hurricane|Whirlwind|TORNDAO|GUSTNADO", EVTYPE) == TRUE ~ "Tornadoes",
      grepl("ICE|ICY|FREEZING|Freezing|FREEZE|Freeze|FROST|Frost", EVTYPE) == TRUE ~ "Icy Weather",
      grepl("TSUNAMI", EVTYPE) == TRUE ~ "Tsunamis",
      grepl("WIND|Wind|wind", EVTYPE) == TRUE ~ "Wind",
      grepl("SNOW|Snow|snow|BLIZZARD|HAIL|SLEET|WINT|Wintry", EVTYPE) == TRUE ~ "Winter Weather", 
      # "WINT" matches "WINTER WEATHER" and "WINTER STORM/S"
      grepl("FIRE|SMOKE|ASH", EVTYPE) == TRUE ~ "Fire",
      grepl("HEAT|Heat|WARM", EVTYPE) == TRUE ~ "Extreme Heat",
      grepl("RAIN|Rainfall|SHOWER|PRECIPITATION|Precipitation", EVTYPE) == TRUE ~ "Rain",
      grepl("FOG", EVTYPE) == TRUE ~ "Fog",
      grepl("SURF", EVTYPE) == TRUE ~ "Surf",
      grepl("LIGHTNING|LIGNTNING|LIGHTING", EVTYPE) == TRUE ~ "Lightning",
      grepl("FLOOD|Flood|flood|FLD", EVTYPE) == TRUE ~ "Flooding",
      grepl("COLD|Cold|EXPOSURE|Exposure|LOW TEMPERATURE|HYPOTHERMIA", EVTYPE) == TRUE ~ "Extreme Cold",
      grepl("AVALANCHE|AVALANCE|LANDSLIDE|Mudslide|MUD SLIDE|MUDSLIDES|MUDSLIDE|Landslump|ROCK SLIDE|Erosion", EVTYPE) == TRUE ~ "Landslides",
      grepl("DUST|Dust", EVTYPE) == TRUE ~ "Dust Storms",
      grepl("THUNDERSTORM|THUNDERSTORMW|TROPICAL", EVTYPE) == TRUE ~ "Thunderstorms",
      grepl("SEA|WAVE|Surf|DROWNING|RISING|SURGE|HIGH SWELLS|HIGH WATER|TIDE", EVTYPE) == TRUE ~ "Rough/High Seas or Waves",
      grepl("DROUGHT", EVTYPE) == TRUE ~ "Drought",
      grepl("RIP CURRENT", EVTYPE) == TRUE ~ "Rip Currents",
      grepl("MARINE|Marine", EVTYPE) == TRUE ~ "Marine Accidents",
      grepl("COASTAL|Coastal", EVTYPE) == TRUE ~ "Coastal Storms"
    )
) %>% mutate(EVCAT = as.factor(EVCAT))

The umbrella categories are the same as for question 1, so we have not reproduced them again here. However, the uncategorised events differ slightly (there are more of them):

uncategorised_events_q2 <- categorised_data_q2 %>% filter(is.na(EVCAT)) %>% distinct(EVTYPE) %>% pull(EVTYPE)
uncategorised_events_q2

##  [1] "WATERSPOUT"         "SEVERE TURBULENCE"  "APACHE COUNTY"     
##  [4] "FUNNEL CLOUD"       "WATERSPOUT-"        "COOL AND WET"      
##  [7] "EXCESSIVE WETNESS"  "GLAZE"              "MICROBURST"        
## [10] "URBAN AND SMALL"    "HEAVY MIX"          "DRY MICROBURST"    
## [13] "TSTMW"              "?"                  "URBAN/SMALL STREAM"
## [16] "HEAVY SWELLS"       "URBAN SMALL"        "Other"             
## [19] "Glaze"              "DOWNBURST"          "Microburst"        
## [22] "OTHER"              "DAM BREAK"          "WET MICROBURST"    
## [25] "SEICHE"             "LANDSPOUT"

Despite the increase in amount, we were happy that the items on the list didn’t fit clearly into the any of our established categories.

Next we summarised the total property and crop damage amounts by event category, keeping only the top ten results:

summarised_data_q2 <- categorised_data_q2 %>% 
  group_by(EVCAT) %>% 
  summarise(PROPDMG = sum(PROPDMG, na.rm = TRUE), CROPDMG = sum(CROPDMG, na.rm = TRUE)) %>%
  filter(!is.na(EVCAT)) %>% # remove incategorised events
  slice(1:10)

See the results section for a bar plot of this data.

Results

Here we present two sets of barplots, one addressing the question of health implications and the other addressing financial implications.

Question 1

Across the USA, which types of events are most harmful with respect to population health?

library("ggplot2")
library("ggpubr") # for ggarrange
# NOTE: reorder ensures graphs are displayed in descending order of fatalities/injuries
xlabel <- "Weather Category"
fatalities_plot <- ggplot(summarised_data_q1, aes(x = reorder(EVCAT, FATALITIES), y = FATALITIES)) +
  geom_col(fill = "red") + 
  xlab(xlabel) + 
  ylab("No of Fatalities") + 
  coord_flip()
injuries_plot <- ggplot(summarised_data_q1, aes(x = reorder(EVCAT, INJURIES), y = INJURIES)) + 
  geom_col(fill = "blue") + 
  xlab(xlabel) + 
  ylab("No of Injuries") + 
  coord_flip()
figure_q1 <- ggarrange(NULL, NULL, fatalities_plot, injuries_plot, ncol = 2, nrow = 2, heights = c(1,20))
# NOTE: NULLs here leave space where title can then be annotated
title_text <- "Health implications from weather events across the USA (1951 - 2011)"
annotate_figure(figure_q1, top = text_grob(title_text, size = 16))

In addition, we can programmatically pull out the top three events that cause the most fatalities:

top_three_fatality_causes <- summarised_data_q1 %>% arrange(desc(FATALITIES)) %>% pull(EVCAT)
top_three_fatality_causes <- as.character(top_three_fatality_causes)[1:3]
top_three_fatality_causes

## [1] "Extreme Heat" "Fire"         "Flooding"

And likewise for those that cause the most injuries:

top_three_injury_causes <- summarised_data_q1 %>% arrange(desc(INJURIES)) %>% pull(EVCAT)
top_three_injury_causes <- as.character(top_three_injury_causes)[1:3]
top_three_injury_causes

## [1] "Extreme Heat" "Flooding"     "Fire"

This clearly shows that Extreme Heat, Fire and Flooding cause the most deaths, whilst Extreme Heat, Flooding and Fire cause the most injuries.

Question 2

Across the USA, which types of events have the greatest economic consequences?

“Financial implications of weather events across the USA (1951 - 2011)”

library("ggplot2")
library("ggpubr") # for ggarrange
# NOTE: reorder ensures graphs are displayed in descending order of fatalities/injuries
xlabel <- "Weather Category"
propdmg_plot <- ggplot(summarised_data_q2, aes(x = reorder(EVCAT, PROPDMG), y = PROPDMG)) +
  geom_col(fill = "red") + 
  xlab(xlabel) + 
  ylab("Property damage (in dollars)") + 
  coord_flip()
cropdmg_plot <- ggplot(summarised_data_q2, aes(x = reorder(EVCAT, CROPDMG), y = CROPDMG)) + 
  geom_col(fill = "blue") + 
  xlab(xlabel) + 
  ylab("Crop damage (in dollars)") + 
  coord_flip()
figure_q2 <- ggarrange(NULL, NULL, propdmg_plot, cropdmg_plot, ncol = 2, nrow = 2, heights = c(1,20))
# NOTE: NULLs here leave space where title can then be annotated
title_text <- "Health implications from weather events across the USA (1951 - 2011)"
annotate_figure(figure_q2, top = text_grob(title_text, size = 16))

In addition, we can programmatically pull out the top three events that cause the most fatalities:

top_three_propdmg_causes <- summarised_data_q2 %>% arrange(desc(PROPDMG)) %>% pull(EVCAT)
top_three_propdmg_causes <- as.character(top_three_propdmg_causes)[1:3]
top_three_propdmg_causes

## [1] "Fire"        "Flooding"    "Icy Weather"

And likewise for those that cause the most injuries:

top_three_cropdmg_causes <- summarised_data_q2 %>% arrange(desc(CROPDMG)) %>% pull(EVCAT)
top_three_cropdmg_causes <- as.character(top_three_cropdmg_causes)[1:3]
top_three_cropdmg_causes

## [1] "Fire"     "Flooding" "Drought"

This clearly shows that Fire, Flooding and Icy Weather cause the most deaths and tornadoes, whilst Fire, Flooding and Drought cause the most injuries.

We hope these results provide food-for-thought which will help inform future policy regarding the direction of future investment in weather protection programmes.

Reproducable Research: Course Project 2 - Analysis of health and financial implications of weather events across the USA (1951 - 2011)

Graham Booth

22/08/2021

Synopsis

Data Processing

Question 1

Question 2

Results

Question 1

Question 2