Synopsis

This analysis aims to identify the storm weather events in the United States which have historically caused most damage to human health and the economy. Additionally, it aims to map the extent of damage caused by these particular weather events quantitatively. The analysis is based on data from the storm data base of the U.S National Oceanic and Atmospheric Administration (NOAA), which was collected between the year 1950 and November 2011 across the entirety of the United States.

Raw data

The raw data used is available through the following links:

Data Processing

Required R Packages

The analysis also relies on a variety of R packages. These are listed below:

library(tidyverse)
library(ggplot2)
library(ggpubr)
library(RColorBrewer)

Raw Data Extraction

The data was loaded into the working directory with the following code:

url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url, destfile = "weatherdata.csv.bz2", method = "curl")
df0 <- read.csv("weatherdata.csv.bz2", stringsAsFactors=FALSE, header=TRUE)

Grouping and Summing Data for Damage to Health

Data was then grouped by weather event to make inter event analysis possible. The column with weather events is first set to lowercase and spaces are removed to avoid matching problems. Next, the values for fatalities and injuries are summed for each weather event.

df.health <- df0 %>%
        mutate(EVTYPE = tolower(gsub(" ", "", EVTYPE))) %>%
        group_by(EVTYPE) %>%
        select(EVTYPE, FATALITIES, INJURIES) %>%
        summarise_all(list(sum)) %>%
        na.omit() %>%
        arrange(desc(INJURIES)) 
head(df.health)
## # A tibble: 6 x 3
##   EVTYPE        FATALITIES INJURIES
##   <chr>              <dbl>    <dbl>
## 1 tornado             5633    91346
## 2 tstmwind             504     6957
## 3 flood                470     6789
## 4 excessiveheat       1903     6525
## 5 lightning            816     5230
## 6 heat                 937     2100

Grouping Data and Summing for Economic Damage

Exactly the same strategy is used to make inter event analysis possible in terms of economic damage. There are four variables in the data set related to economic damage, two of which are numeric (PROPDMG and CROPDMG) and two of which are character variables (PROPDMGEXP and CROPDMGEXP). The supplementary documentation states that:

“Estimates should be rounded to three significant digits, followed by an alphabetical character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000. Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions" (p.12)

Accordingly, we merge the four variables to create two numeric variables, one detailing damage to property and the other damage to crops.

Finally, the resulting damage variables are summed for each event type.

df.econ <- df0 %>%
        mutate(EVTYPE = tolower(gsub(" ", "", EVTYPE))) %>%
        group_by(EVTYPE) %>%
        select(EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) %>%
        mutate(PROPDMG = ifelse(PROPDMGEXP == "K", 
                                PROPDMG *1000, PROPDMG)) %>%
        mutate(PROPDMG = ifelse(PROPDMGEXP == "M", 
                                PROPDMG *1000000, PROPDMG)) %>%
        mutate(PROPDMG = ifelse(PROPDMGEXP == "B", 
                                PROPDMG *1000000000, PROPDMG)) %>%
        mutate(CROPDMG = ifelse(CROPDMGEXP == "K", 
                                CROPDMG *1000, CROPDMG)) %>%
        mutate(CROPDMG = ifelse(CROPDMGEXP == "M", 
                                CROPDMG *1000000, CROPDMG)) %>%
        mutate(CROPDMG = ifelse(CROPDMGEXP == "B", 
                                CROPDMG *1000000000, CROPDMG)) %>%
        select(EVTYPE, PROPDMG, CROPDMG) %>%
        summarise_all(list(sum)) %>%
        mutate(TOTDMG = CROPDMG + PROPDMG) %>%
        arrange(desc(PROPDMG))
head(df.econ)
## # A tibble: 6 x 4
##   EVTYPE                  PROPDMG    CROPDMG        TOTDMG
##   <chr>                     <dbl>      <dbl>         <dbl>
## 1 flood             144657709807  5661968450 150319678257 
## 2 hurricane/typhoon  69305840000  2607872800  71913712800 
## 3 tornado            56925660790.  414953270  57340614060.
## 4 stormsurge         43323536000        5000  43323541000 
## 5 flashflood         16140862067. 1421317100  17562179167.
## 6 hail               15727367053. 3025537890  18752904943.

Results

Harmful Weather Events in terms of Human Health

The harm caused to human health by weather events can take the form of either fatality or injury. It is difficult to create a variable which captures both, for it seems plausible that the numeric values in the two variables should be weighted differently (fatality is much graver than injury). Yet, determining such weights remains controversial. Furthermore, as figure 1 indicates there is no consistency in the strength of the relationship between fatality and injury across weather events, although there is, as expected, a positive relationship between the two.

ggplot(df.health, aes(x = log(FATALITIES), y = log(INJURIES), label = EVTYPE)) + geom_text(size = 3, nudge_x = 0.5, alpha = 1/2) + labs(title = "Figure 1: Relationship between fatalities and injuries\nper type of weather event in the US", x = "Log of Total Fatalities (between 1950 and 2011)", y = "Log of Total Injuries\n(between 1950 and 2011)", subtitle = "There is variability in the strength of the relationship between fatality and\ninjury across different types of storm events. This means some weather\nevents have a larger tendency to cause fatalities whilst others have a\nlarger tendency to generate injury. ") + theme_classic()

Accordingly, the impact of harmful weather events in terms of Human Health should be considered separately in terms of injury and fatalities. Figure 2 reflects the fatality and injury figures for the top 7 weather events in terms of health risks. The 7 events were selected by merging the top 5 list in total fatality and the top 5 list in total injury, whilst removing any duplicates.

df.top5fatalities <- df.health %>%
        arrange(desc(FATALITIES)) %>%
        select(EVTYPE) %>%
        slice(1:5) %>%
        rowid_to_column("rank2")
df.top5injuries <- df.health %>%
        arrange(desc(INJURIES)) %>%
        select(EVTYPE) %>%
        slice(1:5) %>%
        rowid_to_column("rank1")

df.top5 <- df.health %>%
        filter(EVTYPE %in% unlist(df.top5fatalities) | 
                       EVTYPE %in% unlist(df.top5injuries)) %>%
        left_join(df.top5injuries) %>%
        mutate(rank1 = ifelse(is.na(rank1), 11, rank1)) %>%
        left_join(df.top5fatalities) %>%
        mutate(rank2 = ifelse(is.na(rank2), 11, rank2)) %>%
        mutate(average = (rank1 + rank2)/2) %>%
        arrange(desc(average)) %>%
        rowid_to_column("rank") %>%
        select(rank, EVTYPE, FATALITIES, INJURIES) %>%
        pivot_longer(c(FATALITIES,INJURIES)) 

ggplot(df.top5, aes(x = rank, y = value, fill = EVTYPE, rank)) + theme_classic() + facet_wrap(~name, scales = "free", drop = TRUE) + geom_col() + geom_bar(stat = "identity", position = "dodge") + coord_flip() + labs(title = "Figure 2: Top seven types of weather events in terms\nof total historic fatalities or injuries in the US", subtitle = "Tornadoes and excessive heat waves stand out as most \nthreatening to human life. The seven weather events were \nselected by merging the top five events with the highest \ntotal injuries with the top five events with the highest total fatalities.", y = "Total between 1950 and 2011") + theme(axis.title.y=element_blank(), axis.text.y=element_blank(), axis.ticks.y=element_blank()) + guides(fill = guide_legend(title ="Weather Event\nType")) + scale_fill_manual(values = brewer.pal(n = 7,"Pastel2"))

Figure 2 indicates that tornadoes and heat waves (heat and excessive heat) have caused the highest amount of fatalities by a landslide. Tornadoes also overshadow all other event types in terms of injuries, whilst heat waves do not tend to cause more injuries than most of the other weather events on this top seven list.

Harmful Weather Events in terms of Economic Damage

The harm caused to human health by weather events can take the form of either fatality or injury. It is difficult to create a variable which captures both, for it seems plausible that the numeric values in the two variables should be weighted differently (fatality is much graver than injury). Yet, determining such weights remains controversial. Furthermore, as figure 1 indicates there is no consistency in the strength of the relationship between fatality and injury across weather events, although there is as we expect a positive relationship between the two.

df.top10econ <- df.econ %>%
        arrange(desc(TOTDMG)) %>%
        slice(1:10) %>%
        arrange(TOTDMG) %>%
        rowid_to_column("rank")%>%
        pivot_longer(c(PROPDMG, CROPDMG))

df.top10econ$EVTYPE <- reorder(df.top10econ$EVTYPE, df.top10econ$rank) 

ylab = c(50, 100, 150, 200, 250)

ggplot(df.top10econ, aes(x = EVTYPE, y = value, fill = name)) + theme_classic() + geom_bar(stat = "identity", position = "stack") + coord_flip() + labs(title = "Figure 3: Top ten types of weather events in terms\nof economic damage in the US", subtitle = "Floods and hurricanes stand out as the most threatening to\neconomic prosperity. Furthermore, the top 6 weather types\nprimarily cause property damage rather than crop damage.\nThe ten weather events were selected by ranking event types\nby the sum of crop and property damage.", y = "Economic damage caused between\n1950 and 2011 (in billions of US $)", x = "Event Type") + scale_y_continuous(labels = paste0(ylab, "B"),
                     breaks = 10^9 * ylab) +  scale_fill_manual(values = rev(brewer.pal(n = 2,"Pastel2")), name = "Type of\nDamage",labels = c("Crop Damage", "Property Damage"))

Figure 3 indicates that floods have caused most economic damage, followed by hurricanes or typhones and tornadoes. A majority of damage caused by these event types is property damage rather than crop damage. The event types which have caused primarily crop damage are droughts, riverfloods and ice storms.

Acknowledgements

This project was submitted as part of the Reproducible Research course created by John Hopkins University. The two questions answered in this analysis and the links to the raw data were supplied by the course organisers. All aforementioned code and analysis choices are my own.