Summary

We explore U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. The targeted database lists characteristics of major storms and weather events in the United States. Our analysis shows that temperature extremes (e.g. extreme heat, winter weather mix) and severe storm events (e.g. hurricane/typhoon) cause the most damage, both regarding population health and material damage. Therefore, any policymaking should adequately allocate available resources.

Data Processing

Data were obtained from Coursera’s Reproducible Research webpage on the 21st June 2018 as a comma-separated-value file compressed via the bzip2 algorithm (https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2). Required CRAN packages were loaded into R.

url = c("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2")
dest = c("storms.csv.bz2")
# download.file(url, dest)  #uncomment if you wish to download data [47 Mb]

library(data.table)
library(dplyr)
library(ggplot2)
library(ggrepel)
library(magrittr)
library(R.utils)
library(reshape2)
library(stringr)

Reading in the data

The data were uncompressed and loaded into R using fread() function of the data.table package, which greatly expedited processing of the dataset.

bunzip2("storms.csv.bz2", remove = FALSE, skip = TRUE)
storms <- fread("storms.csv") 

Data transformation

The following section will detailly describe data transformations that were performed to estimate the effect of event types on population health. Then, the document will present the code required for estimating the impact of event types on the economy, but with much scantier description.

Data transformation for estimating the effect of event types on population health

Relevant data were subsetted. Variable names and names of event types were converted to lower case. Event types with an associated number of fatalities and number of injuries were subsetted and transformed according to the “tidy data” philosophy.

health.subset <- storms %>% 
    rename_all(tolower) %>% 
    mutate(evtype = tolower(evtype)) %>%
    select(evtype, fatalities, injuries) %>%
    melt(variable.name = "effect",
         value.name = "casualty",
         id.vars = "evtype")

Exploratory analysis showed that majority of the recorded events (> 98%) did not affect population health, i.e. they did not produce injuries or fatalities. More than half of the remaining events counted just a single casualty.

health.subset %>% filter(casualty == 0) %>% nrow(.)/nrow(health.subset)
## [1] 0.9863803
health.subset %>% filter(casualty == 1) %>% nrow(.)/nrow(health.subset)
## [1] 0.007074167
health.subset %>% filter(casualty >1) %>% nrow(.)/nrow(health.subset)
## [1] 0.006545517

Therefore, we constructed special dataframe which indicated whether one event (TRUE) or more events (FALSE) within each event type caused damage.

casualties <- health.subset %>%
    filter(casualty != 0) %>% 
    group_by(evtype, effect) %>%
    summarise(events.with.casualties = n()) %>%
    mutate(events.with.casualties.is.1 = if_else(events.with.casualties == 1, TRUE, FALSE))

We created a new dataframe ‘health’ which summarised events based on their type. For each type of event, it calculated the mean number of injuries and fatalities (mean), as well as the total number of injuries and fatalities (sum). We also calculated the 90th quantile to exclude the effect of outliers (uncommonly disruptive events/storms). Median and maximal values for each event type failed to uncover new insights (not shown). Created dataframe also included information on how often event types cause injuries and fatalities (proportion).

Unused dataframes were deleted.

health <- health.subset %>% 
    group_by(evtype, effect) %>%
    summarise(mean = mean(casualty), 
              median = median(casualty),
              max = max(casualty),
              sum = sum(casualty),
              q90 = quantile(casualty, 0.9),
              proportion = 1 - mean(!casualty, na.rm = TRUE)) %>%
    left_join(., casualties, by = c("evtype", "effect"))

rm(casualties, health.subset)

Generation of labels

For clarity, figures that describe the analysis will include labelled points for the most disruptive event types. We define these labels in additional columns of the ‘health’ dataframe.

health %<>% 
    mutate(mean.label = case_when(mean > 4 & events.with.casualties.is.1 == FALSE ~ evtype)) %>%
    mutate(sum.label = case_when((sum > 1500 & events.with.casualties.is.1 == FALSE) | (proportion > 0.5 & proportion < 1 & sum > 100 & events.with.casualties.is.1 == FALSE) ~ evtype)) %>%
    mutate(q90.label = case_when(q90 > 8 & events.with.casualties.is.1 == FALSE ~ evtype))

multiplot function

We estimate of event types on population health on a three-facet figure generated in ggplot2, with the help of ‘multiplot()’ function (adopted from http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/).

multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
    library(grid)
    
    # Make a list from the ... arguments and plotlist
    plots <- c(list(...), plotlist)
    
    numPlots = length(plots)
    
    # If layout is NULL, then use 'cols' to determine layout
    if (is.null(layout)) {
        # Make the panel
        # ncol: Number of columns of plots
        # nrow: Number of rows needed, calculated from # of cols
        layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                         ncol = cols, nrow = ceiling(numPlots/cols))
    }
    
    if (numPlots==1) {
        print(plots[[1]])
        
    } else {
        # Set up the page
        grid.newpage()
        pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
        
        # Make each plot, in the correct location
        for (i in 1:numPlots) {
            # Get the i,j matrix positions of the regions that contain this subplot
            matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
            
            print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                            layout.pos.col = matchidx$col))
        }
    }
}

Generation of figures

The following code generates three ggplot2 facets that show which event types most commonly produce casualties and how many casualties they produce.

## plot health data: mean
plot.health.mean <- ggplot(health, aes(proportion, mean)) + 
    geom_point(aes(colour = events.with.casualties.is.1, shape = effect, size = effect)) +
    geom_rug() + 
    geom_text_repel(aes(label = mean.label),
                    box.padding   = 0.4, 
                    point.padding = 0.5,
                    vjust = 1,
                    nudge_x = 0.05,
                    segment.color = "grey50") +
    labs(title = "Mean") +
    labs(x = "Proportion of events with casualties") + 
    labs(y = "# of casualties") +
    theme(legend.position = "none") +
    scale_colour_manual(name  = "# of events with casualties",
                        values = c("indianred", alpha("indianred", 0.2), alpha("indianred", 0.1)),
                        labels = c(">1", "1", "0")) +
    scale_shape_manual(values = c(16, 1),
                       labels = c("fatalities", "injuries")) +
    scale_size_manual(values = c(4, 3))

## plot health data: sum
plot.health.sum <- ggplot(health, aes(proportion, log10(sum))) + 
    geom_point(aes(colour = events.with.casualties.is.1, shape = effect, size = effect)) +
    geom_rug() + 
    geom_text_repel(aes(label = sum.label),
                    box.padding   = 0.4, 
                    point.padding = 0.5,
                    hjust = 0,
                    segment.color = "grey50",
                    nudge_x = 0.1) +
    labs(title = "Sum per type of event") +
    labs(x = "Proportion of events with casualties") + 
    labs(y = "Log10 of total casualties") +
    theme(legend.position = "none") +
    scale_colour_manual(name  = "# of events with casualties",
                        values = c("indianred", alpha("indianred", 0.2), alpha("indianred", 0.1)),
                        labels = c(">1", "1", "0")) +
    scale_shape_manual(values = c(16, 1),
                       labels = c("fatalities", "injuries")) +
    scale_size_manual(values = c(4, 3))

## plot health data: q90
plot.health.q90 <- ggplot(health, aes(proportion, q90)) + 
    geom_point(aes(colour = events.with.casualties.is.1, shape = effect, size = effect)) +
    geom_rug() + 
    geom_text_repel(aes(label = q90.label),
                    box.padding   = 0.4, 
                    point.padding = 0.5,
                    hjust = 0,
                    vjust = 1,
                    segment.color = "grey50") +
    labs(title = "90th Quantile") +
    labs(x = "Proportion of events with casualties") + 
    labs(y = "# of casualties") +
    theme(legend.position = "bottom") +
    scale_colour_manual(name  = "# of events with casualties",
                        values = c("indianred", alpha("indianred", 0.2), alpha("indianred", 0.1)),
                        labels = c(">1", "1", "0")) +
    scale_shape_manual(values = c(16, 1),
                       labels = c("fatalities", "injuries")) +
    scale_size_manual(values = c(4, 3)) +
    ylim(c(0,50))

Data transformation for estimating effect of event types on economy

Relevant data were once again subsetted and transformed according to the “tidy data” philosophy. Next, we generated new variables that were much more suitable for further analysis (e.g. we multiplied propdmg and propdmgexp cells to obtain final amount of damage generated per event).

damage.subset <- storms %>% 
    rename_all(tolower) %>% 
    mutate(evtype = tolower(evtype)) %>%
    select(evtype, propdmg, propdmgexp, cropdmg, cropdmgexp) %>%
    mutate(propdmgexp = str_replace_all(propdmgexp, c("K" = "1000", 
                                                      "M" = "1000000",
                                                      "B" = "1000000000",
                                                      "^$" = "0"))) %>%
    mutate(cropdmgexp = str_replace_all(cropdmgexp, c("K" = "1000", 
                                                      "M" = "1000000",
                                                      "B" = "1000000000",
                                                      "^$" = "0"))) %>%
    mutate(propdmg = propdmg * as.numeric(propdmgexp)) %>%
    mutate(cropdmg = cropdmg * as.numeric(cropdmgexp)) %>%
    select(evtype, property = propdmg, crop = cropdmg) %>% 
    melt(variable.name = "economy",
         value.name = "money",
         id.vars = "evtype")

Exploratory analysis showed that majority of the recorded events (> 85%) did not produce any damage.

damage.subset %>% filter(money == 0) %>% nrow(.)/nrow(damage.subset)
## [1] 0.8553758
damage.subset %>% filter(money >0) %>% nrow(.)/nrow(damage.subset)
## [1] 0.1445926

Once again, we constructed special dataframe which indicated whether one event (TRUE) or more events (FALSE) within each event type caused damage.

money <- damage.subset %>%
    filter(money != 0) %>% 
    group_by(evtype, economy) %>%
    summarise(events.with.damage = n()) %>%
    mutate(events.with.damage.is.1 = if_else(events.with.damage == 1, TRUE, FALSE))

We created a new dataframe ‘damage’ which summarised events based on their type, and we deleted unused dataframes.

damage <- damage.subset %>% 
    group_by(evtype, economy) %>%
    summarise(mean = mean(money, na.rm = TRUE), 
              median = median(money, na.rm = TRUE),
              max = max(money, na.rm = TRUE),
              sum = sum(money, na.rm = TRUE),
              q90 = quantile(money, 0.9, na.rm = TRUE),
              proportion = 1 - mean(!money, na.rm = TRUE)) %>%
    left_join(., money, by = c("evtype", "economy"))

rm(money, damage.subset)

Generation of labels

damage %<>% 
    mutate(mean.label = case_when(mean > 10^7.8 & events.with.damage.is.1 == FALSE ~ evtype)) %>%
    mutate(sum.label = case_when(sum > 10^9.8 & events.with.damage.is.1 == FALSE ~ evtype)) %>%
    mutate(q90.label = case_when(q90 > 10^7.8 & events.with.damage.is.1 == FALSE ~ evtype))

Generation of figures

The following code generates three ggplot2 facets that show which how often event types produce damage.

## plot damage data: mean
plot.damage.mean <- ggplot(damage, aes(proportion, log10(mean))) + 
    geom_point(size = 3, aes(colour = events.with.damage.is.1, shape = economy)) +
    geom_rug() + 
    geom_text_repel(aes(label = mean.label),
                    box.padding   = 0.5,
                    point.padding = 0.5,
                    segment.color = "grey50") +
    labs(title = "Mean") +
    labs(x = "Proportion of events that ended in damage") + 
    labs(y = "Log10 of USD") +
    theme(legend.position = "none") +
    scale_colour_manual(name  = "# of events with damage",
                        values = c("darkorange", alpha("darkorange", 0.2), alpha("darkorange", 0.1)),
                        labels = c(">1", "1", "0")) +
    scale_shape_manual(values = c(16, 1),
                       labels = c("property", "crop"))


## plot damage data: sum
plot.damage.sum <- ggplot(damage, aes(proportion, log10(sum))) + 
    geom_point(size = 3, aes(colour = events.with.damage.is.1, shape = economy)) +
    geom_rug() + 
    geom_text_repel(aes(label = sum.label),
                    box.padding   = 0.5,
                    point.padding = 0.5,
                    segment.color = "grey50") +
    labs(title = "Sum per type of event") +
    labs(x = "Proportion of events that ended in damage") + 
    labs(y = "Log10 of USD") +
    theme(legend.position = "none") +
    scale_colour_manual(name  = "# of events with damage",
                        values = c("darkorange", alpha("darkorange", 0.2), alpha("darkorange", 0.1)),
                        labels = c(">1", "1", "0")) +
    scale_shape_manual(values = c(16, 1),
                       labels = c("property", "crop"))


## plot damage data: q90
plot.damage.q90 <- ggplot(damage, aes(proportion, log10(q90))) + 
    geom_point(size = 3, aes(colour = events.with.damage.is.1, shape = economy)) +
    geom_rug() + 
    geom_text_repel(aes(label = q90.label),
                    box.padding   = 0.5,
                    point.padding = 0.5,
                    segment.color = "grey50") +
    labs(title = "90th Quantile") +
    labs(x = "Proportion of events that ended in damage") + 
    labs(y = "Log10 of USD") +
    theme(legend.position = "bottom") +
    scale_colour_manual(name  = "# of events with damage",
                        values = c("darkorange", alpha("darkorange", 0.2), alpha("darkorange", 0.1)),
                        labels = c(">1", "1", "0")) +
    scale_shape_manual(values = c(16, 1),
                       labels = c("property", "damage"))

Results

Effect of event types on population health

We generated three graphs to summarise the effect of event types on population health. On the x-axis, the graphs show how often a specific type of event ends in casualties. For example, if each event of a particular type produces casualties, it will be placed on the right end of the axis and will have the probability of 1. The y-axes show either total or mean number of casualties type events produced, as well as 90th quantile for each type of event. The last metric excludes events within the event type that produced the top 10% casualties to eliminate outliers, i.e. events that have generated an inordinately large number of casualties.

multiplot(plot.health.sum, 
          plot.health.mean,
          plot.health.q90, 
          layout = matrix(c(1,2,2,1,3,3), nrow = 2, byrow = TRUE))

Figure 1: Effect of event types on population health, as the total and mean amount of casualties an event type produced, as well as the 90th quantile of casualties for each event type. Filled circles correspond to fatalities, while the empty ones correspond to injuries. Transparent circles indicate that event type produced casualties only once and therefore correspond to extreme weather events (outliers). We omit event types that did not produce casualties. Margins of the graphs show rug plots, thus allowing for assesment of 1D distributions.

The results of the data analysis show that the type of events that generated the highest amount of both injuries and fatalities over the last 50 years is tornado. It has produced ca. 100,000 injuries and 10,000 fatalities. However, less than 15% of tornadoes end in casualties. Another event type that generates a notable number of fatalities and injuries is excessive heat (a combination of high temperatures and high humidity). While it produces fatalities in 25% of cases and injuries in 10% of cases, it causes 10 times fewer fatalities than do tornadoes. Other types of events that caused more than 3,000 casualties in the last 50 years are flood, thunderstorm wind and lightnings. It is also interesting to mention rip currents which are the type of event that causes fatalities in more than 60% of cases.

When looking at the mean number of casualties, it is evident that the number of injuries is much higher than the number of fatalities. Once again, extreme heat causes the largest number of fatalities. Large number of injuries is also caused by strong winds (e.g. hurricane/typhon, tornado f2), but also by winter weather (e.g. snow/high winds, winter weather mix, glaze).

Interestingly, 90th quantile graph clarifies that extreme heat and cold have a much higher effect on population health than do any other type of events, including tornadoes.

Effect of event types on economy

As in the case of population health investigation, three graphs were produced when investigating the effect of event types on the economy. On the x-axis, they showed how often did a specific type of event produced damage, while on the y-axes they showed total and mean amount of damage the event types produced. In each case, we distinguished between property and crop (agricultural) damage. The third graph plots 90th quantile data, again to exclude unusually severe weather events.

multiplot(plot.damage.sum, 
          plot.damage.mean,
          plot.damage.q90, 
          layout = matrix(c(1,2,2,1,3,3), nrow = 2, byrow = TRUE))

Figure 2: Effect of event types on the economy (material damage), as the total and mean amount of damage an event type produced, as well as the 90th quantile of damage for each event type. Filled circles correspond to property damage, while the empty ones correspond to crop damage. Transparent circles indicate that event type damaged property or crops only once and therefore correspond to extreme weather events (outliers). We omit event types that did not produce damage. Margins of the graphs show rug plots, thus allowing for assesment of 1D distributions.

Event types damage property more often than crops. There are a few event types that have resulted in total property damage higher than 10 billion USD over the last 50 years. These are mostly related to severe storms (e.g. hurricane/typhoon, tornado, hurricane, tropical storm, storm surge, hail, winter storm) and floods (e.g. flood, flash flood). Drought is the only event type that caused crop damage over 10 billion USD.

The effect of storms is also apparent when assessed by the mean amount of caused damage. However, here it becomes clear that certain event types will cause much damage whenever they occur, and therefore it is of great interest to try to prevent them. These are hailstorm and wildfires.

The same event types are striking in the outliers-free 90th quantile graph. This graph also highlights damaging freeze as a cause of massive crop damage.

Conclusion

Event types that the most affect population health can be grouped into those that are caused by extreme hot and cold (e.g. extreme heat, snow/high winds) and to those that are caused by severe storms (e.g. hurricane/typhoon, tornadoes).

We can employ the same classification for the event types that cause the greatest amount of damage. Here, property damage is mostly caused by severe storms (e.g. hurricane/typhoon) rather than by temperature extremes. On the other hand, crops are mostly affected by extreme hot and cold (e.g. drought, damaging freeze).

Note of interest

Despite extensive classification instructions, event types in the dataset suffer from vast inconsistencies. They vary from spelling differences (e.g. wild fires/wildfires) to improperly labelled events (e.g. extreme heat is not a valid event type). Subsequent analyses should be aware of this problem beforehand and perform extensive dataset check and cleaning.