Synopsis

This analysis makes use of data from the National Oceanic and Atmospheric Administration (NOAA) Storm Database to determine which types of extreme weather events are most harmful to life and property. This dataset contains information on individual weather events in the US, including estimates of fatalities, injuries, property damage and crop damage for each event. In this analysis, only the most recent ten years of data (2002-2011) are considered. Weather event records are matched to a number of general weather event types based on keywords in their description, and, for each of these types, the sum total and the mean are calculated for each variable of interest mentioned above. The sums indicate the overall casualties and damage due to each type of event, while the means give an idea of the effect of individual events of each type. These quantities are plotted as bar charts to give a visual representation of which kinds of weather event are most hazardous to health and the economy.

Data Processing

Firstly, the csv-formatted data is downloaded if necessary and loaded into R.

data_file <- "StormData.csv.bz2"

if(!file.exists(data_file)) {
    url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
    download.file(url = url, destfile = data_file, method = "curl")
}

data <- read.csv(bzfile(data_file), na.strings = "?", stringsAsFactors = F)

After being loaded, the data is processed. To ensure the results are relevant to modern times, only data from the most recent ten years (2002-2011) is kept.

library(dplyr)

data <- data %>%
    mutate(BGN_DATE = as.Date(BGN_DATE, "%m/%d/%Y %H:%M:%S")) %>%
    filter(as.numeric(format(BGN_DATE, format = "%Y")) %in% 2002:2011)

The second step in processing the data is to select the relevant variables. These are:

These variables are also given tidier, more descriptive names.

data <- data %>%
    select(EVTYPE, FATALITIES:CROPDMGEXP)
colnames(data) <- c("event.type", "fatalities", "injuries", "property.damage",
                    "property.damage.units", "crop.damage", "crop.damage.units")

Next, in preparation for filtering the data based on these, the textual variables - the event type description and the property and crop damage units - are converted to lowercase and trimmed of leading and trailing whitespace.

data <- data %>%
    mutate(event.type = trimws(tolower(event.type)),
           property.damage.units = trimws(tolower(property.damage.units)),
           crop.damage.units = trimws(tolower(crop.damage.units)))

The data is then filtered to remove any records which refer to summaries of time periods instead of specific weather events. Additionally, only records with unambiguous property and crop damage units - ‘k’, ‘m’, or ‘b’, or blank when there is no damage - are kept.

units <- list(k = 1e3, m = 1e6, b = 1e9)

data <- data %>%
    filter(!grepl("summary", event.type),
           property.damage.units %in% c("", names(units)),
           crop.damage.units %in% c("", names(units)),
           !(property.damage > 0 & property.damage.units == ""),
           !(crop.damage > 0 & crop.damage.units == ""))

The penultimate preprocessing step is to combine the property and crop damage variables with their corresponding units to obtain damage estimates in dollars.

convert_unit <- function(unit) {
    if(unit %in% names(units)) {
        return(units[[unit]])
    }
    return(0)
}

data <- data %>%
    mutate(property.damage.units = sapply(property.damage.units, 
                                          convert_unit, 
                                          USE.NAMES = FALSE),
           crop.damage.units = sapply(crop.damage.units,
                                      convert_unit,
                                      USE.NAMES = FALSE),
           property.damage = property.damage * property.damage.units,
           crop.damage = crop.damage * crop.damage.units) %>%
    select(-c(property.damage.units, crop.damage.units))

Finally, the event type variable, containing a description of the kind of weather event observed, is updated. At this point, there are 121 unique event descriptions, many of which refer to very similar types of event. To remedy this, a number of general weather event types are defined, and the subsets of data whose descriptions match each of these types are extracted, with the event type description changed to the relevant general name. These subsets are then recombined. Note that some records are duplicated as they match more than one general weather event type.

name <- c("hurricane/typhoon", "tornado", "thunderstorm", "tropical storm",
          "wind", "storm surge", "low tide", "high tide","flood", "snow/sleet",
          "rain", "hail", "cold/wintry weather", "ice", "drought", "heat",
          "wildfires", "dust storm/devil", "erosion", "volcanic eruption/ash",
          "mudslide", "fog")
regexp <- c("hurricane|typhoon", "tornado", "tstm|thunderstorm|lightning", 
            "tropical storm", "wind", "surge", "low tide", 
            "high tide", "flood", "snow|blizzard|sleet", "rain|wet", "hail", 
            "cold|freez|(wint(e)?r(y)?.*(weather|mix))", "ice", "drought|dry", 
            "heat|warmth", "fire", "dust", "erosion", "volcanic", "mud", "fog")
weather.types <- data.frame(name, regexp, stringsAsFactors = FALSE)

data_list <- vector(mode = "list", length = nrow(weather.types))

for(i in 1:nrow(weather.types)) {
    data_list[[i]] <- data %>% 
        filter(grepl(weather.types[i,"regexp"], event.type)) %>%
        mutate(event.type = weather.types[i,"name"])
}

data <- do.call("rbind", data_list); rm(data_list)
data$event.type <- as.factor(data$event.type)

Analysis

The analysis of this data is straightforward, consisting of calculating the sum and mean of each variable of interest for each type of weather event. Firstly, the data on casualties is analysed. The data is grouped by event type, and the sum total and the mean are calculated for both fatalities and injuries for each of these event types. Event types are ordered by total fatalities.

health.data <- data %>%
    select(event.type, fatalities, injuries) %>%
    group_by(event.type) %>%
    summarize_all(funs(sum, mean)) %>%
    arrange(desc(fatalities_sum))

print(health.data)
## # A tibble: 21 x 5
##    event.type    fatalities_sum injuries_sum fatalities_mean injuries_mean
##    <fct>                  <dbl>        <dbl>           <dbl>         <dbl>
##  1 tornado                 1112        13588         0.0733         0.896 
##  2 heat                     920         4019         0.567          2.48  
##  3 flood                    789          820         0.0144         0.0150
##  4 wind                     665         3402         0.00381        0.0195
##  5 thunderstorm             597         4849         0.00367        0.0298
##  6 cold/wintry ~            296          454         0.0268         0.0411
##  7 wildfires                 76         1051         0.0247         0.342 
##  8 hurricane/ty~             67         1291         0.523         10.1   
##  9 snow/sleet                36          258         0.00296        0.0212
## 10 fog                       31          289         0.0237         0.221 
## # ... with 11 more rows

The data on economic damage is analysed in the same way. For each weather event type, the sum total and the mean are calculated for both property damage and crop damage. Here, event types are ordered by the sum of the total property and crop damage.

economic.data <- data %>%
    select(event.type, property.damage, crop.damage) %>%
    group_by(event.type) %>%
    summarize_all(funs(sum, mean)) %>%
    arrange(desc(property.damage_sum + crop.damage_sum))

print(economic.data)
## # A tibble: 21 x 5
##    event.type       property.damage_s~ crop.damage_sum property.damage_me~
##    <fct>                         <dbl>           <dbl>               <dbl>
##  1 flood                  144321511800      4404421400            2638467.
##  2 hurricane/typho~        72342695010      3056382800          565177305.
##  3 storm surge             47809503000          850000          190476108.
##  4 tornado                 18406922660       220589910            1213537.
##  5 wind                    10018541390      1155172600              57399.
##  6 hail                     9189944470      1394738150              64327.
##  7 drought                   846041000      5423626000             453641.
##  8 thunderstorm             5512802350       579269900              33928.
##  9 wildfires                4959547000       297479430            1611812.
## 10 tropical storm           2008360550       410061000            3438974.
## # ... with 11 more rows, and 1 more variable: crop.damage_mean <dbl>

Results

A bar plot of the sum total and the mean of fatalities and injuries, for the 10 most hazardous event types by total fatalities, gives a visual representation of which types of extreme weather are most harmful to life.

library(reshape2)
library(tidyr)
library(ggplot2)

health.worst10 <- health.data %>%
    .[1:10,] %>%
    melt(id.vars = "event.type") %>%
    separate(variable, into = c("variable", "summary.type"), sep = "_")
health.worst10$event.type <- factor(health.worst10$event.type,
                                    levels = rev(health.worst10$event.type[1:10]))
health.worst10$summary.type <- factor(health.worst10$summary.type,
                                      levels = c("sum", "mean"))

ggplot(health.worst10, aes(event.type, value, fill = variable)) + 
    geom_col(position = "dodge") + 
    facet_grid(. ~ summary.type, scales = "free") + 
    coord_flip() + 
    theme(legend.position = "bottom", legend.title = element_blank()) +
    ggtitle("Casualties from US Weather Events") +
    xlab("") +
    ylab("Casualties") +
    labs(caption = paste("Total and average fatalities and injuries", 
                         "for the 10 most hazardous weather event",
                         "types by total fatalities caused")) +
    scale_fill_manual(labels = c("Fatalities", "Injuries"),
                      values = c("red3", "orange1"))

This plot shows that tornadoes, extreme heat and flooding are responsible for the greatest number of fatalities. In terms of injuries, tornadoes cause the most by a wide margin, with thunderstorms, heat and wind also causing a significant number. However, the means show that, per-event, hurricanes and heat cause the largest number of casualties.

A similar bar plot for the economic damage data, containing information on the 10 most harmful events by total damage caused, shows which kinds of weather are most destructive to property and crops.

economic.worst10 <- economic.data %>%
    .[1:10,] %>%
    melt(id.vars = "event.type") %>%
    separate(variable, into = c("variable", "summary.type"), sep = "_")
economic.worst10$event.type <- factor(economic.worst10$event.type,
                                   levels = rev(economic.worst10$event.type[1:10]))
economic.worst10$summary.type <- factor(economic.worst10$summary.type,
                                     levels = c("sum", "mean"))

ggplot(economic.worst10, aes(event.type, value / 1e6, fill = variable)) + 
    geom_col(position = "dodge") + 
    facet_grid(. ~ summary.type, scales = "free") + 
    coord_flip() + 
    theme(legend.position = "bottom", legend.title = element_blank()) + 
    ggtitle("Damage from US Weather Events") +
    xlab("") +
    ylab("Damage (Millions of Dollars)") +
    scale_fill_manual(labels = c("Crop Damage", "Property Damage"), 
                      values = c("wheat3", "steelblue")) +
    labs(caption = paste("Total and average property and crop damage",
                         "for the 10 most harmful weather event",
                         "types by total damage caused"))

This plot shows that floods cause the most economic harm overall, with hurricanes and storm surges also responsible for a large amount of damage. However, it can be seen that droughts cause the greatest damage to crops. In terms of per-event damage, hurricanes are the most harmful by a vast margin, followed by storm surges. Other event types cause relatively little destruction per-event.

Session Information

To aid reproducibility, information on the system and packages used to perform this analysis is provided below.

sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252 
## [2] LC_CTYPE=English_United Kingdom.1252   
## [3] LC_MONETARY=English_United Kingdom.1252
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_2.2.1  tidyr_0.8.1    reshape2_1.4.3 bindrcpp_0.2.2
## [5] dplyr_0.7.5   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.17     knitr_1.20       bindr_0.1.1      magrittr_1.5    
##  [5] munsell_0.4.3    tidyselect_0.2.4 colorspace_1.3-2 R6_2.2.2        
##  [9] rlang_0.2.0      stringr_1.3.1    plyr_1.8.4       tools_3.5.0     
## [13] grid_3.5.0       gtable_0.2.0     htmltools_0.3.6  lazyeval_0.2.1  
## [17] yaml_2.1.19      assertthat_0.2.0 rprojroot_1.3-2  digest_0.6.15   
## [21] tibble_1.4.2     purrr_0.2.4      glue_1.2.0       evaluate_0.10.1 
## [25] rmarkdown_1.9    stringi_1.1.7    compiler_3.5.0   pillar_1.2.3    
## [29] scales_0.5.0     backports_1.1.2  pkgconfig_2.0.1