Synopsis

Weather event data from the United States was analyzed to determine public health and economic impacts. Total fatalities were chosen as a proxy for public health impact, and property and crop damages were summed to measure economic impact. Under these assumptions, heat, tornado, and flood events were found to have the largest public health consequences, while flood, storm, and hurricane events had the largest adverse economic effects.

Data Processing

Data for this analysis was obtained from the National Oceanic and Atmospheric Administration (NOAA). The original data set was downloaded from the Coursera site and read into R as a data frame.

temp <- tempfile()
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",temp)
noaa <- read.csv(temp, header = TRUE, na.strings=c("",".","NA"))
rm(temp)

For this analysis, we rely on several tidyverse packages.

library(lubridate)
library(dplyr)
library(forcats)
library(ggplot2)

Per NOAA documentation, all event types were recorded starting in 1996. Previously, only tornado, thunderstorm wind, or hail events were recorded. To avoid skewing the results in favor of these three event types, we filter the data for events recorded on or after 1996. However, the BGN_DATE variable in the NOAA data set was read in as a factor variable, so we convert to POSIXct before filtering.

noaa_96 <- noaa %>%
        mutate(BGN_DATE = mdy_hms(BGN_DATE)) %>%
        filter(year(BGN_DATE) >= 1996)

Our goal is to study the public health effects and the economic impacts of recorded weather events, so we construct two data frames, one for each question. To measure public health impacts, we consider the FATALITIES variable in the filtered data set. We begin by grouping by event type and summarising to find the total number of fatalities corresponding to all occurrences of that event type. In addition, we make a count column in our summary in order to compute average fatalities later. We then filter for event types with nonzero total fatalities.

noaa_fatal <- noaa_96 %>%
        group_by(EVTYPE) %>%
        summarise(TOT_FATAL = sum(FATALITIES), COUNT = n()) %>%
        filter(TOT_FATAL > 0) %>%
        droplevels(.)

The data set filtered for events on or after 1996 had 985 factor levels, and even after filtering for nonzero total fatalities and dropping empty levels, we have 109 levels remaining. Further, many of these factor levels describe identical or similar event types.

head(noaa_fatal$EVTYPE, n = 10)
##  [1] AVALANCHE        BLACK ICE        BLIZZARD         blowing snow    
##  [5] COASTAL FLOOD    Coastal Flooding COASTAL FLOODING COASTAL STORM   
##  [9] COASTALSTORM     Cold            
## 109 Levels: AVALANCHE BLACK ICE BLIZZARD blowing snow ... WINTRY MIX

Looking at the first 10 event types in the fatalities data frame shows levels for “COASTAL FLOOD”, “Coastal Flooding”, and “COASTAL FLOODING”, all of which seem to describe the exact same weather event. We also have event types that are similar enough that we want to group them together, for example “BLIZZARD” and “blowing snow”. Thus, we collapse the existing factors manually into more granular factor levels. Since we also have letter case, spacing, and spelling issues, we use regular expressions to collapse into new factors. Finally, we regroup by event type and summarize again to compute the total fatalaties and mean fatalities for our new factor levels.

noaa_fatal <- noaa_fatal %>%
        mutate(EVTYPE = fct_collapse(.$EVTYPE, 
                "WINTER" = grep("winter|wintry|cold|ice|icy|windchill|snow|frost|glaze|blizzard|windchill|freeze|freezing|hypothermia|low|sleet|mixed", .$EVTYPE, ignore.case = TRUE, value = TRUE),
                "SEAS" = grep("seas|surf|marine|rip|tsunami|wave|swell", .$EVTYPE, ignore.case = TRUE, value = TRUE),
                "WIND" = grep("wind|winds|microburst", .$EVTYPE, ignore.case = TRUE, value = TRUE),
                "HEAT" = grep("heat|warm|hyperthermia", .$EVTYPE, ignore.case = TRUE, value = TRUE),
                "FIRE" = grep("fire|wildfire", .$EVTYPE, ignore.case = TRUE, value = TRUE),
                "STORM" = grep("storm|hail", .$EVTYPE, ignore.case = TRUE, value = TRUE),
                "TORNADO" = grep("tornado|waterspout|dust", .$EVTYPE, ignore.case = TRUE, value = TRUE),
                "HURRICANE" = grep("hurricane|typhoon", .$EVTYPE, ignore.case = TRUE, value = TRUE),
                "FLOOD" = grep("flood|flooding|floods|fld|rain|water|drown", .$EVTYPE, ignore.case = TRUE, value = TRUE),
                "LIGHTNING" = grep("lightning", .$EVTYPE, ignore.case = TRUE, value = TRUE),
                "AVALANCHE" = grep("aval", .$EVTYPE, ignore.case = TRUE, value = TRUE),
                "FOG" = grep("fog", .$EVTYPE, ignore.case = TRUE, value = TRUE),
                "LANDSLIDE" = grep("landslide|mud", .$EVTYPE, ignore.case = TRUE, value = TRUE)
        )) %>%
        group_by(EVTYPE) %>%
        summarise(TOT_FATAL = sum(TOT_FATAL), COUNT = sum(COUNT)) %>%
        mutate(AVG_FATAL = TOT_FATAL/COUNT)

To measure economic impacts of weather events, we follow a similar path for processing the data. However, in this case, we have four variables to account for in the original data set: PROPDMG, CROPDMG, PROPDMGEXP, and CROPDMGEXP. The first two variables give numerical values for property and crop damage, while the latter two variables give the units for the damage variables as thousands of dollars, millions of dollars, or billions of dollars. We filter for events on or after 1996 with nonzero property or crop damage.

It is not uncommon to have missing values for crop damage when we have nonzero property damage, or vice versa. To avoid issues with missing values in calculations like sum(), we coerce missing values to 0. We also need to have consistent units for our analysis, so we define two new variables, PROPDMG_K and CROPDMG_K, for property damage and crop damage in thousands of dollars.

noaa_econ <- noaa_96 %>%
        filter(PROPDMG > 0 | CROPDMG > 0) %>% 
        mutate(PROPDMG_K = case_when(
                PROPDMGEXP == 1 ~ PROPDMG,
                PROPDMGEXP == "K" ~ PROPDMG,
                PROPDMGEXP == "M" ~ PROPDMG * 1000,
                PROPDMGEXP == "B" ~ PROPDMG * 1000000,
                is.na(PROPDMGEXP) ~ 0
                )
        ) %>%
        mutate(CROPDMG_K = case_when(
                CROPDMGEXP == "K" ~ CROPDMG,
                CROPDMGEXP == "M" ~ CROPDMG * 1000,
                CROPDMGEXP == "B" ~ CROPDMG * 1000000,
                is.na(CROPDMGEXP) ~ 0
                )
        ) %>%
        droplevels(.)

As before, we group by event type, then summarise. In this case, we compute the total damage by summing property damage and crop damage in thousands of dollars. After defining new factor levels for event types, we regroup and summarise to find the total damage and the average damage.

noaa_econ <- noaa_econ %>%
        group_by(EVTYPE) %>%
        summarise(
                TOT_DMG_K = sum(PROPDMG_K + CROPDMG_K),
                COUNT = n()
        ) %>%
        mutate(EVTYPE = fct_collapse(.$EVTYPE,
                "WINTER" = grep("winter|wintry|cold|ice|icy|windchill|snow|frost|glaze|blizzard|windchill|freeze|freezing|hypothermia|low|sleet|mixed", .$EVTYPE, ignore.case = TRUE, value = TRUE),
                "SEAS" = grep("seas|surf|marine|rip|tsunami|wave|swell|tide|beach|coastal", .$EVTYPE, ignore.case = TRUE, value = TRUE),
                "WIND" = grep("wind|winds|microburst|downburst", .$EVTYPE, ignore.case = TRUE, value = TRUE),
                "HEAT" = grep("heat|warm|hyperthermia", .$EVTYPE, ignore.case = TRUE, value = TRUE),
                "FIRE" = grep("fire|wildfire|smoke", .$EVTYPE, ignore.case = TRUE, value = TRUE),
                "STORM" = grep("storm|hail|depression", .$EVTYPE, ignore.case = TRUE, value = TRUE),
                "TORNADO" = grep("tornado|waterspout|dust|funnel|landspout", .$EVTYPE, ignore.case = TRUE, value = TRUE),
                "HURRICANE" = grep("hurricane|typhoon", .$EVTYPE, ignore.case = TRUE, value = TRUE),
                "FLOOD" = grep("flood|flooding|floods|fld|rain|water|drown|dam|seiche", .$EVTYPE, ignore.case = TRUE, value = TRUE),
                "LIGHTNING" = grep("lightning", .$EVTYPE, ignore.case = TRUE, value = TRUE),
                "AVALANCHE" = grep("aval", .$EVTYPE, ignore.case = TRUE, value = TRUE),
                "FOG" = grep("fog", .$EVTYPE, ignore.case = TRUE, value = TRUE),
                "LANDSLIDE" = grep("landslide|mud|landslump|slide", .$EVTYPE, ignore.case = TRUE, value = TRUE),
                "DROUGHT" = grep("drought", .$EVTYPE, ignore.case = TRUE, value = TRUE),
                "VOLCANO" = grep("volcan", .$EVTYPE, ignore.case = TRUE, value = TRUE),
                "OTHER" = grep("other", .$EVTYPE, ignore.case = TRUE, value = TRUE)
        )) %>%
        group_by(EVTYPE) %>%
        summarise(TOT_DMG = sum(TOT_DMG_K), COUNT = sum(COUNT)) %>%
        mutate(AVG_DMG = TOT_DMG/COUNT)

Results

To evaluate the event types by public health threat, we plot the total fatalities across each event type.

ggplot(noaa_fatal, aes(x = EVTYPE, y = TOT_FATAL)) + 
        geom_col() + 
        coord_flip() +
        labs(x = "Event Type",
                y = "Total Fatalities",
                caption = "Total number of fatalities by weather event type since 1996 in the United States."
             )

We see that heat events have the largest number of fatilities, followed by tornado and flood events. However, when viewing the mean fatalities across the total number of events of a fixed type, hurricane and avalanche events follow heat as the most deadly. This can be explained by the relatively small frequency of hurricanes and avalanches compared to tornadoes and floods.

ggplot(noaa_fatal, aes(x = EVTYPE, y = AVG_FATAL)) + 
        geom_col() + 
        coord_flip() +
        labs(x = "Event Type",
                y = "Mean Fatalities",
                caption = "Mean number of fatalities by weather event type since 1996 in the United States."
             )

When viewing total economic damages for the event types with nonzero damage estimates, monetary damages span eight orders of magnitude. Thus, we visualize the total damages by taking the base-10 logarithm of the total damage variable.

ggplot(noaa_econ, aes(x = EVTYPE, y = log10(TOT_DMG))) + 
        geom_col() + 
        coord_flip() +
        labs(x = "Event Type",
                y = "Log Thousands of Dollars Total Damage",
                caption = "Log of sum of property and crop economic damage (thousands $US) by weather event type since 1996 in the United States. "
             )

Floods account for the largest total economic losses, followed by storm and hurricane events. This is not unexpected, as those three event types tend to cover large geographic areas. Despite winter events also having potentially large geographic effects, we expect winter events to have relatively smaller economic damages due to a much smaller chance of crop damages. Conversely, events occurring over small geographic regions would be expected to have small direct economic impacts. This bears out in the data shown in the plot.