The impact of severe weather events to health and economy in the US.

Synopsys

The objective of this study is to identify which severe weather events had the most impact on peoples overall health and on economy in the US between the years 1950 and 2011. To answer this questions we will rely on the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, which contains detailed data about major storm and weather events across multiple regions in the United States, as well as estimate measurements of the number of injuries and fatalities and the financial consequences from damages caused by each event. With the results of this study, we hope to bring insights to governmental managers all over the US so they can better prepare for future incidents.

Loading and Processing the Data

Database Source and additional information

From the Storm Data Download we can obtain the storm database just as specified above. For further and updated information about this database you can refer to the NOAA Database Website. You can also read the Storm Data Documentation and National Climatic Datacenter Storm Events FAQ.

Required R Packages for this analysis

library(tidyverse)
library(scales)
library(lubridate)

Loading the data

First we get the work directory ready and organized to receive the data. Then we download the .bz2 into the data folder.

# Check if data folder exists in the work directory
if(!file.exists("./data")){
        dir.create("./data")
}

# Check if project file exists within the data folder. If not, download it.
if(!file.exists("./data/StormData.csv.bz2")){
        file_url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
        download.file(file_url, destfile = "./data/StormData.csv.bz2")
}

Reading the data

To read the data we will use the readr package (automatically loaded with the tidyverse package). It is important to note that the dataset has some pattern issues and changes in some of its data across all the years, so the guess_max argument in the following code chunk came in handy for that problem, specially in the early years. The reading process may take a bit longer though.

# The following guess_max arg from the readr::read_csv function was helpful to determine the overall data types for each variable in the dataset.
# The number 902297 is the number of obs in the dataset.
storm_data <- read_csv("./data/StormData.csv.bz2", guess_max = 902297)

The data frame dimensions and structure can be checked as follows.

dim(storm_data) # Check dimensions

## [1] 902297     37

str(storm_data, give.attr = FALSE) # Check structure

## tibble [902,297 x 37] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ STATE__   : num [1:902297] 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr [1:902297] "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr [1:902297] "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr [1:902297] "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num [1:902297] 97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr [1:902297] "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr [1:902297] "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr [1:902297] "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr [1:902297] NA NA NA NA ...
##  $ BGN_LOCATI: chr [1:902297] NA NA NA NA ...
##  $ END_DATE  : chr [1:902297] NA NA NA NA ...
##  $ END_TIME  : chr [1:902297] NA NA NA NA ...
##  $ COUNTY_END: num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi [1:902297] NA NA NA NA NA NA ...
##  $ END_RANGE : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr [1:902297] NA NA NA NA ...
##  $ END_LOCATI: chr [1:902297] NA NA NA NA ...
##  $ LENGTH    : num [1:902297] 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num [1:902297] 100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : num [1:902297] 3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num [1:902297] 0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num [1:902297] 15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num [1:902297] 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr [1:902297] "K" "K" "K" "K" ...
##  $ CROPDMG   : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr [1:902297] NA NA NA NA ...
##  $ WFO       : chr [1:902297] NA NA NA NA ...
##  $ STATEOFFIC: chr [1:902297] NA NA NA NA ...
##  $ ZONENAMES : chr [1:902297] NA NA NA NA ...
##  $ LATITUDE  : num [1:902297] 3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num [1:902297] 8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num [1:902297] 3051 0 0 0 0 ...
##  $ LONGITUDE_: num [1:902297] 8806 0 0 0 0 ...
##  $ REMARKS   : chr [1:902297] NA NA NA NA ...
##  $ REFNUM    : num [1:902297] 1 2 3 4 5 6 7 8 9 10 ...

unique_events <- length(unique(storm_data$EVTYPE)) # Calculate unique event types

We can also see that, in its current state, the storm_data data frame contains 977 unique event types, which is a large number of event types if compared to the 48 official event types mentioned in the NOAA Database Website and Storm Data Documentation. This is probably due to lack of event names standardization along the years.

Processing the data

We have a total of 37 variables with 902,297 observations. To properly answer our questions we will only need 9 of them, which are:

BGN_DATE → Begin Date
END_DATE → End Date
EVTYPE → Event Type
FATALITIES → Estimate number of fatalities caused by the event
INJURIES → Estimate number of injuries caused by the event
PROPDMG → Value of property damage caused by the event
PROPDMGEXP → Exponent of the property damage value (e.g., K fo thousand, M for million, B for billion, H for hundred, integer i for 10^i)¹
CROPDMG → Value of crop damage caused by the event
CROPDMGEXP → Exponent of the crop damage value (e.g., K fo thousand, M for million, B for billion, H for hundred, integer i for 10^i)²

In order to subset and improve the originally created storm_data data frame we will use the dplyr package (also automatically loaded with the tidyverse package). For that we will: keep only the variables mentioned above, create two new damage value variables based on their respective exponent, create a new variable with the sum of injuries and fatalities together for each observation, create a new variable with the sum of damage value of property and crop categories together, properly format the dates variables as date class and set the event types as upper case character class. We will also filter out observations in which both health and economic impact equals 0. Finally, to avoid possible distortion on the events impact proportionality, we will only keep the records of events happening from 1996 to 2011. As stated in the NOAA Database Website, the properly recording of the complete list of event types only started in 1996 as defined in NWS Directive 10-1605.
As a last complementary step, we will also try to reduce some of the standardization problem by changing some event type names that should belong in the same group. Those changes were chosen by analyzing the preprocessed data and identifying some health and economic relevant events that should be grouped together.

storm_data_clean <- storm_data %>%
        # Subset needed variables
        select(BGN_DATE, END_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) %>% 
        # This step will take the property damage value and multiply it by is exponent based on the rules previously mentioned
        mutate(
                prop_dmg_val = PROPDMG * case_when(
                        tolower(PROPDMGEXP) == "k" ~ 10^3,
                        tolower(PROPDMGEXP) == "m" ~ 10^6,
                        tolower(PROPDMGEXP) == "b" ~ 10^9,
                        tolower(PROPDMGEXP) == "h" ~ 10^2,
                        !is.na(as.numeric(PROPDMGEXP)) ~ 10^as.numeric(PROPDMGEXP),
                        TRUE ~ 10^0
                ),
                # This step will take the crop damage value and multiply it by is exponent based on the rules previously mentioned
                crop_dmg_val = CROPDMG * case_when(
                        tolower(CROPDMGEXP) == "k" ~ 10^3,
                        tolower(CROPDMGEXP) == "m" ~ 10^6,
                        tolower(CROPDMGEXP) == "b" ~ 10^9,
                        tolower(CROPDMGEXP) == "h" ~ 10^2,
                        !is.na(as.numeric(CROPDMGEXP)) ~ 10^as.numeric(CROPDMGEXP),
                        TRUE ~ 10^0
                )
        ) %>%
        # Sum of health related variables into the health_harm var and economic related variables into the economic_conseq var
        mutate(health_harm = FATALITIES + INJURIES, economic_conseq = prop_dmg_val + crop_dmg_val) %>%
        # Overall formatting
        mutate(BGN_DATE = as.Date(BGN_DATE, format = "%m/%d/%Y %H:%M:%S"),
               END_DATE = as.Date(END_DATE, format = "%m/%d/%Y %H:%M:%S"),
               EVTYPE = toupper(EVTYPE)
        ) %>%
        # Filter irrelevant observations and keep only those happening between 1996 and 2011
        filter(health_harm + economic_conseq > 0, year(BGN_DATE) >= 1996) %>%
        # Standardize names of relevant event types that belong to the same event
        mutate(EVTYPE = case_when(
                EVTYPE == "HURRICANE/TYPHOON" ~ "HURRICANE",
                EVTYPE == "WILD/FOREST FIRE" ~ "WILDFIRE",
                EVTYPE == "RIP CURRENTS" ~ "RIP CURRENT",
                EVTYPE == "STRONG WINDS" ~ "STRONG WIND",
                substr(EVTYPE,1, 13) == "COASTAL FLOOD" ~ "COASTAL FLOOD",
                TRUE ~ EVTYPE
        )
        )

# Calculates the number of unique events in the new data frame
unique_events <- length(unique(storm_data_clean$EVTYPE))

With the cleaning, subset and formatting of the original data frame, we will end up with 177 types of events, which is much better than the original number, but yet not equal to the number reported in the official resources. Despite this fact, we will advance in the analysis as it is in its current state, since it would be pretty work intensive to manually summarize that variable into 48 categories, specially considering that those changes would probably be irrelevant for the results we seek.

Results

Across the United States, which types of events are most harmful with respect to population health?

To answer the first question, we will analyze the health harm separately from the economic consequences. For that we will subset and summarize the previously created cleaned data to make it suitable to our needs. We will also visualize only the 20 most meaningful events and will be able to check how relevant they are to the fatalities/injuries numbers.

storm_data_health <- storm_data_clean %>%
        group_by(EVTYPE) %>%
        summarise(total_health = sum(health_harm)) %>%
        arrange(desc(total_health)) %>%
        mutate(EVTYPE = fct_reorder(EVTYPE, desc(total_health))) %>% # Reorder levels of event types by relevance
        mutate(cum_percent = cumsum( total_health / sum(total_health)) ) %>% # Calculate the cumulative sum over the total
        slice_max(total_health, n = 20) # Keep only the 20 most relevant event types

The health harm data frame can be seen below.

storm_data_health

## # A tibble: 20 x 3
##    EVTYPE            total_health cum_percent
##    <fct>                    <dbl>       <dbl>
##  1 TORNADO                  22178       0.332
##  2 EXCESSIVE HEAT            8188       0.455
##  3 FLOOD                     7172       0.563
##  4 LIGHTNING                 4792       0.635
##  5 TSTM WIND                 3870       0.693
##  6 FLASH FLOOD               2561       0.731
##  7 WILDFIRE                  1543       0.754
##  8 THUNDERSTORM WIND         1530       0.777
##  9 WINTER STORM              1483       0.799
## 10 HEAT                      1459       0.821
## 11 HURRICANE                 1446       0.843
## 12 HIGH WIND                 1318       0.863
## 13 RIP CURRENT               1045       0.878
## 14 HEAVY SNOW                 805       0.890
## 15 FOG                        772       0.902
## 16 HAIL                       720       0.913
## 17 BLIZZARD                   455       0.919
## 18 STRONG WIND                409       0.926
## 19 ICE STORM                  400       0.932
## 20 TROPICAL STORM             395       0.938

With the new data frame we can now plot the data to find answers to our first question.

max_health_harm = max(storm_data_health$total_health) # This code is used to create a second proportional y label in the following plot

ggplot(storm_data_health, aes(x = EVTYPE)) + 
        theme_light() +
        geom_col(mapping = aes(y = total_health), fill = "steelblue", color = "black", alpha = 0.5) +
        geom_line(aes(y = cum_percent * max_health_harm, group = NA), size = 0.5, color = "red") +
        labs(x = "Event Type", y = "Total Health Impact (Fatalities + Injuries)") +
        # Format label scales
        scale_y_continuous(
                labels = scales::comma, 
                sec.axis = sec_axis( trans=~./max_health_harm, name="Percent", labels = scales::percent_format(accuracy = 1))
        ) +
        theme(
                axis.text.x = element_text(angle = 90),
                axis.text.y.right = element_text(color = "red"), 
                axis.title.y.right = element_text(color = "red")
        )

As we can see, regarding population health (this includes the sum of fatalities and injuries), from 1996 to 2011 tornadoes were, by far, the most harmful event, being the cause of 33.2% of total deaths plus injuries in that period, followed by excessive heat with 12.3% and flood with 10.8%. The top 20 storm events were responsible for 93.8% of total fatalities plus injuries.

Across the United States, which types of events have the greatest economic consequences?

The second question in our analysis will be answered in a similar way as the previous one, considering, however, only the observations with economical consequences.

storm_data_econ <- storm_data_clean %>%
        group_by(EVTYPE) %>%
        summarise(total_econ = sum(economic_conseq)) %>%
        arrange(desc(total_econ)) %>%
        mutate(EVTYPE = fct_reorder(EVTYPE, desc(total_econ)), total_econ = total_econ / 10^9) %>% # Reorder levels of event types by relevance and change total_econ scale
        mutate(cum_percent = cumsum( total_econ / sum(total_econ)) ) %>% # Calculate the cumulative sum over the total
        slice_max(total_econ, n = 20) # Keep only the 20 most relevant event types

The economical impact data frame can be seen below (total_econ var in Billions).

storm_data_econ

## # A tibble: 20 x 3
##    EVTYPE            total_econ cum_percent
##    <fct>                  <dbl>       <dbl>
##  1 FLOOD                149.          0.371
##  2 HURRICANE             86.5         0.586
##  3 STORM SURGE           43.2         0.694
##  4 TORNADO               24.9         0.756
##  5 HAIL                  17.1         0.798
##  6 FLASH FLOOD           16.6         0.840
##  7 DROUGHT               14.4         0.875
##  8 TROPICAL STORM         8.32        0.896
##  9 WILDFIRE               8.16        0.917
## 10 HIGH WIND              5.88        0.931
## 11 TSTM WIND              5.04        0.944
## 12 STORM SURGE/TIDE       4.64        0.955
## 13 THUNDERSTORM WIND      3.78        0.965
## 14 ICE STORM              3.66        0.974
## 15 WINTER STORM           1.54        0.978
## 16 EXTREME COLD           1.33        0.981
## 17 HEAVY RAIN             1.31        0.984
## 18 FROST/FREEZE           1.10        0.987
## 19 LIGHTNING              0.750       0.989
## 20 HEAVY SNOW             0.706       0.991

Now, to the plot that will answer our second question.

max_econ_conseq = max(storm_data_econ$total_econ) # This code is used to create a second proportional y label in the following plot

ggplot(storm_data_econ, aes(x = EVTYPE)) + 
        theme_light() +
        geom_col(mapping = aes(y = total_econ), fill = "seagreen", color = "black", alpha = 0.5) + 
        geom_line(aes(y = cum_percent * max_econ_conseq, group = NA), size = 0.5, color = "red") +
        labs(x = "Event Type", y = "Total Economic Impact (Billions)") +
        # Format label scales
        scale_y_continuous(
                labels = unit_format(unit = "B"),
                sec.axis = sec_axis( trans=~./max_econ_conseq, name="Percent", labels = scales::percent_format(accuracy = 1))
        ) +
        theme(
                axis.text.x = element_text(angle = 90),
                axis.text.y.right = element_text(color = "red"), 
                axis.title.y.right = element_text(color = "red")
        )

Regarding the economical impact, we can clearly see that floods was the main event type to damage properties and crops between 1996 and 2011, being responsible for 37.1% of total damage value, followed by hurricanes with 21.5% and storm surges with 10.8%. In this case, the top 20 storm events were responsible for 99.1% of total damage value.

For those variables, any value different from the ones mentioned in the examples were considered as 10^0↩︎
For those variables, any value different from the ones mentioned in the examples were considered as 10^0↩︎

Coursera - Reproducible Research - Course Project 2

FKurzhals

18th October, 2020