Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project explores the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, which tracks characteristics of major storms and weather events in the US as well as estimates of any fatalities, injuries, property and crop damages.

The analysis aimed to unravel which types of events are most harmful to the population’s health as well as which events have the greatest economic consequences across the United States. For this, the required information was extracted from the database, processed and the results were presented in two graphs.

The results of this analysis show that tornados are the most harmful events with respect to population health whereas flood events have the greatest economic consequences.

Data Processing

Loading all the required R packages:

library(dplyr)
## 
## Attache Paket: 'dplyr'
## Die folgenden Objekte sind maskiert von 'package:stats':
## 
##     filter, lag
## Die folgenden Objekte sind maskiert von 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(stringdist)
## 
## Attache Paket: 'stringdist'
## Das folgende Objekt ist maskiert 'package:tidyr':
## 
##     extract
library(stringr)
library(ggplot2)

Loading the data:

data <- read.csv("repdata_data_StormData.csv.bz2")

The data analysis aims to adress the following questions:

1. Across the US, which types of events are most harmful with respect to population health?

2. Across the US, which types of events have the greatest economic consequences?

Since we are only interested in overall events across the whole US and not in individual states or time points, the whole complex dataset is first pre-processed and filtered to the columns needed and rows with all missing values (NAs) or zeros are removed:

Columns needed:

data <- data %>%
        select(EVTYPE, FATALITIES:CROPDMGEXP) %>%
        filter(!if_all(
                all_of(c("FATALITIES","INJURIES", "PROPDMG", "CROPDMG")),
                ~ is.na(.) | . == 0)
               )

In the original data set, event types are very redundant, entries are similar but spelled differently etc. To simplify the data set, the event types are standardized. First, all entries in the EVTYPE column are changed to lowercase and a list of event types is created, which is based on the documentation of the database. Then, the amatch() function is used to correct typos and writing variations and the corrected event types are stored in the column evtype_corrected.

data$EVTYPE <- data$EVTYPE %>%
        tolower() %>%
        str_trim()
events <- c("avalanche",
            "blizzard",
            "cold",
            "debris flow",
            "drought",
            "dust",
            "flood",
            "fog",
            "freeze",
            "funnel cloud",
            "hail",
            "heat",
            "surf",
            "hurricane",
            "ice storm",
            "lightning",
            "low tide",
            "marine events",
            "rain",
            "rip current",
            "seiche",
            "sleet",
            "smoke",
            "snow",
            "storm tide",
            "thunderstorm",
            "tornado",
            "tropical events",
            "tsunami",
            "volcanic ash",
            "waterspout",
            "wildfire",
            "wind",
            "winter")
matches <- amatch(data$EVTYPE, events, method = "jw", maxDist = 0.2)
data$evtype_corrected <- ifelse(
        is.na(matches),
        data$EVTYPE,
        events[matches]
)

Then, the corrected event types are summarized in easier/broader categories:

data <- data %>%
  mutate(
    evtype_final = case_when(
      str_detect(evtype_corrected, "thunderstorm|tstm") ~ "thunderstorm",
      str_detect(evtype_corrected, "wind|storm") ~ "wind",
      str_detect(evtype_corrected, "cold") ~ "cold",
      str_detect(evtype_corrected, "rain") ~ "rain",
      str_detect(evtype_corrected, "flood") ~ "flood",
      str_detect(evtype_corrected, "hail") ~ "hail",
      str_detect(evtype_corrected, "dust") ~ "dust",
      str_detect(evtype_corrected, "fog") ~ "fog",
      str_detect(evtype_corrected, "heat") ~ "heat",
      str_detect(evtype_corrected, "tide") ~ "tide",
      str_detect(evtype_corrected, "snow|ice") ~ "snow/ice",
      str_detect(evtype_corrected, "winter") ~ "winter",
      str_detect(evtype_corrected, "freez|frost") ~ "freeze",
      str_detect(evtype_corrected, "smoke") ~ "smoke",
      str_detect(evtype_corrected, "fire") ~ "wildfire",
      str_detect(evtype_corrected, "surf") ~ "surf",
      str_detect(evtype_corrected, "tropic") ~ "tropical events",
      str_detect(evtype_corrected, "marine") ~ "marine events",
      TRUE ~ evtype_corrected
    )
  )

For the damage expenses, the current numbers, which are stored as a number in the columns (PROPDMG and CROPDMG) followed by an alphabetical character signifying the magnitude of the number in a second column (PROPDMGEXP and CROPDMGEXP) are converted and the new complete numbers are stored in the new columns PROPDMG_USD and CROPDMG_USD.

mult <- c(K = 1e3,
          M = 1e6,
          B = 1e9)
data <- data %>%
        mutate(PROPDMG_USD = PROPDMG * mult[PROPDMGEXP],
               CROPDMG_USD = CROPDMG * mult[CROPDMGEXP])

Next, a summary table is created summarizing the number of fatalities, injuries and the sum of expenses per event type:

data_summary <- data %>%
        group_by(evtype_final) %>%
        summarize(fatalities = sum(FATALITIES, na.rm = TRUE),
                  injuries = sum(INJURIES, na.rm = TRUE),
                  prop_dmg = sum(PROPDMG_USD, na.rm = TRUE),
                  crop_dmg = sum(CROPDMG_USD, na.rm = TRUE)
        )

Next, cases of fatalities and injuries are combined in one column (harmful_events) and property and crop damage expanses in another combined column (damage_expenses):

data_summary <- data_summary %>%
        mutate(harmful_events = fatalities + injuries,
               damage_expenses = prop_dmg + crop_dmg)

Last, the summary table is divided into two distinct tables (one for harmful events and one for damage expenses) and converted into long format for easier plotting:

harmful_events <- data_summary %>%
        select(evtype_final, fatalities, injuries, harmful_events) %>%
        pivot_longer(
                cols = c(fatalities, injuries),
                names_to = "type",
                values_to = "cases"
                )

damage_expenses <- data_summary %>%
        select(evtype_final, prop_dmg, crop_dmg, damage_expenses) %>%
        pivot_longer(
                cols = c(prop_dmg, crop_dmg),
                names_to = "type",
                values_to = "expenses"
        )

Results

1. Across the US, which types of events are most harmful with respect to population health?

First, the long harmful event table is sorted decreasingly by the number of harmful events and the top 5 event types are plotted:

harmful_events <- harmful_events %>%
        arrange(desc(harmful_events)) %>%
        slice_head(n = 10)
harmful_events$evtype_final <- factor(harmful_events$evtype_final, levels = unique(harmful_events$evtype_final)) # treat the event types as factors to keep the sorted order for plotting
ggplot(harmful_events, aes_string(x = "evtype_final", y = "harmful_events", fill = "type")) +
                geom_col() +
                labs(
                        title = "Influence of weather events on population health",
                        x = "Event type",
                        y = "Number of cases",
                        fill = "Type",
                ) +
                scale_fill_manual(
                        values = c("red", "black")
                        ) +
                theme(
                        title = element_text (size = 16),
                        axis.text = element_text (size = 12),
                        axis.title.x = element_text(size = 16),
                        axis.title.y = element_text(size = 16),
                        legend.text = element_text(size = 12)
                        
                )
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

As we can see, in the plot, tornados are by far the most harmful events as respect to population health as they cause most fatalities and injuries.

2. Across the US, which types of events have the greatest economic consequences?

For answering the second question, the long damage expenses table is sorted decreasingly by the combined damage expenses, the expenses values are divided by 1.000.000.000 to show them in Billion USD and the top 5 event types are plotted again:

damage_expenses <- damage_expenses %>%
        arrange(desc(damage_expenses)) %>%
        mutate(damage_expenses = damage_expenses/1e9,
               expenses = expenses/1e9) %>%
        slice_head(n = 10)
damage_expenses$evtype_final <- factor(damage_expenses$evtype_final, levels = unique(damage_expenses$evtype_final)) # treat the event types as factors to keep the sorted order for plotting
ggplot(damage_expenses, aes_string(x = "evtype_final", y = "damage_expenses", fill = "type")) +
                geom_col() +
                labs(
                        title = "Damage expenses per weather event type",
                        x = "Event type",
                        y = "Expenses in Billion USD",
                        fill = "Type",
                ) +
                scale_fill_manual(
                        labels = c("Crop damage", "Property damage"),
                        values = c("red", "black")
                        ) +
                theme(
                        title = element_text (size = 18),
                        axis.text = element_text (size = 12),
                        axis.title.x = element_text(size = 16),
                        axis.title.y = element_text(size = 16),
                        legend.text = element_text(size = 12)
                        
                )

The graph shows that with more than 300 billion USD, flood events cause the greatest damage-related expenses and therefore the greatest economic consequences. This is followed by hurricanes, wind events and tornados, which all cause expenses of more than 100 billion USD.