Synopsis

In this report we aim to describe the effects of storms across the United States on population health and the economy, between January 1950 to November 2011.

Our overall hypothesis is that certain storm event types have a noticeably more negatitve impact on either population health or the economy.

To investigate our hypothesis, we obtained USA storm data orginally from the US National Oceanic and Atmospheric Administration (NOAA), downloaded from the Cousera course website for the period January 1950 to November 2011.

From these data, we found that:

Impact on Population Health: Tornado events are by far the most harmful events for population health, for both fatalities and for injuries. Besides tornados: the events heat, flooding, thunderstorms, lightning and other storms are the most harmful events for population health.
Impact on the Economy: When considering total economic damage (crops and property), then floods are the most damaging, followed by hurricanes, storms, tornados, precipitation and dry weather. However if considering only crop damage, then dry weather is the most damaging event, followed by flooding, hurricanes, ice, precipitation.

note: because less data is available for earlier years, the data may under-represent the actual impact for early years compared to later years - in this study, we consider the totals for the entire period.

Data Processing

Loading and Processing the Raw Data

From the Cousera course website we obtained NOAA data on storms across the USA for the period January 1950 to November 2011.

source("utils/utils.download.R")
data_url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
data_dir <- "./temp/raw-data"
data_filename <- "StormData.csv.bz2"
raw_data_file <- download_if_missing(data_dir, data_filename, data_url)

## [1] "File already downloaded!"
## [1] "dest_file:  ./temp/raw-data/StormData.csv.bz2"

We also obtained documentation about the data:

National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events FAQ

Reading in the data

r_mirror_nl <- "https://mirrors.evoluso.com/CRAN/"

library_quiet <- function(lib_name) {
  if (!requireNamespace(lib_name, quietly = TRUE)) {
    install.packages(lib_name, repos = r_mirror_nl)
  }

  suppressWarnings(suppressPackageStartupMessages(library(lib_name, character.only = TRUE)))
}

library_quiet("readr")
df_storms <- read_csv(raw_data_file, show_col_types = FALSE)

Interesting Fields

EVTYPE: TORNADO, …
- FAQ: event name of Landslide was renamed to Debris Flow (checked: the data was already updated)
- some event names are synonyms, need to normalize
BGN_DATE, END_DATE - useful to see what years have data
- assumptions: time is not significant, accurate enough if use the year of BGN_DATE
Health: FATALITIES, INJURIES
Economy: POPDMG, POPDMGEXP, CROPDMG, CROPDMGEXP
- need to normalize the damage values, so can aggregate
other fields:
- STATE - the US state name
- MAG - magnitude of the event (a further analysis could consider different magnitudes of the same event type)
- REMARKS - long commentaries, not relevant
Check: how much data is available, by year
note: Any new columns added by this analysis have the prefix comp_ (‘computed’)

# Normalize the dates

library_quiet("dplyr")
library_quiet("lubridate")

df_storms <- df_storms %>%
  mutate(begin_date = mdy_hms(BGN_DATE))

print(paste("Raw-data Date range:", min(df_storms$begin_date, na.rm = TRUE),
  max(df_storms$begin_date, na.rm = TRUE)))

## [1] "Raw-data Date range: 1950-01-03 2011-11-30"

# Check: how much data is available, by year

# Convert BGN_DATE to a proper Date object and extract year
df_storms$comp_year <- year(mdy_hms(df_storms$BGN_DATE))

df_storms %>%
  group_by(comp_year) %>%
  summarise(count = n()) %>%
  group_by(comp_year) %>%
  summarise(total_count = sum(count)) %>%
  arrange(comp_year) -> df_yearly_counts

sort(unique(df_storms$comp_year))

##  [1] 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964
## [16] 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
## [31] 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994
## [46] 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
## [61] 2010 2011

print(df_yearly_counts)

## # A tibble: 62 × 2
##    comp_year total_count
##        <dbl>       <int>
##  1      1950         223
##  2      1951         269
##  3      1952         272
##  4      1953         492
##  5      1954         609
##  6      1955        1413
##  7      1956        1703
##  8      1957        2184
##  9      1958        2213
## 10      1959        1813
## # ℹ 52 more rows

# Normalize crop-damage

crop_exp_map <- c(
  "?" = 1,
  "0" = 1,
  "2" = 100,
  "B" = 1e9,
  "k" = 1e3,
  "K" = 1e3,
  "m" = 1e6,
  "M" = 1e6
)

# Safe lookup function for CROPDMGEXP
crop_multiplier_fn <- function(x) {
  if (is.na(x)) return(1)
  if (!is.null(crop_exp_map[[x]])) return(crop_exp_map[[x]])
  return(1)
}

# Compute normalized crop damage
df_storms$comp_CROPDMG_NORM <- df_storms$CROPDMG * sapply(as.character(df_storms$CROPDMGEXP), crop_multiplier_fn)

# Normalize property damage

exp_map <- c(
  "-" = 1,
  "?" = 1,
  "+" = 1,
  "0" = 1,
  "1" = 10,
  "2" = 100,
  "3" = 1000,
  "4" = 1e4,
  "5" = 1e5,
  "6" = 1e6,
  "7" = 1e7,
  "8" = 1e8,
  "B" = 1e9,
  "h" = 100,
  "H" = 100,
  "K" = 1e3,
  "m" = 1e6,
  "M" = 1e6
)

# Safer lookup function with empty string handled explicitly
multiplier_fn <- function(x) {
  if (is.na(x) || x == "") return(1)
  if (!is.null(exp_map[[x]])) return(exp_map[[x]])
  return(1)
}

df_storms$comp_PROPDMG_NORM <- df_storms$PROPDMG * sapply(as.character(df_storms$PROPDMGEXP), multiplier_fn)

# Compute Month, Season for exploration
df_storms <- df_storms %>%
  mutate(
    BGN_DATE_parsed = mdy_hms(BGN_DATE),
    comp_MONTH = month(BGN_DATE_parsed),
    comp_SEASON = case_when(
      comp_MONTH %in% c(12, 1, 2)  ~ "Winter",
      comp_MONTH %in% 3:5          ~ "Spring",
      comp_MONTH %in% 6:8          ~ "Summer",
      comp_MONTH %in% 9:11         ~ "Autumn",
      TRUE                    ~ NA_character_
    )
  )

# Select columns - check date range for relevant data
df_select <- df_storms %>%
  select(begin_date,comp_year,comp_MONTH,comp_SEASON,EVTYPE,MAG,STATE,FATALITIES,
    INJURIES,comp_PROPDMG_NORM,comp_CROPDMG_NORM)

print(paste("Select-data Date range WITHOUT filter:", 
  min(df_select$begin_date, na.rm = TRUE), max(df_select$begin_date, na.rm = TRUE)))

## [1] "Select-data Date range WITHOUT filter: 1950-01-03 2011-11-30"

# Check dates when we remove rows that have 0 or NA for all metrics:
df_filtered_health <- df_select %>%
  filter(!( (is.na(FATALITIES) | FATALITIES == 0) &
              (is.na(INJURIES) | INJURIES == 0)  ))

print(paste("Date range for relevant data [population health]:", 
  min(df_filtered_health$begin_date, na.rm = TRUE), max(df_filtered_health$begin_date, na.rm = TRUE)))

## [1] "Date range for relevant data [population health]: 1950-01-03 2011-11-30"

df_filtered_economy <- df_select %>%
  filter(!( (is.na(comp_PROPDMG_NORM) | comp_PROPDMG_NORM == 0) &
              (is.na(comp_CROPDMG_NORM) | comp_CROPDMG_NORM == 0) ))

print(paste("Date range for relevant data [economy]:", 
  min(df_filtered_economy$begin_date, na.rm = TRUE), max(df_filtered_economy$begin_date, na.rm = TRUE) ))

## [1] "Date range for relevant data [economy]: 1950-01-03 2011-11-30"

# Normalize EVTYPE -> COMP_EVTYPE

# The type EVTYPE has too many discrete values, many synonyms and some mis-spellings.
# Merge down for consistency and a data set that is simpler to report on.
library_quiet("stringr")
df_select <- df_select %>%
  mutate(
    COMP_EVTYPE = tolower(EVTYPE),                       # Step 1: lowercase
    COMP_EVTYPE = str_split_fixed(COMP_EVTYPE, "/", 2)[,1],  # Step 2: take part before "/"
    COMP_EVTYPE = case_when(                             # Step 3: standardize categories
      str_detect(COMP_EVTYPE, "cold|cool|low temp|hypothermia|freez|frost") ~ "cold",
      str_detect(COMP_EVTYPE, "dry|drought|dust|driest") ~ "dry",
      str_detect(COMP_EVTYPE, "erosion|erosin") ~ "erosion",
      str_detect(COMP_EVTYPE, "fire") ~ "fire",
      str_detect(COMP_EVTYPE, "flood|stream|drown|dam|water") ~ "flooding",
      str_detect(COMP_EVTYPE, "fog") ~ "fog",
      str_detect(COMP_EVTYPE, "hot|heat|high temp|hyperthermia|record temp|warm") ~ "heat",
      str_detect(COMP_EVTYPE, "hurricane") ~ "hurricane",
      str_detect(COMP_EVTYPE, "ice|icy|glaze") ~ "ice",
      str_detect(COMP_EVTYPE, "landslide|landslides|mudslide|landslump|mud") ~ "landslide",
      str_detect(COMP_EVTYPE, "lightning|ligntning") ~ "lightning",
      str_detect(COMP_EVTYPE, "marine accident|marine mishap") ~ "marine accident",
      str_detect(COMP_EVTYPE, "none|no severe weather") ~ "none",
      str_detect(COMP_EVTYPE, "precipitation|hail|precip|rain|sleet|shower|wet") ~ "precipitation",
      str_detect(COMP_EVTYPE, "snow|blizzard|avalanche") ~ "snow",
      str_detect(COMP_EVTYPE, "storm") ~ "storm",
      str_detect(COMP_EVTYPE, "summary") ~ "summary",
      str_detect(COMP_EVTYPE, "thunderstorm|tstm|tunderstorm|thunderstrom|thundestrom|thundertorm|thundersnow|thundeerstorm|microburst|downburst")
        ~ "thunderstorm",
      str_detect(COMP_EVTYPE, "tornado|funnel|torndao") ~ "tornado",
      str_detect(COMP_EVTYPE, "tsunami|sea|surf|swell|surge|current|tide|wave") ~ "sea waves",
      str_detect(COMP_EVTYPE, "unknown|/?|other|unusual") ~ "unknown",
      str_detect(COMP_EVTYPE, "volcan") ~ "volcano",
      str_detect(COMP_EVTYPE, "waterspout|spout") ~ "waterspout",
      str_detect(COMP_EVTYPE, "wind|wnd|gust") ~ "wind",
      TRUE ~ COMP_EVTYPE
    )
  )
sort(unique(df_select$COMP_EVTYPE))

##  [1] "cold"            "dry"             "erosion"         "fire"           
##  [5] "flooding"        "fog"             "heat"            "hurricane"      
##  [9] "ice"             "landslide"       "lightning"       "marine accident"
## [13] "none"            "precipitation"   "sea waves"       "snow"           
## [17] "storm"           "summary"         "thunderstorm"    "tornado"        
## [21] "unknown"

Results

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

# Plot event impact on population health

health_summary <- df_select %>%
  group_by(COMP_EVTYPE) %>%
  summarise(
    total_fatalities = sum(FATALITIES, na.rm = TRUE),
    total_injuries = sum(INJURIES, na.rm = TRUE),
    total_harmed = total_fatalities + total_injuries
  ) %>%
  arrange(desc(total_harmed))

library_quiet("ggplot2")
library_quiet("tidyr")

# Prepare top 10 event types by total_harmed
top10 <- health_summary %>%
  slice_max(order_by = total_harmed, n = 10) %>%
  select(COMP_EVTYPE, total_fatalities, total_injuries) %>%
  pivot_longer(
    cols = c(total_fatalities, total_injuries),
    names_to = "harm_type",
    values_to = "count"
  ) %>%
  mutate(harm_type = recode(harm_type,
                            total_fatalities = "Fatalities",
                            total_injuries = "Injuries"))

# Plot grouped bar chart with spacing between groups
ggplot(top10, aes(x = reorder(COMP_EVTYPE, -count), y = count, fill = harm_type)) +
  geom_col(position = position_dodge(width = 0.8), width = 0.35) +  # width controls bar width
  scale_fill_manual(values = c("Fatalities" = "red", "Injuries" = "yellow")) +
  labs(
    title = "Population Health: Top 10 Most Harmful Event Types",
    subtitle = "by Fatalities and Injuries (NOAA: Jan 1950 to Nov 2011)",
    x = "Event Type",
    y = "Count",
    fill = "Harm Type"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    panel.grid.major.x = element_blank()
  )

Answer: Tornado events are by far the most harmful events for population health, for both fatalities and for injuries. Besides tornados: the events heat, flooding, thunderstorms, lightning and other storms are the most harmful events for population health.

Across the United States, which types of events have the greatest economic consequences?

# Plot event impact on the economy

cost_summary <- df_select %>%
  group_by(COMP_EVTYPE) %>%
  summarise(
    total_crop_damage = sum(comp_CROPDMG_NORM, na.rm = TRUE),
    total_property_damage = sum(comp_PROPDMG_NORM, na.rm = TRUE),
    total_damage = total_crop_damage + total_property_damage
  ) %>%
  arrange(desc(total_damage))

# Prepare top 10 event types by total_damage
top10 <- cost_summary %>%
  slice_max(order_by = total_damage, n = 10) %>%
  select(COMP_EVTYPE, total_crop_damage, total_property_damage) %>%
  pivot_longer(
    cols = c(total_crop_damage, total_property_damage),
    names_to = "harm_type",
    values_to = "count"
  ) %>%
  mutate(harm_type = recode(harm_type,
                            total_crop_damage = "Crop Damage",
                            total_property_damage = "Property Damage"),
                            count = count / 1e6  # Convert to millions
                            )

# Plot grouped bar chart with spacing between groups
ggplot(top10, aes(x = reorder(COMP_EVTYPE, -count), y = count, fill = harm_type)) +
  geom_col(position = position_dodge(width = 0.8), width = 0.35) +  # width controls bar width
  scale_fill_manual(values = c("Crop Damage" = "orange", "Property Damage" = "yellow")) +
  labs(
    title = "Economy: Top 10 Most Harmful Event Types",
    subtitle = "by Crop and Property Damage (NOAA: Jan 1950 to Nov 2011)",
    x = "Event Type",
    y = "Damage (Millions of USD)",
    fill = "Harm Type"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    panel.grid.major.x = element_blank()
  )

note from the FAQ: “Property and Crop damage should be considered as a broad estimate”.

Answer: When considering total economic damage (crops and property), then floods are the most damaging, followed by hurricanes, storms, tornados, precipitation and dry weather. However if considering only crop damage, then dry weather is the most damaging event, followed by flooding, hurricanes, ice, precipitation.

Health and Economic Effects of Storms Across USA 1950-2011

Sean Ryan

2025-08-04