In this report we aim to describe the effects of storms across the United States on population health and the economy, between January 1950 to November 2011.
Our overall hypothesis is that certain storm event types have a noticeably more negatitve impact on either population health or the economy.
To investigate our hypothesis, we obtained USA storm data orginally from the US National Oceanic and Atmospheric Administration (NOAA), downloaded from the Cousera course website for the period January 1950 to November 2011.
From these data, we found that:
Impact on Population Health: Tornado events are by far the most harmful events for population health, for both fatalities and for injuries. Besides tornados: the events heat, flooding, thunderstorms, lightning and other storms are the most harmful events for population health.
Impact on the Economy: When considering total economic damage (crops and property), then floods are the most damaging, followed by hurricanes, storms, tornados, precipitation and dry weather. However if considering only crop damage, then dry weather is the most damaging event, followed by flooding, hurricanes, ice, precipitation.
note: because less data is available for earlier years, the data may under-represent the actual impact for early years compared to later years - in this study, we consider the totals for the entire period.
From the Cousera course website we obtained NOAA data on storms across the USA for the period January 1950 to November 2011.
source("utils/utils.download.R")
data_url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
data_dir <- "./temp/raw-data"
data_filename <- "StormData.csv.bz2"
raw_data_file <- download_if_missing(data_dir, data_filename, data_url)
## [1] "File already downloaded!"
## [1] "dest_file: ./temp/raw-data/StormData.csv.bz2"
We also obtained documentation about the data:
r_mirror_nl <- "https://mirrors.evoluso.com/CRAN/"
library_quiet <- function(lib_name) {
if (!requireNamespace(lib_name, quietly = TRUE)) {
install.packages(lib_name, repos = r_mirror_nl)
}
suppressWarnings(suppressPackageStartupMessages(library(lib_name, character.only = TRUE)))
}
library_quiet("readr")
df_storms <- read_csv(raw_data_file, show_col_types = FALSE)
EVTYPE: TORNADO, …
BGN_DATE, END_DATE - useful to see what years have data
Health: FATALITIES, INJURIES
Economy: POPDMG, POPDMGEXP, CROPDMG, CROPDMGEXP
other fields:
Check: how much data is available, by year
note: Any new columns added by this analysis have the prefix comp_ (‘computed’)
# Normalize the dates
library_quiet("dplyr")
library_quiet("lubridate")
df_storms <- df_storms %>%
mutate(begin_date = mdy_hms(BGN_DATE))
print(paste("Raw-data Date range:", min(df_storms$begin_date, na.rm = TRUE),
max(df_storms$begin_date, na.rm = TRUE)))
## [1] "Raw-data Date range: 1950-01-03 2011-11-30"
# Check: how much data is available, by year
# Convert BGN_DATE to a proper Date object and extract year
df_storms$comp_year <- year(mdy_hms(df_storms$BGN_DATE))
df_storms %>%
group_by(comp_year) %>%
summarise(count = n()) %>%
group_by(comp_year) %>%
summarise(total_count = sum(count)) %>%
arrange(comp_year) -> df_yearly_counts
sort(unique(df_storms$comp_year))
## [1] 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964
## [16] 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
## [31] 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994
## [46] 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
## [61] 2010 2011
print(df_yearly_counts)
## # A tibble: 62 × 2
## comp_year total_count
## <dbl> <int>
## 1 1950 223
## 2 1951 269
## 3 1952 272
## 4 1953 492
## 5 1954 609
## 6 1955 1413
## 7 1956 1703
## 8 1957 2184
## 9 1958 2213
## 10 1959 1813
## # ℹ 52 more rows
# Normalize crop-damage
crop_exp_map <- c(
"?" = 1,
"0" = 1,
"2" = 100,
"B" = 1e9,
"k" = 1e3,
"K" = 1e3,
"m" = 1e6,
"M" = 1e6
)
# Safe lookup function for CROPDMGEXP
crop_multiplier_fn <- function(x) {
if (is.na(x)) return(1)
if (!is.null(crop_exp_map[[x]])) return(crop_exp_map[[x]])
return(1)
}
# Compute normalized crop damage
df_storms$comp_CROPDMG_NORM <- df_storms$CROPDMG * sapply(as.character(df_storms$CROPDMGEXP), crop_multiplier_fn)
# Normalize property damage
exp_map <- c(
"-" = 1,
"?" = 1,
"+" = 1,
"0" = 1,
"1" = 10,
"2" = 100,
"3" = 1000,
"4" = 1e4,
"5" = 1e5,
"6" = 1e6,
"7" = 1e7,
"8" = 1e8,
"B" = 1e9,
"h" = 100,
"H" = 100,
"K" = 1e3,
"m" = 1e6,
"M" = 1e6
)
# Safer lookup function with empty string handled explicitly
multiplier_fn <- function(x) {
if (is.na(x) || x == "") return(1)
if (!is.null(exp_map[[x]])) return(exp_map[[x]])
return(1)
}
df_storms$comp_PROPDMG_NORM <- df_storms$PROPDMG * sapply(as.character(df_storms$PROPDMGEXP), multiplier_fn)
# Compute Month, Season for exploration
df_storms <- df_storms %>%
mutate(
BGN_DATE_parsed = mdy_hms(BGN_DATE),
comp_MONTH = month(BGN_DATE_parsed),
comp_SEASON = case_when(
comp_MONTH %in% c(12, 1, 2) ~ "Winter",
comp_MONTH %in% 3:5 ~ "Spring",
comp_MONTH %in% 6:8 ~ "Summer",
comp_MONTH %in% 9:11 ~ "Autumn",
TRUE ~ NA_character_
)
)
# Select columns - check date range for relevant data
df_select <- df_storms %>%
select(begin_date,comp_year,comp_MONTH,comp_SEASON,EVTYPE,MAG,STATE,FATALITIES,
INJURIES,comp_PROPDMG_NORM,comp_CROPDMG_NORM)
print(paste("Select-data Date range WITHOUT filter:",
min(df_select$begin_date, na.rm = TRUE), max(df_select$begin_date, na.rm = TRUE)))
## [1] "Select-data Date range WITHOUT filter: 1950-01-03 2011-11-30"
# Check dates when we remove rows that have 0 or NA for all metrics:
df_filtered_health <- df_select %>%
filter(!( (is.na(FATALITIES) | FATALITIES == 0) &
(is.na(INJURIES) | INJURIES == 0) ))
print(paste("Date range for relevant data [population health]:",
min(df_filtered_health$begin_date, na.rm = TRUE), max(df_filtered_health$begin_date, na.rm = TRUE)))
## [1] "Date range for relevant data [population health]: 1950-01-03 2011-11-30"
df_filtered_economy <- df_select %>%
filter(!( (is.na(comp_PROPDMG_NORM) | comp_PROPDMG_NORM == 0) &
(is.na(comp_CROPDMG_NORM) | comp_CROPDMG_NORM == 0) ))
print(paste("Date range for relevant data [economy]:",
min(df_filtered_economy$begin_date, na.rm = TRUE), max(df_filtered_economy$begin_date, na.rm = TRUE) ))
## [1] "Date range for relevant data [economy]: 1950-01-03 2011-11-30"
# Normalize EVTYPE -> COMP_EVTYPE
# The type EVTYPE has too many discrete values, many synonyms and some mis-spellings.
# Merge down for consistency and a data set that is simpler to report on.
library_quiet("stringr")
df_select <- df_select %>%
mutate(
COMP_EVTYPE = tolower(EVTYPE), # Step 1: lowercase
COMP_EVTYPE = str_split_fixed(COMP_EVTYPE, "/", 2)[,1], # Step 2: take part before "/"
COMP_EVTYPE = case_when( # Step 3: standardize categories
str_detect(COMP_EVTYPE, "cold|cool|low temp|hypothermia|freez|frost") ~ "cold",
str_detect(COMP_EVTYPE, "dry|drought|dust|driest") ~ "dry",
str_detect(COMP_EVTYPE, "erosion|erosin") ~ "erosion",
str_detect(COMP_EVTYPE, "fire") ~ "fire",
str_detect(COMP_EVTYPE, "flood|stream|drown|dam|water") ~ "flooding",
str_detect(COMP_EVTYPE, "fog") ~ "fog",
str_detect(COMP_EVTYPE, "hot|heat|high temp|hyperthermia|record temp|warm") ~ "heat",
str_detect(COMP_EVTYPE, "hurricane") ~ "hurricane",
str_detect(COMP_EVTYPE, "ice|icy|glaze") ~ "ice",
str_detect(COMP_EVTYPE, "landslide|landslides|mudslide|landslump|mud") ~ "landslide",
str_detect(COMP_EVTYPE, "lightning|ligntning") ~ "lightning",
str_detect(COMP_EVTYPE, "marine accident|marine mishap") ~ "marine accident",
str_detect(COMP_EVTYPE, "none|no severe weather") ~ "none",
str_detect(COMP_EVTYPE, "precipitation|hail|precip|rain|sleet|shower|wet") ~ "precipitation",
str_detect(COMP_EVTYPE, "snow|blizzard|avalanche") ~ "snow",
str_detect(COMP_EVTYPE, "storm") ~ "storm",
str_detect(COMP_EVTYPE, "summary") ~ "summary",
str_detect(COMP_EVTYPE, "thunderstorm|tstm|tunderstorm|thunderstrom|thundestrom|thundertorm|thundersnow|thundeerstorm|microburst|downburst")
~ "thunderstorm",
str_detect(COMP_EVTYPE, "tornado|funnel|torndao") ~ "tornado",
str_detect(COMP_EVTYPE, "tsunami|sea|surf|swell|surge|current|tide|wave") ~ "sea waves",
str_detect(COMP_EVTYPE, "unknown|/?|other|unusual") ~ "unknown",
str_detect(COMP_EVTYPE, "volcan") ~ "volcano",
str_detect(COMP_EVTYPE, "waterspout|spout") ~ "waterspout",
str_detect(COMP_EVTYPE, "wind|wnd|gust") ~ "wind",
TRUE ~ COMP_EVTYPE
)
)
sort(unique(df_select$COMP_EVTYPE))
## [1] "cold" "dry" "erosion" "fire"
## [5] "flooding" "fog" "heat" "hurricane"
## [9] "ice" "landslide" "lightning" "marine accident"
## [13] "none" "precipitation" "sea waves" "snow"
## [17] "storm" "summary" "thunderstorm" "tornado"
## [21] "unknown"
# Plot event impact on population health
health_summary <- df_select %>%
group_by(COMP_EVTYPE) %>%
summarise(
total_fatalities = sum(FATALITIES, na.rm = TRUE),
total_injuries = sum(INJURIES, na.rm = TRUE),
total_harmed = total_fatalities + total_injuries
) %>%
arrange(desc(total_harmed))
library_quiet("ggplot2")
library_quiet("tidyr")
# Prepare top 10 event types by total_harmed
top10 <- health_summary %>%
slice_max(order_by = total_harmed, n = 10) %>%
select(COMP_EVTYPE, total_fatalities, total_injuries) %>%
pivot_longer(
cols = c(total_fatalities, total_injuries),
names_to = "harm_type",
values_to = "count"
) %>%
mutate(harm_type = recode(harm_type,
total_fatalities = "Fatalities",
total_injuries = "Injuries"))
# Plot grouped bar chart with spacing between groups
ggplot(top10, aes(x = reorder(COMP_EVTYPE, -count), y = count, fill = harm_type)) +
geom_col(position = position_dodge(width = 0.8), width = 0.35) + # width controls bar width
scale_fill_manual(values = c("Fatalities" = "red", "Injuries" = "yellow")) +
labs(
title = "Population Health: Top 10 Most Harmful Event Types",
subtitle = "by Fatalities and Injuries (NOAA: Jan 1950 to Nov 2011)",
x = "Event Type",
y = "Count",
fill = "Harm Type"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
panel.grid.major.x = element_blank()
)
Answer: Tornado events are by far the most harmful events for population health, for both fatalities and for injuries. Besides tornados: the events heat, flooding, thunderstorms, lightning and other storms are the most harmful events for population health.
# Plot event impact on the economy
cost_summary <- df_select %>%
group_by(COMP_EVTYPE) %>%
summarise(
total_crop_damage = sum(comp_CROPDMG_NORM, na.rm = TRUE),
total_property_damage = sum(comp_PROPDMG_NORM, na.rm = TRUE),
total_damage = total_crop_damage + total_property_damage
) %>%
arrange(desc(total_damage))
# Prepare top 10 event types by total_damage
top10 <- cost_summary %>%
slice_max(order_by = total_damage, n = 10) %>%
select(COMP_EVTYPE, total_crop_damage, total_property_damage) %>%
pivot_longer(
cols = c(total_crop_damage, total_property_damage),
names_to = "harm_type",
values_to = "count"
) %>%
mutate(harm_type = recode(harm_type,
total_crop_damage = "Crop Damage",
total_property_damage = "Property Damage"),
count = count / 1e6 # Convert to millions
)
# Plot grouped bar chart with spacing between groups
ggplot(top10, aes(x = reorder(COMP_EVTYPE, -count), y = count, fill = harm_type)) +
geom_col(position = position_dodge(width = 0.8), width = 0.35) + # width controls bar width
scale_fill_manual(values = c("Crop Damage" = "orange", "Property Damage" = "yellow")) +
labs(
title = "Economy: Top 10 Most Harmful Event Types",
subtitle = "by Crop and Property Damage (NOAA: Jan 1950 to Nov 2011)",
x = "Event Type",
y = "Damage (Millions of USD)",
fill = "Harm Type"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
panel.grid.major.x = element_blank()
)
Answer: When considering total economic damage (crops and property), then floods are the most damaging, followed by hurricanes, storms, tornados, precipitation and dry weather. However if considering only crop damage, then dry weather is the most damaging event, followed by flooding, hurricanes, ice, precipitation.