This report explores the NOAA Storm Database to identify which types
of severe weather events are most harmful to population health and which
have the greatest economic consequences in the United States between
1950 and 2011.
The analysis includes all recorded events in the database, covering
fatalities, injuries, and property and crop damages.
Data were cleaned and standardized to correct inconsistencies in event
type names and to convert damage exponents into numerical values.
Two main questions are addressed: 1. Which types of events are most
harmful to population health? 2. Which types of events have the greatest
economic impact?
Results show that tornadoes are by far the leading cause of fatalities
and injuries, while floods, hurricanes, and tornados cause the highest
economic losses.
Together, the top 10 event types account for the vast majority of both
human and economic impacts, highlighting the concentration of risk among
a few major weather phenomena.
# Packages ----
library(data.table)
library(tidyverse)
library(lubridate)
library(scales)
library(janitor)
library(stringr)
library(patchwork)
# Replace the path below with your working directory
dt <- read_csv("C:/Users/Thais/Desktop/Final/RepData_PeerAssessment2/repdata_data_StormData.csv.bz2")
## Rows: 902297 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (18): BGN_DATE, BGN_TIME, TIME_ZONE, COUNTYNAME, STATE, EVTYPE, BGN_AZI,...
## dbl (18): STATE__, COUNTY, BGN_RANGE, COUNTY_END, END_RANGE, LENGTH, WIDTH, ...
## lgl (1): COUNTYENDN
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Inspect the dataset ----
dim(dt) # Check the number of rows and columns
## [1] 902297 37
names(dt) # List column names
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
str(dt) # Display structure and data types
## spc_tbl_ [902,297 × 37] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ STATE__ : num [1:902297] 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr [1:902297] "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr [1:902297] "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr [1:902297] "CST" "CST" "CST" "CST" ...
## $ COUNTY : num [1:902297] 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr [1:902297] "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr [1:902297] "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr [1:902297] "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr [1:902297] NA NA NA NA ...
## $ BGN_LOCATI: chr [1:902297] NA NA NA NA ...
## $ END_DATE : chr [1:902297] NA NA NA NA ...
## $ END_TIME : chr [1:902297] NA NA NA NA ...
## $ COUNTY_END: num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi [1:902297] NA NA NA NA NA NA ...
## $ END_RANGE : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr [1:902297] NA NA NA NA ...
## $ END_LOCATI: chr [1:902297] NA NA NA NA ...
## $ LENGTH : num [1:902297] 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num [1:902297] 100 150 123 100 150 177 33 33 100 100 ...
## $ F : num [1:902297] 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num [1:902297] 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num [1:902297] 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num [1:902297] 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr [1:902297] "K" "K" "K" "K" ...
## $ CROPDMG : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr [1:902297] NA NA NA NA ...
## $ WFO : chr [1:902297] NA NA NA NA ...
## $ STATEOFFIC: chr [1:902297] NA NA NA NA ...
## $ ZONENAMES : chr [1:902297] NA NA NA NA ...
## $ LATITUDE : num [1:902297] 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num [1:902297] 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num [1:902297] 3051 0 0 0 0 ...
## $ LONGITUDE_: num [1:902297] 8806 0 0 0 0 ...
## $ REMARKS : chr [1:902297] NA NA NA NA ...
## $ REFNUM : num [1:902297] 1 2 3 4 5 6 7 8 9 10 ...
## - attr(*, "spec")=
## .. cols(
## .. STATE__ = col_double(),
## .. BGN_DATE = col_character(),
## .. BGN_TIME = col_character(),
## .. TIME_ZONE = col_character(),
## .. COUNTY = col_double(),
## .. COUNTYNAME = col_character(),
## .. STATE = col_character(),
## .. EVTYPE = col_character(),
## .. BGN_RANGE = col_double(),
## .. BGN_AZI = col_character(),
## .. BGN_LOCATI = col_character(),
## .. END_DATE = col_character(),
## .. END_TIME = col_character(),
## .. COUNTY_END = col_double(),
## .. COUNTYENDN = col_logical(),
## .. END_RANGE = col_double(),
## .. END_AZI = col_character(),
## .. END_LOCATI = col_character(),
## .. LENGTH = col_double(),
## .. WIDTH = col_double(),
## .. F = col_double(),
## .. MAG = col_double(),
## .. FATALITIES = col_double(),
## .. INJURIES = col_double(),
## .. PROPDMG = col_double(),
## .. PROPDMGEXP = col_character(),
## .. CROPDMG = col_double(),
## .. CROPDMGEXP = col_character(),
## .. WFO = col_character(),
## .. STATEOFFIC = col_character(),
## .. ZONENAMES = col_character(),
## .. LATITUDE = col_double(),
## .. LONGITUDE = col_double(),
## .. LATITUDE_E = col_double(),
## .. LONGITUDE_ = col_double(),
## .. REMARKS = col_character(),
## .. REFNUM = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
# View the first few rows of the variables of interest
dt %>%
select(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) %>%
head()
## # A tibble: 6 × 7
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
## 1 TORNADO 0 15 25 K 0 <NA>
## 2 TORNADO 0 0 2.5 K 0 <NA>
## 3 TORNADO 0 2 25 K 0 <NA>
## 4 TORNADO 0 2 2.5 K 0 <NA>
## 5 TORNADO 0 2 2.5 K 0 <NA>
## 6 TORNADO 0 6 2.5 K 0 <NA>
# Check how the event types are recorded
unique_event_types <- unique(dt$EVTYPE)
length(unique_event_types) # number of unique event types
## [1] 977
head(unique_event_types, 20) # first 20 event types
## [1] "TORNADO" "TSTM WIND"
## [3] "HAIL" "FREEZING RAIN"
## [5] "SNOW" "ICE STORM/FLASH FLOOD"
## [7] "SNOW/ICE" "WINTER STORM"
## [9] "HURRICANE OPAL/HIGH WINDS" "THUNDERSTORM WINDS"
## [11] "RECORD COLD" "HURRICANE ERIN"
## [13] "HURRICANE OPAL" "HEAVY RAIN"
## [15] "LIGHTNING" "THUNDERSTORM WIND"
## [17] "DENSE FOG" "RIP CURRENT"
## [19] "THUNDERSTORM WINS" "FLASH FLOOD"
# Check the range of fatalities and injuries
summary(dt$FATALITIES)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.01678 0.00000 583.00000
summary(dt$INJURIES)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1557 0.0000 1700.0000
# Check property and crop damage columns
summary(dt$PROPDMG)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 12.06 0.50 5000.00
summary(dt$PROPDMGEXP)
## Length Class Mode
## 902297 character character
summary(dt$CROPDMG)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 1.527 0.000 990.000
summary(dt$CROPDMGEXP)
## Length Class Mode
## 902297 character character
# Find out the period of data you are analysing
# Convert to Date format ----
dt$BGN_DATE <- as.Date(dt$BGN_DATE, format = "%m/%d/%Y %H:%M:%S")
dt$END_DATE <- as.Date(dt$END_DATE, format = "%m/%d/%Y %H:%M:%S")
# Check the date range ----
range(dt$BGN_DATE, dt$END_DATE, na.rm = TRUE)
## [1] "1950-01-03" "2011-11-30"
EVTYPE contains many near-duplicates (e.g., “TSTM WIND”, “THUNDERSTORM WIND”, extra spaces). We apply lightweight normalization that preserves meaning while merging obvious variants:
-uppercase, trim spaces, collapse multiple spaces
-alias common synonyms (e.g., TSTM → THUNDERSTORM)
-remove punctuation that doesn’t convey type
-map a few high-frequency patterns to canonical names
This reduces fragmentation and yields more interpretable totals without over-engineering a full taxonomy.
# Clean EVTYPE ----
# Basic text normalization
dt$EVTYPE <- tolower(dt$EVTYPE) # convert to lowercase
dt$EVTYPE <- str_trim(dt$EVTYPE) # remove leading/trailing spaces
dt$EVTYPE <- str_replace_all(dt$EVTYPE, "[[:punct:]]", " ") # remove punctuation
dt$EVTYPE <- str_replace_all(dt$EVTYPE, "\\s+", " ") # collapse multiple spaces
# Standardize common event types
dt$EVTYPE <- str_replace_all(dt$EVTYPE, "tstm", "thunderstorm")
dt$EVTYPE <- str_replace_all(dt$EVTYPE, "thunderstorm wind", "thunderstorm wind")
dt$EVTYPE <- str_replace_all(dt$EVTYPE, "flood(s)?", "flood")
dt$EVTYPE <- str_replace_all(dt$EVTYPE, "hail", "hail")
dt$EVTYPE <- str_replace_all(dt$EVTYPE, "wind(s)?", "wind")
dt$EVTYPE <- str_replace_all(dt$EVTYPE, "hurricane", "hurricane/typhoon")
dt$EVTYPE <- str_replace_all(dt$EVTYPE, "tornado", "tornado")
dt$EVTYPE <- str_replace_all(dt$EVTYPE, "fire(s)?", "fire")
dt$EVTYPE <- str_replace_all(dt$EVTYPE, "storm surge", "storm surge/tide")
dt$EVTYPE <- str_replace_all(dt$EVTYPE, "winter storm", "winter storm")
dt$EVTYPE <- str_replace_all(dt$EVTYPE, "ice storm", "ice storm")
dt$EVTYPE <- str_replace_all(dt$EVTYPE, "cold/wind chill", "cold/wind chill")
dt$EVTYPE <- str_replace_all(dt$EVTYPE, "thunderstorm wind/hail", "thunderstorm wind")
dt$EVTYPE <- str_replace_all(dt$EVTYPE, "tropical storm", "tropical storm")
#Check how many there are left
length(unique(dt$EVTYPE))
## [1] 813
# Aggregate fatalities and injuries by event type
health_impact <- dt %>%
group_by(EVTYPE) %>%
summarise(
total_fatalities = sum(FATALITIES, na.rm = TRUE),
total_injuries = sum(INJURIES, na.rm = TRUE)
) %>%
arrange(desc(total_fatalities), desc(total_injuries)) # fixed ordering
# Top 10 fatal events
top_fatal <- health_impact %>%
slice_head(n = 10)
# Top 10 injury events
top_injury <- health_impact %>%
arrange(desc(total_injuries)) %>%
slice_head(n = 10)
# Plot fatalities
p1 <- ggplot(top_fatal, aes(x = reorder(EVTYPE, total_fatalities), y = total_fatalities)) +
geom_col(fill = "firebrick") +
coord_flip() +
labs(title = "Top 10 Deadliest Storm Events", x = "Event Type", y = "Total Fatalities") +
theme_minimal(base_size = 12)
# Plot injuries
p2 <- ggplot(top_injury, aes(x = reorder(EVTYPE, total_injuries), y = total_injuries)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(title = "Top 10 Storm Events Causing Most Injuries", x = "Event Type", y = "Total Injuries") +
theme_minimal(base_size = 12)
# Combine the two plots side by side
combined_plot <- p1 + p2 +
plot_layout(ncol = 2) +
plot_annotation(title = "Top 10 Storm Events by Fatalities and Injuries (1950-2011)")
# Display the combined plot
combined_plot
Tornadoes are the leading cause of fatalities and injuries.
# Total fatalities and injuries
total_fatalities_all <- sum(dt$FATALITIES, na.rm = TRUE)
total_injuries_all <- sum(dt$INJURIES, na.rm = TRUE)
# Fatalities top 10 percentage
top10_fatal_pct <- sum(top_fatal$total_fatalities) / total_fatalities_all * 100
# Injuries top 10 percentage
top10_injury_pct <- sum(top_injury$total_injuries) / total_injuries_all * 100
cat(sprintf(
"The top 10 deadliest event types account for %.1f%% of all fatalities, and the top 10 injury-causing event types account for %.1f%% of all injuries.\n",
top10_fatal_pct, top10_injury_pct
))
## The top 10 deadliest event types account for 81.3% of all fatalities, and the top 10 injury-causing event types account for 91.0% of all injuries.
Property/crop damage units use exponents in PROPDMGEXP / CROPDMGEXP. Following common practice, we interpret H=10², K=10³, M=10⁶, B=10⁹; digits 0–9 as 10^digit; blanks and other symbols as 1.
# Create a function to convert exponents to numeric multipliers
exp_to_num <- function(exp) {
ifelse(exp %in% c('k', 'K'), 1e3,
ifelse(exp %in% c('m', 'M'), 1e6,
ifelse(exp %in% c('b', 'B'), 1e9, 1)))
}
# Compute damage in numeric form
dt$PROPDMG_num <- dt$PROPDMG * exp_to_num(dt$PROPDMGEXP)
dt$CROPDMG_num <- dt$CROPDMG * exp_to_num(dt$CROPDMGEXP)
# Total economic damage
dt$TOTALDMG <- dt$PROPDMG_num + dt$CROPDMG_num
# Aggregate economic damage by event type ----
economic_impact <- dt %>%
group_by(EVTYPE) %>%
summarise(
total_damage = sum(TOTALDMG, na.rm = TRUE) # just total
) %>%
arrange(desc(total_damage))
# Top 10 most costly events
top10_econ <- economic_impact %>%
slice_head(n = 10)
# Plot top 10 economic damage events ----
ggplot(top10_econ, aes(x = reorder(EVTYPE, total_damage), y = total_damage / 1e9)) + # in millions
geom_col(fill = "darkgreen") +
coord_flip() +
labs(
title = "Top 10 Storm Events by Economic Damage (USD Billions)",
x = "Event Type",
y = "Total Damage (billions USD)"
) +
theme_minimal(base_size = 12)
Floods, hurricanes, and tornados cause the highest economic losses.