Synopsis

This report explores the NOAA Storm Database to identify which types of severe weather events are most harmful to population health and which have the greatest economic consequences in the United States between 1950 and 2011.
The analysis includes all recorded events in the database, covering fatalities, injuries, and property and crop damages.
Data were cleaned and standardized to correct inconsistencies in event type names and to convert damage exponents into numerical values.
Two main questions are addressed: 1. Which types of events are most harmful to population health? 2. Which types of events have the greatest economic impact?
Results show that tornadoes are by far the leading cause of fatalities and injuries, while floods, hurricanes, and tornados cause the highest economic losses.
Together, the top 10 event types account for the vast majority of both human and economic impacts, highlighting the concentration of risk among a few major weather phenomena.


1. Data Processing

# Packages ----
library(data.table)
library(tidyverse)
library(lubridate)
library(scales)
library(janitor)
library(stringr)
library(patchwork)

1a. Load and inspect the dataset

# Replace the path below with your working directory
dt <- read_csv("C:/Users/Thais/Desktop/Final/RepData_PeerAssessment2/repdata_data_StormData.csv.bz2")
## Rows: 902297 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (18): BGN_DATE, BGN_TIME, TIME_ZONE, COUNTYNAME, STATE, EVTYPE, BGN_AZI,...
## dbl (18): STATE__, COUNTY, BGN_RANGE, COUNTY_END, END_RANGE, LENGTH, WIDTH, ...
## lgl  (1): COUNTYENDN
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Inspect the dataset ----
dim(dt)       # Check the number of rows and columns
## [1] 902297     37
names(dt)     # List column names
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"
str(dt)       # Display structure and data types
## spc_tbl_ [902,297 × 37] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ STATE__   : num [1:902297] 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr [1:902297] "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr [1:902297] "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr [1:902297] "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num [1:902297] 97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr [1:902297] "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr [1:902297] "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr [1:902297] "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr [1:902297] NA NA NA NA ...
##  $ BGN_LOCATI: chr [1:902297] NA NA NA NA ...
##  $ END_DATE  : chr [1:902297] NA NA NA NA ...
##  $ END_TIME  : chr [1:902297] NA NA NA NA ...
##  $ COUNTY_END: num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi [1:902297] NA NA NA NA NA NA ...
##  $ END_RANGE : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr [1:902297] NA NA NA NA ...
##  $ END_LOCATI: chr [1:902297] NA NA NA NA ...
##  $ LENGTH    : num [1:902297] 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num [1:902297] 100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : num [1:902297] 3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num [1:902297] 0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num [1:902297] 15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num [1:902297] 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr [1:902297] "K" "K" "K" "K" ...
##  $ CROPDMG   : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr [1:902297] NA NA NA NA ...
##  $ WFO       : chr [1:902297] NA NA NA NA ...
##  $ STATEOFFIC: chr [1:902297] NA NA NA NA ...
##  $ ZONENAMES : chr [1:902297] NA NA NA NA ...
##  $ LATITUDE  : num [1:902297] 3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num [1:902297] 8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num [1:902297] 3051 0 0 0 0 ...
##  $ LONGITUDE_: num [1:902297] 8806 0 0 0 0 ...
##  $ REMARKS   : chr [1:902297] NA NA NA NA ...
##  $ REFNUM    : num [1:902297] 1 2 3 4 5 6 7 8 9 10 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   STATE__ = col_double(),
##   ..   BGN_DATE = col_character(),
##   ..   BGN_TIME = col_character(),
##   ..   TIME_ZONE = col_character(),
##   ..   COUNTY = col_double(),
##   ..   COUNTYNAME = col_character(),
##   ..   STATE = col_character(),
##   ..   EVTYPE = col_character(),
##   ..   BGN_RANGE = col_double(),
##   ..   BGN_AZI = col_character(),
##   ..   BGN_LOCATI = col_character(),
##   ..   END_DATE = col_character(),
##   ..   END_TIME = col_character(),
##   ..   COUNTY_END = col_double(),
##   ..   COUNTYENDN = col_logical(),
##   ..   END_RANGE = col_double(),
##   ..   END_AZI = col_character(),
##   ..   END_LOCATI = col_character(),
##   ..   LENGTH = col_double(),
##   ..   WIDTH = col_double(),
##   ..   F = col_double(),
##   ..   MAG = col_double(),
##   ..   FATALITIES = col_double(),
##   ..   INJURIES = col_double(),
##   ..   PROPDMG = col_double(),
##   ..   PROPDMGEXP = col_character(),
##   ..   CROPDMG = col_double(),
##   ..   CROPDMGEXP = col_character(),
##   ..   WFO = col_character(),
##   ..   STATEOFFIC = col_character(),
##   ..   ZONENAMES = col_character(),
##   ..   LATITUDE = col_double(),
##   ..   LONGITUDE = col_double(),
##   ..   LATITUDE_E = col_double(),
##   ..   LONGITUDE_ = col_double(),
##   ..   REMARKS = col_character(),
##   ..   REFNUM = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
# View the first few rows of the variables of interest
dt %>%
  select(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) %>%
  head()
## # A tibble: 6 × 7
##   EVTYPE  FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
##   <chr>        <dbl>    <dbl>   <dbl> <chr>        <dbl> <chr>     
## 1 TORNADO          0       15    25   K                0 <NA>      
## 2 TORNADO          0        0     2.5 K                0 <NA>      
## 3 TORNADO          0        2    25   K                0 <NA>      
## 4 TORNADO          0        2     2.5 K                0 <NA>      
## 5 TORNADO          0        2     2.5 K                0 <NA>      
## 6 TORNADO          0        6     2.5 K                0 <NA>
# Check how the event types are recorded
unique_event_types <- unique(dt$EVTYPE)
length(unique_event_types)      # number of unique event types
## [1] 977
head(unique_event_types, 20)    # first 20 event types
##  [1] "TORNADO"                   "TSTM WIND"                
##  [3] "HAIL"                      "FREEZING RAIN"            
##  [5] "SNOW"                      "ICE STORM/FLASH FLOOD"    
##  [7] "SNOW/ICE"                  "WINTER STORM"             
##  [9] "HURRICANE OPAL/HIGH WINDS" "THUNDERSTORM WINDS"       
## [11] "RECORD COLD"               "HURRICANE ERIN"           
## [13] "HURRICANE OPAL"            "HEAVY RAIN"               
## [15] "LIGHTNING"                 "THUNDERSTORM WIND"        
## [17] "DENSE FOG"                 "RIP CURRENT"              
## [19] "THUNDERSTORM WINS"         "FLASH FLOOD"
# Check the range of fatalities and injuries
summary(dt$FATALITIES)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##   0.00000   0.00000   0.00000   0.01678   0.00000 583.00000
summary(dt$INJURIES)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##    0.0000    0.0000    0.0000    0.1557    0.0000 1700.0000
# Check property and crop damage columns
summary(dt$PROPDMG)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   12.06    0.50 5000.00
summary(dt$PROPDMGEXP)
##    Length     Class      Mode 
##    902297 character character
summary(dt$CROPDMG)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   1.527   0.000 990.000
summary(dt$CROPDMGEXP)
##    Length     Class      Mode 
##    902297 character character
# Find out the period of data you are analysing
# Convert to Date format ----
dt$BGN_DATE <- as.Date(dt$BGN_DATE, format = "%m/%d/%Y %H:%M:%S")
dt$END_DATE <- as.Date(dt$END_DATE, format = "%m/%d/%Y %H:%M:%S")

# Check the date range ----
range(dt$BGN_DATE, dt$END_DATE, na.rm = TRUE)
## [1] "1950-01-03" "2011-11-30"

1b. Cleaning and Preparing Variables

Event type standardization (justification)

EVTYPE contains many near-duplicates (e.g., “TSTM WIND”, “THUNDERSTORM WIND”, extra spaces). We apply lightweight normalization that preserves meaning while merging obvious variants:

-uppercase, trim spaces, collapse multiple spaces

-alias common synonyms (e.g., TSTM → THUNDERSTORM)

-remove punctuation that doesn’t convey type

-map a few high-frequency patterns to canonical names

This reduces fragmentation and yields more interpretable totals without over-engineering a full taxonomy.

# Clean EVTYPE ----

# Basic text normalization
dt$EVTYPE <- tolower(dt$EVTYPE)                # convert to lowercase
dt$EVTYPE <- str_trim(dt$EVTYPE)               # remove leading/trailing spaces
dt$EVTYPE <- str_replace_all(dt$EVTYPE, "[[:punct:]]", " ")  # remove punctuation
dt$EVTYPE <- str_replace_all(dt$EVTYPE, "\\s+", " ")         # collapse multiple spaces

# Standardize common event types
dt$EVTYPE <- str_replace_all(dt$EVTYPE, "tstm", "thunderstorm")
dt$EVTYPE <- str_replace_all(dt$EVTYPE, "thunderstorm wind", "thunderstorm wind")
dt$EVTYPE <- str_replace_all(dt$EVTYPE, "flood(s)?", "flood")
dt$EVTYPE <- str_replace_all(dt$EVTYPE, "hail", "hail")
dt$EVTYPE <- str_replace_all(dt$EVTYPE, "wind(s)?", "wind")
dt$EVTYPE <- str_replace_all(dt$EVTYPE, "hurricane", "hurricane/typhoon")
dt$EVTYPE <- str_replace_all(dt$EVTYPE, "tornado", "tornado")
dt$EVTYPE <- str_replace_all(dt$EVTYPE, "fire(s)?", "fire")
dt$EVTYPE <- str_replace_all(dt$EVTYPE, "storm surge", "storm surge/tide")
dt$EVTYPE <- str_replace_all(dt$EVTYPE, "winter storm", "winter storm")
dt$EVTYPE <- str_replace_all(dt$EVTYPE, "ice storm", "ice storm")
dt$EVTYPE <- str_replace_all(dt$EVTYPE, "cold/wind chill", "cold/wind chill")
dt$EVTYPE <- str_replace_all(dt$EVTYPE, "thunderstorm wind/hail", "thunderstorm wind")
dt$EVTYPE <- str_replace_all(dt$EVTYPE, "tropical storm", "tropical storm")

#Check how many there are left
length(unique(dt$EVTYPE))
## [1] 813

2. Results

2a. Impact on Population Health

# Aggregate fatalities and injuries by event type
health_impact <- dt %>%
  group_by(EVTYPE) %>%
  summarise(
    total_fatalities = sum(FATALITIES, na.rm = TRUE),
    total_injuries = sum(INJURIES, na.rm = TRUE)
  ) %>%
  arrange(desc(total_fatalities), desc(total_injuries))  # fixed ordering

# Top 10 fatal events
top_fatal <- health_impact %>%
  slice_head(n = 10)

# Top 10 injury events
top_injury <- health_impact %>%
  arrange(desc(total_injuries)) %>%
  slice_head(n = 10)

# Plot fatalities
p1 <- ggplot(top_fatal, aes(x = reorder(EVTYPE, total_fatalities), y = total_fatalities)) +
  geom_col(fill = "firebrick") +
  coord_flip() +
  labs(title = "Top 10 Deadliest Storm Events", x = "Event Type", y = "Total Fatalities") +
  theme_minimal(base_size = 12)

# Plot injuries
p2 <- ggplot(top_injury, aes(x = reorder(EVTYPE, total_injuries), y = total_injuries)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "Top 10 Storm Events Causing Most Injuries", x = "Event Type", y = "Total Injuries") +
  theme_minimal(base_size = 12)

# Combine the two plots side by side
combined_plot <- p1 + p2 + 
  plot_layout(ncol = 2) + 
  plot_annotation(title = "Top 10 Storm Events by Fatalities and Injuries (1950-2011)")

# Display the combined plot
combined_plot

Summary

Tornadoes are the leading cause of fatalities and injuries.

# Total fatalities and injuries
total_fatalities_all <- sum(dt$FATALITIES, na.rm = TRUE)
total_injuries_all <- sum(dt$INJURIES, na.rm = TRUE)

# Fatalities top 10 percentage
top10_fatal_pct <- sum(top_fatal$total_fatalities) / total_fatalities_all * 100

# Injuries top 10 percentage
top10_injury_pct <- sum(top_injury$total_injuries) / total_injuries_all * 100

cat(sprintf(
  "The top 10 deadliest event types account for %.1f%% of all fatalities, and the top 10 injury-causing event types account for %.1f%% of all injuries.\n",
  top10_fatal_pct, top10_injury_pct
))
## The top 10 deadliest event types account for 81.3% of all fatalities, and the top 10 injury-causing event types account for 91.0% of all injuries.

2b. Economic Impact

Damage exponents (justification)

Property/crop damage units use exponents in PROPDMGEXP / CROPDMGEXP. Following common practice, we interpret H=10², K=10³, M=10⁶, B=10⁹; digits 0–9 as 10^digit; blanks and other symbols as 1.

# Create a function to convert exponents to numeric multipliers
exp_to_num <- function(exp) {
  ifelse(exp %in% c('k', 'K'), 1e3,
         ifelse(exp %in% c('m', 'M'), 1e6,
                ifelse(exp %in% c('b', 'B'), 1e9, 1)))
}

# Compute damage in numeric form
dt$PROPDMG_num <- dt$PROPDMG * exp_to_num(dt$PROPDMGEXP)
dt$CROPDMG_num <- dt$CROPDMG * exp_to_num(dt$CROPDMGEXP)

# Total economic damage
dt$TOTALDMG <- dt$PROPDMG_num + dt$CROPDMG_num

# Aggregate economic damage by event type ----
economic_impact <- dt %>%
  group_by(EVTYPE) %>%
  summarise(
    total_damage = sum(TOTALDMG, na.rm = TRUE)   # just total
  ) %>%
  arrange(desc(total_damage))

# Top 10 most costly events
top10_econ <- economic_impact %>%
  slice_head(n = 10)

# Plot top 10 economic damage events ----
ggplot(top10_econ, aes(x = reorder(EVTYPE, total_damage), y = total_damage / 1e9)) +  # in millions
  geom_col(fill = "darkgreen") +
  coord_flip() +
  labs(
    title = "Top 10 Storm Events by Economic Damage (USD Billions)",
    x = "Event Type",
    y = "Total Damage (billions USD)"
  ) +
  theme_minimal(base_size = 12)

Summary

Floods, hurricanes, and tornados cause the highest economic losses.

3. Conclusion

Across the United States, tornadoes are the most harmful events to population health, while floods cause the greatest economic damage.