NOAA’s Storm Events Analysis

This project examines NOAA’s Storm Events Database, which documents significant weather occurrences across the United States, detailing their timing, locations, and associated damages. The analysis aims to identify the event types that pose the greatest threats to public health and those that lead to substantial economic losses.

Data Source

Data Format: The dataset is provided as a comma-separated value (CSV) file, compressed using the bzip2 algorithm to reduce its size.​

Download Link: You can download the file from the course website:​

Data Coverage: The database records events starting from 1950 up to November 2011. Earlier years may have fewer recorded events, likely due to less comprehensive record-keeping, whereas more recent years are considered more complete.

Questions

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

  2. Across the United States, which types of events have the greatest economic consequences?

Downloading Data

if (!file.exists("stormdata.csv.bz2")) {
    url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
    download.file(url, "stormdata.csv.bz2")
    bunzip2("stormdata.csv.bz2", "stormdata.csv", remove=FALSE)
}
StormInfo <- data.table::fread("stormdata.csv", fill=TRUE, header=TRUE)
head(StormInfo)
##    STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME  STATE
##     <char>             <char>   <char>    <char> <char>     <char> <char>
## 1:    1.00  4/18/1950 0:00:00     0130       CST  97.00     MOBILE     AL
## 2:    1.00  4/18/1950 0:00:00     0145       CST   3.00    BALDWIN     AL
## 3:    1.00  2/20/1951 0:00:00     1600       CST  57.00    FAYETTE     AL
## 4:    1.00   6/8/1951 0:00:00     0900       CST  89.00    MADISON     AL
## 5:    1.00 11/15/1951 0:00:00     1500       CST  43.00    CULLMAN     AL
## 6:    1.00 11/15/1951 0:00:00     2000       CST  77.00 LAUDERDALE     AL
##     EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
##     <char>    <char>  <char>     <char>   <char>   <char>      <num>     <lgcl>
## 1: TORNADO      0.00                                               0         NA
## 2: TORNADO      0.00                                               0         NA
## 3: TORNADO      0.00                                               0         NA
## 4: TORNADO      0.00                                               0         NA
## 5: TORNADO      0.00                                               0         NA
## 6: TORNADO      0.00                                               0         NA
##    END_RANGE END_AZI END_LOCATI LENGTH WIDTH     F   MAG FATALITIES INJURIES
##        <num>  <char>     <char>  <num> <num> <int> <num>      <num>    <num>
## 1:         0                      14.0   100     3     0          0       15
## 2:         0                       2.0   150     2     0          0        0
## 3:         0                       0.1   123     2     0          0        2
## 4:         0                       0.0   100     2     0          0        2
## 5:         0                       0.0   150     2     0          0        2
## 6:         0                       1.5   177     2     0          0        6
##    PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP    WFO STATEOFFIC ZONENAMES LATITUDE
##      <num>     <char>   <num>     <char> <char>     <char>    <char>    <num>
## 1:    25.0          K       0                                            3040
## 2:     2.5          K       0                                            3042
## 3:    25.0          K       0                                            3340
## 4:     2.5          K       0                                            3458
## 5:     2.5          K       0                                            3412
## 6:     2.5          K       0                                            3450
##    LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
##        <num>      <num>      <num>  <char>  <num>
## 1:      8812       3051       8806              1
## 2:      8755          0          0              2
## 3:      8742          0          0              3
## 4:      8626          0          0              4
## 5:      8642          0          0              5
## 6:      8748          0          0              6

Now, let’s check the variable names

names(StormInfo)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

Now, we will select only the variables relevant to our analysis and convert their names to lowercase by creating a subset using dplyr.

library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.3
## 
## Adjuntando el paquete: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
Stormsubset <- StormInfo %>% 
  select(c("EVTYPE","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")) %>%
  rename_all(tolower)
str(Stormsubset)
## Classes 'data.table' and 'data.frame':   902297 obs. of  7 variables:
##  $ evtype    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ fatalities: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ injuries  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ propdmg   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ propdmgexp: chr  "K" "K" "K" "K" ...
##  $ cropdmg   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ cropdmgexp: chr  "" "" "" "" ...
##  - attr(*, ".internal.selfref")=<externalptr>

Data Processing

Analysis of Population Health Impact

First, we select the relevant columns related to population health. Then, the top 10 rows are sorted in descending order to create a bar plot by 3 steps.

library(tidyr)
## Warning: package 'tidyr' was built under R version 4.4.3
step1 <- Stormsubset %>% 
    select(evtype, fatalities, injuries) 
print(head(step1))
##     evtype fatalities injuries
##     <char>      <num>    <num>
## 1: TORNADO          0       15
## 2: TORNADO          0        0
## 3: TORNADO          0        2
## 4: TORNADO          0        2
## 5: TORNADO          0        2
## 6: TORNADO          0        6
step2 <- step1 %>%
    group_by(evtype) %>%
    summarize(fatalities = sum(fatalities), injuries = sum(injuries), .groups = 'drop')
print(head(step2))
## # A tibble: 6 × 3
##   evtype                  fatalities injuries
##   <chr>                        <dbl>    <dbl>
## 1 "   HIGH SURF ADVISORY"          0        0
## 2 " COASTAL FLOOD"                 0        0
## 3 " FLASH FLOOD"                   0        0
## 4 " LIGHTNING"                     0        0
## 5 " TSTM WIND"                     0        0
## 6 " TSTM WIND (G45)"               0        0
step3 <- step2 %>%
    arrange(desc(fatalities), desc(injuries)) %>%
    slice(1:10)
print(step3)
## # A tibble: 10 × 3
##    evtype         fatalities injuries
##    <chr>               <dbl>    <dbl>
##  1 TORNADO              5633    91346
##  2 EXCESSIVE HEAT       1903     6525
##  3 FLASH FLOOD           978     1777
##  4 HEAT                  937     2100
##  5 LIGHTNING             816     5230
##  6 TSTM WIND             504     6957
##  7 FLOOD                 470     6789
##  8 RIP CURRENT           368      232
##  9 HIGH WIND             248     1137
## 10 AVALANCHE             224      170
Effect_health <- step3 %>%
    pivot_longer(cols = c(fatalities, injuries), names_to = "type", values_to = "value")
print(head(Effect_health))
## # A tibble: 6 × 3
##   evtype         type       value
##   <chr>          <chr>      <dbl>
## 1 TORNADO        fatalities  5633
## 2 TORNADO        injuries   91346
## 3 EXCESSIVE HEAT fatalities  1903
## 4 EXCESSIVE HEAT injuries    6525
## 5 FLASH FLOOD    fatalities   978
## 6 FLASH FLOOD    injuries    1777

1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

# Checking if ggplot2 is already installed
if (!requireNamespace("ggplot2", quietly = TRUE)) {
  install.packages("ggplot2")
}
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.3
# Checking the data
if (exists("Effect_health")) {
  str(Effect_health)
  head(Effect_health)
  
  # Plot
  ggplot(data = Effect_health, aes(x = reorder(evtype, -value), y = value, fill = type)) +
    geom_bar(position = "dodge", stat = "identity") +
    labs(x = "Event Type", y = "Count") +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 20, vjust = 0.7)) +
    ggtitle("Total Number of Fatalities and Injuries of Top 10 Storm Event Types") +
    scale_fill_manual(values = c("blue", "gray"))
} else {
  stop("El objeto 'Effect_health' no está definido. Por favor, asegúrate de que los datos se hayan procesado correctamente.")
}
## tibble [20 × 3] (S3: tbl_df/tbl/data.frame)
##  $ evtype: chr [1:20] "TORNADO" "TORNADO" "EXCESSIVE HEAT" "EXCESSIVE HEAT" ...
##  $ type  : chr [1:20] "fatalities" "injuries" "fatalities" "injuries" ...
##  $ value : num [1:20] 5633 91346 1903 6525 978 ...

It is evident that tornadoes have the greatest impact on public health, as they cause the highest number of fatalities and injuries.

Analysis of Population Economic Impact

The variable PROPDMGEXP represents property damage costs and can be used to identify the events with the most significant economic impact.

On the other hand the exponent values for property and crop damage costs are inconsistent, so I created a function to standardize them and compute the total cost using their respective exponents (expressed in millions).

cost_economy <- function(x) {
  if (x == "H")
    1E-4
  else if (x == "K")
    1E-3
  else if (x == "M")
    1
  else if (x == "B")
    1E3
  else
    1E-6  
}

Once we have standardized the respective economic exponents, a variable called Economic_Effect is created to assess whether the selected variables have had an impact on the economy.

Effect_economy <-
    Stormsubset %>% 
    select("evtype", "propdmg", "propdmgexp", "cropdmg", "cropdmgexp") %>% 
    mutate(prop_dmg = propdmg * sapply(propdmgexp, FUN = cost_economy), 
           crop_dmg = cropdmg * sapply(cropdmgexp, FUN = cost_economy), .keep = "unused") %>%
    group_by(evtype) %>% 
    summarize(property = sum(prop_dmg), crop = sum(crop_dmg), .groups = 'drop') %>%
    arrange(desc(property), desc(crop)) %>%
    slice(1:10) %>% 
    pivot_longer(cols = c(property, crop), names_to = "type", values_to = "value")

2. Across the United States, which types of events have the greatest economic consequences?

ggplot(data=Effect_economy, aes(reorder(evtype, -value), value, fill=type)) +
  geom_bar(position = "dodge", stat="identity") + 
  labs(x="Event Type", y="Count (millions)") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 25, vjust=0.5)) + 
  ggtitle("Total Cost of Property and Crop Damage by top 10 storm event types") +
  scale_fill_manual(values=c("blue", "grey"))

The bar plot shows that floods and hurricanes/typhoons incur the highest property and crop damage costs, making them the most economically impactful events.

Conclusion

The analysis of NOAA’s Storm Events Database highlights significant findings regarding the health and economic impacts of severe weather events in the United States:

  1. Impact on Public Health: Tornadoes are the most detrimental to public health, causing the highest number of fatalities and injuries across the country. Their unpredictable nature and intensity make them a critical focus area for disaster preparedness and response efforts.

  2. Economic Consequences: Floods and hurricanes/typhoons lead to the most substantial property and crop damage, resulting in billions of dollars in losses. These events underscore the need for enhanced infrastructure, insurance mechanisms, and climate adaptation strategies to mitigate future economic impacts.

This analysis serves as a foundation for policymakers and stakeholders to prioritize resources, improve resilience, and protect communities from the adverse effects of natural disasters. By understanding the patterns of harm and cost, targeted interventions can be designed to minimize both human and financial losses effectively.