Finding the Most Harmful Storm Events in The United States

The objective of this research was to find the most harmful storm events in the United States in two senses, health and material. For the health damages caused by storms, there are values that account for cases of Injuries and Deaths related to each storm event. For Material damages there is monetary values that account for the loss caused by a storm event, either on properties or crops.

The data base used for this Analysis was the Storm Data Base of The United States from NATIONAL WEATHER SERVICE INSTRUCTION, composed by the Department of Commerce, the National Oceanic & Atmospheric Administration and the National Weather Service. The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

Downloading the data from cloudfront, Storm Data and decompressing it:

Data Processing

#First we load R.utils to decompress the bz2 doc, then the data.table to read it.
library(R.utils)
## Warning: package 'R.utils' was built under R version 4.3.3
## Loading required package: R.oo
## Warning: package 'R.oo' was built under R version 4.3.3
## Loading required package: R.methodsS3
## Warning: package 'R.methodsS3' was built under R version 4.3.3
## R.methodsS3 v1.8.2 (2022-06-13 22:00:14 UTC) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.26.0 (2024-01-24 05:12:50 UTC) successfully loaded. See ?R.oo for help.
## 
## Attaching package: 'R.oo'
## The following object is masked from 'package:R.methodsS3':
## 
##     throw
## The following objects are masked from 'package:methods':
## 
##     getClasses, getMethods
## The following objects are masked from 'package:base':
## 
##     attach, detach, load, save
## R.utils v2.12.3 (2023-11-18 01:00:02 UTC) successfully loaded. See ?R.utils for help.
## 
## Attaching package: 'R.utils'
## The following object is masked from 'package:utils':
## 
##     timestamp
## The following objects are masked from 'package:base':
## 
##     cat, commandArgs, getOption, isOpen, nullfile, parse, warnings
library(data.table)
## Warning: package 'data.table' was built under R version 4.3.3
#Now we save the info we will need as arguments to the downloading and reading func.
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
path <- getwd()
destfile <- paste(path, "StormData.bz2", sep = "/")

#Download
download.file(url = url, destfile = destfile)

#Decompress
R.utils::bunzip2(destfile, remove = FALSE, overwrite = TRUE)
decompressed_file <- paste0(path, "/StormData.csv")

#Read
StormData <- data.table::as.data.table(x = read.csv(file = "StormData"))

str(StormData)
## Classes 'data.table' and 'data.frame':   902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
##  - attr(*, ".internal.selfref")=<externalptr>
head(StormData)
##    STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME  STATE
##      <num>             <char>   <char>    <char>  <num>     <char> <char>
## 1:       1  4/18/1950 0:00:00     0130       CST     97     MOBILE     AL
## 2:       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN     AL
## 3:       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE     AL
## 4:       1   6/8/1951 0:00:00     0900       CST     89    MADISON     AL
## 5:       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN     AL
## 6:       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE     AL
##     EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
##     <char>     <num>  <char>     <char>   <char>   <char>      <num>     <lgcl>
## 1: TORNADO         0                                               0         NA
## 2: TORNADO         0                                               0         NA
## 3: TORNADO         0                                               0         NA
## 4: TORNADO         0                                               0         NA
## 5: TORNADO         0                                               0         NA
## 6: TORNADO         0                                               0         NA
##    END_RANGE END_AZI END_LOCATI LENGTH WIDTH     F   MAG FATALITIES INJURIES
##        <num>  <char>     <char>  <num> <num> <int> <num>      <num>    <num>
## 1:         0                      14.0   100     3     0          0       15
## 2:         0                       2.0   150     2     0          0        0
## 3:         0                       0.1   123     2     0          0        2
## 4:         0                       0.0   100     2     0          0        2
## 5:         0                       0.0   150     2     0          0        2
## 6:         0                       1.5   177     2     0          0        6
##    PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP    WFO STATEOFFIC ZONENAMES LATITUDE
##      <num>     <char>   <num>     <char> <char>     <char>    <char>    <num>
## 1:    25.0          K       0                                            3040
## 2:     2.5          K       0                                            3042
## 3:    25.0          K       0                                            3340
## 4:     2.5          K       0                                            3458
## 5:     2.5          K       0                                            3412
## 6:     2.5          K       0                                            3450
##    LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
##        <num>      <num>      <num>  <char>  <num>
## 1:      8812       3051       8806              1
## 2:      8755          0          0              2
## 3:      8742          0          0              3
## 4:      8626          0          0              4
## 5:      8642          0          0              5
## 6:      8748          0          0              6

The first step of the analysis is to find the total harm (to health or economics). Thus we need the total effects of the events, found by compressing all the rows that account for the same event type and sum their values. I will also add a column with the value of 1 in each cell, when summed, this will count for the occurrence of each event type, by adding the rows with 1 in them. Other modification is on the economic harms that will have a column to count for the size of loss column float, being k for thousand, M for million and B for billion. To be able to sum the correct values I created a correspondent column with the numerical version of these thousand, million and billion to serve as a multiplier and construct a fourth and a fifth columns of with the total economic losses for crops and properties. Then these values will be summed to correspond to each type event total economic harms.

library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
## 
##     between, first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#Let`s add NA on every blank space for consistency sake.
StormData[StormData == "" | StormData == "?"] <- NA

#Now, since we are only interested on the type of event v.s it`s damage nationwide, lets make a table with events and  total damage as the cols. Also lets find the total occurrences of each event to have a mean damage per recurrence.

#First we create a col with one`s that will be summed when we subset the data frame.
StormData$EVCOUNTER <- 1

#Now lets convert the letters of cols PROPDMGEXP and CROPDMGEXP and make them into multipliers to correctly sum the total damage in dollars.

StormData <- StormData %>%
  mutate(MULTPROP = case_when(
    PROPDMGEXP == "K" ~ 1e3,
    PROPDMGEXP == "M" ~ 1e6,
    PROPDMGEXP == "B" ~ 1e9,
    TRUE ~ 0  # When we have no multiplier we have to discard the value
  )) %>%
  mutate(FULLPROPDMG = PROPDMG * MULTPROP)

StormData <- StormData %>%
  mutate(MULTCROP = case_when(
    CROPDMGEXP == "K" ~ 1e3,
    CROPDMGEXP == "M" ~ 1e6,
    CROPDMGEXP == "B" ~ 1e9,
    TRUE ~ 0  # When we have no multiplier we have to discard the value
  )) %>%
  mutate(FULLCROPDMG = CROPDMG * MULTCROP)


#Now we subset events as rows and effects + EVCOUNTER as cols:
TotalDMG <- StormData[, lapply(.SD, sum, na.rm=TRUE), .SDcols = c("FATALITIES","INJURIES","FULLPROPDMG","FULLCROPDMG","EVCOUNTER"), by = EVTYPE]

head(TotalDMG)
##                   EVTYPE FATALITIES INJURIES FULLPROPDMG FULLCROPDMG EVCOUNTER
##                   <char>      <num>    <num>       <num>       <num>     <num>
## 1:               TORNADO       5633    91346 56925660480   414953110     60652
## 2:             TSTM WIND        504     6957  4484928440   554007350    219940
## 3:                  HAIL         15     1361 15727366720  3025537450    288661
## 4:         FREEZING RAIN          7       23     8111500           0       250
## 5:                  SNOW          5       29    14762550       10000       587
## 6: ICE STORM/FLASH FLOOD          0        2           0           0         1
#Now we can filter for the most recurring events, to observe how their total health harms and economic losses are displayed.

Frequent_Events <- TotalDMG[order(TotalDMG$EVCOUNTER, decreasing = TRUE)][1:20]
head(Frequent_Events, 20)
##                       EVTYPE FATALITIES INJURIES  FULLPROPDMG FULLCROPDMG
##                       <char>      <num>    <num>        <num>       <num>
##  1:                     HAIL         15     1361  15727366720  3025537450
##  2:                TSTM WIND        504     6957   4484928440   554007350
##  3:        THUNDERSTORM WIND        133     1488   3483121140   414843050
##  4:                  TORNADO       5633    91346  56925660480   414953110
##  5:              FLASH FLOOD        978     1777  16140811510  1421317100
##  6:                    FLOOD        470     6789 144657709800  5661968450
##  7:       THUNDERSTORM WINDS         64      908   1733452850   190650700
##  8:                HIGH WIND        248     1137   5270046260   638571300
##  9:                LIGHTNING        816     5230    928659280    12092090
## 10:               HEAVY SNOW        127     1021    932589140   134653100
## 11:               HEAVY RAIN         98      251    694248090   733399800
## 12:             WINTER STORM        206     1321   6688497250    26944000
## 13:           WINTER WEATHER         33      398     20866000    15000000
## 14:             FUNNEL CLOUD          0        3       194600           0
## 15:         MARINE TSTM WIND          9        8      5421000           0
## 16: MARINE THUNDERSTORM WIND         10       26       436400       50000
## 17:               WATERSPOUT          3       29      9353700           0
## 18:              STRONG WIND        103      280    175241450    64953500
## 19:     URBAN/SML STREAM FLD         28       79     58309650     8488100
## 20:                 WILDFIRE         75      911   4765114000   295472800
##                       EVTYPE FATALITIES INJURIES  FULLPROPDMG FULLCROPDMG
##     EVCOUNTER
##         <num>
##  1:    288661
##  2:    219940
##  3:     82563
##  4:     60652
##  5:     54277
##  6:     25326
##  7:     20843
##  8:     20212
##  9:     15754
## 10:     15708
## 11:     11723
## 12:     11433
## 13:      7026
## 14:      6839
## 15:      6175
## 16:      5812
## 17:      3796
## 18:      3566
## 19:      3392
## 20:      2761
##     EVCOUNTER
#I'm not plotting it yet because the visualization will not be as good.
#To better explore the data lets add some relative values to the cols to a better analysis of the effects of the events on average.

RelativeDMG <- TotalDMG %>%
  mutate(Mean_Fatalities = FATALITIES/EVCOUNTER,
         Mean_Injuries = INJURIES/EVCOUNTER,
         Mean_PropDMG = FULLPROPDMG/EVCOUNTER,
         Mean_CropDMG = FULLCROPDMG/EVCOUNTER)

#To expand the possibility of enlightening plots we add cols for total health harm and total economic losses per event and also their average per recurrence.

RelativeDMG$HARM <- RelativeDMG$FATALITIES + RelativeDMG$INJURIES
RelativeDMG$LOSS <- RelativeDMG$FULLPROPDMG + RelativeDMG$FULLCROPDMG
RelativeDMG$Mean_Harm <- RelativeDMG$HARM/RelativeDMG$EVCOUNTER
RelativeDMG$Mean_Loss <- RelativeDMG$LOSS/RelativeDMG$EVCOUNTER

str(RelativeDMG)
## Classes 'data.table' and 'data.frame':   985 obs. of  14 variables:
##  $ EVTYPE         : chr  "TORNADO" "TSTM WIND" "HAIL" "FREEZING RAIN" ...
##  $ FATALITIES     : num  5633 504 15 7 5 ...
##  $ INJURIES       : num  91346 6957 1361 23 29 ...
##  $ FULLPROPDMG    : num  5.69e+10 4.48e+09 1.57e+10 8.11e+06 1.48e+07 ...
##  $ FULLCROPDMG    : num  4.15e+08 5.54e+08 3.03e+09 0.00 1.00e+04 ...
##  $ EVCOUNTER      : num  60652 219940 288661 250 587 ...
##  $ Mean_Fatalities: num  0.092874 0.002292 0.000052 0.028 0.008518 ...
##  $ Mean_Injuries  : num  1.50607 0.03163 0.00471 0.092 0.0494 ...
##  $ Mean_PropDMG   : num  938562 20392 54484 32446 25149 ...
##  $ Mean_CropDMG   : num  6842 2519 10481 0 17 ...
##  $ HARM           : num  96979 7461 1376 30 34 ...
##  $ LOSS           : num  5.73e+10 5.04e+09 1.88e+10 8.11e+06 1.48e+07 ...
##  $ Mean_Harm      : num  1.59894 0.03392 0.00477 0.12 0.05792 ...
##  $ Mean_Loss      : num  945404 22911 64965 32446 25166 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Results

Here we plot the filtered table based on the events that cause the biggest health effects Injuries + Deaths. And we observe that Tornadoes, Excessive Heat and Floods are the events that cause the biggest total effects to the health of people in the US.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
#Now let's plot the most Dangerous events for health:

Harmful_absolute <- RelativeDMG[order(RelativeDMG$HARM, decreasing = TRUE), ][1:10, ]

ggplot(Harmful_absolute, aes(x = factor(EVTYPE),y = HARM), fill = EVTYPE) + 
  geom_bar(aes(fill = EVTYPE), stat = "identity") +
  labs(x="Storm Events", y= "Health Harm Related to the Events") + 
  labs(title=expression("Most Harmful Storm Events in the US"))

# Now we show which ones are the most dangerous or generate the most losses among them.

Doing the same filtering for total economic losses (crop losses + property damage), we can observe that the events that cause the total biggest economic harm in the US are Flood, Hurricane/Typhoon and Tornado, respectively.

#

Loss_absolute <- RelativeDMG[order(RelativeDMG$LOSS, decreasing = TRUE), ][1:10, ]

ggplot(Loss_absolute, aes(x = factor(EVTYPE),y = LOSS), fill = EVTYPE) + 
  geom_bar(aes(fill = EVTYPE), stat = "identity") +
  labs(x="Storm Events", y= "Economic Losses Related to the Events") + 
  labs(title=expression("Biggest Economic Losses by Storm Events in the US"))

# Now we show which ones are the most dangerous or generate the most losses among them.

For curiosity sake I filtered and plotted the most harmful events per occurrence. Meaning that I averaged the impact over the occurrence of the event and found a measure of the relative harm of each event. Finding out that the biggest relative health impact per occurrence (the most dangerous events) were caused by a Heat Wave, the Tropical storm Gordon and Wild Fires (in general). The most costly events were Tornadoes, Heavy Rain and Huricame/Typhoon

library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.3.3
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
# Lets plot the two effects relative to the occurence:
Health_relative <- RelativeDMG[order(RelativeDMG$Mean_Harm, decreasing = TRUE), ][1:8, ]
Loss_relative <- RelativeDMG[order(RelativeDMG$Mean_Loss, decreasing = TRUE), ][1:8, ]


p1 <- ggplot(Health_relative, aes(x = factor(EVTYPE),y = Mean_Harm), fill = EVTYPE) + 
  geom_bar(aes(fill = EVTYPE), stat = "identity") +
  labs(x="Storm Events", y= "Relative Harm") + 
  labs(title=expression("Relative Harm"))

p2 <- ggplot(Loss_relative, aes(x = factor(EVTYPE),y = Mean_Loss), fill = EVTYPE) + 
  geom_bar(aes(fill = EVTYPE), stat = "identity") +
  labs(x="Storm Events", y= "Relative Economic Losses") + 
  labs(title=expression("Relative Economic Losses"))

grid.arrange(p1, p2, ncol = 1)