The objective of this research was to find the most harmful storm events in the United States in two senses, health and material. For the health damages caused by storms, there are values that account for cases of Injuries and Deaths related to each storm event. For Material damages there is monetary values that account for the loss caused by a storm event, either on properties or crops.
The data base used for this Analysis was the Storm Data Base of The United States from NATIONAL WEATHER SERVICE INSTRUCTION, composed by the Department of Commerce, the National Oceanic & Atmospheric Administration and the National Weather Service. The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
Downloading the data from cloudfront, Storm Data and decompressing it:
#First we load R.utils to decompress the bz2 doc, then the data.table to read it.
library(R.utils)
## Warning: package 'R.utils' was built under R version 4.3.3
## Loading required package: R.oo
## Warning: package 'R.oo' was built under R version 4.3.3
## Loading required package: R.methodsS3
## Warning: package 'R.methodsS3' was built under R version 4.3.3
## R.methodsS3 v1.8.2 (2022-06-13 22:00:14 UTC) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.26.0 (2024-01-24 05:12:50 UTC) successfully loaded. See ?R.oo for help.
##
## Attaching package: 'R.oo'
## The following object is masked from 'package:R.methodsS3':
##
## throw
## The following objects are masked from 'package:methods':
##
## getClasses, getMethods
## The following objects are masked from 'package:base':
##
## attach, detach, load, save
## R.utils v2.12.3 (2023-11-18 01:00:02 UTC) successfully loaded. See ?R.utils for help.
##
## Attaching package: 'R.utils'
## The following object is masked from 'package:utils':
##
## timestamp
## The following objects are masked from 'package:base':
##
## cat, commandArgs, getOption, isOpen, nullfile, parse, warnings
library(data.table)
## Warning: package 'data.table' was built under R version 4.3.3
#Now we save the info we will need as arguments to the downloading and reading func.
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
path <- getwd()
destfile <- paste(path, "StormData.bz2", sep = "/")
#Download
download.file(url = url, destfile = destfile)
#Decompress
R.utils::bunzip2(destfile, remove = FALSE, overwrite = TRUE)
decompressed_file <- paste0(path, "/StormData.csv")
#Read
StormData <- data.table::as.data.table(x = read.csv(file = "StormData"))
str(StormData)
## Classes 'data.table' and 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
## - attr(*, ".internal.selfref")=<externalptr>
head(StormData)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## <num> <char> <char> <char> <num> <char> <char>
## 1: 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2: 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3: 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4: 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5: 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6: 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## <char> <num> <char> <char> <char> <char> <num> <lgcl>
## 1: TORNADO 0 0 NA
## 2: TORNADO 0 0 NA
## 3: TORNADO 0 0 NA
## 4: TORNADO 0 0 NA
## 5: TORNADO 0 0 NA
## 6: TORNADO 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES
## <num> <char> <char> <num> <num> <int> <num> <num> <num>
## 1: 0 14.0 100 3 0 0 15
## 2: 0 2.0 150 2 0 0 0
## 3: 0 0.1 123 2 0 0 2
## 4: 0 0.0 100 2 0 0 2
## 5: 0 0.0 150 2 0 0 2
## 6: 0 1.5 177 2 0 0 6
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE
## <num> <char> <num> <char> <char> <char> <char> <num>
## 1: 25.0 K 0 3040
## 2: 2.5 K 0 3042
## 3: 25.0 K 0 3340
## 4: 2.5 K 0 3458
## 5: 2.5 K 0 3412
## 6: 2.5 K 0 3450
## LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## <num> <num> <num> <char> <num>
## 1: 8812 3051 8806 1
## 2: 8755 0 0 2
## 3: 8742 0 0 3
## 4: 8626 0 0 4
## 5: 8642 0 0 5
## 6: 8748 0 0 6
The first step of the analysis is to find the total harm (to health or economics). Thus we need the total effects of the events, found by compressing all the rows that account for the same event type and sum their values. I will also add a column with the value of 1 in each cell, when summed, this will count for the occurrence of each event type, by adding the rows with 1 in them. Other modification is on the economic harms that will have a column to count for the size of loss column float, being k for thousand, M for million and B for billion. To be able to sum the correct values I created a correspondent column with the numerical version of these thousand, million and billion to serve as a multiplier and construct a fourth and a fifth columns of with the total economic losses for crops and properties. Then these values will be summed to correspond to each type event total economic harms.
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#Let`s add NA on every blank space for consistency sake.
StormData[StormData == "" | StormData == "?"] <- NA
#Now, since we are only interested on the type of event v.s it`s damage nationwide, lets make a table with events and total damage as the cols. Also lets find the total occurrences of each event to have a mean damage per recurrence.
#First we create a col with one`s that will be summed when we subset the data frame.
StormData$EVCOUNTER <- 1
#Now lets convert the letters of cols PROPDMGEXP and CROPDMGEXP and make them into multipliers to correctly sum the total damage in dollars.
StormData <- StormData %>%
mutate(MULTPROP = case_when(
PROPDMGEXP == "K" ~ 1e3,
PROPDMGEXP == "M" ~ 1e6,
PROPDMGEXP == "B" ~ 1e9,
TRUE ~ 0 # When we have no multiplier we have to discard the value
)) %>%
mutate(FULLPROPDMG = PROPDMG * MULTPROP)
StormData <- StormData %>%
mutate(MULTCROP = case_when(
CROPDMGEXP == "K" ~ 1e3,
CROPDMGEXP == "M" ~ 1e6,
CROPDMGEXP == "B" ~ 1e9,
TRUE ~ 0 # When we have no multiplier we have to discard the value
)) %>%
mutate(FULLCROPDMG = CROPDMG * MULTCROP)
#Now we subset events as rows and effects + EVCOUNTER as cols:
TotalDMG <- StormData[, lapply(.SD, sum, na.rm=TRUE), .SDcols = c("FATALITIES","INJURIES","FULLPROPDMG","FULLCROPDMG","EVCOUNTER"), by = EVTYPE]
head(TotalDMG)
## EVTYPE FATALITIES INJURIES FULLPROPDMG FULLCROPDMG EVCOUNTER
## <char> <num> <num> <num> <num> <num>
## 1: TORNADO 5633 91346 56925660480 414953110 60652
## 2: TSTM WIND 504 6957 4484928440 554007350 219940
## 3: HAIL 15 1361 15727366720 3025537450 288661
## 4: FREEZING RAIN 7 23 8111500 0 250
## 5: SNOW 5 29 14762550 10000 587
## 6: ICE STORM/FLASH FLOOD 0 2 0 0 1
#Now we can filter for the most recurring events, to observe how their total health harms and economic losses are displayed.
Frequent_Events <- TotalDMG[order(TotalDMG$EVCOUNTER, decreasing = TRUE)][1:20]
head(Frequent_Events, 20)
## EVTYPE FATALITIES INJURIES FULLPROPDMG FULLCROPDMG
## <char> <num> <num> <num> <num>
## 1: HAIL 15 1361 15727366720 3025537450
## 2: TSTM WIND 504 6957 4484928440 554007350
## 3: THUNDERSTORM WIND 133 1488 3483121140 414843050
## 4: TORNADO 5633 91346 56925660480 414953110
## 5: FLASH FLOOD 978 1777 16140811510 1421317100
## 6: FLOOD 470 6789 144657709800 5661968450
## 7: THUNDERSTORM WINDS 64 908 1733452850 190650700
## 8: HIGH WIND 248 1137 5270046260 638571300
## 9: LIGHTNING 816 5230 928659280 12092090
## 10: HEAVY SNOW 127 1021 932589140 134653100
## 11: HEAVY RAIN 98 251 694248090 733399800
## 12: WINTER STORM 206 1321 6688497250 26944000
## 13: WINTER WEATHER 33 398 20866000 15000000
## 14: FUNNEL CLOUD 0 3 194600 0
## 15: MARINE TSTM WIND 9 8 5421000 0
## 16: MARINE THUNDERSTORM WIND 10 26 436400 50000
## 17: WATERSPOUT 3 29 9353700 0
## 18: STRONG WIND 103 280 175241450 64953500
## 19: URBAN/SML STREAM FLD 28 79 58309650 8488100
## 20: WILDFIRE 75 911 4765114000 295472800
## EVTYPE FATALITIES INJURIES FULLPROPDMG FULLCROPDMG
## EVCOUNTER
## <num>
## 1: 288661
## 2: 219940
## 3: 82563
## 4: 60652
## 5: 54277
## 6: 25326
## 7: 20843
## 8: 20212
## 9: 15754
## 10: 15708
## 11: 11723
## 12: 11433
## 13: 7026
## 14: 6839
## 15: 6175
## 16: 5812
## 17: 3796
## 18: 3566
## 19: 3392
## 20: 2761
## EVCOUNTER
#I'm not plotting it yet because the visualization will not be as good.
#To better explore the data lets add some relative values to the cols to a better analysis of the effects of the events on average.
RelativeDMG <- TotalDMG %>%
mutate(Mean_Fatalities = FATALITIES/EVCOUNTER,
Mean_Injuries = INJURIES/EVCOUNTER,
Mean_PropDMG = FULLPROPDMG/EVCOUNTER,
Mean_CropDMG = FULLCROPDMG/EVCOUNTER)
#To expand the possibility of enlightening plots we add cols for total health harm and total economic losses per event and also their average per recurrence.
RelativeDMG$HARM <- RelativeDMG$FATALITIES + RelativeDMG$INJURIES
RelativeDMG$LOSS <- RelativeDMG$FULLPROPDMG + RelativeDMG$FULLCROPDMG
RelativeDMG$Mean_Harm <- RelativeDMG$HARM/RelativeDMG$EVCOUNTER
RelativeDMG$Mean_Loss <- RelativeDMG$LOSS/RelativeDMG$EVCOUNTER
str(RelativeDMG)
## Classes 'data.table' and 'data.frame': 985 obs. of 14 variables:
## $ EVTYPE : chr "TORNADO" "TSTM WIND" "HAIL" "FREEZING RAIN" ...
## $ FATALITIES : num 5633 504 15 7 5 ...
## $ INJURIES : num 91346 6957 1361 23 29 ...
## $ FULLPROPDMG : num 5.69e+10 4.48e+09 1.57e+10 8.11e+06 1.48e+07 ...
## $ FULLCROPDMG : num 4.15e+08 5.54e+08 3.03e+09 0.00 1.00e+04 ...
## $ EVCOUNTER : num 60652 219940 288661 250 587 ...
## $ Mean_Fatalities: num 0.092874 0.002292 0.000052 0.028 0.008518 ...
## $ Mean_Injuries : num 1.50607 0.03163 0.00471 0.092 0.0494 ...
## $ Mean_PropDMG : num 938562 20392 54484 32446 25149 ...
## $ Mean_CropDMG : num 6842 2519 10481 0 17 ...
## $ HARM : num 96979 7461 1376 30 34 ...
## $ LOSS : num 5.73e+10 5.04e+09 1.88e+10 8.11e+06 1.48e+07 ...
## $ Mean_Harm : num 1.59894 0.03392 0.00477 0.12 0.05792 ...
## $ Mean_Loss : num 945404 22911 64965 32446 25166 ...
## - attr(*, ".internal.selfref")=<externalptr>
Here we plot the filtered table based on the events that cause the biggest health effects Injuries + Deaths. And we observe that Tornadoes, Excessive Heat and Floods are the events that cause the biggest total effects to the health of people in the US.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
#Now let's plot the most Dangerous events for health:
Harmful_absolute <- RelativeDMG[order(RelativeDMG$HARM, decreasing = TRUE), ][1:10, ]
ggplot(Harmful_absolute, aes(x = factor(EVTYPE),y = HARM), fill = EVTYPE) +
geom_bar(aes(fill = EVTYPE), stat = "identity") +
labs(x="Storm Events", y= "Health Harm Related to the Events") +
labs(title=expression("Most Harmful Storm Events in the US"))
# Now we show which ones are the most dangerous or generate the most losses among them.
Doing the same filtering for total economic losses (crop losses + property damage), we can observe that the events that cause the total biggest economic harm in the US are Flood, Hurricane/Typhoon and Tornado, respectively.
#
Loss_absolute <- RelativeDMG[order(RelativeDMG$LOSS, decreasing = TRUE), ][1:10, ]
ggplot(Loss_absolute, aes(x = factor(EVTYPE),y = LOSS), fill = EVTYPE) +
geom_bar(aes(fill = EVTYPE), stat = "identity") +
labs(x="Storm Events", y= "Economic Losses Related to the Events") +
labs(title=expression("Biggest Economic Losses by Storm Events in the US"))
# Now we show which ones are the most dangerous or generate the most losses among them.
For curiosity sake I filtered and plotted the most harmful events per occurrence. Meaning that I averaged the impact over the occurrence of the event and found a measure of the relative harm of each event. Finding out that the biggest relative health impact per occurrence (the most dangerous events) were caused by a Heat Wave, the Tropical storm Gordon and Wild Fires (in general). The most costly events were Tornadoes, Heavy Rain and Huricame/Typhoon
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.3.3
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
# Lets plot the two effects relative to the occurence:
Health_relative <- RelativeDMG[order(RelativeDMG$Mean_Harm, decreasing = TRUE), ][1:8, ]
Loss_relative <- RelativeDMG[order(RelativeDMG$Mean_Loss, decreasing = TRUE), ][1:8, ]
p1 <- ggplot(Health_relative, aes(x = factor(EVTYPE),y = Mean_Harm), fill = EVTYPE) +
geom_bar(aes(fill = EVTYPE), stat = "identity") +
labs(x="Storm Events", y= "Relative Harm") +
labs(title=expression("Relative Harm"))
p2 <- ggplot(Loss_relative, aes(x = factor(EVTYPE),y = Mean_Loss), fill = EVTYPE) +
geom_bar(aes(fill = EVTYPE), stat = "identity") +
labs(x="Storm Events", y= "Relative Economic Losses") +
labs(title=expression("Relative Economic Losses"))
grid.arrange(p1, p2, ncol = 1)