In this report we aim to list the most harmful severe weather event types in the USA during years 1950-2011. Harm is investigated on two different levels - public health and economic. The data for this analysis come from the publicly availabe U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, which tracks characteristics of major storms and weather events in the United States. From these data we found that the single most fatal event type is tornado. It causes the most injuries as well. However, the most costly event type when combining damage caused to property and crops is flood.
The storm data is available in comma-separated-value format in a file compressed via the bzip2 algorithm at this web site:
The documentation of the database can be found in:
The code for loading the data directly from the above mentioned source is presented below. The programming language is R version 3.1.2 (2014-10-31). The data.table package is used for subsequent processing and ggplot2 for plotting.
library(data.table)
library(ggplot2)
# Fetch dataset if not exists in the current working directory
filename <- "repdata_data_StormData.csv.bz2"
if (!file.exists(filename)) {
fileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileURL, destfile=filename, method="curl")
}
data <- read.csv(bzfile(filename, open="r"), stringsAsFactors=FALSE)
data <- as.data.table(data)
Before the actual study, let’s take a quick look at the data.
The structure of the data looks like this:
str(data)
## Classes 'data.table' and 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
## - attr(*, ".internal.selfref")=<externalptr>
The relevant variables in this study are EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP.
The number of different event types:
length(unique(data$EVTYPE))
## [1] 985
The time range of the data:
range(as.Date(data$BGN_DATE, format="%m/%d/%Y"))
## [1] "1950-01-03" "2011-11-30"
The harm caused to public health by severe weather events comes in two forms: fatalities and injuries. These are both first summarized and grouped by event type over the all time data set.
fatal <- data[, .(fatalities=sum(FATALITIES), injuries=sum(INJURIES)), by=EVTYPE]
The summarized data is then ordered by fatality count to show the 20 most fatal event types of all times.
fatal <- fatal[order(-fatalities, -injuries)]
ggplot(fatal[1:20], aes(x=reorder(EVTYPE, fatalities), y=fatalities)) +
geom_bar(stat="identity", fill="grey30") + coord_flip() +
ylab("Fatalities") +
theme(axis.title.y=element_blank(), plot.title=element_text(face="bold")) +
ggtitle("Top 20 most fatal event types")
Exact numbers of event types ordered by fatality:
fatal[1:20]
## EVTYPE fatalities injuries
## 1: TORNADO 5633 91346
## 2: EXCESSIVE HEAT 1903 6525
## 3: FLASH FLOOD 978 1777
## 4: HEAT 937 2100
## 5: LIGHTNING 816 5230
## 6: TSTM WIND 504 6957
## 7: FLOOD 470 6789
## 8: RIP CURRENT 368 232
## 9: HIGH WIND 248 1137
## 10: AVALANCHE 224 170
## 11: WINTER STORM 206 1321
## 12: RIP CURRENTS 204 297
## 13: HEAT WAVE 172 309
## 14: EXTREME COLD 160 231
## 15: THUNDERSTORM WIND 133 1488
## 16: HEAVY SNOW 127 1021
## 17: EXTREME COLD/WIND CHILL 125 24
## 18: STRONG WIND 103 280
## 19: BLIZZARD 101 805
## 20: HIGH SURF 101 152
The event types are different for injuries, so they are shown next.
fatal <- fatal[order(-injuries, -fatalities)]
ggplot(fatal[1:20], aes(x=reorder(EVTYPE, injuries), y=injuries)) +
geom_bar(stat="identity", fill="tomato2") + coord_flip() +
ylab("Injuries") +
theme(axis.title.y=element_blank(), plot.title=element_text(face="bold")) +
ggtitle("Top 20 most injury causing event types")
Exact numbers of event types ordered by injuries:
fatal[1:20]
## EVTYPE fatalities injuries
## 1: TORNADO 5633 91346
## 2: TSTM WIND 504 6957
## 3: FLOOD 470 6789
## 4: EXCESSIVE HEAT 1903 6525
## 5: LIGHTNING 816 5230
## 6: HEAT 937 2100
## 7: ICE STORM 89 1975
## 8: FLASH FLOOD 978 1777
## 9: THUNDERSTORM WIND 133 1488
## 10: HAIL 15 1361
## 11: WINTER STORM 206 1321
## 12: HURRICANE/TYPHOON 64 1275
## 13: HIGH WIND 248 1137
## 14: HEAVY SNOW 127 1021
## 15: WILDFIRE 75 911
## 16: THUNDERSTORM WINDS 64 908
## 17: BLIZZARD 101 805
## 18: FOG 62 734
## 19: WILD/FOREST FIRE 12 545
## 20: DUST STORM 22 440
To calculate the economic consequences of different weather events, two variables are added together, PROPDMG and CROPDMG, which denote the damage in dollars caused to property and crop, respectively. According to the database documentation, these values should be scaled by the contents of the PROPDMGEXP and CROPDMGEXP fields. The meaning of these are:
However, by taking a look at the data it can be seen that there are other (undocumented) values as well.
table(data$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 424665 7 11330
table(data$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
nrow(data[CROPDMGEXP %in% c("", "K", "M", "B"), ])/nrow(data)
## [1] 0.9999457
nrow(data[PROPDMGEXP %in% c("", "K", "M", "B"), ])/nrow(data)
## [1] 0.9996365
The small capital letters can be simply typing errors, but we cannot be totally sure about it, so the safest thing to do is to ignore them. The total number of the undefined values is so small that ignoring them is well justified.
The damage costs are multiplied by the given scale, and summarized over all the data by weather event type. After that, property and crop damage are added into a new variable holding the total damage cost of each event type.
data[PROPDMGEXP == "B", PROPDMG:=PROPDMG * 1e9]
data[PROPDMGEXP == "M", PROPDMG:=PROPDMG * 1e6]
data[PROPDMGEXP == "K", PROPDMG:=PROPDMG * 1e3]
data[CROPDMGEXP == "B", CROPDMG:=CROPDMG * 1e9]
data[CROPDMGEXP == "M", CROPDMG:=CROPDMG * 1e6]
data[CROPDMGEXP == "K", CROPDMG:=CROPDMG * 1e3]
costly <- data[, .(propdmg=sum(PROPDMG), cropdmg=sum(CROPDMG)), by=EVTYPE]
costly[, totaldmg:=propdmg+cropdmg]
The total costs are ordered by total damage value.
costly <- costly[order(-totaldmg)]
ggplot(costly[1:20], aes(x=reorder(EVTYPE, totaldmg), y=totaldmg/1e9)) +
geom_bar(stat="identity", fill="steelblue3") + coord_flip() +
ylab("Total cost (billions of $)") +
theme(axis.title.y=element_blank(), plot.title=element_text(face="bold")) +
ggtitle("Top 20 most damaging event types in total")
Exact numbers of event types ordered by total cost:
costly[1:20]
## EVTYPE propdmg cropdmg totaldmg
## 1: FLOOD 144657709807 5661968450 150319678257
## 2: HURRICANE/TYPHOON 69305840000 2607872800 71913712800
## 3: TORNADO 56925660790 414953270 57340614060
## 4: STORM SURGE 43323536000 5000 43323541000
## 5: HAIL 15727367053 3025537890 18752904943
## 6: FLASH FLOOD 16140812067 1421317100 17562129167
## 7: DROUGHT 1046106000 13972566000 15018672000
## 8: HURRICANE 11868319010 2741910000 14610229010
## 9: RIVER FLOOD 5118945500 5029459000 10148404500
## 10: ICE STORM 3944927860 5022113500 8967041360
## 11: TROPICAL STORM 7703890550 678346000 8382236550
## 12: WINTER STORM 6688497251 26944000 6715441251
## 13: HIGH WIND 5270046295 638571300 5908617595
## 14: WILDFIRE 4765114000 295472800 5060586800
## 15: TSTM WIND 4484928495 554007350 5038935845
## 16: STORM SURGE/TIDE 4641188000 850000 4642038000
## 17: THUNDERSTORM WIND 3483121284 414843050 3897964334
## 18: HURRICANE OPAL 3152846020 9000010 3161846030
## 19: WILD/FOREST FIRE 3001829500 106796830 3108626330
## 20: HEAVY RAIN/SEVERE WEATHER 2500000000 0 2500000000
We were able to present the most harmful weather event types with respect to both public health and economic consequences with high certainty, even though a minor part of the data had to be ignored due to missing documentation about the meaning of certain values found in the explanatory fields. There seems to be also some overlapping event types with quite similar looking names. However, in this study they were all treated as separate event types, since we cannot be 100% sure based on the available documentation about which of those event types actually are identical.