This document examines NOAA data on Severe Weather Events and attempts to ascertain what kinds of severe weather cause the most economic and population health damage. The data is fairly messy, so a decent amount of pre-processing and cleaning must be done before we can gain much insight from the data. This cleaning is detailed below, then the data is split into an sdh data set containing the data relevant to population health, and an sdp data set containing the data relevant to economic damage. Then, in the Results section, we examine total damage by event type for each data set.
The raw data and documentation can be found and downloaded and unzipped by running the following code. NOTE: you will need the R.utils package, which can be installed through CRAN.
download.file("http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "StormData.csv.bz2")
download.file("http://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf", "StormDocumentation.pdf")
library(R.utils)
bunzip2("StormData.csv.bz2")
Then the data can be loaded in R as follows.
stormData <- read.csv("StormData.csv")
The BGN_DATE column is a factor variable. We will convert it to an object of class “Date” and put the data in chronological order.
stormData$BGN_DATE <- as.character(stormData$BGN_DATE)
stormData$BGN_DATE <- as.Date(stormData$BGN_DATE, "%m/%d/%Y")
stormData <- stormData[order(stormData$BGN_DATE),]
The following is a histogram which shows the number of severe weather events documented in each calendar year. We are looking at the first 300,000 entries, which takes us from January 1950 through July 1997.
From this we can see that the number of events reported in this data set rises sharply beginning in the mid 1970’s. For this reason, we will subset the data to include only those reports from 1975 and later. This will give us a more accurate picture because it will filter out data from the earlier years when reporting was much more scattered and therefore cannot be counted on to be fully representitive.
sd <- stormData[54960:dim(stormData)[1],]
head(sd[,1:5])
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY
## 171797 49 1975-01-04 1030 CST 45
## 171798 49 1975-01-04 1040 CST 35
## 171799 49 1975-01-04 1050 CST 11
## 61785 22 1975-01-07 1532 CST 113
## 61786 22 1975-01-07 1930 CST 55
## 155753 48 1975-01-07 0958 CST 245
tail(sd[,1:5])
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY
## 902209 6 2011-11-30 12:40:00 PM PST 520
## 902260 6 2011-11-30 10:00:00 AM PST 506
## 902266 30 2011-11-30 02:00:00 PM MST 63
## 902270 30 2011-11-30 06:00:00 PM MST 56
## 902271 56 2011-11-30 04:00:00 PM MST 99
## 902293 56 2011-11-30 10:30:00 PM MST 7
For this analysis, we will be examining what types of severe weather events caused the most damage. Unfortunately, the EVTYPE variable is rather messy. We will clean it up by adding another variable called event and putting broader, more useful, labels for each event into that variable.
levels(sd$EVTYPE) <- tolower(levels(sd$EVTYPE))
sd$event <- "other"
sd$event[grep("lightning", sd$EVTYPE)] <- "lightning"
sd$event[grep("thunder|rain", sd$EVTYPE)] <- "heavy rain"
sd$event[grep("snow|blizzard|chill|wint|cold|sleet|ice|freez", sd$EVTYPE)] <- "winter storm"
sd$event[grep("hail", sd$EVTYPE)] <- "hail"
sd$event[grep("seas|surf|tide|swell", sd$EVTYPE)] <- "coastal event"
sd$event[grep("tornado|gustnado|funnel|spout", sd$EVTYPE)] <- "tornado"
sd$event[grep("fire|smoke", sd$EVTYPE)] <- "fire"
sd$event[grep("wind", sd$EVTYPE)] <- "heavy wind"
sd$event[grep("volcan", sd$EVTYPE)] <- "volcano"
sd$event[grep("flood|stream", sd$EVTYPE)] <- "flood"
sd$event[grep("heat|temp|hot", sd$EVTYPE)] <- "heat wave"
sd$event[grep("dry|drought", sd$EVTYPE)] <- "drought"
sd$event[grep("microburst", sd$EVTYPE)] <- "microburst"
sd$event[grep("tropical", sd$EVTYPE)] <- "tropical storm"
sd$event[grep("hurricane|cyclone|surge", sd$EVTYPE)] <- "hurricane"
sd$event <- as.factor(sd$event)
By looking at the column names of our sd data set, and consulting the documentation provided, we conclude that the columns related to property damage are 25 through 28 and those related to population health are 23 and 24.
names(sd)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM" "event"
We will then create two data sets, subsetting out those two groups, as well as our recently created event variable.
sdp <- sd[, c(2, 25:28, 38)]
sdh <- sd[, c(2, 23, 24, 38)]
We have to do some cleaning before we can calculate the Crop Damage and Property Damage properly. We learn from the page 12 of the documentation that the DMG variables represent US Dollar amounts rounded to three significant digits and that the EXP variables should contain either “K”, “M”, or “B” to signify the order of magnitude of the number being “thousand”, “million”, or “billion” respectively. However, by doing a summary of those variables, we note that there are a number of other values as well.
summary(sdp$CROPDMGEXP)
## ? 0 2 B k K m M
## 563454 7 19 1 9 21 281832 1 1994
summary(sdp$PROPDMGEXP)
## - ? + 0 1 2 3 4 5
## 428242 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 408319 7 10409
None of these values are mentioned in the documentation. However, we notice that none of the other values are present in even 0.01% of the rows. This fact, coupled with their lack of explanation in the documentation, leads us to decide that they are both misleading and statistically insignificant, so we will purge them from the data set. The following code does this, after first transforming all letter characters to capitals for ease of processing.
sdp$CROPDMGEXP <- toupper(as.character(sdp$CROPDMGEXP))
sdp$PROPDMGEXP <- toupper(as.character(sdp$PROPDMGEXP))
sdp <- sdp[-grep("[^K|M|B]", sdp$CROPDMGEXP),]
sdp <- sdp[-grep("[^K|M|B]", sdp$PROPDMGEXP),]
Now that the unsavory factors have been eliminated, and all of the remaining characters in the EXP columns have been converted to upper case, we can calculate the actual damage amounts in US Dollars.
sdp$propmult <- 1
sdp$propmult[sdp$PROPDMGEXP == "K"] <- 1000
sdp$propmult[sdp$PROPDMGEXP == "M"] <- 1000000
sdp$propmult[sdp$PROPDMGEXP == "B"] <- 1000000000
sdp$PROPDMG <- sdp$PROPDMG*sdp$propmult
sdp$cropmult <- 1
sdp$cropmult[sdp$CROPDMGEXP == "K"] <- 1000
sdp$cropmult[sdp$CROPDMGEXP == "M"] <- 1000000
sdp$cropmult[sdp$CROPDMGEXP == "B"] <- 1000000000
sdp$CROPDMG <- sdp$CROPDMG*sdp$cropmult
Finally we will combine PROPDMG and CROPDMG into a variable called totaldmg which we will use to compare different event types’ economic damge.
sdp$totaldmg <- sdp$CROPDMG+sdp$PROPDMG
Before we go any further, we will check the totaldmg variable to make sure we don’t have any unexplained outliers. We do this by looking at the Top 10 totaldmg events make sure they all seem realistic.
sdpo <- sdp[,c(1,2,4,6,9)]
sdpo <- sdpo[order(sdpo$totaldmg, decreasing=TRUE),]
rownames(sdpo) <- 1:nrow(sdpo)
top10p <- sdpo[1:10,]
top10p
## BGN_DATE PROPDMG CROPDMG event totaldmg
## 1 2006-01-01 1.150e+11 3.25e+07 flood 115032500000
## 2 2005-08-29 3.130e+10 0.00e+00 hurricane 31300000000
## 3 2005-08-28 1.693e+10 0.00e+00 hurricane 16930000000
## 4 2005-08-29 1.126e+10 0.00e+00 hurricane 11260000000
## 5 1993-08-31 5.000e+09 5.00e+09 flood 10000000000
## 6 2005-10-24 1.000e+10 0.00e+00 hurricane 10000000000
## 7 2005-08-29 5.880e+09 1.51e+09 hurricane 7390000000
## 8 2005-08-28 7.350e+09 0.00e+00 hurricane 7350000000
## 9 2004-08-13 5.420e+09 2.85e+08 hurricane 5705000000
## 10 2001-06-05 5.150e+09 0.00e+00 tropical storm 5150000000
Unfortunately, things seem strange. There is one event - on January 1, 2006 - that registers over $115 Billion in damage. This seems unlikely, especially when we rack our memories and can’t remember any catastrophic weather events happening on New Year’s Day less than 10 years ago.
We go back to the sd data set which still has a REMARKS variable, to see if we can learn anything.
Napa <- sd[sd$BGN_DATE == "2006-01-01",]
Napa <- Napa[23,]
Napa[,c(25:28, 36)]
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 605953 115 B 32.5 M
## REMARKS
## 605953 Major flooding continued into the early hours of January 1st, before the Napa River finally fell below flood stage and the water receeded. Flooding was severe in Downtown Napa from the Napa Creek and the City and Parks Department was hit with $6 million in damage alone. The City of Napa had 600 homes with moderate damage, 150 damaged businesses with costs of at least $70 million.
And here is the event: a flood in Napa, CA. From the remarks, we can see that it did “at least $70 million” in property damage. This is significant, but nowhere near the $115 Billion that’s reflected in the PROPDMG AND PROPDMGEXP variables. We have to assume that this was a typo, the most likely situation being that someone typed a “B” instead of an “M” in the CROPDMGEXP column. This is fixed with the code below and we move on.
sdp$PROPDMG[sdp$totaldmg==115032500000] <- 1.15e+8
sdp$PROPDMGEXP[sdp$totaldmg==115032500000] <- "M"
sdp$propmult[sdp$totaldmg==115032500000] <- 1e+06
sdp$totaldmg[sdp$totaldmg==115032500000] <- 115032500
To calculate the population health effects, we will be looking at the FATALITIES and INJURIES columns of our sdh data set. In order to compare types of events, we will create a totaldmg variable. It seems wrong to weigh fatalies and injuries as the same amount of “health damage”, so we arbitrarily multiply the fatalities by 3 to give us a better picture of overall damage. Everyone knows dying is three times worse than being injured.
sdh$totaldmg <- sdh$INJURIES+sdh$FATALITIES*3
Again, before we go any further, we will check the totaldmg variable to make sure we don’t have any unexplained outliers.
sdho <- sdh
sdho <- sdho[order(sdho$totaldmg, decreasing=TRUE),]
rownames(sdho) <- 1:nrow(sdho)
top10h <- sdho[1:10,]
top10h
## BGN_DATE FATALITIES INJURIES event totaldmg
## 1 1979-04-10 42 1700 tornado 1826
## 2 1995-07-12 583 0 heat wave 1749
## 3 2011-05-22 158 1150 tornado 1624
## 4 1994-02-08 1 1568 winter storm 1571
## 5 2011-04-27 44 800 tornado 932
## 6 1998-10-17 2 800 flood 806
## 7 2004-08-13 7 780 hurricane 801
## 8 2011-04-27 20 700 tornado 760
## 9 1998-10-17 0 750 flood 750
## 10 1998-10-17 11 600 flood 633
Luckily, this time everything seems normal.
Now that everything is cleaned up, we create a new object sdpa which aggregates the total damage for each type of event that we have defined in our event variable.
sdpa <- sdp[,c("event", "totaldmg")]
sdpa <- aggregate(sdpa$totaldmg, list(sdpa$event), sum)
colnames(sdpa) <- c("event", "totaldmg")
sdpa <- sdpa[order(sdpa$totaldmg, decreasing=TRUE),]
rownames(sdpa) <- 1:nrow(sdpa)
sdpa
## event totaldmg
## 1 hurricane 138237551810
## 2 flood 65057699894
## 3 tornado 47108721273
## 4 winter storm 21092558560
## 5 heavy wind 19671868158
## 6 hail 18995930230
## 7 drought 15018927780
## 8 fire 8905010130
## 9 tropical storm 8411023550
## 10 heavy rain 5238188040
## 11 other 1392274180
## 12 lightning 940781373
## 13 heat wave 924539250
## 14 coastal event 166588000
## 15 microburst 7312600
## 16 volcano 500000
The figure below shows the total damage done by each category of severe weather event.
It’s clear that Hurricanes do the most economic damage, far ahead of anything else. Hurricane Katrina alone did more total damage than many of the other entire categories in the previous plot.
Katrina <- rbind(sd[sd$BGN_DATE == "2005-08-29",],sd[sd$BGN_DATE == "2005-08-28",])
Katrina <- Katrina[grep("hurricane|surge", Katrina$EVTYPE),]
Katrina[,c(2,7,8,25,26)]
## BGN_DATE STATE EVTYPE PROPDMG PROPDMGEXP
## 566200 2005-08-29 AR hurricane/typhoon 400.00 K
## 570290 2005-08-29 GA hurricane/typhoon 0.00
## 577676 2005-08-29 LA storm surge 31.30 B
## 577677 2005-08-29 LA hurricane/typhoon 30.00 K
## 577678 2005-08-29 LA hurricane/typhoon 3.60 M
## 581535 2005-08-29 MS storm surge 11.26 B
## 581536 2005-08-29 MS hurricane/typhoon 0.00
## 581537 2005-08-29 MS hurricane/typhoon 5.88 B
## 581545 2005-08-29 MS hurricane/typhoon 0.00
## 581551 2005-08-29 MS hurricane/typhoon 0.00
## 598405 2005-08-29 TX storm surge 40.00 K
## 569215 2005-08-28 FL storm surge 0.00
## 569217 2005-08-28 FL hurricane/typhoon 1.70 M
## 569218 2005-08-28 FL storm surge 200.00 K
## 569219 2005-08-28 FL storm surge 200.00 K
## 569221 2005-08-28 FL storm surge 750.00 K
## 569222 2005-08-28 FL storm surge 250.00 K
## 569223 2005-08-28 FL storm surge 300.00 K
## 569224 2005-08-28 FL storm surge 0.00
## 569225 2005-08-28 FL storm surge 0.00
## 577675 2005-08-28 LA hurricane/typhoon 16.93 B
## 581533 2005-08-28 MS hurricane/typhoon 7.35 B
Floods do the second most damage and tornadoes the third. Winter storms come in a distant fourth, although if you combined them with hail (which is often, but not always, the case) they would be very close to tornadoes.
We create a new object sdha which aggregates the total health damage for each type of event that we have defined in our event variable.
sdha <- sdh[,c("event", "totaldmg")]
sdha <- aggregate(sdha$totaldmg, list(sdha$event), sum)
colnames(sdha) <- c("event", "totaldmg")
sdha <- sdha[order(sdha$totaldmg, decreasing=TRUE),]
rownames(sdha) <- 1:nrow(sdha)
sdha
## event totaldmg
## 1 tornado 53112
## 2 heat wave 18636
## 3 heavy wind 15844
## 4 flood 13342
## 5 winter storm 8939
## 6 lightning 7682
## 7 other 5801
## 8 fire 1878
## 9 hurricane 1848
## 10 hail 1416
## 11 coastal event 855
## 12 heavy rain 623
## 13 tropical storm 581
## 14 drought 125
## 15 microburst 37
## 16 volcano 0
The figure below shows the total health damage done by each category of severe weather event.
By this measure, tornadoes cause by the far the most population health damage. Heat waves are a distant second, followed by heavy winds and then floods.
Considering hurricanes did so much property damage, we are confused that they are so far down the list for health damage. To look into this, we check the health data for the ten most destructive hurricanes in our data set.
hurhealth <- sdh[grep("hurricane", sdh$event),]
hurhealth <- hurhealth[order(sdh$totaldmg),]
hurhealth <- hurhealth[order(hurhealth$totaldmg, decreasing=TRUE),]
hurhealth[1:10,]
## BGN_DATE FATALITIES INJURIES event totaldmg
## 529351 2004-08-13 7 780 hurricane 801
## 484801 2002-12-08 1 316 hurricane 319
## 581537 2005-08-29 15 104 hurricane 149
## 366678 1999-09-14 13 0 hurricane 39
## 529499 2004-09-15 6 16 hurricane 34
## 739576 2008-09-12 11 0 hurricane 33
## 196184 1993-03-13 10 0 hurricane 30
## 282882 1997-01-01 0 27 hurricane 27
## 267894 1996-09-05 7 2 hurricane 23
## 281013 1996-09-09 7 0 hurricane 21
Among other things, we notice that Hurricane Katrina (August 29th, 2005) registers only 15 fatalities and 104 injuries. We know - from our own memory, from Wikipedia, and from the REMARKS column in our Katrina data set above - that this wildly underestimates the fatalities and injuries. Wikipedia tells us that 1,833 fatalities were directly associated with Hurricane Katrina. Our data set registers only 15.
These discrepencies cast some doubt on this whole portion of the analysis, but there seems to be no easy way to fix these entries, short of scanning through the REMARKS column of our original sd set and manually double-checking it against FATALITIES and INJURIES. Unfortunately, with 847,338 observations, that seems unreasonably time-consuming. We therefore accept our current analysis but take it, as they say, with a grain of salt. Tornadoes are bad.
In conclusion, don’t live anywhere with hurricanes, floods, or tornadoes. Unless it’s New Orleans, because New Orleans is the bomb.
Exhibit A: http://www.youtube.com/watch?v=cLPXcYo3uvc