In this paper I strive to answer two questions: 1. What weather events cause the most human tolls and 2. what weather events cause the most property and crop damages? The analysis is based on a data set provided by the National Oceanic and Atmospheric Administration (NOAA) which covers severe weather events in the USA from 1950 through 2011. The results show that tornadoes by far cause the most human fatalities and injuries while floods and droughts cause the most economic damages.
For this analysis I used the “Storm Data” provided by NOAA. The comma-separated data file is available on the Internet:
https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2
Documentation is available at:
https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf
The events in the database start in the year 1950 and end in November 2011. For the earlier years there are generally fewer events recorded in the data set, most likely due to a lack of good records. More recent years should be considered more complete.
The following code snippet shows how I loaded the data using read.csv. I created a subset of relevant variables and included only records which had data in at least one of the following columns: FATALITIES, INJURIES, PROPDMG, CROPDMG. This step reduced the data size significantly.
NOAA.DF <- read.csv(bzfile("NOAAdata/repdata-data-StormData.csv.bz2"))
STRM.DF <- subset(NOAA.DF,
FATALITIES>0 | INJURIES>0 | PROPDMG>0 | CROPDMG>0,
select = c(BGN_DATE, STATE, EVTYPE, MAG, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP))
STRM.DF <- droplevels(STRM.DF) #get rid of unused factor levels
I needed to clean up two data problems before starting the analysis: 1. dollar values needed to calculated from two variables, and 2. the catagories for weather events were quite messy.
The NOAA data set captures dollar values for property and crop damages in a scientific format of sorts. That is, one set of variables (PROPDMG, CROPDMG) shows the significant digits (e.g 1.23) and a second set of variables (PROPDMGEXP, CROPDMGEXP) shows the exponent (e.g. k for thousand or M for million). Therefore, 1.23 k means $1.23*10^3 or $1,230.
However, the data for the exponents is messy. I applied the following transformations:
exp <- function(y){
exp <- NULL
ifelse(y %in% c("h", "H"), exp <- 2,
ifelse(y %in% c("k", "K"), exp <- 3,
ifelse(y %in% c("m", "M"), exp <- 6,
ifelse(y %in% c("b", "B"), exp <- 9,
exp <- 0))))
return(exp)
}
# new variables for the combined dollar values
STRM.DF$prop.damage <- STRM.DF$PROPDMG*10^sapply(STRM.DF$PROPDMGEXP, exp)
STRM.DF$crop.damage <- STRM.DF$CROPDMG*10^sapply(STRM.DF$CROPDMGEXP, exp)
The second issue was to clean up the EVTYPE variable. EVTYPE had 488 levels, with many of them being redundant because of different spelling or because the event label has changed over time. Therefore, I first changed all categories to upper case and replaced all “/” with spaces.
My goal was to reduce the 488 levels to the 48 weather event categories listed in section 2.1.1 of the “Strom Data Preparation” document. Due to limited time, I chose to apply a coarse string search (grepl) to map the original levels to the final 48 categories.
My method has two downsides: First some of the string patters are overlapping and therefore the sequence in which the searches are executed matters. Second, not all categories in the origninal variable EVTYPE were mapped to the new event variable. I controlled for that error by comparing the sum total of fatalities, injuries, propery damages and crop damages in the complete data set with the ones that were unmapped (see below).
STRM.DF$EVTYPE1 <- toupper(STRM.DF$EVTYPE)
STRM.DF$EVTYPE1 <- gsub("/", " ", STRM.DF$EVTYPE1)
STRM.DF$event<-NA
STRM.DF$event[grepl("LOW TIDE", STRM.DF$EVTYPE1)] <- "Astronomical Low Tide"
STRM.DF$event[grepl("AVAL", STRM.DF$EVTYPE1)] <- "Avalanche"
STRM.DF$event[grepl("BLIZ", STRM.DF$EVTYPE1)] <- "Blizzard"
STRM.DF$event[grepl("FLOOD|FLD", STRM.DF$EVTYPE1)] <- "Flood"
STRM.DF$event[grepl("COAST.*FLO", STRM.DF$EVTYPE1)] <- "Coastal Flood"
STRM.DF$event[grepl("^COLD WIND CHILL|^COLD|^LOW TEMP|HYPOTH", STRM.DF$EVTYPE1)] <- "Cold/Wind Chill"
STRM.DF$event[grepl("DEBR|LANDSL|MUD", STRM.DF$EVTYPE1)] <- "Debris Flow"
STRM.DF$event[grepl("^FOG|^DENSE FOG", STRM.DF$EVTYPE1)] <- "Dense Fog"
STRM.DF$event[grepl("SMOKE", STRM.DF$EVTYPE1)] <- "Dense Smoke"
STRM.DF$event[grepl("DROU", STRM.DF$EVTYPE1)] <- "Drought"
STRM.DF$event[grepl("DEV", STRM.DF$EVTYPE1)] <- "Dust Devil"
STRM.DF$event[grepl("DUST S|BLOWING D", STRM.DF$EVTYPE1)] <- "Dust Storm"
STRM.DF$event[grepl(".+HEAT|HEAT.+|^UNSEASONABLY WARM", STRM.DF$EVTYPE1)] <- "Excessive Heat"
STRM.DF$event[grepl("^EXTR.*CHILL|^EXTR.*COLD|^UN.*COLD|^REC.*COLD", STRM.DF$EVTYPE1)] <- "Extreme Cold/Wind Chill"
STRM.DF$event[grepl("FLAS", STRM.DF$EVTYPE1)] <- "Flash Flood"
STRM.DF$event[grepl("FRO|FRE", STRM.DF$EVTYPE1)] <- "Frost/Freeze"
STRM.DF$event[grepl("FUNN", STRM.DF$EVTYPE1)] <- "Funnel Cloud"
STRM.DF$event[grepl("FREEZING FOG|GLAZE|ICY ROADS", STRM.DF$EVTYPE1)] <- "Freezing Fog"
STRM.DF$event[grepl("HAIL", STRM.DF$EVTYPE1)] <- "Hail"
STRM.DF$event[grepl("^HEAT$", STRM.DF$EVTYPE1)] <- "Heat"
STRM.DF$event[grepl("RAIN|EXCESSIVE WET", STRM.DF$EVTYPE1)] <- "Heavy Rain"
STRM.DF$event[grepl("SNOW", STRM.DF$EVTYPE1)] <- "Heavy Snow"
STRM.DF$event[grepl("SURF", STRM.DF$EVTYPE1)] <- "High Surf"
STRM.DF$event[grepl("HI.*WIN|^WIND$|GUSTY", STRM.DF$EVTYPE1)] <- "High Wind"
STRM.DF$event[grepl("HURR|TYP", STRM.DF$EVTYPE1)] <- "Hurricane (Typhoon)"
STRM.DF$event[grepl("ICE.*STO|^ICE$|^BLACK ICE", STRM.DF$EVTYPE1)] <- "Ice Storm"
STRM.DF$event[grepl("LAKE.*EF", STRM.DF$EVTYPE1)] <- "Lake-Effect Snow"
STRM.DF$event[grepl("LAKE.*FLO", STRM.DF$EVTYPE1)] <- "Lakeshore Flood"
STRM.DF$event[grepl("LIGHT", STRM.DF$EVTYPE1)] <- "Lightning"
STRM.DF$event[grepl("MAR.*HAI", STRM.DF$EVTYPE1)] <- "Marine Hail"
STRM.DF$event[grepl("MAR.*HI.*W|MARINE MISH|COASTAL *ST", STRM.DF$EVTYPE1)] <- "Marine High Wind"
STRM.DF$event[grepl("MAR.*STR.*W|HIGH SEAS|ROUGH SEAS|ROUGH WAVES|HIGH WATER|HEAVY SEAS", STRM.DF$EVTYPE1)] <- "Marine Strong Wind"
STRM.DF$event[grepl("MAR.*TS.*W|MAR.*TH.*W", STRM.DF$EVTYPE1)] <- "Marine Thunderstorm Wind"
STRM.DF$event[grepl("RIP", STRM.DF$EVTYPE1)] <- "Rip Current"
STRM.DF$event[grepl("SEI", STRM.DF$EVTYPE1)] <- "Seiche"
STRM.DF$event[grepl("SLE", STRM.DF$EVTYPE1)] <- "Sleet"
STRM.DF$event[grepl("SURG|^ASTRONOMICAL HIGH", STRM.DF$EVTYPE1)] <- "Storm Surge/Tide"
STRM.DF$event[grepl("^STRONG WIND", STRM.DF$EVTYPE1)] <- "Strong Wind"
STRM.DF$event[grepl("^ *TH.*WIND|^ *TSTM.*WIND|MICROB|^THUNDERSTORM|^SEVERE TH", STRM.DF$EVTYPE1)] <- "Thunderstorm Wind"
STRM.DF$event[grepl("TORNADO", STRM.DF$EVTYPE1)] <- "Tornado"
STRM.DF$event[grepl("TROP.*DEP", STRM.DF$EVTYPE1)] <- "Tropical Depression"
STRM.DF$event[grepl("TROP.*STO", STRM.DF$EVTYPE1)] <- "Tropical Storm"
STRM.DF$event[grepl("TSUN", STRM.DF$EVTYPE1)] <- "Tsunami"
STRM.DF$event[grepl("VOL", STRM.DF$EVTYPE1)] <- "Volcanic Ash"
STRM.DF$event[grepl("WATERSP", STRM.DF$EVTYPE1)] <- "Waterspout"
STRM.DF$event[grepl("FIRE", STRM.DF$EVTYPE1)] <- "Wildfire"
STRM.DF$event[grepl("WINTER.*STO", STRM.DF$EVTYPE1)] <- "Winter Storm"
STRM.DF$event[grepl("WINTER.*WEA|^WINTRY", STRM.DF$EVTYPE1)] <- "Winter Weather"
# calculate the number of fatalities etc. missing because of
# event categories that were not mapped to the new "event" variable.
STRM.NA <- STRM.DF[is.na(STRM.DF$event),]
na.fat <- sum(STRM.NA$FATALITIES)/sum(STRM.DF$FATALITIES)*100
na.inj <- sum(STRM.NA$INJURIES)/sum(STRM.DF$INJURIES)*100
na.prop <- sum(STRM.NA$prop.damage)/sum(STRM.DF$prop.damage)*100
na.crop <- sum(STRM.NA$crop.damage)/sum(STRM.DF$crop.damage)*100
Percentage of fatalities etc. missing in the analysis because some orignial event categories in EVTYPE were not mapped to the new “event” variable:
These errors are so small that they do not impact the findings. Therefore, I decided to not spend more time for cleaning up the EVTYPE categories. However, in an ideal case, one would map every single category in EVTYPe to exactly one of the 48 final event catagories.
library(ggplot2)
library(reshape2)
library(xtable)
health <- aggregate(cbind(FATALITIES, INJURIES) ~ event, data = STRM.DF, sum)
# change column labels and factor order for nicer chart printing
health$event <- as.factor(health$event)
health$event <- reorder(health$event, -(health$FATALITIES + health$INJURIES))
health <- health[order(health$event),]
colnames(health) <- c("Event", "Fatalities", "Injuries")
health1 <- melt(health[1:10,], id.var="Event")
health1$Event = with(health1, factor(Event, levels = rev(levels(Event))))
ggplot(health1, aes(x=Event, y=value/1000, fill=variable)) +
geom_bar(stat="identity") +
scale_y_continuous("Count of Fatalities and Injuries (in Thousands)") +
labs(x = "") +
coord_flip() +
labs(title = "Human Cost of Servere Weather Events") +
theme_bw() +
theme(
legend.title=element_blank(),
legend.position = c(1, 0),
legend.justification = c("right", "bottom"),
plot.title=element_text(size=14, face="bold", vjust=2),
axis.title.x=element_text(size=10, vjust=-0.4)
)
Tornadoes are by far the most harmful weather events for human health. In total tornadoes cost 5658 fatalites and 91364 injuries in the period from 1950 through 2011. Excessive heat and flash floods are next in causing human deaths. In terms of injuries thunderstorm winds and excessive heat are on rank two and three.
print(xtable(head(health, 10), digits=0),
type="html", include.rowname=F,
html.table.attributes = "border = '2', align = 'center'")
| Event | Fatalities | Injuries |
|---|---|---|
| Tornado | 5658 | 91364 |
| Thunderstorm Wind | 714 | 9536 |
| Excessive Heat | 2241 | 7141 |
| Flood | 511 | 6874 |
| Lightning | 818 | 5234 |
| Heat | 937 | 2100 |
| Flash Flood | 1035 | 1800 |
| Ice Storm | 96 | 2153 |
| High Wind | 324 | 1605 |
| Wildfire | 90 | 1608 |
The NOAA data set distinguishes general property damages and crop damages. Property damages are an order of magnitude higher than crop damages.
prop <- aggregate(prop.damage ~ event, data = STRM.DF, sum)
# transform to billion dollars; change factor order for chart, table
prop$prop.damage <- prop$prop.damage/10^9
prop$event <- as.factor(prop$event)
prop$event <- reorder(prop$event, prop$prop.damage)
ggplot(prop[prop$prop.damage > 1,], aes(event, prop.damage)) +
geom_bar(stat="identity", fill="turquoise3") +
coord_flip() +
scale_y_continuous("Billion Dollars") +
labs(x = "",
title = "Property Damage Caused By Servere Weather Events"
) +
theme_bw() +
theme(
plot.title=element_text(size=14, face="bold", vjust=1.5),
axis.title.x=element_text(size=10, vjust=0)
)
Floods, huricanes and tornadoes are the three leading weather events causing property damages.
prop <- prop[order(prop$event, decreasing = T),]
tab.prop <- xtable(prop[prop$prop.damage > 1,], digits=3,
caption = "Property Damages in Billion Dollars")
print(tab.prop, type="html", include.rowname=F, include.colnames=F,
caption.placement = "top",
html.table.attributes = "border = '2', align = 'center'")
| Flood | 150.218 |
| Hurricane (Typhoon) | 85.356 |
| Tornado | 58.542 |
| Storm Surge/Tide | 47.975 |
| Flash Flood | 16.908 |
| Hail | 15.975 |
| Thunderstorm Wind | 10.978 |
| Wildfire | 8.502 |
| Tropical Storm | 7.714 |
| Winter Storm | 6.750 |
| High Wind | 6.015 |
| Ice Storm | 3.958 |
| Heavy Rain | 3.236 |
| Drought | 1.046 |
crop <- aggregate(crop.damage ~ event, data = STRM.DF, sum)
crop$crop.damage <- crop$crop.damage/10^9
crop$event <- as.factor(crop$event)
crop$event <- reorder(crop$event, crop$crop.damage)
ggplot(crop[crop$crop.damage > 1,], aes(event, crop.damage)) +
geom_bar(stat="identity", fill="turquoise3") +
coord_flip() +
scale_y_continuous("Billion Dollars") +
labs(x = "",
title = "Crop Damage Caused By Servere Weather Events"
) +
theme_bw() +
theme(
plot.title=element_text(size=14, face="bold", vjust=1.5),
axis.title.x=element_text(size=10, vjust=0)
)
Droughts, floods, and hurricanes are the three leading weather events causing crop damages.
crop <- crop[order(crop$event, decreasing = T),]
tab.crop <- xtable(crop[crop$crop.damage > 1,], digits=3,
caption = "Crop Damages in Billion Dollars")
print(tab.crop, type="html", include.rowname=F, , include.colnames=F,
caption.placement = "top",
html.table.attributes = "border = '2', align = 'center'")
| Drought | 13.973 |
| Flood | 10.743 |
| Hurricane (Typhoon) | 5.516 |
| Ice Storm | 5.022 |
| Hail | 3.047 |
| Frost/Freeze | 1.997 |
| Flash Flood | 1.532 |
| Extreme Cold/Wind Chill | 1.360 |
| Thunderstorm Wind | 1.272 |
| Heavy Rain | 1.060 |
sessionInfo()
## R version 3.1.0 (2014-04-10)
## Platform: x86_64-apple-darwin13.1.0 (64-bit)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] xtable_1.7-3 reshape2_1.2.2 knitr_1.6 ggplot2_1.0.0
##
## loaded via a namespace (and not attached):
## [1] codetools_0.2-8 colorspace_1.2-4 dichromat_2.0-0
## [4] digest_0.6.4 evaluate_0.5.3 formatR_0.10
## [7] grid_3.1.0 gtable_0.1.2 htmltools_0.2.4
## [10] labeling_0.2 MASS_7.3-31 munsell_0.4.2
## [13] plyr_1.8.1 proto_0.3-10 RColorBrewer_1.0-5
## [16] Rcpp_0.11.1 rmarkdown_0.2.46 scales_0.2.3
## [19] stringr_0.6.2 tools_3.1.0 yaml_2.1.13