Synopsis

In this paper I strive to answer two questions: 1. What weather events cause the most human tolls and 2. what weather events cause the most property and crop damages? The analysis is based on a data set provided by the National Oceanic and Atmospheric Administration (NOAA) which covers severe weather events in the USA from 1950 through 2011. The results show that tornadoes by far cause the most human fatalities and injuries while floods and droughts cause the most economic damages.

Data Processing

Data Import

For this analysis I used the “Storm Data” provided by NOAA. The comma-separated data file is available on the Internet:

https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2

Documentation is available at:

https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf

The events in the database start in the year 1950 and end in November 2011. For the earlier years there are generally fewer events recorded in the data set, most likely due to a lack of good records. More recent years should be considered more complete.

The following code snippet shows how I loaded the data using read.csv. I created a subset of relevant variables and included only records which had data in at least one of the following columns: FATALITIES, INJURIES, PROPDMG, CROPDMG. This step reduced the data size significantly.

NOAA.DF <- read.csv(bzfile("NOAAdata/repdata-data-StormData.csv.bz2"))

STRM.DF <- subset(NOAA.DF, 
                  FATALITIES>0 | INJURIES>0 | PROPDMG>0 | CROPDMG>0, 
                  select = c(BGN_DATE, STATE, EVTYPE, MAG, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP))

STRM.DF <- droplevels(STRM.DF) #get rid of unused factor levels

Data clean up

I needed to clean up two data problems before starting the analysis: 1. dollar values needed to calculated from two variables, and 2. the catagories for weather events were quite messy.

The NOAA data set captures dollar values for property and crop damages in a scientific format of sorts. That is, one set of variables (PROPDMG, CROPDMG) shows the significant digits (e.g 1.23) and a second set of variables (PROPDMGEXP, CROPDMGEXP) shows the exponent (e.g. k for thousand or M for million). Therefore, 1.23 k means $1.23*10^3 or $1,230.

However, the data for the exponents is messy. I applied the following transformations:

  • h, H: 2
  • k, K: 3
  • m, M: 6
  • b, B: 9
  • all other values: 0
exp <- function(y){
    exp <- NULL
    ifelse(y %in% c("h", "H"), exp <- 2,
           ifelse(y %in% c("k", "K"), exp <- 3,
                  ifelse(y %in% c("m", "M"), exp <- 6,
                         ifelse(y %in% c("b", "B"), exp <- 9,
                                exp <- 0))))
    return(exp)
}

# new variables for the combined dollar values
STRM.DF$prop.damage <- STRM.DF$PROPDMG*10^sapply(STRM.DF$PROPDMGEXP, exp)
STRM.DF$crop.damage <- STRM.DF$CROPDMG*10^sapply(STRM.DF$CROPDMGEXP, exp)

The second issue was to clean up the EVTYPE variable. EVTYPE had 488 levels, with many of them being redundant because of different spelling or because the event label has changed over time. Therefore, I first changed all categories to upper case and replaced all “/” with spaces.

My goal was to reduce the 488 levels to the 48 weather event categories listed in section 2.1.1 of the “Strom Data Preparation” document. Due to limited time, I chose to apply a coarse string search (grepl) to map the original levels to the final 48 categories.

My method has two downsides: First some of the string patters are overlapping and therefore the sequence in which the searches are executed matters. Second, not all categories in the origninal variable EVTYPE were mapped to the new event variable. I controlled for that error by comparing the sum total of fatalities, injuries, propery damages and crop damages in the complete data set with the ones that were unmapped (see below).

STRM.DF$EVTYPE1 <- toupper(STRM.DF$EVTYPE)
STRM.DF$EVTYPE1 <- gsub("/", " ", STRM.DF$EVTYPE1)

STRM.DF$event<-NA

STRM.DF$event[grepl("LOW TIDE", STRM.DF$EVTYPE1)] <- "Astronomical Low Tide"
STRM.DF$event[grepl("AVAL", STRM.DF$EVTYPE1)] <- "Avalanche"
STRM.DF$event[grepl("BLIZ", STRM.DF$EVTYPE1)] <- "Blizzard"
STRM.DF$event[grepl("FLOOD|FLD", STRM.DF$EVTYPE1)] <- "Flood"
STRM.DF$event[grepl("COAST.*FLO", STRM.DF$EVTYPE1)] <- "Coastal Flood"
STRM.DF$event[grepl("^COLD WIND CHILL|^COLD|^LOW TEMP|HYPOTH", STRM.DF$EVTYPE1)] <- "Cold/Wind Chill"
STRM.DF$event[grepl("DEBR|LANDSL|MUD", STRM.DF$EVTYPE1)] <- "Debris Flow"
STRM.DF$event[grepl("^FOG|^DENSE FOG", STRM.DF$EVTYPE1)] <- "Dense Fog"
STRM.DF$event[grepl("SMOKE", STRM.DF$EVTYPE1)] <- "Dense Smoke"
STRM.DF$event[grepl("DROU", STRM.DF$EVTYPE1)] <- "Drought"
STRM.DF$event[grepl("DEV", STRM.DF$EVTYPE1)] <- "Dust Devil"
STRM.DF$event[grepl("DUST S|BLOWING D", STRM.DF$EVTYPE1)] <- "Dust Storm"
STRM.DF$event[grepl(".+HEAT|HEAT.+|^UNSEASONABLY WARM", STRM.DF$EVTYPE1)] <- "Excessive Heat"
STRM.DF$event[grepl("^EXTR.*CHILL|^EXTR.*COLD|^UN.*COLD|^REC.*COLD", STRM.DF$EVTYPE1)] <- "Extreme Cold/Wind Chill"
STRM.DF$event[grepl("FLAS", STRM.DF$EVTYPE1)] <- "Flash Flood"
STRM.DF$event[grepl("FRO|FRE", STRM.DF$EVTYPE1)] <- "Frost/Freeze"
STRM.DF$event[grepl("FUNN", STRM.DF$EVTYPE1)] <- "Funnel Cloud"
STRM.DF$event[grepl("FREEZING FOG|GLAZE|ICY ROADS", STRM.DF$EVTYPE1)] <- "Freezing Fog"
STRM.DF$event[grepl("HAIL", STRM.DF$EVTYPE1)] <- "Hail"
STRM.DF$event[grepl("^HEAT$", STRM.DF$EVTYPE1)] <- "Heat"
STRM.DF$event[grepl("RAIN|EXCESSIVE WET", STRM.DF$EVTYPE1)] <- "Heavy Rain"
STRM.DF$event[grepl("SNOW", STRM.DF$EVTYPE1)] <- "Heavy Snow"
STRM.DF$event[grepl("SURF", STRM.DF$EVTYPE1)] <- "High Surf"
STRM.DF$event[grepl("HI.*WIN|^WIND$|GUSTY", STRM.DF$EVTYPE1)] <- "High Wind"
STRM.DF$event[grepl("HURR|TYP", STRM.DF$EVTYPE1)] <- "Hurricane (Typhoon)"
STRM.DF$event[grepl("ICE.*STO|^ICE$|^BLACK ICE", STRM.DF$EVTYPE1)] <- "Ice Storm"
STRM.DF$event[grepl("LAKE.*EF", STRM.DF$EVTYPE1)] <- "Lake-Effect Snow"
STRM.DF$event[grepl("LAKE.*FLO", STRM.DF$EVTYPE1)] <- "Lakeshore Flood"
STRM.DF$event[grepl("LIGHT", STRM.DF$EVTYPE1)] <- "Lightning"
STRM.DF$event[grepl("MAR.*HAI", STRM.DF$EVTYPE1)] <- "Marine Hail"
STRM.DF$event[grepl("MAR.*HI.*W|MARINE MISH|COASTAL *ST", STRM.DF$EVTYPE1)] <- "Marine High Wind"
STRM.DF$event[grepl("MAR.*STR.*W|HIGH SEAS|ROUGH SEAS|ROUGH WAVES|HIGH WATER|HEAVY SEAS", STRM.DF$EVTYPE1)] <- "Marine Strong Wind"
STRM.DF$event[grepl("MAR.*TS.*W|MAR.*TH.*W", STRM.DF$EVTYPE1)] <- "Marine Thunderstorm Wind"
STRM.DF$event[grepl("RIP", STRM.DF$EVTYPE1)] <- "Rip Current"
STRM.DF$event[grepl("SEI", STRM.DF$EVTYPE1)] <- "Seiche"
STRM.DF$event[grepl("SLE", STRM.DF$EVTYPE1)] <- "Sleet"
STRM.DF$event[grepl("SURG|^ASTRONOMICAL HIGH", STRM.DF$EVTYPE1)] <- "Storm Surge/Tide"
STRM.DF$event[grepl("^STRONG WIND", STRM.DF$EVTYPE1)] <- "Strong Wind"
STRM.DF$event[grepl("^ *TH.*WIND|^ *TSTM.*WIND|MICROB|^THUNDERSTORM|^SEVERE TH", STRM.DF$EVTYPE1)] <- "Thunderstorm Wind"
STRM.DF$event[grepl("TORNADO", STRM.DF$EVTYPE1)] <- "Tornado"
STRM.DF$event[grepl("TROP.*DEP", STRM.DF$EVTYPE1)] <- "Tropical Depression"
STRM.DF$event[grepl("TROP.*STO", STRM.DF$EVTYPE1)] <- "Tropical Storm"
STRM.DF$event[grepl("TSUN", STRM.DF$EVTYPE1)] <- "Tsunami"
STRM.DF$event[grepl("VOL", STRM.DF$EVTYPE1)] <- "Volcanic Ash"
STRM.DF$event[grepl("WATERSP", STRM.DF$EVTYPE1)] <- "Waterspout"
STRM.DF$event[grepl("FIRE", STRM.DF$EVTYPE1)] <- "Wildfire"
STRM.DF$event[grepl("WINTER.*STO", STRM.DF$EVTYPE1)] <- "Winter Storm"
STRM.DF$event[grepl("WINTER.*WEA|^WINTRY", STRM.DF$EVTYPE1)] <- "Winter Weather"

# calculate the number of fatalities etc. missing because of
# event categories that were not mapped to the new "event" variable.
STRM.NA <- STRM.DF[is.na(STRM.DF$event),]
na.fat <- sum(STRM.NA$FATALITIES)/sum(STRM.DF$FATALITIES)*100
na.inj <- sum(STRM.NA$INJURIES)/sum(STRM.DF$INJURIES)*100
na.prop <- sum(STRM.NA$prop.damage)/sum(STRM.DF$prop.damage)*100
na.crop <- sum(STRM.NA$crop.damage)/sum(STRM.DF$crop.damage)*100

Percentage of fatalities etc. missing in the analysis because some orignial event categories in EVTYPE were not mapped to the new “event” variable:

  • Fatalities: 0.0792%
  • Injuries: 0.0342%
  • Property damages: 0.0026%
  • Crop damages: 0.0134%

These errors are so small that they do not impact the findings. Therefore, I decided to not spend more time for cleaning up the EVTYPE categories. However, in an ideal case, one would map every single category in EVTYPe to exactly one of the 48 final event catagories.

Results

Harm to population health

library(ggplot2)
library(reshape2)
library(xtable)

health <- aggregate(cbind(FATALITIES, INJURIES) ~ event, data = STRM.DF, sum)

# change column labels and factor order for nicer chart printing
health$event <- as.factor(health$event)
health$event <- reorder(health$event, -(health$FATALITIES + health$INJURIES))
health <- health[order(health$event),]
colnames(health) <- c("Event", "Fatalities", "Injuries")

health1 <- melt(health[1:10,], id.var="Event")
health1$Event = with(health1, factor(Event, levels = rev(levels(Event))))


ggplot(health1, aes(x=Event, y=value/1000, fill=variable)) + 
    geom_bar(stat="identity") + 
    scale_y_continuous("Count of Fatalities and Injuries (in Thousands)") +
    labs(x = "") +
    coord_flip() +
    labs(title = "Human Cost of Servere Weather Events") +
    theme_bw() +
    theme(
        legend.title=element_blank(),
        legend.position = c(1, 0),
        legend.justification = c("right", "bottom"),
        plot.title=element_text(size=14, face="bold", vjust=2),
        axis.title.x=element_text(size=10, vjust=-0.4)
        )

plot of chunk health

Tornadoes are by far the most harmful weather events for human health. In total tornadoes cost 5658 fatalites and 91364 injuries in the period from 1950 through 2011. Excessive heat and flash floods are next in causing human deaths. In terms of injuries thunderstorm winds and excessive heat are on rank two and three.

print(xtable(head(health, 10), digits=0), 
      type="html", include.rowname=F,
      html.table.attributes = "border = '2', align = 'center'")
Event Fatalities Injuries
Tornado 5658 91364
Thunderstorm Wind 714 9536
Excessive Heat 2241 7141
Flood 511 6874
Lightning 818 5234
Heat 937 2100
Flash Flood 1035 1800
Ice Storm 96 2153
High Wind 324 1605
Wildfire 90 1608

Economic impacts

Property Damages

The NOAA data set distinguishes general property damages and crop damages. Property damages are an order of magnitude higher than crop damages.

prop <- aggregate(prop.damage ~ event, data = STRM.DF, sum)


# transform to billion dollars; change factor order for chart, table
prop$prop.damage <- prop$prop.damage/10^9
prop$event <- as.factor(prop$event)
prop$event <- reorder(prop$event, prop$prop.damage)


ggplot(prop[prop$prop.damage > 1,], aes(event, prop.damage)) + 
    geom_bar(stat="identity", fill="turquoise3") + 
    coord_flip() +
    scale_y_continuous("Billion Dollars") +
    labs(x = "",
         title = "Property Damage Caused By Servere Weather Events"
         ) +
    theme_bw() +
    theme(
        plot.title=element_text(size=14, face="bold", vjust=1.5),
        axis.title.x=element_text(size=10, vjust=0)
        )

plot of chunk property

Floods, huricanes and tornadoes are the three leading weather events causing property damages.

prop <- prop[order(prop$event, decreasing = T),]
tab.prop <- xtable(prop[prop$prop.damage > 1,], digits=3, 
                   caption = "Property Damages in Billion Dollars")
print(tab.prop, type="html", include.rowname=F, include.colnames=F,
      caption.placement = "top",
      html.table.attributes = "border = '2', align = 'center'")
Property Damages in Billion Dollars
Flood 150.218
Hurricane (Typhoon) 85.356
Tornado 58.542
Storm Surge/Tide 47.975
Flash Flood 16.908
Hail 15.975
Thunderstorm Wind 10.978
Wildfire 8.502
Tropical Storm 7.714
Winter Storm 6.750
High Wind 6.015
Ice Storm 3.958
Heavy Rain 3.236
Drought 1.046

Crop Damages

crop <- aggregate(crop.damage ~ event, data = STRM.DF, sum)
crop$crop.damage <- crop$crop.damage/10^9
crop$event <- as.factor(crop$event)
crop$event <- reorder(crop$event, crop$crop.damage)

ggplot(crop[crop$crop.damage > 1,], aes(event, crop.damage)) + 
    geom_bar(stat="identity", fill="turquoise3") + 
    coord_flip() +
    scale_y_continuous("Billion Dollars") +
    labs(x = "",
         title = "Crop Damage Caused By Servere Weather Events"
         ) +
    theme_bw() +
    theme(
        plot.title=element_text(size=14, face="bold", vjust=1.5),
        axis.title.x=element_text(size=10, vjust=0)
        )

plot of chunk crop

Droughts, floods, and hurricanes are the three leading weather events causing crop damages.

crop <- crop[order(crop$event, decreasing = T),]
tab.crop <- xtable(crop[crop$crop.damage > 1,], digits=3,
                   caption = "Crop Damages in Billion Dollars")
print(tab.crop, type="html", include.rowname=F, , include.colnames=F,
      caption.placement = "top",
      html.table.attributes = "border = '2', align = 'center'")
Crop Damages in Billion Dollars
Drought 13.973
Flood 10.743
Hurricane (Typhoon) 5.516
Ice Storm 5.022
Hail 3.047
Frost/Freeze 1.997
Flash Flood 1.532
Extreme Cold/Wind Chill 1.360
Thunderstorm Wind 1.272
Heavy Rain 1.060

Session Info

sessionInfo()
## R version 3.1.0 (2014-04-10)
## Platform: x86_64-apple-darwin13.1.0 (64-bit)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] xtable_1.7-3   reshape2_1.2.2 knitr_1.6      ggplot2_1.0.0 
## 
## loaded via a namespace (and not attached):
##  [1] codetools_0.2-8    colorspace_1.2-4   dichromat_2.0-0   
##  [4] digest_0.6.4       evaluate_0.5.3     formatR_0.10      
##  [7] grid_3.1.0         gtable_0.1.2       htmltools_0.2.4   
## [10] labeling_0.2       MASS_7.3-31        munsell_0.4.2     
## [13] plyr_1.8.1         proto_0.3-10       RColorBrewer_1.0-5
## [16] Rcpp_0.11.1        rmarkdown_0.2.46   scales_0.2.3      
## [19] stringr_0.6.2      tools_3.1.0        yaml_2.1.13