Synopsis

The purpose of this paper is to examine the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. We have summarized the impact to human population and econmic consequence by weather event type. The results show that the most harmful event to population health are tornadoes. Since 1996, tornadoes have injured 20,667 individuals, and killed 1,511. Excessive heat is the deadliest event, and second most harmful overall, with 1,797 fatalities and 6,461 injuries. With respect to economic consequences, floods have had the greatest impact, resulting in close to $150 billion in total damage. Drought has had the greatest consequence on crop damage, with $13 billion.

Data Processing

Set global options for R code

library(knitr)
library(reshape2)
library(lubridate)
library(ggplot2)

opts_chunk$set(echo = TRUE, cache = TRUE, cache.path = "cache/", 
               fig.path = "figure/", fig.width = 7, fig.height = 7)

The data for this paper is the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. The data contains storm events that start in the year 1950 and end in November 2011. The earlier years of the database have generally fewer events recorded, this is likely due to a lack of good records.

First we download the data if it does not already exist. Then we read in the bz2 file directly, without unzipping.

if (! file.exists("../data/StormData.csv.bz2")){
  download.file("http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
               destfile = "../data/StormData.csv.bz2")
} 

dat <- read.csv(bzfile("../data/StormData.csv.bz2", "rt"))

After reading the data, we check the first few rows and columns.

dim(dat)
## [1] 902297     37
head(dat[, 1:7])
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL

According to the NOAA’s description of the database, not all data has been tracked the entire time. This will cause skew in the numbers. Tornados have been tracked sinced the begining of the database, in 1950. Thunderstorm, Wind, and Hail started being tracked in 1955. But it wasn’t until 1996 that all the other event types appeared as well. We cannot say which event type had the most damange, without a data set that takes all 48 event types into account. For this reason, we filter out all data pre-1996.

dat <- subset(dat, year(mdy_hms(BGN_DATE)) >= 1996)
nrow(dat)
## [1] 653530

Let’s take a look at some of the event types that we’ll be working with. This will be one of the primary columns used in this analysis.

length(table(dat$EVTYPE))
## [1] 985
tail(levels(dat$EVTYPE), 16)
##  [1] "WINDS"                   "WINTER MIX"             
##  [3] "WINTER STORM"            "WINTER STORM HIGH WINDS"
##  [5] "WINTER STORM/HIGH WIND"  "WINTER STORM/HIGH WINDS"
##  [7] "WINTER STORMS"           "Winter Weather"         
##  [9] "WINTER WEATHER"          "WINTER WEATHER MIX"     
## [11] "WINTER WEATHER/MIX"      "WINTERY MIX"            
## [13] "Wintry mix"              "Wintry Mix"             
## [15] "WINTRY MIX"              "WND"

We can see this is not a clean data set. There are 985 different event types, but there at least 14 different ways to express bad winter weather, with 4 of those being variations on spellings and capitalizations of “Wintry mix”.

We’ll do some light data clean-up to try and match more closely the official 48 Event Types specified here. That document describes each of the event types in detail. For the non-matching types to clean, we try to match them to the descriptions provided in the document. We have also used the “REMAKRS” column in the data set to find more information of the non-conforming event type.

#fix upper/lower case mix by making all names upper case
dat$EVTYPE <- toupper(dat$EVTYPE)

#fix events with spaces in front of the names (ie " COASTAL FLOOD")
dat$EVTYPE <- sub("^ *", "", dat$EVTYPE)

#match event types with 48 official types
dat$EVTYPE <- sub(".*blizzard.*", "BLIZZARD", dat$EVTYPE,  ignore.case = T)
dat$EVTYPE <- sub(".*coastal.*flood.*", "COASTAL FLOOD", dat$EVTYPE, ignore.case = T)
dat$EVTYPE <- sub(".*extreme.*(cold|wind.*chill).*", "EXTREME COLD/WIND CHILL", dat$EVTYPE, ignore.case = T)
dat$EVTYPE <- sub(".*frost.*|.*freeze.*", "FROST/FREEZE", dat$EVTYPE, ignore.case = T)

#replace all "cold" or "wind chill", unless it has "extreme" in the name
x <- grep("extreme", dat$EVTYPE, ignore.case = T)
dat[-x, "EVTYPE"] <- sub(".*(cold|wind.*chill).*", "COLD/WIND CHILL", dat[-x, "EVTYPE"], ignore.case = T)

dat$EVTYPE <- sub("tornado debris", "TORNADO", dat$EVTYPE, ignore.case = T)
dat$EVTYPE <- sub(".*dense.*fog.*", "DENSE FOG", dat$EVTYPE, ignore.case = T)
dat$EVTYPE <- sub(".*ice.*fog.*", "FREEZING FOG", dat$EVTYPE, ignore.case = T)
dat$EVTYPE <- sub(".*drought.*", "DROUGHT", dat$EVTYPE, ignore.case = T)
dat$EVTYPE <- sub(".*dust.*devel.*", "DUST DEVIL", dat$EVTYPE, ignore.case = T)
dat$EVTYPE <- sub(".*blowing.*dust.*", "DUST STORM", dat$EVTYPE, ignore.case = T)
dat$EVTYPE <- sub(".*saharan.*dust.*", "DUST DEVIL", dat$EVTYPE, ignore.case = T)
dat$EVTYPE <- sub(".*heat.*wave.*", "EXCESSIVE HEAT", dat$EVTYPE, ignore.case = T)
x <- grep("excessive", dat$EVTYPE, ignore.case = T)
dat[-x, "EVTYPE"] <- sub(".*heat.*", "HEAT", dat[-x, "EVTYPE"], ignore.case = T)
dat$EVTYPE <- sub(".*flash.*", "FLASH FLOOD", dat$EVTYPE,  ignore.case = T)
x <- grep("flash", dat$EVTYPE, ignore.case = T)
dat[-x, "EVTYPE"] <- sub(".*flood.*", "FLOOD", dat[-x, "EVTYPE"], ignore.case = T)
dat$EVTYPE <- sub(".*flash.*", "FLASH FLOOD", dat$EVTYPE,  ignore.case = T)
dat$EVTYPE <- sub(".*funnel.*cloud.*", "FUNNEL CLOUD", dat$EVTYPE,  ignore.case = T)
dat$EVTYPE <- sub(".*freezing.*(drizzle|rain|spray|precip).*", "WINTER STORM", dat$EVTYPE,  ignore.case = T)
dat$EVTYPE <- sub(".*hail.*", "HAIL", dat$EVTYPE,  ignore.case = T)

#Here we classify all remaining events with "rain" into "heavy rain". For the 
#non-impacting rows (ie 0 crop/property damage, 0 injury/fatality), there is no 
#difference what group they belong to. The non-zero rows, which have impact, 
#should be classified as "heavy rain" anyway, per the NOAA documentation: 
#"Heavy Rain (C). Unusually large amount of rain which does not cause a flash 
#flood or flood, but causes damage, e.g., roof collapse or other human/economic 
#impact."
dat$EVTYPE <- sub(".*rain.*", "HEAVY RAIN", dat$EVTYPE,  ignore.case = T)
dat$EVTYPE <- sub(".*snow.*", "HEAVY SNOW", dat$EVTYPE,  ignore.case = T)
dat$EVTYPE <- sub(".*surf.*", "HIGH SURF", dat$EVTYPE,  ignore.case = T)
dat$EVTYPE <- sub(".*high wind.*", "HIGH WIND", dat$EVTYPE,  ignore.case = T)
x <- grep("marine|non", dat$EVTYPE, ignore.case = T)
dat[-x, "EVTYPE"] <- sub(".*(tstm|thunderstorm).*wind.*", "THUNDERSTORM WIND", dat[-x, "EVTYPE"], ignore.case = T)
dat$EVTYPE <- sub(".*strong.*wind.*", "STRONG WIND", dat$EVTYPE, ignore.case = T)
dat$EVTYPE <- sub(".*(gradient|gusty|wind and wave|wind damage).*", "STRONG WIND", dat$EVTYPE,  ignore.case = T)
dat$EVTYPE <- sub(".*(whirlwind|wind gusts).*", "THUNDERSTORM WIND", dat$EVTYPE, ignore.case = T)
dat$EVTYPE <- sub(".*winds.*", "STRONG WIND", dat$EVTYPE, ignore.case = T)

#filter out wind advisory and wake low wind, since they have 0 impact and
#do not belong in any of the official 48 types
dat <- dat[! dat$EVTYPE %in% c("WIND ADVISORY", "WAKE LOW WIND"), ]
dat$EVTYPE <- sub(".*marine tstm wind.*", "MARINE THUNDERSTORM WIND", dat$EVTYPE, ignore.case = T)
dat$EVTYPE <- sub(".*non.*tstm wind.*", "HIGH WIND", dat$EVTYPE, ignore.case = T)
dat$EVTYPE <- sub("^WIND$", "HIGH WIND", dat$EVTYPE, ignore.case = T)
dat$EVTYPE <- sub("^fog$", "DENSE FOG", dat$EVTYPE, ignore.case = T)
dat$EVTYPE <- sub(".*forest.*", "WILDFIRE", dat$EVTYPE, ignore.case = T)
dat$EVTYPE <- sub(".*currents.*", "RIP CURRENT", dat$EVTYPE,  ignore.case = T)
dat$EVTYPE <- sub(".*glaze.*", "WINTER WEATHER", dat$EVTYPE,  ignore.case = T)
dat$EVTYPE <- sub(".*(hurricane|typhoon).*", "HURRICANE/TYPHOON", dat$EVTYPE,  ignore.case = T)
dat$EVTYPE <- sub(".*stream.*", "HEAVY RAIN", dat$EVTYPE,  ignore.case = T)
dat$EVTYPE <- sub(".*slide.*", "DEBRIS FLOW", dat$EVTYPE,  ignore.case = T)
dat$EVTYPE <- sub(".*wint.*mix.*", "WINTER WEATHER", dat$EVTYPE,  ignore.case = T)
dat$EVTYPE <- sub(".*surge.*", "STORM SURGE/TIDE", dat$EVTYPE,  ignore.case = T)

length(table(dat$EVTYPE))
## [1] 219

We have now greatly reduced the number of distinct events.

Processing for harm to population health

To see the effects of each event on human population, we group by event (EVTYPE) and get the number of fatalities and injuries. We do this by melting the data, specifying event type as the id, and the measures as fatalities and injuries. Then we reshape it with dcast so we can see the total fatalities and injuries for each event type. We add a new column, EFFECTED, which represents the sum of fatalities and injuries. This lets us avoid Simpson’s paradox in case any events have a high total effected, but not high in fatalities or injuries alone.

dat.melt <- melt(dat, id.vars = c("EVTYPE"), measure.vars = c("FATALITIES", 
                                                              "INJURIES"))
dat.agg <- dcast(dat.melt, EVTYPE ~ variable, sum)

dat.agg$EFFECTED <- dat.agg$FATALITIES + dat.agg$INJURIES

Processing for economic consequences

First we will slim the data set down by taking only the columns we need. Then we will get the property and crop damage for each event by multiplying the damage columns by the exponent columns. The definitions of values in the exponent fields can be found here. We use a chained IFELSE statement, which supports vectorized operations.

dat.damage <- dat[, 
                  c("EVTYPE", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]

dat.damage$TOTAL_CROP <- 
    ifelse(dat.damage$CROPDMGEXP %in% c('h', 'H'),
      dat.damage$CROPDMG * 100,
    ifelse(dat.damage$CROPDMGEXP %in% c('k', 'K'),
      dat.damage$CROPDMG * 1000, 
    ifelse(dat.damage$CROPDMGEXP %in% c('m', 'M'),
      dat.damage$CROPDMG * 1000000,
    ifelse(dat.damage$CROPDMGEXP %in% c('b', 'B'),
      dat.damage$CROPDMG * 1000000000,
    ifelse(dat.damage$CROPDMGEXP %in% c('-', '?'),
      dat.damage$CROPDMG * 0 ,
    ifelse(dat.damage$CROPDMGEXP %in% c('+'),
       dat.damage$CROPDMG * 1,
    ifelse(dat.damage$CROPDMGEXP %in% 0:8,
        dat.damage$CROPDMG * 10,
    ifelse(dat.damage$CROPDMGEXP == "" & dat.damage$CROPDMG != 0,
        dat.damage$CROPDMG * 0,
        0))))))))

dat.damage$TOTAL_PROP <- 
    ifelse(dat.damage$PROPDMGEXP %in% c('h', 'H'),
      dat.damage$PROPDMG * 100,
    ifelse(dat.damage$PROPDMGEXP %in% c('k', 'K'),
      dat.damage$PROPDMG * 1000, 
    ifelse(dat.damage$PROPDMGEXP %in% c('m', 'M'),
      dat.damage$PROPDMG * 1000000,
    ifelse(dat.damage$PROPDMGEXP %in% c('b', 'B'),
      dat.damage$PROPDMG * 1000000000,
    ifelse(dat.damage$PROPDMGEXP %in% c('-', '?'),
      dat.damage$PROPDMG * 0 ,
    ifelse(dat.damage$PROPDMGEXP %in% c('+'),
       dat.damage$PROPDMG * 1,
    ifelse(dat.damage$PROPDMGEXP %in% 0:8,
        dat.damage$PROPDMG * 10,
    ifelse(dat.damage$PROPDMGEXP == "" & dat.damage$PROPDMG != 0,
        dat.damage$PROPDMG * 0,
        0))))))))

Similar to processing for population harm, we melt the dataset by event type, and reshape it by event type. This aggregates all the events and gives us the total property and crop damage per event. We also add a new column, TOTAL_DAMAGE, which is the sum of property and crop damage.

dat.dam.melt <- melt(dat.damage, id.vars = c("EVTYPE"), measure.vars = 
                   c("TOTAL_PROP", "TOTAL_CROP"))
dat.dam.agg <- dcast(dat.dam.melt, EVTYPE ~ variable, sum)

dat.dam.agg$TOTAL_DAMAGE <- dat.dam.agg$TOTAL_PROP + dat.dam.agg$TOTAL_CROP

Results

Harm to population health

We sort the aggregated results by fatalities, injuries, and total effected. This allows us to compare across each of those categories. The final sort, by total effected, is the metric we will use in determining what events are the most harmful to population health.

dat.ord <- dat.agg[order(dat.agg$FATALITIES, decreasing = TRUE), ]
temp <- dat.ord[1:10, c("EVTYPE")]

dat.ord <- dat.agg[order(dat.agg$INJURIES, decreasing = TRUE), ]
temp <- cbind(temp, dat.ord[1:10, c("EVTYPE")])

dat.ord <- dat.agg[order(dat.agg$EFFECTED, decreasing = TRUE), ]
temp <- cbind(temp, dat.ord[1:10, c("EVTYPE")])

rownames(temp) <- 1:10
colnames(temp) <- c("Fatalities", "Injuries", "Total Effected")

temp
##    Fatalities                Injuries            Total Effected     
## 1  "EXCESSIVE HEAT"          "TORNADO"           "TORNADO"          
## 2  "TORNADO"                 "FLOOD"             "EXCESSIVE HEAT"   
## 3  "FLASH FLOOD"             "EXCESSIVE HEAT"    "FLOOD"            
## 4  "LIGHTNING"               "THUNDERSTORM WIND" "THUNDERSTORM WIND"
## 5  "RIP CURRENT"             "LIGHTNING"         "LIGHTNING"        
## 6  "FLOOD"                   "FLASH FLOOD"       "FLASH FLOOD"      
## 7  "THUNDERSTORM WIND"       "WILDFIRE"          "WILDFIRE"         
## 8  "EXTREME COLD/WIND CHILL" "HURRICANE/TYPHOON" "WINTER STORM"     
## 9  "HIGH WIND"               "WINTER STORM"      "HEAT"             
## 10 "HEAT"                    "HEAT"              "HURRICANE/TYPHOON"

Sorting by fatality, injury, or total, the top five most-harmful events are Tornado, Flood, Heat, Wind, Lightning. Winter Weather and Thunderstorms also appear in the top 10 for each category.

We will take the top 10 most harmful events overall, and melt that dataframe to get Fatalities and Injuries back in the same column. This makes it easy to stack.

top <- head(dat.ord, 10)
top.melt <- melt(top, id.vars = c("EVTYPE"), measure.vars = c("FATALITIES", 
                                                              "INJURIES"))
g <- ggplot(top.melt, aes(x = EVTYPE, y = value, fill = variable))
g + geom_bar(stat = "identity") + 
  labs(y = "Number of People Effected", x = "Event", 
       title = "Most Harmful Events to Population Health\n1996 - 2011") +
  theme(axis.text.x = element_text(angle = 45, hjust=1), plot.title = 
          element_text(size = 18)) + 
  scale_y_continuous(expand = c(0, 0), limits = c(0, 22600))

Figure depicts the most harmful weather events for the human population. Tornados are the most harmful with respect to injury and overall. Heat is the most deadly.

It is clear from this figure that tornados are the most harmful with regards to injury and overall. However heat is the deadliest weather event.

Economic consequences

We order by TOTAL_DAMAGE, the metric we will use in determing the events with the largest economic consequences. Again, we melt by property and crop damage, to stack these in the plot.

dat.dam.ord <- dat.dam.agg[order(dat.dam.agg$TOTAL_DAMAGE, decreasing = TRUE), ]
top.dam <- head(dat.dam.ord, 10)

top.dam.melt <- melt(top.dam, id.vars = c("EVTYPE"), measure.vars = 
                       c("TOTAL_PROP", "TOTAL_CROP"))

h <- ggplot(top.dam.melt, aes(x = EVTYPE, y = value, fill = variable))
h + geom_bar(stat = "identity") + 
  labs(y = "U.S. Dollars of Damage", x = "Event", 
       title = "Events with Greatest Economic Consequences\n1996 - 2011") +
  theme(axis.text.x = element_text(angle = 45, hjust=1), plot.title = 
          element_text(size = 18))  +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 1.8E11))

Figure depicts the weather events with the greatest economic impact. Flood is the most impactful overall, but drought has the biggest impact on crops

It is clear from the figure that flooding is the most impactful weather event, both for property damage, and overall. However, drought has the largest economic impact on crop damage.