Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern. This study will explore the US NOAA storm database to answer the following:

  1. Which types of events are most harmful with respect to population health?
  2. Which types of events have the greatest economic consequences?

In the section Data we will process the data and prepare it for suitable exploration, and in the Results section we will show our descriptive analysis with the help of graphical plots. Finally, in the Conclusions sections we briefly summarize our findings and comment on further studies.

Data Processing

The data for this assignment come from the NOAA Storm Database in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size (Download dataset here). Codebook and data dictionary can be found here. The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete. The relevant columns for our analysis are the following:

  1. BGN_DATE: date of occurrence or first observation of the phenomenon. Date.
  2. BGN_TIME: time of occurrence or first observation. Time.
  3. EVTYPE: Type of event. Categorical.
  4. FATALITIES: Number of human lives lost to the phenomenon. Numeric.
  5. INJURIES: Number of people that were injured in the presence of the phenomenon. Numeric.
  6. PROPDMG: Amount of USD in damage to property in the presence of the phenomenon. Numeric.
  7. PROPDMGEXP: Scale of the PROPDMG field. NA=1, K=1,000, M=1,000,000, B=1,000,000,000. Numeric (will be fused with PROPDMG field during processing).
  8. CROPDMG: Amount of USD in damage to agricultural crops. Numeric.
  9. CROPDMGEXP: Scale of the CROPDMG field. NA=1, K=1,000, M=1,000,000, B=1,000,000,000. Numeric (will be fused with CROPDMG field during processing).
options(scipen = 999, digits = 2) # Set scientific notation digits

# Load required libraries
if (!require(lubridate)) {stop('Package lubridate must be installed before proceeding.')}
if (!require(dplyr)) {stop('Package dplyr must be installed before proceeding.')}
if (!require(ggplot2)) {stop('Package ggplot2 must be installed before proceeding.')}
if (!require(stringr)) {stop('Package stringr must be installed before proceeding.')}
if (!require(tidyr)) {stop('Package tidyr must be installed before proceeding.')}
if (!require(xtable)) {stop('Package xtable must be installed before proceeding.')}
if (!require(gridExtra)) {stop('Package gridExtra must be installed before proceeding.')}

# Download and process data
downloadedFilename <- 'StormData.csv.bz2'
if (!file.exists(downloadedFilename)) {
    dataUrl <- paste('https://d396qusza40orc.cloudfront.net/repdata/data/',downloadedFilename, sep = '')
    download.file(dataUrl, destfile = downloadedFilename, method='curl')
}

# Read-in file, converting empty strings and NA to NA
if (!exists('crude')) {
  crude <- read.csv(downloadedFilename, stringsAsFactors = F, strip.white=T, na.strings=c('NA',''))
}

# Build character vector by concatenating BGN_DATE and BGN_TIME.
stormCharTime <- paste(unlist(strsplit(crude$BGN_DATE, split = ' '))[rep(c(T,F), (nrow(crude)/2))],
                       crude$BGN_TIME, crude$TIME_ZONE)

# Create new dataframe with only the columns we're interested in. We're also merging crop and property damage amount with their respective scales.
if (!exists('storms')) {
  storms <- data.frame(
    timestamp=as.factor(stormCharTime),
    state=as.factor(crude$STATE),
    eventtype=as.factor(crude$EVTYPE),
    fatalities=as.numeric(crude$FATALITIES),
    injuries=as.numeric(crude$INJURIES),
    propertydamage=as.numeric(crude$PROPDMG)*as.numeric(ifelse(crude$PROPDMGEXP %in% c('k','K'), 1000,
                                         ifelse(crude$PROPDMGEXP %in% c('M','m'), 1000000,
                                         ifelse(crude$PROPDMGEXP %in% c('B','b'), 1000000000,
                                         ifelse(crude$PROPDMGEXP %in% c('h','H'), 100,1))))),
    cropdamage=as.numeric(crude$CROPDMG)*as.numeric(ifelse(crude$CROPDMGEXP %in% c('k','K'), 1000,
                                         ifelse(crude$CROPDMGEXP %in% c('M','m'), 1000000,
                                         ifelse(crude$CROPDMGEXP %in% c('B','b'), 1000000000,
                                         ifelse(crude$CROPDMGEXP %in% c('h','H'), 100,1)))))
  )
}

# Columns with no NAs in them
colnames(crude)[unlist(lapply(crude, function(x){!any(is.na(x))}))]
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "STATE"      "EVTYPE"     "BGN_RANGE"  "COUNTY_END" "END_RANGE" 
## [11] "LENGTH"     "WIDTH"      "MAG"        "FATALITIES" "INJURIES"  
## [16] "PROPDMG"    "CROPDMG"    "LONGITUDE"  "LONGITUDE_" "REFNUM"

Considering that the columns holding data on type of weather phenomena, injuries, fatalities, and damage to property and crops don’t have missing data in them, and given that our questions of interest concern only human and economic cost, the entire dataset has been narrowed down to the proper variables.

Results

Which types of events are most harmful with respect to population health?

Definition of ‘population health’

For the purpose of this study, we define ‘population health’ as any injury to human lives caused by the weather phenomenon that results in temporary or permanent physical damage, or a fatality.

Injuries and fatalities by type of weather event

# Group by eventtype, summarize by injuries + fatalities and arrange by injuries + fatalities
harmfulevents <- storms %>% group_by(eventtype) %>% 
  summarise(dead=sum(fatalities),  
            wounded=sum(injuries), 
            dandw=sum(dead+wounded),
            deadpct=(dead/dandw),
            woundedpct=(wounded/dandw), 
            dandwpct=(dandw/dandw)) %>% 
  arrange(desc(dandw))

# Events with no human cost
nocost <- nrow(filter(harmfulevents, dandw==0))

We’ll tackle this question first by plotting the total count of both injured and fatal losses per event type. Also, we will focus on the 20 most harmful events and not on the entire set due to 765 events out of 985 claiming no human lives nor injured population.

# Top 20 harmful events
top20 <- slice(harmfulevents, 1:20)

# make data long and narrow for ggplot2
gathered <- gather(top20, type, count, -c(eventtype, dandw, dandwpct, deadpct, woundedpct))

# rearrange event types to sort them in the plot 
gathered <- mutate(gathered, eventtype=factor(eventtype, levels = top20$eventtype))

ggplot(gathered, aes(x=eventtype, y=count, fill=type)) +
  geom_bar(stat="identity") +
  theme(axis.text.x=element_text(angle=90, hjust=1)) +
  labs(x='Type of weather event', y='Fatalities + Injured count', title='Top 20 weather phenomena most damaging\nto human lives in the US (1950-2011)')

# Total human cost for tornadoes
totalhumancosttornadoes <- (filter(top20, eventtype=='TORNADO') %>% select(dandw))
totalhumancost <- sum(harmfulevents$dandw)

We can see that tornadoes are by far the weather phenomenon that inflicts the most damage on human lives, with 96979 people either injuried or killed, a whopping 62.3% of the human cost recorded by the entire dataset.

The effect of so high a number hides the contribution to human cost of the rest of the phenomena, so we’ll show a 2nd plot WITHOUT data for tornadoes.

# remove rows for tornadoes
notornado <- filter(gathered, eventtype!='TORNADO')

ggplot(notornado, aes(x=eventtype, y=count, fill=type)) +
  geom_bar(stat="identity") +
  theme(axis.text.x=element_text(angle=90, hjust=1)) +
  labs(x='Type of weather event', y='Fatalities + Injured count', title='Lower 19 of the top 20 weather phenomena most damaging\nto human lives in the US (1950-2011)')

The plot shows that the weather phenomena that inflict the most human suffering (combining human losses and injuries), after tornadoes are:

  1. Intense heat waves
  2. Thunderstorm winds
  3. Floods
  4. Lightning
  5. Heat waves
  6. Flash floods

Finally, to summarize these findings, we’ll show a table for these top 20 phenomena, ordered by fatalities, with their total cost, and percent of fatalities and injuries as fractions of total costs.

tab <- transmute(top20, WeatherPhenomenon=eventtype, TotalHumanCost=dandw, PercentInjured=woundedpct*100, PercentKilled=deadpct*100) %>% arrange(desc(PercentKilled)) %>% slice(1:20)

print(xtable(tab, 
             caption = 'Top 20 most harmful phenomena for human lives\n(ordered by % fatalities as fraction of total human cost',), 
      type='html', include.rownames=FALSE)
Top 20 most harmful phenomena for human lives (ordered by % fatalities as fraction of total human cost
WeatherPhenomenon TotalHumanCost PercentInjured PercentKilled
RIP CURRENT 600.00 38.67 61.33
FLASH FLOOD 2755.00 64.50 35.50
HEAT 3037.00 69.15 30.85
EXCESSIVE HEAT 8428.00 77.42 22.58
HIGH WIND 1385.00 82.09 17.91
LIGHTNING 6046.00 86.50 13.50
WINTER STORM 1527.00 86.51 13.49
BLIZZARD 906.00 88.85 11.15
HEAVY SNOW 1148.00 88.94 11.06
THUNDERSTORM WIND 1621.00 91.80 8.20
FOG 796.00 92.21 7.79
WILDFIRE 986.00 92.39 7.61
TSTM WIND 7461.00 93.24 6.76
THUNDERSTORM WINDS 972.00 93.42 6.58
FLOOD 7259.00 93.53 6.47
TORNADO 96979.00 94.19 5.81
HURRICANE/TYPHOON 1339.00 95.22 4.78
ICE STORM 2064.00 95.69 4.31
WILD/FOREST FIRE 557.00 97.85 2.15
HAIL 1376.00 98.91 1.09

In contrast with previous plots, in this table we can clearly see that in terms of fatal incidents, tornadoes are not the most dangerous to human life with just 1 fatality per 20 incidents, while rip currents are almost consistently fatal, with more than 1 casualty per 2 incidents, followed by flash floods and heat haves, with 1 fatality per 3 incidents and 1 casualty per 4 incidents, respectively.

Which types of events have the greatest economic consequences?

Definition of ‘economic consequences’

For the purpose of this study, we define ‘economic consequences’ as any damage inflicted to property and crops. We are not making any distinction as to whether one is more important than the other, neither are we bringing the costs to present value, so the numbers are reported in terms of cost at the time of the inflicted damage and no weight is assigned to neither.

Economic cost per type of event

# Group by eventtype, summarize by injuries + fatalities and arrange by injuries + fatalities
costlyevents <- storms %>% group_by(eventtype) %>% 
  summarise(propertydamage=sum(propertydamage),  
            cropdamage=sum(cropdamage), 
            totaldamage=sum(propertydamage+cropdamage),
            propertydamagepct=(propertydamage/totaldamage),
            cropdamagepct=(cropdamage/totaldamage)) %>%
  arrange(desc(totaldamage))
  
# Top 12 costly events by property damage
top12crop <- costlyevents %>% arrange(desc(cropdamage)) %>% slice(1:12) 

# Top 12 costly events by property damage
top12prop <- costlyevents %>% arrange(desc(propertydamage)) %>% slice(1:12) 

# Top 12 costly events
top12cost <- slice(costlyevents, 1:12)

# make data long and narrow for ggplot2 for each top 12 dataset 
gatheredcost <- gather(top12cost, type, count, -c(eventtype, totaldamage, propertydamagepct, cropdamagepct))

# rearrange event types to sort them in the plot 
gatheredcost <- mutate(gatheredcost, eventtype=factor(eventtype, levels = top12cost$eventtype))
gatheredprop <- mutate(top12prop, eventtype=factor(eventtype, levels = top12prop$eventtype))
gatheredcrop <- mutate(top12crop, eventtype=factor(eventtype, levels = top12crop$eventtype))


totalplot <- ggplot(gatheredcost, aes(x=eventtype, y=count, fill=type)) +
  geom_bar(stat="identity", aes(order=desc(type))) +
  theme(axis.text.x=element_text(angle=90, hjust=1), axis.title.x = element_blank(),
        legend.position='none') + 
  scale_y_continuous(breaks=seq(0,150000000000, 50000000000), labels=c(0, 50, 100, 150)) +
  labs(y='Property + Crop Damage')

propplot <- ggplot(gatheredprop, aes(x=eventtype, y=(propertydamage/1000000000))) +
  geom_bar(stat="identity", fill='#F8766D') +  
  theme(axis.text.x=element_text(angle=90, hjust=1), axis.title.x = element_blank()) +
  ylim(c(0,150)) +
  labs(y='Property Damage')

cropplot <- ggplot(gatheredcrop, aes(x=eventtype, y=(cropdamage/1000000000))) +
  geom_bar(stat="identity", fill='#00BFC4') +
  theme(axis.text.x=element_text(angle=90, hjust=1), axis.title.x = element_blank()) +
  ylim(c(0,150)) +
  labs(y='Crop Damage')

grid.arrange(propplot, cropplot, totalplot, ncol=3, main = 'Property & Crop Damage (billion $) per Event Type (1950-2011)')

We can conclude from these plots that:

  1. Crop damage is orders of magnitude lower than property damage.
  2. The phenomena inflicting the most property damage are 1) floods, 2) typhoons, 3) tornados, 4) storm surges.
  3. The phenomena inflicting the most crop damage are 1) drought, 2) flood, 3) riverflood, 4) icestorm.
  4. The impact of drought is mainly to crop damage with 93% of the total cost of $15 bn., as opposed to property damage, with the remaining 7%.
  5. The impact of flood is mostly localized to property damage, with 96% of the total cost of $150 bn.
  6. Riverfloods and icestorms are the events for which property and crop damage are balanced around 50%-50%.
  7. Save for drought, the top 5 events with the highest material cost involve strong winds and very heavy rain.

Conclusions

We observe that the combination of strong rains and winds are responsible for the most part of both human and material losses, so prevention and relief efforts should center around these 2 forces of nature, although a further, broader study on 1) disaster relief budget, 2) weather monitoring locations, and 3) early warning systems is required to make adequate recommendations that truly result in reduced material and human losses.