Synopsis

I analysed the impact of natural disasters on the United States using the NOAAstorm database. Specifically, I looked at which types of natural disasters caused the most human and / or economic damage across the United States in the last 60 years. I show that (various types of) floods caused the most economic damage, while (various types of) tornados caused the most human damage and also significant economic damage. These effects are dependent on geography and so vary state by state.

Prerequisites

First, load some packages:

library(stringr)
library(lubridate)
library(maps)
library(mapdata)
library(ggplot2)

Time of analysis: 2016-02-15 18:41:11.

Data Processing

The data for this assignment came from the NOAAstorm database in the form of a comma-separated-value file compressed via the bzip2 algorithm. This database contains observations from 1950 to 2011. The file was downloaded from the web and read into R. I also downloaded a file containing information about the different state names and their abbreviations; I used this to make some map plots later.

# download the data
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url, destfile="repdata-data-StormData.csv.bz2", method="curl")

# read the data into R - skipping the first line
stormdata <- read.csv2("repdata-data-StormData.csv.bz2", header = FALSE, skip = 1, sep=",", na.strings="", stringsAsFactors=FALSE)

# I will also download and read in a list of states for some map plots
url2 <- "http://www.fonz.net/blog/wp-content/uploads/2008/04/states.csv"
download.file(url2, destfile="states.csv", method="curl")
states <- read.csv2(url2,sep=",", stringsAsFactors = F)
states$Abbreviation <- str_to_lower(states$Abbreviation)
states$State <- str_to_lower(states$State)

A header for the storm dataset was generated from a trimmed version of the first line of the bzip2 file:

# read the header
h <- readLines("repdata-data-StormData.csv.bz2",n=1)
h <- strsplit(h, split=",")
# remove all the weird characters
h <- gsub("[[:punct:]]","",h[[1]])
# replace duplicate names with unique names
h[7]="STATECODE"
# add header to the stormdata table
names(stormdata) <- h

Finally, the data was transformed according to type. Because of the questions studied, only the columns pertaining to date and location (DATE, STATECODE) and the columns pertaining to event type and impact (EVTYPE, FATALITIES, INJURIES, PROPDMG, CROPDMG, PROPDMGEXP and CROPDMGEXP are of interest. These columns were converted from characters to factors, dates and numbers (as appropriate).

stormdata$STATECODE = as.factor(stormdata$STATECODE)
stormdata$BGNDATE = as.Date(stormdata$BGNDATE,"%m/%d/%Y")
stormdata$FATALITIES = as.integer(stormdata$FATALITIES)
stormdata$INJURIES = as.integer(stormdata$INJURIES)

The event type (EVTYPE) column contains 985 unique strings. Some of these are duplicates; for example, there are hundreds of different thunderstorm classifications, because for each thunderstorm, the wind velocity was recorded within the same column. Other categories exist more than once just due to spelling variations. I reduced the number of event types to around 40 by grouping duplicate entries under single names. All evtypes with fewer than 10 occurrences were grouped under “other”.

stormdata$EVTYPE[grepl("tornado|landspout|torndao|waterspout|water spout",stormdata$EVTYPE,ignore.case = TRUE)] <- "tornado"
stormdata$EVTYPE[grepl("hurricane|typhoon|cyclone",stormdata$EVTYPE,ignore.case = TRUE)] <- "hurricane"
stormdata$EVTYPE[grepl("thunderstorm|THUNDERSTROM|THUNDEERSTORM",stormdata$EVTYPE,ignore.case = TRUE)] <- "thunderstorm"
stormdata$EVTYPE[grepl("tropical storm",stormdata$EVTYPE,ignore.case = TRUE)] <- "tropical storm"
stormdata$EVTYPE[grepl("blizzard",stormdata$EVTYPE,ignore.case = TRUE)] <- "blizzard"
stormdata$EVTYPE[grepl("tsunami",stormdata$EVTYPE,ignore.case = TRUE)] <- "tsunami"
stormdata$EVTYPE[grepl("funnel",stormdata$EVTYPE,ignore.case = TRUE)] <- "funnel cloud"
stormdata$EVTYPE[grepl("earth quake",stormdata$EVTYPE,ignore.case = TRUE)] <- "earth quake"
stormdata$EVTYPE[grepl("volcan",stormdata$EVTYPE,ignore.case = TRUE)] <- "volcano"
stormdata$EVTYPE[grepl("landslide|mudslide|landslump|rock slide|mud/rock slide|beach erosion|beach erosin|mud slide|avalanche|avalance",stormdata$EVTYPE,ignore.case = TRUE)] <- "landslide"
stormdata$EVTYPE[grepl("flood",stormdata$EVTYPE,ignore.case = TRUE)] <- "flood"
stormdata$EVTYPE[grepl("fire",stormdata$EVTYPE,ignore.case = TRUE)] <- "fire"
stormdata$EVTYPE[grepl("lightning|LIGNTNING|LIGHTING",stormdata$EVTYPE,ignore.case = TRUE)] <- "lightning"
stormdata$EVTYPE[grepl("high wind|wind|wnd",stormdata$EVTYPE,ignore.case = TRUE)] <- "high wind"
stormdata$EVTYPE[grepl("wave",stormdata$EVTYPE,ignore.case = TRUE)] <- "wave"
stormdata$EVTYPE[grepl("snow|ice|icy",stormdata$EVTYPE,ignore.case = TRUE)] <- "snow | ice"
stormdata$EVTYPE[grepl("hail|sleet",stormdata$EVTYPE,ignore.case = TRUE)] <- "hail"
stormdata$EVTYPE[grepl("rain|precipitation|wet|shower|PRECIPATATION",stormdata$EVTYPE,ignore.case = TRUE)] <- "rain"
stormdata$EVTYPE[grepl("cold|chill|winter|cool|freeze|hypothermia|glaze|freezing|frost|wintry|low temperature|low temp",stormdata$EVTYPE,ignore.case = TRUE)] <- "cold"
stormdata$EVTYPE[grepl("heat|warm|hot|record temperature|hyperthermia|temperature record",stormdata$EVTYPE,ignore.case = TRUE)] <- "heat"
stormdata$EVTYPE[grepl("dry|drought",stormdata$EVTYPE,ignore.case = TRUE)] <- "drought"
stormdata$EVTYPE[grepl("summary",stormdata$EVTYPE,ignore.case = TRUE)] <- NA

# find evtypes with less than 10 occurrences and set their type to "other"
t <- tapply(stormdata$EVTYPE,stormdata$EVTYPE,function(x){length(x)})
for (i in seq_along(t)) {
        if (t[i]<10) {
                stormdata$EVTYPE[grepl(names(t[i]),stormdata$EVTYPE,fixed=TRUE)] <- "other"
        }
}
stormdata$EVTYPE <- str_to_lower(stormdata$EVTYPE)

The PROPDMG column (property damage estimates) and PROPDMGEXP column (a column of multipliers, e.g. “m” = millions) were used to create a single numeric column (“PROPDAMAGE”), and similarly, the CROPDMG (crop damage) and CROPDMGEXP columns were used to create a single numeric column (“CROPDAMAGE”):

# convert the DMG and EXP columns into numeric vectors
stormdata$PROPDMG = as.numeric(stormdata$PROPDMG)
stormdata$PROPDMGEXP[stormdata$PROPDMGEXP=="B"] <- 1000000000
        stormdata$PROPDMGEXP[stormdata$PROPDMGEXP=="M" | stormdata$PROPDMGEXP=="m"] <- 1000000
        stormdata$PROPDMGEXP[stormdata$PROPDMGEXP=="K" | stormdata$PROPDMGEXP=="k"] <- 1000
        stormdata$PROPDMGEXP[stormdata$PROPDMGEXP=="H" | stormdata$PROPDMGEXP=="h"] <- 100
        stormdata$PROPDMGEXP[stormdata$PROPDMGEXP=="0"] <- 1

stormdata$CROPDMG = as.numeric(stormdata$CROPDMG)
stormdata$CROPDMGEXP[stormdata$CROPDMGEXP=="B"] <- 1000000000
        stormdata$CROPDMGEXP[stormdata$CROPDMGEXP=="M" | stormdata$CROPDMGEXP=="m"] <- 1000000
        stormdata$CROPDMGEXP[stormdata$CROPDMGEXP=="K" | stormdata$CROPDMGEXP=="k"] <- 1000
        stormdata$CROPDMGEXP[stormdata$CROPDMGEXP=="H" | stormdata$CROPDMGEXP=="h"] <- 100
        stormdata$CROPDMGEXP[stormdata$CROPDMGEXP=="0"] <- 1

# calculate total property and crop damage in new columns
stormdata$PROPDMGEXP <- as.numeric(stormdata$PROPDMGEXP) # this will coerce unclear values such as -, ? or + to NA
stormdata$CROPDMGEXP <- as.numeric(stormdata$CROPDMGEXP) # this will coerce unclear values such as -, ? or + to NA
stormdata$PROPDAMAGE <- stormdata$PROPDMG * stormdata$PROPDMGEXP 
stormdata$CROPDAMAGE <- stormdata$CROPDMG * stormdata$CROPDMGEXP 

Note: This code generates “coerced to NA” warnings (printed to console), because the EXP columns contain strings such as “-”, “?” and “+” which have no numeric equivalent and therefore are coerced to NA. This is intentional, because this information is unusable.

Finally, a clean dataset (“stormdataclean”) was generated, containing only the columns of interest:

# only keep columns that are interesting for this analysis
stormdataclean <- stormdata[,c("STATECODE","BGNDATE","EVTYPE","FATALITIES","INJURIES","PROPDAMAGE","CROPDAMAGE")]

Results

1. Across the United States, which types of events are most harmful with respect to population health?

To address this question, I calculated the total sum of fatalities and injuries per event type:

humdamage <- tapply(stormdataclean$FATALITIES+stormdataclean$INJURIES,stormdataclean$EVTYPE,sum,na.rm=TRUE)

And then asked which three event types cause the most fatalities and injuries:

head(sort(humdamage,decreasing = TRUE),3)
##   tornado      heat high wind 
##     97100     11847     10274

So, tornados, heat and high wind cause the most human damage (fatalities and injuries) in the United States. This depends on geography. Tornados cause much more damage in the south-east than in the west of the United States. To show this, I plotted the total human damage caused by tornados per state. (Bear with me, I just wanted to make a cool plot.)

# first, calculate the total human damage caused by tornados per state
tornadoDamagePerState <- aggregate(x = stormdataclean$FATALITIES+stormdataclean$INJURIES,by = list("tornado" = stormdataclean$EVTYPE=="tornado","state" = stormdataclean$STATECODE),sum,na.rm=TRUE)

# keep only tornado-related data
tornadoDamagePerState = data.frame(region=tolower(tornadoDamagePerState$state[tornadoDamagePerState$tornado]), 
    humanDamage=tornadoDamagePerState$x[tornadoDamagePerState$tornado], 
    stringsAsFactors=F)

dm <- merge(x=tornadoDamagePerState,y=states,by.x="region",by.y="Abbreviation")

states_map <- map_data("state")
ggplot(dm, aes(map_id = State)) + 
        geom_map(aes(fill = humanDamage), map = states_map) +
        scale_fill_gradientn(colours=c("white","red")) + 
        expand_limits(x = states_map$long, y = states_map$lat) +
        ggtitle("Human damage caused by tornados")

By contrast, the damage caused by high winds is slightly more equally distributed, although the same trend can be seen (south-eastern states still get slightly more damage).

# first, calculate the total human damage caused by tornados per state
windDamagePerState <- aggregate(x = stormdataclean$FATALITIES+stormdataclean$INJURIES,by = list("wind" = stormdataclean$EVTYPE=="high wind","state" = stormdataclean$STATECODE),sum,na.rm=TRUE)

# keep only tornado-related data
windDamagePerState = data.frame(region=tolower(windDamagePerState$state[windDamagePerState$wind]), 
    humanDamage=windDamagePerState$x[windDamagePerState$wind], 
    stringsAsFactors=F)

dm <- merge(x=windDamagePerState,y=states,by.x="region",by.y="Abbreviation")

states_map <- map_data("state")
ggplot(dm, aes(map_id = State)) + 
        geom_map(aes(fill = humanDamage), map = states_map) +
        scale_fill_gradientn(colours=c("white","blue")) + 
        expand_limits(x = states_map$long, y = states_map$lat) +
        ggtitle("Human damage caused by high winds")

2. Across the United States, which types of events have the greatest economic consequences?

To address this question, I calculated the total sum of property and crop damage per event type:

econdamage <- tapply(stormdataclean$PROPDAMAGE+stormdataclean$CROPDAMAGE,stormdataclean$EVTYPE,sum,na.rm=TRUE)
head(sort(econdamage,decreasing = TRUE),3)
##        flood    hurricane      tornado 
## 157764000819  44330000800  18178050063

This shows that (various types of) floods, hurricanes and tornados (in that order) caused the most economic damage in the United States. Here is a panel plot illustrating human and economic damage:

par(mfrow=c(1,2), mar=c(5, 4, 4, 1))
barplot(sort(humdamage,decreasing = TRUE)[1:3], main="Total human damages \nby type of natural disaster", ylab="Number of fatalities + injuries")
barplot(sort(econdamage,decreasing = TRUE)[1:3], main="Total economic damages \nby type of natural disaster", ylab="USD")