Synopsis

This is an assignment for the Johns Hopkins University “Reproducible Research” course provided by Coursera.

Main questions:

  1. Which types of events have the greatest economic consequences?
  2. Which types of events are most harmful with respect to population health?

To answer the first question, this report examines financial damage to property and loss of crops. To answer the second question, this report focuses on injuries and fatalities.

Data Processing

Read in the data and select the columns we’ll be using for our analysis:

data = read.table('stormdata.csv.bz2', sep = ",", quote = "\"", header = TRUE)
data <- select(data, BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP,
               CROPDMG, CROPDMGEXP)

Since the data prior to 1996 is incomplete, we can discard it.

data$date <- as.Date(data$BGN_DATE, '%m/%d/%Y')
data <- filter(data, year(date) >= 1996)

The event type (EVTYPE) column is quite messy. It’s supposed to have 48 unique event types, but it has many more (nearly 1000). We need to clean it up as follows:

Let’s start by looking at the 48 “official” event types:

official_event_types = c("astronomical low tide", "avalanche", "blizzard", "coastal flood", "cold/wind chill", "debris flow", "dense fog", "dense smoke", "drought", "dust devil", "dust storm", "excessive heat", "extreme cold/wind chill", "flash flood", "flood", "freezing fog", "frost/freeze", "funnel cloud", "hail", "heat", "heavy rain", "heavy snow", "high surf", "high wind", "hurricane/typhoon", "ice storm", "lakeshore flood", "lake-effect snow", "lightning", "marine hail", "marine high wind", "marine strong wind", "marine thunderstorm wind", "rip current", "seiche", "sleet", "storm surge/tide", "strong wind", "thunderstorm wind", "tornado", "tropical depression", "tropical storm", "tsunami", "volcanic ash", "waterspout", "wildfire", "winter storm", "winter weather")
length(official_event_types)
## [1] 48
table(casefold(data$EVTYPE) %in% official_event_types)
## 
##  FALSE   TRUE 
## 148815 504715

A significant part dataset doesn’t match to an official event type, even with case folding. Let’s see what the most frequent offenders are.

sort(table(as.character(data$EVTYPE[! casefold(data$EVTYPE) %in% official_event_types])), decreasing = TRUE)
## 
##              TSTM WIND       MARINE TSTM WIND   URBAN/SML STREAM FLD 
##                 128662                   6175                   3392 
##       WILD/FOREST FIRE     WINTER WEATHER/MIX         TSTM WIND/HAIL 
##                   1443                   1104                   1028 
##           EXTREME COLD              LANDSLIDE                    FOG 
##                    615                    588                    532 
##                   SNOW                   WIND           RIP CURRENTS 
##                    395                    320                    302 
##            STORM SURGE   HEAVY SURF/HIGH SURF      EXTREME WINDCHILL 
##                    253                    228                    204 
##           STRONG WINDS          FREEZING RAIN         DRY MICROBURST 
##                    184                    176                    173 
##              HURRICANE             LIGHT SNOW          RECORD WARMTH 
##                    170                    152                    135 
##      UNSEASONABLY WARM       COASTAL FLOODING ASTRONOMICAL HIGH TIDE 
##                    113                    107                    103 
##            RIVER FLOOD 
##                    102 
##  [ reached getOption("max.print") -- omitted 430 entries ]

Let’s do some mapping and scrubbing:

hashmap <- new.env(hash <- TRUE)
hashmap[["tstm wind"]] <- "thunderstorm wind"
hashmap[["thunderstorm winds"]] <- "thunderstorm wind"
hashmap[["marine tstm wind"]] <- "marine thunderstorm wind"
hashmap[["urban/sml stream fld"]] <- "flood"
hashmap[["high winds"]] <- "high wind"
hashmap[["wild/forest fire"]] <- "wildfire"
hashmap[["winter weather/mix"]] <- "winter weather"
hashmap[["tstm wind/hail"]] <- "thunderstorm wind"
hashmap[["flash flooding"]] <- "flash flood"
hashmap[["extreme cold"]] <- "extreme cold/wind chill"
hashmap[["flood/flash flood"]] <- "flood"
hashmap[["landslide"]] <- "debris flow"
hashmap[["snow"]] <- "heavy snow"
hashmap[["fog"]] <- "dense fog"
hashmap[["wind"]] <- "strong wind"
hashmap[["rip currents"]] <- "rip current"
hashmap[["storm surge"]] <- "storm surge/tide"
hashmap[["freezing rain"]] <- "winter storm"
hashmap[["urban flood"]] <- "flood"
hashmap[["heavy surf/high surf"]] <- "high surf"
hashmap[["extreme windchill"]] <- "extreme cold/wind chill"
hashmap[["strong winds"]] <- "strong wind"
hashmap[["hurricane"]] <- "hurricane/typhoon"
hashmap[["river flood"]] <- "flood"
hashmap[["light snow"]] <- "winter weather"
hashmap[["record warmth"]] <- "excessive heat"
hashmap[["coastal flooding"]] <- "flood"
hashmap[["flooding"]] <- "flood"
hashmap[["unseasonably warm"]] <- "excessive heat"
hashmap[["astronomical high tide"]] <- "high surf"
hashmap[["moderate snowfall"]] <- "winter weather"

scrub_event_type <- function(evtype) {
    folded = casefold(evtype)
    scrubbed <- hashmap[[folded]]
    if (startsWith(folded, 'hurricane')) {
        'HURRICANE'
    } else if (folded %in% official_event_types) {
        toupper(folded)
    } else if (is.null(scrubbed)) {
        'OTHER'
    } else {
        toupper(scrubbed)
    }
}

scrub_event_type_wrapper <- function(evtype_vector) {
    as.factor(sapply(evtype_vector, scrub_event_type))
}

data <- mutate(data, evtype=scrub_event_type_wrapper(EVTYPE))
table(data$evtype %in% casefold(official_event_types, upper = TRUE))
## 
##  FALSE   TRUE 
##   2353 651177

The vast majority of our dataset now contains valid event type labels.

Next, we need to scrub economic and crop damages. They are expressed using two columns: a value and a multiplier (“exponent”):

sort(table(data$PROPDMGEXP), decreasing = TRUE)
## 
##      K             M      B      0      -      ?      +      1      2 
## 369938 276185   7374     32      1      0      0      0      0      0 
##      3      4      5      6      7      8      h      H      m 
##      0      0      0      0      0      0      0      0      0
sort(table(data$CROPDMGEXP), decreasing = TRUE)
## 
##             K      M      B      ?      0      2      k      m 
## 373069 278686   1771      4      0      0      0      0      0

From the above, we can see that the main multipliers are B, M, and K. The rest occur relatively rarely, so ignoring them is OK.

get_multiplier <- function(exp) {
    exp <- tolower(exp)
    if (exp == 'b') {
        1e9
    } else if (exp == 'm') {
        1e6
    } else if (exp == 'k') {
        1e3
    } else {
        0
    }
}

get_multiplier_wrapper <- function(exp_vector) {
    sapply(exp_vector, get_multiplier)
}

data <- mutate(data, crop_mult = get_multiplier_wrapper(CROPDMGEXP),
               prop_mult = get_multiplier_wrapper(PROPDMGEXP))
data <- mutate(data, damage = (CROPDMG * crop_mult + PROPDMG * prop_mult))

Results

Plot the major causes of economic damage:

totals_by_event <- data %>% group_by(evtype) %>% summarize(damage = sum(damage, na.RM = TRUE))
major_causes <- totals_by_event %>% filter(damage > 1e9)
ggplot(data = major_causes, mapping = aes(y = damage, x = reorder(evtype, damage))) +
    geom_bar(stat='identity') + coord_flip()

Floods cause the most economic damage overall, followed by hurricanes and storms.

Plot the major causes of fatalities:

fatalities <- data %>% group_by(evtype) %>% summarize(total = sum(FATALITIES))
fatalities %>% filter(total > 100) %>% 
    ggplot(mapping = aes(y = total, x = reorder(evtype, total))) + 
    geom_bar(stat = 'identity', position = 'stack', fill = 'red') + coord_flip()

The leading cause of fatalities is excessive heat, closely followed by tornadoes and flash floods.

Next, plot the major causes of injuries:

injuries <- data %>% group_by(evtype) %>% summarize(total = sum(INJURIES))
injuries %>% filter(total > 1000) %>% 
    ggplot(mapping = aes(y = total, x = reorder(evtype, total))) + 
    geom_bar(stat = 'identity', position = 'stack', fill = 'orange') + coord_flip()