Synopsis

This report explores storm data in order to analyze the impacts of different event types in the US on both human population health and economic health. The data are from the NOAA storm database as described here: https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf Specifically, the report aims to answer the following questions:

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

The analysis excludes territories of the US, e.g. Puerto Rico.

We found that the type of event with the highest number of total fatalities and injuries is a tornado, but wild fires, while less common, on average result in the greatest number of deaths and injuries. With respect to economic cost we found that floods have had the greatest impact in absolute terms, while hurricanes have the highest average impact on property damage and crop damage.

Data Processing

We first read in the data from the raw csv file from the NOAA.

filename <- "repdata%2Fdata%2FStormData.csv.bz2"
fullpath <- paste0("data/", filename)
if (!file.exists(fullpath)) {
  dir.create("data")
  fileUrl <- paste0("https://d396qusza40orc.cloudfront.net/", filename)
  download.file(fileUrl, destfile = fullpath, method = "curl")
}
storm_data <- read.csv(fullpath, na.strings=c("NA", ""), strip.white = TRUE, stringsAsFactors = FALSE)
num_rows <- nrow(storm_data)

A lot of processing then had to be done on the data. Looking at the EVTYPE variable, for example:

eventTypeFactor <- as.factor(storm_data$EVTYPE)
length(levels(eventTypeFactor))
## [1] 985

This is a huge number of event types compared to the 48 types listed in the National Weather Service documentation for the data. Looking at what these types actually were showed a lot of useless types that told us nothing about the actual type of event, e.g. “Summary of June 16”. An initial clean-up was done to remove these, and also to reduce the number of states being analyzed to just the 50 states plus DC.

cleanData <- function(data) {
  filtered_states <- rbind(filter(data, STATE %in% state.abb),
                           filter(data, STATE == "DC"))
  filter(filtered_states, !grepl(".*summary.*", EVTYPE, 
                          ignore.case = TRUE))
}
storm_data <- cleanData(storm_data)

This still left us with a very messy EVTYPE variable. However, the questions this report addresses are about events that have human or economic impact. It was therefore deemed appropriate to filter the data to only those events which had such impacts.

For this we looked at the FATALITIES, INJURIES, PROPDMG and CROPDMG variables. The latter two however needed to be interpreted and some assumptions had to be made here. Each of them had a corresponding “EXP” column whose values could be numeric, or a character like “m”, “b”, “k”, or “h” (or the uppercase equivalent), or something like a question mark. We made the assumption that in the case of a numeric entry it should be taken as an exponent of 10, whereas the single characters referred to “million”, “billion”, etc. We wrote a function to convert them to multipliers for the PROPDMG and CROPDMG variables. Where the character could not be interpreted in this way, e.g. the “?” character, a mulitplier of 1 was used.

getMultiplier <- function(ex) {
  # Default multiplier is 1
  multiplier <- 1
  asnum <- suppressWarnings(as.numeric(ex))
  if (!is.na(asnum)) {
    multiplier <- 10^asnum
  }
  else {
    if (ex %in% c("h", "H")) {
      multiplier = 100
    }
    else if (ex %in% c("k", "K")) {
      multiplier = 1000
    }
    else if (ex %in% c("m", "M")) {
      multiplier = 10^6
    }
    else if (ex %in% c("b", "B")) {
      multiplier = 10^9
    }
  }
  multiplier
}

We could now use this to create proper numeric columns for property damage and crop damage.

mutated <- mutate(storm_data, property_damage = PROPDMG * sapply(PROPDMGEXP, getMultiplier), crop_damage = CROPDMG * sapply(CROPDMGEXP, getMultiplier))

Now we could filter the data to only those rows whether there was either human impact, as judged by number of fatalities or injuries, or economic impact, as judged by property damage or crop damage.

filtered <- filter(mutated, FATALITIES > 0 | INJURIES > 0 | property_damage > 0 | crop_damage > 0)
num_rows_filtered <- nrow(filtered)

With this filtering we have reduced the number of records from 902297 to 253352. This also had a huge impact on the number of different event type values in the data we had to deal with.

eventTypeFactor <- as.factor(filtered$EVTYPE)
length(levels(eventTypeFactor))
## [1] 471

However there was still massive cleanup to do on these event types and we had to create a mapping function to attempt to map the given values to values in the list of 48.

For the sake of space, the mapping function is now shown here. Please see the appendix. We converted the event types to proper case before passing them to the mapper.

evtypesProper <- capwords(filtered$EVTYPE, strict = TRUE)
mapped <- mapEventTypes(evtypesProper)
withEventTypeFactor <- mutate(filtered, eventTypeFactor=as.factor(mapped))
numEventTypes <- length(levels(withEventTypeFactor$eventTypeFactor))

This gets us down to 185 event types. Clearly more work could be done here but for the purposes of this analysis this was deemed a sufficient cleaup.

Exploratory Analysis

We could now start doing some analysis of the data.

summarized <- ddply(withEventTypeFactor, .(eventTypeFactor), summarise, 
                    totalFatal = sum(FATALITIES), 
                    totalInjured = sum(INJURIES), 
                    totalAll = sum(FATALITIES) + sum(INJURIES),
                    meanFatal = mean(FATALITIES), 
                    meanInjured = mean(INJURIES),
                    meanAll = mean(FATALITIES + INJURIES)) %>% 
  arrange(desc(totalAll), desc(totalFatal), desc(totalInjured))
top_ten <- summarized[1:10,]
top_ten
##      eventTypeFactor totalFatal totalInjured totalAll   meanFatal
## 1            Tornado       5636        91407    97043 0.141044571
## 2               Heat       3132         9209    12341 3.232198142
## 3  Thunderstorm Wind        705         9404    10109 0.005921285
## 4              Flood       1475         8593    10068 0.046034768
## 5          Lightning        807         5226     6033 0.060841375
## 6          Ice Storm         89         1990     2079 0.124824684
## 7          High Wind        296         1518     1814 0.047849984
## 8       Winter Storm        216         1338     1554 0.143141153
## 9               Hail         45         1465     1510 0.001688429
## 10        Heavy Snow        129         1034     1163 0.092539455
##    meanInjured     meanAll
## 1   2.28751971  2.42856428
## 2   9.50361197 12.73581011
## 3   0.07898406  0.08490534
## 4   0.26818763  0.31422240
## 5   0.39399879  0.45484017
## 6   2.79102384  2.91584853
## 7   0.24539282  0.29324281
## 8   0.88667992  1.02982107
## 9   0.05496773  0.05665616
## 10  0.74175036  0.83428981

The following figure plots the top ten event types affecting human population health as measured in thousands of deaths and injuries

barplot(top_ten$totalAll / 1000, names.arg = top_ten$eventTypeFactor, main="Deaths and Injuries (in 1000s) by event type", las=2, ylim=c(0, 100), col=top_ten$eventTypeFactor)

Next we’ll look at the economic impacts of the events using property damage and crop damage.

summarized_economic <- ddply(withEventTypeFactor, .(eventTypeFactor), summarise, 
                             totalPropertyDamage = sum(property_damage)/(10^6), 
                             totalCropDamage = sum(crop_damage)/(10^6), 
                             totalAll = (sum(property_damage) + sum(crop_damage))/(10^6),
                             meanPropertyDamage = mean(property_damage), 
                             meanCropDamage = mean(crop_damage), 
                             meanAll = mean(property_damage + crop_damage)) %>% 
  arrange(desc(totalAll), desc(totalPropertyDamage), desc(totalCropDamage))
top_ten_economic <- summarized_economic[1:10,]
nrow(summarized_economic)
## [1] 185
barplot(top_ten_economic$totalAll, names.arg = top_ten_economic$eventTypeFactor, main="Cost of property and crop damage (in millions of dollars) by event type", las=2, col=top_ten_economic$eventTypeFactor)

Results

Looking just at totals, we learned that tornados have had the biggest impact on human popluation health and floods have had the biggest economic impact.

It is also useful to look at the average impact of each type of event and see which type of event has the highest impact on average.

summarized[which.max(summarized$meanAll),]
##    eventTypeFactor totalFatal totalInjured totalAll meanFatal meanInjured
## 28      Wild Fires          3          150      153      0.75        37.5
##    meanAll
## 28   38.25

Here we see that although wild fires do not account for a large total number of fatalities and injuries, the average wild fire results in a higher number of injuries and deaths than other types of events.

summarized_economic[which.max(summarized_economic$meanAll),]
##       eventTypeFactor totalPropertyDamage totalCropDamage totalAll
## 2 Hurricane (Typhoon)            82429.12        4948.941 87378.06
##   meanPropertyDamage meanCropDamage   meanAll
## 2          436132905       26184872 462317777

Hurricanes on average result in the highest economic cost.

Appendix

The functions for mapping the event types to known types and for proper casing them.

mapEventTypes <- function(data) {
  pattern <- c(
    "Astronomical Low Tide.*",
    ".*Avalanche.*",
    ".*Blizzard.*",
    ".*Coastal Flood.*",
    "(^Extreme).*Chill.*",
    ".*landslide.*",
    ".*Dense Fog.*",
    ".*Dense Smoke.*",
    ".*Drought.*",
    ".*Dust Devil.*",
    ".*Dust Storm.*",
    ".*Excessive Heat.*",
    "(.*Extreme.*Cold.*|Hyp(o|er)thermia.*)",
    ".*Flash Flood.*",
    ".*Flood.*",
    ".*(Frost|Freeze).*",
    ".*Funnel Cloud.*",
    ".*Freezing Fog.*",
    ".*Hail.*",
    ".*Heat.*",
    ".*Heavy Rain.*",
    ".*Heavy Snow.*",
    ".*High Surf.*",
    ".*High Wind.*",
    ".*(Hurricane|Typhoon).*",
    ".*Ice Storm.*",
    ".*Lake-Effect Snow.*",
    ".*Lakeshore Flood.*",
    ".*Lightning.*",
    ".*Marine Hail.*",
    ".*Marine High Wind.*",
    ".*Marine Strong Wind.*",
    ".*Marine Thunderstorm Wind.*",
    ".*Rip Current.*",
    ".*Seiche.*",
    ".*Sleet.*",
    ".*Storm (Surge|Tide).*",
    ".*Strong Wind.*",
    ".*(thunderstorm|tstm).*",
    ".*Tornado.*",
    ".*Tropical Depression.*",
    ".*Tropical Storm.*",
    ".*Tsunami.*",
    ".*Volcanic Ash.*",
    ".*Waterspout.*",
    ".*Wildfire.*",
    ".*Winter Storm.*",
    ".*Winter Weather.*"
    )
  replace <- c(
    "Astronomical Low Tide",
    "Avalanche",
    "Blizzard",
    "Coastal Flood",
    "Cold/Wind Chill",
    "Debris Flow",
    "Dense Fog",
    "Dense Smoke",
    "Drought",
    "Dust Devil",
    "Dust Storm",
    "Excessive Heat",
    "Extreme Cold/Wind Chill",
    "Flash Flood",
    "Flood",
    "Frost/Freeze",
    "Funnel Cloud",
    "Freezing Fog",
    "Hail",
    "Heat",
    "Heavy Rain",
    "Heavy Snow",
    "High Surf",
    "High Wind",
    "Hurricane (Typhoon)",
    "Ice Storm",
    "Lake-Effect Snow",
    "Lakeshore Flood",
    "Lightning",
    "Marine Hail",
    "Marine High Wind",
    "Marine Strong Wind",
    "Marine Thunderstorm Wind",
    "Rip Current",
    "Seiche",
    "Sleet",
    "Storm Surge/Tide",
    "Strong Wind",
    "Thunderstorm Wind",
    "Tornado",
    "Tropical Depression",
    "Tropical Storm",
    "Tsunami",
    "Volcanic Ash",
    "Waterspout",
    "Wildfire",
    "Winter Storm",
    "Winter Weather"
    )

  qdap::mgsub(pattern, replace, data, fixed = FALSE, ignore.case = TRUE)
}

# Taken from the help page for tolower, this function
# converts strings to proper case.
capwords <- function(s, strict = FALSE) {
  cap <- function(s) paste(toupper(substring(s, 1, 1)),
{s <- substring(s, 2); if(strict) tolower(s) else s},
sep = "", collapse = " " )
sapply(strsplit(s, split = " "), cap, USE.NAMES = !is.null(names(s)))
}