Synopsis

The NOAA records data on U.S. weather. Storms and other extreme weather events are reported along with their associated effects on human life, injury, property, and crops. In this report, we explore the data from 1950–2011 and determine those weather events that have caused the most harm in terms of human and economic costs.

Data Processing

We load the dplyr package for data manipulation, the lubridate package for working with dates, and the ggplot2 package for graphing. We will also load Winston Chang’s helpful multiplot function:

library("dplyr")
library("lubridate")
library("ggplot2")
# Multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols:   Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  require(grid)

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}

Although the data comes in a compressed format, read.csv has no trouble parsing it.

storm_data_raw <- read.csv("repdata-data-StormData.csv.bz2",
                           stringsAsFactors = FALSE)

We will select only the columns that pertain to our questions:

storm_data <- storm_data_raw %>%
    select(EVTYPE, BGN_DATE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP,
           CROPDMG, CROPDMGEXP, REMARKS)

First, we’ll deal with the BGN_DATE variable. As we’re not concerned about the time of day (and it’s not even recorded in many cases), we will extract only the date.

date <- storm_data$BGN_DATE %>%
    strsplit(" ") %>%
    unlist %>%
    matrix(nrow = length(storm_data$BGN_DATE), ncol = 2, byrow = TRUE)
date <- as.Date(date[,1], format = "%m/%d/%Y")
storm_data <- storm_data %>%
    mutate(date = date)
rm(date)

Next, we’ll calculate the property and crop damage in dollars. In the instructions for storm data preparation (located here), we read,

Estimates should be rounded to three significant digits, followed by an alphabetical character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000. Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions.

Therefore, to get accurate damage estimates, we need to multiply our damages by the appropriate factor.

The bad news is that there is some messiness in the variables PROPDMGEXP and CROPDMGEXP.

table(storm_data$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5 
## 465934      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     40      1      6 424665      7  11330
table(storm_data$CROPDMGEXP)
## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994

The good news is that there aren’t many too many cases that have strange entries in this field. One possibiility is that the numerical digits may have been recorded by people assuming that this field was for the power of ten required for scientific notation (i.e., \(3000 = 3.00 \times 10^{3}\)). This would be especially problematic for our analysis for large powers of 10, indicating dollar figures possibly in the millions or more. We examine these specific entries for clues:

storm_data %>%
    filter(PROPDMGEXP %in% 5:8) %>%
    select(EVTYPE, PROPDMG, PROPDMGEXP) %>%
    arrange(PROPDMGEXP)
##                         EVTYPE PROPDMG PROPDMGEXP
## 1  URBAN/SMALL STREAM FLOODING     0.0          5
## 2                   HEAVY SNOW     1.7          5
## 3                         HAIL    30.0          5
## 4           THUNDERSTORM WINDS     0.1          5
## 5           THUNDERSTORM WINDS     0.0          5
## 6           THUNDERSTORM WINDS    10.0          5
## 7                      TORNADO    14.0          5
## 8           THUNDERSTORM WINDS    12.0          5
## 9                      TORNADO    88.0          5
## 10          THUNDERSTORM WINDS    12.0          5
## 11                   LIGHTNING     3.0          5
## 12                   LIGHTNING     0.2          5
## 13                 FLASH FLOOD    12.0          5
## 14          THUNDERSTORM WINDS     1.0          5
## 15                   LIGHTNING     1.0          5
## 16                    FLOODING     5.2          5
## 17                        HAIL     0.0          5
## 18                        HAIL     0.0          5
## 19                        HAIL     0.0          5
## 20                        HAIL     0.0          5
## 21                        HAIL     0.0          5
## 22                        HAIL     0.0          5
## 23                        HAIL     0.0          5
## 24          THUNDERSTORM WINDS     0.0          5
## 25                     TORNADO     0.2          5
## 26                    FLOODING     0.7          5
## 27                   LIGHTNING    13.0          5
## 28                 FLASH FLOOD     6.4          5
## 29          THUNDERSTORM WINDS    24.0          6
## 30          THUNDERSTORM WINDS    26.0          6
## 31          THUNDERSTORM WINDS    15.0          6
## 32                     TORNADO     0.0          6
## 33          THUNDERSTORM WINDS    14.0          7
## 34                 FLASH FLOOD    68.0          7
## 35                     TORNADO     0.0          7
## 36                        HAIL     0.0          7
## 37                   LIGHTNING     0.0          7
## 38                        HAIL     0.0          8

To begin with, the multipliers will make no difference to a bunch of entries where the damage in PROPDMG is recorded as 0.0, even though it is clear in most of the accompanying remarks that there was, indeed, damage. (I have not printed the remarks here because they are not fomatted very nicely.) But even for those cases with nonzero figures, the remarks are not easy to reconcile to the figures. For example, the entry labeled ## 2 in the output above mentions several destroyed buildings including a home valued at $250,000 to $300,000. However, the propery damage is recorded as PROPDMG = 1.7 and PROPDMGEXP = 5. Note that \(1.7 \times 10^5\) is 170,000, nowhere near high enough. Similar problems exist in trying to reconcile most of the other damage figures above with remarks that seem off by orders of magnitude.

Given the relatively small number of cases involved here, it seems best to ignore the mysterious entries and focus on the cases with “k”, “K”, “m”, “M”, and “B”. (Anything else will get assigned to 1 so that the multiplier causes no change in the value.) We will do this with PROPDMG and CROPDMG.

PROPDMGEXP_value <- sapply(storm_data$PROPDMGEXP,
    function(x) {switch(x,  "k" = 1000, "K" = 1000,
                            "m" = 1000000, "M" = 1000000, 
                            "B" = 1000000000, 1)})

CROPDMGEXP_value <- sapply(storm_data$CROPDMGEXP,
    function(x) {switch(x,  "k" = 1000, "K" = 1000,
                            "m" = 1000000, "M" = 1000000, 
                            "B" = 1000000000, 1)})
storm_data <- storm_data %>%
    mutate(PROPDMG_value = PROPDMGEXP_value,
           CROPDMG_value = CROPDMGEXP_value,
           PROPDMG_actual = PROPDMG * PROPDMG_value,
           CROPDMG_actual = CROPDMG * CROPDMG_value)
rm(PROPDMGEXP_value)
rm(CROPDMGEXP_value)

Results

1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

First, we get totals for all the quantities of interest.

storm_impact <- storm_data %>%
    group_by(EVTYPE) %>%
    summarize(Fatalities = sum(FATALITIES),
              Injuries = sum(INJURIES),
              Property_damage = sum(PROPDMG_actual),
              Crop_damage = sum(CROPDMG_actual))

Simple tables suffice here. We look at the top ten events that have caused the most fatalities and injuries respectively.

storm_impact %>%
    select(EVTYPE, Fatalities) %>%
    arrange(desc(Fatalities)) %>%
    head(10) %>%
    format(big.mark = ",") %>%
    knitr::kable("markdown", align = c('l', 'r'))
EVTYPE Fatalities
TORNADO 5,633
EXCESSIVE HEAT 1,903
FLASH FLOOD 978
HEAT 937
LIGHTNING 816
TSTM WIND 504
FLOOD 470
RIP CURRENT 368
HIGH WIND 248
AVALANCHE 224
storm_impact %>%
    select(EVTYPE, Injuries) %>%
    arrange(desc(Injuries)) %>%
    head(10) %>%
    format(big.mark = ",") %>%
    knitr::kable("markdown", align = c('l', 'r'))
EVTYPE Injuries
TORNADO 91,346
TSTM WIND 6,957
FLOOD 6,789
EXCESSIVE HEAT 6,525
LIGHTNING 5,230
HEAT 2,100
ICE STORM 1,975
FLASH FLOOD 1,777
THUNDERSTORM WIND 1,488
HAIL 1,361

We can see that there is some overlap in the categories. For example, one could argue that “HEAT” and “EXCESSIVE HEAT” could be counted as one category for purposes of assessing impact on human suffering. “THUNDERSTORM WIND” and “TSTM WIND” are exactly the same thing. (Some data recorders abbreviated and others did not.) Nevertheless, even when combining like categories, it is clear that tornadoes are by far the most destructive natural force when it comes to safety and human life.

tornadoes <- storm_data %>%
    filter(EVTYPE == "TORNADO") %>%
    group_by(Year = year(date)) %>%
    summarize(Fatalities = sum(FATALITIES), Injuries = sum(INJURIES))
g1 <- ggplot(tornadoes, aes(x = Year, y = Fatalities)) +
    ggtitle("Tornadoes in the U.S. (1950-2011)") +
    xlab(NULL) +
    scale_x_continuous(breaks=seq(1950,2011,10)) +
    geom_line()
g2 <- ggplot(tornadoes, aes(x = Year, y = Injuries)) +
    scale_x_continuous(breaks=seq(1950,2011,10)) +
    geom_line()
multiplot(g1, g2)

plot of chunk unnamed-chunk-10

The figure shows two interesting things. One is that there are specific years in which tornadoes are particularly devastating. Note that 2011 was one of the worst years on record. (See this Wikipedia page.) The other observation is that there is a clear change in the magnitude of the effect in the more severe years before and after the 1970s. This is due to the widespread adoption around 1970 of using air raid sirens as tornado warnings. (See the paper by Coleman et al.)

2. Across the United States, which types of events have the greatest economic consequences?

We look at similar tables for property damage and crop damage (in dollars) to assess economic impact.

options(scipen = 10)
storm_impact %>%
    select(EVTYPE, Property_damage) %>%
    arrange(desc(Property_damage)) %>%
    head(10) %>%
    format(big.mark = ",") %>%
    knitr::kable("markdown", align = c('l', 'r'))
EVTYPE Property_damage
FLOOD 144,657,709,807
HURRICANE/TYPHOON 69,305,840,000
TORNADO 56,937,160,779
STORM SURGE 43,323,536,000
FLASH FLOOD 16,140,812,067
HAIL 15,732,267,048
HURRICANE 11,868,319,010
TROPICAL STORM 7,703,890,550
WINTER STORM 6,688,497,251
HIGH WIND 5,270,046,295
storm_impact %>%
    select(EVTYPE, Crop_damage) %>%
    arrange(desc(Crop_damage)) %>%
    head(10) %>%
    format(big.mark = ",") %>%
    knitr::kable("markdown", align = c('l', 'r'))
EVTYPE Crop_damage
DROUGHT 13,972,566,000
FLOOD 5,661,968,450
RIVER FLOOD 5,029,459,000
ICE STORM 5,022,113,500
HAIL 3,025,954,473
HURRICANE 2,741,910,000
HURRICANE/TYPHOON 2,607,872,800
FLASH FLOOD 1,421,317,100
EXTREME COLD 1,292,973,000
FROST/FREEZE 1,094,086,000

There is a difference in the weather event taking the top spot for property damage versus crop damage. Floods cause more than double the amount of property damage than the next highest event (hurricane/typhoon). Flooding also causes a lot of crop damage, but drought is the most deleterious.