Identifying NOAA storm database events of greatest risk

Terms, acronyms and references:

Introduction

In this project, I explore the NOAA's storm database, which contains records of major storms and other weather events along with associated fatalities, injuries and damage estimates to crops and property. From an exploration of the severity of individual events I develop a measure of risk associated with each event type, which I feel would be of greater significance for decisions involving distribution of resources to preventing or reducing the harm resulting from weather events.

Data processing

It is assumed that the storm database has been downloaded into the R working directory. For this project I only read in the variables:

For the variables PROPDMG and CROPDMG I'm not interested in the units, only in that the data may - like number of fatalities or injuries - be interpreted as a measure of severity of the associated event.

setwd('/Users/rambler/Documents/Coursera/Data Science/05_ReproducibleResearch/Project2')
data <- subset(
  read.csv('repdata-data-StormData.csv.bz2'), 
  select = c(EVTYPE, FATALITIES, INJURIES, PROPDMG, CROPDMG),
)

# remove "summary" events
# These appear to be summaries of event data already accounted for.
data$EVTYPE <- tolower(data$EVTYPE)
data <- subset(
  data,
  subset = !grepl("summary", data$EVTYPE)
)

There are several issues with the variable EVTYPE. To illustrate a few, I've written a function that will display the head of means of a selected variable grouped by the raw category in EVTYPE arranged in descending order.

disp.head.by.mean.cat <- function(variable) {
  mfbt <- data.frame()
  invisible(
    lapply(
      split(
        subset(data, select = c('EVTYPE', variable)), 
        data$EVTYPE
      ), 
      function(x) mfbt <<- rbind(
        mfbt, 
        data.frame(
          type = x$EVTYPE[1], 
          mean.var = mean(x[[variable]]),
          instances = nrow(x)
        )
      )
    )
  )

  head(mfbt[order(-mfbt$mean.var), ], 30)
}

Let us use the function to explore the mean value of FATALITIES by raw category.

disp.head.by.mean.cat('FATALITIES')
                          type mean.var instances
699 tornadoes, tstm wind, hail   25.000         1
61               cold and snow   14.000         1
708      tropical storm gordon    8.000         1
518      record/excessive heat    5.667         3
126               extreme heat    4.364        22
245          heat wave drought    4.000         1
333             high wind/seas    4.000         1
440              marine mishap    3.500         2
825              winter storms    3.333         3
302        heavy surf and wind    3.000         1
326         high wind and seas    3.000         1
533                 rough seas    2.667         3
246                 heat waves    2.500         2
526    rip currents/heavy surf    2.500         2
244                  heat wave    2.293        75
744  unseasonably warm and dry    2.231        13
369  hurricane opal/high winds    2.000         1
733                    tsunami    1.650        20
270                 heavy seas    1.500         2
66                cold weather    1.250         4
242                       heat    1.222       767
375       hypothermia/exposure    1.167         6
115             excessive heat    1.134      1678
18                    avalance    1.000         1
55                coastalstorm    1.000         1
63            cold temperature    1.000         2
65                   cold wave    1.000         3
69                  cold/winds    1.000         1
85                    drowning    1.000         1
123              extended cold    1.000         1

Notice that many of the “categories” of events are actually unique cases (e.g., the event categorised as TORNADOES, TSTM WIND, HAIL with 25 fatalities) or very small, either as a result of rarity or inconsistent labelling techniques. Further, many of the categories, e.g. HEAT, HEAT WAVE and EXCESSIVE HEAT appear to refer to the same category of event, and it might make sense to merge them.

It is still a useful exploration, however, in that it suggests which types of events are particularly severe. Let us similarly explore mean values of INJURIES, PROPDMG and CROPDMG.

disp.head.by.mean.cat('INJURIES')
                       type mean.var instances
708   tropical storm gordon   43.000         1
805              wild fires   37.500         4
679           thunderstormw   27.000         1
326      high wind and seas   20.000         1
584         snow/high winds   18.000         2
196         glaze/ice storm   15.000         1
245       heat wave drought   15.000         1
822 winter storm high winds   15.000         1
371       hurricane/typhoon   14.489        88
827      winter weather mix   11.333         6
126            extreme heat    7.045        22
472  non-severe wind damage    7.000         1
733                 tsunami    6.450        20
825           winter storms    5.667         3
695              tornado f2    5.333         3
119      excessive rainfall    5.250         4
795      waterspout/tornado    5.250         8
244               heat wave    5.053        75
194                   glaze    5.023        43
703     torrential rainfall    4.000         1
115          excessive heat    3.889      1678
242                    heat    2.738       767
452            mixed precip    2.600        10
440           marine mishap    2.500         2
376                     ice    2.246        61
282       heavy snow shower    2.000         1
294          heavy snow/ice    2.000         5
355         high winds/snow    2.000         3
388   ice storm/flash flood    2.000         1
437         marine accident    2.000         1
disp.head.by.mean.cat('PROPDMG')
                              type mean.var instances
47                 coastal erosion    766.0         1
254           heavy rain and flood    600.0         1
527         river and stream flood    600.0         2
35           blizzard/winter storm    500.0         1
142                   flash flood/    500.0         1
150 flash flooding/thunderstorm wi    500.0         1
165              flood/river flood    500.0         1
187                  frost\\freeze    500.0         1
264                heavy rain/snow    500.0         1
299        heavy snow/winter storm    500.0         1
333                 high wind/seas    500.0         1
367               hurricane gordon    500.0         1
548                sleet/ice storm    500.0         1
568             snow and ice storm    500.0         1
581                      snow/cold    500.0         2
583                snow/heavy snow    500.0         1
708          tropical storm gordon    500.0         1
696                     tornado f3    362.5         2
169                         floods    335.0         3
817                     wind storm    300.0         1
401                      landslump    285.0         2
52                   coastal surge    250.0         2
237                     hail/winds    250.0         2
366                hurricane felix    250.0         2
149           flash flooding/flood    218.8         8
694                     tornado f1    217.9         4
685              thundertorm winds    202.0         3
709           tropical storm jerry    201.3         3
695                     tornado f2    200.3         3
1               high surf advisory    200.0         1
disp.head.by.mean.cat('CROPDMG')
                             type mean.var instances
105         dust storm/high winds   500.00         1
172                  forest fires   500.00         1
708         tropical storm gordon   500.00         1
352               high winds/cold   401.00         5
366               hurricane felix   250.00         2
825                 winter storms   166.67         3
121             excessive wetness   142.00         1
735                       typhoon    75.00        11
62        cold and wet conditions    66.00         1
809                     wildfires    62.50         8
371             hurricane/typhoon    54.53        88
75                damaging freeze    53.26         8
245             heat wave drought    50.00         1
616             thunderstorm hail    50.00         1
484                          rain    46.88        16
529                river flooding    41.79        29
269          heavy rains/flooding    39.22         9
551                    small hail    32.68        53
362                     hurricane    30.69       174
167                      flooding    28.01       120
145             flash flood/flood    25.23        22
237                    hail/winds    25.02         2
221                      hail 150    25.00         2
707           tropical storm dean    25.00         2
149          flash flooding/flood    21.88         8
268                   heavy rains    21.54        26
108                   early frost    21.00         2
365                hurricane erin    20.86         7
528                   river flood    20.17       173
292 heavy snow/high winds & flood    20.00         1

Cleaning up the data…

Based on the output above, let us create a new variable cat (category) that is a bit more in line with the directives (p. 6) with respect to categorising events. The following code assigns each event a category based on pattern matching on the value stored in EVTYPE.

I generally stuck to the directives (p. 6). See comments for exceptions and notes.

# *mudslide changed name to debris flow
# *instance of "devil" spelled incorrectly
# cold/wind chill and extreme cold/wind chill merged:
#   the deaths/injuries/damage appear to be very similarly sustained
# strong high thunderstorm etc. winds to "Strong Wind"

# issues: mutually exclusive treatment - some events of more than one
#   type categorised arbitrarily as one or the other by the order
#   of the pattern matching below
data$category <- sapply(
  data$EVTYPE, 
  function(x) {
    if (grepl("astronomical", x)) "Astronomical High Tide"
    else if (grepl("avalanche", x)) "Avalanche" 
    else if (grepl("blizzard", x)) "Blizzard" 
    else if (grepl("coastal", x) & grepl("flood", x)) "Coastal Flood" 
    else if (grepl("cold/wind", x) | grepl("wind chill", x)) "Cold/Wind Chill" 
    else if (grepl("debris", x) | grepl("slide", x)) "Debris Flow" 
    else if (grepl("dense fog", x)) "Dense Fog"
    else if (grepl("smoke", x)) "Dense Smoke"
    else if (grepl("drought", x)) "Drought"
    else if (grepl("dust dev", x)) "Dust Devil" 
    else if (grepl("dust storm", x)) "Dust Storm"
    else if (grepl("heat", x)) "Heat"
    else if (grepl("flash", x)) "Flash Flood"
    else if (grepl("lakeshore", x)) "Lakeshore Flood"
    else if (grepl("flood", x)) "Flood"
    else if (grepl("frost", x) | grepl("freeze", x)) "Frost/Freeze"
    else if (grepl("funnel", x)) "Funnel Cloud"
    else if (grepl("freezing", x) | grepl("sleet", x)) "Freezing Precipitation"
    else if (grepl("hail", x)) "Hail"
    else if (grepl("rain", x)) "Heavy Rain"
    else if (grepl("snow", x)) "Heavy Snow"
    else if (grepl("surf", x)) "High Surf"
    else if (grepl("wind", x)) "Strong Wind"
    else if (grepl("hurricane", x) | grepl("typhoon", x)) "Hurricane"
    else if (grepl("tornado", x)) "Tornado"
    else if (grepl("waterspout", x)) "Waterspout"
    else if (grepl("lightning", x)) "Lightning"
    else if (grepl("seiche", x)) "Seiche"
    else if (grepl("storm surge", x)) "Storm Surge/Tide"
    else if (grepl("tsunami", x)) "Tsunami"
    else if (grepl("volcanic ash", x)) "Volcanic Ash"
    else if (grepl("fire", x)) "Wildfire"
    else if (grepl("tropical storm", x)) "Tropical Storm"
    else if (grepl("depression", x)) "Tropical Depression"
    else if (grepl("rip", x)) "Rip Current"
    else if (grepl("ice storm", x)) "Ice Storm"
    else if (grepl("winter storm", x)) "Winter Storm"
    else if (grepl("winter weather", x)) "Winter Weather"
    else "Not Classified"
  }
)

A new data set

Based on the above categorisation, let us create a new data set that stores the mean values by category for the respective variables:

means.by.category <- data.frame()
invisible(
  lapply(
    split(data, data$category),
    function(x) {
      means.by.category <<- rbind(
        means.by.category,
        data.frame(
          cat = x$category[1],
          freq = nrow(x),
          fat = mean(x$FATALITIES),
          inj = mean(x$INJURIES),
          prop = mean(x$PROPDMG),
          crop = mean(x$CROPDMG)
        )  
      )
    }
  )  
)

# Place "Not Classified" category in last row...
# These are events that are not categorised by the pattern matching above.
means.by.category <- rbind(
  means.by.category[means.by.category$cat != "Not Classified", ],
  means.by.category[means.by.category$cat == "Not Classified", ]
)

means.by.category
                      cat   freq       fat       inj     prop      crop
1  Astronomical High Tide    277 0.0000000 0.0000000  4.52527  0.000000
2               Avalanche    387 0.5788114 0.4418605  4.20904  0.000000
3                Blizzard   2743 0.0368210 0.2934743  9.48541  0.062705
4           Coastal Flood    852 0.0070423 0.0082160 21.24529  0.065728
5         Cold/Wind Chill   1584 0.1395202 0.0227273  2.96338  0.410354
6             Debris Flow    653 0.0673813 0.0842266 31.00925  0.056662
7               Dense Fog   1296 0.0138889 0.2638889  6.34680  0.000000
8             Dense Smoke     21 0.0000000 0.0000000  4.76190  0.000000
9                 Drought   2512 0.0023885 0.0075637  1.71141 13.516879
10             Dust Devil    151 0.0132450 0.2847682  4.76245  0.000000
11             Dust Storm    429 0.0512821 1.0256410 11.88695  4.898601
12            Flash Flood  55675 0.0185900 0.0323664 26.48088  3.349514
13                  Flood  26176 0.0184902 0.2595889 36.04845  6.798843
14 Freezing Precipitation    504 0.0257937 0.0753968  9.42514  0.000000
15           Frost/Freeze   1512 0.0013228 0.0019841  1.12270  6.375933
16           Funnel Cloud   6991 0.0000000 0.0004291  0.02855  0.000000
17                   Hail 290399 0.0001550 0.0050517  2.40814  2.017764
18                   Heat   2631 1.1904219 3.5001900  1.15274  0.538731
19             Heavy Rain  11875 0.0088421 0.0237474  4.55311  1.049036
20             Heavy Snow  17620 0.0093076 0.0660045  8.59663  0.123480
21              High Surf   1064 0.1560150 0.2312030  6.06214  0.000000
22              Hurricane    298 0.4463087 4.4731544 84.51862 39.019430
23              Ice Storm   2007 0.0443448 0.9915296 32.88524  0.841530
24        Lakeshore Flood     23 0.0000000 0.0000000  2.06522  0.000000
25              Lightning  15761 0.0518368 0.3319586 38.28353  0.227182
27            Rip Current    774 0.7390181 0.6834625  0.21059  0.000000
28                 Seiche     21 0.0000000 0.0000000 46.66667  0.000000
29       Storm Surge/Tide    409 0.0586797 0.1051345 63.98665  2.090465
30            Strong Wind 362120 0.0032890 0.0312686  8.61428  0.602366
31                Tornado  60698 0.0928531 1.5059310 52.97947  1.647942
32    Tropical Depression     60 0.0000000 0.0000000 12.30000  0.000000
33         Tropical Storm    697 0.0946915 0.5494978 71.63943  9.275638
34                Tsunami     20 1.6500000 6.4500000 45.26500  1.000000
35           Volcanic Ash     27 0.0000000 0.0000000 18.51852  0.000000
36             Waterspout   3845 0.0007802 0.0075423  2.48744  0.000000
37               Wildfire   4239 0.0212314 0.3793347 29.53958  2.256603
38           Winter Storm  11436 0.0188877 0.1169990 11.64923  0.216771
39         Winter Weather   8155 0.0074801 0.0659718  2.07338  0.001839
26         Not Classified   6279 0.0675267 0.2739290  9.73426  1.641263

Results

I wrote the following function to draw a bar plot of the means for the n greatest categories.

plot.top.n <- function(variable, n, ...) {
  order.by.variable <- means.by.category[order(-means.by.category[[variable]]), ]
  r <- nrow(order.by.variable)
  order.by.variable <- rbind(
    order.by.variable[1:n,],
    data.frame(
      cat = "Other",
      freq = sum(order.by.variable$freq[(n + 1):r]),
      fat = sum(order.by.variable$fat[(n + 1):r]),
      inj = sum(order.by.variable$inj[(n + 1):r]),
      prop = sum(order.by.variable$prop[(n + 1):r]),
      crop = sum(order.by.variable$crop[(n + 1):r])
    )
  )
  barplot(
    order.by.variable[[variable]][order.by.variable$cat != "Other"], 
    names.arg = order.by.variable$cat[order.by.variable$cat != "Other"], 
    ...
  )
}

We can now use this function to draw the bar plots showing the five events with the greatest nett severity in each category.

par(mfrow = c(2, 2))
plot.top.n('fat', 5, main = "Fatalities")
plot.top.n('inj', 5, main = "Injuries")
plot.top.n('prop', 5, main = "Property damage")
plot.top.n('crop', 5, main = "Crop damage")

plot of chunk unnamed-chunk-10

Fig. 1: A comparison between the mean severity of events in terms of fatalities, injuries, property damage and crop damage.

Problem: Pertinence to resource allocation question…

The graphical summary above might not be most appropriate for decisions regarding distribution of resources. For example, even though The category Tsunami has a high associated fatality rate, it is actually a relatively rare event; it might make more sense - that is, save more lives in the long term - to distribute resources to the reduction of fatalities during less deadly - but more frequent - events.

Thus, I feel that the risk - by which I mean the product of the severity and the relative frequency - associated with each event would be a more important measure of how resources might be directed than raw severity.

In order to provide a summary that provides some indication of both frequency and severity of events, I wrote the following function to position each event type according to its log-severity and log-frequency (the logarithmic scale was used because of the great range of values in each variable). Categories that appear near the top right corner may then be, in terms of both frequency and severity, the most appropriate candidates for our attention.

plot.risk.n <- function(variable, ...) {
  cond <- means.by.category[[variable]] != 0 & means.by.category$freq != 0 
  x <- log(means.by.category[[variable]][cond])
  y <- log(means.by.category$freq[cond] / sum(means.by.category$freq[cond]))

  cats <- (means.by.category$cat[cond])
  min.x <- min(x)
  max.x <- max(x)
  space.x <- 0.1 * (max.x - min.x)
  min.y <- min(y)
  max.y <- max(y)
  space.y <- 0.1 * (max.y - min.y)

  dists <- sapply(
    1: length(x),
    function(i) {
      x0 <- x[i]
      y0 <- y[i]
      sqrt((x0 - min.x)^2 + (y0 - min.y)^2)
    }
  )

  max.dist <- max(dists)
  min.dist <- min(dists)

  my.col.scheme <- sapply(dists, function(d) {
    p <- (d - min.dist) / (max.dist - min.dist)
    rgb(p, 0.4 * (1 - p), 0.2 * (1 - p))
  })

  my.size.scheme <- sapply(dists, function(d) {
    p <- (d - min.dist) / (max.dist - min.dist)
    1 + 0.5 * p
  })

  plot(
    x,
    y,
    type = "n",
    bty = "n",
    xlab = "Log Mean Severity",
    ylab = "Log Frequency",
    xlim = c(min.x - space.x, max.x + space.x),
    ylim = c(min.y - space.y, max.y + space.y),
    ...
  )

  text(
    x,
    y,
    cats,
    col = my.col.scheme,
    cex = my.size.scheme
  )

}

Here are the associated plots. I've included some redundancy in that the position (top-right greater), colour (red greater) and relative size of the category names are selected to convey severity in terms of risk associated with each event.

par(mfrow = c(2, 2))
plot.risk.n('fat', main = "Fatalities")
plot.risk.n('inj', main = "Injuries")
plot.risk.n('prop', main = "Property damage")
plot.risk.n('crop', main = "Crop damage")

plot of chunk unnamed-chunk-12

Fig. 2: A word plot of category by log-severity and log-frequency in terms of fatalities, injuries, property damage and crop damage.

Notice, to return to the previous example, that even though Tsunamis are the deadliest event types in terms of net fatality per event, the risk of fatality associated with Tsunami events is not the greatest. Tornado, Strong Wind and Heat appear to have greater associated risks.

To demonstrate this more clearly, let us reproduce the plot in figure 1, but this time with associated risk rather than raw severity.

plot.top.risk.n <- function(variable, n, ...) {
  s <- sum(means.by.category$freq)
  x <- -means.by.category[[variable]] * means.by.category$freq / s
  order.by.variable <- means.by.category[
    order(
      x
    ), ]
  r <- nrow(order.by.variable)
  order.by.variable <- rbind(
    order.by.variable[1:n,],
    data.frame(
      cat = "Other",
      freq = sum(order.by.variable$freq[(n + 1):r]),
      fat = sum(order.by.variable$fat[(n + 1):r]),
      inj = sum(order.by.variable$inj[(n + 1):r]),
      prop = sum(order.by.variable$prop[(n + 1):r]),
      crop = sum(order.by.variable$crop[(n + 1):r])
    )
  )
  barplot(
    order.by.variable[[variable]][order.by.variable$cat != "Other"]
    * order.by.variable$freq[order.by.variable$cat != "Other"] / s, 
    names.arg = order.by.variable$cat[order.by.variable$cat != "Other"], 
    ...
  )
}
par(mfrow = c(2, 2))
plot.top.risk.n('fat', 5, main = "Fatalities")
plot.top.risk.n('inj', 5, main = "Injuries")
plot.top.risk.n('prop', 5, main = "Property damage")
plot.top.risk.n('crop', 5, main = "Crop damage")

plot of chunk unnamed-chunk-14

Fig. 3: The risk (severity-relative frequency product) in terms of fatalities, injuries, property damage and crop damage of the worst five events in each category

Summary

Although figure 3 does not necessarily show the five most severe events in each category on a per-event basis (as does figure 1), it does show the five events that would produce the best pay-off (in terms of lives saved, injuries prevented, costs avoided in the long term) if we were able to somehow curb the mean severity of the event. Thus, in terms of policies regarding the allocation of resources to prevent or reduce fatalities, injuries or damage to property and crops, the results from figures 2 and 3 would be of greater importance.

Events most harmful to population health in the long term

Once again, these are not necessarily the most deadly or dangerous per event, but can be expected to produce the most fatalities or injuries over the long term.

Events with greatest economic consequences in the long term

Once again, these events are not necessarily the most damaging per event, but can be expected to produce the most damage over the long term.