Synopsis

The National Weather Service has compiled data on damaging weather events in the United States since 1950 into a freely-available product, Storm Data. For this analysis, I determined the 10 most dangerous types of weather events from the Storm Data product, from both the perspective of total casualties (people hurt and killed) and the total damage to material goods (property and crops) caused by these events over the period of record. According to this analysis, tornadoes caused both more casualties and more damage than any other category of weather event. Thunderstorms and floods also appeared on both lists; extreme heat presents a danger in terms of the number of casualties caused. Officials with responsibility for managing the risks posed by weather events may wish to focus on these types of events.

This project was undertaken as part of the Reproducible Research course on Coursera, taught by Roger D. Peng at Johns Hopkins. The problem statement is online at https://class.coursera.org/repdata-015/human_grading/view/courses/973516/assessments/4/submissions.

Data Processing

The Storm Data product takes the form of a large, compressed .csv file (a bit like an Excel spreadsheet, only much bigger). The dplyr package by Hadley Wickham provides facilities for analyzing this type of data in R. We first load the dplyr package into R, then read in the Storm Data table.

# Mise en place.  
require(dplyr)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
options(warn = -1)
graphics.off()

# If storm.data doesn't already exist in memory, load it and make
# the column names neater.  
if (!exists('storm.data')) {
  storm.data <- read.csv('repdata-data-StormData.csv.bz2')
  colnames(storm.data) <- tolower(colnames(storm.data))
}

The Storm Data table is too large to work with conveniently. Inspection of the column names suggests that the relevant columns are labeled evtype (event type), fatalities, injuries, propdmg (property damage), propdmgexp (property damage exponent), cropdmg (crop damage), and cropdmgexp (crop damage exponent).

# Look at the column names in the data set.  
colnames(storm.data)
##  [1] "state__"    "bgn_date"   "bgn_time"   "time_zone"  "county"    
##  [6] "countyname" "state"      "evtype"     "bgn_range"  "bgn_azi"   
## [11] "bgn_locati" "end_date"   "end_time"   "county_end" "countyendn"
## [16] "end_range"  "end_azi"    "end_locati" "length"     "width"     
## [21] "f"          "mag"        "fatalities" "injuries"   "propdmg"   
## [26] "propdmgexp" "cropdmg"    "cropdmgexp" "wfo"        "stateoffic"
## [31] "zonenames"  "latitude"   "longitude"  "latitude_e" "longitude_"
## [36] "remarks"    "refnum"
# Pull out the relevant columns and put them in sub.data.  
sub.data <- select(storm.data, evtype, fatalities, injuries, propdmg, propdmgexp, cropdmg, cropdmgexp)
sub.data <- mutate(sub.data, evtype = tolower(evtype))

The fatalities and injuries columns are straightforward; they record the numbers of people killed or hurt in each weather event. However, the evtype, propdmgexp, and cropdmgexp require more explanation. The following R commands generate a frequency table of the unique entries in each column.

# Look at some of the event types in sub.data$evtype.  
print(head(arrange(as.data.frame(table(sub.data$evtype)), desc(Freq)), 25))
##                        Var1   Freq
## 1                      hail 288661
## 2                 tstm wind 219942
## 3         thunderstorm wind  82564
## 4                   tornado  60652
## 5               flash flood  54277
## 6                     flood  25327
## 7        thunderstorm winds  20843
## 8                 high wind  20214
## 9                 lightning  15754
## 10               heavy snow  15708
## 11               heavy rain  11742
## 12             winter storm  11433
## 13           winter weather   7045
## 14             funnel cloud   6844
## 15         marine tstm wind   6175
## 16 marine thunderstorm wind   5812
## 17               waterspout   3796
## 18              strong wind   3569
## 19     urban/sml stream fld   3392
## 20                 wildfire   2761
## 21                 blizzard   2719
## 22                  drought   2488
## 23                ice storm   2006
## 24           excessive heat   1678
## 25               high winds   1533
# Note unexplained characters in the propdmgexp and cropdmgexp columns.  No idea
# what to do about these.  
print(arrange(as.data.frame(table(sub.data$propdmgexp)), desc(Freq)))
##    Var1   Freq
## 1       465934
## 2     K 424665
## 3     M  11330
## 4     0    216
## 5     B     40
## 6     5     28
## 7     1     25
## 8     2     13
## 9     ?      8
## 10    m      7
## 11    H      6
## 12    +      5
## 13    7      5
## 14    3      4
## 15    4      4
## 16    6      4
## 17    -      1
## 18    8      1
## 19    h      1
print(arrange(as.data.frame(table(sub.data$cropdmgexp)), desc(Freq)))
##   Var1   Freq
## 1      618413
## 2    K 281832
## 3    M   1994
## 4    k     21
## 5    0     19
## 6    B      9
## 7    ?      7
## 8    2      1
## 9    m      1

We can see problems in the data already.
1. The evtype field contains separate entries for different events that should perhaps be lumped together (are “flood” and “urban/sml stream fld” really different enough that they should have two separate categories?) 2. We need to calculate total damages to property and crops in order to answer the question, but we can’t calculate numerical values from the strings in the exponent columns.
3. Besides the K, M, and B values that we expect to see in the exponent columns, there are many other characters that aren’t explained in the documentation.

To address these problems, we
1. use grep to replace apparent duplicates in the evtype column 2. define a function that, given K (thousands), M (millions), or B (billions), returns the appropriate exponent (3, 6, or 9), or returns 0 if some other character is supplied 3. use the mutate command from the dplyr package to calculate total casualties and total property and crop damage from each event

# Use grep to replace apparent duplicates in the evtype column.  
sub.data$evtype[grep('tstm', sub.data$evtype)] <- 'thunderstorm'
sub.data$evtype[grep('thunderstorm', sub.data$evtype)] <- 'thunderstorm'
sub.data$evtype[grep('flood', sub.data$evtype)] <- 'flood'
sub.data$evtype[grep('heat', sub.data$evtype)] <- 'heat'
sub.data$evtype[grep('ice storm', sub.data$evtype)] <- 'winter storm'
sub.data$evtype[grep('heavy snow', sub.data$evtype)] <- 'winter storm'
sub.data$evtype[grep('wind', sub.data$evtype)] <- 'wind'

# Define a function for turning the letter codes in the *exp columns into
# exponents.  
letter2exponent <- function(letter) {
  
  # /--- function letter2exponent ---/
  # 
  # Given the letters 'K' (kilo-, 10^3), 'M' (mega-, 10^6), or 'B' (billion, 
  # 10^9), returns the appropriate exponent.  If none of those characters,
  # returns 0.  
  # 
  # Syntax: exponent <- letter2exponent(letter)
  # exponent, one of 3, 6, 9, or 0
  # letter, preferably one of 'K', 'M', or 'B'
  
  if (letter == 'K') {
    exponent <- 3
  } else if (letter == 'M') {
    exponent <- 6
  } else if (letter == 'B') {
    exponent <- 9
  } else {
    exponent <- 0
  }
  
  return(exponent)
  
} # end function letter2exponent

# Create two new columns that represent the total number of people hurt or 
# killed and the total damage associated with each event.  
sub.data <- mutate(sub.data, total.hurt = fatalities+ injuries, total.dmg = propdmg* 10^letter2exponent(propdmgexp)+ cropdmg* 10^letter2exponent(cropdmgexp))

The Storm Data table is arranged by individual events, but we want to assess the relative danger posed by different event types. dplyr provides a facility for summarizing the data in this way.

# Use the dplyr package to find the total number of casualties and the total 
# damage due to each type of event.  
group.evtype <- group_by(sub.data, evtype)
casualties.by.evtype <- summarize(group.evtype, total.hurt = sum(total.hurt))
damage.by.evtype <- summarize(group.evtype, total.dmg = sum(total.dmg))

# Extract the top 10 most dangerous event types, both by numbers of people
# hurt and killed and by monetary damages.  
top10.casualty.evtypes <- head(arrange(casualties.by.evtype, desc(total.hurt)), 10)
top10.damage.evtypes <- head(arrange(damage.by.evtype, desc(total.dmg)), 10)

Finally, we can plot the summarized Storm Data for the top 10 types of events in each category.

# Plot the results.  See http://www.statmethods.net/graphs/bar.html
par(las = 1, mar = c(5, 8, 4, 2))
barplot(top10.casualty.evtypes$total.hurt, names.arg = top10.casualty.evtypes$evtype, horiz = TRUE, main = 'Casualties due to event types, 1950-2011', xlab = 'Total casualties (people hurt or killed)')

Figure 1. Number of casualties (people hurt or killed) in weather events of various types between 1950 and 2011, as recorded in the National Weather Service’s Storm Data product. Only the top 10 most hazardous event types are shown.

par(las = 1, mar = c(5, 8, 4, 2))
barplot(top10.damage.evtypes$total.dmg, names.arg = top10.damage.evtypes$evtype, horiz = TRUE, main = 'Damages due to event types, 1950-2011', xlab = 'Total damages (USD)')  

Figure 2. Damages due to weather events of various types between 1950 and 2011, as recorded in the National Weather Service’s Storm Data product. Only the top 10 most damaging event types are shown.

Results

Tornadoes are clearly the most severe weather events recorded in the National Weather Service’s Storm Data product, both from the perspective of the total number of people hurt and killed (Fig. 1) and the monetary damages (Fig. 2) caused by events of different types. Other types of events that appear to present considerable hazards include thunderstorms and floods, which occur in both lists’ five most hazardous event types. Heat also causes many casualties, though it does not appear to be a major source of property or crop damage.

Discussion

It should be noted that the analysis above neglects changes in the completeness of reporting and in the value of the dollar over time. In general, reporting completeness has improved over time, suggesting that this analysis gives greater weight to later years in the period of record (1950-2011). On the other hand, the purchasing power of a dollar has decreased considerably between 1950 and the present, meaning that an event that caused $1,000 worth of damage in 1950 was actually much worse than an event with the same reported damage in 2011. To correct for this bias, the dollar values could be adjusted to represent the value of a dollar in a particular year, perhaps 2005 or 2014.