Synopsis

This report attempts to determine which weather events are the most costly (in terms of property, crops, and in human lives and injuries). The knowledge gained can be used to direct resources to where they will have the most effect. It turns out that only a few weather event types are responsible for most of the damage, to property, crops, and human health.

The project uses data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, which tracks major storms and weather events in the US. The raw data used for this project is here. For more information on the data, consult the National Weather Service Storm Data Documentation and the National Climatic Data Center Storm Events FAQ.

The raw data is extremely untidy; the bulk of this report describes the methods used to process the data before analysis. Data processing is implemented in a manner which allows refining and reproducing the results in the future.

Library loading

First things first - load required libraries.

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Data Processing

Read the input file

This report assumes that the storm data file has been downloaded, and is in the current directory. If needed, the data can be downloaded here. The data is loaded into R.

raw_data <- read.csv("repdata_data_StormData.csv.bz2")

Delete unused columns

The raw data contains 37 variables, but we only care about a handful of them. I create a second, working dataset, and keep the original raw data around as a reference.

keepers <- c(
  "BGN_DATE",
  "EVTYPE",
  "FATALITIES",
  "INJURIES",
  "PROPDMG",
  "PROPDMGEXP",
  "CROPDMG",
  "CROPDMGEXP",
  "REFNUM"
)
data <- raw_data[, keepers]

Deal with Dates

I convert the BGN_DATE variable to an R Date object, and also extract the year from that Date as a new variable.

data$BGN_DATE <- as.Date(data$BGN_DATE, "%m/%d/%Y")
data$year <- as.integer(format(data$BGN_DATE, "%Y"))

The data is untidy, part 1: PROPDMGEXP, CROPDMGEXP

PROPDMGEXP, CROPDMGEXP are two variables which serve as multipliers on the PROPDMG and CROPDMG variables, respectively. Despite the “EXP” (presumably standing for “exponent”), the values are letters, with multiplier values according to the following table:

EXP code multiplier value
h, H 100
k, K 1000
m, M 1e6
b, B 1e9

A very small minority of records have odd values in PROPDMGEXP, CROPDMGEXP, such as numerals 0-9, ‘?’, ‘+’, ‘-’. These records occur in only a few years, and even within that subset are a tiny minority of records. The table shows the ratio of “odd” values to normal values, over the years in which those values occur, for PROPDMGEXP:

odd_names <- c(0:9, "+", "-", "?")
odd_exp <- data[data$PROPDMGEXP %in% odd_names,]
odd_years <- unique(odd_exp$year)
table(odd_exp$year) / table(data[data$year %in% odd_years,]$year)
## 
##         1993         1994         1995         2011 
## 3.172840e-04 1.599535e-03 9.867715e-03 1.608389e-05

The next table shows the same ratio, for CROPDMGEXP:

odd_exp <- data[data$CROPDMGEXP %in% odd_names,]
odd_years <- unique(odd_exp$year)
table(odd_exp$year) / table(data[data$year %in% odd_years,]$year)
## 
##         1993         1994         1995 
## 0.0003172840 0.0004362367 0.0005005363

Because the oddball values for the EXP values are so rare, I’ll remove records from the dataset that contain them. For the purposes of this analysis, this should pose no problem.

data <- data[!data$PROPDMGEXP %in% odd_names,]
data <- data[!data$CROPDMGEXP %in% odd_names,]

Compute numerical property and crop damage variables

Next I convert the CROPDMG, CROPDMGEXP, PROPDMG, PROPDMGEXP variables into new numerical variables.

data = within(data, {
    PROP_MULT = NA
    PROP_MULT[PROPDMGEXP == ""] = 1
    PROP_MULT[PROPDMGEXP %in% c('h', 'H')] = 100
    PROP_MULT[PROPDMGEXP %in% c('k', 'K')] = 1000
    PROP_MULT[PROPDMGEXP %in% c('m', 'M')] = 1000000
    PROP_MULT[PROPDMGEXP %in% 'B'] = 1000000000
})
data = within(data, {
    CROP_MULT = NA
    CROP_MULT[CROPDMGEXP == ""] = 1
    CROP_MULT[CROPDMGEXP %in% c('h', 'H')] = 100
    CROP_MULT[CROPDMGEXP %in% c('k', 'K')] = 1000
    CROP_MULT[CROPDMGEXP %in% c('m', 'M')] = 1000000
    CROP_MULT[CROPDMGEXP %in% 'B'] = 1000000000
})

data$PROP_COST <- data$PROPDMG * data$PROP_MULT
data$CROP_COST <- data$CROPDMG * data$CROP_MULT

Rremove events with no damage costs, fatalities or injuries

For purposes of this analysis, only events that had economic impact, injuries or fatalities are of interest.

data <- data[
  data$FATALITIES > 0 |
  data$INJURIES > 0 |
  data$PROP_COST > 0 |
  data$CROP_COST > 0,
]

The data is untidy, part 2: $115 Billion Flood in Napa, CA!

A summary of property damage reveals an outlier:

summary(data$PROP_COST)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.00e+00 2.00e+03 1.00e+04 1.68e+06 3.50e+04 1.15e+11
m <- data[data$PROP_COST == max(data$PROP_COST),]
m
##          BGN_DATE EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
## 605953 2006-01-01  FLOOD          0        0     115          B    32.5
##        CROPDMGEXP REFNUM year PROP_MULT CROP_MULT PROP_COST CROP_COST
## 605953          M 605943 2006     1e+09     1e+06  1.15e+11  32500000
raw_data[raw_data$REFNUM==m$REFNUM, "REMARKS"]
## [1] Major flooding continued into the early hours of January 1st, before the Napa River finally fell below flood stage and the water receeded. Flooding was severe in Downtown Napa from the Napa Creek and the City and Parks Department was hit with $6 million in damage alone. The City of Napa had 600 homes with moderate damage, 150 damaged businesses with costs of at least $70 million.
## 436781 Levels:  -2 at Deer Park\n ... Zones 22 and 23 were added to the high wind warning of  January 26. Peak winds Sitka 55MPH, Cape Decision 58MPH, and Cape Spencer 64MPH.\n

The remarks section mentions a few costs; $6 million, $70 million. It seems obvious that the total damage wasn’t $115 billion! I’m assuming that $115 million was intended, and will make that correction for this analysis.

data[data$REFNUM == 605943, "PROP_COST"] <- 1.15e8

What’s the next largest property damage?

m <- data[data$PROP_COST == max(data$PROP_COST),]
m
##          BGN_DATE      EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP
## 577676 2005-08-29 STORM SURGE          0        0    31.3          B
##        CROPDMG CROPDMGEXP REFNUM year PROP_MULT CROP_MULT PROP_COST
## 577676       0            577616 2005     1e+09         1  3.13e+10
##        CROP_COST
## 577676         0
raw_data[raw_data$REFNUM==m$REFNUM, "REMARKS"]
## [1] Storm surge damage in southeast Louisiana, especially in the New Orleans area and the coastal parishes, was catastrophic.  Hurricane protection levees and floodwalls were overtopped and/or breached resulting in widespread and deep flooding of homes and businesses.  Much of Orleans and Plaquemines Parishes and nearly all of St. Bernard Parish were flooded by storm surge. Approximately 80 percent of the city of New Orleans was flooded.  Thousands of people were stranded by the flood waters in homes and buildings and on rooftops for several days and had to be rescued by boat and helicopter. In Jefferson Parish, levees were not compromised, however many homes were flooded by either heavy rain overwhelming limited pumping capacity or storm surge water moving through in-operable pumps into the parish.  Severe storm surge damage also occurred along the north shore of Lake Pontchartrain from Mandeville to Slidell with storm surge water moving inland as far as Old Towne Slidell with water up to 6 feet deep in some locations\n\nPost storm high water surveys of the area conducted by FEMA indicated the following storm surge estimates:  Orleans Parish - 12-15 feet in east New Orleans to 9 to 12 feet along the Lakefront; St. Bernard Parish - 14 to 17 feet; Jefferson Parish - 6 to 9 feet along the lakefront to 5 to 8 feet from Lafitte to Grand Isle; Plaquemines Parish - 15 to 17 feet; St. Tammany Parish - 11 to 16 feet in southeast portion to 7 to 10 feet in western portion. All storm surge heights are still water elevations referenced to NAVD88 datum.
## 436781 Levels:  -2 at Deer Park\n ... Zones 22 and 23 were added to the high wind warning of  January 26. Peak winds Sitka 55MPH, Cape Decision 58MPH, and Cape Spencer 64MPH.\n

And that’s Hurricane Katrina - $31.3 billion seems reasonable.

Focus on recent, more-complete data

Data per year was collected according to the following:

  • 1950-1954: tornado events only
  • 1955 - 1995: tornado, thunderstorm, wind and hail vents only
  • 1996 - present: 48 event types recorded (as defined in NWS Directive 10-1605).

Because not all event types were recorded until 1996, I will exclude earlier years from this analysis.

data <- data[data$year >= 1996,]

Remove any remaining unused variables

keepers2 <- c(
  "BGN_DATE",
  "EVTYPE",
  "FATALITIES",
  "INJURIES",
  "year",
  "PROP_COST",
  "CROP_COST",
  "REFNUM"
)
data <- data[, keepers2]

The data is untidy, part 3: EVTYPE

The input data EVTYPE variable (Event Type) contains over 900 unique values, many of which are clearly misspelled, or are worded slightly differently. 48 official event types are defined for data from 1996 to the present, in NWS Directive 10-1605.
Event types:

  • Astronomical Low Tide
  • Avalanche
  • Blizzard
  • Coastal Flood
  • Cold/Wind Chill
  • Debris Flow
  • Dense Fog
  • Dense Smoke
  • Drought
  • Dust Devil
  • Dust Storm
  • Excessive Heat
  • Extreme Cold/Wind Chill
  • Flash Flood
  • Flood
  • Frost/Freeze
  • Funnel Cloud
  • Freezing Fog
  • Hail
  • Heat
  • Heavy Rain
  • Heavy Snow
  • High Surf
  • High Wind
  • Hurricane (Typhoon)
  • Ice Storm
  • Lake-Effect Snow
  • Lakeshore Flood
  • Lightning
  • Marine Hail
  • Marine High Wind
  • Marine Strong Wind
  • Marine Thunderstorm Wind
  • Rip Current
  • Seiche
  • Sleet
  • Storm Surge/Tide
  • Strong Wind
  • Thunderstorm Wind
  • Tornado
  • Tropical Depression
  • Tropical Storm
  • Tsunami
  • Volcanic Ash
  • Waterspout
  • Wildfire
  • Winter Storm
  • Winter Weather

EVTYPE values are transformed into values from the official list of 48 by text transformation and string matching, using a couple of utility functions.

data$EVTYPE <- toupper(data$EVTYPE)
# Replace events in *data* with EVTYPE matching *pattern* with *replacement*.
merge_events <- function(pattern, replacement, data) {
  ast <- grepl(pattern, data$EVTYPE, ignore.case=T)
  data$EVTYPE[ast] <- replacement
  return(data)
}

# Replace events in *data* matching various patterns with replacements
# from the official list of 48 types.
quash_evtype <- function(data) {

  data <- merge_events("ASTRONOMICAL", "ASTRONOMICAL LOW/HIGH TIDE", data)
  data <- merge_events("AVALANC", "AVALANCHE", data)
  data <- merge_events("mud slide", "AVALANCHE", data)
  data <- merge_events("BLIZZARD", "BLIZZARD", data)
  data <- merge_events("coastal.*flood", "COASTAL FLXXD", data)
  data <- merge_events("cstl flood", "COASTAL FLXXD", data)
  data <- merge_events("flash.*flood", "FLASH FLXXD", data)
  data <- merge_events("flood.*flash", "FLASH FLXXD", data)
  data <- merge_events("lake.*flood", "LAKESHORE FLXXD", data)
  data <- merge_events("flood", "FLXXD", data)
  data <- merge_events("dam break", "FLXXD", data)
  data <- merge_events("urban.*stream", "FLXXD", data)
  data$EVTYPE <- gsub("FLXXD", "FLOOD", data$EVTYPE, fixed=T)

  data <- merge_events("^COLD/WIND CHILL", "COLD/WIND CHILL", data)
  data <- merge_events("EXTREME.*CHILL", "EXTREME COLD/WIND CHILL", data)
  data <- merge_events("EXTREME COLD", "EXTREME COLD/WIND CHILL", data)
  data <- merge_events("^FOG$", "DENSE FOG", data)
  data <- merge_events("^FOG AND COLD TEMPERATURES$", "FREEZING FOG", data)
  data <- merge_events("DROUGHT", "DROUGHT", data)
  data <- merge_events("DUST DEVIL", "DUST DEVIL", data)
  data <- merge_events("DUST STORM", "DUST STORM", data)
  data <- merge_events("blowing dust", "DUST STORM", data)
  data <- merge_events("EXCESSIVE HEAT", "EXCESSIVE HEAT", data)
  data <- merge_events("frost", "FROST/FREEZE", data)
  data <- merge_events("freeze", "FROST/FREEZE", data)
  data <- merge_events("waterspout", "WATERSPOUT", data)
  data <- merge_events("tornad", "TORNADO", data)
  data <- merge_events("gustnado", "TORNADO", data)
  data <- merge_events("lightning", "LIGHTNING", data)
  data <- merge_events("marine tstm", "MARINE GUHAQREFGBEZ WIND", data)
  data <- merge_events("marine thunderstorm", "MARINE GUHAQREFGBEZ WIND", data)

  data <- merge_events("thunderstorm", "THUNDERSTORM", data)
  data <- merge_events("MARINE GUHAQREFGBEZ WIND", "MARINE THUNDERSTORM WIND", data)
  data <- merge_events("MARINE GUHAQREFGBEZ WIND", "MARINE THUNDERSTORM WIND", data)
  data <- merge_events("thundersnow", "THUNDERSTORM", data)
  data <- merge_events("thundertorm", "THUNDERSTORM", data)
  data <- merge_events("thunderstrom", "THUNDERSTORM", data)
  data <- merge_events("thunerstrom", "THUNDERSTORM", data)
  data <- merge_events("thunerstorm", "THUNDERSTORM", data)
  data <- merge_events("thuderstorm", "THUNDERSTORM", data)
  data <- merge_events("thundeerstorm", "THUNDERSTORM", data)
  data <- merge_events("tunderstorm", "THUNDERSTORM", data)
  data <- merge_events("tstm", "THUNDERSTORM", data)

  data <- merge_events("marine hail", "MARINE H8IL", data)
  data <- merge_events("hail", "HAIL", data)
  data <- merge_events("MARINE H8IL", "MARINE HAIL", data)
  data <- merge_events("excessive heat", "monkey attack", data)
  data <- merge_events("heat", "HEAT", data)
  data <- merge_events("unseasonably warm", "HEAT", data)
  data <- merge_events("monkey attack", "EXCESSIVE HEAT", data)
  data <- merge_events("hurricane", "HURRICANE", data)

  # Wind.
  data <- merge_events("marine high wind", "znevar uvtu jvaq", data)
  data <- merge_events("high.*wind", "HIGH WIND", data)
  data <- merge_events("znevar uvtu jvaq", "MARINE HIGH WIND", data)
  data <- merge_events("gusty", "HIGH WIND", data)
  data <- merge_events("winds", "HIGH WIND", data)
  data <- merge_events("wind and wave", "MARINE HIGH WIND", data)
  data <- merge_events("strong wave", "HIGH WIND", data)
  data <- merge_events("rain/wind", "HIGH WIND", data)
  data <- merge_events("wind damage", "HIGH WIND", data)
  data <- merge_events("whirlwind", "HIGH WIND", data)
  data <- merge_events("wind storm", "HIGH WIND", data)
  data <- merge_events("gradient wind", "HIGH WIND", data)
  data <- merge_events("^wind$", "HIGH WIND", data)
  data <- merge_events("strong wind", "HIGH WIND", data)

  data <- merge_events("surf", "HIGH SURF", data)
  data <- merge_events("heavy swells", "HIGH SURF", data)
  data <- merge_events("heavy seas", "HIGH SURF", data)
  data <- merge_events("heavy rain", "HEAVY RAIN", data)
  data <- merge_events("heavy.*snow", "HEAVY SNOW", data)
  data <- merge_events("heavy shower", "HEAVY RAIN", data)
  data <- merge_events("heavy precipitation", "HEAVY RAIN", data)
  data <- merge_events("mixed precip", "HEAVY RAIN", data)
  data <- merge_events("heavy mix", "HEAVY RAIN", data)
  data <- merge_events("wet", "HEAVY RAIN", data)
  data <- merge_events("downburst", "HEAVY RAIN", data)

  # Fire
  data <- merge_events("fire", "WILDFIRE", data)

  # Snow
  data <- merge_events("la.*effect snow", "YNXR-RSSRPG FABJ", data)
  data <- merge_events("snow", "HEAVY SNOW", data)
  data <- merge_events("YNXR-RSSRPG FABJ", "LAKE-EFFECT SNOW", data)

  # Rip Current
  data <- merge_events("rip current", "RIP CURRENT", data)

  # Rain
  data <- merge_events("rain", "HEAVY RAIN", data)

  # Tropical Storm
  data <- merge_events("tropical storm", "TROPICAL STORM", data)
  data <- merge_events("ice", "ICE STORM", data)
  data <- merge_events("icy", "ICE STORM", data)
  data <- merge_events("glaze", "ICE STORM", data)
  data <- merge_events("coastalstorm", "COASTAL STORM", data)
  data <- merge_events("surge", "STORM SURGE/TIDE", data)
  data <- merge_events("rapidly rising water", "STORM SURGE/TIDE", data)
  data <- merge_events("high water", "STORM SURGE/TIDE", data)

  return(data)
}
data <- quash_evtype(data)

The number of unique values for EVTYPE is much reduced:

length(unique(data$EVTYPE))
## [1] 78

There are still EVTYPE values outside of the official list, but for purposes of this analysis I’ll ignore them as outliers. This report is implemented in a reproducible manner; if necessary, further work can be done on refining the list of EVTYPE values by modifying the quash_evtype function.

Results

The importance of various weather event types can be measured by harm to human beings, or by economic damage. Plotting the top event types by fatalities, injuries, property damange and crop damage should provide some quick insight into which types of weather events are of most concern.

Fatalities, injuries by event type

par(bg = "light gray")
par(mfrow = c(2, 1))
par(mar=c(5.1, 7.1, 4.1, 2.1))
num_event_types <- 5
fatalities <- aggregate(FATALITIES ~ EVTYPE, FUN=sum, data=data)
fatalities <- arrange(fatalities, desc(FATALITIES))
injuries <- aggregate(INJURIES ~ EVTYPE, FUN=sum, data=data)
injuries <- arrange(injuries, desc(INJURIES))

barplot(
  col = "red",
  xlab = "Fatalities per event type",
  fatalities$FATALITIES[1:num_event_types],
  horiz=T,
  names.arg=fatalities$EVTYPE[1:num_event_types],
  las=1,
  xlim = c(0, 1.2 * max(fatalities$FATALITIES)),
  cex.names=0.5
)
barplot(
  col = "red",
  xlab = "Injuries per event type",
  injuries$INJURIES[1:num_event_types],
  horiz=T,
  names.arg=injuries$EVTYPE[1:num_event_types],
  las=1,
  xlim = c(0, 1.3 * max(injuries$INJURIES)),
  cex.names=0.5
)

The plots show that excessive heat, tornadoes, and floods/flash floods are of the most concern, with tornadoes in the top three list for both fatalities and injuries. These facts may be used to motivate increased investment in early warning systems for tornado and flood events.

Property damage by event type

num_event_types <- 10
prop <- aggregate(PROP_COST ~ EVTYPE, FUN=sum, data=data)
prop$PROP_COST <- prop$PROP_COST / 1e9
prop <- arrange(prop, desc(PROP_COST))
par(bg = "light gray")
par(mfrow = c(1, 1))
par(mar=c(5.1, 6.1, 4.1, 2.1))
barplot(
  col = "red",
  xlab = "Property damage by event type ($US billion)",
  prop$PROP_COST[1:num_event_types],
  horiz=T,
  names.arg=prop$EVTYPE[1:num_event_types],
  las=1,
  cex.names=0.5
)

The plot shows that hurricanes are by a strong margin the major cause of property damage, followed by the water-related events storm surge/tide, flood and flash flood. Tornadoes rate a bit less, perhaps due to their localized effects. It should be noted that the range of dates in which data were collected included Hurricane Katrina, which by itself cost more than $30 billion of the approximately $80 billion dollars in property damage; if not for Katrina, hurricane property damage would be about on par with storm surge/tide.

The data could suggest that infrastructural investment to reduce property damage should focus on rather costly improvements to coastal barriers. Key point: as a society, would we rather invest, say, $100 billion in coastal improvements, or $100 billion in cleanup costs?

Crop damage by event type

par(bg = "light gray")
par(mar=c(5.1, 7.1, 4.1, 2.1))
crop <- aggregate(CROP_COST ~ EVTYPE, FUN=sum, data=data)
crop$CROP_COST = crop$CROP_COST / 1e9
crop <- arrange(crop, desc(CROP_COST))
barplot(
  col = "red",
  xlab = "Crop damage by event type ($US billion)",
  crop$CROP_COST[1:num_event_types],
  horiz=T,
  names.arg=crop$EVTYPE[1:num_event_types],
  las=1,
  xlim = c(0, 1.1 * max(crop$CROP_COST)),
  cex.names=0.5
)

For crop damage, drought is the clear winner, followed by hurricane and flood. If only we could introduce some of the hurricane rain and flood waters to the drought-stricken regions, we’d have a nice savings. The results could be used in a few ways: for example, the cost of drought might motivate dam construction. A different strategy would be to promote the use of more drought-tolerant crops in regions that have historically suffered from drought.

Further work

The work can be extended in several obvious ways.

  1. Further refinement of EVTYPE tidying. Instrumentation of EVTYPE processing would be useful - for example, to enable a report of which EVTYPE values from the raw data were converted into the final values.
  2. Further investigation into “oddball” PROPDMGEXP and CROPDMGEXP values. Where do these values come from? Are original, scanned documents available, which might reveal that transcription errors occurred?