28 December 2017

Synopsis

We used the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database1 to examine some of the health and economic consequences of severe weather events over the period from 1950 to 2011. Weather events from the NOAA database were reclassified into 13 categories for analysis. Hurricanes and extreme heat events had the greatest health consequences in terms of number of fatalities and number of injuries caused. Hurricanes by far had the greatest economic impact in terms of property damage, whereas ice-related events caused the greatest amount of crop damage.

Data Processing

(Note: In this document, the text is accompanied by blocks of R code that perform computations and produce tables and graphics. Comments within the code, denoted by leading # symbols, provide further details and documentation.)

The raw data were downloaded and uncompressed to a comma-separated values file named StormData.csv. Because no information was available as to the number of data records or variables in the dataset, we initially read just the first 10 rows of StormData.csv into a data.table named preliminarydata. (We use data.table instead of data.frame because of data.table’s enhanced speed and memory conservation when used with large datasets.)2 This allowed us to generate a list of column names from the raw data file. By using the command-line tool head on the raw data file StormData.csv we determined that missing values were denoted by empty strings. Thus we use the na.strings = "" option of the fread function3 to code empty strings as NA missing values.

# The code for downloading and extracting the raw data is wrapped in an if statement 
# that tests for the existence of the file `StormData.csv`. 
# The download code will only run if the CSV file does not already exist.

# temp <- if (!exists("StormData.csv")) # We capture annoying return values in variable `temp`.
# {
#     download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
#         "StormData.bz2", quiet = TRUE)
#     bunzip2("StormData.bz2", "StormData.csv", remove = FALSE, skip = TRUE)
# }
# 
# rm(temp)

# Load a small part of the dataset and generate a list of variables.

preliminarydata <- fread("StormData.csv", nrows = 10, na.strings = "", showProgress = FALSE)
n <- names(preliminarydata)

There were 37 variables in the raw data file. Variable names are: STATE__, BGN_DATE, BGN_TIME, TIME_ZONE, COUNTY, COUNTYNAME, STATE, EVTYPE, BGN_RANGE, BGN_AZI, BGN_LOCATI, END_DATE, END_TIME, COUNTY_END, COUNTYENDN, END_RANGE, END_AZI, END_LOCATI, LENGTH, WIDTH, F, MAG, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP, WFO, STATEOFFIC, ZONENAMES, LATITUDE, LONGITUDE, LATITUDE_E, _LONGITUDE__, REMARKS and REFNUM.

The NOAA website publishes detailed information about the fields in their storm events database.4 From this documentation we determined that a subset (Table 1) of the 37 original variables were most relevant to questions regarding the health and economic consequences of severe weather events.

# Create a data frame containing the relevant variable names and their descriptions
# and create a descriptive table.

varNames <- c("BGN_DATE", "STATE", "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", 
              "CROPDMG", "CROPDMGEXP")

desc <- c("Date the storm event began", "Two-letter state abbreviation", "Type of storm event", 
          "Number directly killed", "Number directly injured", "Property damage in whole numbers 
          and hundredths", "Multiplier for PROPDMG", "Crop damage in whole numbers and hundredths", 
          "Multiplier for CROPDMG")

units <- c("Character", "Character", "Character", "Numeric", "Numeric",
           "Numeric", "Character", "Numeric", "Character")

tab1 <- cbind(varNames, desc, units)

tab1 %>%
  kable("html", caption = "Table 1. Variables selected for analysis.", 
    col.names = c("Variable name", "Description", "Data type")) %>%   
    kable_styling(bootstrap_options = "bordered", full_width = F, position="center")
Table 1. Variables selected for analysis.
Variable name Description Data type
BGN_DATE Date the storm event began Character
STATE Two-letter state abbreviation Character
EVTYPE Type of storm event Character
FATALITIES Number directly killed Numeric
INJURIES Number directly injured Numeric
PROPDMG Property damage in whole numbers and hundredths Numeric
PROPDMGEXP Multiplier for PROPDMG Character
CROPDMG Crop damage in whole numbers and hundredths Numeric
CROPDMGEXP Multiplier for CROPDMG Character

The NOAA website states that variables PROPDMGEXP and CROPDMGEXP are multipliers coded as H, K, M, and B for Hundreds, Thousands, Millions and Billions respectively. The numeric values of PROPDMG and CROPDMG are to be multiplied by 100, 1000, 1000000, or 1000000000 according to the codes in their respective PROPDMGEXP and CROPDMGEXP variables. This calculation generates the variables PROPCASH and CROPCASH, which contain the actual dollar amounts of property and crop damage.

# Read in the full dataset `StormData.csv`, retaining only relevant variables.
# Convert empty strings to NA.

options(scipen=999) # Discourage scientific notation.

# Read the raw data into a data.table.
fulldata <- fread("StormData.csv", na.strings = "", verbose = FALSE,
              showProgress = FALSE,
              select = c("BGN_DATE", "STATE", "EVTYPE", "FATALITIES", 
                         "INJURIES", "PROPDMG", "PROPDMGEXP",
                         "CROPDMG", "CROPDMGEXP"))

# Recode numeric multipliers and create new variables.
fulldata[, c("PROPCASH", "CROPCASH", "PROPMULT", "CROPMULT") := 0]
fulldata[, PROPDMGEXP := toupper(PROPDMGEXP)]
fulldata[, EVTYPE := toupper(EVTYPE)]
fulldata[, CROPDMGEXP := toupper(CROPDMGEXP)]
fulldata[PROPDMGEXP == "H", PROPMULT := 100]
fulldata[PROPDMGEXP == "K", PROPMULT := 1000]
fulldata[PROPDMGEXP == "M", PROPMULT := 1000000]
fulldata[PROPDMGEXP == "B", PROPMULT := 1000000000]
fulldata[CROPDMGEXP == "H", CROPMULT := 100]
fulldata[CROPDMGEXP == "K", CROPMULT := 1000]
fulldata[CROPDMGEXP == "M", CROPMULT := 1000000]
fulldata[CROPDMGEXP == "B", CROPMULT := 1000000000]
fulldata[, PROPCASH := PROPDMG * PROPMULT]
fulldata[, CROPCASH := CROPDMG * CROPMULT]

# Create EVENT variable and collapse weather event categories
# from variable EVTYPE. Everything else is `OTHER`.
fulldata[, "EVENT" := "OTHER"]
fulldata[str_detect(EVTYPE, "TORNADO"), EVENT := "TORNADO"]
fulldata[str_detect(EVTYPE, "FUNNEL"), EVENT := "TORNADO"]
fulldata[str_detect(EVTYPE, "WIND"), EVENT := "WIND"]
fulldata[str_detect(EVTYPE, "COLD"), EVENT := "COLD"]
fulldata[str_detect(EVTYPE, "FREEZ"), EVENT := "COLD"]
fulldata[str_detect(EVTYPE, "CHILL"), EVENT := "COLD"]
fulldata[str_detect(EVTYPE, "FROST"), EVENT := "COLD"]
fulldata[str_detect(EVTYPE, "HEAT"), EVENT := "HEAT"]
fulldata[str_detect(EVTYPE, "DROUGHT"), EVENT := "HEAT"]
fulldata[str_detect(EVTYPE, "DRY"), EVENT := "HEAT"]
fulldata[str_detect(EVTYPE, "WARM"), EVENT := "HEAT"]
fulldata[str_detect(EVTYPE, "HOT"), EVENT := "HEAT"]
fulldata[str_detect(EVTYPE, "FLOOD"), EVENT := "FLOOD"]
fulldata[str_detect(EVTYPE, "HURRICANE"), EVENT := "HURRICANE"]
fulldata[str_detect(EVTYPE, "DEPRESSION"), EVENT := "HURRICANE"]
fulldata[str_detect(EVTYPE, "BLIZZARD"), EVENT := "WINTER STORM"]
fulldata[str_detect(EVTYPE, "SNOW"), EVENT := "WINTER STORM"]
fulldata[str_detect(EVTYPE, "WINTER"), EVENT := "WINTER STORM"]
fulldata[str_detect(EVTYPE, "HAIL"), EVENT := "HAIL"]
fulldata[str_detect(EVTYPE, "RAIN"), EVENT := "RAIN"]
fulldata[str_detect(EVTYPE, "LIGHTNING"), EVENT := "LIGHTNING"]
fulldata[str_detect(EVTYPE, "THUNDERSTORM"), EVENT := "THUNDERSTORM"]
fulldata[str_detect(EVTYPE, "ICE"), EVENT := "ICE"]
fulldata[str_detect(EVTYPE, "SLEET"), EVENT := "ICE"]
fulldata[str_detect(EVTYPE, "ICY"), EVENT := "ICE"]

# Delete records that are not in the contiguous 48 states.
nope <- c("AK", "HI")
contiguous <- setdiff(state.abb, nope)
fulldata <- subset(fulldata, STATE %in% contiguous)

However, when we created tables of the variables PROPDMGEXP and CROPDMGEXP using the full dataset, we found that the data for these two variables were quite messy.

Values of PROPDMGEXP:

- ? + 0 1 2 3 4 5 6 7 8 B H K M
1 8 3 215 25 13 4 4 28 4 5 1 39 7 412574 11201

Values of CROPDMGEXP:

? 0 2 B K M
7 19 1 9 270857 1966

According to NOAA’s documentation, variables PROPDMGEXP and CROPDMGEXP are supposed to contain only the values H, K, M, or B. With no other information available, we assume that data entry errors caused the wide range of spurious values shown in the frequency tables above. To control some of this, we converted PROPDMGEXP and CROPDMGEXP to upper case and disregarded all other values that were not H, K, M, or B. (We do the disregarding in the code for producing graphics; see below.)

Additional preliminary analysis showed that the variable EVTYPE, containing the type of storm event (Table 1), had nearly 1000 distinct types of weather events. Many of these weather events occurred only once over the time period from 1950 to 2011. We thus created a new variable, EVENT, and collapsed the 985 categories of EVTYPE into 13 for our analysis. The categories of EVENT were: COLD, FLOOD, HAIL, HEAT, HURRICANE, ICE, LIGHTNING, RAIN, THUNDERSTORM, TORNADO, WIND, WINTER STORM and OTHER. The data.table transformation statements used to create the EVENT variable are shown in the above R code chunk.

Results

Our analysis focused on the following two questions:

  1. Across the United States, which types of weather events are most harmful with respect to population health?

  2. Across the United States, which types of weather events have the greatest economic consequences?

In our analysis, “population health” is represented by the variables FATALITIES and INJURIES, and “economic consequences” by the variables PROPCASH and CROPCASH. Analysis was limited to the 48 contiguous states.

Heat and hurricane events were the most harmful in terms of population health (Figure 1), causing the most fatalities and injuries of the 13 categories of weather events examined. Tornadoes were the third most harmful weather event in terms of number of injuries caused.

In terms of economic consequences, hurricanes were by far the most harmful weather event, causing on average nearly 400 million dollars worth of property damage per event (Figure 2). Various types of ice-related events caused the greatest amount of crop damage, followed by hurricanes and heat events.

options(scipen=999) # Discourage scientific notation.
par(mfrow=c(2,1), mai=c(1.4, 1, 1, 1))

# Compute means of FATALITIES and INJURIES for the 13 weather events.
# We then order the data.table by EVENT so that weather events will
# be in the same order along the x-axis of each bar plot.
fig1a <- fulldata[, mean(FATALITIES, na.rm = TRUE), by=EVENT]
fig1a <- fig1a[order(EVENT),]
fig1b <- fulldata[, mean(INJURIES, na.rm = TRUE), by=EVENT]
fig1b <- fig1b[order(EVENT),]
barplot(fig1a$V1, names.arg = fig1a$EVENT, cex.names = 0.7, 
        main = "Fatalities", las = 2, ylab = "Mean number of fatalities per event")
barplot(fig1b$V1, names.arg = fig1b$EVENT, cex.names = 0.7,
        main = "Injuries", las = 2, ylab = "Mean number of injuries per event")
Figure 1. Mean number of fatalities and injuries directly caused by various weather events in the contiguous 48 states.

Figure 1. Mean number of fatalities and injuries directly caused by various weather events in the contiguous 48 states.

options(scipen=999) # Discourage scientific notation.
par(mfrow=c(2,1), mai=c(1.4, 1, 1, 1))

# Compute means of PROPCASH and CROPCASH for the 13 weather events.
# The statement `PROPCASH >= 100` eliminates any records in which
# the multiplier PROPDMGEXP was not `H`, `K`, `M`, or `B`.
# In other words, we only use weather events that caused at least
# $100 in property damage.
# We then order the data.table by EVENT so that weather events will
# be in the same order along the x-axis of each bar plot.
fig2a <- fulldata[PROPCASH >= 100, mean(PROPCASH, na.rm = TRUE), by=EVENT]
fig2a <- fig2a[order(EVENT),]
fig2b <- fulldata[CROPCASH >= 100, mean(CROPCASH, na.rm = TRUE), by=EVENT]
fig2b <- fig2b[order(EVENT),]
barplot(fig2a$V1/1000000, names.arg = fig2a$EVENT, cex.names = 0.7, 
        main = "Property damage", las = 2, ylim = c(0, 400), ylab = "Millions of dollars per event")
barplot(fig2b$V1/1000000, names.arg = fig2b$EVENT, cex.names = 0.7,
        main = "Crop damage", las = 2, ylim = c(0, 200), ylab = "Millions of dollars per event")
Figure 2. Average property and crop damage (in millions of dollars) caused by various weather events in the contiguous 48 states.

Figure 2. Average property and crop damage (in millions of dollars) caused by various weather events in the contiguous 48 states.


  1. See https://www.ncdc.noaa.gov/stormevents/.

  2. See data.table() vs data.frame(): Learn to work on large data sets in R.

  3. See fread.

  4. See Storm Events Database.