Synopsis

Severe weather events can cause both population health and economic problems. This report analyzed the data from the U.S. National Oceanic and Atmospheric Administration to identify the events that are most impactful to the population health and economy. The impact was measured by the total number of fatalities and injuries for the population health, and the total dollar amount of property and crop damage for economic impact between 1950 and 2011. Tornado is identified as the event with the greatest population health impact, and flooding is for the economic impact.

Introduction

This report was written as the final project for Reproducible Research course on coursera, taught by Roger D. Peng, PhD, Jeff Leek, PhD, and Brian Caffo, PhD at Johns Hopkins University Bloomberg School of Public Health.

The objective of this report is to address following two questions:

This report analyzes the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

The rest of the report describes the steps and results of the analysis in details, including R scripts and outputs to reproduce the results.

Data Processing

Loading raw data set

A copy of the data is available from the course web site.

The text file is in a comma separated value (CSV) format, and it is compressed with bzip2 algorithm. The following R code downloads, decompress, and loads the data into R:

library(knitr)
opts_chunk$set(echo=TRUE, cache=TRUE, fig.height=15)

options(scipen = 20, digits = 4)

knit_hooks$set(html.cap = function(before, options, envir) {
  if(!before) {
    paste('<p class="caption">',options$html.cap,"</p>",sep="")
    }
})

# install.packages("R.utils")
library(R.utils)
library(RCurl)

data_url = paste("https://d396qusza40orc.cloudfront.net",
                 "repdata%2Fdata%2FStormData.csv.bz2",
                 sep="/")
data_bz2 = "repdata-data-StormData.csv.bz2"
data_csv = "repdata-data-StormData.csv"
data_raw = "data_raw.rd"
if (file.exists(data_raw)) {
    # Loading from cached raw data.
    load(file=data_raw)
} else {
    if (!file.exists(data_csv)) {
        if (!file.exists(data_bz2)) {
            # Download the data
            download.file(url=data_url, destfile=data_bz2, method="wget")
            }
        bunzip2(data_bz2, data_csv)
        }
    data = read.csv(data_csv, header=TRUE)
    save(data, file=data_raw)
}

The loaded data has 902297 entries:

length(data[,1])
## [1] 902297

Processing raw data for the analysis

Identifying NOAA event types

Types of events are indicated in the EVTYPE variable. NOAA document document specifies 48 types of events. However, the raw data contains 985 different type of entries in EVTYPE:

length(levels(data$EVTYPE))
## [1] 985

Cleaning up the white space and make all upper case will give a slightly smaller number:

library(stringr)
data$evtype_clean = as.factor(toupper(str_trim(data$EVTYPE)))
length(levels(data$evtype_clean))
## [1] 890

Based on the names of 48 NOAA event types, the following regular expression table is created:

# Rerodered so that simple word like "Flood" won't miscategrize "Coastal Flood"
regex_list = '
"index","name","regex"
1,"Astronomical Low Tide","(\\W|^)low tide(\\W|$)"
2,"Avalanche","(\\W|^)avalanche(\\W|$)"
3,"Blizzard","(\\W|^)blizzard(\\W|$)"
4,"Coastal Flood","(\\W|^)coastal(\\W|$)"
6,"Debris Flow","(\\W|^)debris(\\W|$)"
7,"Dense Fog","(\\W|^)(dense fog|fog dense)(\\W|$)"
8,"Dense Smoke","(\\W|^)(dense smoke|smoke dense)(\\W|$)"
9,"Drought","(\\W|^)drought(\\W|$)"
10,"Dust Devil","(\\W|^)(dust devil|devil dust)(\\W|$)"
11,"Dust Storm","(\\W|^)(dust storm|storm dust)(\\W|$)"
12,"Excessive Heat","(\\W|^)(excessive heat|heat excessive)(\\W|$)"
13,"Extreme Cold/Wind Chill","(\\W|^)extreme.*(cold|wind[s]* chill)(\\W|$)"
14,"Flash Flood","(\\W|^)flash flood(\\W|$)"
16,"Frost/Freeze","(\\W|^)(frost|freeze)(\\W|$)"
17,"Funnel Cloud","(\\W|^)funnel(\\W|$)"
18,"Freezing Fog","(\\W|^)(freezing fog|fog freezing)(\\W|$)"
21,"Heavy Rain","(\\W|^)(heavy rain|rain heavy)(\\W|$)"
22,"Heavy Snow","(\\W|^)(heavy snow|snow heavy)(\\W|$)"
23,"High Surf","(\\W|^)surf(\\W|$)"
24,"High Wind","(\\W|^)(high wind[s]*|wind[s]* high)(\\W|$)"
25,"Hurricane (Typhoon)","(\\W|^)(hurricane|typhoon)(\\W|$)"
26,"Ice Storm","(\\W|^)ice storm(\\W|$)"
27,"Lake-Effect Snow","(\\W|^)lake[ -]effect snow(\\W|$)"
28,"Lakeshore Flood","(\\W|^)lakeshore flood(\\W|$)"
29,"Lightning","(\\W|^)lightning(\\W|$)"
30,"Marine Hail","(\\W|^)(marine hail|hail marine)(\\W|$)"
31,"Marine High Wind","(\\W|^)(marine high wind[s]*|high wind[s]*[ ,]marine)(\\W|$)"
32,"Marine Strong Wind","(\\W|^)(marine strong wind|strong wind[s]*[ ,]marine)(\\W|$)"
33,"Marine Thunderstorm Wind","(\\W|^)(marine thunderstorm wind[s]*|thunderstorm wind[s]*[ ,]marine)(\\W|$)"
34,"Rip Current","(\\W|^)rip current(\\W|$)"
35,"Seiche","(\\W|^)seiche(\\W|$)"
36,"Sleet","(\\W|^)sleet(\\W|$)"
37,"Storm Surge/Tide","(\\W|^)(storm surge|storm tide)(\\W|$)"
38,"Strong Wind","(\\W|^)(strong wind[s]*|wind[s]*[ ,]strong)(\\W|$)"
39,"Thunderstorm Wind","(\\W|^)(tstm wind[s]*|thunderstorm[s]* wind[s]*|wind[s]*[ ,]thunderstorm)(\\W|$)"
40,"Tornado","(\\W|^)tornado(\\W|$)"
41,"Tropical Depression","(\\W|^)tropical depression(\\W|$)"
42,"Tropical Storm","(\\W|^)(tropical storm|storm[ ,]tropical)(\\W|$)"
43,"Tsunami","(\\W|^)tsunami(\\W|$)"
44,"Volcanic Ash","(\\W|^)volcanic ash(\\W|$)"
45,"Waterspout","(\\W|^)waterspout(\\W|$)"
46,"Wildfire","(\\W|^)wildfire(\\W|$)"
47,"Winter Storm","(\\W|^)(winter storm|storm[ ,]winter)(\\W|$)"
48,"Winter Weather","(\\W|^)(winter weather|weather[ ,]winter)(\\W|$)"
5,"Cold/Wind Chill","(\\W|^)(low temperature[s]*|cold|wind[s]* chill)(\\W|$)"
15,"Flood","(\\W|^)(flood|flooding)(\\W|$)"
19,"Hail","(\\W|^)hail(\\W|$)"
20,"Heat","(\\W|^)(heat|high temperature[s]*)(\\W|$)"
49,"Unclassified",".*"
'
regex_df = read.csv(text=regex_list, header=T)

The regex filter is applied to categorize the result into 48 event types. Rather than directly applying the regex classifier to all 902297 entries, it was first applied to 985 unique EVTYPE, create a hash table so it has the quick look up when applied to the entire data set in the later use:

library(hash)
# Create factor of cleaned events
evtype = data.frame(key=unique(as.character(data$evtype_clean)))
evtype$clean = vector(mode="character", length(evtype[,1]))
evtype_hash = hash()
for (i in 1:length(evtype[,1])) {
    for (j in 1:length(regex_df[,1])) {
        if (grepl(as.character(regex_df[j,]$regex), tolower(evtype[i,]$key))){
            evtype[i,]$clean = as.character(regex_df[j,]$name)
            evtype_hash[[as.character(evtype[i,]$key)]] =
                as.character(regex_df[j,]$name)
            break
        }
    }
}

Justification of this approach

Among the entire NOAA 48 types, 48 types appeared in the unique 985 EVTYPE. Among 985 unique EVTYPE, 404 were categorized as “Unclassified”. But, many of the entries are actually difficult to categorize into one of 48 NOAA types even with human judgement. Here are some of such entries:

head(evtype[evtype$clean=="Unclassified",]$key, 50)
##  [1] FREEZING RAIN        SNOW                 SNOW/ICE            
##  [4] THUNDERSTORM WINS    WIND                 LIGHTING            
##  [7] HEAVY RAINS          WALL CLOUD           THUNDERSTORM        
## [10] HIGH TIDES           RECORD HIGH          RECORD LOW          
## [13] MARINE MISHAP        HIGH SEAS            SEVERE TURBULENCE   
## [16] RECORD RAINFALL      RECORD SNOWFALL      RECORD WARMTH       
## [19] WIND DAMAGE          APACHE COUNTY        FLASH FLOODS        
## [22] GUSTY WINDS          SNOW AND WIND        HEAVY PRECIPATATION 
## [25] BLOWING DUST         URBAN/SMALL          WILD FIRES          
## [28] HIGH                 WATER SPOUT          WINTER STORMS       
## [31] MUDSLIDES            RAINSTORM            SEVERE THUNDERSTORM 
## [34] SEVERE THUNDERSTORMS DRY MICROBURST       WINDS               
## [37] DRY MICROBURST 61    THUNDERSTORMS        MICROBURST          
## [40] URBAN AND SMALL      HEAVY SNOWPACK       ICE                 
## [43] DOWNBURST            GUSTNADO AND         WET MICROBURST      
## [46] DOWNBURST WINDS      DRY MICROBURST WINDS DRY MIRCOBURST WINDS
## [49] DRY MICROBURST 53    MICROBURST WINDS    
## 890 Levels: ? ABNORMAL WARMTH ABNORMALLY DRY ... WND

Results

Impact to population health

Q1. Across the United States, which types of events are most harmful with respect to population health?

Among the columns, the measurement for population health are FATALITIES, and INJURIES.

fatalities_sum = aggregate(data$FATALITIES, by=list(data$evtype_clean), FUN=sum)
colnames(fatalities_sum) = c("evtype", "sum")
injuries_sum = aggregate(data$INJURIES, by=list(data$evtype_clean), FUN=sum)
colnames(injuries_sum) = c("evtype", "sum")

The event types here are all pre-classification. So apply the hash table to consolidate the sum by NOAA category.

classify_evtype = function(sum_data) {
    sum_data$noaa_evtype = vector(mode="character", length=length(sum_data[,1]))        
    for (i in 1:length(sum_data[,1])){
        sum_data[i,]$noaa_evtype =
            as.character(evtype_hash[[as.character(sum_data[i,]$evtype)]])
        }
    sum_data = aggregate(sum_data$sum, by=list(sum_data$noaa_evtype), FUN=sum)
    colnames(sum_data) = c("noaa_evtype", "sum")
    # Order by impact
    sum_data = sum_data[with(sum_data, order(sum)),]
    # Relevel based on the order
    sum_data$noaa_evtype =
        factor(sum_data$noaa_evtype, levels=sum_data$noaa_evtype)
    return(sum_data)
}
fatalities_sum = classify_evtype(fatalities_sum)
injuries_sum = classify_evtype(injuries_sum)

Figure 1 is the plot of the total number of fatalities and injuries between 1950 and 2011 by event types:

library(gridExtra)
library(ggplot2)

fatalities_plot = ggplot(data=fatalities_sum, aes(x=noaa_evtype, y=sum)) +
    geom_bar(stat="identity") + coord_flip() +
    labs(x="NOAA Event Type", y="Total fatalities",
        title="Total fatalities")
injuries_plot = ggplot(data=injuries_sum, aes(x=noaa_evtype, y=sum)) +
           geom_bar(stat="identity") + coord_flip() +
    labs(x="NOAA Event Type", y="Total injuries",
        title="Total injuries")
grid.arrange(fatalities_plot, injuries_plot, nrow=2)

Figure 1: The total number of fatalities and injuries between 1950 and 2011 by event types

Tornado is by far the most harmful to population health both from fatalities and injuries.

Impact to the economy

Q2. Across the United States, which types of events have the greatest economic consequences?

The NOAA document says:

“Estimates should be rounded to three significant digits, followed by an alphabetical character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000. Alphabetical characters used to signify magnitude include”K" for thousands, “M” for millions, and “B” for billions." (p.12)

Among the columns, the measurement for economic consequences are the damage to properties and crops, and their 3 significant digits are stored at the columns PROPDMG and CROPDMG, respectively. The magnitude of the number is stored at PROPDMGEXP and CROPDMGEXP.

The symbols found in PROPDMGEXP and CROPDMGEXP contains more than K, M, or B:

unique(data$PROPDMGEXP)
##  [1] K M   B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels:  - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
unique(data$CROPDMGEXP)
## [1]   M K m B ? 0 k 2
## Levels:  ? 0 2 B k K m M

So in this report, I will convert to:

  • H or h to 10^2
  • K or k to 10^3
  • M or m to 10^6
  • B or b to 10^9
  • 1~8 to 10^(1~8)
  • Everything else to 1

and store as propdmg_mag and cropdmg_mag.

After the conversion, the dollar value is calculated by multiplying PROPDMG and propdmg_mag, also CROPDMG and cropdmg_mag:

library(plyr)
data$propdmg_mag = as.numeric(as.character(revalue(
    data$PROPDMGEXP,
    c(
        "H"="100",
        "h"="100",
        "K"="1000",
        "k"="1000",
        "M"="1000000",
        "m"="1000000",
        "B"="1000000000",
        "b"="1000000000",
        "0"="1",
        "1"="10",
        "2"="100",
        "3"="1000",
        "4"="10000",
        "5"="100000",
        "6"="1000000",
        "7"="10000000",
        "8"="100000000",
        "+"="1",
        "-"="1",
        "?"="1",
        " "="1"
        ))))
## The following `from` values were not present in `x`: k, b,
# NA for MAG is 1 not zero
data[is.na(data$propdmg_mag),]$propdmg_mag = 1

data$cropdmg_mag = as.numeric(as.character(revalue(
    data$CROPDMGEXP,
    c(
        "H"="100",
        "h"="100",
        "K"="1000",
        "k"="1000",
        "M"="1000000",
        "m"="1000000",
        "B"="1000000000",
        "b"="1000000000",
        "0"="1",
        "1"="10",
        "2"="100",
        "3"="1000",
        "4"="10000",
        "5"="100000",
        "6"="1000000",
        "7"="10000000",
        "8"="100000000",
        "+"="1",
        "-"="1",
        "?"="1",
        " "="1"
        ))))
## The following `from` values were not present in `x`: H, h, b, 1, 3, 4, 5, 6, 7, 8, +, -,
# NA for MAG is 1 not zero
data[is.na(data$cropdmg_mag),]$cropdmg_mag = 1

data$propdmg_dollar = data$PROPDMG * data$propdmg_mag
data$cropdmg_dollar = data$CROPDMG * data$cropdmg_mag

As I did in population health impact, aggregate the sum, classifying into 48 NOAA types, then ordering by the impact yields the following plots as in Fig. 2.

propdmg_sum = aggregate(data$propdmg_dollar, by=list(data$evtype_clean), FUN=sum)
colnames(propdmg_sum) = c("evtype", "sum")
propdmg_sum = classify_evtype(propdmg_sum)

cropdmg_sum = aggregate(data$cropdmg_dollar, by=list(data$evtype_clean), FUN=sum)
colnames(cropdmg_sum) = c("evtype", "sum")
cropdmg_sum = classify_evtype(cropdmg_sum)

propdmg_plot = ggplot(data=propdmg_sum, aes(x=noaa_evtype, y=sum)) +
           geom_bar(stat="identity") + coord_flip() +
    labs(x="NOAA Event Type", y="Total damage(US dollar)",
        title="Total property damange")
cropdmg_plot = ggplot(data=cropdmg_sum, aes(x=noaa_evtype, y=sum)) +
           geom_bar(stat="identity") + coord_flip() +
    labs(x="NOAA Event Type", y="Total damage(US dollar)",
        title="Total crop damange")
grid.arrange(propdmg_plot, cropdmg_plot, nrow=2)

Figure 2: The total damange of property and crop in US dollar between 1950 and 2011 by event types

As a total between property and crop damage, flooding made the greatest economic impact. (Flooding made the greatest total property damage, and drought was the greatest for crop damage. Flooding is the second in the crop damage, making total the greatest.)