Contents

  1. Synopsis: a brief overview of the questions and goals of the analysis
  2. Scope and Limitations: links and notes on accuracy
  3. Data Processing: reproducible instructions of how data was processed, from raw to tidier data formats
  4. Results: reproducible exploratory data analysis of three research questions with tidy data

Part 1. Synopsis

This analysis explores the National Oceanic and Atmospheric Administration’s (NOAA) Storm Database in answering three research questions:

  1. Across the United States, which types of events are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?
  3. How do the impacts on the U.S. population and economy compare across the various weather event types?

According to this analysis, and keeping in mind its scope and limitations (see Part 2.), brief answers to these questions are (see Part 4. for a fuller account):

  1. The most harmful weather events seem to be: tornadoes, extreme heat, flooding, strong winds (other than hurricanes, tornadoes), and lightning, and most of the harm is injurious, not fatal.
  2. The most expensive weather events seem to be: flooding, tropical storms (including hurricanes), coastal and marine weather, tornadoes, hail, and drought, and most of the damage is to properties, not crops.
  3. Since the impact of weather events vary greatly depending on whether one considers their cost or their harm, event types can be categorized into those dominated by harm to the population or economic cost.

Part 2. Scope and Limitations

This is the final course project for Course 5: Reproducible Research in the 9-Course ‘Data Science Specialization’ from Johns Hopkins University in Coursera. As such, the scope and aim of this project is not toward industry-standard accuracy in its findings, but merely whether the analysis is fully reproducible or not. These are a few limitations when considering the results of this project’s analysis:


Part 3. Data Processing

This analysis was performed with the following system specifications:

sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: i386-w64-mingw32/i386 (32-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.4.3  backports_1.1.2 magrittr_1.5    rprojroot_1.3-1
##  [5] tools_3.4.3     htmltools_0.3.6 yaml_2.1.16     Rcpp_0.12.14   
##  [9] stringi_1.1.6   rmarkdown_1.8   knitr_1.17      stringr_1.2.0  
## [13] digest_0.6.13   evaluate_0.10.1

Clear Workspace and Load Packages

Caution: the first command will clear your R workspace.

Credit to Steven Worthington for the “ipak” function.

rm(list = ls())
ipak <- function(pkg){
    new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
    if (length(new.pkg)) 
        install.packages(new.pkg, dependencies = TRUE, warn.conflicts = FALSE)
    sapply(pkg, require, character.only = TRUE)
}
packages <- c("dplyr", "data.table", "ggplot2")
ipak(packages)
##      dplyr data.table    ggplot2 
##       TRUE       TRUE       TRUE

Download Data

if (!file.exists("data")) {
    dir.create("data")
}
fileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileURL, destfile = "./data/StormData.csv.bz2")

Read Data into R

Since the storm data is large and takes several minutes to load, I subset the data by preprocessing steps determined after some initial exploratory analysis (not included). Here are some initial findings and how they guided the choice of subset:

  1. Between 1950 and 1990 there were only three types of events in the storm data: TORNADO, TSTM WIND, and HAIL. After 1990, the number of events spike to 913. The subset below (“noaa”) starts at 1990 to minimize bias toward those three events, creating a DATE variable from the BGN_DATE, which gets discarded.

  2. The Storm Data Documentation and FAQ page provide information about how the storm data is collected, but fails to describe the variables in the dataset. Lacking a proper CODEBOOK, the initial exploratory analysis showed that the appropriate variables for determining population harm are FATALITIES and INJURIES, and those for the cost of damages are PROPDMG (property damage) and CROPDMG. These variables contain simplified figures which need to be raised by the appropriate value in the corresponding exponent variables: PROPDMGEXP and CROPDMGEXP. Finally, the EVTYPE is a conditioning variable with names of all the types of weather events.

noaa <- data.table(read.table("./data/StormData.csv.bz2", sep = ",", header = TRUE, nrows = 902298,
                                                     na.strings = "", stringsAsFactors = FALSE)) %>%
        select(BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) %>%
        mutate(DATE = as.Date(BGN_DATE, format = "%m/%d/%Y %H:%M:%S")) %>%
        filter(year(DATE) > 1989) 

# discard old date variable BGN_DATE
noaa <- noaa[, -1] 

str(noaa)
## 'data.frame':    751740 obs. of  8 variables:
##  $ EVTYPE    : chr  "HAIL" "TSTM WIND" "TSTM WIND" "TSTM WIND" ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ INJURIES  : num  0 0 0 0 28 0 0 0 0 0 ...
##  $ PROPDMG   : num  0 0 0 0 2.5 0 0 0 25 25 ...
##  $ PROPDMGEXP: chr  NA NA NA NA ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  NA NA NA NA ...
##  $ DATE      : Date, format: "1990-01-05" "1990-01-20" ...

We immediately see that there are a lot of missing values and zeros in the data.


Cleaning the Data

Handling Missing Values

mean(is.na(noaa$PROPDMGEXP))
## [1] 0.460618
mean(is.na(noaa$CROPDMGEXP))
## [1] 0.6223641

Since a large portion of the data is missing for the exponent variables (46% and 62%), the wrong choice of replacement would have disastrous consequences. The integer 1 is chosen so if any figures were entered for property or crop damages with an NA exponent, we would not lose that figure. There are no other missing data in the “noaa” subset.

noaa$PROPDMGEXP[is.na(noaa$PROPDMGEXP)] <- 1L
noaa$CROPDMGEXP[is.na(noaa$CROPDMGEXP)] <- 1L

Handling Zeroes

We discard those observations in which all variables under consideration have a zero, considerably reducing the size of the “noaa” subset from roughly 752 to 230 thousand observations.

noaa <- noaa[noaa$FATALITIES != 0 | noaa$INJURIES != 0 | noaa$PROPDMG != 0 | noaa$CROPDMG != 0, ]
dim(noaa)
## [1] 230112      8

Handling the EVTYPE variable

The EVTYPE variable has several flaws, the first of which is random capitalization. For ease in cleaning, I capitalize everything.

noaa$EVTYPE <- toupper(noaa$EVTYPE)
length(unique(noaa$EVTYPE))
## [1] 447

Still contending with 447 unique event types, an exploratory analysis (not included) reveals a high number of misspellings, redundancies in terminology, and obscure entries.

We filter out a few obscure entries (such as “APACHE COUNTY”) and proceed to re-categorize various types of event types into broader categories for a streamlining the analysis. This process has a deep impact on the analysis as choices in categorization will lead to changes in results.

The new variable EVCAT has 23 broad event categories. A few notes on categorization choices:

  • cold takes precedence over other types of events when the word cold is included
  • heat takes precedence over other types of events, except for events that include the word drought
  • dust refers to events such as dust storms
  • hail refers to events that include hail but exclude tornadoes
  • hurricane refers to all tropical storm events
  • ice includes events that involved icy conditions, except those that included the word snow
  • landslide includes mudslides
  • lightning is considered separately from events which might involve lightning such as tornadoes, t-storms, etc.
  • ocean is a broad category including coastal and marine events; see code for further details on what words were included =TRUE and not included = FALSE
  • rain is a small subset of all events that might include rain, as flood or wind, among others, take precedence
  • wind burst includes events such as microbursts, downbursts and gustnadoes
  • wind includes events with wind speeds below those such as tornadoes and hurricanes, and is limited to inland events
  • winter includes observations named for that season, such as wintry mix
  • other includes the rest (listed below)
noaa <- noaa %>%    
    filter(EVTYPE != "APACHE COUNTY" & EVTYPE != "?" & EVTYPE != "HIGH" & EVTYPE != "URBAN AND SMALL" 
                                                                     & EVTYPE !=  "URBAN SMALL") %>%
    mutate(EVCAT = ifelse(grepl("COLD|CHILL|HYPOTHERMIA|LOW TEMP", EVTYPE) == TRUE, "cold", 
            ifelse(grepl("HEAT|HYPERTHERMIA|WARM", EVTYPE) == TRUE 
                & grepl("DROUGHT", EVTYPE) == FALSE, "heat", 
            ifelse(grepl("DROUGHT", EVTYPE) == TRUE, "drought",
            ifelse(grepl("DROWN", EVTYPE) == TRUE, "drowning",
            ifelse(grepl("DUST|DRY", EVTYPE) == TRUE 
                & grepl("WARM", EVTYPE) == FALSE, "dust",
            ifelse(grepl("FLOOD|FLD|RISING WATER|HIGH WATER", EVTYPE) == TRUE, "flood", 
            ifelse(grepl("FIRE|SMOKE", EVTYPE) == TRUE, "fire",
            ifelse(grepl("FOG", EVTYPE) == TRUE, "fog",
            ifelse(grepl("FREEZ|FROST", EVTYPE) == TRUE, "frost",
            ifelse(grepl("HAIL", EVTYPE) == TRUE 
                & grepl("TORNADO", EVTYPE) == FALSE, "hail",
            ifelse(grepl("HURRICANE|TYPHOON|TROPICAL", EVTYPE) == TRUE, "hurricane",
            ifelse(grepl("ICE|ICY|GLAZE", EVTYPE) == TRUE 
                & grepl("SNOW|FLOOD", EVTYPE) == FALSE, "ice",
            ifelse(grepl("LANDSLIDE|MUDSLIDE|MUD SLIDE|LANDSLUMP", EVTYPE) == TRUE, "landslide",
            ifelse(grepl("LIGHTNING|LIGHTING|LIGNTNING", EVTYPE) == TRUE, "lightning",
            ifelse(grepl("BEACH|COASTAL|CURRENT|MARINE|SEAS|SURGE|SURF|TIDE|TSUN|WATERSPOUT|WAVE",
                EVTYPE) == TRUE & grepl("UNSEASON|HEAT|COLD|FLOOD", EVTYPE) == FALSE, "ocean",
            ifelse(grepl("RAIN|SHOWER|PRECIP", EVTYPE) == TRUE 
                & grepl("SNOW|FLOOD|FREEZ|WIND", EVTYPE) == FALSE, "rain",
            ifelse(grepl("AVALAN|BLIZZ|SNOW", EVTYPE) == TRUE 
                & grepl("COLD|FREEZ", EVTYPE) == FALSE, "snow",
            ifelse(grepl("THUNDERSTORM|THUNDERSTORMS", EVTYPE) == TRUE, "t-storm",
            ifelse(grepl("TORNADO|TORNDAO|WHIRLWIND|FUNNEL|LANDSPOUT", EVTYPE) == TRUE, "tornado",
            ifelse(grepl("BURST|GUSTNADO", EVTYPE) == TRUE, "wind burst",
            ifelse(grepl("WIND|TURBULENCE|TSTMW", EVTYPE) == TRUE 
                & grepl("MARINE|SEAS|SURF|HURRIC|BURST|TORNADO|WHIRL|SNOW", EVTYPE) == FALSE, "wind",
            ifelse(grepl("WINTER|WINTRY", EVTYPE) == TRUE, "winter", "other")))))))))))))))))))))))


# a look at "other" events:         
unique(noaa[noaa$EVCAT == "other", ]$EVTYPE)
##  [1] "SLEET"              "COOL AND WET"       "EXCESSIVE WETNESS" 
##  [4] "HEAVY MIX"          "URBAN/SMALL STREAM" "HEAVY SWELLS"      
##  [7] "OTHER"              "DAM BREAK"          "HIGH SWELLS"       
## [10] "SEICHE"             "ROCK SLIDE"         "VOLCANIC ASH"

Handling the Exponent Variables

For handling of exponent variables I partly consulted this GitHub repo, among other sources.

# values in the exponent variables
table(noaa$PROPDMGEXP) 
## 
##      -      +      0      1      2      3      4      5      6      7 
##      1      5    210  10717      1      1      4     18      3      3 
##      B      h      H      K      m      M 
##     40      1      6 210144      7   8946
table(noaa$CROPDMGEXP)
## 
##      ?      0      1      B      k      K      m      M 
##      6     17 128138      7     21  99932      1   1985

The number of observations classified with extraneous values beyond the clear-cut letter labels (B = billion, M = million, etc.) is small, however, this is another source of uncertainty in this analysis. After some exploratory analysis (not included), I decided to proceed with the approach that numbered exponent values mean actual exponents in base 10, so a 5 means 10,000 times the value in the property or crop variables.

# substituting code symbols for integer multipliers
noaa$PROPDMGEXP <- recode(noaa$PROPDMGEXP, B = 1e+09L, M = 1e+06L, m = 1e+06L, K = 1e+03L, H = 1e+02L,
                                h = 1e+02L, "2" = 1e+02L, "3" = 1e+03L, "4" = 1e+04L, "5" = 1e+05L, 
                             "6" = 1e+06L, "7" = 1e+07L, "0" = 1L, "+" = 1L, "-" = 0L, .default = 1L)

noaa$CROPDMGEXP <- recode(noaa$CROPDMGEXP, B = 1e+09L, M = 1e+06L, m = 1e+06L, K = 1e+03L, k = 1e+03L, 
                                                                   "0" = 1L, "?" = 0L, .default = 1L)

# multiplying by exponent and further narrowing the "noaa" subset
noaa$PROPDMG <- noaa$PROPDMG * noaa$PROPDMGEXP
noaa$CROPDMG <- noaa$CROPDMG * noaa$CROPDMGEXP

# removing unnecessary variables 
noaa <- noaa[, -c(1,5,7,8)]
head(noaa)
##   FATALITIES INJURIES PROPDMG CROPDMG   EVCAT
## 1          0       28 2500000       0 tornado
## 2          0        0   25000       0 tornado
## 3          0        0   25000       0 tornado
## 4          0        3 2500000       0 tornado
## 5          0        2 2500000       0 tornado
## 6          0       15 2500000       0 tornado

With the final figures for property and crop damage computed for every observation, the data is primed for final tidy subsets.


Part 4. Results

Question 1: Across the United States, which types of events are most harmful with respect to population health?

We assess harm to population health by considering total number of fatalities and injuries per event category. Noticing that fatality and injury are actually two levels of harm, I created a single index variable (ind) with those two levels.

# set print max options for tibbles
options('tibble.print_max' = 100) 

harm <- noaa %>%
    select(EVCAT, FATALITIES, INJURIES) %>%
    group_by(EVCAT) %>%
    filter(FATALITIES != 0 | INJURIES != 0) %>%
    summarise_all(sum) %>% 
    arrange(desc(FATALITIES))

# final tidy dataset 
tidyharm <- data.frame(harm, stack(harm[2:3]))[, c(1, 4, 5)]
head(tidyharm, 3)
##     EVCAT values        ind
## 1    heat   3173 FATALITIES
## 2 tornado   1778 FATALITIES
## 3   flood   1557 FATALITIES
tail(tidyharm, 3)
##       EVCAT values      ind
## 42  drought     19 INJURIES
## 43    other      4 INJURIES
## 44 drowning      0 INJURIES

Plot 1: POPULATION HARM

ggplot(tidyharm, aes(x = reorder(EVCAT, +values), y = values, fill = tolower(ind))) +
        geom_bar(stat = "identity") + theme_minimal() + coord_flip() + 
        scale_fill_manual(values=c("#510A1A", "#CB103B")) +
        labs(title = "Weather-Related Fatalities and Injuries in the U.S.", 
             subtitle = "Summary of NOAA Storm Data from 1990 to 2011", 
             x = "Event Type", y = "Number of Incidents Reported") + 
             theme(legend.title = element_blank())

The most harmful weather events, according to this analysis and keeping in mind its scope and limitations, seem to be tornadoes, extreme heat, flooding, strong winds (other than hurricanes, tornadoes), and lightning. Noticeably, heat causes more deaths than tornadoes. This could partly be due to inconsistent methodologies and incentives in classifying mortality by heat or tornado. Perhaps surprisingly, cold is further down the list of most harmful events; this is partly due to other classifications of events (as previously listed in this analysis) that include cold, such as: wind, snow, ice, and winter.


Question 2: Across the United States, which types of events have the greatest economic consequences?

The cost of weather-related damage to property and crops is considered using the PROPDMG and CROPDMG variables, as previously computed. Similarly, a single cost variable can be computed using two levels: property, and crop.

cost <- noaa %>% 
    select(EVCAT, PROPDMG, CROPDMG) %>%
    filter(PROPDMG != 0 | CROPDMG != 0) %>%
    group_by(EVCAT) %>%
    summarise_all(sum) %>%
    arrange(desc(PROPDMG))

# final tidy dataset    
tidycost <- data.frame(cost, stack(cost[2:3]))[, c(1, 4, 5)]
tidycost <- tidycost %>%
    mutate(ind = ifelse(tidycost$ind == "CROPDMG", "crops", "properties")) 
head(tidycost)
##       EVCAT       values        ind
## 1     flood 168271030485 properties
## 2 hurricane  93072537560 properties
## 3     ocean  48305183590 properties
## 4   tornado  32073720807 properties
## 5      hail  16022991537 properties
## 6      wind  10630829763 properties
tail(tidycost)
##         EVCAT     values   ind
## 39      frost 1997061000 crops
## 40        fog          0 crops
## 41       heat  904423500 crops
## 42       dust    3615000 crops
## 43      other  148034400 crops
## 44 wind burst       1550 crops

Plot 2: ECONOMIC COST

ggplot(tidycost, aes(x = reorder(EVCAT, +values), y = values/1e+09, fill = ind)) +
        geom_bar(stat = "identity") + theme_minimal() + coord_flip() + 
        scale_fill_manual(values=c("#0E348B", "#1976EC")) +
        labs(title = "Cost of Weather-Related Damages to Crops and Properties in the U.S.", 
             subtitle = "Summary of NOAA Storm Data from 1990 to 2011", 
             x = "Event Type", y = "Cost of Damages (in Billions of Dollars)") +
             guides(fill=guide_legend(title="Damage to"))

The most expensive weather-related events, according to this analysis and considering its limitations, seem to be: flooding, tropical storms (including hurricanes), coastal and marine weather, tornadoes, hail, and drought. Most of the damage is done to properties, however, most of the drought damage, as expected, is done to crops (just as frost and cold). Noticeably, tornado damages rank much lower than flooding, a distinct difference from harm done to the population. This and other possible comparisons led me to pursue a third question of how population harm and economic damage relate to each other.


Question 3: How do the impacts on the U.S. population and economy compare across the various weather event types?

This question requires a comparison of number of incidents with dollar value of incidents, which is somewhat contrived since one can’t put a price on lives, among other considerations. For this comparison to work I calculated the proportional impact separately and plotted these proportions.

# Recalculating "harm" and "cost" subsets without removing observations of zeros separately 
# This step avoids discarding event types that zeroes across one subset but not the other
harm <- noaa %>%
    select(EVCAT, FATALITIES, INJURIES) %>%
    group_by(EVCAT) %>%
    summarise_all(sum) %>% 
    arrange(desc(FATALITIES))
    
cost <- noaa %>% 
    select(EVCAT, PROPDMG, CROPDMG) %>%
    group_by(EVCAT) %>%
    summarise_all(sum) %>%
    arrange(desc(PROPDMG))

# Creating tidy set of totals and their proportional impact for each case 
totharm <- harm %>% 
    mutate(TOT.harm = FATALITIES + INJURIES) %>%
    mutate(prop = prop.table(TOT.harm)) %>%
    select(EVCAT, prop) %>%
    arrange(desc(prop))

totcost <- cost %>% 
    mutate(TOT.cost = PROPDMG + CROPDMG) %>%
    mutate(prop = prop.table(TOT.cost)) %>%
    select(EVCAT, prop) %>%
    arrange(desc(prop)) 

# Merging data subsets
harmcost <- merge(totharm[-c(22,23),], totcost[-c(22,23),], by = "EVCAT") %>%
    arrange(desc(prop.x)) 

# Recreating tidy dataset   
tidyharmcost <- data.frame(harmcost, stack(harmcost[2:3]))[, c(1, 4, 5)]
tidyharmcost <- tidyharmcost %>%
    mutate(ind = ifelse(tidyharmcost$ind == "prop.x", "total harm", "total cost"))

head(tidyharmcost)
##       EVCAT     values        ind
## 1   tornado 0.33582591 total harm
## 2      heat 0.14626408 total harm
## 3     flood 0.12077608 total harm
## 4      wind 0.09015746 total harm
## 5 lightning 0.07134517 total harm
## 6   t-storm 0.03126732 total harm
tail(tidyharmcost)
##        EVCAT       values        ind
## 37      dust 3.700946e-05 total cost
## 38      rain 8.911674e-03 total cost
## 39 landslide 7.702402e-04 total cost
## 40     frost 4.508665e-03 total cost
## 41   drought 3.331244e-02 total cost
## 42     other 3.372558e-04 total cost

Plot 3: HARM versus COST

ggplot(tidyharmcost, aes(x = reorder(EVCAT, -values), y = values, fill = ind)) +
        geom_bar(stat = "identity", position = "dodge") + theme_minimal() +
        scale_fill_manual(values=c("#1244B8", "#A40C2E")) +
        labs(title = "Impacts of Weather Events on the U.S. Population and Economy", 
             subtitle = "Summary of NOAA Storm Data from 1990 to 2011", 
             x = "Event Type", y = "Proportional Distribution of Totals") + 
             theme(legend.title = element_blank()) +
             theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

This plot exemplifies how the impact of weather-related events vary greatly depending on whether one considers their cost or their harm. A further classification of weather events could be made by separating them into two broad categories, those that cause more population harm than economic cost, and those that do the opposite. In the former group, we can include: tornadoes, heat, wind, lightning, icy conditions, thunderstorms and snow; while in the latter group (more money loss than life loss) we can include: flood, tropical storms, coastal weather, hail, drought, and rain. Fire seems to be the only type of event that causes equal harm and cost to society.

FINAL NOTE: Further exploration and better methodology and consideration of fallacies and biases in this analysis are warranted.