Why Weather Matters: U.S. Storm Data Analysis

Synopsis

This document describes an analysis done as an assignment of the Coursera Reproducible Research Course from Johns Hopkins University.

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage. Preparing for unprecedented extreme weather is a key concern of public officials.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. Over 900 event types are contained in this database, which covers the 1950-2011 time period. It tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Our goal here is to answer two basic questions about losses generated from severe weather events.

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

Links for more information about the data include:

Data Processing

1. Load libraries

Before beginning the project, be sure to load the required R libraries and set any environmental variables. Note that setting messages in markdown to false suppresses messages from library loading such as version number and dependencies. Updating to the latest versions of these libraries may improve ability to obtain results fairly similar to the steps outlined here. All code is set to echo=TRUE to reveal calculations and assumptions made.

# Clear environment of prior calculations. 
    rm(list=ls())
    library(dplyr) # data.table + dplyr code now lives in dtplyr (alternative)
    library(data.table)
    library(knitr) # for output to html and pdf
    library(ggplot2) # enhanced grahics
    library(ggthemes) # advanced themese
    library(readr) # for faster/consistent read-writes
    library(RCurl) # for loading external dataset (getBinaryURL)
    library(R.utils) # for bunzip2

2. Load source or clean data file

Load data from already prepared stormData.rds file or reload and rescrub data. This really saves time during the report writing phase.

    dataReload <- TRUE
    dataRescrub <- TRUE

# check if scrubbed stormData.rds already exists
if(file.exists("./data/stormData.rds")){
    stormData <- read_rds("./data/stormData.rds")
    dataReload <- FALSE # set this to FALSE for publication, TRUE for debugging
    dataRescrub <- FALSE # set this to FALSE for publication, TRUE for debugging
}

if(dataReload){
  # create a data dir if it doesn't exist
  if(!file.exists("./data")){
      dir.create("./data")
  }
  # load file from URL to bz2 file 
  if(!file.exists("./data/StormData.csv.bz2")){
    print("Saving new stormData files...")  
    temp <- tempfile()
    download.file("https://d396qusza40orc.cloudfront.net/repdata/data/StormData.csv.bz2",temp)
    destPath <- "./data/StormData.csv"
    bunzip2(temp,destPath,overwrite=TRUE, remove=FALSE)
    unlink(temp)
  }
  # read file into stormData
  if(file.exists("./data/StormData.csv")){
    print("Reading new stormData csv file...")  
    stormData <- read.csv("./data/StormData.csv")
  }    
}

3. Drop unwanted colums

For the purposes of this analysis we will focus only on the following fields:

BGN_DATE,END_DATE to properly assign event time periods
EVTYPE to classify the weather events
FATALITIES,INJURIES to determine personal health impact
PROPDMG,PROPRDMGEXP to measure the economic impact of property damages
CROPDMG,CROPDMGEXP to measure the economic impact of crop damages

if(dataRescrub){
# A. Drop columns not wanted based on knowledge of dataset or research focus.    
    cols.want <- stormData[,grep('BGN_DATE|END_DATE|EVTYPE|FATALITIES|INJURIES|PROPDMG|PROPRDMGEXP|CROPDMG|CROPDMGEXP', x = names(stormData) )]
    cols.dont.want <- stormData[,-grep('BGN_DATE|END_DATE|EVTYPE|FATALITIES|INJURIES|PROPDMG|PROPRDMGEXP|CROPDMG|CROPDMGEXP', x = names(stormData) )]
    # use this to reset base data 
    stormData <- cols.want
}

4. Refactor PROPDMG and CROPDMG and calculate totals

Since damage totals are not consistently calculated, create summary numeric fields to hold total damages.

if(dataRescrub){
# B. Reformat property damages columns
    # fill in the blanks
    stormData$PROPDMGEXP[is.na(stormData$PROPDMGEXP)] <- "0"
    # temporarily set all multipliers to alpha values
    stormData$PROPDMGEXP <- as.character(stormData$PROPDMGEXP)
    # set both lower and upper characters to numeric meaning
    stormData$PROPDMGEXP[toupper(stormData$PROPDMGEXP) == 'H'] <- "2"
    stormData$PROPDMGEXP[toupper(stormData$PROPDMGEXP) == 'K'] <- "3"
    stormData$PROPDMGEXP[toupper(stormData$PROPDMGEXP) == 'M'] <- "6"
    stormData$PROPDMGEXP[toupper(stormData$PROPDMGEXP) == 'B'] <- "9"
    stormData$PROPDMGEXP[is.na(stormData$PROPDMGEXP)] <- 0
    # set the multipliers back to numeric values
    stormData$PROPDMGEXP <- as.numeric(stormData$PROPDMGEXP)
    # calculate total damages and place into new column
    stormData$TOTALPROPDMG <- stormData$PROPDMG * 10^stormData$PROPDMGEXP
# C. Reformat crop damages columns
    stormData$CROPDMGEXP[is.na(stormData$CROPDMGEXP)] <- "0"    
    # temporarily set all multipliers to alpha values
    stormData$CROPDMGEXP <- as.character(stormData$CROPDMGEXP)
    # set both lower and upper characters to numeric meaning
    stormData$CROPDMGEXP[toupper(stormData$CROPDMGEXP) == 'H'] <- "2"
    stormData$CROPDMGEXP[toupper(stormData$CROPDMGEXP) == 'K'] <- "3"
    stormData$CROPDMGEXP[toupper(stormData$CROPDMGEXP) == 'M'] <- "6"
    stormData$CROPDMGEXP[toupper(stormData$CROPDMGEXP) == 'B'] <- "9"
    # set the multipliers back to numeric values
    stormData$CROPDMGEXP[is.na(stormData$CROPDMGEXP)] <- 0
    stormData$CROPDMGEXP <- as.numeric(stormData$CROPDMGEXP)
    # calculate total damages and place into new column
    stormData$TOTALCROPDMG <- as.numeric(stormData$CROPDMG * 10^stormData$CROPDMGEXP)
}

5. Refactor date variables and select subset based on period

For historical and data-entry reasons, the event labels assigned to records may not be consistent with the current official event categories. In 1996 the event types were increased to 48 from three. For consistency, records for events prior to 1996 will be excluded.

if(dataRescrub){
# assign value to cut-off date
    splitYear <- as.numeric(format(as.Date("1/1/1996 0:00:00", format = "%m/%d/%Y"), "%Y"))
# query dataset based on date selection
    stormData <-   stormData %>%
    filter (as.numeric(format(as.Date(stormData$BGN_DATE, format = "%m/%d/%Y"), "%Y")) >= splitYear)
# add numeric year dimension    
    stormData$EVT_YR <-as.numeric(format(as.Date(stormData$BGN_DATE, format = "%m/%d/%Y"), "%Y"))
}

6. Refactor EVTYPE categories

The raw data initially contained over 900 event type labels. This refactoring and label review is essential to reducing classification error and proper allocation of damages to each category. The code below documents the grouping decisions for creating new event categories. This process reduces the categories to 13, including one “other” category for hard-to-classify event types.

if(dataRescrub){
    # create temporary processing sets
    stormData2 <- stormData
    reducedStormData <- stormData2 # keeps a prefactored copy for edits
    # set upper characters to begin normalizing labels  or set ignore.case = TRUE in grep 
    reducedStormData$EVTYPE <- toupper(reducedStormData$EVTYPE)
    reducedStormData[grepl("HIGH WIND*|WND|GUSTY|HIGH WINDS|GRADIENT WIND|WIND DAMAGE|GUSTY LAKE WIND|WIND GUSTS|WIND ADVISORY|RED FLAG CRITERIA|STRONG WIND.*|STRONG WIND GUSTS|WINDS",
        reducedStormData$EVTYPE), "EVTYPE"] <- "WIND"
    reducedStormData[grepl("^STORM|HURRICANE|TYPHOON|TROPICAL STORM|REMNANTS OF FLOYD", 
        reducedStormData$EVTYPE), "EVTYPE"] <- "HURRICANE/TYPHOON"
    reducedStormData[grepl("SUMMARY.*|.*OTHER.*|NONE|NORTHERN LIGHTS|NO SEVERE WEATHER", 
        reducedStormData$EVTYPE), "EVTYPE"] <- "OTHER"
    reducedStormData[grepl("TSTM|THUNDERSTORM|LIGHTNING|^METRO STORM.*|COASTAL STORM|COASTALSTORM", 
        reducedStormData$EVTYPE), "EVTYPE"] <- "THUNDERSTORM"
    reducedStormData[grepl("TORNADO|SPOUT|FUNNEL|WHIRLWIND", 
        reducedStormData$EVTYPE), "EVTYPE"] <- "TORNADO"  
    reducedStormData[grepl("FLOOD|SURF|BLOW_OUT|SWELLS|FLD|DAM BREAK", 
        reducedStormData$EVTYPE), "EVTYPE"] <- "FLOOD"
    reducedStormData[grepl("FIRE|SMOKE|VOLCANIC",
        reducedStormData$EVTYPE), "EVTYPE"] <- "FIRE"
    reducedStormData[grepl("PRECIPITATION|RAIN|HAIL|DRIZZLE|WET|PERCIP|BURST|DEPRESSION|FOG|CLOUD|vOG",
        reducedStormData$EVTYPE), "EVTYPE"] <- "RAIN/FOG"
    reducedStormData[grepl("COLD|COOL|ICE|ICY|FROST|FREEZE|SNOW|WINTER|WINTRY|WINTERY|BLIZZARD|CHILL|FREEZING|AVALANCHE|GLAZE|SLEET|MIXED PRECIP",
        reducedStormData$EVTYPE), "EVTYPE"] <- "SNOW/ICE"
    reducedStormData[grepl("WARMTH|WARM|HEAT|DRY|HOT|DROUGHT|THERMIA|TEMPERATURE RECORD|RECORD TEMPERATURE|RECORD HIGH|MONTHLY TEMPERATURE|DRIEST MONTH", 
        reducedStormData$EVTYPE), "EVTYPE"] <- "HEAT/DROUGHT"
    reducedStormData[grepl("SEAS|HIGH WATER|TIDE|TSUNAMI|WAVE|CURRENT|MARINE|DROWNING|SURF|SEICHE|WAKE LOW WIND", 
        reducedStormData$EVTYPE), "EVTYPE"] <- "WATER-RELATED"
    reducedStormData[grepl("DUST.*|SHARA.*", 
        reducedStormData$EVTYPE), "EVTYPE"] <- "DUST"
    reducedStormData[grepl("SLIDE|EROSION|SLUMP", 
        reducedStormData$EVTYPE), "EVTYPE"] <- "SLIDES/EROSION"
    #verify all records in a proper category
    unique(reducedStormData$EVTYPE)
    # assign reduced data back to analysis dataset    
    stormData<- reducedStormData
}

7. Store a clean version of the dataset

To improve processing time, we are saving the cleaned, reduced dataset in an rds-format file. To re-run the scrubbing operations, and update information, delete the local “rds” file from the data folder.

if(dataRescrub){
    write_rds(stormData,"./data/stormData.rds") # alternative to store data rda or csv
}

Results

Question: Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

# create graphing dataset for fatalities
    sumFatalities <- stormData %>%
                select(Event=EVTYPE,FATALITIES) %>%
                group_by(Event) %>%
                summarise(Fatalities = as.numeric(sum(FATALITIES, na.rm=TRUE)))
     sumFatalities <- sumFatalities[order(-sumFatalities$Fatalities), ][1:10, ]

# create graphing dataset for injuries
    sumInjuries <- stormData %>%
                select(Event=EVTYPE,INJURIES) %>%
                group_by(Event) %>%
                summarise(Injuries = as.numeric(sum(INJURIES, na.rm=TRUE)))
    sumInjuries <- sumInjuries[order(-sumInjuries$Injuries), ][1:10, ]

# create graphing dataset combined dead and injured
    deadAndInjured <- merge(x = sumFatalities, y = sumInjuries, by = "Event", all = TRUE)
    deadAndInjured <- melt(deadAndInjured, id.vars = 'Event')
    names(deadAndInjured) <- c("Event", "Loss", "Value")
    deadAndInjured$Value <- deadAndInjured$Value/(10^3)
# plot both sets    
    ggplot(deadAndInjured, aes(x = Event, y = Value, fill = Loss)) +
        geom_bar(stat = "identity") +    
        labs(fill="") +
        labs(x="", y="Thousands - (Source: NOAA Storm Data 1995-2011)") + 
        ggtitle("Weather-related Fatality & Injury Counts")+
        coord_flip() + 
        theme_economist()

Answer: This plot shows that the combination of excessive heat and tornado caused most fatalities. On the other hand, tornato disasters alone represented the largest cause of weather-related injuries in the United States from 1996 to 2011.

Question: Across the United States, which types of events have the greatest economic consequences?

# create graphing dataset for crop damages
    cropSum <- stormData %>%
                select(Event=EVTYPE,TOTALCROPDMG) %>%
                group_by(Event) %>%
                summarise(Crop = as.numeric(sum(TOTALCROPDMG, na.rm=TRUE)))
    cropSum <- cropSum[order(-cropSum$Crop), ][1:10, ]

# create graphing dataset for prop damages
    propSum <- stormData %>%
                select(Event=EVTYPE,TOTALPROPDMG) %>%
                group_by(Event) %>%
                summarise(Property = as.numeric(sum(TOTALPROPDMG, na.rm=TRUE)))
    propSum <- propSum[order(-propSum$Property), ][1:10, ]

# create graphing dataset combined damages
    cropAndProp <- merge(x = cropSum, y = propSum, by = "Event", all = TRUE)
    cropAndProp <- melt(cropAndProp, id.vars = 'Event')
    names(cropAndProp) <- c("Event", "Loss", "Value")
    cropAndProp$Value <- cropAndProp$Value/(10^9)
# plot property and crop losses   
    ggplot(cropAndProp, aes(Event, Value)) +
        geom_bar(aes(fill = Loss), position = "dodge", stat="identity") + 
        labs(fill="") +
        labs(x="", y="Damage, Billion $USD  - (Source: NOAA Storm Data 1995-2011)") + 
        ggtitle("Weather-related Property & Crop Losses")+
        coord_flip() + 
        theme_economist()

Answer: Based on the chart above, frequent flood and hurricane/typhoon disasters accounted for most of the property damage. Meanwhile severe drought and flood events contributed most to the crop damage in the United States from 1996 to 2011.

The main challenges of working with this data included, inconsistent event labeling and different ways of describing damage/human loss estimates. In addition, more recent data may be affected by other features not recorded in the dataset, such as an increase in concentration of property along coastal and flood areas which are subject to natural events.

Finally, further study of this data should yield more specific regional and local differences in weather-related risks.

Here is the Github repo link to view all of the R code needed to reproduce the results (numbers, plots, etc.)

RPubs page: http://rpubs.com/iwebconsultant/storm-data