This analysis looks at Storm Data published by the National Oceanic and Atmospheric Administration (NOAA) detailing significant storms, rare or unusual weather phenomena and other significant meteorological events, such as record maximum or minimum temperatures or precipitation between 1950 and 2011.

Documentation relating to the storm data can be found here

It asks the questions: Which types of events across the United States:

  1. are the most harmful with respect to population health?

  2. have the greatest economic consequences?

It does this by comparing the fatalities, injuries, property and crop damage across the the different types of events.

Data Processing

The following steps are carried out to process and tidy the data

1. Setup Environment

Load required libraries.

##########
# 1. Setup
##########

library(stringr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
library(scales)
library(reshape2)

# Required to allow publishing to RPubs
devtools::install_github("rstudio/rsconnect", ref = "bugfix/multi-status-header")
## Skipping install of 'rsconnect' from a github remote, the SHA1 (c5fdee00) has not changed since last install.
##   Use `force = TRUE` to force installation
# Required to create PDFs - Install Once Only
# install.packages("tinytex")
# tinytex::install_tinytex()  # install TinyTeX

2. Define Functions

Define specialist functions for:

  1. Interpreting Damage Exponents

    Property and Crop Damage are expressed in the form of a value and exponent. Exponents can be characters such as H for Hundred, K for Thousand, M for Million, B for Billion. The function GetExponent is used to calcualate the exponent and this is used to calculate the damage.

  2. Simplifying Event Classifications

    The function SimplifyEvent is used to to create a new Event Type grouping the original EVTYPE into similar types of events and removing spelling errors etc.

##############
# 2. Functions
##############

GetExponent <- function(exponent){
    uppercaseExponent <- toupper(exponent)
    power <- case_when(
        uppercaseExponent == "H" ~ 2,
        uppercaseExponent == "K" ~ 3,
        uppercaseExponent == "M" ~ 6,
        uppercaseExponent == "B" ~ 9,
        TRUE ~ 0
    )
    return(power)
}

SimplifyEvent <- function(event){
    newEvent <- toupper(event)
    newEvent <- case_when(
        str_detect(newEvent, "HAIL")  ~ "HAIL",
        str_detect(newEvent, "THUND|TSTM|WIND|STORM|MICROBURST|WND|TROPICAL DEPRESSION")  ~ "WIND STORM",
        str_detect(newEvent, "TORNADO|TORNDAO|LANDSPOUT|WATERSPOUT|WAYTERSPOUT|FUNNEL CLOUD|DUST DEVIL|DUST DEVEL|TYPHOON|FUNNEL|WALL CLOUD|GUSTNADO")  ~ "WIND: TORNADO / TYPHOON",
        str_detect(newEvent, "FLOOD|FLD")  ~ "WIND: FLOOD",
        str_detect(newEvent, "DROUGHT")  ~ "DROUGHT",
        str_detect(newEvent, "LIGHTNING|LIGHTING|LIGNTNING")  ~ "LIGHTNING",
        str_detect(newEvent, "SNOW|BLIZZARD|WINTER|SLEET|FREEZING DRIZZLE|WINTRY MIX")  ~ "SNOW / SLEET",
        str_detect(newEvent, "HURRICANE|FLOYD")  ~ "WIND: HURRICANE", 
        str_detect(newEvent, "RAIN|PRECIP|WET|WATER|DOWNBURST|HEAVY SHOWER")  ~ "RAIN",
        str_detect(newEvent, "FOG|VOG")  ~ "FOG",
        str_detect(newEvent, "ICE|FROST|FREEZE|ICY|LOW")  ~ "FREEZE",  
        str_detect(newEvent, "RIP CURRENT|SURF|TIDE|TSUNAMI|SWELL|SEICHE|SEAS|WAVES|SURGE")  ~ "OCEAN CURRENT",  
        str_detect(newEvent, "COLD|COOL|HYP")  ~ "TEMPERATURE - LOW",  
        str_detect(newEvent, "HEAT|HOT|RECORD HIGH|WARM|DRY")  ~ "TEMPERATURE- HIGH",  
        str_detect(newEvent, "TEMPERATURE")  ~ "TEMPERATURE - UNKNOWN",  
        str_detect(newEvent, "FIRE")  ~ "FIRE",  
        str_detect(newEvent, "SLIDE|SLUMP")  ~ "LANDSLIDE",  
        str_detect(newEvent, "AVALANCE|AVALANCHE")  ~ "AVALANCHE",
        str_detect(newEvent, "EROS")  ~ "COASTAL EROSION",
        str_detect(newEvent, "LANDSLIDE")  ~ "LANDSLIDE",
        str_detect(newEvent, "HEAT")  ~ "HEAT",
        TRUE ~ "OTHER"
    )
    return(newEvent)
}

3. Load Data

##############
# 3. Load Data
##############

# Read storm data - this can be done directly from the zipped version
stormDF <- read.csv("repdata-data-StormData.csv.bz2")

originalColumns <- ncol(stormDF)

# See how many different event types there are
rawEventTypes <-stormDF %>%
    count(EVTYPE, sort = TRUE)
## Warning: The `printer` argument is soft-deprecated as of rlang 0.3.0.
## This warning is displayed once per session.
stormDF <- stormDF %>%
    select(EVTYPE,BGN_DATE,FATALITIES,INJURIES,PROPDMG,PROPDMGEXP,CROPDMG,
           CROPDMGEXP, LONGITUDE, LONGITUDE_, LATITUDE, LATITUDE_E) %>%
    mutate(
        eventType = as.factor(SimplifyEvent(EVTYPE)),
        propertyDamage = PROPDMG * (10 ^ GetExponent(PROPDMGEXP)),
        cropDamage = CROPDMG * (10 ^ GetExponent(CROPDMGEXP)),
        beginDate = as.Date(BGN_DATE, format="%m/%d/%Y"),
        year = year(beginDate),
        month = month(beginDate)
    )

# See how many different simplified event types there are
tidyEventTypes <-stormDF %>%
    count(eventType, sort = TRUE)
  1. Load the zipped data (902k observations of 37 variables)
  2. Count the number of different event types. There are 985 types of event in the raw data, due to mutiple classifications of similar events resulting from a lack of consistency during data capture (e.g. LIGHTNING, LIGHTING and LIGNTNING).
  3. Apply the SimplifyEvent function to reduce this down to 20 types by combining similar types (e.g. those containing SNOW, BLIZZARD, WINTER, SLEET, FREEZING DRIZZLE, WINTRY MIX) and removing errors (e.g. mis-spelling of LIGHTNING)
  4. Select just the 18 variables of interest including the new ones derived from existing

4. Review the Date Range

The brief stated that ‘In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.’. The chart below shows that records were created more fully from 1993. Therefore analysis has been limited to records since then.

############################
# Plot A - Recording By Year
############################

# Summarise by year
eventsByYear <- stormDF %>% 
    group_by(year, eventType) %>% 
    summarise(totalEvents = n())
## Warning: The `printer` argument is soft-deprecated as of rlang 0.3.0.
## This warning is displayed once per session.
p <- ggplot(data = eventsByYear, aes(x=year, y=totalEvents, color = eventType))
p + geom_line() +
    labs(x = "Year", 
         y = "Events Recorded",
         title = "Records by Event Type") +
    scale_color_discrete("Event Type") +
    scale_x_continuous(breaks=seq(1950,2011,by=10)) +
    scale_y_continuous(labels = comma) + # Add commas to numbers - uses scales package
    geom_vline(xintercept=1993, colour="grey") +
    annotate(geom="text", x=1993, y=20000, label="\nFull Results Recorded from 1993", angle=90, color="red", size = 3)

################################
# Limit Analysis to 1993 Onwards
################################

stormDF <- stormDF %>%
    filter(year >= 1993)

5. Review the Event Types with the Highest Impact

##########################
# Limit by Top Event Types
##########################

# Get the impact totals by Event Type
impactByType <- stormDF %>%
    group_by(eventType) %>%
    summarise(fatalitiesTotal = sum(FATALITIES),
              injuriesTotal = sum(INJURIES),
              propertyDamageTotal = sum(propertyDamage),
              cropDamageTotal = sum(cropDamage))

# Get the Events with the highest fatality impact
topByFatality <- impactByType %>%
    top_n(5, wt = fatalitiesTotal) %>%
    arrange(desc(fatalitiesTotal)) %>%
    select(eventType)

# Get the Events with the highest injury impact
topByInjury <- impactByType %>%
    top_n(5, wt = injuriesTotal) %>%
    arrange(desc(injuriesTotal)) %>%
    select(eventType)

# Get the Events with the highest property damage impact
topByPropertyDamage <- impactByType %>%
    top_n(5, wt = propertyDamageTotal) %>%
    arrange(desc(propertyDamageTotal)) %>%
    select(eventType)

# Get the Events with the highest crop damage impact
topByCropDamage <- impactByType %>%
    top_n(5, wt = cropDamageTotal) %>%
    arrange(desc(cropDamageTotal)) %>%
    select(eventType)

# Merge the 4 different top 5s
topEvents <- unique(do.call("rbind", list(topByFatality,topByInjury, topByPropertyDamage, topByCropDamage)))

# Filter events based on the Top Event Types
stormDF <- stormDF %>% 
    filter(eventType %in% topEvents$eventType)

# Get the events for the top event types
topImpactByType <- stormDF %>%
    group_by(eventType) %>%
    summarise(fatalitiesTotal = sum(FATALITIES),
              injuriesTotal = sum(INJURIES),
              propertyDamageTotal = sum(propertyDamage),
              cropDamageTotal = sum(cropDamage))

# Melt the totals so that they can be plotted side by side
topImpactByTypeMelted <- melt(topImpactByType, id = "eventType")

As we are looking for event types with the highest impact, it makes sense to focus on those rather than the 20 event types we have tidied the data into. The top 5 for each of the 4 measures (fatalities, injuries property and crop costs), were therefore identified and merged. As highest impacting types were different between the measures, these were merged to give the 9 overall highest impacting types and only records for these were then analysed.

By selecting only events from 1993 onwards and relating to the top 9 event types, analysis is then performed on the most relevant 664k observations.

Results

Question 1: Across the United States, which types of events are most harmful with respect to population health?

#######################################
# Plot B - Health Impacts by Event Type
#######################################

pHealth <- ggplot(
    data = filter(topImpactByTypeMelted,variable == "fatalitiesTotal" | variable == "injuriesTotal"),
    aes(x=eventType, y=value, fill=variable))
pHealth +
    geom_bar(position="dodge", stat="identity") +
    labs(y = "Number of Fatalities / Injuries",
         fill = "",
         title = "Health Impacts across the US since 1993") +
    scale_y_continuous(labels = comma) + # Add commas to numbers - uses scales package
    scale_fill_discrete(labels=c("Fatalities", "Injuries")) +
    theme(legend.position="bottom", 
          element_blank(),
          axis.title.x=element_blank(),
          axis.text.x = element_text(size=6, angle=90, hjust=1))

As can be seen, the most harmful event types with respect to population health were TEMPERATURE- HIGH for Fatalities and WIND: TORNADO / TYPHOON for Injuries.

Question 2: Across the United States, which types of events have the greatest economic consequences?

As can be seen, the event types with the greatest economic consequences WIND: FLOOD for Property Damage and DROUGHT for Crop Damage. Costs are shown as Log 10 as they range exponentially.