Peer Assessment 2: An Analysis of Which Weather Events Are The Most Damaging

Synopsis

This paper examines which types of weather events are the most harmful, both with respect to population health, and to the economy. It finds that tornadoes cause the most injuries and deaths, but floods cause the most economic damage.

The data comes from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, which tracks the characteristics of major storms and weather events in the United States.

Data Processing

This assignment asks two questions:

  1. Across the United States, which types of events (as indicated in the EVTYPE
    variable) are most harmful with respect to population health?
  1. Across the United States, which types of events have the greatest economic
    consequences?

First, let’s download and unzip the data file.

# Get the data
    require(utils)
    ## Set up a data folder
    if(!file.exists("./data")) {
        dir.create("./data")
    }

    ## Download data file and unzip
    if( !file.exists("./data/stormData.csv")) {
        urlData <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
        temp <- tempfile()
        download.file( url = urlData
                       , destfile = temp
                       , method = "curl"
                       )
        unzip(temp, "./data/")
        unlink(temp)
        dateDownloaded <- date()
    }

Next, let’s open the data file.

# Open the data file
    stormDataRaw <- read.csv(file="./data/repdata-data-StormData.csv.bz2")
    dim(stormDataRaw)
## [1] 902297     37

There are 902,297 rows of data, which track 37 different variables.

Question 1

The first question is:

  1. Across the United States, which types of events (as indicated in the EVTYPE
    variable) are most harmful with respect to population health?

Let’s define “harm to popualation health” as either death or injury directly caused by the event. Let us weight them the same, so that one death is considered the same amount of harm as an injury. While it’s much worse to die, the economic impact of a prolonged injury can cause more economic damage, so it is hard to tell which has a greater impact on the economy. Consequently, we’ll weight them the same.

We will need information from three columns:
* EVTYPE - The type of event
* FATALITIES - Fatalities that were directly caused by the event.
* INJURIES - Injuries that were directly caused by the event.

To calculate the weather event with the greatest health impact, we will need to sum deaths, injuries, and the sum of both, for each event type. Then we’ll have a nice data set we can sort in descending order to find the worst health events.

# Load necessary packages
    require(dplyr)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
    require(tidyr)
## Loading required package: tidyr
    require(ggplot2)
## Loading required package: ggplot2
    require(scales)
## Loading required package: scales
# Clean the data
    q1Data <- stormDataRaw %>% 
        select(EVTYPE, FATALITIES, INJURIES) %>%
        group_by(EVTYPE) %>%
        summarize(Death = sum(FATALITIES)
                  , Injury = sum(INJURIES) ) %>%
        mutate(HealthEffects = Death + Injury) %>%
        arrange(desc(HealthEffects))
    head(q1Data)
## Source: local data frame [6 x 4]
## 
##           EVTYPE Death Injury HealthEffects
## 1        TORNADO  5633  91346         96979
## 2 EXCESSIVE HEAT  1903   6525          8428
## 3      TSTM WIND   504   6957          7461
## 4          FLOOD   470   6789          7259
## 5      LIGHTNING   816   5230          6046
## 6           HEAT   937   2100          3037

The table above shows the most harmful health effects broken down by event type. Tornadoes are the worst.

Question 2

The second question is:

  1. Across the United States, which types of events have the greatest economic
    consequences?

Let’s define “economic consequences” as either property or crop damage, as per the columns “PROPDMG” and “CROPDMG”. (That is, we are excluding the cost to the economy of injuries and deaths, as we do not have recorded data about how much injuries and deaths have cost.)

The table records property and crop damage as decimal numbers, with
separate columns telling us the exponent (i.e., thousands, millions, or billions). It looks like this:

    exponentExample <- stormDataRaw %>%
        select(EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) %>%
        filter(!is.na(PROPDMGEXP)) %>%
        filter(!is.na(CROPDMGEXP)) %>%
        filter(CROPDMG > 0)

    head(exponentExample)
##                      EVTYPE PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 HURRICANE OPAL/HIGH WINDS     0.1          B      10          M
## 2        THUNDERSTORM WINDS     5.0          M     500          K
## 3            HURRICANE ERIN    25.0          M       1          M
## 4            HURRICANE OPAL    48.0          M       4          M
## 5            HURRICANE OPAL    20.0          m      10          m
## 6        THUNDERSTORM WINDS    50.0          K      50          K

To find the events with the greatest economic impact, first, let’s prepare a data set with just the columns we want:

    q2Columns <- stormDataRaw %>%
        select(EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) 
    
    head(q2Columns)
##    EVTYPE PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO    25.0          K       0           
## 2 TORNADO     2.5          K       0           
## 3 TORNADO    25.0          K       0           
## 4 TORNADO     2.5          K       0           
## 5 TORNADO     2.5          K       0           
## 6 TORNADO     2.5          K       0

We need to find whole numbers for property damage and crop damage, by multiplying PROPDMG by the number represented in PROPDMGEXP. For example, if PROPDMGEXP has an M, we must multiply PROPDMG by 1,000,000. We’ll need to prepare a lookup table to do this.

Let’s look at all the unique values in either PROPDMGEXP or CROPDMGEXP to see what we have to translate.

    unique(q2Columns$PROPDMGEXP)
##  [1] K M   B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels:  - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
    unique(q2Columns$CROPDMGEXP)
## [1]   M K m B ? 0 k 2
## Levels:  ? 0 2 B k K m M

Let’s prepare lookup tables for both Property damage and Crop damage, by substituting the unique characters listed above for exponents they represent. For instance, we’ll tell the computer to substitute an “M” for ten to the power of “6”, so we can multiply that exponent by the value in either PROPDMG or CROPDMG.

# Prepare Lookup table
    letter <- c("-", "?", "+", "0", "1", "2", "3", "4", "5", "6", "7", "8"
                    , "B", "h", "H", "k", "K", "m", "M", "")
    multiplier <- c(0
                        , 0
                        , 0
                        , 0
                        , 1
                        , 2
                        , 3
                        , 4
                        , 5
                        , 6
                        , 7
                        , 8
                        , 9
                        , 2
                        , 2
                        , 3
                        , 3
                        , 6
                        , 6
                        , 0)
    lookup <- as.data.frame(cbind(letter, multiplier))

    # Format the lookup table to character and numeric, rather than factors
    lookup$letter <- as.character(lookup$letter)
    lookup$multiplier <- as.numeric(lookup$multiplier)
    
    #Format the original table 
    q2Columns$PROPDMGEXP <- as.character(q2Columns$PROPDMGEXP, na.rm=TRUE)
    q2Columns$CROPDMGEXP <- as.character(q2Columns$CROPDMGEXP, na.rm=TRUE)

# Replace the letters in PROPDMGEXP and CROPDMGEXP with the numbers in our 
#  lookup table
    q2ColumnsA <- q2Columns %>%
        left_join(x = q2Columns
                  , y = lookup
                  , by = c("PROPDMGEXP" = "letter")
                  , copy=FALSE) %>%
        rename(PropertyExp = multiplier) 
    head(q2ColumnsA)
##   PROPDMGEXP  EVTYPE PROPDMG CROPDMG CROPDMGEXP PropertyExp
## 1          K TORNADO    25.0       0                      4
## 2          K TORNADO     2.5       0                      4
## 3          K TORNADO    25.0       0                      4
## 4          K TORNADO     2.5       0                      4
## 5          K TORNADO     2.5       0                      4
## 6          K TORNADO     2.5       0                      4
    q2ColumnsB <- q2ColumnsA %>%
        left_join(x = q2ColumnsA
                  , y = lookup
                  , by = c("CROPDMGEXP" = "letter")
                  , copy = FALSE
                  , na.rm= TRUE) %>%
        rename(CropExp = multiplier)

    q2ColumnsC <- q2ColumnsB %>%
        mutate(TotalProperty = PROPDMG * 10 ^ PropertyExp
               , TotalCrop = CROPDMG * 10 ^ CropExp
               , TotalCost = TotalProperty + TotalCrop) %>%
        select(EVTYPE, PROPDMG, PropertyExp, TotalProperty, CROPDMG, CropExp, TotalCrop, TotalCost) %>%
        group_by(EVTYPE) %>%
        summarize(cost = sum(TotalCost)) %>%
        arrange(desc(cost))
    head(q2ColumnsC)
## Source: local data frame [6 x 2]
## 
##              EVTYPE         cost
## 1             FLOOD 1.503197e+12
## 2 HURRICANE/TYPHOON 7.191371e+11
## 3           TORNADO 5.736233e+11
## 4       STORM SURGE 4.332354e+11
## 5              HAIL 1.876122e+11
## 6       FLASH FLOOD 1.824399e+11

After our transformation, the table shows that the most costly storm events are flood, hurricane, tornado, storm surge, and hail. Floods cause roughly 3 times the economic damage as tornadoes.

Results

Tornadoes have the biggest effect on human life, as can be seen by this chart:

# Prepare the data to be in a stacked bar chart.
    q1DataChart <- q1Data %>%
        gather(key="DeathOrInjury", value="Count", -EVTYPE, -HealthEffects) %>%
        filter(Count > 0) %>%
        arrange(desc(HealthEffects)) %>%
        select(EVTYPE, DeathOrInjury, Count, HealthEffects)

# Order the result from largest to smallest
    q1Data$EVTYPE <- as.character(q1Data$EVTYPE)
    topFiveEffects <- as.character(q1Data$EVTYPE[1:5])
    q1DataChart$EVTYPE <- factor(q1DataChart$EVTYPE
                                 , levels = topFiveEffects
                                 )
# Format the Events in comma form
    q1DataChart$Count <- comma(q1DataChart$Count)

# Plot in a stacked bar chart
    q1Plot <- ggplot(q1DataChart
                     , aes(x=q1DataChart$EVTYPE
                           , y=comma(q1DataChart$Count)
                           , fill=q1DataChart$DeathOrInjury
                           )
                     )

    g <- testPlot + 
        geom_bar(stat = "identity") +
        xlab("Events") +
        ylab("Number of People Directly Affected") +
        ggtitle("Top 5 Most Dangerous Storm Events") + 
        labs(fill="") +
        theme_bw() +
        scale_fill_manual(values=c("red", "black"), limits=c("Injury", "Death")) +
        scale_y_continuous(labels=comma)
    g        

We can see that tornadoes cause more than ten times the number of deaths as the next most harmful event, which is excessive heat.

Floods have the greatest economic impact, as can be seen by this chart.

    q2ColumnsD <- q2ColumnsC[q2ColumnsC$cost == q2ColumnsC$cost[1:5], ]

# Order the result from largest to smallest
    q2ColumnsD$EVTYPE <- as.character(q2ColumnsD$EVTYPE)
    topFiveEffectsEconomic <- as.character(q2ColumnsD$EVTYPE[1:5])
    q2ColumnsD$EVTYPE <- factor(q2ColumnsD$EVTYPE
                                 , levels = topFiveEffectsEconomic
                                 )

    q2Data <- ggplot(q2ColumnsD
                     , aes(x=q2ColumnsD$EVTYPE
                           , y=q2ColumnsD$cost
                           )
                     )
    q2Plot <- q2Data + 
        geom_bar(stat = "identity") +
        xlab("Events") +
        ylab("Economic Impact ($)") +
        ggtitle("Top 5 Most Costly Storm Events") + 
        labs(fill="") +
        theme_bw() 
    q2Plot

Floods cause roughly three times the economic damage as tornadoes.