SYNOPSIS

In this report I am describing the hydro/meteorological events occurring in the United States between 1950 and 2011 that are a) most harmful with respect to population health and b) have the greatest economic consequences. The data source used in this analysis is from U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. The data indicated that the most harmful event by far with respect to fatalities and injuries is a tornado. In terms of property and crop damage, flooding has the highest economic impact.

DATA PROCESSING

The data used in this analysis can be downloaded in bz2 format from here: Storm Data

It should be noted that there are numerous inconsistencies and overlap in some of the Event categories. For example, we have FLOOD FLASH, FLOOD FLOOD/FLASH, FLOOD/FLASH, Flood/Flash Flood, FLOOD/FLASH FLOOD, FLOOD/FLASH FLOODING, FLOOD/FLASH/FLOOD, FLOOD/FLASHFLOOD, FLOODING and finally, FLOODS. These are all listed as separate events yet clearly there is overlap. It would be beyond the scope of this report to determine whether any of these overlapping categories could be combined into a more clear delineation of separate events. These inconsistencies however do not significantly change the results presented below as many of the duplicate event categories do not contain large numbers that would otherwise skew the data as it is presented here.

Unzip and read data

I am making the assumption that you have the bz2 zip file in the working directory.

if (!exists("stormData")) {
    stormData <- read.csv("repdata%2Fdata%2FStormData.csv.bz2", na.strings = "")
}

We check the dimensions of the dataset. There are 902297 rows and 37 columns.

dim(stormData)
## [1] 902297     37

Events most harmful with respect to population health

I’m defining “population health” to be fatalities or injuries directly attributable to the hydro/meteorological event itself or impact by debris due to these events as described in section 2.6.1 of the Storm Data Documentation. This data is recorded in the FATALITIES and INJURIES columns in the data set.

library(dplyr)
library(ggplot2)

I’m going to take a subset of the data that includes the event, fatalities and injuries and calculate the sum of fatalities and injuries by event as well as calculating the total number of both fatalities and injuries for each event.

healthDataByEvent <- select(stormData, 
                     EVTYPE, 
                     FATALITIES, 
                     INJURIES) %>%
    group_by(EVTYPE) %>%
    summarise( Fatalities = sum(FATALITIES), Injuries = sum(INJURIES), total.carnage = sum(FATALITIES) + sum(INJURIES))

We’ll sort that new dataset from most to least damage with regards to the sum total of injuries and fatalities.

healthDataSorted <- healthDataByEvent[order(healthDataByEvent$total.carnage, decreasing = TRUE),]

RESULTS (damage to health)

This is what we have.

head(healthDataSorted, 10)
## # A tibble: 10 x 4
##    EVTYPE            Fatalities Injuries total.carnage
##    <fct>                  <dbl>    <dbl>         <dbl>
##  1 TORNADO               5633      91346         96979
##  2 EXCESSIVE HEAT        1903       6525          8428
##  3 TSTM WIND              504       6957          7461
##  4 FLOOD                  470       6789          7259
##  5 LIGHTNING              816       5230          6046
##  6 HEAT                   937       2100          3037
##  7 FLASH FLOOD            978       1777          2755
##  8 ICE STORM               89.0     1975          2064
##  9 THUNDERSTORM WIND      133       1488          1621
## 10 WINTER STORM           206       1321          1527
library(reshape2)

We’re going to drop the total carnage column as we were only using it to get top health damage events. We then melt the data in order to be able to plot both injuries and fatalities in a single bar in the bar plot. We have to resort after the melt.

healthDataSorted <- healthDataSorted[, -4]

healthMelted <- melt(head(healthDataSorted, 10), id.vars = "EVTYPE")
healthMelted <- healthMelted[order(-healthMelted$value),]

We’ll make a bar plot of the top 10 damaging weather events with regard to both injuries and fatalities.

healthPlot <- ggplot(data=healthMelted, aes(x=reorder(EVTYPE, value, sum), value, fill=variable)) +
    geom_bar(stat="identity") +
    coord_flip() +
    labs(title="Injuries and Fatalities by Event",
         subtitle="USA - 1950-2011",
         y ="Total", 
         x = "Event Type") +
    theme(legend.title=element_blank())

print(healthPlot)

The plot above clearly shows that tornados are by far the most damaging with regards to both injuries and fatalities with 96,979 in total.

Events across the United States that have the greatest economic consequences.

I’m going to be using the variables EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP so let’s get a subset of the data with these variables.

economicCols <- c("EVTYPE", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
economicDmg <- stormData[economicCols]

It is necessary to use the values in the exponent (EXP) variables to translate the values in the DMG variables into actual dollar amounts.

The possible values in the exponent columns are as follows:

exponentValues <- c(as.character(economicDmg$PROPDMGEXP), as.character(economicDmg$CROPDMGEXP))
unique(exponentValues)
##  [1] "K" "M" NA  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8" "k"

I’m calculating the monetary value of damage by multiplying damage by 10 to the power of the exponent. Eg. PROPDMG * 10 ^ PROPDMGEXP

I’m making the assumptions that:

  • H or h=Hundreds (10^2)
  • K or k=Thousands (10^3)
  • M or m=Millions (10^6)
  • B=Billions (10^9)
  • Other numeric values will be used as the exponent eg 6 -> 10^6

NA, +, - and ? will be changed to 0 in order that the value in the DMG column will be taken as is.

So here I am replacing the letters with their exponential equivalents and non-numeric characters with 0.

library(car)
economicDmg$PROPDMGEXP <- recode(economicDmg$PROPDMGEXP, "'K'=3;'k'=3;'H'=2;'h'=2; 'M'=6;'m'=6;'B'=9;c('+','-', '?',NA) = 0")

economicDmg$CROPDMGEXP <- recode(economicDmg$CROPDMGEXP, "'K'=3;'k'=3;'H'=2;'h'=2; 'M'=6;'m'=6;'B'=9;c('+','-', '?',NA) = 0")

Now I’ll calculate the dollar values and append these values to the dataset along with a column representing the total of both crop and property damage for each event.

economicDmg$property.damage.dollars <- economicDmg$PROPDMG * 10 ^ as.numeric(as.character(economicDmg$PROPDMGEXP))

economicDmg$crop.damage.dollars <- economicDmg$CROPDMG * 10 ^ as.numeric(as.character(economicDmg$CROPDMGEXP))

economicDmg$total.damage.amount <- economicDmg$property.damage.dollars + economicDmg$crop.damage.dollars

ecoDmgSorted <- economicDmg[order(-economicDmg$total.damage.amount),]

Now we’ll group the damage costs by event.

ecoDmgByEvent <- select(ecoDmgSorted, 
                     EVTYPE, 
                     property.damage.dollars, 
                     crop.damage.dollars,
                     total.damage.amount) %>%
    group_by(EVTYPE) %>%
    summarise( Property = sum(property.damage.dollars), Crop = sum(crop.damage.dollars), Total = sum(property.damage.dollars) + sum(crop.damage.dollars))

ecoDmgSorted <- ecoDmgByEvent[order(-ecoDmgByEvent$Total),]

top10eco <- head(ecoDmgSorted, 10)

RESULTS (economic damage)

These are the top 10 most damaging events by cost in dollars (USD).

top10eco
## # A tibble: 10 x 4
##    EVTYPE                Property        Crop        Total
##    <fct>                    <dbl>       <dbl>        <dbl>
##  1 FLOOD             144657709807  5661968450 150319678257
##  2 HURRICANE/TYPHOON  69305840000  2607872800  71913712800
##  3 TORNADO            56947380676   414953270  57362333946
##  4 STORM SURGE        43323536000        5000  43323541000
##  5 HAIL               15735267513  3025954473  18761221986
##  6 FLASH FLOOD        16822673978  1421317100  18243991078
##  7 DROUGHT             1046106000 13972566000  15018672000
##  8 HURRICANE          11868319010  2741910000  14610229010
##  9 RIVER FLOOD         5118945500  5029459000  10148404500
## 10 ICE STORM           3944927860  5022113500   8967041360

I’m removing the Total column in order to melt the data for the plot.

top10eco <- top10eco[, -4]

ecoMelted <- melt(top10eco, id.vars = "EVTYPE")
ecoMelted <- ecoMelted[order(-ecoMelted$value),]
ecoMelted$value <- ecoMelted$value / 10^9

ecoPlot <- ggplot(data=ecoMelted, aes(x=reorder(EVTYPE, value / 10 ^ 9, sum), value, fill=variable)) +
    geom_bar(stat="identity") +
    coord_flip() +
    labs(title="Economic Damage by Event",
         subtitle="USA - 1950-2011",
         y ="USD Cost (Billions)", 
         x = "Event Type") +
    theme(legend.title=element_blank())

print(ecoPlot)

The plot above shows that flooding causes the most economic damage when both property and crop damage are considered. Approximately 150 Billion USD of damage was caused between the years 1950 and 2011. The data also indicates that drought causes higher damage to crops than the other most damaging weather events.