Introduction

The purpose of this analysis is to look at data from the NOAA Storm Database from years 1950-2011 and analyze two questions of interest:

  1. Across the US, which types of events are most harmful with respect to population health?
  2. Across the US, which types of events have the greatest economic consequences?

To answer these questions, I will organize the data from the database. To find which events are most harmful with respect to health, I will look at which types of events have historically caused the most fatalities. To find which events cause the greatest economic consequences, I will look at which events have historically caused the greatest monetary loss to crop damage and property damage.

Data Processing

I will be using the reshape2, ggplot2, dplyr, and knitr libraries throughout this analysis.

#Load necessary libraries to be used
library(knitr)
library(reshape2)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

First, I read in the data and take a look at it.

df <- read.csv("repdata_data_StormData.csv")
str(df)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

It turns out the event type classifications are messy and inconsistent as shown below:

unique(df$EVTYPE)[1:20]
##  [1] TORNADO                   TSTM WIND                
##  [3] HAIL                      FREEZING RAIN            
##  [5] SNOW                      ICE STORM/FLASH FLOOD    
##  [7] SNOW/ICE                  WINTER STORM             
##  [9] HURRICANE OPAL/HIGH WINDS THUNDERSTORM WINDS       
## [11] RECORD COLD               HURRICANE ERIN           
## [13] HURRICANE OPAL            HEAVY RAIN               
## [15] LIGHTNING                 THUNDERSTORM WIND        
## [17] DENSE FOG                 RIP CURRENT              
## [19] THUNDERSTORM WINS         FLASH FLOOD              
## 985 Levels:    HIGH SURF ADVISORY  COASTAL FLOOD ... WND

We can see that a lot of events are categorized with different and sometimes incorrect spellings or capitalization, so need to be cleaned up. I have done some cleaning up shown by the code below:

#Cleaning up of EVTYPE
winterstorm <- c("icy", "ice", "snow", "wint", "hail", "freez", "frost", 
                 "blizzard", "glaze", "sleet")
coastalstorm <- c("coastal", "surf", "waves", "current", "tsunami", "waterspout",
                  "high", "seas", "tide", "surge", "water", "wave", "typhoon")
marine <- c("marine", "drowning")
rainstorm <- c("rain", "precip")
cold <- c("cold", "hypothermia", "hyperthermia", "low temperature")
flooding <- c("floo", "fld")

df$EVTYPE <- as.character(df$EVTYPE)
df$EVTYPE[grepl("aval", df$EVTYPE, ignore.case=TRUE)]  <- "Avalanche"
df$EVTYPE[grepl(paste(cold, collapse = "|"), 
               df$EVTYPE, ignore.case=TRUE)]  <- "Cold"
df$EVTYPE[grepl("drough", df$EVTYPE, ignore.case=TRUE)]  <- "Drought"
df$EVTYPE[grepl("dust", df$EVTYPE, ignore.case=TRUE)]  <- "Dust Related"
df$EVTYPE[grepl(paste(flooding, collapse = "|"), 
               df$EVTYPE, ignore.case=TRUE)]  <- "Flooding"
df$EVTYPE[grepl("fire", df$EVTYPE, ignore.case=TRUE)]  <- "Fire/Wildfire"
df$EVTYPE[grepl("fog", df$EVTYPE, ignore.case=TRUE)]  <- "Fog"
df$EVTYPE[grepl("heat|warm", df$EVTYPE, ignore.case=TRUE)]  <- "Heat"
df$EVTYPE[grepl("mudslide", df$EVTYPE, ignore.case=TRUE)]  <- "Mudslide"
df$EVTYPE[grepl("landslide", df$EVTYPE, ignore.case=TRUE)]  <- "Landslide"
df$EVTYPE[grepl("lightning|thunder", df$EVTYPE, ignore.case=TRUE)]  <- "Lightning"
df$EVTYPE[grepl("torn", df$EVTYPE, ignore.case=TRUE)]  <- "Tornado"
df$EVTYPE[grepl(paste(winterstorm, collapse = "|"), 
               df$EVTYPE, ignore.case=TRUE)]  <- "Winter Storm"
df$EVTYPE[grepl(paste(rainstorm, collapse = "|"), 
               df$EVTYPE, ignore.case=TRUE)]  <- "Rain storm"
df$EVTYPE[grepl("wind|microburst|funnel", df$EVTYPE, ignore.case=TRUE)]  <- "Winds"
df$EVTYPE[grepl("hurric", df$EVTYPE, ignore.case=TRUE)]  <- "Hurricane"
df$EVTYPE[grepl("tropical", df$EVTYPE, ignore.case=TRUE)]  <- "Tropical Storm"
df$EVTYPE[grepl(paste(coastalstorm, collapse = "|"), 
               df$EVTYPE, ignore.case=TRUE)]  <- "Coastal Storm"
df$EVTYPE[grepl(paste(marine, collapse = "|"), 
               df$EVTYPE, ignore.case=TRUE)]  <- "Marine Related"

The second messy part of this data set is related to the economic figures. The table contains two “EXP” columns which indicates a power of 10 multiplier for each of DMG values. There are some odd characters here, which need to be cleaned up

unique(df$PROPDMGEXP)
##  [1] K M   B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels:  - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
unique(df$CROPDMGEXP)
## [1]   M K m B ? 0 k 2
## Levels:  ? 0 2 B k K m M

The numeric values represent powers of 10. It should follow that the values h = hundreds, k = thousands, m = millions, b = billions. There is no way to know the values of -, ?, + so I will assign them values of 0.

#Clean PROPDMGEXP
df$PROPDMGEXP <- as.character(df$PROPDMGEXP)
df$PROPDMGEXP[df$PROPDMGEXP == "" | df$PROPDMGEXP == "+" | df$PROPDMGEXP == "-" | df$PROPDMGEXP == "?"] <- 0
df$PROPDMGEXP[df$PROPDMGEXP == "H" | df$PROPDMGEXP == "h"] <- 2
df$PROPDMGEXP[df$PROPDMGEXP == "K"] <- 3
df$PROPDMGEXP[df$PROPDMGEXP == "M" | df$PROPDMGEXP == "m"] <- 6
df$PROPDMGEXP[df$PROPDMGEXP == "B"] <- 9
df$PROPDMGEXP <- as.numeric(df$PROPDMGEXP)
unique(df$PROPDMGEXP)
##  [1] 3 6 0 9 5 4 2 7 1 8
#Clean CROPDMGEXP
df$CROPDMGEXP <- as.character(df$CROPDMGEXP)
df$CROPDMGEXP[df$CROPDMGEXP == "?" | df$CROPDMGEXP == ""] <- 0
df$CROPDMGEXP[df$CROPDMGEXP == "K" | df$CROPDMGEXP == "k"] <- 3
df$CROPDMGEXP[df$CROPDMGEXP == "M" | df$CROPDMGEXP == "m"] <- 6
df$CROPDMGEXP[df$CROPDMGEXP == "B"] <- 9
df$CROPDMGEXP <- as.numeric(df$CROPDMGEXP)
unique(df$CROPDMGEXP)
## [1] 0 6 3 9 2

Results

Health consequences of storms

For our first question of interest, we are only concerned with the data with health consequences. So I have subsetted the data by those storms that caused any fatalities or injuries and aggregated the sums for each event type:

health <- subset(df, FATALITIES > 0 | INJURIES > 0)
X <- aggregate(cbind(FATALITIES, INJURIES) ~ EVTYPE, data = health, FUN = sum)
names(X) <- c("EventType", "Fatalities", "Injuries")
X <- arrange(X, desc(Fatalities + Injuries))[1:10,]

Here is a table of the top 10 events with the highest counts related to health consequences:

print(X)
##        EventType Fatalities Injuries
## 1        Tornado       5661    91407
## 2           Heat       3172     9228
## 3       Flooding       1553     8683
## 4          Winds        986     8897
## 5      Lightning       1029     7711
## 6   Winter Storm        676     7843
## 7  Coastal Storm        821      999
## 8  Fire/Wildfire         90     1608
## 9      Hurricane        133     1328
## 10           Fog         80     1076

It may be useful to visualize the data as well:

healthmelted <- melt(X, id = "EventType")
ggplot(data = healthmelted, aes(x = EventType, y = value, fill = variable)) +
        geom_bar(stat = "identity") +
        scale_fill_brewer(palette = "Set1") + 
        ggtitle("Harmful US Storms 1950-2011") +
        coord_flip()

Economic Damages of Storms

For our second question of interest, we are only concerned with the data causing economic damage. I subsetted the data by those storms that caused any damage.

econ <- select(df, c(EVTYPE, contains("DMG")))               
econ <- subset(econ, PROPDMG > 0 | CROPDMG > 0)

Now I aggregate the sums for each event type (in billions) and we can see this data in a table:

Y <- aggregate(cbind(CROPDMG*10^CROPDMGEXP, PROPDMG*10^PROPDMGEXP) ~ EVTYPE, data = econ, FUN = sum)
Y[,2:3] <- Y[,2:3]/1000000000
names(Y) <- c("EventType", "CropDamage", "PropertyDamage")
Y <- arrange(Y, desc(CropDamage + PropertyDamage))[1:10,]
print(Y)
##         EventType CropDamage PropertyDamage
## 1        Flooding 12.3885972     168.270525
## 2       Hurricane  5.5052928      84.656180
## 3         Tornado  0.4174615      58.603318
## 4   Coastal Storm  0.0017000      48.831805
## 5    Winter Storm 10.4249263      28.469780
## 6         Drought 13.9726218       1.046306
## 7           Winds  1.3313427      10.679604
## 8   Fire/Wildfire  0.4032816       8.501629
## 9  Tropical Storm  0.6948960       7.716128
## 10      Lightning  0.6650675       7.576885

And to visualize the data:

econmelted <- melt(Y, id = "EventType")
ggplot(data = econmelted, aes(x = EventType, y = value, fill = variable)) +
        geom_bar(stat = "identity") +
        scale_fill_brewer(palette = "Set1") + 
        ggtitle("Economic Damage (Billions) from US Storms 1950-2011") +
        coord_flip()

I think the data organized here paints some clear pictures that will be useful for government officials to use as they prioritize resources for different types of future events.