Synopsis

The goal of this analysis is to examine the health and economic effects of various weather events on the United States Population from 1950-2011. Data was obtained from the NOAA Storm Database along with the supporting reference instruction from the National Weather Service (NWS 10-1605, August 17, 2007). Data was considered accurate enough for the purpose of general insights, as improvements over time have increased both reporting, granularity (more events), and level of detail captured.

Tornado event by far causes the most fatalities and injuries of all weather related events. It should be noted that, although heat related injuries are about half that of the wind related, they are significantly more fatal. Overall, Drought has the largest impact on Crops and Flooding causes the largest losses in property damage. However, Flooding is clearly the number one economic damage, having caused some $149 Billion US worth of damage combined in terms of 2014 fiscal year dollars.

Data Processing

Loading the Data

# Load Libraries for use
library(plyr)
library(ggplot2)
library(grid)


#Only download/extract the first time

if(!file.exists("repdata-data-StormData.csv.bz2")) {
  url="https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
  
  download.file(url, destfile="repdata-data-StormData.csv.bz2", method="curl") 
}

if(!exists("StormData")){
  StormData <- read.csv("repdata-data-StormData.csv.bz2")
}

# Quick look at the data columns
names(StormData)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

Subset to columns/rows worth keeping

The specific question being investigated has to do with the Date/Event type/health damage/financial impact, so to simplify the data set, we’ll drop columns we’re not going to use, standardize to uppercase on EVTYPE, and reformat the date. Then we’ll take a look at a description of what’s left.

subStormData <- StormData[,c("BGN_DATE", "EVTYPE","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")]

subStormData$EVTYPE <- toupper(subStormData$EVTYPE)
subStormData$EVTYPE <- as.factor(subStormData$EVTYPE)
subStormData$BGN_DATE <- format(as.Date(subStormData$BGN_DATE, format="%m/%d/%Y %H:%M:%S"), "%m/%d/%Y")

print(str(subStormData))
## 'data.frame':    902297 obs. of  8 variables:
##  $ BGN_DATE  : chr  "04/18/1950" "04/18/1950" "02/20/1951" "06/08/1951" ...
##  $ EVTYPE    : Factor w/ 898 levels "   HIGH SURF ADVISORY",..: 758 758 758 758 758 758 758 758 758 758 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## NULL

Since we are only interested in events that had injury or financial impact, we can drop all events where neither occurred. We will then go through this reduced set and normalize the crop and property damage values – as they contain 19 and 9 levels respectfully (values of K, m,b, etc)

subStormData.sig <- subset(subStormData, FATALITIES>0 | INJURIES>0 | PROPDMG>0 | CROPDMG>0)

Normalizing the cost exponent value

# Make all characters upper 
subStormData.sig$PROPDMGEXP <- toupper(subStormData.sig$PROPDMGEXP)
subStormData.sig$CROPDMGEXP <- toupper(subStormData.sig$CROPDMGEXP)

# convert all to a number value for exponent. For ?,+.-, rather than drop, I'll just = 0 
subStormData.sig$PROPDMGEXP[subStormData.sig$PROPDMGEXP == "H"] <- 2
subStormData.sig$PROPDMGEXP[subStormData.sig$PROPDMGEXP == "K"] <- 3
subStormData.sig$PROPDMGEXP[subStormData.sig$PROPDMGEXP == "M"] <- 6
subStormData.sig$PROPDMGEXP[subStormData.sig$PROPDMGEXP == "B"] <- 9
subStormData.sig$PROPDMGEXP[subStormData.sig$PROPDMGEXP == "+"] <- 0
subStormData.sig$PROPDMGEXP[subStormData.sig$PROPDMGEXP == "-"] <- 0
subStormData.sig$PROPDMGEXP[subStormData.sig$PROPDMGEXP == ""] <- 0

subStormData.sig$CROPDMGEXP[subStormData.sig$CROPDMGEXP == "H"] <- 2
subStormData.sig$CROPDMGEXP[subStormData.sig$CROPDMGEXP == "K"] <- 3
subStormData.sig$CROPDMGEXP[subStormData.sig$CROPDMGEXP == "M"] <- 6
subStormData.sig$CROPDMGEXP[subStormData.sig$CROPDMGEXP == "B"] <- 9
subStormData.sig$CROPDMGEXP[subStormData.sig$CROPDMGEXP == "?"] <- 0
subStormData.sig$CROPDMGEXP[subStormData.sig$CROPDMGEXP == ""] <- 0

# Store as numbers 
subStormData.sig$PROPDMGEXP <- as.numeric(subStormData.sig$PROPDMGEXP)
subStormData.sig$CROPDMGEXP <- as.numeric(subStormData.sig$CROPDMGEXP)

Calculating present value of money

Now that the Crop/Property damage units have been fixed, we must account for the “Present Value” of past money because the cost of goods and services goes up (on average) over time. To do this, we must normalize all the dollar amounts to a single reference point in time by using the average US dollar increase based on the average annual Consumer Price Index (CPI). All values will be put in 2014 Dollars and the Formula: Present$ = Past$ * (PresentCPI / PastCPI).

CPI TABLE*:

1950, 24.1 1951, 26.0 1952, 26.5 1953, 26.7 1954, 26.9 1955, 26.8 1956, 27.2 1957, 28.1 1958, 28.9 1959, 29.1 1960, 29.6 1961, 29.9 1962, 30.2 1963, 30.6 1964, 31.0 1965, 31.5 1966, 32.4 1967, 33.4 1968, 34.8 1969, 36.7 1970, 38.8 1971, 40.5 1972, 41.8 1973, 44.4 1974, 49.3 1975, 53.8 1976, 56.9 1977, 60.6 1978, 65.2 1979, 72.6 1980, 82.4 1981, 90.9 1982, 96.5 1983, 99.6 1984, 103.9 1985, 107.6 1986, 109.6 1987, 113.6 1988, 118.3 1989, 124.0 1990, 130.7 1991, 136.2 1992, 140.3 1993, 144.5 1994, 148.2
1995, 152.4 1996, 156.9 1997, 160.5 1998, 163.0 1999, 166.6 2000, 172.2 2001, 177.1 2002, 179.9 2003, 184.0 2004, 188.9 2005, 195.3 2006, 201.6 2007, 207.342 2008, 215.303 2009, 214.537 2010, 218.056 2011, 224.939 2012, 229.594 2013, 232.957 2014, 236.384

# First read the table in, then merge with the Storm Data
cpi <- read.csv("cpi-data.csv", skip=1, header=T)
cpi$BGN_DATE <- as.factor(cpi$BGN_DATE)
cpi$BGN_DATE <- format(as.Date(cpi$BGN_DATE, format="%Y"), "%Y")
subStormData.sig$BGN_DATE <- format(as.Date(subStormData.sig$BGN_DATE, format="%m/%d/%Y"), "%Y")
StormData.2014dollars <- merge(subStormData.sig, cpi, by="BGN_DATE")

# Calculate CPI, create one cost value for crops/property in 2014 dollars (Present Value: PV)
# Formula: Present$ = Past$ * (PresentCPI / PastCPI)  and the PV(2014) = 236.384
StormData.2014dollars$PROPDMG_PV= with(StormData.2014dollars, PROPDMG*10^PROPDMGEXP * CPIAVG/236.384)
StormData.2014dollars$CROPDMG_PV= with(StormData.2014dollars, CROPDMG*10^CROPDMGEXP * CPIAVG/236.384)

Cleaning up the EVTYPE

The EVTYPE column has ~900 levels. Even after normalizing to all uppercase, there are many “unique” data names. Looking at the data in a text editor, for instance, reveals that “wind” might be thunderstorm wind, marine wind, high wind, etc… Flooding might be river, coastal, generic “flood”, etc.

To make make matters more confusing, according to the NATIONAL WEATHER SERVICE INSTRUCTION 10-1605 (8/17/2007) on Storm Data Preparation*, flooding may be the cause of injury, but it was part of a hurricane event, so the the numbers are assigned to a hurricane and that would be found in the narrative. The fact that the reporting mechanism changed so that the narrative description allowed for the reasoning and more granular categorization as required has proliferated the different permutations of the event types.

While Tornadoes and Hurricanes might be special types of wind/storm events, we can try and lump together things like flooding and wind as a start to look at broader, more generalized categories. This really was an iterative process of condensing the data, checking, adding new search strings, reading through the narratives in the raw file, checking, etc. The code below reflects the point at which this process was stopped. It’s a little higher level in some cases then the data description document.

temp <- StormData.2014dollars

#This first gsub will cheaply convert all the EVT to a string ;)
temp$EVTYPE <-gsub("COASTAL FLOOD","FLOOD", StormData.2014dollars$EVTYPE)
temp$EVTYPE <-gsub("HEAVY PRECIPITATION","RAIN", StormData.2014dollars$EVTYPE)
temp[grepl("FLOOD|HIGH WATER|RAPIDLY RISING WATER|FLD|URBAN", temp$EVTYPE),]$EVTYPE <- "FLOOD"
temp[grepl("WIND|WIN|DOWNBURST|SEVERE TURBULENCE|MICROBURST|APACHE|HIGH", temp$EVTYPE),]$EVTYPE <- "WIND"
temp[grepl("TORNADO|DUST DEVIL|GUSTNADO|WATERSPOUT|LANDSPOUT", temp$EVTYPE),]$EVTYPE <- "TORNADO"
temp[grepl("FREEZ|FROST", temp$EVTYPE),]$EVTYPE <- "FROST/FREEZE"
temp[grepl("FIRE", temp$EVTYPE),]$EVTYPE <- "FIRE"
temp[grepl("DROUGHT", temp$EVTYPE),]$EVTYPE <- "DROUGHT"
temp[grepl("HAIL", temp$EVTYPE),]$EVTYPE <- "HAIL"
temp[grepl("SNOW|WINTER|HEAVY MIX|SLEET|BLIZZARD", temp$EVTYPE),]$EVTYPE <- "BLIZZARD/WINTER STORM"
temp[grepl("HURRICANE", temp$EVTYPE),]$EVTYPE <- "HURRICANE/TYPHOON"
temp[grepl("TYPHOON", temp$EVTYPE),]$EVTYPE <- "HURRICANE/TYPHOON"
temp[grepl("ICE|ICY|GLAZE", temp$EVTYPE),]$EVTYPE <- "ICE"
temp[grepl("RAIN|SHOWER|MIXED PRECIP|COASTALSTORM|COASTAL STORM|EXCESSIVE WETNESS", temp$EVTYPE),]$EVTYPE <- "RAIN"
temp[grepl("TROPICAL STORM", temp$EVTYPE),]$EVTYPE <- "TROPICAL STORM"
temp[grepl("SURF|TIDE|SWELLS|SEA|WAVE|SURGE|STORM SURGE|RIP CUR", temp$EVTYPE),]$EVTYPE <- "STORM SURGE/TIDE"
temp[grepl("COLD|COOL|HYPOTHERMIA|LOW TEMP", temp$EVTYPE),]$EVTYPE <- "COLD/WINDCHILL"
temp[grepl("LIGHTN|LIGNTNING", temp$EVTYPE),]$EVTYPE <- "LIGHTNING"
temp[grepl("THUNDER|TSTMW", temp$EVTYPE),]$EVTYPE <- "THUNDERSTORM"
temp[grepl("FOG", temp$EVTYPE),]$EVTYPE <- "FOG"
temp[grepl("HEAT|WARM WEATHER|HYPERTHERMIA", temp$EVTYPE),]$EVTYPE <- "HEAT"
temp[grepl("AVALANC|MUD SLIDE|LANDSLIDE|MUDSLIDE|ROCK SLIDE|EROSION", temp$EVTYPE),]$EVTYPE <- "LANDSLIDE/AVALANCHE"
temp[grepl("DUST|ASH|SMOKE", temp$EVTYPE),]$EVTYPE <- "DUST/ASH"
temp[grepl("MARINE MISHAP|ACCIDENT|\\?", temp$EVTYPE),]$EVTYPE <- "OTHER"

temp$EVTYPE <- as.factor(temp$EVTYPE)

Results

Question 1: Which events are most harmful to human health?

We will look at the top seven events for Fatalities and Injuries by weather events in Figure 1 below:

#Summarize the fatalities and injuries
sum.health <- ddply(temp, "EVTYPE", summarize, sumFATALITIES=sum(FATALITIES),sumINJURY=sum(INJURIES))

#Look at the top seven of each for good measure
fatal <- sum.health[order(sum.health$sumFATALITIES,decreasing=T),]
injury <- sum.health[order(sum.health$sumINJURY,decreasing=T),]

top7fatal <- fatal[1:7,1:2]
top7injury <- injury[1:7,1-3]

Figure 1: Fatalities and Injuries Comparison by Weather Events (1950-2011)

p1 <- ggplot(top7fatal, aes(x=reorder(EVTYPE,-sumFATALITIES),y=sumFATALITIES, fill=sumFATALITIES )) +
     geom_bar(stat="identity") +
     ggtitle("Fatalities") +
     xlab("") +
     ylab("Count") +
     theme(axis.text.x = element_text(angle = 45, hjust=1)) +
     theme(legend.position="none")  

p2 <- ggplot(top7injury, aes(x=reorder(EVTYPE,-sumINJURY),y=sumINJURY, fill=sumINJURY )) +
     geom_bar(stat="identity") +
     ggtitle("Injuries") +
     xlab("") +
     ylab("Count") +
     theme(axis.text.x = element_text(angle = 45, hjust=1)) +
     theme(legend.position="none")  

# Reference to panel layout plotting:
# http://zevross.com/blog/2014/08/04/beautiful-plotting-in-r-a-ggplot2-cheatsheet-3/#put-two-potentially-unrelated-plots-side-by-side-pushviewport
pushViewport(viewport(layout = grid.layout(1, 2)))
print(p1, vp = viewport(layout.pos.row = 1, layout.pos.col = 1))
print(p2, vp = viewport(layout.pos.row = 1, layout.pos.col = 2))

Question 2: Which events have the greated economic impact to property and crops?

We will look at the top seven events for Property and Crop damage in 2014 US Dollars by weather events in Figure 2 below:

#Summarize the Property and Crop damage costs
sum.cost <- ddply(temp, "EVTYPE", summarize, sumPROPERTY=sum(PROPDMG_PV),sumCROP=sum(CROPDMG_PV))

#Look at the top seven of each for good measure
property <- sum.cost[order(sum.cost$sumPROPERTY,decreasing=T),]
crop <- sum.cost[order(sum.cost$sumCROP,decreasing=T),]

top7property <- property[1:7,1:2]
top7crop <- crop[1:7,1-3]

# Lets look at things in Billions of dollars (10^9)
top7property$sumPROPERTY = with(top7property, sumPROPERTY/10^9)
top7crop$sumCROP = with(top7crop, sumCROP/10^9)

Figure 2: Property and Crop damage costs by Weather Event (1950-2011) in 2014 US Dollars

p1 <- ggplot(top7property, aes(x=reorder(EVTYPE,-sumPROPERTY),y=sumPROPERTY, fill=sumPROPERTY )) +
     geom_bar(stat="identity") +
     ggtitle("Property Damage") +
     xlab("") +
     ylab("Billions (2014 US Dollars)") +
     theme(axis.text.x = element_text(angle = 45, hjust=1)) +
     theme(legend.position="none")  

p2 <- ggplot(top7crop, aes(x=reorder(EVTYPE,-sumCROP),y=sumCROP, fill=sumCROP )) +
     geom_bar(stat="identity") +
     ggtitle("Crop Damage") +
     xlab("") +
     ylab("Billions (2014 US Dollars)") +
     theme(axis.text.x = element_text(angle = 45, hjust=1)) +
     theme(legend.position="none")  

# Reference to panel layout plotting:
# http://zevross.com/blog/2014/08/04/beautiful-plotting-in-r-a-ggplot2-cheatsheet-3/#put-two-potentially-unrelated-plots-side-by-side-pushviewport
pushViewport(viewport(layout = grid.layout(1, 2)))
print(p1, vp = viewport(layout.pos.row = 1, layout.pos.col = 1))
print(p2, vp = viewport(layout.pos.row = 1, layout.pos.col = 2))

Conclusions

Question 1: Health Effects

Based on the plots, the Tornado event by far causes the most fatalities and injuries of all weather related events. It should be noted that, although heat related injuries are about half that of the wind related, they are significantly more fatal.

Question 2: Economic Impact

Flooding is number one economic damage, having caused some $149 Billion US worth of damage combined in terms of 2014 fiscal year dollars. Overall, Drought has the largest impact on Crops but it is interesting to note that Tornadoes do not seem to have much of an impact. However, hurricanes do have a big impact on crops, most likely due to the southern US coastal states and a further geographic distribution breakdown would be an interesting exercise.