Synopsis

This report examines the average annual cost of each type of weather event tracked by the National Weather Service. We will show that while flooding has the strongest cost in terms of damage to property it is extreme heat that has the largest impact on population health.

Data Processing

For this analysis I’ve utilized the Storm Data set from the National Weather Service’s National Climatic Data Center.

Here is the code to obtain the dataset:

filename <- "StormData.csv.bz2"
fileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if(!file.exists(filename)){
                download.file(fileURL, filename)
}

Since the data is compressed as a .csv.bz2 file we can read it directly into R:

SData <- read.csv(filename)

This is a very large data set and most of it doesn’t pertain to the two questions we are asking, namely, which types of events are the most harmful to public health and which events have the greatest economic impact. I’ll subset the data so I only have to work with the pertinent information. If an event didn’t cause any fatalities, injuries, property damage or crop damage it isn’t interesting to us and I’ll subset them out as well.

SDataSub <- subset(SData, FATALITIES+INJURIES+PROPDMG+CROPDMG > 0, 
                   select = c(BGN_DATE, EVTYPE, FATALITIES:CROPDMGEXP))

It’s possible that some individual events could have a very large impact but not happen very often. To make sure our results aren’t skewed by an outlier I’m going to be looking at the average harm per type by year. Since the date data is stored as a factor variable I’ll need to convert it to a date variable and then extract the year. I find the lubridate package the easiest to do this with so I’ll load that package, extract the dates and then bind on just the year to the data we’re working with.

library(lubridate)
datelist <- mdy_hms(SDataSub$BGN_DATE)
yearlist <- year(datelist)
SDataSub <- cbind(SDataSub, YEAR = yearlist)

The economic impact data is stored in the original dataset rounded to 3 significant digits followed by an alphabetical signifying the magnitude of the number, “K” for thousands, “M” for millions and “B” for billions. There is also a lot of garbage coding in the field that has no explanation in the documentation that we need to clean up.

I’m going to assume the lower case letters should be their upper case equivalents and that all the rest yield an exponent of 1 except H which I’ll guess means hundred. If it’s blank we’ll use 0.

propexplist <- names(table(SDataSub$PROPDMGEXP))
ExpOne <- propexplist[!propexplist %in% c("m", "M", "k", "K", "b", "B", "h", "H", "0", "")]
propexp <- as.character(SDataSub$PROPDMGEXP)
propexp[propexp == "H"] <- 100
propexp[propexp == "h"] <- 100
propexp[propexp == "K"] <- 1000
propexp[propexp == "k"] <- 1000
propexp[propexp == "M"] <- 1000000
propexp[propexp == "m"] <- 1000000
propexp[propexp == "B"] <- 1000000000
propexp[propexp == "b"] <- 1000000000
propexp[propexp == ""] <- 0
propexp[propexp %in% ExpOne] <- 1
cropexplist <- names(table(SDataSub$CROPDMGEXP))
ExpOne <- cropexplist[!cropexplist %in% c("m", "M", "k", "K", "b", "B", "h", "H", "0", "")]
cropexp <- as.character(SDataSub$CROPDMGEXP)
cropexp[cropexp == "H"] <- 100
cropexp[cropexp == "h"] <- 100
cropexp[cropexp == "K"] <- 1000
cropexp[cropexp == "k"] <- 1000
cropexp[cropexp == "M"] <- 1000000
cropexp[cropexp == "m"] <- 1000000
cropexp[cropexp == "B"] <- 1000000000
cropexp[cropexp == "b"] <- 1000000000
cropexp[cropexp == ""] <- 0
cropexp[cropexp %in% ExpOne] <- 1

Now we have new character vectors for the property damage and crop damage multiples. I’ll now calculate the total damage for each event by multiplying the new multiples by the PRPDMG and CROPDMG and sum the result. I’ll bind this as a new column called TTLDMG.

TTLDMG <- SDataSub$PROPDMG * as.numeric(propexp) + SDataSub$CROPDMG * as.numeric(cropexp)
SDataSub <- cbind(SDataSub, TTLDMG)

Time for the real cleaning. The EVTYPE is ultimately what we’re interested in. However, the way it’s split up is far too granular. For example, the name of each hurricane is included in the field instead of just hurricane. What we need is a clean list of names that separates the EVTYPEs into higher level buckets. I’m no meteorologist but I’ll attempt to make a new vector that does this. I’ll use the stringr package here. Note, some of this is up for interpretation.

library(stringr)
NEVT <- as.character(SDataSub$EVTYPE)
NEVT <- str_to_upper(NEVT)
hurricane <- unique(str_subset(NEVT, "HURRICANE|TYPHOON|TSUNAMI"))
NEVT[NEVT %in% hurricane] <- "HURRICANE"
tornado <- unique(str_subset(NEVT, "TORNADO|WATERSPOUT|FUNNEL|WHIRL|TORNDAO|GUSTNADO"))
NEVT[NEVT %in% tornado] <- "TORNADO"
flood <- unique(str_subset(NEVT, "FLOOD|SURGE|FLD|DAM|RAPIDLY RISING WATER|HIGH WATER"))
NEVT[NEVT %in% flood] <- "FLOOD"
tstm <- unique(str_subset(NEVT, "TSTM|THUNDERSTORM|LIGHTNING|THUDERSTORM|THUNDEERSTORM|THUNDEERSTORM|THUNDEERSTORM|THUNDEERSTORM|THUNDEERSTORM|THUNDERSTROM|THUNDERTORM|THUNERSTORM|TUNDERSTORM|LIGHTING|LIGNTNING"))
NEVT[NEVT %in% tstm] <- "THUNDERSTORM"
winter <- unique(str_subset(NEVT, "WINTER|SNOW|ICE|FREEZ|BLIZZ|COLD|CHILL|HAIL|WINTRY|FROST|HYPOTHERMIA|ICY|LOW TEMPERATURE|SLEET|COOL|GLAZE|AVALANCE|AVALANCHE"))
NEVT[NEVT %in% winter] <- "WINTER WEATHER"
trop <- unique(str_subset(NEVT, "TROP"))
NEVT[NEVT %in% trop] <- "TROP STORM/DEP."
fire <- unique(str_subset(NEVT, "FIRE|SMOKE"))
NEVT[NEVT %in% fire] <- "FIRE"
wind <- unique(str_subset(NEVT, "WIND|MICROBURST|TURBULENCE"))
NEVT[NEVT %in% wind] <- "WIND"
coast <- unique(str_subset(NEVT, "COAST|TIDE|BEACH|SEA|CURRENT|SURF|SWELLS|WAVES|ROGUE WAVE|DROWNING|SEICHE|MARINE"))
NEVT[NEVT %in% coast] <- "COASTAL"
rain <- unique(str_subset(NEVT, "RAIN|PRECIPITATION|SHOWER|MIXED PRECIP|DOWNBURST|EXCESSIVE WETNESS|HEAVY MIX"))
NEVT[NEVT %in% rain] <- "RAIN"
heat <- unique(str_subset(NEVT, "HEAT|WARM|HYPERTHERMIA"))
NEVT[NEVT %in% heat] <- "HEAT"
dust <- unique(str_subset(NEVT, "DUST"))
NEVT[NEVT %in% dust] <- "DUST"
fog <- unique(str_subset(NEVT, "FOG"))
NEVT[NEVT %in% fog] <- "FOG"
others <- unique(str_subset(NEVT, "APACHE COUNTY|HIGH|LANDSLIDE|LANDSLIDES|LANDSLUMP|LANDSPOUT|MUD|OTHER|ROCK SLIDE|URBAN|ASH|^\\?"))
NEVT[NEVT %in% others] <- "OTHER"
SDataSub <- cbind(SDataSub, NEVT)

The final piece of manipulation we need to do is somehow combine the number of fatalities and the number of injuries to derive a measure of harm to population health. OSHA sets the average cost of a workplace injury at $38,000. I’ll figure storm related injuries are around the same. According to this article “the Environmental Protection Agency set the value of a human life at 9.1 million. Meanwhile, the Food and Drug Administration put it at 7.9 million - and the Department of Transportation figure was around 6 million.” I’ll take the average of those 3 and say it’s:

life <- mean(c(9100000, 7900000, 6000000))
injury <- 38000
popcost <- SDataSub$FATALITIES * life + SDataSub$INJURIES * injury
SDataSub <- cbind(SDataSub, POPCOST = popcost)

Looking at the data there really isn’t much before 1993 so I’m going to subset out all those early years

storm <- subset(SDataSub, YEAR > 1992)

We want the average per year so I’ll tapply over 2 dimensions for the cost to health and the cost to property.

propcost <- with(storm, tapply(TTLDMG, list(YEAR, NEVT), sum))
popcost <- with(storm, tapply(POPCOST, list(YEAR, NEVT), sum))

Now I’ll use the reshape2 package to melt those back into nice tidy data-sets.

library(reshape2)
propcostMelt <- melt(propcost, value.name = "Cost")
popcostMelt <- melt(popcost, value.name = "Cost")

Finally, we can do one more set of tapplys to take the mean cost for each EVTYPE.

AvgPopCost <- with(popcostMelt, tapply(Cost, Var2, mean, na.rm=TRUE))
AvgPopCost <- sort(AvgPopCost, decreasing = TRUE)
AvgPropCost <- with(propcostMelt, tapply(Cost, Var2, mean, na.rm=TRUE))
AvgPropCost <- sort(AvgPropCost, decreasing = TRUE)

Results

Using the monetary factors for injuries and human life described in the data processing section we determined the average annual cost to population health. As you can see extreme heat takes the biggest toll on population health each year.

par(mar=c(10,6,4,4), mgp=c(5,1,0))
barplot(AvgPopCost, las=2, col = "steelblue")

In terms of the cost to property the results clearly show that flooding is far and away the most costly type of event on an average annual basis.

par(mar=c(10,6,4,4), mgp=c(5,1,0))
barplot(AvgPropCost, las=2, col = "steelblue")