Weather Impacts To Population Health and The Economy In The United States, 1950-2011

Synopsis

For this data set, the challenge was grouping the unclean data into manageable weather event categories. There were initially over 985 categories, with duplicates, misspellings, and variations of the same event. With the help of a csv file from Maricio Lihnares (http://mauricio.github.io/2014/12/23/getting-and-clearning-data.html), the event categories were reduced to a more manageable 100. Maricio’s CSV file created a mapping between permutations and combinations of weather events and mapped them to the appropriate weather event. It is a text file with over 890 rows. This saved me a significant amount of time instead of using regular expressions to perform the mapping. Once the mapping was complete, I used the plyr package to create pareto charts to answer each of the questions.

Data Processing

Data file was loaded into a data frame:

# load bzfile directly, no need to unzip
proj.data <- read.csv(bzfile("repdata-data-StormData.csv.bz2"))

Now need to clean the data. We have 985 occurrences of EVTYPE, need to get that down to more consistent categories.

# we only want the rows that have damage or injury
proj.data <- proj.data[which(proj.data$FATALITIES > 0 | proj.data$INJURIES > 0 |
                               proj.data$PROPDMG > 0 | proj.data$CROPDMG > 0),]
# remove extra spaces and make events upper case
proj.data$EVTYPE <- toupper(gsub("^\\s+|\\s+$", "", proj.data$EVTYPE))
# helpful code from http://mauricio.github.io/2014/12/23/getting-and-clearning-data.html on classifying the groups.  Both CSVs can be downloaded from this URL.
# the replacement.csv is a helpful mapping table that saves a lot of time,
# why reinvent the wheel right?
replacements <- read.csv("replacements.csv", stringsAsFactors=FALSE)

eventFor <- function( evtype ) {
  replacements[replacements$event == evtype,]$actual
}
proj.data$CLEANEV <- sapply(proj.data$EVTYPE, eventFor)

Now we are mapping the DMG value to the DMGEXP value so we can calculate total damage and total crop damage.

proj.data$PROPDMGEXP <- toupper(proj.data$PROPDMGEXP)
proj.data$CROPDMGEXP <- toupper(proj.data$CROPDMGEXP)

# more useful code from the above author.  Instead of mapping everything out myself,
# he already created a csv that maps the damage (either crop or property) to the exponent
# so we can get a total damage number
multipliers <- read.csv("multipliers.csv", colClasses=c("character", "numeric"))

mapDamage <- function(damage, mapping) {
  damage * multipliers[multipliers$key == mapping,]$number
}
proj.data$property_damage <- mapply(mapDamage, proj.data$PROPDMG, proj.data$PROPDMGEXP)
proj.data$crop_damage <- mapply(mapDamage, proj.data$CROPDMG, proj.data$CROPDMGEXP)
# This will be used for question 2
proj.data$total_damage <- proj.data$property_damage + proj.data$crop_damage

Must ensure plyr is installed in order to manipulate the data:

library(plyr)

Now we started looking at the data to answer the first question: Which events were most harmful with respect to population health? For this, I looked at FATALITIES and INJURIES from the data set.

pop.harm <- ddply(
  proj.data,
  c("CLEANEV"),
  summarise,
  total_deaths=sum(FATALITIES),
  total_injuries=sum(INJURIES)
  )

Results

  1. The first 2 Pareto charts represent the events that have caused the most harm to people, graphed both in terms of FATALITIES and INJURIES.
fatality.pareto <- pop.harm[order(-pop.harm$total_deaths),]
injury.pareto <- pop.harm[order(-pop.harm$total_injuries),]
barplot(
  fatality.pareto[1:5,2],
  names.arg=fatality.pareto[1:5,1],
  cex.names=0.75,
  main="Fatalities Caused By Weather Events, 1950-2011",
  xlab="Event",
  ylab="Fatalities",
  ylim=c(0,6000))

barplot(
  injury.pareto[1:5,3],
  names.arg=injury.pareto[1:5,1],
  cex.names=0.75,
  main="Injuries Caused By Weather Events, 1950-2011",
  xlab="Event",
  ylab="Injuries",
  ylim=c(0,95000))

As you can see, Tornados and Heat are the two biggest contributors to both Fatalities and Injuries to people in the United States. Tornados overwhelmingly do the most harm in terms of fatalities and injuries compared to the other weather events.

  1. In terms of economic damage, I summed up the crop damage and the property damage to come up with a “total damage” number.
econ.harm <- ddply(
  proj.data,
  c("CLEANEV"),
  summarise,
  total_damage=sum(total_damage)
  )
econ.pareto <- econ.harm[order(-econ.harm$total_damage),]
# divide by 10^9 to get dollars in Billions
econ.pareto$total_damage <- econ.pareto$total_damage / 10^9
barplot(
  econ.pareto[1:5,2],
  names.arg=econ.pareto[1:5,1],
  cex.names=0.75,
  main="Total Economic Damage Caused By Weather Events, 1950-2011",
  xlab="Event",
  ylab="Cost (In Billions)",
  ylim=c(0,150))

Floods have had the biggest economic impact, causing over $150 billion in damages in the United States. Hurricanes, tornados, storms, and hail round out the top 5.

Overall, I would argue that tornados have had the biggest weather impact in the United States. They are by far the most dangerous in terms of death and injury, and they rank third in terms of economic impact.