Executive Summary

The dataset obtained from NOAA is a report of weather events and their impact, with varying quality of data. After cleaning and analysis of the data, we determined that excessive heat is the most dangerous event for human health on a per-event basis (i.e. the average heat wave is more dangerous than the average hurricane), while the most impactful event type for human health considering event frequency is the tornado (there are more tornados than heat events, so the cumulative impact is greater). The most expensive type of weather event on a per-event basis is the hurricane, while floods create greater economic impact overall, given their frequency.

Data Processing

Obtaining Data and Preprocessing

We download the data from the host server and load it into a data frame.

# This adds time to the computation as well as adding server load, so we'll cache it.
library(utils)
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "stormdata.csv.bz2", method="curl")
weatherEvents<-read.csv("stormdata.csv.bz2")

Next, we will select just the columns that interest us:

EVTYPE records the type of weather event.

FATALITIES and INJURIES give us insight into the public health consequences of events.

PROPDMG, PROPDMGEXP, CROPDMG, and CROPDMGEXP provide information about the economic impact of weather events.

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
weatherEvents<-select(weatherEvents, c(EVTYPE, FATALITIES, INJURIES,
PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP))

Next, we will handle exponent data following the lead of a previous student’s work on this dataset, adopting the same conventions for converting the respective “DMG” and “DMGEXP” fields:

DMGEXP field value Means multiply DMG amount by
B,b = billions 1,000,000,000
M,m = millions 1,000,000
K,k = kilos = thousands 1,000
H,h = hundreds 100
(+) 1
(-) 0
(?) 0
blank or empty character 0

Once we correct for exponents, we’ll sum property damage and crop damage to come up with a total economic impact figure.

# save a copy haha
weatherEventsBak <- weatherEvents
# Fix Billions
weatherEvents$PROPDMG[which(tolower(weatherEvents$PROPDMGEXP)=="b")]<-weatherEvents$PROPDMG[which(tolower(weatherEvents$PROPDMGEXP)=="b")]*1000000000
weatherEvents$CROPDMG[which(tolower(weatherEvents$CROPDMGEXP)=="b")]<-weatherEvents$CROPDMG[which(tolower(weatherEvents$CROPDMGEXP)=="b")]*1000000000
# Fix Millions
weatherEvents$PROPDMG[which(tolower(weatherEvents$PROPDMGEXP)=="m")]<-weatherEvents$PROPDMG[which(tolower(weatherEvents$PROPDMGEXP)=="m")]*1000000
weatherEvents$CROPDMG[which(tolower(weatherEvents$CROPDMGEXP)=="m")]<-weatherEvents$CROPDMG[which(tolower(weatherEvents$CROPDMGEXP)=="m")]*1000000
# Fix Thousands
weatherEvents$PROPDMG[which(tolower(weatherEvents$PROPDMGEXP)=="k")]<-weatherEvents$PROPDMG[which(tolower(weatherEvents$PROPDMGEXP)=="k")]*1000
weatherEvents$CROPDMG[which(tolower(weatherEvents$CROPDMGEXP)=="k" )]<-weatherEvents$CROPDMG[which(tolower(weatherEvents$CROPDMGEXP)=="k")]*1000
# Fix Hundreds
weatherEvents$PROPDMG[which(tolower(weatherEvents$PROPDMGEXP)=="h")]<-weatherEvents$PROPDMG[which(tolower(weatherEvents$PROPDMGEXP)=="h")]*100
weatherEvents$CROPDMG[which(tolower(weatherEvents$CROPDMGEXP)=="h")]<-weatherEvents$CROPDMG[which(tolower(weatherEvents$CROPDMGEXP)=="h")]*100
# Create a "total cost" column
weatherEvents<-mutate(weatherEvents, TOTALCOST = PROPDMG + CROPDMG)
# Remove extraneous columns
weatherEvents<-select(weatherEvents, c(EVTYPE, FATALITIES, INJURIES, TOTALCOST))

Now we have to come up with a way to quantify the public health cost in deaths and injuries. There are a number of ways to calculate total casualties, and it’s tricky, because here we have both deaths as well as injuries. For the sake of simplicity, I consider a death to be the public health equivalent to 200 injuries (a figure that is completely arbitrary).

We will calculate total casualties using this method:

weatherEvents<-mutate(weatherEvents, TOTALCASUALTY = as.integer(200*FATALITIES + INJURIES))
# Remove extraneous columns
weatherEvents<-select(weatherEvents, -c(FATALITIES, INJURIES))

At this point, we have a data frame of just three columns which will be easy to analyze. Let’s take a look at what our data frame looks like, and then take a look at the weather event types listed:

head(weatherEvents)
##    EVTYPE TOTALCOST TOTALCASUALTY
## 1 TORNADO     25000            15
## 2 TORNADO      2500             0
## 3 TORNADO     25000             2
## 4 TORNADO      2500             2
## 5 TORNADO      2500             2
## 6 TORNADO      2500             6
head(table(weatherEvents$EVTYPE), 10)
## 
##    HIGH SURF ADVISORY         COASTAL FLOOD           FLASH FLOOD 
##                     1                     1                     1 
##             LIGHTNING             TSTM WIND       TSTM WIND (G45) 
##                     1                     4                     1 
##            WATERSPOUT                  WIND                     ? 
##                     1                     1                     1 
##       ABNORMAL WARMTH 
##                     4
tail(table(weatherEvents$EVTYPE), 10)
## 
##      WINTER STORMS     Winter Weather     WINTER WEATHER 
##                  3                 19               7026 
## WINTER WEATHER MIX WINTER WEATHER/MIX        WINTERY MIX 
##                  6               1104                  2 
##         Wintry mix         Wintry Mix         WINTRY MIX 
##                  3                  1                 90 
##                WND 
##                  1

The event types are a bit messy. For example, we have “Winter Weather” and “WINTER WEATHER” as two separate categories, and “WINTER WEATHER MIX”, “WINTER WEATHER/MIX”, “WINTERY MIX”, “WINTRY MIX”, “Wintry mix”, and “Wintry Mix” all as separate categories. We can fix part of this problem by standarizing capitalization. Let’s first check how many categories we begin with, then set capitalization to all lower case, and see what improvements that gives us.

length(table(weatherEvents$EVTYPE))
## [1] 985
weatherEvents$EVTYPE<-tolower(weatherEvents$EVTYPE)
length(table(weatherEvents$EVTYPE))
## [1] 898

Not that much of an improvement. A quick perusal of the table of weather events shows us some patterns, however. We’ll remove all the “summary” rows, and consolidate many categories according to a string search (for example, all the “volcan” should be “volcano-related”, anything with “flood” can be “flood-related”“,”thund" will be “thunderstorm-related”, etc.)

weatherEvents<-filter(weatherEvents, !grepl("summary",EVTYPE))
weatherEvents$EVTYPE[which(grepl("flood", weatherEvents$EVTYPE) | grepl("floood", weatherEvents$EVTYPE)| grepl("fld", weatherEvents$EVTYPE))]<-"flood-related"
weatherEvents$EVTYPE[which(grepl("volcan", weatherEvents$EVTYPE))]<-"volcano-related"
weatherEvents$EVTYPE[which(grepl("thund", weatherEvents$EVTYPE) | grepl("tstm", weatherEvents$EVTYPE) )]<-"thunderstorm-related"
weatherEvents$EVTYPE[which(grepl("torn", weatherEvents$EVTYPE) | grepl("funnel", weatherEvents$EVTYPE) | grepl("gustnado", weatherEvents$EVTYPE))]<-"tornado-related"
weatherEvents$EVTYPE[which(grepl("wind", weatherEvents$EVTYPE) | grepl("wnd", weatherEvents$EVTYPE))]<-"wind-related"
weatherEvents$EVTYPE[which(grepl("snow", weatherEvents$EVTYPE) | grepl("ice", weatherEvents$EVTYPE) | grepl("blizz", weatherEvents$EVTYPE) | grepl("icy", weatherEvents$EVTYPE) | grepl("glaze", weatherEvents$EVTYPE) | grepl("winter", weatherEvents$EVTYPE) | (grepl("wint", weatherEvents$EVTYPE) & grepl("mix", weatherEvents$EVTYPE)))]<-"winter-weather-related"
weatherEvents$EVTYPE[which(grepl("slide", weatherEvents$EVTYPE) | grepl("slump", weatherEvents$EVTYPE) )]<-"land or mudslide-related"
weatherEvents$EVTYPE[which(grepl("tropical", weatherEvents$EVTYPE) | grepl("typhoon", weatherEvents$EVTYPE) | grepl("hurr", weatherEvents$EVTYPE) | grepl("floyd", weatherEvents$EVTYPE))]<-"tropical weather / hurricane-related"
weatherEvents$EVTYPE[which(grepl("hot", weatherEvents$EVTYPE) | grepl("heat", weatherEvents$EVTYPE) | grepl("high temperature", weatherEvents$EVTYPE) | grepl("warm", weatherEvents$EVTYPE) | grepl("hypertherm", weatherEvents$EVTYPE)  | grepl("record high", weatherEvents$EVTYPE))]<-"heat-related"
weatherEvents$EVTYPE[which(grepl("cold", weatherEvents$EVTYPE) |  grepl("hypotherm", weatherEvents$EVTYPE) | grepl("cool", weatherEvents$EVTYPE) |  grepl("freeze", weatherEvents$EVTYPE) | grepl("low temp", weatherEvents$EVTYPE) | grepl("record low", weatherEvents$EVTYPE) | grepl("frost", weatherEvents$EVTYPE))]<-"cold-related"
weatherEvents$EVTYPE[which(grepl("fire", weatherEvents$EVTYPE) | grepl("smoke", weatherEvents$EVTYPE))]<-"fire-related"
weatherEvents$EVTYPE[which(grepl("spout", weatherEvents$EVTYPE))]<-"waterspout-related"
weatherEvents$EVTYPE[which(grepl("dust", weatherEvents$EVTYPE))]<-"dust-related"
weatherEvents$EVTYPE[which(grepl("fog", weatherEvents$EVTYPE))]<-"fog-related"
weatherEvents$EVTYPE[which(grepl("surge", weatherEvents$EVTYPE) | grepl("tsun", weatherEvents$EVTYPE) | grepl("surf", weatherEvents$EVTYPE) | grepl("current", weatherEvents$EVTYPE) | grepl("wave", weatherEvents$EVTYPE) | grepl("seiche", weatherEvents$EVTYPE) | grepl("swell", weatherEvents$EVTYPE) | grepl("tide", weatherEvents$EVTYPE) | grepl("high water", weatherEvents$EVTYPE) | grepl("eros", weatherEvents$EVTYPE) | grepl("seas", weatherEvents$EVTYPE))]<-"tide/wave-related"
weatherEvents$EVTYPE[which(grepl("dry", weatherEvents$EVTYPE) | grepl("drought", weatherEvents$EVTYPE)  | grepl("below normal precip", weatherEvents$EVTYPE))]<-"drought-related"
weatherEvents$EVTYPE[which(grepl("wet", weatherEvents$EVTYPE))]<-"excessively wet-related"
weatherEvents$EVTYPE[which(grepl("hail", weatherEvents$EVTYPE) | grepl("freezing", weatherEvents$EVTYPE) | grepl("sleet", weatherEvents$EVTYPE))]<-"sleet/hail-related"
weatherEvents$EVTYPE[which(grepl("rain", weatherEvents$EVTYPE) | grepl("burst", weatherEvents$EVTYPE) | grepl("storm", weatherEvents$EVTYPE) | grepl("shower", weatherEvents$EVTYPE))]<-"rain-related"
weatherEvents$EVTYPE[which(grepl("light", weatherEvents$EVTYPE))]<-"lightning-related"
weatherEvents$EVTYPE[which(grepl("aval", weatherEvents$EVTYPE))]<-"avalanche-related"

While not accounting for some events such as “record temperatures” (high or low?), we now have 20 categories, each with a “-related” suffix, which account for a vast majority (>900 k) of cases. We’ll perform analyses on these 20 categories. Let’s filter out the rest.

weatherEvents<-filter(weatherEvents, grepl("-related",EVTYPE))

Analysis

In order to analyze the relative impacts in both public health casualties and economic impact by event type, we will need to use the groupby function. This allows us to do summary statistics based on weather event type.

groupedWeather<-group_by(weatherEvents, EVTYPE)

We can now do summary statistics. Let’s first find out totals (across many occurrences of the same weather event type):

summarise(groupedWeather, TotalCost = sum(TOTALCOST)) %>% arrange(desc(TotalCost))
## Source: local data frame [20 x 2]
## 
##                                  EVTYPE    TotalCost
##                                   (chr)        (dbl)
## 1                         flood-related 179976647782
## 2  tropical weather / hurricane-related  99173551360
## 3                       tornado-related  57408358249
## 4                     tide/wave-related  48248930650
## 5                    sleet/hail-related  19029673516
## 6                          rain-related  17862150634
## 7                winter-weather-related  17671448413
## 8                       drought-related  15025419600
## 9                          fire-related   8905010130
## 10                         wind-related   7158842673
## 11                         cold-related   3554767900
## 12                    lightning-related    940786537
## 13                         heat-related    924805030
## 14             land or mudslide-related    347413100
## 15              excessively wet-related    142035000
## 16                          fog-related     25011500
## 17                   waterspout-related      9571700
## 18                         dust-related      9387630
## 19                    avalanche-related      3721800
## 20                      volcano-related       500000
summarise(groupedWeather, TotalCasualties = sum(TOTALCASUALTY)) %>% arrange(desc(TotalCasualties))
## Source: local data frame [20 x 2]
## 
##                                  EVTYPE TotalCasualties
##                                   (chr)           (int)
## 1                       tornado-related         1218610
## 2                          heat-related          645043
## 3                         flood-related          319283
## 4                          rain-related          181827
## 5                     lightning-related          168631
## 6                     tide/wave-related          163562
## 7                          wind-related          141394
## 8                winter-weather-related          136479
## 9                          cold-related           46683
## 10                    avalanche-related           45170
## 11 tropical weather / hurricane-related           41516
## 12                         fire-related           19608
## 13                          fog-related           17076
## 14             land or mudslide-related            8855
## 15                   sleet/hail-related            6809
## 16                         dust-related            5283
## 17                      drought-related             632
## 18                   waterspout-related             629
## 19              excessively wet-related               0
## 20                      volcano-related               0

Floods are out way ahead in costs, while tornados have the highest casualty count. Let’s keep in mind that we’re talking about all related events summed together, not the average impact of a given type of weather event. It’s unsurprising, for example, that mere rain has higher economic impact than volcano-related events, given the relative frequency of rain events compared to volcanic eruptions. Let’s also take a look at means to see what the average impact of a given event comes out to be:

summarise(groupedWeather, AvgCost = mean(TOTALCOST)) %>% arrange(desc(AvgCost))
## Source: local data frame [20 x 2]
## 
##                                  EVTYPE      AvgCost
##                                   (chr)        (dbl)
## 1  tropical weather / hurricane-related 93825497.975
## 2                     tide/wave-related 17969806.574
## 3               excessively wet-related  7475526.316
## 4                       drought-related  5530150.755
## 5                          fire-related  2089887.381
## 6                         flood-related  2089641.554
## 7                          cold-related  1456872.090
## 8                       tornado-related   848044.291
## 9              land or mudslide-related   536131.327
## 10               winter-weather-related   418100.800
## 11                         heat-related   307448.481
## 12                         wind-related   254130.020
## 13                   sleet/hail-related    65693.866
## 14                    lightning-related    59683.216
## 15                         rain-related    51227.037
## 16                      volcano-related    17241.379
## 17                         dust-related    16019.846
## 18                          fog-related    13303.989
## 19                    avalanche-related     9617.054
## 20                   waterspout-related     2486.156
summarise(groupedWeather, AvgCasualties = mean(TOTALCASUALTY)) %>% arrange(desc(AvgCasualties))
## Source: local data frame [20 x 2]
## 
##                                  EVTYPE AvgCasualties
##                                   (chr)         (dbl)
## 1                          heat-related   214.4424867
## 2                     avalanche-related   116.7183463
## 3                     tide/wave-related    60.9169460
## 4  tropical weather / hurricane-related    39.2771996
## 5                          cold-related    19.1323770
## 6                       tornado-related    18.0014772
## 7              land or mudslide-related    13.6651235
## 8                     lightning-related    10.6979001
## 9                           fog-related     9.0829787
## 10                         dust-related     9.0153584
## 11                         wind-related     5.0193113
## 12                         fire-related     4.6017367
## 13                        flood-related     3.7070755
## 14               winter-weather-related     3.2290494
## 15                         rain-related     0.5214634
## 16                      drought-related     0.2326095
## 17                   waterspout-related     0.1633766
## 18                   sleet/hail-related     0.0235059
## 19              excessively wet-related     0.0000000
## 20                      volcano-related     0.0000000

Looks like tropical weather / hurricanes make up the most expensive weather event on a per-event (not summed) basis, and by a very large margin. Excessive heat makes up the most dangerous event for human health on a per-event basis.

Results

We’ve discovered some numeric realities, namely that heat events and tornados are the most impactful on human health on a per-event and overall basis, respectively, while hurricanes and flooding are the most economically damaging, again, on a per-event and overall basis.

This can be hard to understand, so let’s visualize the average cost and frequency of our events together, using a bubble plot. First we’ll consolidate all our info, and review it:

overallData<-summarise(groupedWeather, TotalCost = sum(TOTALCOST), TotalCasualties = sum(TOTALCASUALTY), AvgCost = as.integer(mean(TOTALCOST)),  AvgCasualties = as.integer(mean(TOTALCASUALTY)), NumEvents=as.integer(n()))
overallData
## Source: local data frame [20 x 6]
## 
##                                  EVTYPE    TotalCost TotalCasualties
##                                   (chr)        (dbl)           (int)
## 1                     avalanche-related      3721800           45170
## 2                          cold-related   3554767900           46683
## 3                       drought-related  15025419600             632
## 4                          dust-related      9387630            5283
## 5               excessively wet-related    142035000               0
## 6                          fire-related   8905010130           19608
## 7                         flood-related 179976647782          319283
## 8                           fog-related     25011500           17076
## 9                          heat-related    924805030          645043
## 10             land or mudslide-related    347413100            8855
## 11                    lightning-related    940786537          168631
## 12                         rain-related  17862150634          181827
## 13                   sleet/hail-related  19029673516            6809
## 14                    tide/wave-related  48248930650          163562
## 15                      tornado-related  57408358249         1218610
## 16 tropical weather / hurricane-related  99173551360           41516
## 17                      volcano-related       500000               0
## 18                   waterspout-related      9571700             629
## 19                         wind-related   7158842673          141394
## 20               winter-weather-related  17671448413          136479
## Variables not shown: AvgCost (int), AvgCasualties (int), NumEvents (int)

Now we’ll construct a bubble plot to show the average cost and frequency together, and use color to show the total cost.

library(ggplot2)
p <- ggplot(overallData, aes(EVTYPE, AvgCost/10000, size=NumEvents/1000, colour=TotalCost/100000000)) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  geom_point() +
  xlab("Event Type") + 
  ylab("Average Cost in Tens of Thousands of Dollars") +
  scale_size_continuous(range = c(2,12), "Thousands of Events", breaks = c(50, 100, 150, 200, 250, 300, 350)) +
  scale_colour_gradient(name="Total Cost in Billions", low="#99D19A", high="#0E8710")
p

Let’s do the same with casualties.

p <- ggplot(overallData, aes(EVTYPE, AvgCasualties, size=NumEvents/1000, colour=TotalCasualties)) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  geom_point() +
  xlab("Event Type") + 
  ylab("Average Number of Casualties") +
  scale_size_continuous(range = c(2,12), "Thousands of Events", breaks = c(50, 100, 150, 200, 250, 300, 350)) +
  scale_colour_gradient(name="Total Casualties", low="grey", high="red")
p