The dataset obtained from NOAA is a report of weather events and their impact, with varying quality of data. After cleaning and analysis of the data, we determined that excessive heat is the most dangerous event for human health on a per-event basis (i.e. the average heat wave is more dangerous than the average hurricane), while the most impactful event type for human health considering event frequency is the tornado (there are more tornados than heat events, so the cumulative impact is greater). The most expensive type of weather event on a per-event basis is the hurricane, while floods create greater economic impact overall, given their frequency.
We download the data from the host server and load it into a data frame.
# This adds time to the computation as well as adding server load, so we'll cache it.
library(utils)
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "stormdata.csv.bz2", method="curl")
weatherEvents<-read.csv("stormdata.csv.bz2")
Next, we will select just the columns that interest us:
EVTYPE records the type of weather event.
FATALITIES and INJURIES give us insight into the public health consequences of events.
PROPDMG, PROPDMGEXP, CROPDMG, and CROPDMGEXP provide information about the economic impact of weather events.
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
weatherEvents<-select(weatherEvents, c(EVTYPE, FATALITIES, INJURIES,
PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP))
Next, we will handle exponent data following the lead of a previous student’s work on this dataset, adopting the same conventions for converting the respective “DMG” and “DMGEXP” fields:
| DMGEXP field value | Means multiply DMG amount by |
|---|---|
| B,b = billions | 1,000,000,000 |
| M,m = millions | 1,000,000 |
| K,k = kilos = thousands | 1,000 |
| H,h = hundreds | 100 |
| (+) | 1 |
| (-) | 0 |
| (?) | 0 |
| blank or empty character | 0 |
Once we correct for exponents, we’ll sum property damage and crop damage to come up with a total economic impact figure.
# save a copy haha
weatherEventsBak <- weatherEvents
# Fix Billions
weatherEvents$PROPDMG[which(tolower(weatherEvents$PROPDMGEXP)=="b")]<-weatherEvents$PROPDMG[which(tolower(weatherEvents$PROPDMGEXP)=="b")]*1000000000
weatherEvents$CROPDMG[which(tolower(weatherEvents$CROPDMGEXP)=="b")]<-weatherEvents$CROPDMG[which(tolower(weatherEvents$CROPDMGEXP)=="b")]*1000000000
# Fix Millions
weatherEvents$PROPDMG[which(tolower(weatherEvents$PROPDMGEXP)=="m")]<-weatherEvents$PROPDMG[which(tolower(weatherEvents$PROPDMGEXP)=="m")]*1000000
weatherEvents$CROPDMG[which(tolower(weatherEvents$CROPDMGEXP)=="m")]<-weatherEvents$CROPDMG[which(tolower(weatherEvents$CROPDMGEXP)=="m")]*1000000
# Fix Thousands
weatherEvents$PROPDMG[which(tolower(weatherEvents$PROPDMGEXP)=="k")]<-weatherEvents$PROPDMG[which(tolower(weatherEvents$PROPDMGEXP)=="k")]*1000
weatherEvents$CROPDMG[which(tolower(weatherEvents$CROPDMGEXP)=="k" )]<-weatherEvents$CROPDMG[which(tolower(weatherEvents$CROPDMGEXP)=="k")]*1000
# Fix Hundreds
weatherEvents$PROPDMG[which(tolower(weatherEvents$PROPDMGEXP)=="h")]<-weatherEvents$PROPDMG[which(tolower(weatherEvents$PROPDMGEXP)=="h")]*100
weatherEvents$CROPDMG[which(tolower(weatherEvents$CROPDMGEXP)=="h")]<-weatherEvents$CROPDMG[which(tolower(weatherEvents$CROPDMGEXP)=="h")]*100
# Create a "total cost" column
weatherEvents<-mutate(weatherEvents, TOTALCOST = PROPDMG + CROPDMG)
# Remove extraneous columns
weatherEvents<-select(weatherEvents, c(EVTYPE, FATALITIES, INJURIES, TOTALCOST))
Now we have to come up with a way to quantify the public health cost in deaths and injuries. There are a number of ways to calculate total casualties, and it’s tricky, because here we have both deaths as well as injuries. For the sake of simplicity, I consider a death to be the public health equivalent to 200 injuries (a figure that is completely arbitrary).
We will calculate total casualties using this method:
weatherEvents<-mutate(weatherEvents, TOTALCASUALTY = as.integer(200*FATALITIES + INJURIES))
# Remove extraneous columns
weatherEvents<-select(weatherEvents, -c(FATALITIES, INJURIES))
At this point, we have a data frame of just three columns which will be easy to analyze. Let’s take a look at what our data frame looks like, and then take a look at the weather event types listed:
head(weatherEvents)
## EVTYPE TOTALCOST TOTALCASUALTY
## 1 TORNADO 25000 15
## 2 TORNADO 2500 0
## 3 TORNADO 25000 2
## 4 TORNADO 2500 2
## 5 TORNADO 2500 2
## 6 TORNADO 2500 6
head(table(weatherEvents$EVTYPE), 10)
##
## HIGH SURF ADVISORY COASTAL FLOOD FLASH FLOOD
## 1 1 1
## LIGHTNING TSTM WIND TSTM WIND (G45)
## 1 4 1
## WATERSPOUT WIND ?
## 1 1 1
## ABNORMAL WARMTH
## 4
tail(table(weatherEvents$EVTYPE), 10)
##
## WINTER STORMS Winter Weather WINTER WEATHER
## 3 19 7026
## WINTER WEATHER MIX WINTER WEATHER/MIX WINTERY MIX
## 6 1104 2
## Wintry mix Wintry Mix WINTRY MIX
## 3 1 90
## WND
## 1
The event types are a bit messy. For example, we have “Winter Weather” and “WINTER WEATHER” as two separate categories, and “WINTER WEATHER MIX”, “WINTER WEATHER/MIX”, “WINTERY MIX”, “WINTRY MIX”, “Wintry mix”, and “Wintry Mix” all as separate categories. We can fix part of this problem by standarizing capitalization. Let’s first check how many categories we begin with, then set capitalization to all lower case, and see what improvements that gives us.
length(table(weatherEvents$EVTYPE))
## [1] 985
weatherEvents$EVTYPE<-tolower(weatherEvents$EVTYPE)
length(table(weatherEvents$EVTYPE))
## [1] 898
Not that much of an improvement. A quick perusal of the table of weather events shows us some patterns, however. We’ll remove all the “summary” rows, and consolidate many categories according to a string search (for example, all the “volcan” should be “volcano-related”, anything with “flood” can be “flood-related”“,”thund" will be “thunderstorm-related”, etc.)
weatherEvents<-filter(weatherEvents, !grepl("summary",EVTYPE))
weatherEvents$EVTYPE[which(grepl("flood", weatherEvents$EVTYPE) | grepl("floood", weatherEvents$EVTYPE)| grepl("fld", weatherEvents$EVTYPE))]<-"flood-related"
weatherEvents$EVTYPE[which(grepl("volcan", weatherEvents$EVTYPE))]<-"volcano-related"
weatherEvents$EVTYPE[which(grepl("thund", weatherEvents$EVTYPE) | grepl("tstm", weatherEvents$EVTYPE) )]<-"thunderstorm-related"
weatherEvents$EVTYPE[which(grepl("torn", weatherEvents$EVTYPE) | grepl("funnel", weatherEvents$EVTYPE) | grepl("gustnado", weatherEvents$EVTYPE))]<-"tornado-related"
weatherEvents$EVTYPE[which(grepl("wind", weatherEvents$EVTYPE) | grepl("wnd", weatherEvents$EVTYPE))]<-"wind-related"
weatherEvents$EVTYPE[which(grepl("snow", weatherEvents$EVTYPE) | grepl("ice", weatherEvents$EVTYPE) | grepl("blizz", weatherEvents$EVTYPE) | grepl("icy", weatherEvents$EVTYPE) | grepl("glaze", weatherEvents$EVTYPE) | grepl("winter", weatherEvents$EVTYPE) | (grepl("wint", weatherEvents$EVTYPE) & grepl("mix", weatherEvents$EVTYPE)))]<-"winter-weather-related"
weatherEvents$EVTYPE[which(grepl("slide", weatherEvents$EVTYPE) | grepl("slump", weatherEvents$EVTYPE) )]<-"land or mudslide-related"
weatherEvents$EVTYPE[which(grepl("tropical", weatherEvents$EVTYPE) | grepl("typhoon", weatherEvents$EVTYPE) | grepl("hurr", weatherEvents$EVTYPE) | grepl("floyd", weatherEvents$EVTYPE))]<-"tropical weather / hurricane-related"
weatherEvents$EVTYPE[which(grepl("hot", weatherEvents$EVTYPE) | grepl("heat", weatherEvents$EVTYPE) | grepl("high temperature", weatherEvents$EVTYPE) | grepl("warm", weatherEvents$EVTYPE) | grepl("hypertherm", weatherEvents$EVTYPE) | grepl("record high", weatherEvents$EVTYPE))]<-"heat-related"
weatherEvents$EVTYPE[which(grepl("cold", weatherEvents$EVTYPE) | grepl("hypotherm", weatherEvents$EVTYPE) | grepl("cool", weatherEvents$EVTYPE) | grepl("freeze", weatherEvents$EVTYPE) | grepl("low temp", weatherEvents$EVTYPE) | grepl("record low", weatherEvents$EVTYPE) | grepl("frost", weatherEvents$EVTYPE))]<-"cold-related"
weatherEvents$EVTYPE[which(grepl("fire", weatherEvents$EVTYPE) | grepl("smoke", weatherEvents$EVTYPE))]<-"fire-related"
weatherEvents$EVTYPE[which(grepl("spout", weatherEvents$EVTYPE))]<-"waterspout-related"
weatherEvents$EVTYPE[which(grepl("dust", weatherEvents$EVTYPE))]<-"dust-related"
weatherEvents$EVTYPE[which(grepl("fog", weatherEvents$EVTYPE))]<-"fog-related"
weatherEvents$EVTYPE[which(grepl("surge", weatherEvents$EVTYPE) | grepl("tsun", weatherEvents$EVTYPE) | grepl("surf", weatherEvents$EVTYPE) | grepl("current", weatherEvents$EVTYPE) | grepl("wave", weatherEvents$EVTYPE) | grepl("seiche", weatherEvents$EVTYPE) | grepl("swell", weatherEvents$EVTYPE) | grepl("tide", weatherEvents$EVTYPE) | grepl("high water", weatherEvents$EVTYPE) | grepl("eros", weatherEvents$EVTYPE) | grepl("seas", weatherEvents$EVTYPE))]<-"tide/wave-related"
weatherEvents$EVTYPE[which(grepl("dry", weatherEvents$EVTYPE) | grepl("drought", weatherEvents$EVTYPE) | grepl("below normal precip", weatherEvents$EVTYPE))]<-"drought-related"
weatherEvents$EVTYPE[which(grepl("wet", weatherEvents$EVTYPE))]<-"excessively wet-related"
weatherEvents$EVTYPE[which(grepl("hail", weatherEvents$EVTYPE) | grepl("freezing", weatherEvents$EVTYPE) | grepl("sleet", weatherEvents$EVTYPE))]<-"sleet/hail-related"
weatherEvents$EVTYPE[which(grepl("rain", weatherEvents$EVTYPE) | grepl("burst", weatherEvents$EVTYPE) | grepl("storm", weatherEvents$EVTYPE) | grepl("shower", weatherEvents$EVTYPE))]<-"rain-related"
weatherEvents$EVTYPE[which(grepl("light", weatherEvents$EVTYPE))]<-"lightning-related"
weatherEvents$EVTYPE[which(grepl("aval", weatherEvents$EVTYPE))]<-"avalanche-related"
While not accounting for some events such as “record temperatures” (high or low?), we now have 20 categories, each with a “-related” suffix, which account for a vast majority (>900 k) of cases. We’ll perform analyses on these 20 categories. Let’s filter out the rest.
weatherEvents<-filter(weatherEvents, grepl("-related",EVTYPE))
In order to analyze the relative impacts in both public health casualties and economic impact by event type, we will need to use the groupby function. This allows us to do summary statistics based on weather event type.
groupedWeather<-group_by(weatherEvents, EVTYPE)
We can now do summary statistics. Let’s first find out totals (across many occurrences of the same weather event type):
summarise(groupedWeather, TotalCost = sum(TOTALCOST)) %>% arrange(desc(TotalCost))
## Source: local data frame [20 x 2]
##
## EVTYPE TotalCost
## (chr) (dbl)
## 1 flood-related 179976647782
## 2 tropical weather / hurricane-related 99173551360
## 3 tornado-related 57408358249
## 4 tide/wave-related 48248930650
## 5 sleet/hail-related 19029673516
## 6 rain-related 17862150634
## 7 winter-weather-related 17671448413
## 8 drought-related 15025419600
## 9 fire-related 8905010130
## 10 wind-related 7158842673
## 11 cold-related 3554767900
## 12 lightning-related 940786537
## 13 heat-related 924805030
## 14 land or mudslide-related 347413100
## 15 excessively wet-related 142035000
## 16 fog-related 25011500
## 17 waterspout-related 9571700
## 18 dust-related 9387630
## 19 avalanche-related 3721800
## 20 volcano-related 500000
summarise(groupedWeather, TotalCasualties = sum(TOTALCASUALTY)) %>% arrange(desc(TotalCasualties))
## Source: local data frame [20 x 2]
##
## EVTYPE TotalCasualties
## (chr) (int)
## 1 tornado-related 1218610
## 2 heat-related 645043
## 3 flood-related 319283
## 4 rain-related 181827
## 5 lightning-related 168631
## 6 tide/wave-related 163562
## 7 wind-related 141394
## 8 winter-weather-related 136479
## 9 cold-related 46683
## 10 avalanche-related 45170
## 11 tropical weather / hurricane-related 41516
## 12 fire-related 19608
## 13 fog-related 17076
## 14 land or mudslide-related 8855
## 15 sleet/hail-related 6809
## 16 dust-related 5283
## 17 drought-related 632
## 18 waterspout-related 629
## 19 excessively wet-related 0
## 20 volcano-related 0
Floods are out way ahead in costs, while tornados have the highest casualty count. Let’s keep in mind that we’re talking about all related events summed together, not the average impact of a given type of weather event. It’s unsurprising, for example, that mere rain has higher economic impact than volcano-related events, given the relative frequency of rain events compared to volcanic eruptions. Let’s also take a look at means to see what the average impact of a given event comes out to be:
summarise(groupedWeather, AvgCost = mean(TOTALCOST)) %>% arrange(desc(AvgCost))
## Source: local data frame [20 x 2]
##
## EVTYPE AvgCost
## (chr) (dbl)
## 1 tropical weather / hurricane-related 93825497.975
## 2 tide/wave-related 17969806.574
## 3 excessively wet-related 7475526.316
## 4 drought-related 5530150.755
## 5 fire-related 2089887.381
## 6 flood-related 2089641.554
## 7 cold-related 1456872.090
## 8 tornado-related 848044.291
## 9 land or mudslide-related 536131.327
## 10 winter-weather-related 418100.800
## 11 heat-related 307448.481
## 12 wind-related 254130.020
## 13 sleet/hail-related 65693.866
## 14 lightning-related 59683.216
## 15 rain-related 51227.037
## 16 volcano-related 17241.379
## 17 dust-related 16019.846
## 18 fog-related 13303.989
## 19 avalanche-related 9617.054
## 20 waterspout-related 2486.156
summarise(groupedWeather, AvgCasualties = mean(TOTALCASUALTY)) %>% arrange(desc(AvgCasualties))
## Source: local data frame [20 x 2]
##
## EVTYPE AvgCasualties
## (chr) (dbl)
## 1 heat-related 214.4424867
## 2 avalanche-related 116.7183463
## 3 tide/wave-related 60.9169460
## 4 tropical weather / hurricane-related 39.2771996
## 5 cold-related 19.1323770
## 6 tornado-related 18.0014772
## 7 land or mudslide-related 13.6651235
## 8 lightning-related 10.6979001
## 9 fog-related 9.0829787
## 10 dust-related 9.0153584
## 11 wind-related 5.0193113
## 12 fire-related 4.6017367
## 13 flood-related 3.7070755
## 14 winter-weather-related 3.2290494
## 15 rain-related 0.5214634
## 16 drought-related 0.2326095
## 17 waterspout-related 0.1633766
## 18 sleet/hail-related 0.0235059
## 19 excessively wet-related 0.0000000
## 20 volcano-related 0.0000000
Looks like tropical weather / hurricanes make up the most expensive weather event on a per-event (not summed) basis, and by a very large margin. Excessive heat makes up the most dangerous event for human health on a per-event basis.
We’ve discovered some numeric realities, namely that heat events and tornados are the most impactful on human health on a per-event and overall basis, respectively, while hurricanes and flooding are the most economically damaging, again, on a per-event and overall basis.
This can be hard to understand, so let’s visualize the average cost and frequency of our events together, using a bubble plot. First we’ll consolidate all our info, and review it:
overallData<-summarise(groupedWeather, TotalCost = sum(TOTALCOST), TotalCasualties = sum(TOTALCASUALTY), AvgCost = as.integer(mean(TOTALCOST)), AvgCasualties = as.integer(mean(TOTALCASUALTY)), NumEvents=as.integer(n()))
overallData
## Source: local data frame [20 x 6]
##
## EVTYPE TotalCost TotalCasualties
## (chr) (dbl) (int)
## 1 avalanche-related 3721800 45170
## 2 cold-related 3554767900 46683
## 3 drought-related 15025419600 632
## 4 dust-related 9387630 5283
## 5 excessively wet-related 142035000 0
## 6 fire-related 8905010130 19608
## 7 flood-related 179976647782 319283
## 8 fog-related 25011500 17076
## 9 heat-related 924805030 645043
## 10 land or mudslide-related 347413100 8855
## 11 lightning-related 940786537 168631
## 12 rain-related 17862150634 181827
## 13 sleet/hail-related 19029673516 6809
## 14 tide/wave-related 48248930650 163562
## 15 tornado-related 57408358249 1218610
## 16 tropical weather / hurricane-related 99173551360 41516
## 17 volcano-related 500000 0
## 18 waterspout-related 9571700 629
## 19 wind-related 7158842673 141394
## 20 winter-weather-related 17671448413 136479
## Variables not shown: AvgCost (int), AvgCasualties (int), NumEvents (int)
Now we’ll construct a bubble plot to show the average cost and frequency together, and use color to show the total cost.
library(ggplot2)
p <- ggplot(overallData, aes(EVTYPE, AvgCost/10000, size=NumEvents/1000, colour=TotalCost/100000000)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
geom_point() +
xlab("Event Type") +
ylab("Average Cost in Tens of Thousands of Dollars") +
scale_size_continuous(range = c(2,12), "Thousands of Events", breaks = c(50, 100, 150, 200, 250, 300, 350)) +
scale_colour_gradient(name="Total Cost in Billions", low="#99D19A", high="#0E8710")
p
Let’s do the same with casualties.
p <- ggplot(overallData, aes(EVTYPE, AvgCasualties, size=NumEvents/1000, colour=TotalCasualties)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
geom_point() +
xlab("Event Type") +
ylab("Average Number of Casualties") +
scale_size_continuous(range = c(2,12), "Thousands of Events", breaks = c(50, 100, 150, 200, 250, 300, 350)) +
scale_colour_gradient(name="Total Casualties", low="grey", high="red")
p