We clean up the storm data from the NOAA and look at damages both in terms of economic damages as well as harm to human health. We find that, since 1996, the major storm causes of damage to human health are tornadoes and excessive heat, while the causers of the greatest economic consequences are floods and hurricanes, but that drought is the leading cause of crop damages in the United States.
I’ll be using a series of R packages in the course of the analysis. magrittr, dplyr, stringr, lubridate, and tidyr will all come in handy during the various manipulations we need to perform to clean up the data and add new columns. I use the approximate matching capabilities in stringdist in order to create a cleaner verion of the event codings, and quantmod to get historical inflation numbers in order to index all dollar damage values to 1996 dollars. Finally, I use ggplot2 to plot results.
library(magrittr)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
library(lubridate)
library(tidyr)
library(stringdist)
## Loading required package: parallel
library(quantmod)
## Loading required package: Defaults
## Loading required package: xts
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
##
## Attaching package: 'xts'
##
## The following objects are masked from 'package:dplyr':
##
## first, last
##
## Loading required package: TTR
## Version 0.4-0 included new data defaults. See ?getSymbols.
library(ggplot2)
library(scales)
I start by getting the data and reading it into a data.frame, and taking a quick look at the structure:
storm_url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
destfile <- "data/storm_data.csv.bz2"
if (!file.exists(destfile)) download.file(storm_url, destfile, mode="wb")
storm <- read.csv("data/storm_data.csv.bz2",
header=TRUE,
stringsAsFactors=FALSE) %>%
tbl_df
names(storm) <- tolower(names(storm))
storm
## Source: local data frame [902,297 x 37]
##
## state__ bgn_date bgn_time time_zone county countyname state
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## 7 1 11/16/1951 0:00:00 0100 CST 9 BLOUNT AL
## 8 1 1/22/1952 0:00:00 0900 CST 123 TALLAPOOSA AL
## 9 1 2/13/1952 0:00:00 2000 CST 125 TUSCALOOSA AL
## 10 1 2/13/1952 0:00:00 2000 CST 57 FAYETTE AL
## .. ... ... ... ... ... ... ...
## Variables not shown: evtype (chr), bgn_range (dbl), bgn_azi (chr),
## bgn_locati (chr), end_date (chr), end_time (chr), county_end (dbl),
## countyendn (lgl), end_range (dbl), end_azi (chr), end_locati (chr),
## length (dbl), width (dbl), f (int), mag (dbl), fatalities (dbl),
## injuries (dbl), propdmg (dbl), propdmgexp (chr), cropdmg (dbl),
## cropdmgexp (chr), wfo (chr), stateoffic (chr), zonenames (chr), latitude
## (dbl), longitude (dbl), latitude_e (dbl), longitude_ (dbl), remarks
## (chr), refnum (dbl)
str(storm)
## Classes 'tbl_df', 'tbl' and 'data.frame': 902297 obs. of 37 variables:
## $ state__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ bgn_date : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ bgn_time : chr "0130" "0145" "1600" "0900" ...
## $ time_zone : chr "CST" "CST" "CST" "CST" ...
## $ county : num 97 3 57 89 43 77 9 123 125 57 ...
## $ countyname: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ state : chr "AL" "AL" "AL" "AL" ...
## $ evtype : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ bgn_range : num 0 0 0 0 0 0 0 0 0 0 ...
## $ bgn_azi : chr "" "" "" "" ...
## $ bgn_locati: chr "" "" "" "" ...
## $ end_date : chr "" "" "" "" ...
## $ end_time : chr "" "" "" "" ...
## $ county_end: num 0 0 0 0 0 0 0 0 0 0 ...
## $ countyendn: logi NA NA NA NA NA NA ...
## $ end_range : num 0 0 0 0 0 0 0 0 0 0 ...
## $ end_azi : chr "" "" "" "" ...
## $ end_locati: chr "" "" "" "" ...
## $ length : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ width : num 100 150 123 100 150 177 33 33 100 100 ...
## $ f : int 3 2 2 2 2 2 2 1 3 3 ...
## $ mag : num 0 0 0 0 0 0 0 0 0 0 ...
## $ fatalities: num 0 0 0 0 0 0 0 0 1 0 ...
## $ injuries : num 15 0 2 2 2 6 1 0 14 0 ...
## $ propdmg : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ propdmgexp: chr "K" "K" "K" "K" ...
## $ cropdmg : num 0 0 0 0 0 0 0 0 0 0 ...
## $ cropdmgexp: chr "" "" "" "" ...
## $ wfo : chr "" "" "" "" ...
## $ stateoffic: chr "" "" "" "" ...
## $ zonenames : chr "" "" "" "" ...
## $ latitude : num 3040 3042 3340 3458 3412 ...
## $ longitude : num 8812 8755 8742 8626 8642 ...
## $ latitude_e: num 3051 0 0 0 0 ...
## $ longitude_: num 8806 0 0 0 0 ...
## $ remarks : chr "" "" "" "" ...
## $ refnum : num 1 2 3 4 5 6 7 8 9 10 ...
summary(storm)
## state__ bgn_date bgn_time time_zone
## Min. : 1.0 Length:902297 Length:902297 Length:902297
## 1st Qu.:19.0 Class :character Class :character Class :character
## Median :30.0 Mode :character Mode :character Mode :character
## Mean :31.2
## 3rd Qu.:45.0
## Max. :95.0
##
## county countyname state evtype
## Min. : 0 Length:902297 Length:902297 Length:902297
## 1st Qu.: 31 Class :character Class :character Class :character
## Median : 75 Mode :character Mode :character Mode :character
## Mean :101
## 3rd Qu.:131
## Max. :873
##
## bgn_range bgn_azi bgn_locati end_date
## Min. : 0 Length:902297 Length:902297 Length:902297
## 1st Qu.: 0 Class :character Class :character Class :character
## Median : 0 Mode :character Mode :character Mode :character
## Mean : 1
## 3rd Qu.: 1
## Max. :3749
##
## end_time county_end countyendn end_range
## Length:902297 Min. :0 Mode:logical Min. : 0
## Class :character 1st Qu.:0 NA's:902297 1st Qu.: 0
## Mode :character Median :0 Median : 0
## Mean :0 Mean : 1
## 3rd Qu.:0 3rd Qu.: 0
## Max. :0 Max. :925
##
## end_azi end_locati length width
## Length:902297 Length:902297 Min. : 0.0 Min. : 0
## Class :character Class :character 1st Qu.: 0.0 1st Qu.: 0
## Mode :character Mode :character Median : 0.0 Median : 0
## Mean : 0.2 Mean : 8
## 3rd Qu.: 0.0 3rd Qu.: 0
## Max. :2315.0 Max. :4400
##
## f mag fatalities injuries
## Min. :0 Min. : 0 Min. : 0 Min. : 0.0
## 1st Qu.:0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0.0
## Median :1 Median : 50 Median : 0 Median : 0.0
## Mean :1 Mean : 47 Mean : 0 Mean : 0.2
## 3rd Qu.:1 3rd Qu.: 75 3rd Qu.: 0 3rd Qu.: 0.0
## Max. :5 Max. :22000 Max. :583 Max. :1700.0
## NA's :843563
## propdmg propdmgexp cropdmg cropdmgexp
## Min. : 0 Length:902297 Min. : 0.0 Length:902297
## 1st Qu.: 0 Class :character 1st Qu.: 0.0 Class :character
## Median : 0 Mode :character Median : 0.0 Mode :character
## Mean : 12 Mean : 1.5
## 3rd Qu.: 0 3rd Qu.: 0.0
## Max. :5000 Max. :990.0
##
## wfo stateoffic zonenames latitude
## Length:902297 Length:902297 Length:902297 Min. : 0
## Class :character Class :character Class :character 1st Qu.:2802
## Mode :character Mode :character Mode :character Median :3540
## Mean :2875
## 3rd Qu.:4019
## Max. :9706
## NA's :47
## longitude latitude_e longitude_ remarks
## Min. :-14451 Min. : 0 Min. :-14455 Length:902297
## 1st Qu.: 7247 1st Qu.: 0 1st Qu.: 0 Class :character
## Median : 8707 Median : 0 Median : 0 Mode :character
## Mean : 6940 Mean :1452 Mean : 3509
## 3rd Qu.: 9605 3rd Qu.:3549 3rd Qu.: 8735
## Max. : 17124 Max. :9706 Max. :106220
## NA's :40
## refnum
## Min. : 1
## 1st Qu.:225575
## Median :451149
## Mean :451149
## 3rd Qu.:676723
## Max. :902297
##
The data is a bit messy. The questions we are looking to answer are about which event types are the most harmful and the most expensive. So I’ll focus on cleaning up the columns in the data related to event type coding, cost estimates, and the injury/fatality counts. I do this in several steps:
propdmg and cropdmg fields, where each field is accompanied by a multiplier. We use a lookup vector to convert the single letter codes to the correct multiplier number (for instance, “M” gets mapped to 1,000,000).I remove event types listed as “summary,” as the aggregated data they represent should be included unaggregated in other rows. I also replace “tstm” in the event names with “thunderstorm,” which matches the coding in NWS Directive 10-1605. Finally, I create more meaningful date fields, and I convert the propdmg and cropdmg variables to actual numbers that can be used in calculations.
# lookup vector:
expref <- c("0" = 1, "K" = 1000, "M" = 1E6, "B" = 1E9)
# filter and add cost numbers:
storm_clean <- storm %>%
mutate(event = str_trim(tolower(evtype)),
event = str_replace(event, "tstm", "thunderstorm"),
bgn_date = mdy_hms(bgn_date),
year = year(bgn_date),
propdmgexp = ifelse(propdmgexp == "", 0, propdmgexp),
property_damage = propdmg * expref[propdmgexp],
cropdmgexp = ifelse(cropdmgexp == "", 0, cropdmgexp),
crop_damage = cropdmg * expref[cropdmgexp],
total_damage = property_damage+crop_damage) %>%
filter(year >= 1996,
!str_detect(event, "summary"))
I use the package quantmod to get the monthly inflation numbers from FRED, and use them to create real dollar values for the damages:
cpi <- data.frame(getSymbols("CPIAUCSL", src='FRED', auto.assign=FALSE, warnings=FALSE))
## As of 0.4-0, 'getSymbols' uses env=parent.frame() and
## auto.assign=TRUE by default.
##
## This behavior will be phased out in 0.5-0 when the call will
## default to use auto.assign=FALSE. getOption("getSymbols.env") and
## getOptions("getSymbols.auto.assign") are now checked for alternate defaults
##
## This message is shown once per session and may be disabled by setting
## options("getSymbols.warning4.0"=FALSE). See ?getSymbol for more details
cpi$date <- ymd(row.names(cpi))
cpi$index <- cpi$CPIAUCSL/cpi["1996-01-01","CPIAUCSL"]
storm_clean <- storm_clean %>%
mutate(date = floor_date(bgn_date, unit = "month")) %>%
inner_join(cpi, by="date") %>%
mutate(property_damage_real = property_damage / index,
crop_damage_real = crop_damage / index,
total_damage_real = total_damage / index)
The event types are all over the place. A complete cleanup would require domain knowledge that I don’t have access to, but we can use approximate string matching to map a good chunk of the event types to the 48 official event types. I start by creating a vector of the official event types, from the NWS Directive, and then mapping each event name that appears in the storm data to one of the official event types. Those event names that can not be mapped are left as is, but we still are able to reduce the number of obvious duplicates. The official event names are only available in a PDF, so I copied and pasted the values from the PDF directly into my R code, assuring we’ll be able to re-create them at any time. Because of their origin, the event names data need to be tidied up (split the single string into a vector of event names, remove the C/Z/M designator at the end of the name and convert everything to lower case as I did with the event data in the storm dataset):
storm_events <- unique(storm_clean$event)
event_names <- "Astronomical Low Tide Z
Avalanche Z
Blizzard Z
Coastal Flood Z
Cold/Wind Chill Z
Debris Flow C
Dense Fog Z
Dense Smoke Z
Drought Z
Dust Devil C
Dust Storm Z
Excessive Heat Z
Extreme Cold/Wind Chill Z
Flash Flood C
Flood C
Frost/Freeze Z
Funnel Cloud C
Freezing Fog Z
Hail C
Heat Z
Heavy Rain C
Heavy Snow Z
High Surf Z
High Wind Z
Hurricane (Typhoon) Z
Ice Storm Z
Lake-Effect Snow Z
Lakeshore Flood Z
Lightning C
Marine Hail M
Marine High Wind M
Marine Strong Wind M
Marine Thunderstorm Wind M
Rip Current Z
Seiche Z
Sleet Z
Storm Surge/Tide Z
Strong Wind Z
Thunderstorm Wind C
Tornado C
Tropical Depression Z
Tropical Storm Z
Tsunami Z
Volcanic Ash Z
Waterspout M
Wildfire Z
Winter Storm Z
Winter Weather Z "
event_names <- event_names %>%
str_split("\n") %>%
`[[`(1) %>%
str_match("(.*?) [ZCM] $") %>%
`[`(,2) %>%
tolower
names(event_names) <- event_names
try_match <- sapply(storm_events, amatch, event_names, method="cosine", maxDist=.1)
clean_events <- structure(event_names[try_match], names = storm_events)
I made one more change. The only hurricane/typhoon entry in the official event list is hurricane (typhoon), but visual inspection showed that the storm data has numerous instances of hurricane as well as typhoon, both of which should be mapped to hurricane (typhoon):
clean_events["hurricane"] <- "hurricane (typhoon)"
clean_events["typhoon"] <- "hurricane (typhoon)"
Now clean_events is a vector that maps event names found in the storm data to their closest counterparts in the official event names. I modify the storm_clean data frame to add a cleaned up event column, prioritizing exact matches, then approximate matches, and finally leaving the field as is if no better option is found:
storm_clean <- storm_clean %>%
mutate(event_clean = event_names[event],
event_clean = ifelse(is.na(event_clean), clean_events[event], event_clean),
event_clean = ifelse(is.na(event_clean), event, event_clean))
Here’s a quick example of the sorts of changes we made:
storm_clean %>%
filter(event_clean != event) %>%
group_by(event, event_clean) %>%
summarise(total_records = n()) %>%
arrange(desc(total_records)) %>%
head
## Source: local data frame [6 x 3]
## Groups: event
##
## event event_clean total_records
## 1 winter weather/mix winter weather 1104
## 2 thunderstorm wind/hail marine thunderstorm wind 1028
## 3 rip currents rip current 302
## 4 heavy surf/high surf high surf 228
## 5 extreme windchill extreme cold/wind chill 204
## 6 strong winds strong wind 192
It appears that we can take the injury and fatality numbers as they are from the storm data, so now we’re ready to answer the original questions.
The chart below shows the top causers of injuries and fatalities, sorted by number of events, since 1996. Overall, tornadoes take the highest human toll, but excessive heat is disproportionately fatal when it does strike.
storm_clean %>%
group_by(event_clean) %>%
summarise(events = n(),
fatalities = sum(fatalities),
injuries=sum(injuries),
human_damages = sum(fatalities+injuries)) %>%
filter(events > 25 & human_damages > 100) %>%
select(event_type = event_clean,
events,
Fatalities = fatalities,
Injuries = injuries,
human_damages) %>%
gather(damage_type, count, Fatalities, Injuries, -human_damages, -events) %>%
ggplot(aes(y = reorder(event_type, events), x = count)) +
geom_point() +
geom_segment(aes(xend = 0, yend=reorder(event_type, events))) +
scale_x_continuous(labels=comma, expand=c(0.01,0)) +
facet_wrap(~damage_type, ncol=2, scale="free_x") +
labs(title="Injuries and fatalities by event type\nsorted by number of events", x = "Total", y = "Event Type") +
theme(rect=element_blank(),
plot.title = element_text(hjust=0),
axis.ticks = element_blank(),
strip.text = element_text(hjust=0),
axis.title.x = element_text(hjust=0),
panel.grid.major.x = element_line(size = .1, colour="gray50"))
Events types are sorted by number of event ocurrences. We can see that tornadoes and excessive heat have the highest human cost.
From the chart below, showing the top 10 event types in terms of overall dollars of damage, we can see that floods and hurricanes and typhoons have the greatest economic consequences overall, but that, as expected, droughts are by far the leading cause of crop damages. The dollar amounts are in constant 1996 dollars.
storm_clean %>%
group_by(event_clean) %>%
summarise(events = n(),
crop_damages = sum(crop_damage_real),
property_damages=sum(property_damage_real),
total_damages = sum(total_damage_real)) %>%
select(event_type = event_clean,
events,
crop_damages,
property_damages,
total_damages) %>%
top_n(10) %>%
gather(damage_type, dollars, crop_damages, property_damages, -total_damages, -events) %>%
ggplot(aes(y = reorder(event_type, total_damages), x = dollars)) +
geom_point() +
geom_segment(aes(xend=0, yend=reorder(event_type, total_damages))) +
scale_x_continuous(labels=comma, expand=c(0.01,0)) +
facet_wrap(~damage_type, ncol=2, scale="free_x") +
labs(title="Monetary damages by event type\nsorted by total cost of damages", x = "Cost in 1996 dollars", y = "Event Type") +
theme(rect=element_blank(),
plot.title = element_text(hjust=0),
axis.ticks = element_blank(),
strip.text = element_text(hjust=0),
axis.title.x = element_text(hjust=0),
panel.grid.major.x = element_line(size = .1, colour="gray50"))
## Selecting by total_damages
Top 10 economic damage-causers. Floods and hurricanes/typhoons have the greatest economic consequences overall, but droughts cause by far the most crop damage.