Weather events most harmful to population health and economic damage in the United States, 1997 to 2011

Synopsis

The three major causes of weather-related mortality and morbidity in the United States are excessive heat (18.0%), tornados (15.2%) and flash flooding (8.1%). Similar events are also among the three major causes of storm-related injuries (tornados 33.0%, flood 11.2% and excessive heat 10.5%).

Total weather-related fatalities are approximately 650 deaths annually. There are approximately 4025 storm-related injuries annually.

The three weather-related events causing the most economic damage are flood (34.3%), hurricanes (21.2%) and storm surges (15.0%). Total annual economic damage totals approximately $28.7 billion per year.

Crop damage, which is a subset of total economic damage, is also significantly affected by drought, which causes about 36% of the total $2.4 billion of weather-related crop damage occurring annually.

Data sourced from NOAA (National Weather Service), 1950-2011.

Data Processing

### start necessary libraries
library(tidyverse)
library(dplyr)
library(lubridate)
library(rlist)
library(gdata)
library(ggplot2)
library(knitr)

### set figure save path to figure/
knitr::opts_chunk$set(fig.path = 'figure/')
knitr::opts_chunk$set(fig.width=12, fig.height=8) 

Download data

Storm Data is from the National Weather Service.

Documentation for the data can be found at the National Weather Service and National Climatic Data Center Storm Events FAQ

Data comes in the form of a comma-separated-value file compressed via the bzip2 algorithm.

### download data ZIP file if necessary, and extract if necessary
if (!file.exists('StormData.csv.bz2')) {
  if (download.file('https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2','StormData.csv.bz2')) {
    stop('Unable to download data file')
  }
}

StormData <- as.tibble(read.csv(bzfile('StormData.csv.bz2')))

Variables of interest

Variable Description
BGN_DATE date of event starting
BGN_TIME time of event starting
TIME_ZONE time zone of event
COUNTYNAME name of county
STATE state of the United States
EVTYPE type of event
FATALITIES number of fatalities
INJURIES number of injuries
PROPDMG value of property damage in dollars
PROPDMGEXP property damage dollar multiplier (K thousands, M millions, B billions)
CROPDMG value of crop damage in dollars
CROPDMGEXP crop damage dollar multiplier (K thousands, M millions, B billions)
REFNUM unique reference number for event

Extract times of storm commencement

Extract time information from BGN_DATE, BGN_TIME and TIME_ZONE, and store in BEGIN_TIME.

### Information in BGN_TIME is not consistently stored
### some are in four-digit format e.g. '0000', '1300', '2350'. there is even one three-digit '000'
### others are in 11-character format e.g. '03:25:00 PM'
### convert all to 11-character format

StormData$BGN_calcTIME <- ifelse(str_length(StormData$BGN_TIME)<=4,
                             strftime(strptime(StormData$BGN_TIME,'%H%M'),'%I:%M:%S %p'),
                             as.character(StormData$BGN_TIME))
### more times which are in 'invalid' AM/PM format
### e.g. 15:30:00 PM, or 00:00:10 AM
### re-interpret these as 24-hour times
invalid.time.list <- is.na(strptime(StormData$BGN_calcTIME,'%I:%M:%S %p'))
StormData$BGN_calcTIME[invalid.time.list] <- 
  strftime(strptime(StormData[invalid.time.list,]$BGN_calcTIME,'%H:%M'), '%I:%M:%S %p')

### some times still remain invalid
kable(StormData[is.na(StormData$BGN_calcTIME),c('BGN_TIME','COUNTYNAME','STATE','REFNUM')],
      title = 'Invalid times to be manually adjusted')
BGN_TIME COUNTYNAME STATE REFNUM
1990 SCREVEN GA 197399
9999 OHZ060 - 070>078 OH 223421
2090 FAIRFAX VA 244548
1580 KENOSHA WI 247251
0572 WASHINGTON WI 247767
13O0 PRZ001 PR 248507
### these times are 'impossible' e.g. 19:90 hr, or sometimes subsituted a '0 (zero)' for 'O (letter)'
### manually adjust these times, assuming simple transposition or substitution errors
StormData$BGN_calcTIME[StormData$REFNUM == 197399] <- '07:09:00 PM' ### was 1990
StormData$BGN_calcTIME[StormData$REFNUM == 223421] <- '12:01:00 AM' ### was 9999
StormData$BGN_calcTIME[StormData$REFNUM == 244548] <- '08:09:00 PM' ### was 2090
StormData$BGN_calcTIME[StormData$REFNUM == 247251] <- '03:08:00 PM' ### was 1580
StormData$BGN_calcTIME[StormData$REFNUM == 247767] <- '05:27:00 AM' ### was 0572
StormData$BGN_calcTIME[StormData$REFNUM == 248507] <- '01:00:00 PM' ### was 13O0

StormData$BEGIN_TIME <- as.POSIXct(strptime(paste(sub(' .*','',StormData$BGN_DATE),
                                       StormData$BGN_calcTIME),'%m/%d/%Y %I:%M:%S %p'))

TIME_ZONE information, in particular, requires tidying up and adjustment.

List all current timezones, and standardize capitalization

kable(unique(StormData$TIME_ZONE))
x
CST
EST
PST
MST
CDT
PDT
EDT
UNK
HST
GMT
MDT
AST
ADT
CSt
ESt
CSC
SCT
ESY
UTC
SST
AKS
GST
### some timezone names use 'lower-case' characters, convert all to upper-case
StormData$TIME_ZONE <- as.factor(toupper(StormData$TIME_ZONE))
kable(StormData[StormData$TIME_ZONE %in% c('UNK','CSC','ESY','SCT'), c('COUNTYNAME','STATE','TIME_ZONE','REFNUM')],
      title = 'List of records with dubious timezone names')
COUNTYNAME STATE TIME_ZONE REFNUM
OCONEE GA UNK 23087
HAWAII HI UNK 27702
COLES IL UNK 30669
RUSH IN UNK 35244
HAMPDEN MA UNK 67899
LOVE OK UNK 120704
PERKINS SD UNK 143198
MCLENNAN TX UNK 154637
ARMSTRONG TX UNK 160236
CHICKASAW IA CSC 200720
IAZ033 - 034 - 045>052 - 057>068 - 070>078 - 081>089 - 092>099 IA SCT 201257
CAMBRIA PA ESY 229824
SHAWANO WI SCT 247612

Correct/adjust timezones.

Convert ‘daylight saving’ timezones to ‘standard’ timezones, as Lubridate adjusts for daylight saving times automatically.

### replace missing timezones 'UNK'
### Oconee County, Georgia
StormData$TIME_ZONE[StormData$REFNUM == 23087] <- 'EST'
### Hawaii country, Hawaii
StormData$TIME_ZONE[StormData$REFNUM == 27702] <- 'HST'
### Coles county, Illinois
StormData$TIME_ZONE[StormData$REFNUM == 30669] <- 'CST'
### Rush County, Indiana
StormData$TIME_ZONE[StormData$REFNUM == 35244] <- 'EST'
### Hampden County, Massachusetts
StormData$TIME_ZONE[StormData$REFNUM == 67899] <- 'EST'
### Love County, Oklahoma
StormData$TIME_ZONE[StormData$REFNUM == 120704] <- 'CST'
### Perkins County, South Dakota
StormData$TIME_ZONE[StormData$REFNUM == 143198] <- 'MST'
### McLennan County, Texas
StormData$TIME_ZONE[StormData$REFNUM == 154637] <- 'CST'
### Armstrong, Texas
StormData$TIME_ZONE[StormData$REFNUM == 160236] <- 'CST'

### Chickasaw, Mississippi. likely mistaken timezone 'CSC'
StormData$TIME_ZONE[StormData$REFNUM == 200720] <- 'CST'
### Cambria, Pennsylvania, likely mistaken timezone 'ESY'
StormData$TIME_ZONE[StormData$REFNUM == 229824] <- 'EST'
### Shawano, Wisconsin, likely mistaken timezone 'SCT'
StormData$TIME_ZONE[StormData$REFNUM == 247612] <- 'CST'
### IAZ033, Indiana, likely mistaken timezone 'SCT'
StormData$TIME_ZONE[StormData$REFNUM == 201257] <- 'EST'

### convert daylight saving timezones to 'standard' time
StormData$TIME_ZONE[StormData$TIME_ZONE == 'ADT'] <- 'AST'
StormData$TIME_ZONE[StormData$TIME_ZONE == 'CDT'] <- 'CST'
StormData$TIME_ZONE[StormData$TIME_ZONE == 'EDT'] <- 'EST'
StormData$TIME_ZONE[StormData$TIME_ZONE == 'MDT'] <- 'MST'
StormData$TIME_ZONE[StormData$TIME_ZONE == 'PDT'] <- 'PST'

Adjust BEGIN_TIME according to TIME_ZONE.

### adjust time zones. roll=TRUE prevents 'NA' times as result of daylight savings
StormData[StormData$TIME_ZONE == 'AKS',]$BEGIN_TIME <- 
  force_tz(StormData[StormData$TIME_ZONE == 'AKS',]$BEGIN_TIME, tzone = 'America/Anchorage', roll = TRUE)
StormData[StormData$TIME_ZONE == 'AST',]$BEGIN_TIME <- 
  force_tz(StormData[StormData$TIME_ZONE == 'AST',]$BEGIN_TIME, tzone = 'America/Halifax', roll = TRUE)
StormData[StormData$TIME_ZONE == 'CST',]$BEGIN_TIME <- 
  force_tz(StormData[StormData$TIME_ZONE == 'CST',]$BEGIN_TIME, tzone = 'America/Chicago', roll = TRUE)
StormData[StormData$TIME_ZONE == 'EST',]$BEGIN_TIME <- 
  force_tz(StormData[StormData$TIME_ZONE == 'EST',]$BEGIN_TIME, tzone = 'America/New_York', roll = TRUE)
StormData[StormData$TIME_ZONE == 'HST',]$BEGIN_TIME <- 
  force_tz(StormData[StormData$TIME_ZONE == 'HST',]$BEGIN_TIME, tzone = 'Pacific/Honolulu', roll = TRUE)
StormData[StormData$TIME_ZONE == 'MST',]$BEGIN_TIME <- 
  force_tz(StormData[StormData$TIME_ZONE == 'MST',]$BEGIN_TIME, tzone = 'America/Denver', roll = TRUE)
StormData[StormData$TIME_ZONE == 'PST',]$BEGIN_TIME <- 
  force_tz(StormData[StormData$TIME_ZONE == 'PST',]$BEGIN_TIME, tzone = 'America/Los_Angeles', roll = TRUE)
StormData[StormData$TIME_ZONE == 'SST',]$BEGIN_TIME <- 
  force_tz(StormData[StormData$TIME_ZONE == 'SST',]$BEGIN_TIME, tzone = 'Pacific/Samoa', roll = TRUE)
### other timezones 'GMT', 'GST', and 'UTC' all appear to be UTC+0, and so left alone

Event synonyms

Some events in EVTYPE which appear to be the same are recorded with different names (synonyms)

e.g. ‘TSTM WIND’, ‘THUNDERSTORM WIND’, ‘THUNDERSTORM WINDS’

These synonyms should be merged.

In this section ‘close’ synonyms will not be merged e.g. ‘MARINE THUNDERSTORM WIND’ will not be merged with ‘THUNDERSTORM WIND’, ‘THUNDERSTORM WIND/HAIL’ will not be merged into ‘THUNDERSTORM WIND’ and ‘STRONG WIND’ will not be merged into ‘WIND’, although arguably some of these close synonyms are the same.

### Most events are recorded in capitals ('upper case'), but some are recorded in mixed (upper and lower) case
### Covert all to upper case, strip all numbers, periods and trailing/leading whitespace
### numbers and parentheses are often used to describe sub-types within an event type e.g. size of hail
### there are some subtypes which are not yet merged with this logic e.g. 'G45' sub-types
StormData$EVTYPE <- as.factor(toupper(trimws(gsub('[0-9,.,(,)]+','',StormData$EVTYPE))))

### list of synonyms
synonyms <- list(c('THUNDERSTORM WIND','TSTM WIND','THUNDERSTORM WINDS','THUNDERSTORM WINDSS','THUNDERSTORMS WINDS','THUNDERSTROM WIND','THUNDERTSORM WIND','THUNERSTORM WINDS','TUNDERSTORM WIND','THUNDERSTORM WINS','TSTM WND'),
                 c('MARINE THUNDERSTORM WIND','MARINE TSTM WIND','MARINE TSTM WINDS'),
                 c('HIGH WIND','STRONG WIND','HIGH WINDS','STRONG WINDS'),
                 c('FLASH FLOOD','FLASH FLOODING','FLASH FLOODS'),
                 c('FLOOD','FLOODING'),
                 c('URBAN FLOOD','URBAN FLOODING'),
                 c('WILDFIRE','WILD/FOREST FIRE'),
                 c('RECORD HEAT','RECORD WARMTH'),
                 c('THUNDERSTORM WINDS HAIL','TSTM WIND/HAIL','THUNDERSTORM WINDS/HAIL'),
                 c('WIND GUSTS','GUSTY WIND'),
                 c('TORRENTIAL RAIN','TORRENTIAL RAINFALL'),
                 c('TORNADO','TORNADOS','TORNDAO'),
                 c('WIND','WND'),
                 c('THUNDERSTORM','THUNDERSTORMS'),
                 c('RIP CURRENT', 'RIP CURRENTS'),
                 c('HURRICANE','TYPHOON','HURRICANE/TYPHOON'))

### merge synonyms into one synonym
for (events in synonyms) {
  StormData[as.character(StormData$EVTYPE) %in% events,]$EVTYPE <- events[1]
}

Economic damage value calculations

Calculate ‘true’ dollar values of economic damage, combining PROPDMG with PROPDMGEXP and CROPDMG and CROPDMGEXP

In National Weather Service there are only three recognized abbreviations for PROPDMGEXP and CROPDMGEXP, being ‘K’ for thousands, ‘M’ for millions and ‘B’ for billions.

### convert the property and crop damage variables to true numeric values
StormData <- StormData %>%
  mutate(PropertyValue = PROPDMG * 1000^match(toupper(PROPDMGEXP),c('K','M','B'), nomatch=0)) %>%
  mutate(CropValue = CROPDMG * 1000^match(toupper(CROPDMGEXP),c('K','M','B'), nomatch=0))

Finding data which appears to have been collected consistently with the most recent years

Data has been collected from 1950 to 2011. Has data been consistently collected during this time? Compare the change in events recorded annually from 1950 to 2011, including the total number of events and the five most commonly recorded events.

### find the most common event types
frequent.events <- StormData %>%
  count(EVTYPE) %>%
  arrange(desc(n))

# show the twenty most frequent event types
print(head(frequent.events,20))
## # A tibble: 20 x 2
##    EVTYPE                        n
##    <fct>                     <int>
##  1 THUNDERSTORM WIND        323458
##  2 HAIL                     288761
##  3 TORNADO                   60654
##  4 FLASH FLOOD               54992
##  5 HIGH WIND                 25536
##  6 FLOOD                     25447
##  7 LIGHTNING                 15756
##  8 HEAVY SNOW                15708
##  9 MARINE THUNDERSTORM WIND  11987
## 10 HEAVY RAIN                11742
## 11 WINTER STORM              11433
## 12 WINTER WEATHER             7045
## 13 FUNNEL CLOUD               6845
## 14 WILDFIRE                   4218
## 15 WATERSPOUT                 3797
## 16 URBAN/SML STREAM FLD       3392
## 17 BLIZZARD                   2719
## 18 DROUGHT                    2488
## 19 ICE STORM                  2006
## 20 EXCESSIVE HEAT             1678

Plot the yearly event count for the top ten events, and the total event count (the total event count includes both ‘common’ and ‘uncommon’ events)

In order to make comparison easier, the event counts are ‘normalized’ to the 2011 counts i.e. the 2011 counts are the ‘baseline’.

# the ten most frequent events
most.frequent.events <- as.list(as.character(frequent.events[1:10,]$EVTYPE))

# count the events in 'most.frequent.events' in each year of recording
frequent.event.count <- StormData %>%
  filter(EVTYPE %in% most.frequent.events) %>%
  mutate(event.year = year(BEGIN_TIME)) %>%
  group_by(event.year, EVTYPE) %>%
  summarise(events = n())

# and the total events for each year
yearly.event.count <- StormData %>%
  mutate(event.year = year(BEGIN_TIME)) %>%
  group_by(event.year) %>%
  summarise(events = n())

yearly.event.count[,'EVTYPE'] <- 'TOTAL'

event.count <- bind_rows(frequent.event.count,yearly.event.count)

# 'normalize the events' to the frequency observed in 2011
# store the normalized event count in 'event.norm'
event.count[,'event.norm'] <- NA

for (event in list.append(most.frequent.events,'TOTAL')) {
  event.2011.count <- with(event.count, events[EVTYPE == event & event.year == 2011])
  for (year in with(event.count, event.year[EVTYPE == event])) {
    index <- with(event.count,(EVTYPE == event) & (event.year == year))
    event.count$event.norm[index] <- (event.count$events[index]/event.2011.count)
  }
}

ggplot(data=event.count,aes(x=event.year,y=event.norm,group=EVTYPE, colour=EVTYPE)) +
  geom_line()+geom_point() +
  labs(title = "Ten most frequent storm events by year, and total events, from 1950 to 2011",
       subtitle = "NOAA storm database. Yearly event frequency in comparison to 2011 event frequency") +
  labs(x = "Year", y = "Frequency in comparison to 2011 frequency") +
  guides(fill = guide_legend(title="")) +
  theme(legend.position=c(0.2,0.6))

Many of the frequent events are not recorded in the current form until at the mid-1990s. ‘MARINE THUNDERSTORM WIND’ is not even recorded until 2001.

Restrict data analysis from 1997 to 2011 (fifteen years in total)

StormSubset <- StormData[year(StormData$BEGIN_TIME)>=1997,
                         c('COUNTYNAME','STATE','EVTYPE','FATALITIES','INJURIES',
                           'PROPDMG','PROPDMGEXP','CROPDMG','CROPDMGEXP',
                           'REFNUM','PropertyValue','CropValue','BEGIN_TIME')]

Results

Which types of events are most harmful with respect to population health?

Rank the twenty events with the greatest number of fatalities or injuries per year. Population health harm contained in variables FATALITIES and INJURIES.

### calculate the sum of fatalities and injuries for each event for each year
HealthSum <- StormSubset %>%
  mutate(event.year = year(BEGIN_TIME)) %>%
  group_by(event.year, EVTYPE) %>%
  summarise(fatalities = sum(FATALITIES), injuries = sum(INJURIES))

### calculate the mean ('average') annual fatalities and injuries for each event
### calculate the proportion of annual fatalities or injuries caused by the event 
HealthMean <- HealthSum %>%
  group_by(EVTYPE) %>%
  summarise(mean_fatalities = mean(fatalities), mean_injuries = mean(injuries)) %>%
  mutate(proportion_fatalities = mean_fatalities/sum(mean_fatalities)*100) %>%
  mutate(proportion_injuries = mean_injuries/sum(mean_injuries)*100) %>%
  arrange(desc(mean_fatalities))

kable(HealthMean[1:20,c('EVTYPE','mean_fatalities','proportion_fatalities')], caption = 'Weather events causing most fatalities, annual mean fatalities, and percentage of fatalities caused by event')
Weather events causing most fatalities, annual mean fatalities, and percentage of fatalities caused by event
EVTYPE mean_fatalities proportion_fatalities
EXCESSIVE HEAT 117.400000 17.968625
TORNADO 99.000000 15.152418
FLASH FLOOD 53.000000 8.111901
LIGHTNING 39.866667 6.101782
RIP CURRENT 34.466667 5.275286
FLOOD 25.466667 3.897794
HEAT 23.700000 3.627397
THUNDERSTORM WIND 23.266667 3.561073
HIGH WIND 21.000000 3.214149
AVALANCHE 14.533333 2.224395
COLD AND SNOW 14.000000 2.142766
COLD/WIND CHILL 13.571429 2.077171
EXTREME COLD/WIND CHILL 12.500000 1.913184
WINTER STORM 11.133333 1.704009
HEAVY SURF/HIGH SURF 10.500000 1.607075
EXTREME COLD 9.285714 1.421223
FOG 8.285714 1.268168
TSUNAMI 8.250000 1.262701
HYPOTHERMIA/EXPOSURE 7.000000 1.071383
HIGH SURF 6.692308 1.024289
total_mean_fatalities <- sum(HealthMean$mean_fatalities)
sprintf('Total mean number of fatalities annually: %.1f',total_mean_fatalities)
## [1] "Total mean number of fatalities annually: 653.4"
kable(arrange(HealthMean,desc(mean_injuries))[1:20,c('EVTYPE','mean_injuries','proportion_injuries')],caption = 'Weather events causing the most injuries, annual mean injuries and percentage of injuries caused by event')
Weather events causing the most injuries, annual mean injuries and percentage of injuries caused by event
EVTYPE mean_injuries proportion_injuries
TORNADO 1330.80000 33.0652423
FLOOD 449.33333 11.1641986
EXCESSIVE HEAT 422.13333 10.4883836
THUNDERSTORM WIND 313.66667 7.7934057
LIGHTNING 255.06667 6.3374219
HEAT 122.20000 3.0361982
FLASH FLOOD 108.66667 2.6999471
FOG 95.71429 2.3781305
WILDFIRE 94.86667 2.3570704
HURRICANE 93.50000 2.3231140
HIGH WIND 84.60000 2.1019834
WINTER STORM 69.46667 1.7259785
GLAZE 53.00000 1.3168454
HAIL 42.06667 1.0451943
HEAVY SNOW 37.86667 0.9408405
WINTER WEATHER MIX 34.00000 0.8447687
TSUNAMI 32.25000 0.8012880
RIP CURRENT 29.20000 0.7255073
WINTER WEATHER 26.38462 0.6555558
MIXED PRECIP 26.00000 0.6459996
total_mean_injuries <- sum(HealthMean$mean_injuries)
sprintf('Total mean number of injuries annually: %.1f',total_mean_injuries)
## [1] "Total mean number of injuries annually: 4024.8"

Display results graphically of the events causing the most fatalities and injuries

### the (union) of the ten events causing the most fatalities and ten events causing the most injuries
most_health_events <- union(HealthMean$EVTYPE[1:10],arrange(HealthMean,desc(mean_injuries))$EVTYPE[1:10])

### gather data in preparation for bar-charting
HealthImpact <- HealthMean %>%
  filter(HealthMean$EVTYPE %in% most_health_events) %>%
  mutate(Injuries = -proportion_injuries) %>% ### negative values will be used for 'bar-negative-stack' later
  rename(Fatalities = proportion_fatalities) %>%
  select(one_of('EVTYPE','Injuries','Fatalities')) %>%
  gather(key = 'harm', value = 'proportion', Injuries, Fatalities)

# arrange factors by fatality order
HealthImpact$EVTYPE = reorder.factor(HealthImpact$EVTYPE, new.order = arrange(HealthMean,mean_fatalities)$EVTYPE)
arrange(HealthImpact, EVTYPE)
## # A tibble: 26 x 3
##    EVTYPE    harm       proportion
##    <fct>     <chr>           <dbl>
##  1 WILDFIRE  Injuries       -2.36 
##  2 WILDFIRE  Fatalities      0.888
##  3 HURRICANE Injuries       -2.32 
##  4 HURRICANE Fatalities      0.973
##  5 FOG       Injuries       -2.38 
##  6 FOG       Fatalities      1.27 
##  7 AVALANCHE Injuries       -0.250
##  8 AVALANCHE Fatalities      2.22 
##  9 HIGH WIND Injuries       -2.10 
## 10 HIGH WIND Fatalities      3.21 
## # ... with 16 more rows
subtitle <- sprintf('Data from NOAA storm database 1997 to 2011\n
                    Total mean fatalities per year : %.1f\n
                    Total mean injuries per year : %.1f',
                    total_mean_fatalities, total_mean_injuries)  

ggplot(data=HealthImpact, aes(x = EVTYPE, y = proportion, fill = harm)) +
  geom_bar(stat = 'identity') + ### 'identity' needed for values (rather than counts)
  labs(title = "The storm events causing the highest proportions of fatalities and injuries",
       subtitle = subtitle) +
  labs(x = "Event", y = "Percentage proportion of annual fatalities or injuries") +
  scale_y_continuous(labels=abs) + 
                     ### remove negative values for axis for 'negative' part of chart
  guides(fill = guide_legend(title="")) +
  theme(legend.position=c(0.8,0.5)) +
  coord_flip(ylim=c(-35,35))

Which types of events have the greatest economic consequences?

Rank the twenty events with the greatest value of property or crop damage per year Economic damage information is stored in PropertyValue and CropValue.

### remove variables no longer required
DamageSubset <- StormSubset %>%
  select(-c(PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP,FATALITIES,INJURIES))

### calculate the sum of damage for each event for each year
DamageSum <- DamageSubset %>%
  mutate(event.year = year(BEGIN_TIME)) %>%
  group_by(event.year, EVTYPE) %>%
  summarise(property = sum(PropertyValue), crop = sum(CropValue))

### calculate the mean ('average') damage for each event (for each of property and crop)
### calculate the proportion of annual damage caused by event (for each of property and crop)
### and also calculate combined mean damage (property+crop) and proportion
DamageMean <- DamageSum %>%
  group_by(EVTYPE) %>%
  summarise(mean_property = mean(property), mean_crop = mean(crop)) %>%
  mutate(mean_combined = mean_property + mean_crop) %>%
  mutate(proportion_property = mean_property/sum(mean_property)*100) %>%
  mutate(proportion_crop = mean_crop/sum(mean_crop)*100) %>%
  mutate(proportion_combined = mean_combined/sum(mean_combined)*100) %>%
  arrange(desc(mean_property))

kable(DamageMean[1:20,c('EVTYPE','mean_property','proportion_property')], caption = 'Weather events causing most property damage, annual mean property damage, and percentage of property damage caused by event')
Weather events causing most property damage, annual mean property damage, and percentage of property damage caused by event
EVTYPE mean_property proportion_property
FLOOD 9531642097 36.2058371
HURRICANE 5738335644 21.7970045
STORM SURGE 4318594900 16.4041350
TORNADO 1593155877 6.0515850
FLASH FLOOD 945352367 3.5909105
HAIL 932133795 3.5406999
STORM SURGE/TIDE 663026857 2.5185002
WILDFIRE 512436633 1.9464849
TROPICAL STORM 506411703 1.9235993
THUNDERSTORM WIND 497588195 1.8900833
HIGH WIND 354029326 1.3447765
ICE STORM 236839987 0.8996341
WINTER STORM 99690550 0.3786735
DROUGHT 60713400 0.2306192
LIGHTNING 47244955 0.1794594
HEAVY RAIN 36684229 0.1393446
TSUNAMI 36015500 0.1368045
HEAVY SNOW 33396323 0.1268556
BLIZZARD 32326930 0.1227935
LANDSLIDE 29507091 0.1120824
total_mean_property <- sum(DamageMean$mean_property)
sprintf('Total mean value of property damage annually: $%.0f',total_mean_property)
## [1] "Total mean value of property damage annually: $26326258003"
kable(arrange(DamageMean,desc(mean_crop))[1:20,c('EVTYPE','mean_crop','proportion_crop')],caption = 'Weather events causing the most crop damage, annual mean crop damage and percentage of crop damage caused by event')
Weather events causing the most crop damage, annual mean crop damage and percentage of crop damage caused by event
EVTYPE mean_crop proportion_crop
DROUGHT 857565733 35.8547176
HURRICANE 357161629 14.9328837
FLOOD 310131493 12.9665595
EXTREME COLD 162917714 6.8115696
HAIL 152800840 6.3885843
FROST/FREEZE 99471455 4.1588893
FLASH FLOOD 84755447 3.5436148
THUNDERSTORM WIND 61726533 2.5807788
HEAVY RAIN 48544387 2.0296348
TROPICAL STORM 45114067 1.8862136
HIGH WIND 39428147 1.6484860
EXCESSIVE HEAT 32826800 1.3724845
AGRICULTURAL FREEZE 28820000 1.2049606
WILDFIRE 26612009 1.1126448
FREEZE 24370833 1.0189415
TORNADO 18061054 0.7551304
THUNDERSTORM WINDS HAIL 7164464 0.2995454
HARD FREEZE 6450000 0.2696737
UNSEASONAL RAIN 5000000 0.2090494
HEAVY SNOW 4726807 0.1976272
total_mean_crop <- sum(DamageMean$mean_crop)
sprintf('Total mean value of crop damage annually: $%.0f',total_mean_crop)
## [1] "Total mean value of crop damage annually: $2391779355"
kable(arrange(DamageMean,desc(mean_combined))[1:20,c('EVTYPE','mean_combined','proportion_combined')],caption = 'Weather events causing the most economic damage, annual mean economic damage and percentage of economic damage caused by event')
Weather events causing the most economic damage, annual mean economic damage and percentage of economic damage caused by event
EVTYPE mean_combined proportion_combined
FLOOD 9841773590 34.2703558
HURRICANE 6095497272 21.2253268
STORM SURGE 4318594900 15.0379180
TORNADO 1611216931 5.6104702
HAIL 1084934635 3.7778857
FLASH FLOOD 1030107814 3.5869715
DROUGHT 918279133 3.1975693
STORM SURGE/TIDE 663148286 2.3091699
THUNDERSTORM WIND 559314729 1.9476078
TROPICAL STORM 551525770 1.9204856
WILDFIRE 539048642 1.8770386
HIGH WIND 393457473 1.3700709
ICE STORM 237503987 0.8270203
EXTREME COLD 163610286 0.5697126
FROST/FREEZE 100424182 0.3496903
WINTER STORM 100243950 0.3490627
HEAVY RAIN 85228616 0.2967773
LIGHTNING 47701395 0.1661026
HEAVY SNOW 38123129 0.1327498
TSUNAMI 36020500 0.1254281
total_mean_combined <- sum(DamageMean$mean_combined)
sprintf('Total mean economic damage annually: $%.0f',total_mean_combined)
## [1] "Total mean economic damage annually: $28718037357"

Display results graphically of the events causing the most property and crop damage

### the (union) of the ten events causing the most property, crop or combined economic damage
most_damage_events <- Reduce(union, list(DamageMean$EVTYPE[1:10],arrange(DamageMean,desc(mean_combined))$EVTYPE[1:10],
                            arrange(DamageMean,desc(mean_combined))$EVTYPE[1:10]))

### gather data in preparation for bar-charting
DamageImpact <- DamageMean %>%
  filter(DamageMean$EVTYPE %in% most_damage_events) %>%
  rename(Property = proportion_property, Crop = proportion_crop, Combined = proportion_combined) %>%
  select(one_of('EVTYPE','Property','Crop','Combined')) %>%
  gather(key = 'damage', value = 'proportion', Property, Crop, Combined)

### order the event types by the 'mean_combined' economic impact
DamageImpact$EVTYPE <- reorder.factor(DamageImpact$EVTYPE, new.order = arrange(DamageMean,mean_combined)$EVTYPE)
DamageImpact <- arrange(DamageImpact,EVTYPE)

subtitle <- sprintf('Data from NOAA storm database 1997 to 2011\n
                    Total mean property damage per year \t: $%12.0f\n
                    Total mean crop damage per year \t\t: $%12.0f\n
                    Total combined damage per year \t\t: $%12.0f\n',
                    total_mean_property, total_mean_crop, total_mean_combined)  

ggplot(data=DamageImpact, aes(x = EVTYPE, y = proportion, fill = damage)) +
  geom_bar(stat = 'identity', position = 'dodge') + ### 'identity' needed for values (rather than counts)
  ### 'dodge' for side-by-side bar chart
  labs(title = "The storm events causing the highest proportions of economic impact",
       subtitle = subtitle) +
  labs(x = "Event", y = "Percentage proportion of property, crop or combined value of loss") +
  guides(fill = guide_legend(title="")) +
  theme(legend.position=c(0.8,0.6)) +
  coord_flip()