Based on storm data from the National Oceanographic and Atmospheric Administration’s (NOAA) National Weather Service (NWS), this report assesses the effects that severe weather effects have had on US population health and economy between 1996 and 2011. Out of the 66,700 storm-related casualties, excessive heat was the main cause of death, accounting for around 7,600 fatalities, whilst tornadoes have caused the highest number of injuries (over 20,000 injured). Property and crop damage amounted to a total estimate of slightly over 400 billion US dollars over the same period of time, mostly due to floods, with damage mainly affecting property.
Before reading and processing the data, we first load the R libraries that we’ll be using.
library(dplyr)
library(ggplot2)
library(tidyr)
library(lubridate)
The software versions that were used in this analysis, as returned by sessionInfo(), are listed in the appendices.
We download the raw storm data file, which is published online and is a bz2-compressed CSV file.
dataUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
dataFile <- "StormData.csv.bz2"
if (!file.exists(dataFile)) {
download.file(dataUrl, dataFile, mode = "wb")
}
We then read the raw data from the file into a table.
# read data from compressed data file (this may take up to a few minutes)
# (compressed data size: 48 MB, uncompressed data size: 548MB)
rawStorm <- read.csv(dataFile)
For exploratory purposes, we now display the dimensions of the data set as well as the first few rows.
dim(rawStorm)
## [1] 902297 37
head(rawStorm)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
Having the column names in upper case is somewhat inconvenient, so we’ll convert them to lower case.
# convert column names to lower case for ease of use
colnames(rawStorm) <- tolower(colnames(rawStorm))
We also want to see where values are missing:
# count missing values by column
numNAs <- colSums(is.na(rawStorm))
# extract columns with missing values
numMissingValuesOnly <- numNAs[numNAs != 0]
# display number of missing values
numMissingValuesOnly
## countyendn f latitude latitude_e
## 902297 843563 47 40
# show as ratio to total number of values
numMissingValuesOnly / nrow(rawStorm)
## countyendn f latitude latitude_e
## 1.000000e+00 9.349061e-01 5.208928e-05 4.433130e-05
All the values in columns countyendn, most (93.5%) of the ones in the bgn_azi columns, and a few (around 0.005%) latitude and latitude_e values are missing.
These columns represent geographical data, which we will not be using in our analysis, and can therefore safely be ignored.
Based on the Storm Events Database web page, we consider that the storm event database is highly biased towards some types of events prior to 1996, as:
From 1950 through 1954, only tornado events were recorded.
From 1955 through 1995, only tornado, thunderstorm wind and hail events were recorded.
From 1996 to present, 48 event types are recorded as defined in NWS Directive 10-1605.
For this reason, we choose to ignore all the events before 1996, in order to have a more representative selection of event types.
# add a year column with year of event
rawStorm$year <- year(mdy_hms(rawStorm$bgn_date))
# ratio of preserved events within the data set
ratioKeptEvents <- sum(rawStorm$year >= 1996) / nrow(rawStorm)
ratioKeptEvents
## [1] 0.7242959
# keep only data from 1996 onward for our analysis
storm <- subset(rawStorm, year >= 1996)
Our analysis will be based on 72.4% of the original data set, which is most of the data and therefore considered acceptable.
Let us consider the number of different event types in the data set.
eventTypesLevels <- nlevels(storm$evtype)
eventTypesLevels
## [1] 985
The data set contains 985 event types, whereas it should only contain a maximum of 48, namely the ones listed in section 2.1.1 of the Storm Data Documentation, as per information on the event types of the Storm Events Database web page.
We will therefore proceed to match up the event types from the data set to the official event types, noting that this approach should really be reviewed by a subject matter expert.
# extract event types for further processing
evtypes <- storm$evtype
### general cleaning
## remove whitespace
cleanEventType <- trimws(evtypes)
## convert to lower case
cleanEventType <- tolower(cleanEventType)
### first match the most specific event types...
## astronomical low tide
cleanEventType <- gsub(".*blow-out tide.*", "astronomical low tide", cleanEventType)
## avalanche
cleanEventType <- gsub(".*(avalance|slide|landslump|landspout).*",
"avalanche", cleanEventType)
## blizzard
cleanEventType <- gsub(".*blizzard.*",
"blizzard", cleanEventType)
## coastal flood
cleanEventType <- gsub(
paste(".*(astronomical high tide|beach eros|beach flood|cstl|tidal|high tides",
"|coastal flooding|coastal surge|coastal/tidal|coastalfl).*"),
"coastal flood", cleanEventType)
## dust storm
cleanEventType <- gsub(".*(blowing dust|dust ?storm|saharan).*",
"dust storm", cleanEventType)
## dense fog
cleanEventType <- gsub(".*fog.*",
"dense fog", cleanEventType)
## dense smoke
cleanEventType <- gsub(".*smoke.*",
"dense smoke", cleanEventType)
## drought
cleanEventType <- gsub(".*(dry|drought).*",
"drought", cleanEventType)
## dust devil
cleanEventType <- gsub(".*dust dev.*",
"dust devil", cleanEventType)
## flash flood
cleanEventType <- gsub(".*flash.?flooo?d.*",
"flash flood", cleanEventType)
## flood
cleanEventType <- gsub(".*(?<!flash )flood.*",
"flood", cleanEventType, perl = TRUE)
cleanEventType <- gsub("^urban.*",
"flood", cleanEventType) ## -58
## ice storm
cleanEventType <- gsub(".*(glaze|ice ?storm).*",
"frost/freeze", cleanEventType)
## frost/freeze
cleanEventType <- gsub(".*(freez|frost|ice(?! storm)|icy).*",
"frost/freeze", cleanEventType, perl = TRUE)
## funnel cloud
cleanEventType <- gsub(".*(funnel|wall cloud).*",
"funnel cloud", cleanEventType)
## hail
cleanEventType <- gsub(".*hail.*",
"hail", cleanEventType)
## high surf
cleanEventType <- gsub(
".*(surf|swell|high seas|high water|wave|rough seas).*",
"high surf", cleanEventType)
## high wind
cleanEventType <- gsub("^high$", "high wind", cleanEventType)
cleanEventType <- gsub("^(high ? ?wind).*",
"high wind", cleanEventType)
## hurricane (typhoon)
cleanEventType <- gsub(".*(hurricane|typhoon).*",
"hurricane (typhoon)", cleanEventType)
## lake-effect snow
cleanEventType <- gsub(".*(lake snow|lake effect).*",
"lake-effect snow", cleanEventType)
## lightning
cleanEventType <- gsub("^(lightning|ligntning).*",
"lightning", cleanEventType)
## marine strong wind
cleanEventType <- gsub("^(heavy seas).*",
"marine strong wind", cleanEventType)
## marine thunderstorm wind
cleanEventType <- gsub("^(coastal ?storm).*",
"marine thunderstorm wind", cleanEventType)
## rip current
cleanEventType <- gsub("^(rip current).*",
"rip current", cleanEventType)
## sleet
cleanEventType <- gsub("^(sleet).*",
"sleet", cleanEventType)
## storm surge/tide
cleanEventType <- gsub("^(storm surge).*",
"storm surge/tide", cleanEventType)
## tornado
cleanEventType <- gsub(".*torn(ado|dao).*",
"tornado", cleanEventType)
## tropical storm
cleanEventType <- gsub(".*tropical storm.*",
"tropical storm", cleanEventType)
## volcanic ash
cleanEventType <- gsub(".*(vog|volcanic).*",
"volcanic ash", cleanEventType)
## waterspout
cleanEventType <- gsub(".*way?ter ?spout.*",
"waterspout", cleanEventType)
## wildfire
cleanEventType <- gsub(".*fire.*",
"wildfire", cleanEventType)
## winter storm
cleanEventType <- gsub(".*(thundersnow|winter storm).*",
"winter storm", cleanEventType)
## winter weather
cleanEventType <- gsub(".*(winter.*mix|wintery|wintry).*",
"winter weather", cleanEventType)
### now match less specific event types...
## extreme cold/wind chill
cleanEventType <- gsub(
paste(".*(wind ?chill|extreme cold|low temperature|record cool",
"|record low|unseasonably co|unseasonal low|unseasonable cold",
"|unusually cold).*"),
"extreme cold/wind chill", cleanEventType)
## cold/wind chill
cleanEventType <- gsub(".*(cold|cool|hypothermia).*",
"cold/wind chill", cleanEventType)
## excessive heat
cleanEventType <- gsub(
paste(".*(heat|high temperature|record high|record warm",
"|unseasonably Warm|unseasonably hot|warmth|unusually warm",
"|very warm|warm weather|hyperthermia).*"),
"excessive heat", cleanEventType)
## heavy rain
cleanEventType <- gsub(
paste("^(abnormally wet|excessive precip|excessive rain",
"|excessive wet|extremely wet|heavy mix",
"|heavy shower|hvy rain|prolonged rain|rain",
"|record rain|torrential rain|unseasonably wet|unseasonal rain",
"|wet).*"),
"heavy rain", cleanEventType)
cleanEventType <- gsub(
".*(heavy precip|heavy rain|excessive rain|record precip).*",
"heavy rain", cleanEventType)
## heavy snow
cleanEventType <- gsub(
paste("^(blowing snow|excessive snow|snow).*"),
"heavy snow", cleanEventType)
cleanEventType <- gsub(".*(heavy snow|record snow).*",
"heavy snow", cleanEventType)
## strong wind
cleanEventType <- gsub(
paste(".*(gradient wind|gusty|micr?oburst|non-?thunderstorm wind|turbulence",
"|downburst|strong wind).*"),
"strong wind", cleanEventType)
cleanEventType <- gsub("^(wind|wnd).*",
"strong wind", cleanEventType)
## thunderstorm wind
cleanEventType <- gsub(
paste(".*(tstm|th?und?ee?r?e?s?torm|thuderstorm|thunderstrom|thundertsorm",
"|gustnado|whirlwind|metro storm",
"|rotating wall cloud|storm force wind",
"|windstorm).*"),
"thunderstorm wind", cleanEventType)
A number of events could not be tagged (e.g. event types starting with “summary for”, or labelled “excessive” without mentioning what was in excess): these events will be labelled with the catch-all event type “other”.
officialEventTypes <- c("astronomical low tide", "avalanche", "blizzard",
"coastal flood", "cold/wind chill", "debris flow", "dense fog",
"dense smoke", "drought", "dust devil", "dust storm", "excessive heat",
"extreme cold/wind chill", "flash flood", "flood", "frost/freeze",
"funnel cloud", "freezing fog", "hail", "heat", "heavy rain", "heavy snow",
"high surf", "high wind", "hurricane (typhoon)", "ice storm",
"lake-effect snow", "lakeshore flood", "lightning", "marine hail",
"marine high wind", "marine strong wind", "marine thunderstorm wind",
"rip current", "seiche", "sleet", "storm surge/tide", "strong wind",
"thunderstorm wind", "tornado", "tropical depression", "tropical storm",
"tsunami", "volcanic ash", "waterspout", "wildfire", "winter storm",
"winter weather")
# anything that doesn't match an official event type gets assigned the
# type "other"
eventsWithUnmatchedEventTypesVector <- !(cleanEventType %in% officialEventTypes)
cleanEventType[eventsWithUnmatchedEventTypesVector] <- "other"
# calculate some indicators on the impact of this approach on the data set
numEventsWithUnmatchedEventTypes <- sum(eventsWithUnmatchedEventTypesVector)
numEventsWithUnmatchedEventTypes
## [1] 761
ratioUnmatchedEventTypes <- sum(eventsWithUnmatchedEventTypesVector) /
nrow(storm)
ratioUnmatchedEventTypes
## [1] 0.001164445
Our strategy has resulted in 761 events (i.e. 0.116%) having unmatched event types, and being assigned the type other. This is a very low ratio, which is deemed satisfactory for our analysis.
# calculate number of distinct event types
numEventTypes <- length(unique(cleanEventType))
numEventTypes
## [1] 39
The clean-up operation produced a total of 39 distinct event types (including the catch-all “other” event type).
We finally add the cleaned event types as a new column in the data set.
storm$cleanEventType <- cleanEventType
Economic impacts of storm-related events are measured as property damage and crop damage, each of which is coded in the data set as a base value and a magnitude (B for billions, M for millions, K for thousands), as shown in this extract:
storm %>%
select(propdmg, propdmgexp, cropdmg, cropdmgexp) %>%
slice(1:10)
## propdmg propdmgexp cropdmg cropdmgexp
## 1 380 K 38 K
## 2 100 K 0
## 3 3 K 0
## 4 5 K 0
## 5 2 K 0
## 6 0 0
## 7 400 K 0
## 8 12 K 0
## 9 8 K 0
## 10 12 K 0
Let us check if the data set consistently applies this coding scheme by checking the values used for the orders of magnitude.
# orders of magnitude for property damage
unique(storm$propdmgexp)
## [1] K M B 0
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
# orders of magnitude for crop damage
unique(storm$cropdmgexp)
## [1] K M B
## Levels: ? 0 2 B k K m M
The coding scheme is mostly respected (we note that subsetting the data set to events from 1996 onwards has filtered out unclean magnitude specifiers such as -, ?, 5 and h), but we still need to investigate what the value 0 and the absence of value correspond to.
Let us show events where the property damage and crop damage base value is non-zero and the order of magnitude is either 0 or missing.
storm %>% select(propdmg, propdmgexp) %>%
filter(propdmg != 0 & propdmgexp %in% c("", 0))
## [1] propdmg propdmgexp
## <0 rows> (or 0-length row.names)
storm %>% select(cropdmg, cropdmgexp) %>%
filter(cropdmg != 0 & cropdmgexp %in% c("", 0))
## [1] cropdmg cropdmgexp
## <0 rows> (or 0-length row.names)
No events match these conditions, meaning that we can ignore the superfluous or missing magnitude specifiers.
We add two new columns to the data set, one for property damage and one for crop damage, containing the magnitudes expressed as a multiplier, defaulting to 0. We then update this multiplier when the magnitude specifier is either B, M or K.
# create new columns with multiplier, defaulting to 0
storm$propDmgMultiplier <- 0
storm$cropDmgMultiplier <- 0
# update multiplier based on magnitude
storm$propDmgMultiplier[storm$propdmgexp == "B"] <- 1e9
storm$propDmgMultiplier[storm$propdmgexp == "M"] <- 1e6
storm$propDmgMultiplier[storm$propdmgexp == "K"] <- 1e3
storm$cropDmgMultiplier[storm$cropdmgexp == "B"] <- 1e9
storm$cropDmgMultiplier[storm$cropdmgexp == "M"] <- 1e6
storm$cropDmgMultiplier[storm$cropdmgexp == "K"] <- 1e3
Finally we create two new columns containing the dollar-value amounts of damage, obtained by multiplying the base values by the multiplier, for both property damage and crop damage.
storm$propDamageDollars <- storm$propdmg * storm$propDmgMultiplier
storm$cropDamageDollars <- storm$cropdmg * storm$cropDmgMultiplier
# display first few rows to cross-check
storm %>%
select(propdmg, propdmgexp, propDamageDollars,
cropdmg, cropdmgexp, cropDamageDollars) %>%
slice(1:10)
## propdmg propdmgexp propDamageDollars cropdmg cropdmgexp
## 1 380 K 380000 38 K
## 2 100 K 100000 0
## 3 3 K 3000 0
## 4 5 K 5000 0
## 5 2 K 2000 0
## 6 0 0 0
## 7 400 K 400000 0
## 8 12 K 12000 0
## 9 8 K 8000 0
## 10 12 K 12000 0
## cropDamageDollars
## 1 38000
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## 7 0
## 8 0
## 9 0
## 10 0
In order to have a general idea of the harm caused by storms to population health in the United States, we will first calculate the total number of fatalities, injuries, and overall casualties (i.e. injuries and fatalities) that storms have caused from 1996 to 2011.
# calculate total fatalities, injuries, and overall casualties
totalFatalities <- sum(storm$fatalities)
totalFatalities
## [1] 8732
totalInjuries <- sum(storm$injuries)
totalInjuries
## [1] 57975
totalCasualties <- sum(storm$fatalities) + sum(storm$injuries)
totalCasualties
## [1] 66707
ratioFatalities <- totalFatalities / totalCasualties
ratioFatalities
## [1] 0.1309008
During the analysed period, storm-related events were responsible for 66,707 casualties, decomposed as 57,975 injuries and 8,732 deaths (13.1% of total casualties).
We will estimate the harm caused by storms by event type, expressed as the total number of casualties.
# casualties by event type, ignoring events that have caused no casualties
casualtiesByEventType <- storm %>%
select(cleanEventType, fatalities, injuries) %>%
group_by(cleanEventType) %>%
mutate(casualties = fatalities + injuries) %>%
filter(casualties != 0) %>%
summarise_each(funs(sum))
numEventTypeCausingCasualties <- nrow(casualtiesByEventType)
numEventTypeCausingCasualties
## [1] 32
Out of the 39 event types (including the “other” generic event type) in the cleaned data set, 32 have caused casualties.
The following graph represents the total number of casualties (fatalities and injuries) in the United States, by event type, during the studied period.
# reorder event types by number of casualties
casualtiesByEventType$cleanEventType <- factor(
casualtiesByEventType$cleanEventType,
levels = casualtiesByEventType$cleanEventType[order(
casualtiesByEventType$casualties)])
# rename columns and reshape data in preparation of stacked bar plot
casualtiesByEventTypeReshaped <- casualtiesByEventType %>%
rename(fatality = fatalities, injury = injuries) %>%
gather("casualtyType", "numCasualties", c(fatality, injury))
# create bar plot of casualties, stacked by type of casualty
ggplot(casualtiesByEventTypeReshaped, aes(cleanEventType, numCasualties)) +
geom_bar(aes(fill = casualtyType), stat = "identity") +
coord_flip() +
scale_fill_discrete(name="Type of casualty") +
theme(legend.position="bottom") +
labs(title = "Casualties due to storms in the US, by type of event, from 1996 to 2011",
x = "Type of event", y = "Total number of casualties from 1996 to 2011")
Casualties due to storms in the US, by type of event, from 1996 to 2011
The top 10 causes of casualties (fatalities and injuries) are:
top10CasualtiesByEventType <- casualtiesByEventType %>%
arrange(desc(casualties)) %>%
slice(1:10)
top10CasualtiesByEventType
## Source: local data frame [10 x 4]
##
## cleanEventType fatalities injuries casualties
## (fctr) (dbl) (dbl) (dbl)
## 1 tornado 1511 20667 22178
## 2 excessive heat 2037 7615 9652
## 3 flood 450 6846 7296
## 4 thunderstorm wind 398 5071 5469
## 5 lightning 651 4141 4792
## 6 flash flood 887 1674 2561
## 7 wildfire 87 1458 1545
## 8 winter storm 191 1292 1483
## 9 hurricane (typhoon) 125 1328 1453
## 10 high wind 235 1083 1318
# calculate ratio of casualties due to leading cause compared to total
# casualties
ratioLeadingCauseCasualties <- top10CasualtiesByEventType$casualties[1] /
totalCasualties
ratioLeadingCauseCasualties
## [1] 0.3324689
With 22,178 casualties, tornadoes were the leading cause of casualties during the considered period, accounting for 33.2% of all casualties.
The top 10 causes of death (fatalities) are:
top10FatalitiesByEventType <- casualtiesByEventType %>%
arrange(desc(fatalities)) %>%
slice(1:10) %>%
select(cleanEventType, fatalities)
top10FatalitiesByEventType
## Source: local data frame [10 x 2]
##
## cleanEventType fatalities
## (fctr) (dbl)
## 1 excessive heat 2037
## 2 tornado 1511
## 3 flash flood 887
## 4 lightning 651
## 5 rip current 542
## 6 flood 450
## 7 thunderstorm wind 398
## 8 cold/wind chill 396
## 9 avalanche 266
## 10 high wind 235
# calculate ratio of fatalities due to leading cause compared to total
# fatalities
ratioLeadingCauseFatalities <- top10FatalitiesByEventType$fatalities[1] /
totalFatalities
ratioLeadingCauseFatalities
## [1] 0.2332799
The leading cause of death, excessive heat, accounted for 2,037 fatalities (23.3% of total fatalities).
The top 10 causes of injuries are:
top10InjuriesByEventType <- casualtiesByEventType %>%
arrange(desc(injuries)) %>%
slice(1:10) %>%
select(cleanEventType, injuries)
top10InjuriesByEventType
## Source: local data frame [10 x 2]
##
## cleanEventType injuries
## (fctr) (dbl)
## 1 tornado 20667
## 2 excessive heat 7615
## 3 flood 6846
## 4 thunderstorm wind 5071
## 5 lightning 4141
## 6 flash flood 1674
## 7 wildfire 1458
## 8 hurricane (typhoon) 1328
## 9 winter storm 1292
## 10 high wind 1083
# calculate ratio of injuries due to leading cause compared to total
# injuries
ratioLeadingCauseInjuries <- top10InjuriesByEventType$injuries[1] /
totalInjuries
ratioLeadingCauseInjuries
## [1] 0.3564812
Most injuries were caused by tornadoes, which where responsible for 20,667 injuries (35.6% of all injuries).
As we did for casualties, we will first assess the overall economic consequences of storms in the United States by calculating the total property damage, crop damage, and overall damage (i.e. property and crop) that storms have caused from 1996 to 2011.
# calculate total property damage, crop damage, and overall damage
totalPropertyDamage <- sum(storm$propDamageDollars)
totalPropertyDamage
## [1] 366767615380
totalCropDamage <- sum(storm$cropDamageDollars)
totalCropDamage
## [1] 34752728730
totalDamage <- totalPropertyDamage + totalCropDamage
totalDamage
## [1] 401520344110
# crop damage as ratio of total damage
ratioCropDamage <- totalCropDamage / totalDamage
ratioCropDamage
## [1] 0.08655285
Total damage caused by storms during the analysed period amounts to 402 billion US dollars, combining property damage (US$367B) and crop damage (US$34.8B, or 8.66% of the total damage).
We will now estimate the harm caused by storms by event type, expressed as the total number of casualties.
damageByEventType <- storm %>%
select(cleanEventType, propDamageDollars, cropDamageDollars) %>%
group_by(cleanEventType) %>%
mutate(totalDamageDollars = propDamageDollars + cropDamageDollars) %>%
filter(totalDamageDollars != 0) %>%
summarise_each(funs(sum))
The following graph represents the total damage (property damage and crop damage) in the United States, by event type, during the studied period.
# reorder event types by number of damage amount
damageByEventType$cleanEventType <- factor(
damageByEventType$cleanEventType,
levels = damageByEventType$cleanEventType[order(
damageByEventType$totalDamageDollars)])
# rename columns and reshape data in preparation of stacked bar plot
damageByEventTypeReshaped <- damageByEventType %>%
rename(property = propDamageDollars, crop = cropDamageDollars) %>%
gather("damageType", "damageDollars", c(property, crop))
# create bar plot of damage, stacked by type of damage
ggplot(damageByEventTypeReshaped, aes(cleanEventType, damageDollars/1e9)) +
geom_bar(aes(fill = damageType), stat = "identity") +
coord_flip() +
scale_fill_discrete(name="Scope of damage") +
theme(legend.position="bottom") +
labs(title = "Damage due to storms in the US, by type of event, from 1996 to 2011",
x = "Type of event", y = "Total damage in billions of US dollars from 1996 to 2011")
Damage due to storms in the US, by type of event, from 1996 to 2011
The top 10 causes of overall damage are:
top10DamageByEventType <- damageByEventType %>%
arrange(desc(totalDamageDollars)) %>%
select(cleanEventType, totalDamageDollars) %>%
slice(1:10)
top10DamageByEventType
## Source: local data frame [10 x 2]
##
## cleanEventType totalDamageDollars
## (fctr) (dbl)
## 1 flood 149566260260
## 2 hurricane (typhoon) 87068996810
## 3 storm surge/tide 47835579000
## 4 tornado 24900370720
## 5 hail 17201091620
## 6 flash flood 16557170610
## 7 drought 14415414600
## 8 thunderstorm wind 8827476130
## 9 tropical storm 8320186550
## 10 wildfire 8162704630
# calculate ratio of damage due to leading cause compared to total
# damage
ratioLeadingCauseDamage <- top10DamageByEventType$totalDamageDollars[1] /
totalDamage
ratioLeadingCauseDamage
## [1] 0.3724998
With an estimated 150 billion US dollars’ worth of damage, floods were the most economically impacting events during the considered period, accounting for 37.2% of the total US dollar amount of damage.
The top 10 causes of property damage are:
top10PropertyDamageByEventType <- damageByEventType %>%
arrange(desc(propDamageDollars)) %>%
slice(1:10) %>%
select(cleanEventType, propDamageDollars)
top10PropertyDamageByEventType
## Source: local data frame [10 x 2]
##
## cleanEventType propDamageDollars
## (fctr) (dbl)
## 1 flood 144553098760
## 2 hurricane (typhoon) 81718889010
## 3 storm surge/tide 47834724000
## 4 tornado 24616945710
## 5 flash flood 15222268910
## 6 hail 14639572920
## 7 thunderstorm wind 7875179780
## 8 wildfire 7760449500
## 9 tropical storm 7642475550
## 10 high wind 5248378360
# calculate ratio of property damage due to leading cause compared to total
# property damage
ratioLeadingCausePropertyDamage <-
top10PropertyDamageByEventType$propDamageDollars[1] /
totalPropertyDamage
ratioLeadingCausePropertyDamage
## [1] 0.3941272
The leading cause of property damage, flooding, accounted for 145 billion US dollars’ worth of property damage (39.4% of overall property damage).
The top 10 causes of crop damage are:
top10CropDamageByEventType <- damageByEventType %>%
arrange(desc(cropDamageDollars)) %>%
slice(1:10) %>%
select(cleanEventType, cropDamageDollars)
top10CropDamageByEventType
## Source: local data frame [10 x 2]
##
## cleanEventType cropDamageDollars
## (fctr) (dbl)
## 1 drought 13367581000
## 2 hurricane (typhoon) 5350107800
## 3 flood 5013161500
## 4 hail 2561518700
## 5 frost/freeze 1384421000
## 6 cold/wind chill 1356765500
## 7 flash flood 1334901700
## 8 thunderstorm wind 952296350
## 9 heavy rain 728169800
## 10 tropical storm 677711000
# calculate ratio of crop damage due to leading cause compared to total
# crop damage
ratioLeadingCauseCropDamage <-
top10CropDamageByEventType$cropDamageDollars[1] /
totalCropDamage
ratioLeadingCauseCropDamage
## [1] 0.3846484
Drought was the main cause of crop damage, and was responsible for 13.4 billion US dollars’ worth of crop damage (38.5% of all crop damage).
Taking into account the complete data set, without cleaning the event types or restricting our analysis to the dates when all event types were taken into account (i.e. 1996 to 2011), would have produced somewhat different results.
We illustrate this by considering the number of casualties by event type in the original (raw) data set.
# by event type
rawCasualtiesByEventType <- rawStorm %>%
select(evtype, fatalities, injuries) %>%
group_by(evtype) %>%
mutate(casualties = fatalities + injuries) %>%
summarise_each(funs(sum))
# number of events that have caused casualties
rawNumEventsCasualties <- sum(rawCasualtiesByEventType$casualties != 0)
rawNumEventsCasualties
## [1] 220
Out of the 985 uncleaned event types in the raw data set, 220 have caused casualties. We will extract the 20 events that caused the highest number of fatalities and injuries.
# extract top 20 casualty-causing event types
top20RawCasualtyEventTypes <- rawCasualtiesByEventType %>%
arrange(desc(casualties)) %>%
slice(1:20)
top20RawCasualtyEventTypes
## Source: local data frame [20 x 4]
##
## evtype fatalities injuries casualties
## (fctr) (dbl) (dbl) (dbl)
## 1 TORNADO 5633 91346 96979
## 2 EXCESSIVE HEAT 1903 6525 8428
## 3 TSTM WIND 504 6957 7461
## 4 FLOOD 470 6789 7259
## 5 LIGHTNING 816 5230 6046
## 6 HEAT 937 2100 3037
## 7 FLASH FLOOD 978 1777 2755
## 8 ICE STORM 89 1975 2064
## 9 THUNDERSTORM WIND 133 1488 1621
## 10 WINTER STORM 206 1321 1527
## 11 HIGH WIND 248 1137 1385
## 12 HAIL 15 1361 1376
## 13 HURRICANE/TYPHOON 64 1275 1339
## 14 HEAVY SNOW 127 1021 1148
## 15 WILDFIRE 75 911 986
## 16 THUNDERSTORM WINDS 64 908 972
## 17 BLIZZARD 101 805 906
## 18 FOG 62 734 796
## 19 RIP CURRENT 368 232 600
## 20 WILD/FOREST FIRE 12 545 557
Observing this top 20, we note that:
The thunderstorm wind event type appears under three different labels: TSTM WIND, THUNDERSTORM WIND and THUNDERSTORM WINDS, making this event type seem less significant than it actually was. Similarly, EXCESSIVE HEAT and HEAT are seen as different event types, as are WILDFIRE and WILD/FOREST FIRE.
The order of the causes of casualties (e.g. thunderstorm wind in third position) is different from the one found based on the cleaned data (where the events causing the third highest number of casualties are floods).
Furthermore, let us compare the ratio between casualties caused by tornadoes and the total number of casualties.
totalCasualtiesRaw <- sum(rawStorm$fatalities) + sum(rawStorm$injuries)
ratioTornadoCasualtiesRaw <- top20RawCasualtyEventTypes$casualties[1] /
totalCasualtiesRaw
ratioTornadoCasualtiesRaw
## [1] 0.6229661
ratioTornadoCasualties <- top10CasualtiesByEventType$casualties[1] /
totalCasualties
ratioTornadoCasualties
## [1] 0.3324689
62.3% of casualties were caused by tornadoes according to the raw data set, whereas this figure is 33.2% (about half as much) using the clean data set, thus clearly showing that the raw data set is strongly biased towards tornadoes.
In fact, if we extract the top three causes of death, tornadoes also come in at first place with the raw data set, well ahead of the second cause of death, whereas they are only the second cause of fatality in the clean data set.
# top 3 causes of fatality in raw data set
rawCasualtiesByEventType %>%
arrange(desc(fatalities)) %>%
slice(1:3)
## Source: local data frame [3 x 4]
##
## evtype fatalities injuries casualties
## (fctr) (dbl) (dbl) (dbl)
## 1 TORNADO 5633 91346 96979
## 2 EXCESSIVE HEAT 1903 6525 8428
## 3 FLASH FLOOD 978 1777 2755
# top 3 causes of fatality in clean data set
casualtiesByEventType %>%
arrange(desc(fatalities)) %>%
slice(1:3)
## Source: local data frame [3 x 4]
##
## cleanEventType fatalities injuries casualties
## (fctr) (dbl) (dbl) (dbl)
## 1 excessive heat 2037 7615 9652
## 2 tornado 1511 20667 22178
## 3 flash flood 887 1674 2561
This report was produced using the following software versions:
sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 8 x64 (build 9200)
##
## locale:
## [1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252
## [3] LC_MONETARY=French_France.1252 LC_NUMERIC=C
## [5] LC_TIME=French_France.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] lubridate_1.5.0 tidyr_0.3.1 ggplot2_2.0.0 dplyr_0.4.3
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.3 knitr_1.12.3 magrittr_1.5 munsell_0.4.2
## [5] colorspace_1.2-6 R6_2.1.1 stringr_1.0.0 plyr_1.8.3
## [9] tools_3.2.2 parallel_3.2.2 grid_3.2.2 gtable_0.1.2
## [13] DBI_0.3.1 htmltools_0.3 yaml_2.1.13 lazyeval_0.1.10
## [17] assertthat_0.1 digest_0.6.9 formatR_1.2.1 evaluate_0.8
## [21] rmarkdown_0.9.2 labeling_0.3 stringi_1.0-1 scales_0.3.0