This data analysis looks at storm data collected by the US Weather Service between 1950 and 2011, which contains over 900,000 records of extreme weather events, their location, fatalities and injuries and economic damage estimations in the US over that period. This data is available at the following url: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2
The goal of the analysis is to see which types of events cause the most damage. The data was downloaded and events were categorized into a reasonable number (10-15) of distinct categories in order to look at the types of events that caused the most death and injuries, and the types of events that had the most economic impact.
Our results show that in terms of human health, tornadoes, hurricanes and wind related events have by far the largest amount of casualties, followed by heatwaves, floods, and cold related events. In terms of economic impact, floods, hurricanes, tornadoes, and other marine events such as coastal surges represent a huge share of the economic impact of extreme weather events. Most of the economic impact comes from a few extreme events which have a disproportional cost, the most expensive event representing more than 20% of total costs registered by itself.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
fname<-"repdata_data_StormData.csv.bz2"
if(!file.exists(fname))
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
"repdata_data_StormData.csv.bz2")
stormData <- read.csv(fname)
dim(stormData)
## [1] 902297 37
names(stormData)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
Our analysis deals with types of events, not with individual events per se. Therefore, all of the information regarding the geographical location of the events or the time at which the event occurs is not really relevant to this analysis and can be removed.
stormtbl <- select(stormData, EVTYPE, FATALITIES, INJURIES, PROPDMG:CROPDMGEXP)
stormtbl <- as_tibble(stormtbl)
stormtbl
## # A tibble: 902,297 x 7
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## <fct> <dbl> <dbl> <dbl> <fct> <dbl> <fct>
## 1 TORNADO 0 15 25 K 0 ""
## 2 TORNADO 0 0 2.5 K 0 ""
## 3 TORNADO 0 2 25 K 0 ""
## 4 TORNADO 0 2 2.5 K 0 ""
## 5 TORNADO 0 2 2.5 K 0 ""
## 6 TORNADO 0 6 2.5 K 0 ""
## 7 TORNADO 0 1 2.5 K 0 ""
## 8 TORNADO 0 0 2.5 K 0 ""
## 9 TORNADO 1 14 25 K 0 ""
## 10 TORNADO 0 0 25 K 0 ""
## # ... with 902,287 more rows
The codebook tells us that economic damage should be expressed as a number, followed by a letter in the next column, with K, M and B representing a thousand, million and billion. However, we can see that this is not always the case:
table(stormtbl$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5 6
## 465934 1 8 5 216 25 13 4 4 28 4
## 7 8 B h H K m M
## 5 1 40 1 6 424665 7 11330
table(stormtbl$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
We’ll assume that lower case k and m can safely be converted to upper case. Since other values represent at most a few hundred cases out of hundreds of thousands, we’ll remove them (counting the multiplier as 1 so that it is neutral). Blanks will be counted as 1, so that if there is a number next to it, it won’t be lost in the calculation. Then we calculate the total economic impact of each event.
stormtbl$cropmult <- rep(1,902297)
stormtbl$propmult <- rep(1,902297)
stormtbl[stormtbl$PROPDMGEXP %in% c("K", "k"),]$propmult <-1000
stormtbl[stormtbl$PROPDMGEXP %in% c("m", "M"),]$propmult<- 1000000
stormtbl[stormtbl$PROPDMGEXP=="B",]$propmult <- 1000000000
stormtbl[stormtbl$CROPDMGEXP %in% c("K", "k"),]$cropmult <-1000
stormtbl[stormtbl$CROPDMGEXP %in% c("m", "M"),]$cropmult<- 1000000
stormtbl[stormtbl$CROPDMGEXP=="B",]$cropmult <- 1000000000
stormtbl <-mutate(stormtbl, property_damage = PROPDMG * propmult, crop_damage = CROPDMG*cropmult, total_economic_impact = property_damage + crop_damage)
stormclean <- select(stormtbl, -(PROPDMG:propmult))
stormclean
## # A tibble: 902,297 x 6
## EVTYPE FATALITIES INJURIES property_damage crop_damage total_economic_impact
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 TORNADO 0 15 25000 0 25000
## 2 TORNADO 0 0 2500 0 2500
## 3 TORNADO 0 2 25000 0 25000
## 4 TORNADO 0 2 2500 0 2500
## 5 TORNADO 0 2 2500 0 2500
## 6 TORNADO 0 6 2500 0 2500
## 7 TORNADO 0 1 2500 0 2500
## 8 TORNADO 0 0 2500 0 2500
## 9 TORNADO 1 14 25000 0 25000
## 10 TORNADO 0 0 25000 0 25000
## # ... with 902,287 more rows
Now that we have a much smaller table with only variables we’re interested in, we’re going to need to look at the contents of EVTYPE.
length(unique(stormtbl$EVTYPE))
## [1] 985
As we can see EVTYPE contains too many unique names to sort through them right now, so we’ll start by looking at the events which caused the most deaths and categorize these, then apply our categories in general.
mostfatal <- stormtbl %>% filter(FATALITIES>0) %>% arrange(desc(FATALITIES))
length(unique(mostfatal$EVTYPE))
## [1] 168
Now that we have just 168, we look at the individual categories to find patterns.
unique(mostfatal$EVTYPE)
## [1] HEAT TORNADO
## [3] EXCESSIVE HEAT EXTREME HEAT
## [5] HEAT WAVE TSUNAMI
## [7] UNSEASONABLY WARM AND DRY TORNADOES, TSTM WIND, HAIL
## [9] TROPICAL STORM FLASH FLOOD
## [11] HEAVY RAIN RECORD/EXCESSIVE HEAT
## [13] FLOOD HURRICANE/TYPHOON
## [15] COLD AND SNOW WILDFIRE
## [17] LANDSLIDE HURRICANE
## [19] TSTM WIND FOG
## [21] STORM SURGE/TIDE DUST STORM
## [23] WINTER STORMS STORM SURGE
## [25] HIGH SURF UNSEASONABLY WARM
## [27] EXTREME COLD/WIND CHILL TROPICAL STORM GORDON
## [29] BLIZZARD THUNDERSTORM WINDS
## [31] HEAVY SNOW HIGH WIND
## [33] EXTREME COLD THUNDERSTORM WIND
## [35] MARINE MISHAP FLASH FLOOD/FLOOD
## [37] ICE STORM AVALANCHE
## [39] RIP CURRENT WINTER WEATHER/MIX
## [41] FLOOD/FLASH FLOOD HEAT WAVES
## [43] GLAZE LIGHTNING
## [45] WINTER STORM STRONG WIND
## [47] MARINE TSTM WIND HEAVY SURF/HIGH SURF
## [49] COLD/WIND CHILL HAIL
## [51] HIGH WIND/SEAS RIP CURRENTS/HEAVY SURF
## [53] FLASH FLOODING HEAT WAVE DROUGHT
## [55] Mudslide WILD/FOREST FIRE
## [57] ROUGH SEAS WILD FIRES
## [59] HURRICANE ERIN WIND
## [61] SNOW AND ICE WATERSPOUT/TORNADO
## [63] COLD WAVE HIGH WINDS/SNOW
## [65] FLASH FLOODING/FLOOD HIGH WIND AND SEAS
## [67] RAIN/SNOW Heavy surf and wind
## [69] RIP CURRENTS EXTREME WINDCHILL
## [71] URBAN/SML STREAM FLD HURRICANE OPAL/HIGH WINDS
## [73] FLOODING SLEET
## [75] HIGH WINDS DENSE FOG
## [77] ICE COLD
## [79] HEAVY SNOW AND HIGH WINDS SNOW
## [81] FREEZING RAIN UNSEASONABLY COLD
## [83] EXCESSIVE RAINFALL DROUGHT/EXCESSIVE HEAT
## [85] HEAVY SEAS ROUGH SURF
## [87] Cold Coastal Flooding
## [89] ICY ROADS Hypothermia/Exposure
## [91] SNOW SQUALL RECORD HEAT
## [93] WATERSPOUT HIGH WATER
## [95] COLD WEATHER WINTER WEATHER
## [97] MARINE STRONG WIND COASTAL FLOOD
## [99] MARINE THUNDERSTORM WIND HIGH SEAS
## [101] GUSTY WINDS WINTER STORM HIGH WINDS
## [103] STRONG WINDS HURRICANE OPAL
## [105] FREEZING RAIN/SNOW THUNDERSNOW
## [107] WIND STORM URBAN AND SMALL STREAM FLOODIN
## [109] FREEZE LIGHTNING.
## [111] THUNDERTORM WINDS FLOOD/RIVER FLOOD
## [113] RIVER FLOOD AVALANCE
## [115] FOG AND COLD TEMPERATURES HEAVY SURF
## [117] HURRICANE FELIX BLOWING SNOW
## [119] RAIN/WIND THUNDERSTORM WIND G52
## [121] HIGH WAVES LOW TEMPERATURE
## [123] HYPOTHERMIA COLD/WINDS
## [125] RECORD COLD SNOW/ BITTER COLD
## [127] RAPIDLY RISING WATER RIVER FLOODING
## [129] MINOR FLOODING FLOOD & HEAVY RAIN
## [131] FLASH FLOODS Marine Accident
## [133] Heavy Surf DRY MICROBURST
## [135] WINDS COASTAL STORM
## [137] High Surf Extreme Cold
## [139] Strong Winds Extended Cold
## [141] Whirlwind MIXED PRECIP
## [143] Freezing Spray Mudslides
## [145] Cold Temperature HYPOTHERMIA/EXPOSURE
## [147] BLACK ICE COASTALSTORM
## [149] blowing snow FREEZING DRIZZLE
## [151] FROST Snow Squalls
## [153] LANDSLIDES HIGH SWELLS
## [155] TSTM WIND/HAIL TSTM WIND (G35)
## [157] HYPERTHERMIA/EXPOSURE GUSTY WIND
## [159] WINTRY MIX THUNDERSTORM WIND (G40)
## [161] LIGHT SNOW THUNDERSTORM
## [163] FALLING SNOW/ICE COASTAL FLOODING
## [165] ICE ON ROAD DROWNING
## [167] DUST DEVIL MARINE HIGH WIND
## 985 Levels: HIGH SURF ADVISORY COASTAL FLOOD FLASH FLOOD ... WND
We see that there are a lot of repetitions of terms like heat, tornado, hurricane, flood, etc… From our most fatal events, we can create the following categories:
- Tornado / Wind
- Hurricane / Tropical storm
- Heat events
- Floods
- Cold/ Snow / Ice
- Marine events
- Fire
- Lightning
- Land/Mudslides
- Fog
- Rain
tornadocat <- grep("tornad|wind", stormtbl$EVTYPE, ignore.case = TRUE)
heatcat <- grep("heat|hot|warm|hyperthermia", stormtbl$EVTYPE, ignore.case = TRUE)
floodcat<- grep("flood|fld|high water|rising water", stormtbl$EVTYPE, ignore.case = TRUE)
hurricanecat <- grep("hurricane|typhoon|tropical storm", stormtbl$EVTYPE, ignore.case = TRUE)
coldcat<- grep("snow|ice|icy|sleet|freez|cold|frost|glaze|low temp|avalan|winter|wintry|hypothermia|blizzard", stormtbl$EVTYPE, ignore.case = TRUE)
seacat <- grep("current|surf|sea|marine|coastal|wave|swell|surge|tsunami", stormtbl$EVTYPE, ignore.case = TRUE)
firecat <- grep("fire", stormtbl$EVTYPE, ignore.case = TRUE)
lightningcat<- grep("lightning|thunder", stormtbl$EVTYPE, ignore.case = TRUE)
slidecat <- grep("landslide|mudslide", stormtbl$EVTYPE, ignore.case = TRUE)
raincat <- grep("rain|precip|waterspout", stormtbl$EVTYPE, ignore.case = TRUE)
fogcat <- grep("fog", stormtbl$EVTYPE, ignore.case = TRUE)
Now let’s check that our categories gather most of the data.
stormclean$category <- rep(NA, 902297)
stormclean$category[tornadocat] <- "Tornado or Wind"
stormclean$category[heatcat] <- "Heat related"
stormclean$category[floodcat] <- "Flood related"
stormclean$category[hurricanecat] <- "Hurricane"
stormclean$category[coldcat] <- "Cold related"
stormclean$category[seacat] <- "Sea related"
stormclean$category[firecat] <- "Fire related"
stormclean$category[lightningcat] <- "Lightning related"
stormclean$category[slidecat] <- "Land and mud slides"
stormclean$category[raincat] <- "Rain related"
stormclean$category[fogcat] <- "Fog related"
uncategorized <-which(is.na(stormclean$category))
totaldead <- sum(stormclean$FATALITIES)
totalcost<- sum(stormclean$total_economic_impact)
sum(stormclean[uncategorized,]$FATALITIES) / totaldead
## [1] 0.002839221
sum(stormclean[uncategorized,]$total_economic_impact) / totalcost
## [1] 0.0718294
We’ve captured over 99% of fatalities with our categories, however, 7% of the economic impact remains uncategorized. Let’s see if we need to add categories by sorting the uncategorized data by economic impact.
othercats <- stormclean[uncategorized,] %>% arrange(desc(total_economic_impact))
head(othercats)
## # A tibble: 6 x 7
## EVTYPE FATALITIES INJURIES property_damage crop_damage total_economic_~
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 HAIL 0 1 1800000000 0 1800000000
## 2 DROUG~ 0 0 0 1000000000 1000000000
## 3 HAIL 0 0 900000000 0 900000000
## 4 DROUG~ 0 0 645150000 0 645150000
## 5 DROUG~ 0 0 0 578850000 578850000
## 6 DROUG~ 0 0 0 515000000 515000000
## # ... with 1 more variable: category <chr>
We can see drought and hail events which have no fatalities but major economic impact, in the billions or hundreds of millions. We add these two categories to our table.
droughtcat <- grep("drought", stormtbl$EVTYPE, ignore.case = TRUE)
hailcat <- grep("hail", stormtbl$EVTYPE, ignore.case=TRUE)
stormclean$category[droughtcat] <- "Drought"
stormclean$category[hailcat] <- "Hail"
uncategorized <-which(is.na(stormclean$category))
sum(stormclean[uncategorized,]$total_economic_impact) / totalcost
## [1] 0.000381211
We now have over 99% of our economic damage categorized, so our categories cover virtually all of the relevant data.
To present the results in graphs, we’ll need to summarize the data by category.
Here we create one table for the most deadly events, and one for those with the most economic impact. We also include a column for rank, which will be useful for graphing, and a column for cumulative share, in order to avoid putting too many unrelevant categories in our graphs.
mostdeadly <- stormclean %>% group_by(category) %>%
summarize(totaldeaths=sum(FATALITIES), totalinjury = sum(INJURIES)) %>%
arrange(desc(totaldeaths), desc(totalinjury))
r <- nrow(mostdeadly)
mostdeadly<- mutate(mostdeadly, rank=1:r, cumulative_share_deaths = cumsum(totaldeaths)/totaldead)
mostdeadly
## # A tibble: 14 x 5
## category totaldeaths totalinjury rank cumulative_share_deaths
## <chr> <dbl> <dbl> <int> <dbl>
## 1 Tornado or Wind 6580 100179 1 0.434
## 2 Heat related 2956 8832 2 0.630
## 3 Flood related 1550 8674 3 0.732
## 4 Cold related 1329 6844 4 0.820
## 5 Sea related 1084 1427 5 0.891
## 6 Lightning related 1029 7710 6 0.959
## 7 Hurricane 201 1714 7 0.973
## 8 Rain related 122 403 8 0.981
## 9 Fire related 90 1608 9 0.987
## 10 Fog related 81 1077 10 0.992
## 11 Hail 45 1467 11 0.995
## 12 Land and mud slides 44 55 12 0.998
## 13 <NA> 28 519 13 1.00
## 14 Drought 6 19 14 1
mostcostly <- stormclean %>% group_by(category) %>%
summarize(total.property.damage = sum(property_damage), total.crop.damage= sum(crop_damage), total.economic.cost= sum(total_economic_impact)) %>%
arrange(desc(total.economic.cost))
s<- nrow(mostcostly)
mostcostly<- mutate(mostcostly, rank= 1:s, cumulative_share = cumsum(total.economic.cost)/totalcost)
mostcostly
## # A tibble: 14 x 6
## category total.property.~ total.crop.dama~ total.economic.~ rank
## <chr> <dbl> <dbl> <dbl> <int>
## 1 Flood r~ 167127574520. 12270338200 179397912720. 1
## 2 Hurrica~ 93070725560 6211013800 99281739360 2
## 3 Tornado~ 67505857757. 1736789220 69242646977. 3
## 4 Sea rel~ 48663842950 36633500 48700476450 4
## 5 Cold re~ 12698320263. 8704985450 21403305713. 5
## 6 Hail 17619991072. 3114212873 20734203945. 6
## 7 Drought 1046306000 13972621780 15018927780 7
## 8 Fire re~ 8496628500 403281630 8899910130 8
## 9 Lightni~ 7365931906. 664968478 8030900384. 9
## 10 Rain re~ 3332250892 919315800 4251566692 10
## 11 Heat re~ 9665700 898863500 908529200 11
## 12 Land an~ 326077000 20017000 346094000 12
## 13 <NA> 30466680 151150950 181617630 13
## 14 Fog rel~ 25011500 0 25011500 14
## # ... with 1 more variable: cumulative_share <dbl>
par(mar=c(8,4,4,1))
with(mostdeadly[1:10,], plot(totaldeaths, type="h", lwd =5,
main="Events with the most impact on human health", xaxt="n", xlab="",
ylim=c(0,10000),ylab ="total number of casualties"))
lines(mostdeadly$totalinjury, lwd=3, lty=2, col="red")
legend("topright", lwd=c(5,3), lty=c(1,2), col=c("black", "red"), legend=c("deaths", "injuries"))
text(x=1:10, y=par()$usr[3]-0.1*(par()$usr[4]-par()$usr[3]),
labels=mostdeadly[1:10,]$category, srt=45, adj=1, xpd=TRUE)
We can see that tornadoes and wind related events have by far the highest impact on human health, followed by heat, floods and cold related events. While they cause the most deaths, the number of injuries caused by tornadoes is also off the charts, more than ten times the number caused by any other type of event. Lightning is a secondary cause of death but causes many injuries.
par(mar=c(8,4,4,1))
with(mostcostly[1:10,], plot(total.economic.cost/1000000000, type="h", lwd =5,
main="Events with the most harmful economic impact ", xaxt="n", xlab="",
ylab ="total economic cost (in B$)"))
lines(mostcostly$total.property.damage/1000000000, col="red")
lines(mostcostly$total.crop.damage/1000000000, col="green")
text(x=1:10, y=par()$usr[3]-0.1*(par()$usr[4]-par()$usr[3]),
labels=mostcostly[1:10,]$category, srt=45, adj=1, xpd=TRUE)
legend("topright", lwd=1, col=c("red", "green"), legend=c("property damage", "crop damage"))
We can see that floods cause a big share of the economic damage, followed by hurricanes and tornadoes or wind. Other marine events, such as tides and surges, also cause significant damage, and cold related events come in fifth place. Hail and drought, both of which are negligible in terms of their impact on human health, have also caused several billions of dollars of damage. Most damage is property damage, only droughts have more economic impact due to crop losses.
While exploring the data, I noticed that particularly for economic damages, a few individual events seemed to have a huge share of the costs. Does this also apply to human health?
max(stormclean$FATALITIES) /totaldead
## [1] 0.03849455
max(stormclean$total_economic_impact) / totalcost
## [1] 0.2414504
fatal_individual <- arrange(stormclean, desc(FATALITIES))
costly_individual <- arrange(stormclean, desc(total_economic_impact))
sum(fatal_individual[1:100,]$FATALITIES) / totaldead
## [1] 0.2408055
sum(costly_individual[1:100,]$total_economic_impact) / totalcost
## [1] 0.67708
This is an interesting result: a few extreme individual events represent the majority (67%) of the economic damages, but not the majority (24%) of deaths. The worst event in terms of economic consequences caused 24% of all economic impacts recorded, but the worst event in terms of death caused just 3% of total deaths.
This can probably be explained by the fact that people can be evacuated when an event is predicted, but property is more difficult to move.