This report analyzes data from the U.S. National Oceanic and Atmospheric Administration’s storm database in order to determine which sorts of events are the most harmful in terms of human health and economic impact. It concludes that the most harmful events for human health are tornados, heat, and flooding. It concludes that the most costly events are hurricanes, tornados, and flooding.
Data for this analysis is retrieved from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. Here are links further describing this data set:
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2","Data/StormData.csv.bz2",method="curl")
sd <- read.csv(bzfile("Data/StormData.csv.bz2"))
# make a copy of storm data containing only the fields we will work with in the analysis
pd <- data.frame(sd[,c("EVTYPE","FATALITIES", "INJURIES", "PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")])
# discard the original data frame to save some memory
# sd <- NULL
The economic damage is encoded with a base value and an encoded exponent. This will be converted to simple numeric types to simplify the quantification impacts from the events. We will also sum the property and crop damages for a measure of total economic impact. We sum fatalities and injuries for total health impacts.
transform_exponents <- function(level)
{
level[level==""] <- "0"
level[level=="-"] <- "0"
level[level=="?"] <- "0"
level[level=="+"] <- "0"
level[level=="B"] <- "9"
level[level=="h"] <- "2"
level[level=="H"] <- "2"
level[level=="K"] <- "3"
level[level=="k"] <- "3"
level[level=="m"] <- "6"
level[level=="M"] <- "6"
level
}
# convert all the exponents to numeric strings "0"
levels(pd$PROPDMGEXP) <- transform_exponents(levels(pd$PROPDMGEXP))
levels(pd$CROPDMGEXP) <- transform_exponents(levels(pd$CROPDMGEXP))
# convert the exponent to a number
pd$PROPDMGEXP <- as.numeric(levels(pd$PROPDMGEXP))[pd$PROPDMGEXP]
# Transform the damage by multiplying it by the specified power. We can then work with the PROPDMG directly
pd$PROPDMG <- pd$PROPDMG*10^pd$PROPDMGEXP
# get a numeric crop damage value also
pd$CROPDMGEXP <- as.numeric(levels(pd$CROPDMGEXP))[pd$CROPDMGEXP]
pd$CROPDMG <- pd$CROPDMG*10^pd$CROPDMGEXP
# create a total damage variable
pd$TOTDMG <- pd$PROPDMG + pd$CROPDMG
summary(pd$TOTDMG)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00e+00 0.00e+00 0.00e+00 5.29e+05 1.00e+03 1.15e+11
# create a total casualities (fatalities + injuries)
pd$TOTCAS <- pd$FATALITIES + pd$INJURIES
summary(pd$TOTCAS)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1725 0.0000 1742.0000
There are over 900 types of events in the database, most of which are not pertinent to the question of which types of atmospheric events cause the most health and economic impacts. We will analyze only the top 10 types of events for health impacts and the 10 ten types of events for economic impact.
library(dplyr)
# summarize the total casualities and economic damage by event type
pdsumcas <- pd %>% group_by(EVTYPE) %>% summarize(SUMTOTCAS=sum(TOTCAS)) %>% arrange(desc(SUMTOTCAS))
pdsumdmg <- pd %>% group_by(EVTYPE) %>% summarize(SUMTOTDMG=sum(TOTDMG)) %>% arrange(desc(SUMTOTDMG))
# get the top 10 events for casualties and economic damage
pdsumcas10 <- pdsumcas[1:10,]
pdsumdmg10 <- pdsumdmg[1:10,]
# order the factors so that the plots come out ordered on the event type magnitude
pdsumcas10$EVTYPE <- factor(pdsumcas10$EVTYPE, levels = pdsumcas10$EVTYPE, ordered=TRUE)
pdsumdmg10$EVTYPE <- factor(pdsumdmg10$EVTYPE, levels = pdsumdmg10$EVTYPE, ordered=TRUE)
Now that we have the data summarized by top 10 total damages and casualties we can construct plots of the results.
# Create bar plots for casualities and damage
library(ggplot2)
ggplot(pdsumcas10, aes(EVTYPE, y=SUMTOTCAS)) + geom_bar(stat="identity") + coord_flip() +
labs(title="Top 10 Event Types Causing Injuries and Fatalities") +
labs(y="Sum of Injuries and Fatalities") +
labs(x="Event Type")
The chart demonstrates that tornadoes are the most dangerous atmospheric events in the United States by far. The next most dangerous events are heat and flooding.
ggplot(pdsumdmg10, aes(EVTYPE, y=SUMTOTDMG)) + geom_bar(stat="identity") + coord_flip() +
labs(title="Top 10 Event Types Cuasing Economic Damage") +
labs(y="Sum of Property and Crop Damages $") +
labs(x="Event Type")
The chart displays floods as the most economically damaging weather event, but this is likely due to a single erroneous data entry point (see the supplementary analysis). So this report concludes that Hurricanes and Typhoons are the most damaging events. This is particularrly true if we sum the “HURRICANE/TYPHOON” category with the other related categories of “STORM SURGE” and “HURRICANE.” The next most damaging events are Tornados and Floods (once the erroneous entry is corrected).
# Let's check and see how the exponents are encoded
summary(sd$PROPDMGEXP)
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 424665 7 11330
summary(sd$CROPDMGEXP)
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
# Check the values in the storm types
summary(sd$EVTYPE)
## HAIL TSTM WIND THUNDERSTORM WIND
## 288661 219940 82563
## TORNADO FLASH FLOOD FLOOD
## 60652 54277 25326
## THUNDERSTORM WINDS HIGH WIND LIGHTNING
## 20843 20212 15754
## HEAVY SNOW HEAVY RAIN WINTER STORM
## 15708 11723 11433
## WINTER WEATHER FUNNEL CLOUD MARINE TSTM WIND
## 7026 6839 6175
## MARINE THUNDERSTORM WIND WATERSPOUT STRONG WIND
## 5812 3796 3566
## URBAN/SML STREAM FLD WILDFIRE BLIZZARD
## 3392 2761 2719
## DROUGHT ICE STORM EXCESSIVE HEAT
## 2488 2006 1678
## HIGH WINDS WILD/FOREST FIRE FROST/FREEZE
## 1533 1457 1342
## DENSE FOG WINTER WEATHER/MIX TSTM WIND/HAIL
## 1293 1104 1028
## EXTREME COLD/WIND CHILL HEAT HIGH SURF
## 1002 767 725
## TROPICAL STORM FLASH FLOODING EXTREME COLD
## 690 682 655
## COASTAL FLOOD LAKE-EFFECT SNOW FLOOD/FLASH FLOOD
## 650 636 624
## LANDSLIDE SNOW COLD/WIND CHILL
## 600 587 539
## FOG RIP CURRENT MARINE HAIL
## 538 470 442
## DUST STORM AVALANCHE WIND
## 427 386 340
## RIP CURRENTS STORM SURGE FREEZING RAIN
## 304 261 250
## URBAN FLOOD HEAVY SURF/HIGH SURF EXTREME WINDCHILL
## 249 228 204
## STRONG WINDS DRY MICROBURST ASTRONOMICAL LOW TIDE
## 196 186 174
## HURRICANE RIVER FLOOD LIGHT SNOW
## 174 173 154
## STORM SURGE/TIDE RECORD WARMTH COASTAL FLOODING
## 148 146 143
## DUST DEVIL MARINE HIGH WIND UNSEASONABLY WARM
## 141 135 126
## FLOODING ASTRONOMICAL HIGH TIDE MODERATE SNOWFALL
## 120 103 101
## URBAN FLOODING WINTRY MIX HURRICANE/TYPHOON
## 98 90 88
## FUNNEL CLOUDS HEAVY SURF RECORD HEAT
## 87 84 81
## FREEZE HEAT WAVE COLD
## 74 74 72
## RECORD COLD ICE THUNDERSTORM WINDS HAIL
## 64 61 61
## TROPICAL DEPRESSION SLEET UNSEASONABLY DRY
## 60 59 56
## FROST GUSTY WINDS THUNDERSTORM WINDSS
## 53 53 51
## MARINE STRONG WIND OTHER SMALL HAIL
## 48 48 47
## FUNNEL FREEZING FOG THUNDERSTORM
## 46 45 45
## Temperature record TSTM WIND (G45) Coastal Flooding
## 43 39 38
## WATERSPOUTS MONTHLY PRECIPITATION WINDS
## 37 36 36
## (Other)
## 2940
# check the flood record that is recorded as $115B in damages to see if it is an outlier
pd[pd$PROPDMG==max(pd$PROPDMG),]
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 605953 FLOOD 0 0 1.15e+11 9 32500000 6
## TOTDMG TOTCAS
## 605953 115032500000 0
sd[605953,]
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME
## 605953 6 1/1/2006 0:00:00 12:00:00 AM PST 55 NAPA
## STATE EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE
## 605953 CA FLOOD 0 COUNTYWIDE 1/1/2006 0:00:00
## END_TIME COUNTY_END COUNTYENDN END_RANGE END_AZI END_LOCATI
## 605953 07:00:00 AM 0 NA 0 COUNTYWIDE
## LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
## 605953 0 0 NA 0 0 0 115 B 32.5
## CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 605953 M MTR CALIFORNIA, Western 3828 12218
## LATITUDE_E LONGITUDE_
## 605953 3828 12218
## REMARKS
## 605953 Major flooding continued into the early hours of January 1st, before the Napa River finally fell below flood stage and the water receeded. Flooding was severe in Downtown Napa from the Napa Creek and the City and Parks Department was hit with $6 million in damage alone. The City of Napa had 600 homes with moderate damage, 150 damaged businesses with costs of at least $70 million.
## REFNUM
## 605953 605943
Record 605953 notes in its remarks that one constituent of the damage is $70M. Most likely the total damage is $115M and the PROPDMGEXP is a mistaken entry of that should be M (millions) instead of B (billions).
# check that we don't have an outlier driving the TORNADO event casualties
max(pd$TOTCAS)
## [1] 1742
Total Tornado casualties is 90k+ and the largest single entry is 1742 casualties. This indicates that there are many entries driving the total casualities for the event.
sessionInfo()
## R version 3.1.2 (2014-10-31)
## Platform: x86_64-apple-darwin10.8.0 (64-bit)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_1.0.1 dplyr_0.4.1
##
## loaded via a namespace (and not attached):
## [1] assertthat_0.1 codetools_0.2-11 colorspace_1.2-6 DBI_0.3.1
## [5] digest_0.6.8 evaluate_0.7 formatR_1.2 grid_3.1.2
## [9] gtable_0.1.2 htmltools_0.2.6 knitr_1.10 labeling_0.3
## [13] lazyeval_0.1.10 magrittr_1.5 MASS_7.3-40 munsell_0.4.2
## [17] parallel_3.1.2 plyr_1.8.2 proto_0.3-10 Rcpp_0.11.5
## [21] reshape2_1.4.1 rmarkdown_0.5.1 scales_0.2.4 stringr_0.6.2
## [25] tools_3.1.2 yaml_2.1.13