This report is produced in partial fulfillment of the requirements for the Reproducible Research Course offered by John Hopkins Bloomberg School of Public Health and Coursera.
This report describes processing and analytic steps performed on the Storm Events Database provided by the National Climatic Data Center and strives to provide analysis on the impact of weather events to the United States in terms of population health and economic consequences.
This report concludes with displaying the top 10 events that cause the greatest casualties and greatest monetary damage.
## Download and read raw data
url <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url, destfile="stormdata.csv.bz2")
raw <- read.csv(bzfile("stormdata.csv.bz2"), header=TRUE)
1. Column names are normalized using the make.names function and then converted to lowercase. Dates are normalized to R Date() objects 2. Required columns are then selected as part of this research report 3. Based on the National Climatic Data Center’s descriptions of the event types as described in source [1], only data recorded after 01 January 1996 is taken into account. This is done to correct potential bias of data to Tornados, Thunderstorm Wind and Hail.
## Normalization of names and required date
names(raw) <- tolower(make.names(names(raw)))
raw$bgn_date <- as.Date(raw$bgn_date, format="%m/%d/%Y")
## Extract required time frame and data variables
data <- subset(raw, bgn_date >= as.Date("1996-01-01"), select=c(bgn_date, countyname, state, evtype, fatalities, injuries, propdmg, propdmgexp, cropdmg, cropdmgexp))
4. Derived values for combined population casualties and combined economic damage are then calculated.
5.Combined population casualties is calculated by summing fatalities and injuries.
6.Combined economic damage is calculated by summing property damage and crop damage. Units interpretation is based on source [2]
## Calculate derived values
data$propdmgexpvalue[data$propdmgexp == "K"] <- 1e3
data$propdmgexpvalue[data$propdmgexp == "M"] <- 1e6
data$propdmgexpvalue[data$propdmgexp == "B"] <- 1e9
data$propdmgexpvalue[data$propdmgexp == "0" | data$propdmgexp == ""] <- 1
data$cropdmgexpvalue[data$cropdmgexp == "K"] <- 1e3
data$cropdmgexpvalue[data$cropdmgexp == "M"] <- 1e6
data$cropdmgexpvalue[data$cropdmgexp == "B"] <- 1e9
data$cropdmgexpvalue[data$cropdmgexp == "0" | data$cropdmgexp == ""] <- 1
## calculated combined property and crop damage
data$combineddamage <- data$propdmg*data$propdmgexpvalue+data$cropdmg*data$cropdmgexpvalue
1. It is observed that there are much more event types (see output below) in the data than the described 48 official event types as per source [3].
length(unique(data$evtype))
## [1] 516
2.Therefore the event type values are harmonized to fit with the events within the official 48 types as stipulated from Directive 10-1605. Event names are also normalized to lowercase.
## harmonization
data$evtype <- tolower(data$evtype)
data$evtype[grep("*marine thunderstorm wind*", data$evtype)] <- "marine thunderstorm wind"
data$evtype[grep("*extreme cold*", data$evtype)] <- "extreme cold/wind chill"
data$evtype[grep("*extreme wind chill*", data$evtype)] <- "extreme cold/wind chill"
data$evtype[grep("*astronomical low tide*", data$evtype)] <- "astronomical low tide"
data$evtype[grep("*hurricane*", data$evtype)] <- "hurricane (typhoon)"
data$evtype[grep("*typhoon*", data$evtype)] <- "hurricane (typhoon)"
data$evtype[grep("*tropical depression*", data$evtype)] <- "tropical depression"
data$evtype[grep("*marine strong wind*", data$evtype)] <- "marine strong wind"
data$evtype[grep("*thunderstorm wind*", data$evtype)] <- "thunderstorm wind"
data$evtype[grep("*lake-effect snow*", data$evtype)] <- "lake-effect snow"
data$evtype[grep("*marine high wind*", data$evtype)] <- "marine high wind"
data$evtype[grep("*storm surge/tide*", data$evtype)] <- "storm surge/tide"
data$evtype[grep("*storm surge*", data$evtype)] <- "storm surge/tide"
data$evtype[grep("*tide*", data$evtype)] <- "tide"
data$evtype[grep("^(!extreme)*cold*", data$evtype)] <- "cold/wind chill"
data$evtype[grep("^(!extreme)*wind chill*", data$evtype)] <- "cold/wind chill"
data$evtype[grep("*lakeshore flood*", data$evtype)] <- "lakeshore flood"
data$evtype[grep("*excessive heat*", data$evtype)] <- "excessive heat"
data$evtype[grep("*tropical storm*", data$evtype)] <- "tropical storm"
data$evtype[grep("*winter weather*", data$evtype)] <- "winter weather"
data$evtype[grep("*coastal flood*", data$evtype)] <- "coastal flood"
data$evtype[grep("*frost*", data$evtype)] <- "frost/freeze"
data$evtype[grep("*freeze*", data$evtype)] <- "frost/freeze"
data$evtype[grep("*funnel cloud*", data$evtype)] <- "funnel cloud"
data$evtype[grep("*freezing fog*", data$evtype)] <- "freezing fog"
data$evtype[grep("*volcanic ash*", data$evtype)] <- "volcanic ash"
data$evtype[grep("*winter storm*", data$evtype)] <- "winter storm"
data$evtype[grep("*debris flow*", data$evtype)] <- "debris flow"
data$evtype[grep("*dense smoke*", data$evtype)] <- "dense smoke"
data$evtype[grep("*flash flood*", data$evtype)] <- "flash flood"
data$evtype[grep("*marine hail*", data$evtype)] <- "marine hail"
data$evtype[grep("*rip current*", data$evtype)] <- "rip current"
data$evtype[grep("*strong wind*", data$evtype)] <- "strong wind"
data$evtype[grep("*dust devil*", data$evtype)] <- "dust devil"
data$evtype[grep("*dust storm*", data$evtype)] <- "dust storm"
data$evtype[grep("*heavy rain*", data$evtype)] <- "heavy rain"
data$evtype[grep("*heavy snow*", data$evtype)] <- "heavy snow"
data$evtype[grep("*waterspout*", data$evtype)] <- "waterspout"
data$evtype[grep("*astronomical low tide*", data$evtype)] <- "astronomical low tide"
data$evtype[grep("*avalanche*", data$evtype)] <- "avalanche"
data$evtype[grep("*dense fog*", data$evtype)] <- "dense fog"
data$evtype[grep("*high surf*", data$evtype)] <- "high surf"
data$evtype[grep("*high wind*", data$evtype)] <- "high wind"
data$evtype[grep("*ice storm*", data$evtype)] <- "ice storm"
data$evtype[grep("*lightning*", data$evtype)] <- "lightning"
data$evtype[grep("*blizzard*", data$evtype)] <- "blizzard"
data$evtype[grep("*wildfire*", data$evtype)] <- "wildfire"
data$evtype[grep("*drought*", data$evtype)] <- "drought"
data$evtype[grep("*tornado*", data$evtype)] <- "tornado"
data$evtype[grep("*tsunami*", data$evtype)] <- "tsunami"
data$evtype[grep("*seiche*", data$evtype)] <- "seiche"
data$evtype[grep("^(!lakeshore)*flood*", data$evtype)] <- "flood"
data$evtype[grep("*sleet*", data$evtype)] <- "sleet"
data$evtype[grep("*hail*", data$evtype)] <- "hail"
data$evtype[grep("^(!excessive)*heat*", data$evtype)] <- "heat"
1. Population casualties and economic damage are tabulated for each recorded event type using the plyr package. 2. The data is then sorted by population casualties and economic damage respectively in descending order. 3. The top 10 events are then selected for the aforementioned areas of interest.
library(plyr)
populationData <- ddply(data, .(evtype), summarize, casualties=sum(fatalities, injuries))
populationData <- populationData[order(-populationData$casualties),]
## Select top 10
populationTop10 <- populationData[1:11,]
names(populationTop10) <- c("events", "casualties")
economyData <- ddply(data, .(evtype), summarize, monetarydamage=sum(combineddamage))
economyData <- economyData[order(-economyData$monetarydamage),]
## Select top 10
economyTop10 <- economyData[1:10,]
names(economyTop10) <- c("events", "monetarydamage")
Firstly, the idea of a “Top 10” is familiar with most people, therefore making the information easy to interpret.
Secondly, the top 10 events for each area of interest covers more than 80% of the casualties caused and more than 90% of monetary damages caused as shown below.
Top 10 Coverage for Population Casualty coverage:
sum(populationTop10$casualties)/sum(populationData$casualties)
## [1] 0.8589
Top 10 Coverage for monetary damage coverage:
sum(economyTop10$monetarydamage)/sum(economyData$monetarydamage)
## [1] 0.9368
It is noted that amongst the top 10 events, there is an event “tstm wind” that is not part of the official 48 events described.
In the absence of any official description, it is recommended not to be considered until further clarification. In light of this, the top eleventh event (high wind) is thus recommended to be considered.
Both events “tstm wind” and “high wind” are nevertheless still shown in the top 10 to reflect this finding.
# hist(populationTop10$casualties)
library(ggplot2)
popplot <- ggplot(populationTop10, aes(x=events, y=casualties)) + geom_point() + coord_flip()
popplot <- popplot + ggtitle("Top 10 population casualties by event")
popplot
populationTop10$rank <- 1:11
names(populationTop10) <- c("Event", "Number of Casualties", "Rank")
populationTop10
## Event Number of Casualties Rank
## 266 tornado 22178 1
## 47 excessive heat 8188 2
## 60 flood 7172 3
## 110 lightning 4792 4
## 271 tstm wind 3870 5
## 73 heat 2710 6
## 59 flash flood 2561 7
## 263 thunderstorm wind 1567 8
## 331 winter storm 1483 9
## 83 hurricane (typhoon) 1453 10
## 79 high wind 1320 11
## format monetary damage into billions of dollars
economyTop10$monetarydamage <- format(economyTop10$monetarydamage/1e9, scientific=FALSE, big.mark=",")
# hist(economyTop10$monetarydamage)
econplot <- ggplot(economyTop10, aes(x=events, y=monetarydamage)) + geom_point() + coord_flip()
econplot <- econplot + ggtitle("Top 10 monetary damage (Billions of US Dollars) by event")
econplot
economyTop10$rank <- 1:10
names(economyTop10) <- c("Events", "Billions of US Dollars", "Rank")
economyTop10
## Events Billions of US Dollars Rank
## 60 flood 148.920 1
## 83 hurricane (typhoon) 87.069 2
## 265 tide 47.845 3
## 266 tornado 24.900 4
## 71 hail 17.201 5
## 59 flash flood 16.557 6
## 32 drought 14.414 7
## 269 tropical storm 8.320 8
## 79 high wind 5.883 9
## 323 wildfire 5.054 10
sessionInfo()
## R version 3.1.1 (2014-07-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_1.0.0 plyr_1.8.1
##
## loaded via a namespace (and not attached):
## [1] colorspace_1.2-4 digest_0.6.4 evaluate_0.5.5 formatR_0.10
## [5] grid_3.1.1 gtable_0.1.2 htmltools_0.2.4 knitr_1.6
## [9] labeling_0.2 MASS_7.3-33 munsell_0.4.2 proto_0.3-10
## [13] Rcpp_0.11.2 reshape2_1.4 rmarkdown_0.2.49 scales_0.2.4
## [17] stringr_0.6.2 tools_3.1.1