In this report we aim to identify most harmful storms respect to population health and storms that has most economic consequences. The U.S. National Oceanic and Atmospheric Administration track different storm events across the United States and provide amount of death and injury within population, and property and crop damage per event. From data we investigated that Tornado has the most damage on population health: highest injury and death from this events. Tornado also has one of the biggest economic consequences, but the greatest consequences are from flood.
The raw data come form U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database for event between 1950 and November 2011. Raw data can be downloaded here. Data was originally downloaded at Jun 18 2014 21:54 (VLAT).
First of all let’s download and load file into R:
if (!file.exists("data.csv.bz2")) {
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
"data.csv.bz2", method = "curl")
}
if (!file.exists("data.csv")) {
R.utils::bunzip2("data.csv.bz2", remove = FALSE, overwrite = TRUE)
}
data <- read.csv("data.csv", na.strings = "")
This is a huge data-set (902297 rows by 37 cols):
dim(data)
## [1] 902297 37
And we will use data.table package to fasten work with it:
suppressPackageStartupMessages(require(data.table))
data <- as.data.table(data)
Also for our investigation we need only this columns:
We’ll do some transformations to data:
data[, PROPDMGEXP := toupper(PROPDMGEXP)]
data[is.na(PROPDMGEXP), eco.prop.b := PROPDMG / 1e9]
data[PROPDMGEXP == "H", eco.prop.b := PROPDMG / 1e7]
data[PROPDMGEXP == "K", eco.prop.b := PROPDMG / 1e6]
data[PROPDMGEXP =="M", eco.prop.b := PROPDMG / 1e3]
data[PROPDMGEXP == "B", eco.prop.b := PROPDMG]
data[is.na(eco.prop.b), eco.prop.b := PROPDMG / (10 ^ (9 - suppressWarnings(as.numeric(PROPDMGEXP))))]
data[is.na(eco.prop.b), eco.prop.b := PROPDMG / 1e9]
data[, CROPDMGEXP := toupper(CROPDMGEXP)]
data[is.na(CROPDMGEXP), eco.crop.b := CROPDMG / 1e9]
data[CROPDMGEXP == "H", eco.crop.b := CROPDMG / 1e7]
data[CROPDMGEXP == "K", eco.crop.b := CROPDMG / 1e6]
data[CROPDMGEXP =="M", eco.crop.b := CROPDMG / 1e3]
data[CROPDMGEXP == "B", eco.crop.b := CROPDMG]
data[is.na(eco.crop.b), eco.crop.b := CROPDMG / (10 ^ (9 - suppressWarnings(as.numeric(CROPDMGEXP))))]
data[is.na(eco.crop.b), eco.crop.b := CROPDMG / 1e9]
setnames(data, "EVTYPE", "event")
setnames(data, "FATALITIES", "people.death")
setnames(data, "INJURIES", "people.injur")
data <- data[, list(event, people.death, people.injur, eco.prop.b, eco.crop.b)]
We need some transformations on event names: * Some events are in lower case we transform them:
data[, event := toupper(event)]
data[, event := gsub("[ ]?AND[ ]?|[ ]?[&][ ]?|[ ][-][ ]", "/", event)]
data[, event := gsub("^[ ]*|[ ]$", "", event)]
data[, event := gsub("[ ]+", " ", event)]
data[, event := gsub("S$", "", event)]
It seems that there are a lot of typos, but it hard to correct, ex:
data[, .N, by = event][order(event)][82:83, event]
We then aggregate data and remove not meaningful zeros:
data.agg.eco <- data[, list(eco.prop.b.sum = sum(eco.prop.b, na.rm = TRUE),
eco.crop.b.sum = sum(eco.crop.b, na.rm = TRUE)),
by = event][eco.prop.b.sum > 0 | eco.crop.b.sum > 0]
data.agg.people <- data[, list(people.death.sum = sum(people.death),
people.injur.sum = sum(people.injur)),
by = event][people.death.sum > 0 | people.injur.sum > 0]
In this section we address events which are most harmful to population health and which has greatest economic consequences.
We have two storms effects regarding to population health: * death, that seems more harmful, * and injury.
Top 20 storms with respect to population health:
head(data.agg.people[order(-people.death.sum, -people.injur.sum)], 20)
## event people.death.sum people.injur.sum
## 1: TORNADO 5633 91346
## 2: EXCESSIVE HEAT 1903 6525
## 3: FLASH FLOOD 980 1777
## 4: HEAT 937 2100
## 5: LIGHTNING 816 5230
## 6: RIP CURRENT 572 529
## 7: TSTM WIND 504 6957
## 8: FLOOD 470 6789
## 9: HIGH WIND 283 1439
## 10: AVALANCHE 224 170
## 11: WINTER STORM 216 1338
## 12: THUNDERSTORM WIND 197 2406
## 13: HEAT WAVE 177 379
## 14: EXTREME COLD 162 231
## 15: HEAVY SNOW 127 1021
## 16: EXTREME COLD/WIND CHILL 125 24
## 17: STRONG WIND 111 301
## 18: HIGH SURF 104 156
## 19: BLIZZARD 101 805
## 20: HEAVY RAIN 98 255
Since we have a lot of events we’ll group 21th, …, 190th:
data.agg.people.group <- rbind(data.agg.people[order(-people.death.sum, -people.injur.sum)][1:20],
data.agg.people[order(-people.death.sum, -people.injur.sum)][21:nrow(data.agg.people), list(event = "OTHERS", people.death.sum = sum(people.death.sum), people.injur.sum = sum(people.injur.sum))])
data.agg.people.group[, event := factor(event, event)]
suppressPackageStartupMessages(require(reshape2))
data.agg.people.group <- melt(data.agg.people.group,
id.vars = "event",
measure.vars = c("people.death.sum", "people.injur.sum"))
data.agg.people.group[, variable := factor(variable, labels = c("Death", "Injury"))]
suppressPackageStartupMessages(require(ggplot2))
ggplot(data.agg.people.group, aes(x = reorder(event, -xtfrm(event)))) +
geom_bar(aes(y = value, fill = variable),
stat = "identity",
position = "dodge") +
coord_flip() +
scale_y_sqrt() +
labs(x = "Strom type", y = "Number of people hit", fill = "") +
theme_bw()
Figure 1. Most harmful storms with respect to peoples health, x-axis uses square root scale
We have two storms effects on economics: * damages to property, * and damages to crop.
Top 20 storms with the greatest economic consequences, in billions of dollars:
head(data.agg.eco[order(-eco.prop.b.sum - eco.crop.b.sum),
list(event,
eco.prop.b.sum = round(eco.prop.b.sum, 2),
eco.crop.b.sum = round(eco.crop.b.sum, 2))], 20)
## event eco.prop.b.sum eco.crop.b.sum
## 1: FLOOD 144.66 5.66
## 2: HURRICANE/TYPHOON 69.31 2.61
## 3: TORNADO 56.95 0.41
## 4: STORM SURGE 43.32 0.00
## 5: HAIL 15.74 3.03
## 6: FLASH FLOOD 16.83 1.42
## 7: DROUGHT 1.05 13.97
## 8: HURRICANE 11.87 2.74
## 9: RIVER FLOOD 5.12 5.03
## 10: ICE STORM 3.94 5.02
## 11: TROPICAL STORM 7.70 0.68
## 12: WINTER STORM 6.69 0.03
## 13: HIGH WIND 5.88 0.68
## 14: THUNDERSTORM WIND 5.43 0.61
## 15: WILDFIRE 4.87 0.30
## 16: TSTM WIND 4.49 0.55
## 17: STORM SURGE/TIDE 4.64 0.00
## 18: HURRICANE OPAL 3.17 0.02
## 19: WILD/FOREST FIRE 3.00 0.11
## 20: HEAVY RAIN/SEVERE WEATHER 2.50 0.00
Since we have a lot of events we’ll group 21th, …, 362th:
data.agg.eco.group <- rbind(data.agg.eco[order(-eco.prop.b.sum - eco.crop.b.sum)][1:20],
data.agg.eco[order(-eco.prop.b.sum - eco.crop.b.sum)][21:nrow(data.agg.eco), list(event = "OTHERS", eco.prop.b.sum = sum(eco.prop.b.sum), eco.crop.b.sum = sum(eco.crop.b.sum))])
data.agg.eco.group[, event := factor(event, event)]
data.agg.eco.group <- melt(data.agg.eco.group,
id.vars = "event",
measure.vars = c("eco.prop.b.sum", "eco.crop.b.sum"))
data.agg.eco.group[, variable := factor(variable, labels = c("Crop", "Property"))]
ggplot(data.agg.eco.group, aes(x = reorder(event, -xtfrm(event)))) +
geom_bar(aes(y = value, fill = variable),
stat = "identity",
position = "dodge") +
coord_flip() +
scale_y_sqrt() +
labs(x = "Strom type", y = "Damage in billions of dollars", fill = "") +
theme_bw()
Figure 2. Most harmful storms with respect to economic, x-axis uses square root scale
This analysis originally performed in a Linux environment:
sessionInfo()
## R version 3.1.0 (2014-04-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
##
## locale:
## [1] LC_CTYPE=ru_RU.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=ru_RU.UTF-8 LC_COLLATE=ru_RU.UTF-8
## [5] LC_MONETARY=ru_RU.UTF-8 LC_MESSAGES=ru_RU.UTF-8
## [7] LC_PAPER=ru_RU.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=ru_RU.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_1.0.0 reshape2_1.4 data.table_1.9.2
##
## loaded via a namespace (and not attached):
## [1] colorspace_1.2-4 digest_0.6.4 evaluate_0.5.5 formatR_0.10
## [5] grid_3.1.0 gtable_0.1.2 htmltools_0.2.4 knitr_1.6
## [9] labeling_0.2 MASS_7.3-33 munsell_0.4.2 plyr_1.8.1
## [13] proto_0.3-10 Rcpp_0.11.1 rmarkdown_0.2.46 scales_0.2.4
## [17] stringr_0.6.2 tools_3.1.0 yaml_2.1.11
We used some additional packages (all available from CRAN):