In this paper we investigate which weather events are most harmful to both population health and economy. The analysis is based on the NOAA database that contains data from 1950 - 2011 (U.S. National Oceanic and Atmospheric Administration’s). Population health is estimated by the number of injuries and fatalities; impact on economy is estimated based on crop and propoerty damages. We found that tornados are most harmful to the population health and that floods have the greatest economic consequences.
library(ggplot2)
library(plyr)
We take the strom data from the Coursera webpage (link below) and extract the bzip2 file directly in R.
if(!"data_raw.csv.bz2" %in% dir("./")) {
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
destfile = "./data_raw.csv.bz2")
}
if(!exists("data_raw")) {
data_raw <- read.csv("data_raw.csv.bz2")
}
As also stated on the Coursera page, documentation can be found here:
Quick Inspection of the raw data:
names(data_raw)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
For focus on data that is relevant for this analysis, i.e. the columns EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, and CROPDMGEXP.
data <- data_raw[, c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
Quick Inspection of the data:
str(data)
## 'data.frame': 902297 obs. of 7 variables:
## $ EVTYPE : Factor w/ 985 levels "?","ABNORMALLY DRY",..: 830 830 830 830 830 830 830 830 830 830 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
Damage value preparation
Property and crop damages are expressed as a combination of xxxDMG and xxxDMGEXP, i.e. a value and an exponent.
unique(data$PROPDMGEXP)
## [1] K M B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
unique(data$CROPDMGEXP)
## [1] M K m B ? 0 k 2
## Levels: ? 0 2 B k K m M
We need to translage the exponent data before usage!
Translate function:
translateDMG <- function(f_exp) {
if (f_exp %in% c("k", "K", "3")) { return(1e+03) }
if (f_exp %in% c("m", "M", "6")) { return(1e+06) }
if (f_exp %in% c("b", "B")) { return(1e+09) }
if (f_exp %in% c("0", "")) { return(1e+00) }
if (f_exp %in% c("1")) { return(1e+01) }
if (f_exp %in% c("2")) { return(1e+02) }
if (f_exp %in% c("4")) { return(1e+04) }
if (f_exp %in% c("5")) { return(1e+05) }
if (f_exp %in% c("7")) { return(1e+07) }
if (f_exp %in% c("8")) { return(1e+08) }
if (f_exp %in% c("-", "?", "+")) { return(0) }
f_exp
}
Creating the new columns:
data$PROPDMGEXPtransl <- sapply(data$PROPDMGEXP, translateDMG)
data$CROPDMGEXPtransl <- sapply(data$CROPDMGEXP, translateDMG)
data$PROPDMGtransl <- data$PROPDMG * data$PROPDMGEXPtransl
data$CROPDMGtransl <- data$CROPDMG * data$CROPDMGEXPtransl
head(data)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO 0 15 25.0 K 0
## 2 TORNADO 0 0 2.5 K 0
## 3 TORNADO 0 2 25.0 K 0
## 4 TORNADO 0 2 2.5 K 0
## 5 TORNADO 0 2 2.5 K 0
## 6 TORNADO 0 6 2.5 K 0
## PROPDMGEXPtransl CROPDMGEXPtransl PROPDMGtransl CROPDMGtransl
## 1 1000 1 25000 0
## 2 1000 1 2500 0
## 3 1000 1 25000 0
## 4 1000 1 2500 0
## 5 1000 1 2500 0
## 6 1000 1 2500 0
Event value preparation
As seen above, there are 985 different values for the factor variable EVTYPE. In the following we just take the most relevant 10 event types per analysis, respectively. For a more sophisticated approach it might be useful to create subgroups to reduce the number of possible types.
To answer the question of which types of events are most harmful to population health, we combine the data of injuries and fatalities.
data$FAT_AND_INJ <- data$FATALITIES + data$INJURIES
health <- ddply(data, .(EVTYPE), function(x) data.frame(FAT_AND_INJ = sum(x$FAT_AND_INJ)))
health <- health[with(health, order(-FAT_AND_INJ)), ]
health <- health[1:10, ]
health$EVTYPE = factor(health$EVTYPE, levels = health$EVTYPE)
health
## EVTYPE FAT_AND_INJ
## 830 TORNADO 96979
## 123 EXCESSIVE HEAT 8428
## 854 TSTM WIND 7461
## 164 FLOOD 7259
## 452 LIGHTNING 6046
## 269 HEAT 3037
## 147 FLASH FLOOD 2755
## 424 ICE STORM 2064
## 759 THUNDERSTORM WIND 1621
## 972 WINTER STORM 1527
We show a bar plot of the top 10 in descending order.
p <- ggplot(data = health, aes(x = EVTYPE, y = FAT_AND_INJ))
p <- p + geom_bar(stat = "identity")
p <- p + theme(axis.text.x = element_text(angle = 90, hjust = 1))
p <- p + xlab("Event type")
p <- p + ylab("Fatalities and Injuries")
p <- p + ggtitle("Top 10 events for number of fatalities and injuries")
p
To answer the question of which types of events are causing the most damage, we combine the available data for crop and property damage in a similar fashion.
data$PROP_AND_CROP <- data$PROPDMGtransl + data$CROPDMGtransl
damage <- ddply(data, .(EVTYPE), function(x) data.frame(PROP_AND_CROP = sum(x$PROP_AND_CROP)))
damage <- damage[with(damage, order(-PROP_AND_CROP)), ]
damage <- damage[1:10, ]
damage$EVTYPE = factor(damage$EVTYPE, levels = damage$EVTYPE)
damage
## EVTYPE PROP_AND_CROP
## 164 FLOOD 150319678257
## 406 HURRICANE/TYPHOON 71913712800
## 830 TORNADO 57362333886
## 666 STORM SURGE 43323541000
## 238 HAIL 18761221566
## 147 FLASH FLOOD 18243991078
## 88 DROUGHT 15018672000
## 397 HURRICANE 14610229010
## 586 RIVER FLOOD 10148404500
## 424 ICE STORM 8967041360
We show a bar plot of the top 10 in descending order.
p <- ggplot(data = damage, aes(x = EVTYPE, y = PROP_AND_CROP))
p <- p + geom_bar(stat = "identity")
p <- p + theme(axis.text.x = element_text(angle = 90, hjust = 1))
p <- p + xlab("Event type")
p <- p + ylab("Damage in USD (combined property and crop)")
p <- p + ggtitle("Top 10 events for combined crop and property damage")
p
Tornados are most harmful with respect to population health as measured in combined number of injuries and fatalities. According to the data, flood events have the greatest economic consequences.