The aim of this report is to address the following research questions:
Research Qustions
- Across the United States, which types of storm events are most harmful with respect to population health?
- Across the United States, which types of storm events have the greatest economic consequences?
The analysis explores the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. It is concluded that tornadoes have the largest impact on human health and that floods have the greatest economic impact. However, this analysis is limited in that it is of an exploratory nature only. Further study should be performed to develop these conclusions using a statistical analysis.
The data.table package will be used to store the data and the ggplot2 package to plot the data:
library('data.table')
library('ggplot2')
It is assumed that the working directory has been set to the source folder. This code block downloads and reads the data from here. Caching is used since this is time consuming operation. Only the relevant data are stored; the relevant data were deduced from Storm Data Archive and Storm Data Documentation. Note that the data available seems to be a subset of the data described in the documentation.
# download data if it has not been already
if (!file.exists("./data/data.bz2")){
if (!file.exists("./data")) dir.create("data")
my.url <- paste("http://d396qusza40orc.cloudfront.net",
"/repdata%2Fdata%2FStormData.csv.bz2",
sep = "")
download.file(my.url, "./data/data.bz2")
}
# load the data if it has not been already
if (!exists("my.data")){
my.data <- read.csv(bzfile("./data/data.bz2"))
my.data <- as.data.table(my.data[,
c("EVTYPE",
"FATALITIES",
"INJURIES",
"PROPDMG",
"PROPDMGEXP",
"CROPDMG",
"CROPDMGEXP"
)]
)
}
The impact on population health will be quantified by the number of injuries and the number of fatalities that an event caused. This is given by the variables FATALITIES and INJURIES. As a sanity check (i.e. a quick check of the basic data structure), a summary of the variables is calculated:
# summarise population health varibles as sanity check
my.data[, summary(INJURIES)]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 0.2 0.0 1700.0
my.data[, summary(FATALITIES)]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 0 0 583
Economic consequences will be quantified as the total cost of damage to property and to crops. The variables PROPDMG and CROPDMG quantify this; in addition to the numerical data given by these variables, an exponent is given in PROPDMGEXP and CROPDMGEXP (‘h’ for hundred, ‘k’ for thousand, ‘m’ for million, ‘b’ for billion) which must be accounted for. The total damage is the product of the numeric variable and the exponent. The following code block performs the operation; writing each expression is repetitive so they are created as strings in a loop and then parsed to create a new variable in the data.table. The expressions are displayed for the readers reference.
# intialise new variables
my.data[, post.PROPDMG:= 0]
my.data[, post.CROPDMG:= 0]
my.data[, post.TOTDMG:= 0]
# exponent lookup list
expon <- list('hH' = '100', 'kK' = '1e3', 'mM' = '1e6', 'bB' = '1e9')
# template expression - strings will be subsituted to form expression
temp.expr <-
"post.%sDMG := ifelse(grepl('[%s]', %sDMGEXP), %sDMG * %s, post.%sDMG)"
# this loop constructs and parses a series of expressions to compute the total
# damage
for (i in c('PROP', 'CROP')){
for (j in seq_along(expon)){
expr <- sprintf(temp.expr, i, names(expon[j]), i, i, expon[[j]], i)
# print expression for readers reference
print(expr)
my.data[, eval(parse(text = expr))]
}
}
## [1] "post.PROPDMG := ifelse(grepl('[hH]', PROPDMGEXP), PROPDMG * 100, post.PROPDMG)"
## [1] "post.PROPDMG := ifelse(grepl('[kK]', PROPDMGEXP), PROPDMG * 1e3, post.PROPDMG)"
## [1] "post.PROPDMG := ifelse(grepl('[mM]', PROPDMGEXP), PROPDMG * 1e6, post.PROPDMG)"
## [1] "post.PROPDMG := ifelse(grepl('[bB]', PROPDMGEXP), PROPDMG * 1e9, post.PROPDMG)"
## [1] "post.CROPDMG := ifelse(grepl('[hH]', CROPDMGEXP), CROPDMG * 100, post.CROPDMG)"
## [1] "post.CROPDMG := ifelse(grepl('[kK]', CROPDMGEXP), CROPDMG * 1e3, post.CROPDMG)"
## [1] "post.CROPDMG := ifelse(grepl('[mM]', CROPDMGEXP), CROPDMG * 1e6, post.CROPDMG)"
## [1] "post.CROPDMG := ifelse(grepl('[bB]', CROPDMGEXP), CROPDMG * 1e9, post.CROPDMG)"
# total damage is sum of property damage and crop damage
my.data[, post.TOTDMG := post.PROPDMG + post.CROPDMG]
Sample results (those variables prepended by post.) are checked here to ensure that the operation was performed as intended. Also, a summary of post.TOTDMG is shown as a sanity check.
# sanity checks on post processed data
subset(my.data, CROPDMGEXP == 'B')
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
## 1: FREEZE 0 0 0.00 0.20
## 2: RIVER FLOOD 0 0 5.00 B 5.00
## 3: HEAT 0 0 0.00 0.40
## 4: DROUGHT 0 0 0.00 0.50
## 5: DROUGHT 0 0 0.00 1.00
## 6: DROUGHT 0 0 0.00 K 0.00
## 7: DROUGHT 0 0 0.00 K 0.00
## 8: ICE STORM 0 0 500.00 K 5.00
## 9: HURRICANE/TYPHOON 15 104 5.88 B 1.51
## CROPDMGEXP post.PROPDMG post.CROPDMG post.TOTDMG post.EVTYPE
## 1: B 0.00e+00 2.00e+08 2.00e+08 COLD
## 2: B 5.00e+09 5.00e+09 1.00e+10 FLOOD
## 3: B 0.00e+00 4.00e+08 4.00e+08 HEAT
## 4: B 0.00e+00 5.00e+08 5.00e+08 HEAT
## 5: B 0.00e+00 1.00e+09 1.00e+09 HEAT
## 6: B 0.00e+00 0.00e+00 0.00e+00 HEAT
## 7: B 0.00e+00 0.00e+00 0.00e+00 HEAT
## 8: B 5.00e+05 5.00e+09 5.00e+09 STORM
## 9: B 5.88e+09 1.51e+09 7.39e+09 TROP.STORM
my.data[, summary(post.TOTDMG)]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00e+00 0.00e+00 0.00e+00 5.28e+05 1.00e+03 1.15e+11
The event type is given by EVTYPE. This is confirmed by listing the ‘levels’ for that variable:
# show extract of events and show number of levels
head(my.data[, levels(EVTYPE)], 20)
## [1] " HIGH SURF ADVISORY" " COASTAL FLOOD"
## [3] " FLASH FLOOD" " LIGHTNING"
## [5] " TSTM WIND" " TSTM WIND (G45)"
## [7] " WATERSPOUT" " WIND"
## [9] "?" "ABNORMAL WARMTH"
## [11] "ABNORMALLY DRY" "ABNORMALLY WET"
## [13] "ACCUMULATED SNOWFALL" "AGRICULTURAL FREEZE"
## [15] "APACHE COUNTY" "ASTRONOMICAL HIGH TIDE"
## [17] "ASTRONOMICAL LOW TIDE" "AVALANCE"
## [19] "AVALANCHE" "BEACH EROSIN"
sprintf("There are %i levels", length(my.data[, levels(EVTYPE)]))
## [1] "There are 985 levels"
However, as shown in the block, there are 985 levels; in order to draw meaningful conclusions from this study, the levels will be grouped into wider categories. Regular expressions are used to group key words; the new categories are given by post.EVTYPE:
# initialise new categories with 'OTHER'
my.data[, post.EVTYPE := "OTHER"]
# create groups of categrories defined by regular expressions
groups <- list(
FOG = "FOG",
HEAT = "HEAT|WARM|FIRE|DROUGHT|DR[YI]|HOT|HIGH TEMP|HYPERT",
COLD = "COLD|COOL|FREEZ|WINTE?R|FROST|IC[EY]|GLAZE|HYPOT|LOW TEMP",
STORM = "STORM",
RAIN = "RAIN|PREC|WET|SHOWER",
SNOW = "SNOW|BLIZZ|SLEET",
LIGHTNING = "LIGHTNING",
FLOOD = "FLOOD",
TROP.STORM = "HURRICAN|TYPHOON|CYCLONE",
TORNADO = "TORNADO|FUNNEL|SPOUT",
WIND = "WIND",
HAIL = "HAIL",
TSUNAMI = "TSUNAMI",
WATER = "CURRENT|SURF",
SLIDES = "SLIDE|AVALANCHE"
)
# apply the regular expressions in a loop to recategorise
lapply(seq_along(groups), function(i){
my.data[, post.EVTYPE := ifelse(grepl(groups[[i]], EVTYPE, ignore.case = T),
names(groups[i]), post.EVTYPE)]
})
Note that in the above, it can happen that two categories apply. In that case, the order of precedence of categorization is from the bottom-up in groups. After re-categorization, a check is performed to ensure that the uncategorized data is insignificant in terms of the variables of interest. The target was to ensure no more than 1% of the total variables of interest class as OTHER i.e. uncategorized.
# the key will be used for fast subsetting
setkey(my.data, 'post.EVTYPE')
sprintf("Percentage of uncategorized EVTYPE: %.2f%%",
100*nrow(my.data['OTHER'])/nrow(my.data))
## [1] "Percentage of uncategorized EVTYPE: 0.48%"
sprintf("Percentage of total damage cost uncategorized: %.2f%%",
100*my.data['OTHER'][, sum(post.TOTDMG)]/my.data[, sum(post.TOTDMG)])
## [1] "Percentage of total damage cost uncategorized: 0.02%"
sprintf("Percentage of total fatalities uncategorized: %.2f%%",
100*my.data['OTHER'][, sum(FATALITIES)]/my.data[, sum(FATALITIES)])
## [1] "Percentage of total fatalities uncategorized: 0.41%"
sprintf("Percentage of total injuries uncategorized: %.2f%%",
100*my.data['OTHER'][, sum(INJURIES)]/my.data[, sum(INJURIES)])
## [1] "Percentage of total injuries uncategorized: 0.11%"
As shown above, the OTHER categorization is not significant in terms of the variables of interest.
This code block aggregates the data so as to evaluate the impact of each variable for each storm event categorization:
pp.data <- my.data[,
list(FATALITIES = sum(FATALITIES),
INJURIES = sum(INJURIES),
COST = sum(post.TOTDMG)),
by = post.EVTYPE]
print(pp.data)
## post.EVTYPE FATALITIES INJURIES COST
## 1: COLD 317 1322 3.543e+09
## 2: FLOOD 1525 8604 1.798e+11
## 3: FOG 80 1076 2.283e+07
## 4: HAIL 45 1467 2.073e+10
## 5: HEAT 3272 10883 2.485e+10
## 6: LIGHTNING 817 5232 9.508e+08
## 7: OTHER 62 149 8.684e+07
## 8: RAIN 109 329 4.239e+09
## 9: SLIDES 268 226 3.556e+08
## 10: SNOW 267 1927 1.926e+09
## 11: STORM 422 4235 7.329e+10
## 12: TORNADO 5639 91439 5.742e+10
## 13: TROP.STORM 133 1333 9.076e+10
## 14: TSUNAMI 33 129 1.441e+08
## 15: WATER 738 775 1.167e+08
## 16: WIND 1418 11402 1.819e+10
Below, plots of the number of fatalities and injuries for each event type are shown:
# reorder the levels
pp.data <- within(pp.data,
post.EVTYPE <- factor(post.EVTYPE,
levels=pp.data[order(FATALITIES, decreasing = T), post.EVTYPE]))
# plot the graph
g = ggplot(pp.data, aes(x = post.EVTYPE, y = FATALITIES))
g = g + geom_bar(stat="identity")
g = g + theme(axis.text.x = element_text(angle = 90, hjust = 1))
g = g + ggtitle('Number of Fatalities')
g = g + ylab('Fatalities') + xlab('Event Type')
print(g)
# reorder the levels
pp.data <- within(pp.data,
post.EVTYPE <- factor(post.EVTYPE,
levels=pp.data[order(INJURIES, decreasing = T), post.EVTYPE]))
# plot the graph
g = ggplot(pp.data, aes(x = post.EVTYPE, y = INJURIES))
g = g + geom_bar(stat="identity")
g = g + theme(axis.text.x = element_text(angle = 90, hjust = 1))
g = g + ggtitle('Number of Injuries')
g = g + ylab('Injuries') + xlab('Event Type')
print(g)
This exploratory analysis suggests that tornadoes are the storm events that are the most harmful with respect to population health.
# reorder the levels
pp.data <- within(pp.data,
post.EVTYPE <- factor(post.EVTYPE,
levels=pp.data[order(COST, decreasing = T), post.EVTYPE]))
# plot the graph
g = ggplot(pp.data, aes(x = post.EVTYPE, y = COST))
g = g + geom_bar(stat="identity")
g = g + theme(axis.text.x = element_text(angle = 90, hjust = 1))
g = g + ggtitle('Total Cost')
g = g + ylab('Cost') + xlab('Event Type')
print(g)
This exploratory analysis suggests that flooding is the storm events that have the greatest economic consequences.
This exploratory study has found that tornadoes are the events that have the largest impact on poluation health and that flooding are the most costly events. Robust conclusions cannot be drawn from the exploratory analysis alone. Nevertheless, the exploratory evidence is strong in favour of the inferences found above. It is recommended that a more detailed statistical study should be performed in order to develop these conclusions further.