This report aims at answering the following two questions based on the NOAA storm database, containing information about weather events impact in the United States since 1950:
Which type of events are most harmful with respect to populatoin health?
Which type of events have the greatest economic consequences?
The report will answer these two questions by focusing on the most critical events under both dimensions, as defined by a risk factor index, taking into account both the harmfulness and economic total impact and the respective frequency or likelyhood of the events (e.g. impact * frequency = index)
1. Data processing
# file name renamed to replace "-" with "_"
data <- read.csv("repdata_data_StormData.csv")
# loading required packages
library(ggplot2)
library(dplyr)
library(knitr)
The first step on this exercise was to identify which variables would serve to answer the two research questions. Taking a quick look at the collums (together with a brief pass to the NOAA docs) serves to identify that the key variables required would be: BGN_DATE, EVTYPE, FATALITITES, INJURES, PROPDMG, PROPDMEXP, CROPDMG, CROPDMEXP and REFNUM.
names(data) <-tolower(names(data))
names(data)
## [1] "state__" "bgn_date" "bgn_time" "time_zone" "county"
## [6] "countyname" "state" "evtype" "bgn_range" "bgn_azi"
## [11] "bgn_locati" "end_date" "end_time" "county_end" "countyendn"
## [16] "end_range" "end_azi" "end_locati" "length" "width"
## [21] "f" "mag" "fatalities" "injuries" "propdmg"
## [26] "propdmgexp" "cropdmg" "cropdmgexp" "wfo" "stateoffic"
## [31] "zonenames" "latitude" "longitude" "latitude_e" "longitude_"
## [36] "remarks" "refnum"
Therefore, a new data set (data1) is created, including only the identified relevant variables
# date column to date format
data$bgn_date <- as.POSIXlt(as.Date(data$bgn_date, format="%m/%d/%Y"))
# only interested on year
data$bgn_date <- (data$bgn_date$year) + 1900
# filtering identified relevant variables
data1 <- data %>%
select(refnum, bgn_date, evtype, fatalities, injuries, propdmg, propdmgexp, cropdmg, cropdmgexp)
kable(head(data1,5)) # data1 selected variables
| refnum | bgn_date | evtype | fatalities | injuries | propdmg | propdmgexp | cropdmg | cropdmgexp |
|---|---|---|---|---|---|---|---|---|
| 1 | 1950 | TORNADO | 0 | 15 | 25.0 | K | 0 | |
| 2 | 1950 | TORNADO | 0 | 0 | 2.5 | K | 0 | |
| 3 | 1951 | TORNADO | 0 | 2 | 25.0 | K | 0 | |
| 4 | 1951 | TORNADO | 0 | 2 | 2.5 | K | 0 | |
| 5 | 1951 | TORNADO | 0 | 2 | 2.5 | K | 0 |
A quick quality check of the variables (using theunique function), quickly puts in evidence that there are many data quality issues with the observationsn, e.g. evtype has almost 1000 unique observations, while the NOAA storm document refers about 50 different types. Similar issues were found with the exp variables, which not only contain the expected B, M, H, K and H multipliers but 19 different characters.
# Different descriptions for event type
typesEV <- unique(data1$evtype)
length(typesEV)
## [1] 985
# Different descriptions for dollar value multiplier
typesEXP <- unique(data$propdmgexp)
length(typesEXP)
## [1] 19
However, before jumping to clean the entire set of data, the first piece of analysis will focus on identifying if the available data is “good enough” to provide the municipality with enough information to prioritise resources.
This approach is supported by the initial hypothesis that the available “accurate” registers, from the NOAA storm dataset, should already put in evidence the disproporcionate impact that few natural events might have over the total number of observations, as these natural events are expected to follow a Pareto distribution (i.e. ~20% of the events drive ~80% of the impact).
The following table (Table 1), ranks the events in ascending order, based on the percentage of contribution of each event (out of the total).
# 80/20 analysis
# Event frequency
freqType<-as.data.frame(table(data1$evtype))
freqType <- freqType %>%
arrange(desc(Freq))
Prop.type <- freqType$Freq[c(1:10)]/sum(freqType$Freq)
# Fatality sources average
fatal <- data1 %>%
group_by(evtype) %>%
summarise (Tfatal = mean(fatalities))
fatal <- fatal %>%
arrange(desc(Tfatal))
Prop.fatal <- fatal$Tfatal[c(1:10)]/sum(fatal$Tfatal)
# Fatality sources total
fatalS <- data1 %>%
group_by(evtype) %>%
summarise (Tfatals = sum(fatalities))
fatalS <- fatalS %>%
arrange(desc(Tfatals))
Prop.fatalS <- fatalS$Tfatals[c(1:10)]/sum(fatalS$Tfatals)
# Injury sources
injur <- data1 %>%
group_by(evtype) %>%
summarise (Tinj = mean(injuries))
injur <- injur %>%
arrange(desc(Tinj))
Prop.inj <- injur$Tinj[c(1:10)]/sum(injur$Tinj)
# Injury sources
injurS <- data1 %>%
group_by(evtype) %>%
summarise (Tinjs = sum(injuries))
injurS <- injurS %>%
arrange(desc(Tinjs))
Prop.injS <- injurS$Tinjs[c(1:10)]/sum(injurS$Tinjs)
# 80/20 table
Ranking <- cbind(freqType[1:10,1],Prop.type,fatal[1:10,1],Prop.fatal,injur[1:10,1],
Prop.inj,fatalS[1:10,1],Prop.fatalS,injurS[1:10,1],Prop.injS)
names(Ranking) <- c("ev.freq","freq.weight","ev.fatal","fatal.weight","ev.injur",
"inj.weight","ev.fatal.sum","fatal.weight","ev.injur.sum","inj.weight")
kable(Ranking, digits = 2, caption = "**Table 1: Weight of events' impact**")
| ev.freq | freq.weight | ev.fatal | fatal.weight | ev.injur | inj.weight | ev.fatal.sum | fatal.weight | ev.injur.sum | inj.weight |
|---|---|---|---|---|---|---|---|---|---|
| HAIL | 0.32 | TORNADOES, TSTM WIND, HAIL | 0.17 | Heat Wave | 0.17 | TORNADO | 0.37 | TORNADO | 0.65 |
| TSTM WIND | 0.24 | COLD AND SNOW | 0.09 | TROPICAL STORM GORDON | 0.10 | EXCESSIVE HEAT | 0.13 | TSTM WIND | 0.05 |
| THUNDERSTORM WIND | 0.09 | TROPICAL STORM GORDON | 0.05 | WILD FIRES | 0.09 | FLASH FLOOD | 0.06 | FLOOD | 0.05 |
| TORNADO | 0.07 | RECORD/EXCESSIVE HEAT | 0.04 | THUNDERSTORMW | 0.06 | HEAT | 0.06 | EXCESSIVE HEAT | 0.05 |
| FLASH FLOOD | 0.06 | EXTREME HEAT | 0.03 | HIGH WIND AND SEAS | 0.05 | LIGHTNING | 0.05 | LIGHTNING | 0.04 |
| FLOOD | 0.03 | HEAT WAVE DROUGHT | 0.03 | SNOW/HIGH WINDS | 0.04 | TSTM WIND | 0.03 | HEAT | 0.01 |
| THUNDERSTORM WINDS | 0.02 | HIGH WIND/SEAS | 0.03 | GLAZE/ICE STORM | 0.04 | FLOOD | 0.03 | ICE STORM | 0.01 |
| HIGH WIND | 0.02 | MARINE MISHAP | 0.02 | HEAT WAVE DROUGHT | 0.04 | RIP CURRENT | 0.02 | FLASH FLOOD | 0.01 |
| LIGHTNING | 0.02 | WINTER STORMS | 0.02 | WINTER STORM HIGH WINDS | 0.04 | HIGH WIND | 0.02 | THUNDERSTORM WIND | 0.01 |
| HEAVY SNOW | 0.02 | Heavy surf and wind | 0.02 | HURRICANE/TYPHOON | 0.03 | AVALANCHE | 0.01 | HAIL | 0.01 |
Description of column ranking: Column 1 & 2 > Frequency of events based on number of occurence; Columns 3 to 6 > average # of fatalities and injuries by event (shows single event impact); Columns 7 to 10 > total # of fatalities and injuries by event.
As can be seen from Table 1 above, as expected, the first two to three ranked events have a disproporsionate impact over the entire data set, indicating which are the events where we would like to focus our analysis. It’s worth noting, that the Tropical Storm Gordon as an isolated event, killed 8 people and injured 43 in 1994, spawing six tornadoes in the Florida state
kable(data1[data1$evtype=="TROPICAL STORM GORDON",])
| refnum | bgn_date | evtype | fatalities | injuries | propdmg | propdmgexp | cropdmg | cropdmgexp | |
|---|---|---|---|---|---|---|---|---|---|
| 195017 | 194949 | 1994 | TROPICAL STORM GORDON | 8 | 43 | 500 | K | 500 | K |
In addition, we want to know if these fatalities and injuries are representative of the entire period (1950 to 2011). The time series chart below, shows that the data set is comprehensive of the entire period.
# time series analysis
crossF <- data1 %>%
group_by(bgn_date, evtype) %>%
summarise(fat.sum = sum(fatalities), inj.sum = sum(injuries)) %>%
arrange(desc(fat.sum),desc(inj.sum))
# linear plot across time
# linear plot across time
ggplot(crossF, aes(bgn_date)) + geom_line(aes(y=fat.sum, colour="fat.sum")) +
geom_line(aes(y=inj.sum, colour="inj.sum")) +
theme_bw() +
labs(x="Years (1950 to 2011)", y="Total number of Injuries/Fatalities",
title= "Figure 1: Historical record of Injuries and Fatalities")
Furthermore, as Figure 1 shows, it seems that Injuries by it self could be good single proxy for general harm, as it seems to be the case, more often than not, if there are injuries there might be some fatalities involved, while the opposite seems to be a rare case.
# Simplifying injuries vs. fatalities to use one single variable as proxy
harm <- with(data, injuries - fatalities)
table(harm < 0)
##
## FALSE TRUE
## 897838 4459
In the table avobe, TRUE values are indicative of observations were the number of fatalities were higher than the number of injuries. From the data, this only happed 0.5% during the last 60 years. Hence, it makes sense, that in order to simplify our analysis, without compromissing accuracy, we drop the fatalities variable, and use the injuries variable as a single proxy for population harm.
Finally, we want to select from the ranking on Table 1 the “most harmful to the population health” events, so we can focus our analysis, investigate further, clean data, etc. This selection is informed by the overlapping of the two following key conditions (regulary used in [risk analysis])(http://www.mindtools.com/pages/article/newPPM_78.htm):
1. Events with the highest number of injuries (measured as % of total deaths for the period)
2. Events that have a high chance of happening (measured as frequency % of total event occurrences during the studied periord).
Given this criteria, we end up with the following table of top 10 events:
# Ranking with Freq and Injuries only
harm <- Ranking %>%
select(1:2,9:10)
kable(harm, digits = 2, caption = "**Table 2: Top 10 most harmful events (Frequency vs. Impact**")
| ev.freq | %.freq | ev.injur.sum | %.inj.sum |
|---|---|---|---|
| HAIL | 0.32 | TORNADO | 0.65 |
| TSTM WIND | 0.24 | TSTM WIND | 0.05 |
| THUNDERSTORM WIND | 0.09 | FLOOD | 0.05 |
| TORNADO | 0.07 | EXCESSIVE HEAT | 0.05 |
| FLASH FLOOD | 0.06 | LIGHTNING | 0.04 |
| FLOOD | 0.03 | HEAT | 0.01 |
| THUNDERSTORM WINDS | 0.02 | ICE STORM | 0.01 |
| HIGH WIND | 0.02 | FLASH FLOOD | 0.01 |
| LIGHTNING | 0.02 | THUNDERSTORM WIND | 0.01 |
| HEAVY SNOW | 0.02 | HAIL | 0.01 |
Let’s now clean the event types that are clearly refering to the same phenomena.
# Cleaning same phenomena
harm$ev.freq <- gsub("TSTM", "THUNDERSTORM", harm$ev.freq)
harm$ev.freq <- gsub("WINDS", "WIND", harm$ev.freq)
harm$ev.injur.sum <- gsub("TSTM", "THUNDERSTORM", harm$ev.injur.sum)
The data to answer the questions refering to most harmful event is ready for analysis. Now let’s go back to the data regarding economic impact.
Following our previous approach, we want to firstly assess if the available data is “good enough” and assess which events drive the higher economic impact, so as to help the administration to prioritise resources.
As previously mentioned, based on the NOAA storm documents, we are expecting to find clasifications for the monetary impact in four categories of multipliers: H, K, M and B. (where H = 10^2, K = 10^3, etc.)
The tables below shows, there are quite a number of typos, some requiring easy cleaning.
# Quick cleaning
data1$propdmgexp <- toupper(data1$propdmgexp)
data1$cropdmgexp <- toupper(data1$cropdmgexp)
However, the table also shows that the biggest issue in both cases, is the absence of any categorisation (52% on property damage, and 70% on crop damage):
mult.prop <- data1 %>%
count(propdmgexp) %>%
mutate(weight.prop = round(prop.table(n),2)) %>%
arrange(desc(weight.prop))
mult.prop
## Source: local data frame [17 x 3]
##
## propdmgexp n weight.prop
## 1 465934 0.52
## 2 K 424665 0.47
## 3 M 11337 0.01
## 4 + 5 0.00
## 5 - 1 0.00
## 6 0 216 0.00
## 7 1 25 0.00
## 8 2 13 0.00
## 9 3 4 0.00
## 10 4 4 0.00
## 11 5 28 0.00
## 12 6 4 0.00
## 13 7 5 0.00
## 14 8 1 0.00
## 15 ? 8 0.00
## 16 B 40 0.00
## 17 H 7 0.00
mult.crop <- data1 %>%
count(cropdmgexp) %>%
mutate(weight.crop = round(prop.table(n),2)) %>%
arrange(desc(weight.crop))
mult.crop
## Source: local data frame [7 x 3]
##
## cropdmgexp n weight.crop
## 1 618413 0.69
## 2 K 281853 0.31
## 3 0 19 0.00
## 4 2 1 0.00
## 5 ? 7 0.00
## 6 B 9 0.00
## 7 M 1995 0.00
For the Property data, the problem is not as large, as we have confidence on ~47% of the data. For the Crops data, we’ll have to work with 1/3 of data, ands see how we can extrapolate some information, if it is required.
Then, let’s work with the information we have, try to gather some insights, and see if the can, (and still need), to extrapolate to the uncategorised data.
# Economic analysis based on valid multipliers
exp <- c("H", "K", "M", "B")
prop <- data1 %>%
select(1:3,6:7) %>%
filter(propdmgexp %in% exp)
crop <- data1 %>%
select(1:3,8:9) %>%
filter(cropdmgexp %in% exp)
eco <- full_join(prop,crop,by="refnum")
# rebuilding evtypes and dates from Crop not existent on Prop
for (i in 1:nrow(eco)){
if(is.na(eco$evtype.x[i])){
x <- eco$refnum[i]
row <- which(eco$refnum==x)
y <- eco$evtype.y[row]
eco$evtype.x[i] <- y
}
}
for (i in 1:nrow(eco)){
if(is.na(eco$bgn_date.x[i])){
x <- eco$refnum[i]
row <- which(eco$refnum==x)
y <- eco$bgn_date.y[row]
eco$bgn_date.x[i] <- y
}
}
# filtering relevant columns
eco <- eco %>%
select(1:5, 8:9)
# converting categories into numeric multipliers
eco$propdmgexp <- gsub("H", 100, eco$propdmgexp)
eco$propdmgexp <- gsub("K", 1000, eco$propdmgexp)
eco$propdmgexp <- gsub("M", 1000000, eco$propdmgexp)
eco$propdmgexp <- gsub("B", 1000000000, eco$propdmgexp)
eco$cropdmgexp <- gsub("H", 100, eco$cropdmgexp)
eco$cropdmgexp <- gsub("K", 1000, eco$cropdmgexp)
eco$cropdmgexp <- gsub("M", 1000000, eco$cropdmgexp)
eco$cropdmgexp <- gsub("B", 1000000000, eco$cropdmgexp)
# to numeric
eco[is.na(eco[])] <- 0
eco$propdmgexp <- as.numeric(eco$propdmgexp)
eco$cropdmgexp <- as.numeric(eco$cropdmgexp)
eco$cropdmg <- as.numeric(eco$cropdmg)
# Proportions between Prop and Crop impact
e.prop <- sum(eco$propdmg * eco$propdmgexp)
e.crop <- sum(eco$cropdmg * eco$cropdmgexp)
Total crop impact, from available data is a mere 0.11%. Even knowing that this is only ~31% of the data available, scaling for this fact still makes it account to only 0.17% of total economic impact.
With this information in hand, we can simplify our comparison of our dataset with available multipliers (H, K, M, B), against the set without them. By focusing on data derived from property damage only (as is the one with the greatest overall impact) and comaprin the distribution of event types on both sets, so as to extrapolate, or nor, similar expected economic impacts.
# Comparison of frequency of non available to available multipliers
eco1 <- anti_join(data1, prop, by = "refnum")
eco1.freq <- eco1 %>%
count(evtype) %>%
mutate(weight = prop.table(n)) %>%
arrange(desc(weight))
eco.freq <- prop %>%
count(evtype) %>%
mutate(weight = prop.table(n)) %>%
arrange(desc(weight))
comparison <- cbind(eco1.freq[1:10, c(1,3)], eco.freq[1:10, c(1,3)])
names(comparison) <- c("ev.non.data", "weight.na", "ev.data.aval", "weight")
kable(comparison, digits = 2, caption = "**Table 3: Top 10 events frequency comparison: data set with multipliers (H, K, M & B) vs. non available**")
| ev.non.data | weight.na | ev.data.aval | weight |
|---|---|---|---|
| HAIL | 0.42 | HAIL | 0.21 |
| TSTM WIND | 0.34 | THUNDERSTORM WIND | 0.19 |
| FLASH FLOOD | 0.05 | TSTM WIND | 0.14 |
| THUNDERSTORM WINDS | 0.02 | TORNADO | 0.12 |
| TORNADO | 0.02 | FLASH FLOOD | 0.08 |
| HEAVY SNOW | 0.02 | FLOOD | 0.04 |
| FLOOD | 0.02 | HIGH WIND | 0.03 |
| HIGH WIND | 0.01 | THUNDERSTORM WINDS | 0.03 |
| MARINE TSTM WIND | 0.01 | LIGHTNING | 0.03 |
| HEAVY RAIN | 0.01 | WINTER STORM | 0.02 |
As can be seen from Table 3, the distribution of events is very similar between both sets of data. Hence,in the absence of more information, and resources, we can assume, with some confidence, that the data without multipliers, should have a similar distributions to the economic impact to the half we analised. Therefore, the events identified as high economic impact, should not change, even if new data were available to fill in the gaps. Table 4 below, shows the top 10 events with this impact.
# consolidation of economic impact
eco <- eco %>%
mutate(impact = ((propdmg * propdmgexp) + (cropdmg * cropdmgexp))) %>%
select(1:3,impact) %>%
group_by(evtype.x) %>%
summarise(e.impact = sum(impact)) %>%
arrange(desc(e.impact))
eco$weigth <- eco$e.impact/sum(eco$e.impact)
kable(eco[1:10,c(1,3)], digits = 2, caption = "**Table 4: Top 10 events with higher economic impact**")
| evtype.x | weigth |
|---|---|
| FLOOD | 0.32 |
| HURRICANE/TYPHOON | 0.15 |
| TORNADO | 0.12 |
| STORM SURGE | 0.09 |
| HAIL | 0.04 |
| FLASH FLOOD | 0.04 |
| DROUGHT | 0.03 |
| HURRICANE | 0.03 |
| RIVER FLOOD | 0.02 |
| ICE STORM | 0.02 |
Table 4 above, clearly shows that the total combined economic impact (crops + prop damage) is highly skewed around four events, which together contribute to close to 60% of all economic impact, namely: Flooding, Hurricanes, Tornadoes and Storm surges. The following section will evaluate these findings against their respective frequency.
2. Analysis
We can consider the frequency weight as the probability of the event happening, therefore, we can multiply this value by the injury weight, ontaining a final harmfulness index, which we will use to identify the final list of the top five most harmful weather types.
# final table
names(harm) <- c("event.type","a","event.type","b")
top.harm <- full_join(harm[,1:2], harm[,3:4], by = "event.type") %>%
mutate(harm.index = (a * b)*100) %>%
select(event.type, harm.index) %>%
arrange(desc(harm.index)) %>%
group_by(event.type) %>%
summarise(harm.index = sum(harm.index)) %>%
arrange(desc(harm.index)) %>%
filter(harm.index != "NA")
kable(top.harm, digits = 2, caption = "**Table 5: Top 5 most harmful events by index ranking**")
| event.type | harm.index |
|---|---|
| TORNADO | 4.37 |
| THUNDERSTORM WIND | 2.15 |
| HAIL | 0.31 |
| FLOOD | 0.14 |
| FLASH FLOOD | 0.08 |
| LIGHTNING | 0.06 |
top.harm$event.type <- as.factor(top.harm$event.type)
# plotting chart
ggplot(top.harm[1:5,], aes(event.type, harm.index)) + geom_bar(stat = "identity", alpha = .5, colour = "black") +
theme_bw() +
labs(x="Weather event types", y="Harmfulness Index (Freq * Injuries)",
title= "Figure 2. Top five most harmful weather events in USA (1950 to 2011)")
As can be clearly seen from this analysis, Tornadoes and Thunderstorms have been by far the two most harmful weather events in the USA for the past half a century, driver the higher number of injueries and deaths
Following the same line of reasoning, we also multiply the total economic cost per event, against its respective frequency (proxy for likehood), deriving the econmic impact index, which we will use to identify the final list of the top five most economicaly impactful weather types
# accounting for frequencies
d <- eco.freq[,c(1,3)]
names(d) <- c("event.type", "weight.freq")
e <- eco[,c(1,3)]
names(e) <- c("event.type", "weight.cost")
f <- left_join(e, d, by = "event.type")
final.eco <- f %>%
mutate(eco.index = (weight.cost * weight.freq)*100) %>%
arrange(desc(eco.index)) %>%
select(-(2:3))
kable(final.eco[1:5,], digits = 2, caption = "**Table 6: Top 5 events with higher economic impact by index ranking**")
| event.type | eco.index |
|---|---|
| TORNADO | 1.43 |
| FLOOD | 1.26 |
| HAIL | 0.83 |
| FLASH FLOOD | 0.28 |
| THUNDERSTORM WIND | 0.15 |
# plotting chart
ggplot(final.eco[1:5,], aes(event.type, eco.index)) + geom_bar(stat = "identity", alpha = .5, colour = "black") +
theme_bw() +
labs(x="Weather event types", y="Economic impact Index (Freq * Total cost)",
title= "Figure 3. Top five weather events with higher economic impact in USA (1950 to 2011)")
The consolidating chart above, Figure 3, shows the five event types with the highest economic impact, moderated by their respective frequencies. Not surprisingly, the top five events are the same as the ones for the harmfulness ranking, as factoring the frequency of each event seems to have a large influence on this clasification.
Nevertheless, while Tornadoes are the most critical event on both dimensions (economic and health), Floods and Hail ranked higher than Thunderstorms on total economic impact.
3. Results
In conclusion, the analysis of the NOAA storm data base (1950-2011), shows the following three relevant findings:
These rankings are based on data overarching the entire USA country, from 1950 to 2011. Further anallysius would be required in order to identify variances depending on specific areas, as well as any particular trend over time.