This report analyzes the NOAA storm database between 1950 and November 2011 in order to find the most harmful weather event types regarding population health and economic damage.
The top cause of harm to population health are tornado events.
The top cause of economic damage are flood events.
original_data <- read.csv("repdata-data-StormData.csv.bz2")
The event type is found in the column EVTYPE, health harm is a combination of columns FATALITIES and INJURIES and finally, economic damage is a combination of columns PROPDMG and CROPDMG.
Each damage column has a corresponding exponent column for scaling the amounts. Valid exponents are ‘K’, ‘M’ ‘B’ for thousands, millions and billings respectively.
To get raw dollar values for the damages, we multiply each value with the corresponding value for its exponent.
numeric_exponent <- function(exponent) {
if (exponent == 'K') {
1000
} else if (exponent == 'M') {
1000000
} else if (exponent == 'B') {
1000000000
} else {
1
}
}
data <- data.frame(evtype = original_data$EVTYPE,
fatalities = original_data$FATALITIES,
injuries = original_data$INJURIES,
propdmg = original_data$PROPDMG * sapply(original_data$PROPDMGEXP,
numeric_exponent),
cropdmg = original_data$CROPDMG * sapply(original_data$CROPDMGEXP,
numeric_exponent))
We add a total damage column containing the sum of propdmg and cropdmg, and a total harm column containing the sum of fatalities and injuries (albeit one could very well question whether it is correct to simply add these to types of harm).
data$dmg <- data$cropdmg + data$propdmg
data$harm <- data$fatalities + data$injuries
str(data)
## 'data.frame': 902297 obs. of 7 variables:
## $ evtype : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ fatalities: num 0 0 0 0 0 0 0 0 1 0 ...
## $ injuries : num 15 0 2 2 2 6 1 0 14 0 ...
## $ propdmg : num 25000 2500 25000 2500 2500 2500 2500 2500 25000 25000 ...
## $ cropdmg : num 0 0 0 0 0 0 0 0 0 0 ...
## $ dmg : num 25000 2500 25000 2500 2500 2500 2500 2500 25000 25000 ...
## $ harm : num 15 0 2 2 2 6 1 0 15 0 ...
We see that there are a little bit more than 900,000 events in the data.
In order to find the event with most harm or damage, we need to group the data according to event type and calculate the total harm and damage.
agg <- aggregate(list(fatalities = data$fatalities,
injuries = data$injuries,
harm = data$harm,
propdmg = data$propdmg,
cropdmg = data$cropdmg,
dmg = data$dmg),
list(evtype = data$evtype),
sum)
Now lets calculate the top five for each harm and damage column.
top_fatalities <- head(agg[order(agg$fatalities, decreasing=T),c('evtype', 'fatalities')], n=5)
top_injuries <- head(agg[order(agg$injuries, decreasing=T),c('evtype', 'injuries')], n=5)
top_harm <- head(agg[order(agg$harm, decreasing=T),c('evtype', 'harm')], n=5)
top_propdmg <- head(agg[order(agg$propdmg, decreasing=T),c('evtype', 'propdmg')], n=5)
top_cropdmg <- head(agg[order(agg$cropdmg, decreasing=T),c('evtype', 'cropdmg')], n=5)
top_dmg <- head(agg[order(agg$dmg, decreasing=T),c('evtype', 'dmg')], n=5)
Looking at the top fatalities and injuries cause, we see that both are the same, i.e. tornado. And the total harm (defined ad hoc as the sum of fatalities and injuries) is also highest for tornado.
top_fatalities
## evtype fatalities
## 834 TORNADO 5633
## 130 EXCESSIVE HEAT 1903
## 153 FLASH FLOOD 978
## 275 HEAT 937
## 464 LIGHTNING 816
top_injuries
## evtype injuries
## 834 TORNADO 91346
## 856 TSTM WIND 6957
## 170 FLOOD 6789
## 130 EXCESSIVE HEAT 6525
## 464 LIGHTNING 5230
top_harm
## evtype harm
## 834 TORNADO 96979
## 130 EXCESSIVE HEAT 8428
## 856 TSTM WIND 7461
## 170 FLOOD 7259
## 464 LIGHTNING 6046
Doing the same for the economic damages, we see that there is a difference between property damage and crop damage. The highest property damage total was caused by flood, the highest crop damage total was caused by drought. Taken together, the top cause of economic damage is flood.
top_propdmg
## evtype propdmg
## 170 FLOOD 144657709807
## 411 HURRICANE/TYPHOON 69305840000
## 834 TORNADO 56925660790
## 670 STORM SURGE 43323536000
## 153 FLASH FLOOD 16140812067
top_cropdmg
## evtype cropdmg
## 95 DROUGHT 13972566000
## 170 FLOOD 5661968450
## 590 RIVER FLOOD 5029459000
## 427 ICE STORM 5022113500
## 244 HAIL 3025537890
top_dmg
## evtype dmg
## 170 FLOOD 150319678257
## 411 HURRICANE/TYPHOON 71913712800
## 834 TORNADO 57340614060
## 670 STORM SURGE 43323541000
## 244 HAIL 18752904943
palette <- c(rgb(1, 0, 0),
rgb(0.8, 0.5, 0),
rgb(0.7, 0.45, 0),
rgb(0.6, 0.4, 0),
rgb(0.5, 0.35, 0))
barplot(top_harm$harm,
legend = top_harm$evtype,
col = palette,
ylab = 'Harm (Fatalities + Injuries)',
main = 'Top five causes of storm related fatalities and injuries')
The above barplot shows the total harm for each of the top five event types regarding harm (fatalities and injuries) for the full data period (1950 through november 2011).
barplot(top_dmg$dmg / 1000000000,
legend = top_dmg$evtype,
col = palette,
ylab = 'Damage [B$]',
main = 'Top five causes of storm related damage')
The above barplot shows the total damage for each of the top five event types regarding damages (property and crop) for the full data period (1950 through november 2011).
This report only looks at the total data from 1950 to 2011. One obvious question to investigate would be, whether the damage totals changed in this time. In other words, were the e.g. last ten years different from the 50’s?
Another question to pursue would be: which weather type does have the highest harm/damage per event?
Finally, as far as I can tell, the damage numbers are dollar values for the time when the event happened. If we were to start comparing earlier periods with later periods, then we would have to adjust the values for inflation.