Synopsis

The United States National Oceanic and Atmospheric Administration’s (NOAA) storm database tracks characteristics of major storms and weather events in the United States. This analysis asses which types of events on average are most harmful with respect to injuries, fatalities and economic consequences. Most harmful with respect to population health (injuries and fatalities) are events related to extreme heat. Most harmful with respect to economic consequences are events related to hurricane (typhoon) and storm surge/tide.

Data Processing

The data from the NOAA Storm Database is provided in a bzip2 compressed file. This file is loaded via the read.csv method, which automatically read the file without decompressing it.

library(data.table)
library(xtable)
data <- read.csv("E:/Coursera/RR/repdata-data-StormData.csv.bz2")
data <- as.data.table(data)

According to the documentation of the database, there are 48 permitted event types. The list of permitted event types has been copied from the pdf-documentation to a flad file and stored to be made accessible for this analysis. Here’s what the first five entries of the list looks like.

permitted_events <- read.csv("E:/Coursera/RR/permitted_events.csv")
print(xtable(head(permitted_events, 5)), type="html")
Event.Name Designator
1 Astronomical Low Tide Z
2 Avalanche Z
3 Blizzard Z
4 Coastal Flood Z
5 Cold/Wind Chill Z
unique_event_types_db <- length(unique(data$EVTYPE))
correct_cat <- round(nrow(subset(data, EVTYPE %in% permitted_events$Event.Name))/nrow(data)*100,1)

A quick look at the event types in the database indicates that a great deal of the assigned event types doesn’t correspond to the permitted list. The number of unique event categories used in the database 985, far exceed 48, and the proportion of correctly used event types are remarkable low, as to 0 percent.

This calls for some cleaning. First we ensure same casing and do the obvious corrections by remove all digits, commas and punctuations, together with leading and trailing spaces.

data$EVTYPE <- toupper(data$EVTYPE)
permitted_events$Event.Name <- toupper(permitted_events$Event.Name)
data$EVTYPE <- gsub("[[:digit:],.\\&]|^\\s+|\\s+$", "", data$EVTYPE)
correct_cat <- round(nrow(subset(data, EVTYPE %in% permitted_events$Event.Name))/nrow(data)*100,1)

These operations bring the event type feature to a level of 70.4 percent, in permitted categories.

Then, systematically, we go through the largest groups of obvious misclassifications and create a correcting operations, including changing the abbreviation TSTM to the full name THUNDERSTORM and looking for event category names containing both WILD and FIRE and renaming them to WILDFIRE.

data$EVTYPE <- gsub("TSTM", "THUNDERSTORM", data$EVTYPE)
data$EVTYPE <- gsub(".*THUNDERSTORM WIND.*", "THUNDERSTORM WIND", data$EVTYPE)
data$EVTYPE <- gsub(".*HIGH WIND.*", "HIGH WIND", data$EVTYPE)
data$EVTYPE <- gsub(".*WILD.*FIRE.*", "WILDFIRE", data$EVTYPE)
data$EVTYPE <- gsub(".*WINTER.*WEATHER.*", "WINTER WEATHER", data$EVTYPE)
data$EVTYPE <- gsub(".*EXTREME COLD.*", "EXTREME COLD/WIND CHILL", data$EVTYPE)
data$EVTYPE <- gsub(".*FLASH.*FLOOD.*", "FLASH FLOOD", data$EVTYPE)
correct_cat <- round(nrow(subset(data, EVTYPE %in% permitted_events$Event.Name))/nrow(data)*100,1)

These operations bring the event type feature to a level of 98.6 percent, in permitted categories. This is much more acceptable level and will no doubt increase the usefulness of the results of this analysis.

After grouping and summing the data accross the variables of interest for this analysis, we have further increased the quality of the classifications via follwing corrections, which would otherwise have seen (even more) unpermitted categories among the most harmful event types.

data$EVTYPE <- gsub(".*HURRICANE.*|.*TYPHOON.*", "HURRICANE (TYPHOON)", data$EVTYPE)
data$EVTYPE <- gsub(".*SEVERE THUNDERSTORM.*", "THUNDERSTORM WIND", data$EVTYPE)
data$EVTYPE <- gsub(".*THUNDERSTORMW.*", "THUNDERSTORM WIND", data$EVTYPE)
data$EVTYPE <- gsub(".*STORM SURGE.*", "STORM SURGE/TIDE", data$EVTYPE)
data$EVTYPE <- gsub(".*DAMAGING FREEZE.*", "FROST/FREEZE", data$EVTYPE)
data$EVTYPE <- gsub(".*WATERSPOUT.*", "WATERSPOUT", data$EVTYPE)
data$EVTYPE <- gsub(".*TROPICAL STORM.*", "TROPICAL STORM", data$EVTYPE)
data$EVTYPE <- gsub(".*WINTER STORMS.*", "WINTER STORM", data$EVTYPE)
correct_cat <- round(nrow(subset(data, EVTYPE %in% permitted_events$Event.Name))/nrow(data)*100,1)

These operations doesn’t change the level of event categories in permitted categories much. The level is now 98.7 percent.

For the last part of this analysis we will look into the economic consequences of the extreme weather events captured in the database. The economic damage is captured in variables expressing expenses related to property damage and expenses related to crob damage. The value unit used in these variables are different across the events, so in order to compare expenses across events, we need to translate these values into same unit, with help of the unit category variables. We choose million US-dollars (10*6) as base unit. Some expenses has unknown unit, these are translated into NA and summarized as 0 when calculating total damage.

data$CROPDMGEXP <- toupper(data$CROPDMGEXP)
data$PROPDMGEXP <- toupper(data$PROPDMGEXP)

letter <- c("K","M","B")
scale <- c(10^-3,10^0,10^3)

data$PropDmgVal <- data$PROPDMG * scale[match(data$PROPDMGEXP, letter)]
data$CropDmgVal <- data$CROPDMG * scale[match(data$CROPDMGEXP, letter)]
data$TotalDmg <- ifelse(is.na(data$PropDmgVal),0,data$PropDmgVal) + ifelse(is.na(data$CropDmgVal),0,data$CropDmgVal)

Questions

Following questions will be adressed in this analysis:

Results

Population health is measured in count of injuries and fatalities corresponding to each event. The quantities might not be directly summable, so for each variable, the most harmful types of events have been assesed. The results are presented as the 8 most harmful types with respect to average number of injuries per event and average number of fatalities per event, as a table and a corresponding boxplot. For both injuries and fatalities, event types with less than 10 events in the database are filtered out. The argument behind this decision is a combination of the fact that all of these are non-permitted categories, and that average of just a few counts might not be an representative estimate of the true mean for any category.

Injuries

data <- as.data.table(data)
evtypeInj <- data[,.(sum=sum(INJURIES), mean = round(mean(INJURIES),2), sd = round(sd(INJURIES),2), count=sum(!(is.na(INJURIES)))),by=list(EVTYPE)][order(-mean)]
evtypeInj <- subset(evtypeInj, count>=10)
print(xtable(evtypeInj[1:8]), type="html")
EVTYPE sum mean sd count
1 EXTREME HEAT 155.00 7.05 18.83 22
2 TSUNAMI 129.00 6.45 28.85 20
3 HEAT WAVE 379.00 5.05 25.06 75
4 GLAZE 216.00 5.02 12.25 43
5 HURRICANE (TYPHOON) 1333.00 4.47 49.05 298
6 EXCESSIVE HEAT 6525.00 3.89 26.13 1678
7 HEAT 2100.00 2.74 20.34 767
8 MIXED PRECIP 26.00 2.60 4.77 10
inj <- subset(data, EVTYPE %in% evtypeInj[1:8]$EVTYPE)
inj$EVTYPE = factor(inj$EVTYPE, evtypeInj[8:1]$EVTYPE)
par(mar=c(4.5,12,3,1))
boxplot(INJURIES~EVTYPE, inj, las=1, horizontal=TRUE, xlab="Number of injuries", main="8 most harmful types of events with respect to number of injuries")

From the table, both extreme heat and tsunami looks vey harmful on average, but looking at the boxplot, we see that the high average for tsunami mostly is due to one single high value.

Looking at the averages is a good way to adress what type of events are most harmful, but it’s absolutely not the only way. Another approach is to look at events causing overall most damage, then the most harmful event type with respect to injuries is

evtypeInj[which.max(evtypeInj$sum),]
##     EVTYPE   sum mean    sd count
## 1: TORNADO 91346 1.51 17.18 60652

Fatalities

evtypeFat <- data[,.(sum=sum(FATALITIES), mean = round(mean(FATALITIES),2), sd = round(sd(FATALITIES),2), count=sum(!(is.na(FATALITIES)))),by=list(EVTYPE)][order(-mean)]
evtypeFat <- subset(evtypeFat, count>=10)
print(xtable(evtypeFat[1:8]), type="html")
EVTYPE sum mean sd count
1 EXTREME HEAT 96.00 4.36 12.42 22
2 HEAT WAVE 172.00 2.29 5.54 75
3 UNSEASONABLY WARM AND DRY 29.00 2.23 8.04 13
4 TSUNAMI 33.00 1.65 7.15 20
5 HEAT 937.00 1.22 21.10 767
6 EXCESSIVE HEAT 1903.00 1.13 4.78 1678
7 RIP CURRENT 368.00 0.78 0.63 470
8 RIP CURRENTS 204.00 0.67 0.63 304
fat <- subset(data, EVTYPE %in% evtypeFat[1:8]$EVTYPE)
fat$EVTYPE = factor(fat$EVTYPE, evtypeFat[8:1]$EVTYPE)
par(mar=c(4.5,15,3,1))
boxplot(FATALITIES~EVTYPE, fat, las=1, horizontal=TRUE, xlab="Number of fatalities", main="8 most harmful types of events with respect to number of fatalities")

From the table and the box plot we see that extreme heat and other heat related events looks most harmful with regards to fatalities, as for injuries.

Again, looking at the events causing overall most damage, another event turns out as most harmful. With respect to fatalities we find

evtypeFat[which.max(evtypeFat$sum),]
##     EVTYPE  sum mean   sd count
## 1: TORNADO 5633 0.09 1.41 60652

Economic Consequences

The extreme weather events do not only damage living creatures, it also has economic measurable consequences for properties and crob. Looking at the total economic consequences as the sum of property damage expenses and crob damage expenses, and repeat the proces from above, by looking at average expenses in million (10^6) US dollars per event the picure is following:

evtypeEco <- data[,.(sum=round(sum(TotalDmg),2), mean = round(mean(TotalDmg),2), sd = round(sd(TotalDmg),2), count=sum(!(is.na(TotalDmg)))),by=list(EVTYPE)][order(-mean)]
evtypeEco <- subset(evtypeEco, count>=10)
print(xtable(evtypeEco[1:8]), type="html")
EVTYPE sum mean sd count
1 HURRICANE (TYPHOON) 90762.53 304.57 1419.51 298
2 STORM SURGE/TIDE 47965.58 117.28 1654.84 409
3 RIVER FLOOD 10148.40 58.66 760.23 173
4 TROPICAL STORM 8409.29 12.06 196.93 697
5 TSUNAMI 144.08 7.20 18.62 20
6 DROUGHT 15018.67 6.04 43.97 2488
7 FREEZE 456.93 6.01 25.58 76
8 FLOOD 150319.68 5.94 723.40 25327
eco <- subset(data, EVTYPE %in% evtypeEco[1:8]$EVTYPE)
eco$EVTYPE = factor(eco$EVTYPE, evtypeEco[8:1]$EVTYPE)
par(mar=c(4.5,11,3,1))
boxplot(TotalDmg~EVTYPE, eco, las=1, horizontal=TRUE, xlab="Expenses in million US dollars (10^6)", main="8 most harmful types of events with respect to total damage\n(property damage + crop damage)")

Again, looking at the events causing overall most damage, another event turns out as most harmful. With respect to economic consequences, we find

evtypeEco[which.max(evtypeEco$sum),]
##    EVTYPE      sum mean    sd count
## 1:  FLOOD 150319.7 5.94 723.4 25327

Some of the categories reported in this analysis is still not in-line with the documentated list of permitted event types. These require domain level expertise in order to either correctly classify or removing.