The United States National Oceanic and Atmospheric Administration’s (NOAA) storm database tracks characteristics of major storms and weather events in the United States. This analysis asses which types of events on average are most harmful with respect to injuries, fatalities and economic consequences. Most harmful with respect to population health (injuries and fatalities) are events related to extreme heat. Most harmful with respect to economic consequences are events related to hurricane (typhoon) and storm surge/tide.
The data from the NOAA Storm Database is provided in a bzip2 compressed file. This file is loaded via the read.csv method, which automatically read the file without decompressing it.
library(data.table)
library(xtable)
data <- read.csv("E:/Coursera/RR/repdata-data-StormData.csv.bz2")
data <- as.data.table(data)
According to the documentation of the database, there are 48 permitted event types. The list of permitted event types has been copied from the pdf-documentation to a flad file and stored to be made accessible for this analysis. Here’s what the first five entries of the list looks like.
permitted_events <- read.csv("E:/Coursera/RR/permitted_events.csv")
print(xtable(head(permitted_events, 5)), type="html")
| Event.Name | Designator | |
|---|---|---|
| 1 | Astronomical Low Tide | Z |
| 2 | Avalanche | Z |
| 3 | Blizzard | Z |
| 4 | Coastal Flood | Z |
| 5 | Cold/Wind Chill | Z |
unique_event_types_db <- length(unique(data$EVTYPE))
correct_cat <- round(nrow(subset(data, EVTYPE %in% permitted_events$Event.Name))/nrow(data)*100,1)
A quick look at the event types in the database indicates that a great deal of the assigned event types doesn’t correspond to the permitted list. The number of unique event categories used in the database 985, far exceed 48, and the proportion of correctly used event types are remarkable low, as to 0 percent.
This calls for some cleaning. First we ensure same casing and do the obvious corrections by remove all digits, commas and punctuations, together with leading and trailing spaces.
data$EVTYPE <- toupper(data$EVTYPE)
permitted_events$Event.Name <- toupper(permitted_events$Event.Name)
data$EVTYPE <- gsub("[[:digit:],.\\&]|^\\s+|\\s+$", "", data$EVTYPE)
correct_cat <- round(nrow(subset(data, EVTYPE %in% permitted_events$Event.Name))/nrow(data)*100,1)
These operations bring the event type feature to a level of 70.4 percent, in permitted categories.
Then, systematically, we go through the largest groups of obvious misclassifications and create a correcting operations, including changing the abbreviation TSTM to the full name THUNDERSTORM and looking for event category names containing both WILD and FIRE and renaming them to WILDFIRE.
data$EVTYPE <- gsub("TSTM", "THUNDERSTORM", data$EVTYPE)
data$EVTYPE <- gsub(".*THUNDERSTORM WIND.*", "THUNDERSTORM WIND", data$EVTYPE)
data$EVTYPE <- gsub(".*HIGH WIND.*", "HIGH WIND", data$EVTYPE)
data$EVTYPE <- gsub(".*WILD.*FIRE.*", "WILDFIRE", data$EVTYPE)
data$EVTYPE <- gsub(".*WINTER.*WEATHER.*", "WINTER WEATHER", data$EVTYPE)
data$EVTYPE <- gsub(".*EXTREME COLD.*", "EXTREME COLD/WIND CHILL", data$EVTYPE)
data$EVTYPE <- gsub(".*FLASH.*FLOOD.*", "FLASH FLOOD", data$EVTYPE)
correct_cat <- round(nrow(subset(data, EVTYPE %in% permitted_events$Event.Name))/nrow(data)*100,1)
These operations bring the event type feature to a level of 98.6 percent, in permitted categories. This is much more acceptable level and will no doubt increase the usefulness of the results of this analysis.
After grouping and summing the data accross the variables of interest for this analysis, we have further increased the quality of the classifications via follwing corrections, which would otherwise have seen (even more) unpermitted categories among the most harmful event types.
data$EVTYPE <- gsub(".*HURRICANE.*|.*TYPHOON.*", "HURRICANE (TYPHOON)", data$EVTYPE)
data$EVTYPE <- gsub(".*SEVERE THUNDERSTORM.*", "THUNDERSTORM WIND", data$EVTYPE)
data$EVTYPE <- gsub(".*THUNDERSTORMW.*", "THUNDERSTORM WIND", data$EVTYPE)
data$EVTYPE <- gsub(".*STORM SURGE.*", "STORM SURGE/TIDE", data$EVTYPE)
data$EVTYPE <- gsub(".*DAMAGING FREEZE.*", "FROST/FREEZE", data$EVTYPE)
data$EVTYPE <- gsub(".*WATERSPOUT.*", "WATERSPOUT", data$EVTYPE)
data$EVTYPE <- gsub(".*TROPICAL STORM.*", "TROPICAL STORM", data$EVTYPE)
data$EVTYPE <- gsub(".*WINTER STORMS.*", "WINTER STORM", data$EVTYPE)
correct_cat <- round(nrow(subset(data, EVTYPE %in% permitted_events$Event.Name))/nrow(data)*100,1)
These operations doesn’t change the level of event categories in permitted categories much. The level is now 98.7 percent.
For the last part of this analysis we will look into the economic consequences of the extreme weather events captured in the database. The economic damage is captured in variables expressing expenses related to property damage and expenses related to crob damage. The value unit used in these variables are different across the events, so in order to compare expenses across events, we need to translate these values into same unit, with help of the unit category variables. We choose million US-dollars (10*6) as base unit. Some expenses has unknown unit, these are translated into NA and summarized as 0 when calculating total damage.
data$CROPDMGEXP <- toupper(data$CROPDMGEXP)
data$PROPDMGEXP <- toupper(data$PROPDMGEXP)
letter <- c("K","M","B")
scale <- c(10^-3,10^0,10^3)
data$PropDmgVal <- data$PROPDMG * scale[match(data$PROPDMGEXP, letter)]
data$CropDmgVal <- data$CROPDMG * scale[match(data$CROPDMGEXP, letter)]
data$TotalDmg <- ifelse(is.na(data$PropDmgVal),0,data$PropDmgVal) + ifelse(is.na(data$CropDmgVal),0,data$CropDmgVal)
Following questions will be adressed in this analysis:
Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
Population health is measured in count of injuries and fatalities corresponding to each event. The quantities might not be directly summable, so for each variable, the most harmful types of events have been assesed. The results are presented as the 8 most harmful types with respect to average number of injuries per event and average number of fatalities per event, as a table and a corresponding boxplot. For both injuries and fatalities, event types with less than 10 events in the database are filtered out. The argument behind this decision is a combination of the fact that all of these are non-permitted categories, and that average of just a few counts might not be an representative estimate of the true mean for any category.
Injuries
data <- as.data.table(data)
evtypeInj <- data[,.(sum=sum(INJURIES), mean = round(mean(INJURIES),2), sd = round(sd(INJURIES),2), count=sum(!(is.na(INJURIES)))),by=list(EVTYPE)][order(-mean)]
evtypeInj <- subset(evtypeInj, count>=10)
print(xtable(evtypeInj[1:8]), type="html")
| EVTYPE | sum | mean | sd | count | |
|---|---|---|---|---|---|
| 1 | EXTREME HEAT | 155.00 | 7.05 | 18.83 | 22 |
| 2 | TSUNAMI | 129.00 | 6.45 | 28.85 | 20 |
| 3 | HEAT WAVE | 379.00 | 5.05 | 25.06 | 75 |
| 4 | GLAZE | 216.00 | 5.02 | 12.25 | 43 |
| 5 | HURRICANE (TYPHOON) | 1333.00 | 4.47 | 49.05 | 298 |
| 6 | EXCESSIVE HEAT | 6525.00 | 3.89 | 26.13 | 1678 |
| 7 | HEAT | 2100.00 | 2.74 | 20.34 | 767 |
| 8 | MIXED PRECIP | 26.00 | 2.60 | 4.77 | 10 |
inj <- subset(data, EVTYPE %in% evtypeInj[1:8]$EVTYPE)
inj$EVTYPE = factor(inj$EVTYPE, evtypeInj[8:1]$EVTYPE)
par(mar=c(4.5,12,3,1))
boxplot(INJURIES~EVTYPE, inj, las=1, horizontal=TRUE, xlab="Number of injuries", main="8 most harmful types of events with respect to number of injuries")
From the table, both extreme heat and tsunami looks vey harmful on average, but looking at the boxplot, we see that the high average for tsunami mostly is due to one single high value.
Looking at the averages is a good way to adress what type of events are most harmful, but it’s absolutely not the only way. Another approach is to look at events causing overall most damage, then the most harmful event type with respect to injuries is
evtypeInj[which.max(evtypeInj$sum),]
## EVTYPE sum mean sd count
## 1: TORNADO 91346 1.51 17.18 60652
Fatalities
evtypeFat <- data[,.(sum=sum(FATALITIES), mean = round(mean(FATALITIES),2), sd = round(sd(FATALITIES),2), count=sum(!(is.na(FATALITIES)))),by=list(EVTYPE)][order(-mean)]
evtypeFat <- subset(evtypeFat, count>=10)
print(xtable(evtypeFat[1:8]), type="html")
| EVTYPE | sum | mean | sd | count | |
|---|---|---|---|---|---|
| 1 | EXTREME HEAT | 96.00 | 4.36 | 12.42 | 22 |
| 2 | HEAT WAVE | 172.00 | 2.29 | 5.54 | 75 |
| 3 | UNSEASONABLY WARM AND DRY | 29.00 | 2.23 | 8.04 | 13 |
| 4 | TSUNAMI | 33.00 | 1.65 | 7.15 | 20 |
| 5 | HEAT | 937.00 | 1.22 | 21.10 | 767 |
| 6 | EXCESSIVE HEAT | 1903.00 | 1.13 | 4.78 | 1678 |
| 7 | RIP CURRENT | 368.00 | 0.78 | 0.63 | 470 |
| 8 | RIP CURRENTS | 204.00 | 0.67 | 0.63 | 304 |
fat <- subset(data, EVTYPE %in% evtypeFat[1:8]$EVTYPE)
fat$EVTYPE = factor(fat$EVTYPE, evtypeFat[8:1]$EVTYPE)
par(mar=c(4.5,15,3,1))
boxplot(FATALITIES~EVTYPE, fat, las=1, horizontal=TRUE, xlab="Number of fatalities", main="8 most harmful types of events with respect to number of fatalities")
From the table and the box plot we see that extreme heat and other heat related events looks most harmful with regards to fatalities, as for injuries.
Again, looking at the events causing overall most damage, another event turns out as most harmful. With respect to fatalities we find
evtypeFat[which.max(evtypeFat$sum),]
## EVTYPE sum mean sd count
## 1: TORNADO 5633 0.09 1.41 60652
Economic Consequences
The extreme weather events do not only damage living creatures, it also has economic measurable consequences for properties and crob. Looking at the total economic consequences as the sum of property damage expenses and crob damage expenses, and repeat the proces from above, by looking at average expenses in million (10^6) US dollars per event the picure is following:
evtypeEco <- data[,.(sum=round(sum(TotalDmg),2), mean = round(mean(TotalDmg),2), sd = round(sd(TotalDmg),2), count=sum(!(is.na(TotalDmg)))),by=list(EVTYPE)][order(-mean)]
evtypeEco <- subset(evtypeEco, count>=10)
print(xtable(evtypeEco[1:8]), type="html")
| EVTYPE | sum | mean | sd | count | |
|---|---|---|---|---|---|
| 1 | HURRICANE (TYPHOON) | 90762.53 | 304.57 | 1419.51 | 298 |
| 2 | STORM SURGE/TIDE | 47965.58 | 117.28 | 1654.84 | 409 |
| 3 | RIVER FLOOD | 10148.40 | 58.66 | 760.23 | 173 |
| 4 | TROPICAL STORM | 8409.29 | 12.06 | 196.93 | 697 |
| 5 | TSUNAMI | 144.08 | 7.20 | 18.62 | 20 |
| 6 | DROUGHT | 15018.67 | 6.04 | 43.97 | 2488 |
| 7 | FREEZE | 456.93 | 6.01 | 25.58 | 76 |
| 8 | FLOOD | 150319.68 | 5.94 | 723.40 | 25327 |
eco <- subset(data, EVTYPE %in% evtypeEco[1:8]$EVTYPE)
eco$EVTYPE = factor(eco$EVTYPE, evtypeEco[8:1]$EVTYPE)
par(mar=c(4.5,11,3,1))
boxplot(TotalDmg~EVTYPE, eco, las=1, horizontal=TRUE, xlab="Expenses in million US dollars (10^6)", main="8 most harmful types of events with respect to total damage\n(property damage + crop damage)")
Again, looking at the events causing overall most damage, another event turns out as most harmful. With respect to economic consequences, we find
evtypeEco[which.max(evtypeEco$sum),]
## EVTYPE sum mean sd count
## 1: FLOOD 150319.7 5.94 723.4 25327
Some of the categories reported in this analysis is still not in-line with the documentated list of permitted event types. These require domain level expertise in order to either correctly classify or removing.