A short analysis is presented of the NOAA (National Oceanic and Atmospheric Administration) storm database that included recorded storm events in the USA from 1950 to 2011. Two questions were addressed: first, What are the event types that were most harmful to human health? Second, What are the event types that caused most economic damage? The data was first filtered to only contain events that caused harm to human health or caused economic damage. The remaining events were then calssified based on the 48 official event types of the NOAA. Events that could not be classified were ignored. Then the injuries and fatalities caused by the events of the different types were calculated and summaries presented. The same was done for the property and crop damage caused by the events. The actual costs had to be calculated first based on the units given in the data.
Loading the used R packages
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
require(ggplot2)
## Loading required package: ggplot2
require(xtable)
## Loading required package: xtable
require(tidyr)
## Loading required package: tidyr
require(gridExtra)
## Loading required package: gridExtra
## Loading required package: grid
# R can read compressed files directly, so no need for unzipping
# bzfile() can be used but is not necessary either
Data <- read.csv('repdata-data-StormData.csv.bz2', sep = ",", header = TRUE, na.strings = c("", "NA"))
I first removed all rows (events) from the data that caused neither harm nor damage as they are irrelevant to this analysis.
Data <- tbl_df(Data)
DataRelevant <- filter(Data, FATALITIES != 0 | INJURIES != 0 |
PROPDMG != 0 | CROPDMG != 0)
The data contained the categorial variable EVTYPE in which the event types were listed. However, there were 985 different types listed whereas the NOAA considers only 48 official events. The official events can be found via their website (http://www.ncdc.noaa.gov/stormevents/details.jsp). I saved the official events in the variable Events.
Events = c("Astronomical Low Tide", "Avalanche", "Blizzard",
"Coastal Flood", "Cold/Wind Chill", "Debris Flow",
"Dense Fog", "Dense Smoke", "Drought", "Dust Devil",
"Dust Storm", "Excessive Heat", "Extreme Cold/Wind Chill",
"Flash Flood", "Flood", "Frost/Freeze", "Funnel Cloud",
"Freezing Fog", "Hail", "Heat", "Heavy Rain", "Heavy Snow",
"High Surf", "High Wind", "Hurricane (Typhoon)", "Ice Storm",
"Lake-Effect Snow", "Lakeshore Flood", "Lightning",
"Marine Hail", "Marine High Wind", "Marine Strong Wind",
"Marine Thunderstorm Wind", "Rip Current", "Seiche",
"Sleet", "Storm Surge/Tide", "Strong Wind",
"Thunderstorm Wind", "Tornado", "Tropical Depression",
"Tropical Storm", "Tsunami", "Volcanic Ash", "Waterspout",
"Wildfire", "Winter Storm", "Winter Weather")
To facilitate the search for these official events in the events listed in the Data, I had to consider terms such as flood that are part of several types, e.g. Flood and Flash Flood. I adjusted the order so flood was first, and the more specific flood times came later in the list. This was also done for “Heat”. Furthermore, the type Hurricane (Typhone) was changed to the regular expression Hurricane|Typhoon to allow the following regular expression search. The new list with these few changes was called EventsSearch, it was used to assign the listed event types with the official types. Events that could not be assigned were not considered.
# the adjusted search list of the 48 event types
EventsSearch = c("Hurricane|Typhoon", "Low Tide", "Avalanche", "Blizzard", "Flood",
"Coastal Flood", "Cold/Wind Chill", "Debris Flow",
"Dense Fog", "Dense Smoke", "Drought", "Dust Devil",
"Dust Storm", "Heat", "Excessive Heat", "Extreme Cold/Wind Chill",
"Flash Flood", "Frost/Freeze", "Funnel Cloud",
"Freezing Fog", "Hail", "Heavy Rain", "Heavy Snow",
"High Surf", "High Wind", "Ice Storm",
"Lake-Effect Snow", "Lakeshore Flood", "Lightning",
"Marine Hail", "Marine High Wind", "Marine Strong Wind",
"Marine Thunderstorm Wind", "Rip Current", "Seiche",
"Sleet", "Storm Surge/Tide", "Strong Wind",
"Thunderstorm Wind", "Tornado", "Tropical Depression",
"Tropical Storm", "Tsunami", "Volcanic Ash", "Waterspout",
"Wildfire", "Winter Storm", "Winter Weather")
# add a column with the 48 official storm types to the Data
DataRelevant$OffEvent = vector(mode = "character",
length = dim(DataRelevant)[1])
for (i in EventsSearch){
DataRelevant$OffEvent[grep(i, DataRelevant$EVTYPE,
ignore.case = TRUE)] = i
}
# Events that could not be assigned to the official events were
# removed
DataRelevant <- filter(DataRelevant, OffEvent != "")
The adjusted data still contained 188352 events that belonged to 45 of the official 48 types.
I only kept the relevant columns of the data.
DataRelevant <- select(DataRelevant, OffEvent, FATALITIES, INJURIES,
PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
The data contains columns that specify the costs that were caused by damage of property (PROPDMG) or crops (CROPDMG). The units of the costs are given in two separate columns, PROPDMGEXP and CROPDMGEXP, respectively. A K stands for Thousand, a M for million, and a B for billion dollars. However, a look in the unit colums shows that also other “values” appear.
table(DataRelevant$PROPDMGEXP)
##
## + - 0 1 2 3 4 5 6 7
## 5 1 208 0 1 1 4 18 3 3
## 8 ? B H K M h m
## 0 0 36 6 169246 10574 1 7
table(DataRelevant$CROPDMGEXP)
##
## 0 2 ? B K M k m
## 17 0 6 6 94237 1772 20 1
For the purpose of this analysis I set the few damage values without correct unit to NA. In the analysis of the damage, these events will be ignored.
So I added columns with the values of the damage in US Dollar to receive the final Data.frame DataRelevant that was used for the subsequent analyses.
DataRelevant$PROPDMGEXP <- as.character(DataRelevant$PROPDMGEXP)
DataRelevant$CROPDMGEXP <- as.character(DataRelevant$CROPDMGEXP)
DataK <- filter(DataRelevant, PROPDMGEXP == "K")
DataK <- mutate(DataK, Prop_Damage = PROPDMG*1000)
DataM <- filter(DataRelevant, PROPDMGEXP == "M" | PROPDMGEXP == "m")
DataM <- mutate(DataM, Prop_Damage = PROPDMG*1000000)
DataB <- filter(DataRelevant, PROPDMGEXP == "B")
DataB <- mutate(DataB, Prop_Damage = PROPDMG*1000000000)
DataOther <- filter(DataRelevant, PROPDMGEXP != "K", PROPDMGEXP != "M",
PROPDMGEXP != "m", PROPDMGEXP != "B")
DataOther <- mutate(DataOther, Prop_Damage = NA)
DataNA <- filter(DataRelevant, is.na(PROPDMGEXP))
DataNA <- mutate(DataNA, Prop_Damage = NA)
DataRelevant <- rbind(DataK, DataM, DataB, DataOther, DataNA)
rm(DataK, DataB, DataM, DataOther, DataNA)
# same procedure for the crops damage cost
DataK <- filter(DataRelevant, CROPDMGEXP == "K" | CROPDMGEXP == "k")
DataK <- mutate(DataK, Crop_Damage = CROPDMG*1000)
DataM <- filter(DataRelevant, CROPDMGEXP == "M" | CROPDMGEXP == "m")
DataM <- mutate(DataM, Crop_Damage = CROPDMG*1000000)
DataB <- filter(DataRelevant, CROPDMGEXP == "B")
DataB <- mutate(DataB, Crop_Damage = CROPDMG*1000000000)
DataOther <- filter(DataRelevant, CROPDMGEXP != "K", CROPDMGEXP != "M",
CROPDMGEXP != "m", CROPDMGEXP != "B",
CROPDMGEXP != "k")
DataOther <- mutate(DataOther, Crop_Damage = NA)
DataNA <- filter(DataRelevant, is.na(CROPDMGEXP))
DataNA <- mutate(DataNA, Crop_Damage = NA)
DataRelevant <- rbind(DataK, DataM, DataB, DataOther, DataNA)
rm(DataK, DataB, DataM, DataOther, DataNA)
For all of the events classified into the 48 different categories, I calculated the mean and the sum of injuries and fatalities.
Group_Data <- group_by(DataRelevant, OffEvent)
SummaryPersonDamageSum <- summarise(Group_Data,
Sum_Injuries = sum(INJURIES),
Sum_Fatalities = sum(FATALITIES))
SummaryPersonDamageMean <- summarise(Group_Data,
Mean_Injuries = mean(INJURIES),
Mean_Fatalities = mean(FATALITIES))
SummaryPersonDamageSum <- arrange(SummaryPersonDamageSum,
desc(Sum_Injuries),
desc(Sum_Fatalities))
SummaryPersonDamageMean <- arrange(SummaryPersonDamageMean,
desc(Mean_Injuries),
desc(Mean_Fatalities))
The results are shown in the following two tables, that were ordered in decending order after injuries and then fatalities. First the Table summarising the sum of all Injuries and Fatalities caused by all events of the different types from 1950 - 2011 in the USA
table_sum <- xtable(SummaryPersonDamageSum, digits = 0)
print(table_sum, type = 'html')
| OffEvent | Sum_Injuries | Sum_Fatalities | |
|---|---|---|---|
| 1 | Tornado | 91364 | 5658 |
| 2 | Flood | 6795 | 483 |
| 3 | Excessive Heat | 6525 | 1922 |
| 4 | Lightning | 5232 | 817 |
| 5 | Heat | 2699 | 1216 |
| 6 | Thunderstorm Wind | 2428 | 209 |
| 7 | Ice Storm | 1992 | 89 |
| 8 | Flash Flood | 1800 | 1035 |
| 9 | High Wind | 1507 | 297 |
| 10 | Hail | 1466 | 20 |
| 11 | Winter Storm | 1353 | 217 |
| 12 | Hurricane|Typhoon | 1333 | 133 |
| 13 | Heavy Snow | 1034 | 127 |
| 14 | Wildfire | 911 | 75 |
| 15 | Blizzard | 805 | 101 |
| 16 | Winter Weather | 538 | 61 |
| 17 | Rip Current | 529 | 577 |
| 18 | Dust Storm | 440 | 22 |
| 19 | Tropical Storm | 383 | 66 |
| 20 | Dense Fog | 342 | 18 |
| 21 | Strong Wind | 323 | 125 |
| 22 | Heavy Rain | 255 | 99 |
| 23 | High Surf | 204 | 146 |
| 24 | Avalanche | 170 | 224 |
| 25 | Tsunami | 129 | 33 |
| 26 | Waterspout | 72 | 6 |
| 27 | Dust Devil | 43 | 2 |
| 28 | Extreme Cold/Wind Chill | 24 | 125 |
| 29 | Cold/Wind Chill | 12 | 95 |
| 30 | Coastal Flood | 7 | 6 |
| 31 | Storm Surge/Tide | 5 | 11 |
| 32 | Drought | 4 | 0 |
| 33 | Funnel Cloud | 3 | 0 |
| 34 | Marine High Wind | 1 | 1 |
| 35 | Sleet | 0 | 2 |
| 36 | Dense Smoke | 0 | 0 |
| 37 | Freezing Fog | 0 | 0 |
| 38 | Frost/Freeze | 0 | 0 |
| 39 | Lake-Effect Snow | 0 | 0 |
| 40 | Lakeshore Flood | 0 | 0 |
| 41 | Low Tide | 0 | 0 |
| 42 | Marine Hail | 0 | 0 |
| 43 | Seiche | 0 | 0 |
| 44 | Tropical Depression | 0 | 0 |
| 45 | Volcanic Ash | 0 | 0 |
Second the Table showing the mean numbers of Injuries and Fatalities caused by the events of the different types from 1950 - 2011
table_mean <- xtable(SummaryPersonDamageMean, digits = 2)
print(table_mean, type = 'html')
| OffEvent | Mean_Injuries | Mean_Fatalities | |
|---|---|---|---|
| 1 | Heat | 9.96 | 4.49 |
| 2 | Tsunami | 9.21 | 2.36 |
| 3 | Excessive Heat | 9.20 | 2.71 |
| 4 | Hurricane|Typhoon | 5.75 | 0.57 |
| 5 | Dense Fog | 4.62 | 0.24 |
| 6 | Dust Storm | 4.27 | 0.21 |
| 7 | Blizzard | 3.16 | 0.40 |
| 8 | Ice Storm | 2.79 | 0.12 |
| 9 | Tornado | 2.29 | 0.14 |
| 10 | Waterspout | 1.12 | 0.09 |
| 11 | High Surf | 1.09 | 0.78 |
| 12 | Wildfire | 1.06 | 0.09 |
| 13 | Winter Weather | 0.98 | 0.11 |
| 14 | Tropical Storm | 0.91 | 0.16 |
| 15 | Winter Storm | 0.89 | 0.14 |
| 16 | Rip Current | 0.82 | 0.90 |
| 17 | Heavy Snow | 0.74 | 0.09 |
| 18 | Flood | 0.64 | 0.05 |
| 19 | Avalanche | 0.63 | 0.84 |
| 20 | Dust Devil | 0.45 | 0.02 |
| 21 | Lightning | 0.39 | 0.06 |
| 22 | High Wind | 0.24 | 0.05 |
| 23 | Funnel Cloud | 0.23 | 0.00 |
| 24 | Heavy Rain | 0.22 | 0.09 |
| 25 | Extreme Cold/Wind Chill | 0.22 | 1.13 |
| 26 | Cold/Wind Chill | 0.13 | 1.06 |
| 27 | Storm Surge/Tide | 0.11 | 0.23 |
| 28 | Strong Wind | 0.09 | 0.04 |
| 29 | Flash Flood | 0.08 | 0.05 |
| 30 | Hail | 0.06 | 0.00 |
| 31 | Marine High Wind | 0.05 | 0.05 |
| 32 | Thunderstorm Wind | 0.04 | 0.00 |
| 33 | Coastal Flood | 0.03 | 0.03 |
| 34 | Drought | 0.02 | 0.00 |
| 35 | Sleet | 0.00 | 0.20 |
| 36 | Dense Smoke | 0.00 | 0.00 |
| 37 | Freezing Fog | 0.00 | 0.00 |
| 38 | Frost/Freeze | 0.00 | 0.00 |
| 39 | Lake-Effect Snow | 0.00 | 0.00 |
| 40 | Lakeshore Flood | 0.00 | 0.00 |
| 41 | Low Tide | 0.00 | 0.00 |
| 42 | Marine Hail | 0.00 | 0.00 |
| 43 | Seiche | 0.00 | 0.00 |
| 44 | Tropical Depression | 0.00 | 0.00 |
| 45 | Volcanic Ash | 0.00 | 0.00 |
I calculated the mean and the sum of the property and crop damages caused by the events of the different event types, and ordered the results after the total damage (property plus crop) with the event types causing the highest damage on top.
SummaryCostSum <- summarise(Group_Data,
Sum_Property_Damage = sum(Prop_Damage, na.rm = TRUE),
Sum_Crop_Damage = sum(Crop_Damage, na.rm = TRUE))
SummaryCostSum <- mutate(SummaryCostSum, Total = Sum_Property_Damage +
Sum_Crop_Damage)
SummaryCostSum <- arrange(SummaryCostSum, desc(Total),
desc(Sum_Property_Damage))
SummaryCostMean <- summarise(Group_Data,
Mean_Property_Damage = mean(Prop_Damage, na.rm = TRUE),
Mean_Crop_Damage = mean(Crop_Damage, na.rm = TRUE))
SummaryCostMean <- mutate(SummaryCostMean, Total = Mean_Property_Damage +
Mean_Crop_Damage)
SummaryCostMean <- arrange(SummaryCostMean, desc(Total),
desc(Mean_Property_Damage),
desc(Mean_Crop_Damage))
### For the Plots the Data was rearranged and the levels of the OffEvent factor ordered based on the Total damage
names(SummaryCostSum)[2:3] = c("Property", "Crop")
Plot_SummaryCostSum <- SummaryCostSum[1:15,]
Plot_SummaryCostSum <- gather(Plot_SummaryCostSum, Type, Damage,
-OffEvent, -Total)
Plot_SummaryCostSum <- arrange(Plot_SummaryCostSum,
desc(Total))
Plot_SummaryCostSum$Damage <- Plot_SummaryCostSum$Damage/10^9
Plot_SummaryCostSum$OffEvent <- as.factor(Plot_SummaryCostSum$OffEvent)
# Reorder the levels
Plot_SummaryCostSum$OffEvent <-
reorder(Plot_SummaryCostSum$OffEvent,-Plot_SummaryCostSum$Total)
### same procedure for the mean data
names(SummaryCostMean)[2:3] = c("Property", "Crop")
Plot_SummaryCostMean <- SummaryCostMean[1:15,]
Plot_SummaryCostMean <- gather(Plot_SummaryCostMean, Type, Damage,
-OffEvent, -Total)
Plot_SummaryCostMean <- arrange(Plot_SummaryCostMean,
desc(Total))
Plot_SummaryCostMean$Damage <- Plot_SummaryCostMean$Damage/1000000
Plot_SummaryCostMean$OffEvent <- as.factor(Plot_SummaryCostMean$OffEvent)
Plot_SummaryCostMean$OffEvent <-
reorder(Plot_SummaryCostMean$OffEvent,-Plot_SummaryCostMean$Total)
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73",
"#F0E442", "#0072B2", "#D55E00", "#CC79A7")
TrS <- ggplot(Plot_SummaryCostSum,
aes(x = OffEvent, y = Damage, fill = Type)) +
geom_bar(stat = "identity", position = position_dodge(),
alpha = 3/4) +
ylab("Damage in Billion US Dollar") +
xlab("NOAA Storm Event Type") +
ggtitle("Total Damage (1950 - 2011)") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_fill_manual(values=cbPalette[2:3]) +
theme(legend.position="top")
TrS2 <- TrS +
coord_cartesian(ylim=c(0, 20))
grid.arrange(TrS, TrS2, ncol = 2)
Figure Caption: The 15 storm event types whose recorded events caused the highest total damage of property and crop in the USA from 1950 - 2011 are shown. The figure on the right is the same as on the left just with an adjusted y-axis scale.
TrM <- ggplot(Plot_SummaryCostMean,
aes(x = OffEvent, y = Damage, fill = Type)) +
geom_bar(stat = "identity", position = position_dodge(),
alpha = 3/4) +
ylab("Damage in Million US Dollar") +
xlab("NOAA Storm Event Type") +
ggtitle("Mean Damage Caused by Events (1950 - 2011)") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_fill_manual(values=cbPalette[2:3])
#theme(legend.position="top")
TrM
Figure Caption: The 15 storm event types whose recorded events caused the highest average damage of property and crop in the USA from 1950 - 2011.