Data on different storm events have been collected by the National Oceanic and Atmospheric Administration (NOAA)1 starting in the 1950s. Through the careful cleaning and analysis of the collected storm data, one event takes precedence over all others. Tornadoes have the highest effect of human health with high levels of fatalities and injuries. Tornadoes cause the largest amount of property damage and total damage. When it comes to crop damage, it plays a smaller role with the highest source coming from droughts.
The data analysis starts with the loading of the following r libraries:
library(dplyr)
library(lubridate)
library(ggplot2)
library(ggdendro)
library(ggpubr)
The NOAA storm database is to be downloaded if it is not already into a new folder called data. The data is then loaded into r as the dataframe “Storm”.
url<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if(!dir.exists("data"))
{
dir.create("data")
download.file(url,"data//stormdata.bz2")
}
Storm<-read.csv("data//stormdata.bz2")
For this analysis, the time/date was not considered but it was cleaned up none the less with the lubridate library. This will make time dependent analysis possible in the future.
Storm$BGN_DATE<-gsub(" 0:00:00","",Storm$BGN_DATE)
Storm$BGN_DATE<-mdy(Storm$BGN_DATE)
Storm$BGN_TIME<-hm(paste0(substring(Storm$BGN_TIME,1,2),":",substring(Storm$BGN_TIME,3,4)))
The are 985 unique values for event types. Some of them are misspelled while others are data enter errors. These values need to be cleaned up. A verified list of event names was entered to clean up the data. This verified list was supplied by the Storm Data Documentation 2.
Type<-list("ASTRONOMICAL LOW TIDE","AVALANCHE","BLIZZARD","COASTAL FLOOD",
"COLD/WIND CHILL","DEBRIS FLOW","DENSE FOG", "DENSE SMOKE", "DROUGHT",
"DUST DEVIL", "DUST STORM","EXCESSIVE HEAT", "EXTREME COLD/WIND CHILL",
"FLASH FLOOD","FLOOD", "FREEZING FOG","FROST/FREEZE","FUNNEL CLOUD","HAIL",
"HEAT", "HEAVY RAIN", "HEAVY SNOW","HIGH SURF", "HIGH WIND",
"HURRICANE/TYPHOON", "ICE STORM", "LAKESHORE FLOOD", "LAKE-EFFECT SNOW",
"LIGHTNING", "MARINE HAIL", "MARINE HIGH WIND","MARINE STRONG WIND",
"MARINE THUNDERSTORM WIND", "RIP CURRENT", "SEICHE","SLEET", "STORM TIDE",
"STRONG WIND","THUNDERSTORM WIND", "TORNADO", "TROPICAL DEPRESSION",
"TROPICAL STORM", "TSUNAMI","VOLCANIC ASH", "WATERSPOUT", "WILDFIRE",
"WINTER STORM","WINTER WEATHER")
The events are then compared to the verified list with the agrep function. The agrep function is similar to grep function but requires only an approximate match which will fix simple mistakes. A tolerance of 30% was used, a higher tolerance would match similar event types such as “high tide” and “low tide”. Code was include to take in consideration the use of “/” in the event names, considering both the name before the “/” and the name after. This code uses a for loop to match each verified event name and thus can take a long time to compute.
for (i in 1:length(Type)) {
Storm$EVTYPE[agrep(paste0(".*",Type[i],".*"),Storm$EVTYPE,ignore.case = T,max.distance = 0.3)]<-as.character(Type[i])
Storm$EVTYPE[agrep(paste0(".*",sub(".*/","",Type[i]),".*"),Storm$EVTYPE,ignore.case = T,max.distance = 0.3)]<-as.character(Type[i])
Storm$EVTYPE[agrep(paste0(".*",sub("/.*","",Type[i]),".*"),Storm$EVTYPE,ignore.case = T,max.distance = 0.3)]<-as.character(Type[i])
}
The clean removes some of the data in the data frame but 73.5% of the data still remains.
The data has two variables to consider for human health: number of fatalities and number of injures. A new data frame is constructed as a subset of the total data with only the necessary information.
DF1<-select(Storm[Storm$EVTYPE %in% Type,],Event=EVTYPE,Fatalities=FATALITIES,Injuries=INJURIES)
The analysis for fatalities and injures will be the same, where the sum total was found with the hierarchical clustering.
Sums<-aggregate(Fatalities ~ Event, data = DF1,sum)
Distance<-dist(Sums)
hclustering<-hclust(Distance)
Sums2<-aggregate(Injuries ~ Event, data = DF1,sum)
Distance2<-dist(Sums2)
hclustering2<-hclust(Distance2)
From these clusters, dendrogams were constructing showing the relative role of each event type. These plots utilize the ggplot and ggpubr libraries.
g1<-ggdendrogram(hclustering)+labs(title = "Fatalities Dendrogram")
g2<-ggdendrogram(hclustering2)+labs(title = "Injuries Dendrogram")
ggarrange(g1,g2,nrow=2)
From the dendrograms, it is obvious that the event type 34, which is “TORNADO”, plays the biggest role in both fatalities and injuries.
Similar to the population health analysis, the economic analysis is based on two factors: property damage and crop damage. The economic damages require an additional step as they include the values “PROPDMGEXP” and “CROPDMGEXP” which represent scale of the “PROPDMG” and “CROPDMG” values. A function was created to simplify this conversion. It returns the value 1000 for k or K and the value 1000000 for m or M. There are no other unique values in the database.
convert<-function(char){
result<-vector()
if (char=="K"|char=="k"){
result<-1000
}
else if (char=="M"|char=="m"){
result<-1000000
}
else{
result<-0
}
as.numeric(result)
}
The new dataframe is then constructed as a subset of the main database with the conversions for the EXP columns.
DF2<-select(Storm[Storm$EVTYPE %in% Type,],Event=EVTYPE,PropertyDmg=PROPDMG,
Propexp=PROPDMGEXP,CropDmg=CROPDMG,Cropexp=CROPDMGEXP)
DF2$Propexp<-as.numeric(lapply(DF2$Propexp, convert))
DF2$PropertyDmg<-as.numeric(DF2$PropertyDmg)
DF2$PropertyDmg<-DF2$PropertyDmg*DF2$Propexp
DF2$Cropexp<-as.numeric(lapply(DF2$Cropexp, convert))
DF2$CropDmg<-as.numeric(DF2$CropDmg)
DF2$CropDmg<-DF2$CropDmg*DF2$Cropexp
DF2<-mutate(DF2,Propexp=NULL,Cropexp=NULL)
Once the dataframe has been constructed, the clustering can be found as in the population health section.
Sums3<-aggregate(PropertyDmg ~ Event, data = DF2,sum)
Distance3<-dist(Sums3)
hclustering3<-hclust(Distance3)
Sums4<-aggregate(CropDmg ~ Event, data = DF2,sum)
Distance4<-dist(Sums4)
hclustering4<-hclust(Distance4)
The economic damage dendrograms were then plotted.
g3<-ggdendrogram(hclustering3) + labs(title = "Property Damage Dendrogram")
g4<-ggdendrogram(hclustering4)+ labs(title = "Crop Damage Dendrogram")
ggarrange(g3,g4,nrow=2)
The event for the most property damage is 34 or “TORNADO” but for the most crop damage it is event 7 or “DROUGHT”. To get the full economic picture, the total property and crop damages were found with the same steps for analysis.
Sums5<-aggregate(cbind(CropDmg,PropertyDmg) ~ Event, data = DF2,sum)
Distance5<-dist(Sums5)
hclustering5<-hclust(Distance5)
g5<-ggdendrogram(hclustering5) + labs(title = "Total Damage Dendrogram")
plot(g5)
From the total economic damage dendrogram, it is clear that “TORNADO” still takes the largest role, almost twice the cost of all other events combined.
From the analysis performed on the NOAA storm data, it is clear the one single event type plays the majority role for risk of health and economic damage in the United States. The majority of damages comes from tornadoes as reported in the storm data. The only category that it doesn’t lead in is in damages to crops. In this category, droughts are the leader. It is the recommendation of this author and this paper that policy and resources should be directed to addressing the effects of Tornadoes. For policies that are concerned the reduction of damages to crops, the focus should be to address the issue of drought.