In this report is presented a short study of the National Weather Service Storm Data Documentation. The data cover the years 2050 up to November 2011 with more irregular number of recordings during the earkly years. Initially the date was downloaded and cleaned to a format more suitable for analysis and graphs were plotted showing the 10 most severe type of events in each case. The results of the study show that in terms of number of fatalities, Tornados are almost three times as severe as exessive heat wich is in second place. In financial terms, Flooding have the highest impact, around 5 times higher than Hurricane/Typhoon.
The original datafile was downloaded from the official website as below and opened as a local variable. For all details aboutthe original dataset, the reader is referred to the National Weather Service Storm Data Documentation (https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf)
#if(!file.exists("./data")){dir.create("./data")}
#dataUrl<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
#download.file(dataUrl,destfile="./data/stormdata.bz2",method="curl")
#list.files("./data")
#dateDownloaded<-date()
#dateDownloaded
##read file from zip archive to memory
stormdata<-read.csv("./data/stormdata.bz2")
In order to make the variables easier to read and speed up processing,
variable names were adjusted and irrelevant variables dropped.
colnames(stormdata)<-tolower(colnames(stormdata))
colnames(stormdata)<-sub("_|__","",colnames(stormdata))
stormdata<-stormdata[,c("evtype","fatalities","propdmg","propdmgexp","cropdmg","cropdmgexp")]
Tote that this table contain 985 different evern factors, but the original instructions say it should only be 48. A study of the frequencies show however that the number of extra factors often is a result form misspelling or
Some further processing av data will be made further on, in the case it is needed for further analysis.
First, data was processed so that a new set was created with a summary of the fatalities for each case. The summary was then sorted so that the ten events with highest number of recorded fatalities were found
fatal<-(aggregate(fatalities ~ evtype, stormdata, FUN=sum))
fatal<-fatal[order(fatal$fatalities,decreasing = TRUE),]
fatalten<-head(fatal,10)
fatalten<-fatal[order(fatalten$fatalities,decreasing = TRUE),]
From this dataset, the ten events have been plotted as below. We can clearly see that tornados is the event with overall most fatalities during this period in the country.
library(ggplot2)
dd <- ggplot(fatalten, aes(x=evtype,y=fatalities,group = 1))
dd <- dd + geom_line(aes(colour="red"),lwd=2)
dd <- dd + theme(legend.position="none",axis.text.x = element_text(angle = 45, hjust = 1))
dd <- dd + xlab("Event type")
dd <- dd + ylab("Number of fatalities")
dd <- dd + geom_text(aes(label=fatalities), vjust=0, size=4)
dd <- dd + ggtitle("Ten most fatal events in the U.S. ")
print(dd)
#### Events and their financial influences When looking at the data it was clear that a number of erroneous entries had been made and t=his needed to be made clear. It was also necessry to recode some of the economic data since the variables were split up in dual columns where one represented the numerical value and the second a multiplier.
stormdata$propdmgexp <- as.character(stormdata$propdmgexp)
## assign empty values with "0"
stormdata[stormdata$propdmgexp == NULL, c("propdmgexp")] <- "?"
## now assign the values their correct exponential factor for property
stormdata[stormdata$propdmgexp == "0", c("propdmgexp")] <- "1e0"
stormdata[stormdata$propdmgexp == "1", c("propdmgexp")] <- "1e1"
stormdata[stormdata$propdmgexp == "2"|stormdata$propdmgexp == "h"|stormdata$propdmgexp == "H", c("propdmgexp")] <- "1e2"
stormdata[stormdata$propdmgexp == "3"|stormdata$propdmgexp == "k"|stormdata$propdmgexp == "K", c("propdmgexp")] <- "1e3"
stormdata[stormdata$propdmgexp == "4", c("propdmgexp")] <- "1e4"
stormdata[stormdata$propdmgexp == "5", c("propdmgexp")] <- "1e5"
stormdata[stormdata$propdmgexp == "6"|stormdata$propdmgexp == "m"|stormdata$propdmgexp == "M", c("propdmgexp")] <- "1e6"
stormdata[stormdata$propdmgexp == "7", c("propdmgexp")] <- "1e7"
stormdata[stormdata$propdmgexp == "8", c("propdmgexp")] <- "1e8"
stormdata[stormdata$propdmgexp == "9"|stormdata$propdmgexp == "b"|stormdata$propdmgexp == "B", c("propdmgexp")] <- "1e9"
## Assume that question marks are same as missing and replace wirth NA
stormdata[stormdata$propdmgexp == "?"|stormdata$propdmgexp == "+"|stormdata$propdmgexp == "-", c("propdmgexp")] <- "0"
## to numeric / this will introduce NA by coercion on the other values
stormdata$propdmgexp <- as.numeric(stormdata$propdmgexp)
## and again same for crops
stormdata$cropdmgexp <- as.character(stormdata$cropdmgexp)
stormdata[stormdata$cropdmgexp == "0", c("cropdmgexp")] <- "1e0"
stormdata[stormdata$cropdmgexp == "1", c("cropdmgexp")] <- "1e1"
stormdata[stormdata$cropdmgexp == "2"|stormdata$cropdmgexp == "h"|stormdata$cropdmgexp == "H", c("cropdmgexp")] <- "1e2"
stormdata[stormdata$cropdmgexp == "3"|stormdata$cropdmgexp == "k"|stormdata$cropdmgexp == "K", c("cropdmgexp")] <- "1e3"
stormdata[stormdata$cropdmgexp == "4", c("cropdmgexp")] <- "1e4"
stormdata[stormdata$cropdmgexp == "5", c("cropdmgexp")] <- "1e5"
stormdata[stormdata$cropdmgexp == "6"|stormdata$cropdmgexp == "m"|stormdata$cropdmgexp == "M", c("cropdmgexp")] <- "1e6"
stormdata[stormdata$cropdmgexp == "7", c("cropdmgexp")] <- "1e7"
stormdata[stormdata$cropdmgexp == "8", c("cropdmgexp")] <- "1e8"
stormdata[stormdata$cropdmgexp == "9"|stormdata$cropdmgexp == "b"|stormdata$cropdmgexp == "B", c("cropdmgexp")] <- "1e9"
stormdata[stormdata$cropdmgexp == "?"|stormdata$cropdmgexp == "+"|stormdata$cropdmgexp == "-", c("cropdmgexp")] <- "0"
stormdata$cropdmgexp <- as.numeric(stormdata$cropdmgexp)
Then, data was processed so that a new set was created with a summary of the economic impact for each event. The summary was then sorted so that the ten events with highest recorded economic damages were found.
# Now calculate a new column with the total of damage costs
stormdata$totaldamage<-((stormdata$cropdmg*stormdata$cropdmgexp)+(stormdata$propdmg*stormdata$propdmgexp))
## Summarise the damages over Event type
dam<-aggregate(totaldamage ~ evtype, stormdata, FUN=sum)
dam<-dam[order(dam$totaldamage,decreasing = TRUE),]
## Lets list the 10 worst since there are more than 900 different factors in evtype
damten<-head(dam,10)
In the resulting barplot, it is clear that Flooding is the event with financial highest impsvt on the country dorin the period.
dd <- ggplot(damten, aes(x=evtype,y=totaldamage))
dd <- dd + geom_bar(stat="identity",colour="red",fill="red")
dd <- dd + theme(legend.position="none",axis.text.x = element_text(angle = 45, hjust = 1))
dd <- dd + xlab("Event type")
dd <- dd + ylab("Financial damage in USD")
dd <- dd + ggtitle("Ten most financially damaging in the U.S. ")
print(dd)
While it has not been done here, it might be of intereest for a future study to further group the events in related ares. In the list of financial damage, the top ten events contain the categories: hurricane, hurricane/typhoon, Thunderstorm wind and tornado. It is outside the researchers expertise to judge if these categories might be merged or not and they have for that reason been left as they are. A similar thought could be made in regard to the catagories flood, flash flood, river flood and storm surge/tide. However recategorising these over 900 factors is outside the timelimit of this project and have for that reason been left out.