Introduction

Storm data is published by the national oceanic and atmosphric administration (NOAA) to document:
1. the occurence of storms and other events that may cause the loss of lives, injuries, property and crop damage.
2. Rare and unusal weather events.

Data pertaining to various weather events are publicly available through the storm events database as csv files. The latest database contain information during the period from January 1950 through October 2016. Information regarding the database can also be found through NOAA’s NWS documentation services.

Some information in the database is mainly collected through the National Weather Services but data can also be obtained form other sources such as media and law enforcment/ other govornmental agencies.

Methodology

Database

Database included in the current study contain data form January 1950 to November 2011 and this information will be used for the current study and this data can be found as bz2 file available at the following link.

Preprocessing

  1. Various libraries that were needed in the analysis were firsr loaded.
#libraries
library(car)
library(ggplot2)
library(dplyr)
library(data.table)
library(stringdist)
library(Hmisc)
library(gridExtra)
  1. Data was downloaded and extracted and then read as csv using fread() at data.table package that allows reading large data sets in a small amount of time.
setwd("c:/users/ahmed/desktop/assignment 4")
storm<- fread("storm.csv", sep = ",")
  1. Needed columns were extracted form data. These are columns that contain State, Date and Time, Fatalities, Injuries, Crop and Property damage in USD.
cols<- c("BGN_DATE", "STATE", "EVTYPE", "FATALITIES", "INJURIES", agrep("DMG", names(storm), value = T))
  1. process data to determine public health consequences which are injuries and fatalities.

*Extract events that caused either fatalities and injuries and exclude others to reduce the dataset to the lowest number of observations possible.

health<- storm %>% select(one_of(cols)) %>% filter(INJURIES >0 | FATALITIES >0)
length(unique(health$EVTYPE))
## [1] 220

*The number of events was too high due to spelling and non-unified method of recording data over time so clustering was performed to cluster similar events together using stringdist package. This was done using a function clusterfunc(). Distance matrices were constructed throught that fucnction using jw method and jw=0.32 was used as it was found to produce best results.

#Create fucntion that takes data frame and column to cluster on to return df with clusters columns
clustfunc<- function(data=data,x=x,i){
data[,x]<- tolower(data[,x])
matrix<- stringdistmatrix(unique(data[,x]), unique(data[,x]), method = "jw")
rownames(matrix)<- unique(data[,x])
clus<- hclust(as.dist(matrix))
clusters<- as.data.frame(cutree(clus, h=i))
clusters$Event<- row.names(clusters)
row.names(clusters)<- NULL
colnames(clusters)<- c("Cluster", "Event")
df<- merge(x=data, y=clusters, by.x = names(data[x]),  by.y = "Event", all.x = T, all.y = F)
}

*Now we apply this clusterfunc to the health dataframe.

df<- clustfunc(data=health, x=3, i=0.32)
length(unique(df$Cluster))
## [1] 80
  • TSTM wind and Thunderstorm wind are the same so they can be further combined (clusters 9 and 2).
df$Cluster[df$Cluster==9]<- 2
  • Data summary should now bw constructed to determine the Sum() of fatalities and injuries summarized by Cluster and The most common event in the cluster will be used to represent it. Eventfreq function was constructed for that purpose.
#Most common event in a cluster
eventfreq<- function(x){
        names(sort(table(x), decreasing = T)[1])
}
  • Summarize data then derive 2 data frames with fatalities and injuries sorted in a descending order, respectively.
#Summary
healthcon<- df %>% arrange(Cluster) %>% group_by(Cluster) %>% summarise(Event= eventfreq(EVTYPE), Fatalities= sum(FATALITIES), Injuries=sum(INJURIES))
#Fatalities df
fats<- arrange(healthcon,desc(Fatalities))
#Injuries df
injs<- arrange(healthcon, desc(Injuries))
  1. Process data to determine economin health consequences which are injuries and fatalities.
  • Select events that affected proprties or crops
dmg<- storm %>% select(one_of(cols)) %>% filter(PROPDMG > 0 | CROPDMG > 0) 
  • use clustfunc to do same as before
# jw value of 0.42 was a better value for clustering events
df2<-clustfunc(dmg,3, i=0.42)
length(unique(df2$EVTYPE)) #before clustering
## [1] 397
length(unique(df2$Cluster)) #after clustering
## [1] 70
  • Recode() from cars package was used to transfrom k = 1000, m=10^6 and **b=10^9** after changing all characters into lower cases. Other values were assigned asNA`.
#function to convert exponents to numeric values
df2<-df2 %>% mutate(PROPDMGEXP=tolower(PROPDMGEXP), CROPDMGEXP=tolower(CROPDMGEXP))
df2$PROPDMGEXP<- Recode(df2$PROPDMGEXP, "'k'=1000; 'm'=10^6; 'b'=10^9; else=NA", as.factor.result = F)
df2$CROPDMGEXP<- Recode(df2$CROPDMGEXP, "'k'=1000; 'm'=10^6; 'b'=10^9; else=NA", as.factor.result = F)
  • Two 2 variables were computed to estiamte property.damage and crop.damage in USD.
#Calculate property and crop damage
df2<- df2 %>% mutate(property.damage=PROPDMG*PROPDMGEXP, 
                     crop.damage=CROPDMGEXP*CROPDMG)
  • Data was summarized by calculataing sum() of property damages and crop damages summarized by clusters and the mosr common event in the cluster was used to define the cluster using eventfreq(). Two data frames,prop and crop, were generated sorted *descendingly** according to damage.
#Calculate most frequent event in the cluster and aggregate by cluster
ecocon<- df2 %>% arrange(Cluster) %>% group_by(Cluster) %>% 
        summarise(Event= eventfreq(EVTYPE), property.damage= sum(property.damage, na.rm=T),
                  crop.damage=sum(crop.damage, na.rm= T))

#Create 2 data frames for prporty and crop damage arranged in ascending order
prop<- ecocon %>% arrange(desc(property.damage))
crop<- ecocon %>% arrange(desc(crop.damage))

Plots

  1. Public health consequences plots Bar charts were chosen to visualize fatalities and injuries according to top 10 events
##ggplot
p1<- ggplot(data=fats[1:10,], aes(y=Fatalities, x=reorder(capitalize(Event), Fatalities)))+
                    geom_bar(stat="identity", fill="cyan3")+ coord_flip() +theme_light() + 
        geom_text(data=fats[1:10,],aes(label= Fatalities,vjust=0)) +
        labs(x="Events", y="Fatalities") +
        ggtitle("Fatalities by Event type", 
subtitle = "This charts shows top 10 causes of fatalities from 1950 to 2011")

p2<- ggplot(data=injs[1:10,], aes(y=Injuries, x=reorder(capitalize(Event), Injuries)))+
        geom_bar(stat="identity", fill= "#FF9999")+ coord_flip() +labs(x="Events", y="Injuries") +
                geom_text(data=injs[1:10,],aes(label= Injuries,vjust=0)) +
        ggtitle("Injuries by Event type",
subtitle = "This charts shows top 10 causes of injuries from 1950 to 2011") +theme_light()

p<- grid.arrange(p1,p2, ncol = 2)

print(p)
## TableGrob (1 x 2) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
  1. Economic consequences plots
p1<- ggplot(data=prop[1:10,], aes(y=round(property.damage/10^9, digits = 0), x=reorder(capitalize(Event), property.damage)))+
        geom_bar(stat="identity", fill="cyan3")+ coord_flip() +theme_light() + 
        labs(x="Events", y="Property damage in billions USD") +
        geom_text(data=prop[1:10,],aes(label=round(property.damage/10^9, digits = 0),vjust=0)) +
        ggtitle("Property damage by Event type", 
                subtitle = "This charts shows top 10 causes of property damage from 1950 to 2011")

p2<- ggplot(data=crop[1:10,], aes(y=round(crop.damage/10^9, digits = 0), x=reorder(capitalize(Event), crop.damage)))+
        geom_bar(stat="identity", fill= "#FF9999")+ coord_flip() +
        labs(x="Events", y="Crop damage in billions USD") +
        geom_text(data=crop[1:10,],aes(label=round(crop.damage/10^9, digits = 0),vjust=0)) +
        ggtitle("Crop damage by Event type",
                subtitle = "This charts shows top 10 causes of Crop damage from 1950 to 2011") +theme_light()

p<- grid.arrange(p1,p2, ncol = 2)

Results and discussion

  1. Public health consequences
    Results show that Tornados are the most common cause of both injuries and fatalities accounting for nearly 5000 fatalities and 91,000 injuries. Excessive heat was the second most common cause of fatalities while Thunderstorms came in 2nd place as a cause of injury. Floods, wind and lightning and extreme cold were also major causes of fatalities and injuries.

  2. Economic consequence Analysis showed that floods are the most common cause of property damage as it casued an estimated damage of 145 billion USD from 1950 to 2011. Estimated property damage produced by Hurricances and tornados was 85 and 45 billion USD, respectively.
    The most common cause for crop damage was drought (14 billion USD) followed by floods (6 billion USD) and hurricanes (6 billion USD).