Storm data is published by the national oceanic and atmosphric administration (NOAA) to document:
1. the occurence of storms and other events that may cause the loss of lives, injuries, property and crop damage.
2. Rare and unusal weather events.
Data pertaining to various weather events are publicly available through the storm events database as csv files. The latest database contain information during the period from January 1950 through October 2016. Information regarding the database can also be found through NOAA’s NWS documentation services.
Some information in the database is mainly collected through the National Weather Services but data can also be obtained form other sources such as media and law enforcment/ other govornmental agencies.
Database included in the current study contain data form January 1950 to November 2011 and this information will be used for the current study and this data can be found as bz2 file available at the following link.
#libraries
library(car)
library(ggplot2)
library(dplyr)
library(data.table)
library(stringdist)
library(Hmisc)
library(gridExtra)
fread() at data.table package that allows reading large data sets in a small amount of time.setwd("c:/users/ahmed/desktop/assignment 4")
storm<- fread("storm.csv", sep = ",")
cols<- c("BGN_DATE", "STATE", "EVTYPE", "FATALITIES", "INJURIES", agrep("DMG", names(storm), value = T))
*Extract events that caused either fatalities and injuries and exclude others to reduce the dataset to the lowest number of observations possible.
health<- storm %>% select(one_of(cols)) %>% filter(INJURIES >0 | FATALITIES >0)
length(unique(health$EVTYPE))
## [1] 220
*The number of events was too high due to spelling and non-unified method of recording data over time so clustering was performed to cluster similar events together using stringdist package. This was done using a function clusterfunc(). Distance matrices were constructed throught that fucnction using jw method and jw=0.32 was used as it was found to produce best results.
#Create fucntion that takes data frame and column to cluster on to return df with clusters columns
clustfunc<- function(data=data,x=x,i){
data[,x]<- tolower(data[,x])
matrix<- stringdistmatrix(unique(data[,x]), unique(data[,x]), method = "jw")
rownames(matrix)<- unique(data[,x])
clus<- hclust(as.dist(matrix))
clusters<- as.data.frame(cutree(clus, h=i))
clusters$Event<- row.names(clusters)
row.names(clusters)<- NULL
colnames(clusters)<- c("Cluster", "Event")
df<- merge(x=data, y=clusters, by.x = names(data[x]), by.y = "Event", all.x = T, all.y = F)
}
*Now we apply this clusterfunc to the health dataframe.
df<- clustfunc(data=health, x=3, i=0.32)
length(unique(df$Cluster))
## [1] 80
df$Cluster[df$Cluster==9]<- 2
Sum() of fatalities and injuries summarized by Cluster and The most common event in the cluster will be used to represent it. Eventfreq function was constructed for that purpose.#Most common event in a cluster
eventfreq<- function(x){
names(sort(table(x), decreasing = T)[1])
}
#Summary
healthcon<- df %>% arrange(Cluster) %>% group_by(Cluster) %>% summarise(Event= eventfreq(EVTYPE), Fatalities= sum(FATALITIES), Injuries=sum(INJURIES))
#Fatalities df
fats<- arrange(healthcon,desc(Fatalities))
#Injuries df
injs<- arrange(healthcon, desc(Injuries))
dmg<- storm %>% select(one_of(cols)) %>% filter(PROPDMG > 0 | CROPDMG > 0)
# jw value of 0.42 was a better value for clustering events
df2<-clustfunc(dmg,3, i=0.42)
length(unique(df2$EVTYPE)) #before clustering
## [1] 397
length(unique(df2$Cluster)) #after clustering
## [1] 70
Recode() from cars package was used to transfrom k = 1000, m=10^6 and **b=10^9** after changing all characters into lower cases. Other values were assigned asNA`.#function to convert exponents to numeric values
df2<-df2 %>% mutate(PROPDMGEXP=tolower(PROPDMGEXP), CROPDMGEXP=tolower(CROPDMGEXP))
df2$PROPDMGEXP<- Recode(df2$PROPDMGEXP, "'k'=1000; 'm'=10^6; 'b'=10^9; else=NA", as.factor.result = F)
df2$CROPDMGEXP<- Recode(df2$CROPDMGEXP, "'k'=1000; 'm'=10^6; 'b'=10^9; else=NA", as.factor.result = F)
property.damage and crop.damage in USD.#Calculate property and crop damage
df2<- df2 %>% mutate(property.damage=PROPDMG*PROPDMGEXP,
crop.damage=CROPDMGEXP*CROPDMG)
sum() of property damages and crop damages summarized by clusters and the mosr common event in the cluster was used to define the cluster using eventfreq(). Two data frames,prop and crop, were generated sorted *descendingly** according to damage.#Calculate most frequent event in the cluster and aggregate by cluster
ecocon<- df2 %>% arrange(Cluster) %>% group_by(Cluster) %>%
summarise(Event= eventfreq(EVTYPE), property.damage= sum(property.damage, na.rm=T),
crop.damage=sum(crop.damage, na.rm= T))
#Create 2 data frames for prporty and crop damage arranged in ascending order
prop<- ecocon %>% arrange(desc(property.damage))
crop<- ecocon %>% arrange(desc(crop.damage))
##ggplot
p1<- ggplot(data=fats[1:10,], aes(y=Fatalities, x=reorder(capitalize(Event), Fatalities)))+
geom_bar(stat="identity", fill="cyan3")+ coord_flip() +theme_light() +
geom_text(data=fats[1:10,],aes(label= Fatalities,vjust=0)) +
labs(x="Events", y="Fatalities") +
ggtitle("Fatalities by Event type",
subtitle = "This charts shows top 10 causes of fatalities from 1950 to 2011")
p2<- ggplot(data=injs[1:10,], aes(y=Injuries, x=reorder(capitalize(Event), Injuries)))+
geom_bar(stat="identity", fill= "#FF9999")+ coord_flip() +labs(x="Events", y="Injuries") +
geom_text(data=injs[1:10,],aes(label= Injuries,vjust=0)) +
ggtitle("Injuries by Event type",
subtitle = "This charts shows top 10 causes of injuries from 1950 to 2011") +theme_light()
p<- grid.arrange(p1,p2, ncol = 2)
print(p)
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
p1<- ggplot(data=prop[1:10,], aes(y=round(property.damage/10^9, digits = 0), x=reorder(capitalize(Event), property.damage)))+
geom_bar(stat="identity", fill="cyan3")+ coord_flip() +theme_light() +
labs(x="Events", y="Property damage in billions USD") +
geom_text(data=prop[1:10,],aes(label=round(property.damage/10^9, digits = 0),vjust=0)) +
ggtitle("Property damage by Event type",
subtitle = "This charts shows top 10 causes of property damage from 1950 to 2011")
p2<- ggplot(data=crop[1:10,], aes(y=round(crop.damage/10^9, digits = 0), x=reorder(capitalize(Event), crop.damage)))+
geom_bar(stat="identity", fill= "#FF9999")+ coord_flip() +
labs(x="Events", y="Crop damage in billions USD") +
geom_text(data=crop[1:10,],aes(label=round(crop.damage/10^9, digits = 0),vjust=0)) +
ggtitle("Crop damage by Event type",
subtitle = "This charts shows top 10 causes of Crop damage from 1950 to 2011") +theme_light()
p<- grid.arrange(p1,p2, ncol = 2)
Public health consequences
Results show that Tornados are the most common cause of both injuries and fatalities accounting for nearly 5000 fatalities and 91,000 injuries. Excessive heat was the second most common cause of fatalities while Thunderstorms came in 2nd place as a cause of injury. Floods, wind and lightning and extreme cold were also major causes of fatalities and injuries.
Economic consequence Analysis showed that floods are the most common cause of property damage as it casued an estimated damage of 145 billion USD from 1950 to 2011. Estimated property damage produced by Hurricances and tornados was 85 and 45 billion USD, respectively.
The most common cause for crop damage was drought (14 billion USD) followed by floods (6 billion USD) and hurricanes (6 billion USD).