This data that we have analysed is about 50 years of recording and we must admit that during this long span we have several factors changes and the accuracy and methodolgy of the data callection could be main players here in terms of the reliability and accuracy. I have picked the largest availble data intervals and they happen to be from 1990-2013 which is the end of the dataset. We can see below that the data is poorly orginized thus it required significant amount of cleaning. Not all the events were taken in consideration, instead I have taken main events and summed up a big part of the remaining events to them in case if they were close. After aggregating the data we can see that in terms of effect on population It can be seen from the barchart above that Tornados have the most impact on health in terms of injured people while HEAT event has the majority of Fatality counts on population and in terms of effect on Crops and Property *In here we can see that we have several players here in terms of property damage and they are (Tornados, Hurricanes and Floods) while in terms of Crops we can see that only Floods and Cold has the major effect
By looking at the contents and the table provided by NOAA I have decided to use the data from 1990 till 2011 which means working with about 90% of the data
Data download and date filtering was done by:
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",destfile = "rawdata.csv.bz2")
rawset<-read.csv("repdata-data-StormData.csv.bz2")
rawset$BGN_DATE<-as.Date(rawset$BGN_DATE, format="%m/%d/%Y %H:%M:%S")
install.packages("dplyr")
library(dplyr)
subsetted <- filter(rawset, BGN_DATE >= as.Date("1990-02-01 00:00:00"), BGN_DATE < as.Date("2013-02-03 00:00:00"))
The dataset also was trimmed down to the columns we need
subsetted1<-subsetted[,c("BGN_DATE","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP","EVTYPE")]
The columns PROBDMGEXP and CROPDMGEXP contains non standard values other than (h, k, m, b). These values were removed using the follwing code:
#unifying PROPDMGEXP to upper letter
subsetted1$PROPDMGEXP<-tolower(subsetted1$PROPDMGEXP)
subsetted1$CROPDMGEXP<-tolower(subsetted1$CROPDMGEXP)
expense<-c("h","k","m","b")
subsetted1$PROPDMGEXP[subsetted1$PROPDMGEXP!=expense]<-0 #Zeroing all the other values
subsetted1$CROPDMGEXP[subsetted1$CROPDMGEXP!=expense]<-0 #Zeroing all the other values
I have also created a CROPCASH and PROPCASH columns which they are (CROPCASH=CROPDMG x CROPDMGEXP) and the same is for PROPCASH using the followin code.
#Creating the total cash column
subsetted1$PROPDMGEXP[subsetted1$PROPDMGEXP=="h"]<-100
subsetted1$PROPDMGEXP[subsetted1$PROPDMGEXP=="k"]<-1000
subsetted1$PROPDMGEXP[subsetted1$PROPDMGEXP=="m"]<-1000000
subsetted1$PROPDMGEXP[subsetted1$PROPDMGEXP=="b"]<-1000000000
subsetted1$PROPDMGEXP<-as.numeric(subsetted1$PROPDMGEXP)
subsetted1$PROPCASH<-subsetted1$PROPDMGEXP*subsetted1$PROPDMG
subsetted1$CROPDMGEXP[subsetted1$CROPDMGEXP=="h"]<-100
subsetted1$CROPDMGEXP[subsetted1$CROPDMGEXP=="k"]<-1000
subsetted1$CROPDMGEXP[subsetted1$CROPDMGEXP=="m"]<-1000000
subsetted1$CROPDMGEXP[subsetted1$CROPDMGEXP=="b"]<-1000000000
subsetted1$CROPDMGEXP<-as.numeric(subsetted1$CROPDMGEXP)
subsetted1$CROPCASH<-subsetted1$CROPDMGEXP*subsetted1$CROPDMG
By looking at the number of the injuries and making sense of all the EVTYPES variables I have created a set of events that I will scale down the dataset to and note that I have included several into one event that I will explain in a table here. The reason is that I think for example (Thunder Storm, Lightening) they can be summed uo under one category.
| Name in Dataset | Selected Category |
|---|---|
| “FLOOD” | “FLOOD” |
| “TORNA” | “TORNADO” |
| “STORM” | “STORM” |
| “RAIN” | “RAIN” |
| “COLD” | “COLD” |
| “HEAT” | “HEAT” |
| “SNOW” | “SNOW” |
| “HAIL” | “SNOW” |
| “STORM” | “STORM” |
| “HURRI” | “HURRICANE” |
| “WIND” | “WIND” |
| “COL” | “COLD” |
| “FREEZ” | “COLD” |
| “WATER” | “FLOOD” |
| “HOT” | “HEAT” |
| “ICE” | “SNOW” |
| “WINT” | “COLD” |
| “FIRE” | “FIRE” |
| “BLIZZ” | “SNOW” |
| “LIGHT” | “THUNDER” |
| “AVALA” | “AVALANCHE” |
using the following code chunk to perform the above subsittution
subsetted1$EVTYPE[grep("FLOOD",subsetted1$EVTYPE, ignore.case = T)]<-"FLOOD"
subsetted1$EVTYPE[grep("TORNA",subsetted1$EVTYPE,ignore.case = T)]<-"TORNADO"
subsetted1$EVTYPE[grep("STORM",subsetted1$EVTYPE,ignore.case = T)]<-"STORM"
subsetted1$EVTYPE[grep("RAIN",subsetted1$EVTYPE,ignore.case = T)]<-"RAIN"
subsetted1$EVTYPE[grep("COLD",subsetted1$EVTYPE,ignore.case = T)]<-"COLD"
subsetted1$EVTYPE[grep("HEAT",subsetted1$EVTYPE,ignore.case = T)]<-"HEAT"
subsetted1$EVTYPE[grep("SNOW",subsetted1$EVTYPE,ignore.case = T)]<-"SNOW"
subsetted1$EVTYPE[grep("HAIL",subsetted1$EVTYPE,ignore.case = T)]<-"SNOW"
subsetted1$EVTYPE[grep("STORM",subsetted1$EVTYPE,ignore.case = T)]<-"STORM"
subsetted1$EVTYPE[grep("HURRI",subsetted1$EVTYPE,ignore.case = T)]<-"HURRICANE"
subsetted1$EVTYPE[grep("WIND",subsetted1$EVTYPE,ignore.case = T)]<-"WIND"
subsetted1$EVTYPE[grep("COL",subsetted1$EVTYPE,ignore.case = T)]<-"COLD"
subsetted1$EVTYPE[grep("FREEZ",subsetted1$EVTYPE,ignore.case = T)]<-"COLD"
subsetted1$EVTYPE[grep("WATER",subsetted1$EVTYPE,ignore.case = T)]<-"FLOOD"
subsetted1$EVTYPE[grep("HOT",subsetted1$EVTYPE,ignore.case = T)]<-"HEAT"
subsetted1$EVTYPE[grep("ICE",subsetted1$EVTYPE,ignore.case = T)]<-"SNOW"
subsetted1$EVTYPE[grep("WINT* *",subsetted1$EVTYPE,ignore.case = T)]<-"COLD"
subsetted1$EVTYPE[grep("FIRE",subsetted1$EVTYPE,ignore.case = T)]<-"FIRE"
subsetted1$EVTYPE[grep("AVALA",subsetted1$EVTYPE,ignore.case = T)]<-"AVALANCHE"
subsetted1$EVTYPE[grep("BLIZZ",subsetted1$EVTYPE,ignore.case = T)]<-"SNOW"
subsetted1$EVTYPE[grep("LIGHT",subsetted1$EVTYPE,ignore.case = T)]<-"THUNDER"
subsetted1$EVTYPE[grep("AVALA",subsetted1$EVTYPE,ignore.case = T)]<-"AVALANCHE"
I have created a vector cats that contains all the categories I included and then dropped out all the other NAs and non-standard cats using:
cats<-c("TORNADO","STORM","RAIN","HEAT","HURRICANE","WIND","FLOOD","COLD","FIRE","SNOW")
key<-which ((subsetted1$EVTYPE %in% cats)) #which are the rows fall in the cat above
subsetted2<-subsetted1[key,]
Now as our dataset is clean and ready, all we need is to start agregating the variables accoring to both the human impact and financial impact using the code:
final<-aggregate(cbind(subsetted2$FATALITIES, subsetted2$INJURIES, subsetted2$CROPCASH,
subsetted2$PROPCASH)~EVTYPE, data=subsetted2, sum)
colnames(final)<-c("Event","Fatalities", "Injuries", "Crop Total Cash $", "Property Total Cash $")
In order to answer this question the following code:
final<-aggregate(cbind(subsetted2$FATALITIES, subsetted2$INJURIES, subsetted2$CROPCASH,
subsetted2$PROPCASH)~EVTYPE, data=subsetted2, sum)
colnames(final)<-c("Event","Fatalities", "Injuries", "Crop Total Cash ($)", "Property Total Cash ($)")
install.packages("reshape")
library(reshape)
install.packages("ggplot2")
library(ggplot2)
finalh<-melt(final[,1:3])
ggplot(finalh,aes(x=Event,y=value,fill=variable)) +
geom_bar(position="dodge",stat="identity",binwidth=1) +
ggtitle("Most harmful events to population health") +
labs(x="",y="Number of people affected") +
theme(axis.text.x=element_text(angle=45,hjust=1))
It can be seen from the barchart above that Tornados have the most impact on health in terms of injured people while HEAT event has the majority of Fatality counts on population
finalp<-melt(final[,c(1,4,5)])
ggplot(finalp,aes(x=Event,y=value,fill=variable)) +
geom_bar(position="dodge",stat="identity",binwidth=1) +
ggtitle("Most harmful events to Property and Crops") +
labs(x="",y="Loss Value ($)") +
theme(axis.text.x=element_text(angle=45,hjust=1))
*In here we can see that we have several players here in terms of property damage and they are (Tornados, Hurricanes and Floods) while in terms of Crops we can see that only Floods and Cold has the major effect