Storm Events Database Analysis Project

Summary

This data that we have analysed is about 50 years of recording and we must admit that during this long span we have several factors changes and the accuracy and methodolgy of the data callection could be main players here in terms of the reliability and accuracy. I have picked the largest availble data intervals and they happen to be from 1990-2013 which is the end of the dataset. We can see below that the data is poorly orginized thus it required significant amount of cleaning. Not all the events were taken in consideration, instead I have taken main events and summed up a big part of the remaining events to them in case if they were close. After aggregating the data we can see that in terms of effect on population It can be seen from the barchart above that Tornados have the most impact on health in terms of injured people while HEAT event has the majority of Fatality counts on population and in terms of effect on Crops and Property *In here we can see that we have several players here in terms of property damage and they are (Tornados, Hurricanes and Floods) while in terms of Crops we can see that only Floods and Cold has the major effect

Data Processing

By looking at the contents and the table provided by NOAA I have decided to use the data from 1990 till 2011 which means working with about 90% of the data

alt text

Data download and date filtering was done by:

download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",destfile = "rawdata.csv.bz2")
rawset<-read.csv("repdata-data-StormData.csv.bz2")
rawset$BGN_DATE<-as.Date(rawset$BGN_DATE, format="%m/%d/%Y %H:%M:%S")
install.packages("dplyr")
library(dplyr)
subsetted <- filter(rawset, BGN_DATE >= as.Date("1990-02-01 00:00:00"), BGN_DATE < as.Date("2013-02-03 00:00:00"))

The dataset also was trimmed down to the columns we need

subsetted1<-subsetted[,c("BGN_DATE","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP","EVTYPE")]

The columns PROBDMGEXP and CROPDMGEXP contains non standard values other than (h, k, m, b). These values were removed using the follwing code:

#unifying PROPDMGEXP to upper letter
subsetted1$PROPDMGEXP<-tolower(subsetted1$PROPDMGEXP)
subsetted1$CROPDMGEXP<-tolower(subsetted1$CROPDMGEXP)
expense<-c("h","k","m","b")
subsetted1$PROPDMGEXP[subsetted1$PROPDMGEXP!=expense]<-0  #Zeroing all the other values
subsetted1$CROPDMGEXP[subsetted1$CROPDMGEXP!=expense]<-0  #Zeroing all the other values

I have also created a CROPCASH and PROPCASH columns which they are (CROPCASH=CROPDMG x CROPDMGEXP) and the same is for PROPCASH using the followin code.

#Creating the total cash column
subsetted1$PROPDMGEXP[subsetted1$PROPDMGEXP=="h"]<-100
subsetted1$PROPDMGEXP[subsetted1$PROPDMGEXP=="k"]<-1000
subsetted1$PROPDMGEXP[subsetted1$PROPDMGEXP=="m"]<-1000000
subsetted1$PROPDMGEXP[subsetted1$PROPDMGEXP=="b"]<-1000000000
subsetted1$PROPDMGEXP<-as.numeric(subsetted1$PROPDMGEXP)
subsetted1$PROPCASH<-subsetted1$PROPDMGEXP*subsetted1$PROPDMG

subsetted1$CROPDMGEXP[subsetted1$CROPDMGEXP=="h"]<-100
subsetted1$CROPDMGEXP[subsetted1$CROPDMGEXP=="k"]<-1000
subsetted1$CROPDMGEXP[subsetted1$CROPDMGEXP=="m"]<-1000000
subsetted1$CROPDMGEXP[subsetted1$CROPDMGEXP=="b"]<-1000000000
subsetted1$CROPDMGEXP<-as.numeric(subsetted1$CROPDMGEXP)
subsetted1$CROPCASH<-subsetted1$CROPDMGEXP*subsetted1$CROPDMG

Choosing what are the categories to scale down the data to

By looking at the number of the injuries and making sense of all the EVTYPES variables I have created a set of events that I will scale down the dataset to and note that I have included several into one event that I will explain in a table here. The reason is that I think for example (Thunder Storm, Lightening) they can be summed uo under one category.

Name in Dataset	Selected Category
“FLOOD”	“FLOOD”
“TORNA”	“TORNADO”
“STORM”	“STORM”
“RAIN”	“RAIN”
“COLD”	“COLD”
“HEAT”	“HEAT”
“SNOW”	“SNOW”
“HAIL”	“SNOW”
“STORM”	“STORM”
“HURRI”	“HURRICANE”
“WIND”	“WIND”
“COL”	“COLD”
“FREEZ”	“COLD”
“WATER”	“FLOOD”
“HOT”	“HEAT”
“ICE”	“SNOW”
“WINT”	“COLD”
“FIRE”	“FIRE”
“BLIZZ”	“SNOW”
“LIGHT”	“THUNDER”
“AVALA”	“AVALANCHE”

using the following code chunk to perform the above subsittution

subsetted1$EVTYPE[grep("FLOOD",subsetted1$EVTYPE, ignore.case = T)]<-"FLOOD"
subsetted1$EVTYPE[grep("TORNA",subsetted1$EVTYPE,ignore.case = T)]<-"TORNADO"
subsetted1$EVTYPE[grep("STORM",subsetted1$EVTYPE,ignore.case = T)]<-"STORM"
subsetted1$EVTYPE[grep("RAIN",subsetted1$EVTYPE,ignore.case = T)]<-"RAIN"
subsetted1$EVTYPE[grep("COLD",subsetted1$EVTYPE,ignore.case = T)]<-"COLD"
subsetted1$EVTYPE[grep("HEAT",subsetted1$EVTYPE,ignore.case = T)]<-"HEAT"
subsetted1$EVTYPE[grep("SNOW",subsetted1$EVTYPE,ignore.case = T)]<-"SNOW"
subsetted1$EVTYPE[grep("HAIL",subsetted1$EVTYPE,ignore.case = T)]<-"SNOW"
subsetted1$EVTYPE[grep("STORM",subsetted1$EVTYPE,ignore.case = T)]<-"STORM"
subsetted1$EVTYPE[grep("HURRI",subsetted1$EVTYPE,ignore.case = T)]<-"HURRICANE"
subsetted1$EVTYPE[grep("WIND",subsetted1$EVTYPE,ignore.case = T)]<-"WIND"
subsetted1$EVTYPE[grep("COL",subsetted1$EVTYPE,ignore.case = T)]<-"COLD"
subsetted1$EVTYPE[grep("FREEZ",subsetted1$EVTYPE,ignore.case = T)]<-"COLD"
subsetted1$EVTYPE[grep("WATER",subsetted1$EVTYPE,ignore.case = T)]<-"FLOOD"
subsetted1$EVTYPE[grep("HOT",subsetted1$EVTYPE,ignore.case = T)]<-"HEAT"
subsetted1$EVTYPE[grep("ICE",subsetted1$EVTYPE,ignore.case = T)]<-"SNOW"
subsetted1$EVTYPE[grep("WINT* *",subsetted1$EVTYPE,ignore.case = T)]<-"COLD"
subsetted1$EVTYPE[grep("FIRE",subsetted1$EVTYPE,ignore.case = T)]<-"FIRE"
subsetted1$EVTYPE[grep("AVALA",subsetted1$EVTYPE,ignore.case = T)]<-"AVALANCHE"
subsetted1$EVTYPE[grep("BLIZZ",subsetted1$EVTYPE,ignore.case = T)]<-"SNOW"
subsetted1$EVTYPE[grep("LIGHT",subsetted1$EVTYPE,ignore.case = T)]<-"THUNDER"
subsetted1$EVTYPE[grep("AVALA",subsetted1$EVTYPE,ignore.case = T)]<-"AVALANCHE"

I have created a vector cats that contains all the categories I included and then dropped out all the other NAs and non-standard cats using:

cats<-c("TORNADO","STORM","RAIN","HEAT","HURRICANE","WIND","FLOOD","COLD","FIRE","SNOW")

key<-which ((subsetted1$EVTYPE %in% cats)) #which are the rows fall in the cat above

subsetted2<-subsetted1[key,]

Now as our dataset is clean and ready, all we need is to start agregating the variables accoring to both the human impact and financial impact using the code:

final<-aggregate(cbind(subsetted2$FATALITIES, subsetted2$INJURIES, subsetted2$CROPCASH, 
                       subsetted2$PROPCASH)~EVTYPE, data=subsetted2, sum)

colnames(final)<-c("Event","Fatalities", "Injuries", "Crop Total Cash $", "Property Total Cash $")

Results

1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

In order to answer this question the following code:

final<-aggregate(cbind(subsetted2$FATALITIES, subsetted2$INJURIES, subsetted2$CROPCASH, 
                       subsetted2$PROPCASH)~EVTYPE, data=subsetted2, sum)

colnames(final)<-c("Event","Fatalities", "Injuries", "Crop Total Cash ($)", "Property Total Cash ($)")

install.packages("reshape")
library(reshape)
install.packages("ggplot2")
library(ggplot2)


finalh<-melt(final[,1:3])

ggplot(finalh,aes(x=Event,y=value,fill=variable)) + 
  geom_bar(position="dodge",stat="identity",binwidth=1) + 
  ggtitle("Most harmful events to population health") + 
  labs(x="",y="Number of people affected") + 
  theme(axis.text.x=element_text(angle=45,hjust=1))

alt text

It can be seen from the barchart above that Tornados have the most impact on health in terms of injured people while HEAT event has the majority of Fatality counts on population

2. Across the United States, which types of events have the greatest economic consequences?

finalp<-melt(final[,c(1,4,5)])

ggplot(finalp,aes(x=Event,y=value,fill=variable)) + 
  geom_bar(position="dodge",stat="identity",binwidth=1) + 
  ggtitle("Most harmful events to Property and Crops") + 
  labs(x="",y="Loss Value ($)") + 
  theme(axis.text.x=element_text(angle=45,hjust=1))

alt text

*In here we can see that we have several players here in terms of property damage and they are (Tornados, Hurricanes and Floods) while in terms of Crops we can see that only Floods and Cold has the major effect

Storm Events Database Analysis Project

Atheer M. Al Attar

October 24, 2015

Summary

Data Processing

Choosing what are the categories to scale down the data to

Results

1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

2. Across the United States, which types of events have the greatest economic consequences?