Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
The basic goal of this document(assignment) is to to answer following questions about severe weather events using NOAA Storm Database:
- which types of events, Across the United States are most harmful with respect to population health?
- which types of events, Across the United States have the greatest economic consequences?
As per following Analysis “Tornado” has caused the most damage to the public health, while “Flood” had the greatest economic consequences.
loading required packages
library(dplyr)
library(tidyr)
library(ggplot2)
library(scales)
The data for this analysis comes in the form of a comma-separated-value file compressed via the bzip2 taken from NOAA Storm Database which we will download the file from the course website.
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
## since the loading and processing is resource consuming we will using caching=TRUE in loading and pre-processing steps.
if(!file.exists(... = "noaa.csv.bz2"))
{download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",destfile = "noaa.csv.bz2")}
##reading as CSV
noaa<-read.csv("noaa.csv.bz2",stringsAsFactors=FALSE,na.strings=c(""))
##Converting BGN_DATE from string to Date type so we can later subset based on the year.
noaa<-transform(noaa,BGN_DATE=as.Date(BGN_DATE,format="%m/%d/%Y %H:%M:%S"))
here we are going to do some basic analysis regarding the raw dataset:
unqEventType<-unique(noaa$EVTYPE)
unqCounty<-unique(noaa$COUNTY)
range<-range(noaa.selected$BGN_DATE)
-the observations in this dataset are in the range: 1950-01-03, 2011-11-30. -there are 985 weather event defined from .
-the records are available for 557 counties.
Now we select only the column that are releated to our reasearch questions: - EVTYPE: measure of event type - FATALITIES: measure of harm to human death - Injuries: measure of harm to human health - PROPDMG: mesaure of property damage in USD - PROPDMGEXP: measure of mangitude of property damage - CROPDMG: measure of crop damage in USD - CROPDMGEXP: measure of magnitude of crop damage.
noaa.selected<-select(noaa,STATE,EVTYPE,BGN_DATE,FATALITIES,INJURIES,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP)
noaa.selected<-transform(noaa.selected,BGN_YEAR=as.numeric(format(BGN_DATE,"%Y")))
As stated in the assignment, since the recent years data obseravations are more accurate, to avoid wrong analysis we will use the subset [BGN_YEAR>2006] for the reset of this assignment
noaa.selected2007<-subset(noaa.selected,BGN_YEAR>2006)
The columns PROPDMGEXP and CROPDMGEXP represents Property damage exponent and crop damag exponent accordingly.
PROPDMGEXP.Noise<-unique(noaa.selected2007$PROPDMGEXP)
PROPDMGEXP.Noise ##Show
## [1] "K" "M" "B" "0"
CROPDMGEXP.Noise<-unique(noaa.selected2007$CROPDMGEXP)
CROPDMGEXP.Noise ##Show
## [1] "K" "M" "B"
Here we will add column for representing numeric version of columns PROPDMGEXP and CROPDMGEXP and will fill it using a function which will does followings:
- it will consider the values NA,“+” ,“?” and “-” as noise and return 0 as numeric value
- it Will consider “h” and “H” as Hunred and returns 2.
- it Will consider “k” and “K” as Thousends and returns 3.
- it Will consider “m” and “M” as Millions and returns 6.
- it Will consider “b” and “B” as Billions and returns 9.
- for any other values 0:9 it returns the same value.
translateExponent<-function(x) { if(is.na(x)) {0} else {switch(toupper(x),"+"=0,"?"=0,"-"=0,"H"=2,"K"=3,"M"=6,"B"=9,as.numeric(x)) }}
##adding two nummeric equivalent columns for PROPDMGEXP and CROPDMGEXP using the function translateExponent.
noaa.selected2007<-transform(noaa.selected2007,PROPDMEXPNumeric=sapply(PROPDMGEXP,translateExponent),CROPDMGEXPNumeric=sapply(CROPDMGEXP,translateExponent))
Now that we have translated exponent values we will use the formula “PROPDMGEXP(10^PROPDMEXPNummeric) +CROPDMG(10^CROPDMGEXPNumeric)” to calculate overall damage per observation and will save it as a column for further analysis.
noaa.selected2007<-transform(noaa.selected2007,totalDMG.Economic=(PROPDMG*(10^PROPDMEXPNumeric)+CROPDMG*(10^CROPDMGEXPNumeric)))
Since there is no explicit instruction on how to weight injuries vs fatalities, we are going to just addup both columns “FATALITIES” and “INJURIES” as public health damange(totalDMG.PBH):
noaa.selected2007<-transform(noaa.selected2007,totalDMG.PBH=FATALITIES+INJURIES)
to finalize data processing section, we will aggregate the dataset and group by event type.
noaa.summurized2007<-summarise(group_by(noaa.selected2007,EVTYPE),DMG.Economic=sum(totalDMG.Economic),DMG.PBHealth=sum(totalDMG.PBH))
##converting to factor for better ploting results
noaa.summurized2007<-transform(noaa.summurized2007,EVTYPE=as.factor(EVTYPE))
Now We are going to answer the research questions.
following figure shows which type of events during during years 2007-2011 had the most sum of injuries and fatalities (public health damage).
Top10.PBHealth<-head(arrange(noaa.summurized2007,desc(DMG.PBHealth)),10)
##show
print(Top10.PBHealth[,c(1,3)])
## EVTYPE DMG.PBHealth
## 1 TORNADO 10471
## 2 THUNDERSTORM WIND 1521
## 3 LIGHTNING 1082
## 4 EXCESSIVE HEAT 999
## 5 HEAT 884
## 6 FLASH FLOOD 609
## 7 WILDFIRE 455
## 8 WINTER WEATHER 354
## 9 RIP CURRENT 334
## 10 FLOOD 332
options(scipen=10000)
gplot.PBHealth<-ggplot(data=Top10.PBHealth,aes(y=DMG.PBHealth,x=reorder(EVTYPE,DMG.PBHealth)))
gplot.PBHealth<-gplot.PBHealth+geom_bar(stat="identity",width=.5,fill="red")+coord_flip()
gplot.PBHealth<-gplot.PBHealth+labs(y="count of deaths and injuries",x="Event Type")+scale_y_continuous(labels = comma)
gplot.PBHealth<-gplot.PBHealth+ggtitle(label = "public health damage by event(2007-2011)")
gplot.PBHealth
following figure shows which type of events during during years 2007-2011 had the most Property and crop damages in USD(Economic damage)
Top10.Economic<-head(arrange(noaa.summurized2007,desc(DMG.Economic)),10)
##show
print(Top10.Economic[,c(1,2)])
## EVTYPE DMG.Economic
## 1 FLOOD 16855415800
## 2 TORNADO 14732283740
## 3 HAIL 6967790600
## 4 FLASH FLOOD 5752614130
## 5 STORM SURGE/TIDE 4641493000
## 6 THUNDERSTORM WIND 3771560690
## 7 HURRICANE 2648110000
## 8 WILDFIRE 2221507470
## 9 HIGH WIND 1292629040
## 10 FROST/FREEZE 941281000
options(scipen=10000)
gplot.Economic<-ggplot(data=Top10.Economic,aes(y=DMG.Economic,x=reorder(EVTYPE,DMG.Economic)))
gplot.Economic<-gplot.Economic+geom_bar(stat="identity",width=.5,fill="steelblue")+coord_flip()
gplot.Economic<-gplot.Economic+labs(y="Total economic damage in USD",x="Event Type")+scale_y_continuous(labels = comma)
gplot.Economic<-gplot.Economic+ggtitle(label = "Total Economic damage by event(2007-2011)")
gplot.Economic