=========================================================================================
Storm events can cause both public health and economic problems for communities. Many severe events can result in fatalities, injuries, and property damage.
Our hypothesis is that only few storm types highly contributed to the overall health and economic damage in the United States recorded between 1950 and 2011. So, we will explore the US National Oceanic and Atmospheric Administration’s (NOAA) storm database and use the Pareto principle to prove our hypothesis.
From this database we found that there is one undisputable type that was most harmful to public health. Whereas, economic consequences where highly effected by three specific types.
The data for this analysis comes in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. The data file-existing under the link Storm Data- will be automatically downloaded in the next section of the report.
The National Weather Service Storm Data Documentation is also a useful documentation of the database where you will find how some of the variables are constructed/defined.
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
This section describes (in words and code) how the data was downloaded from the source, loaded into R, and processed for analysis.
Downloading the raw data
The analysis starts by downloading the zipped raw data file. Then, the file is unzipped.
fileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
dataFileZIP <- "./repdata_data_StormData.csv.bz2"
if (file.exists(dataFileZIP) == FALSE) {
download.file(fileURL, destfile = dataFileZIP)
}
Loading the raw data file into R
storms <- read.csv("repdata_data_StormData.csv.bz2", stringsAsFactors=FALSE)
Loading necessary R packages
The analysis uses some packages that need to be installed before running the code. The code below loads these packages. Yet, you need to make sure they are installed in your R version before loading.
library(dplyr)
library(ggplot2)
library(reshape2)
Processing data for analysis
To get an overview of the data before any processing let’s see the data characteristics in terms of dimensions (rows and columns) and some of the variables existing in the data.
dim<-dim(storms)
obs<-dim[1]
var<-dim[2]
We have 902297 observations with 37 variables.
To get sense of the data let’s take a look at some of the variables.
options(width = 100)
head(storms[,1:8],5)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL TORNADO
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL TORNADO
In this report we are concerned with the events that are most harmful to public health and those having the greatest economic consequences. By reading through the Storm Data Documentation and exploring the variables in the raw data we can conclude the following:
Impact on public health is represented in the following variables:
Impact on economy is represented in the following variables:
Besides, we will use some of the following variables in the analysis:
Hence, we will create a subset of the raw data containing the required variables only.
stormsSub<-storms[,c(2,3,7,8,23:28)]
By looking at the event type (EVTYPE) in the dataset we can see that some descriptions contain spaces that need trimming. So, as an initial processing step, we better remove the leading and trailing spaces.
stormsSub[,"EVTYPE"]<-gsub("(^ +)|( +$)", "", stormsSub$EVTYPE)
As the events in the raw data were not entered consistently throughout the years we need to devise a strategy to replace the EVTYPE descriptions with standard ones as listed in the Storm Data Documentation(2.1.1 Storm Data Event Table, pp.6).
However, replacing all EVTYPE descriptions is a tedious process. So, we will follow the vital few principle to correct the event descriptions by correcting the names of the descriptions that have the highest count of observations in the raw data.
To achieve this we need to calculate the frequency of events descriptions in the raw data and their cumulative share to the overall count of observations.
First, we create a dataset that summarizes the available descriptions with each one’s count of observations sorted in descending order.
EVType<-as.factor(stormsSub$EVTYPE)
EVFreq<-arrange(as.data.frame(table(EVType)),desc(Freq))
Second, we calculate the cumulative sum of observations across all events then its cumulative share of the total number of observations.
TotObs<-nrow(storms) # total number of observations
EVFreq$CumSum<- cumsum(EVFreq$Freq)
EVFreq$CumShare<- round(EVFreq$CumSum/TotObs,digits = 3)
By looking at this dataset (EVFreq) we find that the first 47 descriptions contribute to 99% of total observations as shown below.
head(EVFreq,47)
## EVType Freq CumSum CumShare
## 1 HAIL 288661 288661 0.320
## 2 TSTM WIND 219944 508605 0.564
## 3 THUNDERSTORM WIND 82563 591168 0.655
## 4 TORNADO 60652 651820 0.722
## 5 FLASH FLOOD 54278 706098 0.783
## 6 FLOOD 25326 731424 0.811
## 7 THUNDERSTORM WINDS 20843 752267 0.834
## 8 HIGH WIND 20212 772479 0.856
## 9 LIGHTNING 15755 788234 0.874
## 10 HEAVY SNOW 15708 803942 0.891
## 11 HEAVY RAIN 11723 815665 0.904
## 12 WINTER STORM 11433 827098 0.917
## 13 WINTER WEATHER 7026 834124 0.924
## 14 FUNNEL CLOUD 6839 840963 0.932
## 15 MARINE TSTM WIND 6175 847138 0.939
## 16 MARINE THUNDERSTORM WIND 5812 852950 0.945
## 17 WATERSPOUT 3797 856747 0.950
## 18 STRONG WIND 3566 860313 0.953
## 19 URBAN/SML STREAM FLD 3392 863705 0.957
## 20 WILDFIRE 2761 866466 0.960
## 21 BLIZZARD 2719 869185 0.963
## 22 DROUGHT 2488 871673 0.966
## 23 ICE STORM 2006 873679 0.968
## 24 EXCESSIVE HEAT 1678 875357 0.970
## 25 HIGH WINDS 1533 876890 0.972
## 26 WILD/FOREST FIRE 1457 878347 0.973
## 27 FROST/FREEZE 1342 879689 0.975
## 28 DENSE FOG 1293 880982 0.976
## 29 WINTER WEATHER/MIX 1104 882086 0.978
## 30 TSTM WIND/HAIL 1028 883114 0.979
## 31 EXTREME COLD/WIND CHILL 1002 884116 0.980
## 32 HEAT 767 884883 0.981
## 33 HIGH SURF 725 885608 0.982
## 34 TROPICAL STORM 690 886298 0.982
## 35 FLASH FLOODING 682 886980 0.983
## 36 EXTREME COLD 655 887635 0.984
## 37 COASTAL FLOOD 651 888286 0.984
## 38 LAKE-EFFECT SNOW 636 888922 0.985
## 39 FLOOD/FLASH FLOOD 624 889546 0.986
## 40 LANDSLIDE 600 890146 0.987
## 41 SNOW 587 890733 0.987
## 42 COLD/WIND CHILL 539 891272 0.988
## 43 FOG 538 891810 0.988
## 44 RIP CURRENT 470 892280 0.989
## 45 MARINE HAIL 442 892722 0.989
## 46 DUST STORM 427 893149 0.990
## 47 AVALANCHE 386 893535 0.990
Hence, applying the vital few principle we need to make sure these 47 descriptions match the standard ones in the Documentation table. This gives us high confidence in the analysis accuracy as well as efficiency in data processing.
After examining these 47 events descriptions and comparing them to those in the Storm Data Event Table we find that only 13 of them need to be renamed (shown below). Re-naming mis-spelt descriptions will take place in the subsetted data (stormsSub).
stormsSub[,"EVTYPE"]<-gsub("TSTM WIND","THUNDERSTORM WIND",stormsSub[,"EVTYPE"])
stormsSub[,"EVTYPE"]<-gsub("THUNDERSTORM WINDS","THUNDERSTORM WIND",stormsSub[,"EVTYPE"])
stormsSub[,"EVTYPE"]<-gsub("MARINE TSTM WIND","MARINE THUNDERSTORM WIND",stormsSub[,"EVTYPE"])
stormsSub[,"EVTYPE"]<-gsub("URBAN/SML STREAM FLD","FLOOD",stormsSub[,"EVTYPE"])
stormsSub[,"EVTYPE"]<-gsub("HIGH WINDS","HIGH WIND",stormsSub[,"EVTYPE"])
stormsSub[,"EVTYPE"]<-gsub("WILD/FOREST FIRE","WILDFIRE",stormsSub[,"EVTYPE"])
stormsSub[,"EVTYPE"]<-gsub("WINTER WEATHER/MIX","WINTER WEATHER",stormsSub[,"EVTYPE"])
stormsSub[,"EVTYPE"]<-gsub("TSTM WIND/HAIL","THUNDERSTORM WIND",stormsSub[,"EVTYPE"])
stormsSub[,"EVTYPE"]<-gsub("FLASH FLOODING","FLASH FLOOD",stormsSub[,"EVTYPE"])
stormsSub[,"EVTYPE"]<-gsub("EXTREME COLD","EXTREME COLD/WIND CHILL",stormsSub[,"EVTYPE"])
stormsSub[,"EVTYPE"]<-gsub("FLOOD/FLASH FLOOD","FLASH FLOOD",stormsSub[,"EVTYPE"])
stormsSub[,"EVTYPE"]<-gsub("SNOW","HEAVY SNOW",stormsSub[,"EVTYPE"])
stormsSub[,"EVTYPE"]<-gsub("FOG","DENSE FOG",stormsSub[,"EVTYPE"])
Another aspect to be processed before analysis is the unification of damage estimate numbers. Since the significance of damage estimate is represented in a character (K,M, or B) for both property and crop damages we need to convert this significance into numeric values then reflect it to the damage estimates (PROPDMG and CROPDMG).
We start by re-calculating the property damage estimates.
stormsSub$PROPDMG[stormsSub$PROPDMGEXP=="K"]<- stormsSub$PROPDMG[stormsSub$PROPDMGEXP=="K"]*1000
stormsSub$PROPDMG[stormsSub$PROPDMGEXP=="M"]<- stormsSub$PROPDMG[stormsSub$PROPDMGEXP=="M"]*1000000
stormsSub$PROPDMG[stormsSub$PROPDMGEXP=="B"]<- stormsSub$PROPDMG[stormsSub$PROPDMGEXP=="B"]*1000000000
Then, we re-calculate the crop damage estimates.
stormsSub$CROPDMG[stormsSub$CROPDMGEXP=="K"]<- stormsSub$CROPDMG[stormsSub$CROPDMGEXP=="K"]*1000
stormsSub$CROPDMG[stormsSub$CROPDMGEXP=="M"]<- stormsSub$CROPDMG[stormsSub$CROPDMGEXP=="M"]*1000000
stormsSub$CROPDMG[stormsSub$CROPDMGEXP=="B"]<- stormsSub$CROPDMG[stormsSub$CROPDMGEXP=="B"]*1000000000
Now we have same-unit property and crop damage estimates, we assume that overall economic damage can be represented by the sum of both property and crop damages. So, we create a new variable (ECODMG) calculated by the sum of both variables (PROPDMG and CROPDMG) for each event type.
stormsSub$ECODMG<-stormsSub$PROPDMG+stormsSub$CROPDMG
Similarly, we will assume that total event impact on public health can be represented by both fatalities and injuries caused by that event. So, we will create a new variable (HLTHDMG) calculated by the sum of both variables (FATALITIES and INJURIES) for each event type.
stormsSub$HLTHDMG<-stormsSub$FATALITIES+stormsSub$INJURIES
Now, we can drop the property and crop significance variables as they become irrelevant.
stormsSub<-stormsSub[,-c(8,10)]
Summarizing Impact of Events
So far, we have a processed dataset (stormsSub) which we can use to do the calculations needed for the analysis. In order to see the overall impact of each event type we will create a summary dataset (stormsSummary) grouped by event type along with its overall damage estimates to each of public health and to economy.
stormsSummary<-stormsSub %>%
group_by(EVTYPE)%>%
summarize(Total_FATAL=sum(FATALITIES),
Total_INJR=sum(INJURIES),
Total_HLTHDMG=sum(HLTHDMG),
Total_PROPDMG=sum(PROPDMG),
Total_CROPDMG=sum(CROPDMG),
Total_ECODMG=sum(ECODMG))
Finding Vital Few Events
We need to find out the events that are most harmful to public health. To do that we will arrange events in descending order by their total health damage (Total_HLTHDMG), and, for easier analysis, store the ordered list in a new dataset (OrdHLTH).
OrdHLTH<-arrange(stormsSummary,desc(Total_HLTHDMG))
On the other hand, we need to see the events that have the greatest economic consequences. So, we will also arrange events in descending order by their total economic damage(Total_ECODMG) and store the ordered list in a new dataset (OrdECO).
OrdECO<-arrange(stormsSummary,desc(Total_ECODMG))
Then, we will apply Pareto principle and find out the vital few events most harmful to each of health and economy. We calculate the cumulative sum of health damage across all events then calculate its share across all events top-down. We’ll do this for both ordered datasets (OrdHLTH,OrdECO). A subset of each dataset is shown below.
TotHLTHDMG<-sum(OrdHLTH$Total_HLTHDMG) # overall health damage
OrdHLTH$HLTHCumSum<- cumsum(OrdHLTH$Total_HLTHDMG)
OrdHLTH$HLTHCumShare<- round(OrdHLTH$HLTHCumSum/TotHLTHDMG,digits = 3)
head(OrdHLTH[,c(1:4,9)],20)
## Source: local data frame [20 x 5]
##
## EVTYPE Total_FATAL Total_INJR Total_HLTHDMG HLTHCumShare
## 1 TORNADO 5633 91346 96979 0.623
## 2 THUNDERSTORM WIND 701 9353 10054 0.688
## 3 EXCESSIVE HEAT 1903 6525 8428 0.742
## 4 FLOOD 498 6868 7366 0.789
## 5 LIGHTNING 816 5230 6046 0.828
## 6 HEAT 937 2100 3037 0.847
## 7 FLASH FLOOD 1014 1800 2814 0.865
## 8 ICE STORM 89 1975 2064 0.879
## 9 HIGH WIND 283 1439 1722 0.890
## 10 WILDFIRE 87 1456 1543 0.900
## 11 WINTER STORM 206 1321 1527 0.909
## 12 HAIL 15 1361 1376 0.918
## 13 HURRICANE/TYPHOON 64 1275 1339 0.927
## 14 HEAVY HEAVY SNOW 127 1021 1148 0.934
## 15 BLIZZARD 101 805 906 0.940
## 16 DENSE FOG 62 734 796 0.945
## 17 RIP CURRENT 368 232 600 0.949
## 18 WINTER WEATHER 61 470 531 0.952
## 19 RIP CURRENTS 204 297 501 0.956
## 20 HEAT WAVE 172 309 481 0.959
TotECODMG<-sum(OrdECO$Total_ECODMG) # overall economic damage
OrdECO$ECOCumSum<- cumsum(OrdECO$Total_ECODMG)
OrdECO$ECOCumShare<- round(OrdECO$ECOCumSum/TotECODMG,digits = 3)
head(OrdECO[,c(1,5:7,9)],17)
## Source: local data frame [17 x 5]
##
## EVTYPE Total_PROPDMG Total_CROPDMG Total_ECODMG ECOCumShare
## 1 FLOOD 144716019457 5670456550 150386476007 0.316
## 2 HURRICANE/TYPHOON 69305840000 2607872800 71913712800 0.467
## 3 TORNADO 56925660790 414953270 57340614060 0.587
## 4 STORM SURGE 43323536000 5000 43323541000 0.678
## 5 HAIL 15727367053 3025537890 18752904943 0.717
## 6 FLASH FLOOD 16622664680 1531457160 18154121840 0.755
## 7 DROUGHT 1046106000 13972566000 15018672000 0.787
## 8 HURRICANE 11868319010 2741910000 14610229010 0.818
## 9 THUNDERSTORM WIND 9709667805 1159501192 10869168997 0.840
## 10 RIVER FLOOD 5118945500 5029459000 10148404500 0.862
## 11 ICE STORM 3944927860 5022113500 8967041360 0.881
## 12 TROPICAL STORM 7703890550 678346000 8382236550 0.898
## 13 WILDFIRE 7766943500 402269630 8169213130 0.915
## 14 WINTER STORM 6688497251 26944000 6715441251 0.929
## 15 HIGH WIND 5878370043 679291900 6557661943 0.943
## 16 STORM SURGE/TIDE 4641188000 850000 4642038000 0.953
## 17 HURRICANE OPAL 3152846020 9000010 3161846030 0.960
Event types that are most harmful with respect to population health, and those that have the greatest economic consequences can be easily picked in a bar chart constructed in descending order. Therefore, we will plot the significant event types in 2 bar charts; one representing the impact on public health and another on economy. Then, we can pin-point those vital few event types that will be reported as most harmful to health and those with greatest economic consequences.
Notice that the significant event types plotted in the bar charts are those having 96% of the total impact on either public health or economy.
PLOT-1:
Using the ggplot2 package we will plot the events having about 96% of total impact on public health.
mltdHL<-melt(OrdHLTH[1:20,],id=c(1),measure.vars=c(2,3)) # convert data to long form
g<-ggplot(mltdHL,aes(EVTYPE,value,fill=factor(variable,labels=c("Fatalities","Injuries"))))
g<-g+geom_bar(stat="identity")+
labs(x="Event Type")+
labs(y="Fatalities and Injuries")+
theme(axis.text.x = element_text(angle=45, hjust=1, vjust=1),
plot.title = element_text(size = 18, face = "bold", colour = "black", vjust = +1),legend.title=element_blank(),legend.position=c(0.9,0.8))+
ggtitle(expression(atop("Events Impact on Population Health",
atop(italic("Events that Contributed to 96% of Total Health Damage"),
atop("Across USA, 1950-2011")))))+
annotate("text",x=8,y=60000,label= "Tornado is the most harmful to public health",size=3)
# order the bar chart in descending order to see impact of the vital few events
mltdHL0<- mltdHL
mltdHL0$EVTYPE <- with(mltdHL0, reorder(EVTYPE, -value))
g <- g %+% mltdHL0
g
PLOT-2:
Using the ggplot2 package we will plot the events having about 96% of total impact on economy.
mltdEC<-melt(OrdECO[1:17,],id=c(1),measure.vars=c(5,6)) # convert data to long form
p<-ggplot(mltdEC,aes(EVTYPE,value,fill=factor(variable,labels=c("Property","Crop"))))
p<-p+geom_bar(stat="identity")+
labs(x="Event Type")+
labs(y="Property and Crop Damages (USD)")+
theme(axis.text.x = element_text(angle=45, hjust=1, vjust=1),
plot.title = element_text(size = 18, face = "bold", colour = "black", vjust = +1),legend.title=element_blank(),legend.position=c(0.9,0.8))+
ggtitle(expression(atop("Events Impact on Economy",
atop(italic("Events that Contributed to 96% of Total Economic Damage"),
atop("Across USA, 1950-2011")))))+
annotate("text",x=8,y=1.0e+11,label= "Flood has the greatest economic \nconsequences followed by Hurricane and Tornado",size=3)
# order the bar chart in descending order to see impact of the vital few events
mltdEC0<- mltdEC
mltdEC0$EVTYPE <- with(mltdEC0, reorder(EVTYPE, -value))
p <- p %+% mltdEC0
p
Although the vital few events are clear in Plot-2 (Events Impact on Economy), we can add another plot showing the cumulative share to the total economic damage across all events. This plot will guide us to select the vital few events based on their impact share (percentage) compared to others.
PLOT-3:
s<-ggplot(OrdECO[1:17,],aes(EVTYPE,ECOCumShare,color=ECOCumShare<=0.60))
s<-s+geom_point()+
labs(x="Event Type")+
labs(y="Cumulative Share")+
theme(legend.position="none", axis.text.x = element_text(angle=45, hjust=1, vjust=1),
plot.title = element_text(size = 18, face = "bold", colour = "black", vjust = +1))+
ggtitle(expression(atop("Cumulative Share of Events Impact on Economy",
atop(italic("Events that Contributed to 96% of Total Economic Damage"),
atop("Across USA, 1950-2011")))))+
annotate("text",x=10,y=.6,label= "Flood, Hurricane and Tornado are the vital few\nthat have the greatest economic consequences",size=3)
# order the bar chart in descending order to see impact of the vital few events
OrdECO0<- OrdECO[1:17,]
OrdECO0$EVTYPE <- with(OrdECO0, reorder(EVTYPE, ECOCumShare))
s <- s %+% OrdECO0
s
Conclusion
Events impact on population health
Based on the analysis presented in this report we can conclude that Tornado is the undisputable event type that has been most harmful to population health amongst all other events. Tornado got the highest contribution of 62.3% to the overall health damage. The following figures summarizes the impact of this harmful event:
Events impact on economy
Based on the analysis presented in this report we can conclude that the greatest economic consequences were caused by three major events; namely, Flood, Hurricane/Typhoon, and Tornado with a total contribution of 58.7% to the overall economic damage. The following figures summarizes the impact of these harmful events:
Flood event recorded the highest economic impact of USD 150.4 billion. And, the overall economic damage caused by these three major events was about USD 279.6 billion for the entire span of data from 1950 till November 2011.
======================================END OF REPORT======================================