Analyzing he NOAA Storm Database
The goal here is to expore the NOAA Storm Database and answer some basic questions about severe weather events. Furthermore, the two questions of interest here are:
This analysis condluded that, Wind/Torados are responsible for 75% of the total effect on population health and 49% of the total economic consequence for all weather related events in the United States. It is to be noted that, Wind/Tornado events have extremely more imapct to property than crops in the United Stated and cause more injuries then deaths.
Other major events that are noteworthy include Heat/Drought due to their impact to population health, accounting for 9% of total incidences. While from an economic consequence standpoint, hurricanes and floods account for a total of 37% of the total economic damange.
Overall, injuries from wether events have a greater impact on population health. In similar respect, property damage from weather events has a greater economic consequence than crop damages.
Provided below are the list of libraries required for this analysis
## Loading required package: survival
## Loading required package: splines
## Loading required package: MASS
The code below extract the rae file and loads it in the StormData data.frame
OrigDataURL = "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if (!file.exists("dataFile.bz2"))
download.file(OrigDataURL, "dataFile.bz2")
# Read downloaded data file
StormData <- read.csv("dataFile.bz2")
In order to simplify the number of classes of events, reclassification of the $EVTYPE variable was done to accomodate the following 11 classes: Flood, Rain, Hurricane, Heat/Drought, Cold, Fire, Wind/tornado, Marine, Dust/Fog, Slides, and Other. The new calssifications were stored in the $TYPE varaible.
StormData$TYPE <- StormData$EVTYPE
StormData$TYPE<-rep("Other",length(StormData$EVTYPE)) # Inizialize with a generic value
StormData$TYPE[grep("*flood*",StormData$EVTYPE,ignore.case=TRUE)]<-"Flood"
StormData$TYPE[grep("(*rain*|*hail*|*storm*|*lightning*)",StormData$EVTYPE,ignore.case=TRUE)]<-"Percipitation"
StormData$TYPE[grep("*hurrica*",StormData$EVTYPE,ignore.case=TRUE)]<-"Hurricane"
StormData$TYPE[grep("(*heat*|*drought*)",StormData$EVTYPE,ignore.case=TRUE)]<-"Heat/Drought"
StormData$TYPE[grep("(*cold*|*freeze*)",StormData$EVTYPE,ignore.case=TRUE)]<-"Cold"
StormData$TYPE[grep("*fire*",StormData$EVTYPE,ignore.case=TRUE)]<-"Fire"
StormData$TYPE[grep("(*blizzard*|*wind*|*torn*)",StormData$EVTYPE,ignore.case=TRUE)]<-"Wind/Tornado"
StormData$TYPE[grep("(*marine*|*tsunami*|*seiche*|*tide*|*coast*)",StormData$EVTYPE,ignore.case=TRUE)]<-"Marine"
StormData$TYPE[grep("(*dust*|*smoke*|*fog*|*ash*)",StormData$EVTYPE,ignore.case=TRUE)]<-"Dust/Fog"
StormData$TYPE[grep("(*debris*|*mud*|*slide*|*avalanche*)",StormData$EVTYPE,ignore.case=TRUE)]<-"Slides"
StormData$TYPE<-as.factor(StormData$TYPE)
To make data processing faster, the main columns from the original dataset were extracted. The columns related to health data that will help us answer our first question are:
The columns related to economic data that will help us answer our second question are:
Both these datasets have the event classification column:
HealthData <- StormData[,c("TYPE", "FATALITIES", "INJURIES")]
EconData <- StormData[,c("TYPE", "PROPDMG", "CROPDMG", "PROPDMGEXP", "CROPDMGEXP")]
We can now see overall statistics for the two data sets created
summary(HealthData)
## TYPE FATALITIES INJURIES
## Wind/Tornado :438977 Min. : 0 Min. : 0.0
## Percipitation:304659 1st Qu.: 0 1st Qu.: 0.0
## Dust/Fog : 61092 Median : 0 Median : 0.0
## Heat/Drought : 33077 Mean : 0 Mean : 0.2
## Flood : 26141 3rd Qu.: 0 3rd Qu.: 0.0
## Other : 18705 Max. :583 Max. :1700.0
## (Other) : 19646
summary(EconData)
## TYPE PROPDMG CROPDMG PROPDMGEXP
## Wind/Tornado :438977 Min. : 0 Min. : 0.0 :465934
## Percipitation:304659 1st Qu.: 0 1st Qu.: 0.0 K :424665
## Dust/Fog : 61092 Median : 0 Median : 0.0 M : 11330
## Heat/Drought : 33077 Mean : 12 Mean : 1.5 0 : 216
## Flood : 26141 3rd Qu.: 0 3rd Qu.: 0.0 B : 40
## Other : 18705 Max. :5000 Max. :990.0 5 : 28
## (Other) : 19646 (Other): 84
## CROPDMGEXP
## :618413
## K :281832
## M : 1994
## k : 21
## 0 : 19
## B : 9
## (Other): 9
The table below shows the break down of exponents coded in the $PROPDMGEXP field. Here, we see that there are issues with letter cases and annomaly values such as '-' or '+'.
table(EconData$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 424665 7 11330
To get the true value of property damage, the $PROPDMG variable was raised to the power of $PROPDMGEXP. A intermidiate step where a numeric value was assigned to the character values in the $PROPDMGEXP variable was performed. The table above should provide some insight on what to code the values in the $PROPDMGEXP field with.
# Adding the exponenet to relavent values
EconData$PROPDMG_VAL[EconData$PROPDMGEXP=="K"]<-EconData$PROPDMG*10^3
EconData$PROPDMG_VAL[EconData$PROPDMGEXP=="M"]<-EconData$PROPDMG*10^6
EconData$PROPDMG_VAL[EconData$PROPDMGEXP=="m"]<-EconData$PROPDMG*10^6
EconData$PROPDMG_VAL[EconData$PROPDMGEXP=="B"]<-EconData$PROPDMG*10^9
EconData$PROPDMG_VAL[EconData$PROPDMGEXP=="0"]<-EconData$PROPDMG*1^0
EconData$PROPDMG_VAL[EconData$PROPDMGEXP=="1"]<-EconData$PROPDMG*1^1
EconData$PROPDMG_VAL[EconData$PROPDMGEXP=="2"]<-EconData$PROPDMG*1^2
EconData$PROPDMG_VAL[EconData$PROPDMGEXP=="3"]<-EconData$PROPDMG*1^3
EconData$PROPDMG_VAL[EconData$PROPDMGEXP=="4"]<-EconData$PROPDMG*1^4
EconData$PROPDMG_VAL[EconData$PROPDMGEXP=="5"]<-EconData$PROPDMG*1^5
EconData$PROPDMG_VAL[EconData$PROPDMGEXP=="6"]<-EconData$PROPDMG*1^6
EconData$PROPDMG_VAL[EconData$PROPDMGEXP=="7"]<-EconData$PROPDMG*1^7
EconData$PROPDMG_VAL[EconData$PROPDMGEXP=="8"]<-EconData$PROPDMG*1^8
The table below shows the break down of exponents coded in the $CROPDMGEXP field. Here, we see that there are issues with letter cases and annomaly values such as '-' or '+'.
table(EconData$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
To get the true value of crop damage, the $CROPDMG variable was raised to the power of $CROPDMGEXP. A intermidiate step where a numeric value was assigned to the character values in the $CROPDMGEXP variable was performed.
#EconData$CROPDMGEXP[is.na(EconData$CROPDMGEXP)]<-"1"
EconData$CROPDMG_VAL[EconData$CROPDMGEXP=="K"]<-EconData$CROPDMG*10^3
EconData$CROPDMG_VAL[EconData$CROPDMGEXP=="k"]<-EconData$CROPDMG*10^3
EconData$CROPDMG_VAL[EconData$CROPDMGEXP=="M"]<-EconData$CROPDMG*10^6
EconData$CROPDMG_VAL[EconData$CROPDMGEXP=="m"]<-EconData$CROPDMG*10^6
EconData$CROPDMG_VAL[EconData$CROPDMGEXP=="B"]<-EconData$CROPDMG*10^9
EconData$CROPDMG_VAL[EconData$CROPDMGEXP=="m"]<-EconData$CROPDMG*10^6
EconData$CROPDMG_VAL[EconData$CROPDMGEXP=="0"]<-EconData$CROPDMG*1^0
EconData$CROPDMG_VAL[EconData$CROPDMGEXP=="2"]<-EconData$CROPDMG*1^2
Added a total and percentage column to aid in analysis in further secions.
HealthData_Tidy <- HealthData[]
HealthData_Tidy <- aggregate(.~TYPE, data= HealthData_Tidy, FUN=sum)
HealthData_Tidy$TOTAL <- HealthData_Tidy$FATALITIES + HealthData_Tidy$INJURIES
HealthData_Tidy$PERCENT_TOTAL <- round((HealthData_Tidy$TOTAL/sum(HealthData_Tidy$TOTAL))*100,2)
#HealthData_Tidy[order(HealthData_Tidy$TOTAL, decreasing = TRUE),]
Added a total and percentage column to aid in analysis in further secions.
EconData_Tidy <- EconData[]
EconData_Tidy <- aggregate(.~TYPE, data= EconData_Tidy, FUN=sum)
EconData_Tidy$TOTAL <- EconData_Tidy$PROPDMG_VAL + EconData_Tidy$CROPDMG_VAL
EconData_Tidy$PERCENT_TOTAL <- round((EconData_Tidy$TOTAL/sum(EconData_Tidy$TOTAL))*100,2)
#EconData_Tidy[order(EconData_Tidy$TOTAL, decreasing = TRUE),]
HealthData_Tidy_Graph <-HealthData_Tidy[,1:3]
HealthData_Tidy_Graph <- melt(HealthData_Tidy_Graph, id = c("TYPE"))
HealthData_Tidy_Graph[order(HealthData_Tidy_Graph$TYPE, HealthData_Tidy_Graph$value, HealthData_Tidy_Graph$variable, decreasing = TRUE),]
## TYPE variable value
## 22 Wind/Tornado INJURIES 108055
## 11 Wind/Tornado FATALITIES 7596
## 10 Slides FATALITIES 268
## 21 Slides INJURIES 226
## 20 Percipitation INJURIES 6625
## 9 Percipitation FATALITIES 838
## 19 Other INJURIES 1351
## 8 Other FATALITIES 778
## 18 Marine INJURIES 199
## 7 Marine FATALITIES 86
## 17 Hurricane INJURIES 1328
## 6 Hurricane FATALITIES 133
## 16 Heat/Drought INJURIES 10604
## 5 Heat/Drought FATALITIES 3419
## 15 Flood INJURIES 6794
## 4 Flood FATALITIES 483
## 14 Fire INJURIES 1063
## 3 Fire FATALITIES 78
## 13 Dust/Fog INJURIES 3966
## 2 Dust/Fog FATALITIES 1227
## 12 Cold INJURIES 317
## 1 Cold FATALITIES 239
g <- ggplot(HealthData_Tidy_Graph, aes(x =TYPE, y=log(value), fill = variable ))
g <- g + geom_bar(stat="identity")
g <- g + guides(fill=guide_legend(reverse=TRUE, title = "Type of Incident"))
g <- g + scale_fill_brewer(palette="Pastel1")
g <- g + theme(legend.position="bottom")
g <- g + xlab("Type of Event") + ylab("Log Number of Incidences [Units: log(count)]") + labs(title = "Number of incidences to population health by event type")
print(g)
Figure 1. Effects of weather related events on public health
x1 <- xtable(HealthData_Tidy[order(HealthData_Tidy$TOTAL, decreasing = TRUE),])
print(x1,floating=FALSE, type = "html")
| TYPE | FATALITIES | INJURIES | TOTAL | PERCENT_TOTAL | |
|---|---|---|---|---|---|
| 11 | Wind/Tornado | 7596.00 | 108055.00 | 115651.00 | 74.29 |
| 5 | Heat/Drought | 3419.00 | 10604.00 | 14023.00 | 9.01 |
| 9 | Percipitation | 838.00 | 6625.00 | 7463.00 | 4.79 |
| 4 | Flood | 483.00 | 6794.00 | 7277.00 | 4.67 |
| 2 | Dust/Fog | 1227.00 | 3966.00 | 5193.00 | 3.34 |
| 8 | Other | 778.00 | 1351.00 | 2129.00 | 1.37 |
| 6 | Hurricane | 133.00 | 1328.00 | 1461.00 | 0.94 |
| 3 | Fire | 78.00 | 1063.00 | 1141.00 | 0.73 |
| 1 | Cold | 239.00 | 317.00 | 556.00 | 0.36 |
| 10 | Slides | 268.00 | 226.00 | 494.00 | 0.32 |
| 7 | Marine | 86.00 | 199.00 | 285.00 | 0.18 |
Analysis: From figure 1 and the table above, we see that Wind/Tornado events are the most harmful and account for about 74% of the total damage. Furthremore, the number of fatalaties and injuriesinduvidually from this type of event are greater than any other type of event. Even on a log scale the total number of Wind/Tornado events are over an unit more than the scond highest event type.
Heat/Drought event type is the second most harful. Although these types of events only account for roughly 9% of the total, they are double of the next highest event type. Futhermore, the number of fatalites in this type of event type constitute for a large percentage, about 23%, of the total fatality amount.
The rest of that event types account for 16.5% of the total.
EconData_Tidy_Graph <-EconData_Tidy[,c(1,6,7)]
EconData_Tidy_Graph <- melt(EconData_Tidy_Graph, id = c("TYPE"))
#so as not to break log function
EconData_Tidy_Graph[EconData_Tidy_Graph$value==0, 3] <- 0.0001
#EconData_Tidy_Graph[order(EconData_Tidy_Graph$TYPE, EconData_Tidy_Graph$value, EconData_Tidy_Graph$variable, decreasing = TRUE),]
h <- ggplot(EconData_Tidy_Graph, aes(x =TYPE, y=log(value), fill = variable ))
h <- h + geom_bar(stat="identity")
h <- h + guides(fill=guide_legend(reverse=TRUE, title = "Type of Cost"))
h <- h + scale_fill_brewer(palette="Pastel2")
h <- h + theme(legend.position="bottom")
h <- h + xlab("Type of Event") + ylab("Log Cost [Units: log($ USD)]") + labs(title = "Economic consequence by event type")
print(h)
Figure 2. Economic consequence of weather related events in the United Stated
x2 <- xtable(EconData_Tidy[order(EconData_Tidy$TOTAL, decreasing = TRUE),c(1,6:9)])
print(x2,floating=FALSE, type = "html")
| TYPE | PROPDMG_VAL | CROPDMG_VAL | TOTAL | PERCENT_TOTAL | |
|---|---|---|---|---|---|
| 11 | Wind/Tornado | 299071791907.50 | 128504070.00 | 299200295977.50 | 48.55 |
| 6 | Hurricane | 182977647500.00 | 0.00 | 182977647500.00 | 29.69 |
| 4 | Flood | 42709046140.00 | 14031200.00 | 42723077340.00 | 6.93 |
| 9 | Percipitation | 30347558105.00 | 60427750.00 | 30407985855.00 | 4.93 |
| 3 | Fire | 26638610080.00 | 3447100.00 | 26642057180.00 | 4.32 |
| 7 | Marine | 25342471330.00 | 7980300.00 | 25350451630.00 | 4.11 |
| 2 | Dust/Fog | 7460721977.50 | 16308910.00 | 7477030887.50 | 1.21 |
| 5 | Heat/Drought | 791327070.00 | 10588300.00 | 801915370.00 | 0.13 |
| 8 | Other | 365058310.00 | 4011900.00 | 369070210.00 | 0.06 |
| 10 | Slides | 257828350.00 | 170500.00 | 257998850.00 | 0.04 |
| 1 | Cold | 13439710.00 | 526500.00 | 13966210.00 | 0.00 |
Analysis: From figure 2 and the table above, we see that Wind/Tornados account for about 49% of the total economic impact. This is significantly greater than the next event type. In this type of event, property damage is significantly more than crop damage. The second highest amount of economic consequence comes from Hurricanes, which accounts for about 30% of the total economic consequence. It is to be noted here that all the economic consequence for this event type comes solely from property damage.
The top two event tyes account for 78.24% of the total damage, while the rest account for 21.73%.
sessionInfo()
## R version 3.1.0 (2014-04-10)
## Platform: i386-w64-mingw32/i386 (32-bit)
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] splines stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] xtable_1.7-3 doBy_4.5-10 MASS_7.3-33 survival_2.37-7
## [5] reshape2_1.4 plyr_1.8.1 ggplot2_1.0.0 knitr_1.6
##
## loaded via a namespace (and not attached):
## [1] colorspace_1.2-4 digest_0.6.4 evaluate_0.5.5
## [4] formatR_0.10 grid_3.1.0 gtable_0.1.2
## [7] lattice_0.20-29 lme4_1.1-6 Matrix_1.1-3
## [10] minqa_1.2.3 munsell_0.4.2 nlme_3.1-117
## [13] proto_0.3-10 Rcpp_0.11.1 RcppEigen_0.3.2.1.2
## [16] scales_0.2.4 stringr_0.6.2 tools_3.1.0
Done By: Jovan Sardinha (jovan.sardinha@gmail.com)
Created On: June 16, 2014
Last Updated: June 22, 2014