The basic goal of this analysis is to explore the NOAA Storm Database and answer some basic questions about severe weather events. In this analysis we want to find out, accross the US, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health and which types of events have the greatest economic consequences. The Event Type variable was cleaned and the variables regarding property and crop damages were cleaned as well and aggregated in a single variable economiccost. Injuries and Fatalities were aggregated as well in a single variable healthcost. The analysis found that Tornadoes and heavy winds are most harmful to population health and Hurricanes as well as tornados have the greatest economic consequences.
We first load the raw data
data<-read.csv('repdata-data-StormData.csv')
After a brief inspection, we find that Event Types and the other relevant variables, namely, the ones regarding Fatalities, Injuries and Property and Crop damage needed some cleaning. In this section we briefly explain and go trhough the cleaning process of the stated variables.
According to the Storm Data Documentation [1], the number of accepted event types are 48, however the dataset has 985 unique different types in the variable EVTYPE. We show the frist 20 values to get an idea.
uniqueEVTYPE <- unique(data$EVTYPE)
length(uniqueEVTYPE)
## [1] 985
uniqueEVTYPE[1:20]
## [1] TORNADO TSTM WIND
## [3] HAIL FREEZING RAIN
## [5] SNOW ICE STORM/FLASH FLOOD
## [7] SNOW/ICE WINTER STORM
## [9] HURRICANE OPAL/HIGH WINDS THUNDERSTORM WINDS
## [11] RECORD COLD HURRICANE ERIN
## [13] HURRICANE OPAL HEAVY RAIN
## [15] LIGHTNING THUNDERSTORM WIND
## [17] DENSE FOG RIP CURRENT
## [19] THUNDERSTORM WINS FLASH FLOOD
## 985 Levels: HIGH SURF ADVISORY COASTAL FLOOD ... WND
To clean the Event Type variable we use extensively the grep command to substitute many minor difference in a variety of terms. For simplicity, we also aggregate all easily unrelated types to the new category “other” and we also assign this value to event types that occur less than 30 times.
data$evtype <- tolower(data$EVTYPE)
data$evtype[grep("blizzard",data$evtype)]<-"blizzard"
data$evtype[grep("^heavy snow",data$evtype)] <- "heavy snow"
data$evtype[grep("^heavy rain",data$evtype)] <- "heavy rain"
data$evtype[grep("heavy rain",data$evtype)] <- "heavy rain"
data$evtype[grep("surf",data$evtype)] <- "high surf"
data$evtype[grep("flash",data$evtype)] <- "flash"
data$evtype[grep("tornado",data$evtype)] <- "tornado"
data$evtype[grep("hurricane",data$evtype)] <- "hurricane"
data$evtype[grep("chil",data$evtype)] <- "chill"
data$evtype[grep("winter",data$evtype)] <- "heavy snow"
data$evtype[grep("snow",data$evtype)] <- "heavy snow"
data$evtype[grep("wind",data$evtype)] <- "high wind"
data$evtype[grep("hail",data$evtype)] <- "hail"
data$evtype[grep("mud",data$evtype)] <- "debris flow"
data$evtype[grep("fog",data$evtype)] <- "dense fog"
data$evtype[grep("coastal",data$evtype)] <- "coastal"
data$evtype[grep("^thunder.*wind",data$evtype)] <- "wind"
data$evtype[grep("^tstm.*wind",data$evtype)] <- "wind"
data$evtype[grep("wind",data$evtype)] <- "wind"
data$evtype[grep("flood",data$evtype)] <- "flood"
data$evtype[grep("rain",data$evtype)] <- "heavy rain"
data$evtype[grep("summary",data$evtype)] <- "other"
data$evtype[grep("unseason",data$evtype)] <- "other"
data$evtype[grep("unusual",data$evtype)] <- "other"
data$evtype[grep("abnormal",data$evtype)] <- "other"
data$evtype[grep("month",data$evtype)] <- "other"
data$evtype[grep("record",data$evtype)] <- "other"
data$evtype[grep("dust",data$evtype)] <- "dust"
data$evtype[grep("tropical",data$evtype)] <- "tropical"
data$evtype[grep("ice storm",data$evtype)] <- "heavy snow"
data$evtype[grep("storm",data$evtype)] <- "heavy rain"
data$evtype[grep("fire",data$evtype)] <- "wildfire"
data$evtype[grep("shower",data$evtype)] <- "heavy rain"
data$evtype[grep("spout",data$evtype)] <- "tornado"
data$evtype[grep("fld",data$evtype)] <- "flood"
data$evtype[grep("cloud",data$evtype)] <- "funnel cloud"
data$evtype[grep("funnel",data$evtype)] <- "funnel cloud"
data$evtype[grep("flash",data$evtype)] <- "flash flood"
tableev<-table(data$evtype)
for (i in 1:length(tableev)) {
if (tableev[i]<35) {
# print(tableev[i])
# print(names(tableev[i]))
data$evtype[data$evtype==names(tableev[i])] <- "other"
}
}
unique(data$evtype)
## [1] "tornado" "wind"
## [3] "hail" "heavy rain"
## [5] "heavy snow" "flash flood"
## [7] "hurricane" "other"
## [9] "lightning" "dense fog"
## [11] "rip current" "funnel cloud"
## [13] "heat" "flood"
## [15] "cold" "extreme cold"
## [17] "blizzard" "chill"
## [19] "freeze" "coastal"
## [21] "avalanche" "dust"
## [23] "sleet" "excessive heat"
## [25] "high surf" "wildfire"
## [27] "debris flow" "dry microburst"
## [29] "ice" "glaze"
## [31] "heat wave" "wintry mix"
## [33] "drought" "tropical"
## [35] "frost" "rip currents"
## [37] "landslide" "frost/freeze"
## [39] "mixed precipitation" "astronomical high tide"
## [41] "astronomical low tide"
table(data$evtype)
##
## astronomical high tide astronomical low tide avalanche
## 103 174 386
## blizzard chill coastal
## 2745 1807 864
## cold debris flow dense fog
## 82 37 1883
## drought dry microburst dust
## 2488 186 587
## excessive heat extreme cold flash flood
## 1678 657 55674
## flood freeze frost
## 29560 76 57
## frost/freeze funnel cloud glaze
## 1343 6997 43
## hail heat heat wave
## 289277 767 75
## heavy rain heavy snow high surf
## 12707 39302 1065
## hurricane ice landslide
## 288 61 600
## lightning mixed precipitation other
## 15754 37 1423
## rip current rip currents sleet
## 470 304 59
## tornado tropical wildfire
## 64550 757 4240
## wind wintry mix
## 363040 94
Exploring fatalities and injuries in the dataset. Those variables seem mostly clean and therefore we only need to aggregrate injuries and fatalities in a single variable to help with our analysis.
summary(data$FATALITIES)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0168 0.0000 583.0000
summary(data$INJURIES)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1557 0.0000 1700.0000
Total health harm aggregating injuries and fatalities in a single variable to aid our analysis,
data$healthcost <- data$FATALITIES + data$INJURIES
As we would like to assess the economic impact of severe weather events, we need to check the relevant variables, to evaluate if the data is clean and tidy. We explore the variables PROPDMG,PROPDMGEXP and crop damage in CROPDMG and CROPDMGEXP. According to the Storm Data Documentation [1], property damage estimates are calculated in the following way: PROPDMG contains the property damage estimate rounded to three significant digits and PRPDMGEXP contains the magnitude of the estimate.
data$PROPDMGEXP<-as.character(data$PROPDMGEXP)
table(data$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 424665 7 11330
In the table we can see that the majority of the entries are empty. As it turns out, all but three entries with an empty value have an estimate PROPDMG of 0. Also, according to the property damage estimates [1], the minimal quantity is thousands of dollars, i.e. ‘K’. With this in mind, we assign 0 to PROPDMGEXP with 0 in PROPDMG but assing a value of 3 (K) to values of PROPDMG with non-zero values.
data$PROPDMGEXP[which(data$PROPDMG==0 & data$PROPDMGEXP=="")]<-0
#values with missing PROPDMGEXP but non-zero PROPDMG
data$PROPDMGEXP[which(data$PROPDMGEXP=="")]<-3
Next, we substitute the alphabetic entries for its numeric equivalents.
data$PROPDMGEXP<-gsub("^[mM]","6",data$PROPDMGEXP)
data$PROPDMGEXP<-gsub("^[hH]","2",data$PROPDMGEXP)
data$PROPDMGEXP<-gsub("^[kK]","3",data$PROPDMGEXP)
data$PROPDMGEXP<-gsub("^[bB]","9",data$PROPDMGEXP)
After a closer look to values with an entry of ‘-’,‘+’ and ‘?’ it is safe to assume that ‘-’ and ‘+’ are equivalent to thousands. We will assume that ‘?’ corresponds to zero.
data$PROPDMGEXP<-gsub("^[-|+]","3",data$PROPDMGEXP)
data$PROPDMGEXP<-gsub("^[?]","0",data$PROPDMGEXP)
Our table now looks like,
table(data$PROPDMGEXP)
##
## 0 1 2 3 4 5 6 7 8 9
## 466082 25 20 424751 4 28 11341 5 1 40
It is safe now to create a property damage variable as PROPDMG * 10^(PROPDMGEXP)
data$PROPDMGEXP<-as.integer(data$PROPDMGEXP)
data$propertydamage<-data$PROPDMG*10^(data$PROPDMGEXP)
After exploring briefly the new variable propertydamage, we found 1 outlier that did not seem to be correct. The orignial value after our processing was of $115 billion dollars. Further research revealed that it was a major flooding in the area of Western California in new year’s eve 2006 around napa valley,
outlierdamage <- which(data$propertydamage==max(data$propertydamage))
Exploring neighboring rows, it was clear that the the value was an error and the exponent was assumed to be in the order of millions. Furthermore a government estimate was found around $300 million dollars [3], corroborating our assumptions.
data[outlierdamage,"PROPDMGEXP"]<-6
data[outlierdamage,"propertydamage"]<-115*10^6
The total property damage is 313 billion dollars.
totalpropertydamage<-sum(data$propertydamage)/1e9
All but 3 empty columns “” are zero the other three are non-zero. All “?” columns are zero CROPDMG
data$CROPDMGEXP<-as.character(data$CROPDMGEXP)
table(data$CROPDMG[data$CROPDMGEXP==""])
##
## 0 3 4
## 618410 1 2
table(data$CROPDMG[data$CROPDMGEXP=="?"])
##
## 0
## 7
#zero to empty strings with a cropdmg estimate of 0
data$CROPDMGEXP[which(data$CROPDMG==0 & data$CROPDMGEXP=="")]<-0
#assing a value of 3 (the minimal) to missing PROPDMGEXP but non-zero PROPDMG values
data$CROPDMGEXP[which(data$CROPDMGEXP=="")]<-3
data$CROPDMGEXP<-gsub("^[mM]","6",data$CROPDMGEXP)
data$CROPDMGEXP<-gsub("^[kK]","3",data$CROPDMGEXP)
data$CROPDMGEXP<-gsub("^[bB]","7",data$CROPDMGEXP)
data$CROPDMGEXP<-gsub("^[?]","0",data$CROPDMGEXP)
data$CROPDMGEXP<-as.integer(data$CROPDMGEXP)
table(data$CROPDMGEXP)
##
## 0 2 3 6 7
## 618436 1 281856 1995 9
data$cropdamage<-data$CROPDMG*10^(data$CROPDMGEXP)
Now, we can aggregate property and crop damage costs into a single variable describing total economic cost
data$economiccost <- data$propertydamage + data$cropdamage
We create a new tidy-er dataset containing only the relevant variables to aid our analysis,
tidydata <- data[,c("evtype", "healthcost", "economiccost")]
tidydata[sample(900000,20),]
## evtype healthcost economiccost
## 423353 hail 0 0
## 561166 heavy snow 0 0
## 804710 flash flood 0 0
## 668896 tornado 0 250
## 638361 tornado 0 0
## 375818 flash flood 1 50000
## 694476 heat 0 0
## 612280 wind 0 2000
## 385481 wind 0 0
## 631983 wind 0 25000
## 517213 wind 0 6000
## 360142 hail 0 0
## 506768 wind 0 0
## 631874 hail 0 0
## 704595 hail 0 0
## 674944 wind 0 10000
## 212051 hail 0 0
## 83376 tornado 0 25000
## 243973 flash flood 0 50000
## 721298 hail 0 0
In the following, we can see both, the Event Types with the greatest economic consequences and the Event Types most harmful to population health.
library(plyr)
EconomicCostEventType<-ddply(data,"evtype",summarize,economiccost=sum(economiccost))
HealthCostEventType<-ddply(data,"evtype",summarize,healthcost=sum(healthcost))
EconomicCostEventType<-EconomicCostEventType[order(EconomicCostEventType$economiccost,decreasing = TRUE),]
HealthCostEventType<-HealthCostEventType[order(HealthCostEventType$healthcost,decreasing = TRUE),]
head(EconomicCostEventType)
## evtype economiccost
## 28 hurricane 88776572810
## 37 tornado 59030413584
## 25 heavy rain 53261262040
## 16 flood 41134225350
## 15 flash flood 19122214438
## 22 hail 19023995926
head(HealthCostEventType)
## evtype healthcost
## 37 tornado 97100
## 40 wind 12577
## 13 excessive heat 8428
## 16 flood 7385
## 31 lightning 6046
## 26 heavy snow 5582
The event types with greatest economic consequences are: Hurricanes with a cost of 88 billion dollars followed by tornadoes with a total cost of 59 billioni dollars.
The event types most harmful to population health are: Tornados with 97100 injuries/casualties combined followed by Heavy Wind with 12577 injuries/casualties.
barplot(EconomicCostEventType[1:10,2]/1e6,
ylab="millions of US$",
main="Costliest Severe Weather Events",
names.arg=EconomicCostEventType$evtype[1:10],
las=2)
barplot(HealthCostEventType[1:10,2],
ylab="Number of Injuries and Casualties",
main="Most Harmful Severe Weather Events",
names.arg=HealthCostEventType$evtype[1:10],
las=2)
[1] National Weather Service Storm Data Documentation https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf
[2] National Climatic Data Center Storm Events FAQ https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf
[3] Storms and Flooding in California in December 2005 and January 2006–a Preliminary Assessment’http://pubs.usgs.gov/of/2006/1182/pdf/ofr2006-1182.pdf’ accessed: Oct 25,2014