Summary

The basic goal of this analysis is to explore the NOAA Storm Database and answer some basic questions about severe weather events. In this analysis we want to find out, accross the US, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health and which types of events have the greatest economic consequences. The Event Type variable was cleaned and the variables regarding property and crop damages were cleaned as well and aggregated in a single variable economiccost. Injuries and Fatalities were aggregated as well in a single variable healthcost. The analysis found that Tornadoes and heavy winds are most harmful to population health and Hurricanes as well as tornados have the greatest economic consequences.

Data Preprocessing

We first load the raw data

data<-read.csv('repdata-data-StormData.csv')

After a brief inspection, we find that Event Types and the other relevant variables, namely, the ones regarding Fatalities, Injuries and Property and Crop damage needed some cleaning. In this section we briefly explain and go trhough the cleaning process of the stated variables.

Cleaning Event Types

According to the Storm Data Documentation [1], the number of accepted event types are 48, however the dataset has 985 unique different types in the variable EVTYPE. We show the frist 20 values to get an idea.

uniqueEVTYPE <- unique(data$EVTYPE)
length(uniqueEVTYPE)
## [1] 985
uniqueEVTYPE[1:20]
##  [1] TORNADO                   TSTM WIND                
##  [3] HAIL                      FREEZING RAIN            
##  [5] SNOW                      ICE STORM/FLASH FLOOD    
##  [7] SNOW/ICE                  WINTER STORM             
##  [9] HURRICANE OPAL/HIGH WINDS THUNDERSTORM WINDS       
## [11] RECORD COLD               HURRICANE ERIN           
## [13] HURRICANE OPAL            HEAVY RAIN               
## [15] LIGHTNING                 THUNDERSTORM WIND        
## [17] DENSE FOG                 RIP CURRENT              
## [19] THUNDERSTORM WINS         FLASH FLOOD              
## 985 Levels:    HIGH SURF ADVISORY  COASTAL FLOOD ... WND

To clean the Event Type variable we use extensively the grep command to substitute many minor difference in a variety of terms. For simplicity, we also aggregate all easily unrelated types to the new category “other” and we also assign this value to event types that occur less than 30 times.

data$evtype <- tolower(data$EVTYPE)
data$evtype[grep("blizzard",data$evtype)]<-"blizzard"
data$evtype[grep("^heavy snow",data$evtype)] <- "heavy snow"
data$evtype[grep("^heavy rain",data$evtype)] <- "heavy rain"
data$evtype[grep("heavy rain",data$evtype)] <- "heavy rain"
data$evtype[grep("surf",data$evtype)] <- "high surf"
data$evtype[grep("flash",data$evtype)] <- "flash"
data$evtype[grep("tornado",data$evtype)] <- "tornado"
data$evtype[grep("hurricane",data$evtype)] <- "hurricane"
data$evtype[grep("chil",data$evtype)] <- "chill"
data$evtype[grep("winter",data$evtype)] <- "heavy snow"
data$evtype[grep("snow",data$evtype)] <- "heavy snow"
data$evtype[grep("wind",data$evtype)] <- "high wind"
data$evtype[grep("hail",data$evtype)] <- "hail"
data$evtype[grep("mud",data$evtype)] <- "debris flow"
data$evtype[grep("fog",data$evtype)] <- "dense fog"
data$evtype[grep("coastal",data$evtype)] <- "coastal"
data$evtype[grep("^thunder.*wind",data$evtype)] <- "wind"
data$evtype[grep("^tstm.*wind",data$evtype)] <- "wind"
data$evtype[grep("wind",data$evtype)] <- "wind"
data$evtype[grep("flood",data$evtype)] <- "flood"
data$evtype[grep("rain",data$evtype)] <- "heavy rain"
data$evtype[grep("summary",data$evtype)] <- "other"
data$evtype[grep("unseason",data$evtype)] <- "other"
data$evtype[grep("unusual",data$evtype)] <- "other"
data$evtype[grep("abnormal",data$evtype)] <- "other"
data$evtype[grep("month",data$evtype)] <- "other"
data$evtype[grep("record",data$evtype)] <- "other"
data$evtype[grep("dust",data$evtype)] <- "dust"
data$evtype[grep("tropical",data$evtype)] <- "tropical"
data$evtype[grep("ice storm",data$evtype)] <- "heavy snow"
data$evtype[grep("storm",data$evtype)] <- "heavy rain"  
data$evtype[grep("fire",data$evtype)] <- "wildfire"  
data$evtype[grep("shower",data$evtype)] <- "heavy rain"  
data$evtype[grep("spout",data$evtype)] <- "tornado"  
data$evtype[grep("fld",data$evtype)] <- "flood"  
data$evtype[grep("cloud",data$evtype)] <- "funnel cloud"  
data$evtype[grep("funnel",data$evtype)] <- "funnel cloud" 
data$evtype[grep("flash",data$evtype)] <- "flash flood" 

tableev<-table(data$evtype)
for (i in 1:length(tableev)) {
    if (tableev[i]<35) {
      #  print(tableev[i])
      #  print(names(tableev[i]))
        data$evtype[data$evtype==names(tableev[i])] <- "other"
    }    
}
unique(data$evtype)
##  [1] "tornado"                "wind"                  
##  [3] "hail"                   "heavy rain"            
##  [5] "heavy snow"             "flash flood"           
##  [7] "hurricane"              "other"                 
##  [9] "lightning"              "dense fog"             
## [11] "rip current"            "funnel cloud"          
## [13] "heat"                   "flood"                 
## [15] "cold"                   "extreme cold"          
## [17] "blizzard"               "chill"                 
## [19] "freeze"                 "coastal"               
## [21] "avalanche"              "dust"                  
## [23] "sleet"                  "excessive heat"        
## [25] "high surf"              "wildfire"              
## [27] "debris flow"            "dry microburst"        
## [29] "ice"                    "glaze"                 
## [31] "heat wave"              "wintry mix"            
## [33] "drought"                "tropical"              
## [35] "frost"                  "rip currents"          
## [37] "landslide"              "frost/freeze"          
## [39] "mixed precipitation"    "astronomical high tide"
## [41] "astronomical low tide"
table(data$evtype)
## 
## astronomical high tide  astronomical low tide              avalanche 
##                    103                    174                    386 
##               blizzard                  chill                coastal 
##                   2745                   1807                    864 
##                   cold            debris flow              dense fog 
##                     82                     37                   1883 
##                drought         dry microburst                   dust 
##                   2488                    186                    587 
##         excessive heat           extreme cold            flash flood 
##                   1678                    657                  55674 
##                  flood                 freeze                  frost 
##                  29560                     76                     57 
##           frost/freeze           funnel cloud                  glaze 
##                   1343                   6997                     43 
##                   hail                   heat              heat wave 
##                 289277                    767                     75 
##             heavy rain             heavy snow              high surf 
##                  12707                  39302                   1065 
##              hurricane                    ice              landslide 
##                    288                     61                    600 
##              lightning    mixed precipitation                  other 
##                  15754                     37                   1423 
##            rip current           rip currents                  sleet 
##                    470                    304                     59 
##                tornado               tropical               wildfire 
##                  64550                    757                   4240 
##                   wind             wintry mix 
##                 363040                     94

Injuries and Fatalities

Exploring fatalities and injuries in the dataset. Those variables seem mostly clean and therefore we only need to aggregrate injuries and fatalities in a single variable to help with our analysis.

summary(data$FATALITIES)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0000   0.0000   0.0000   0.0168   0.0000 583.0000
summary(data$INJURIES)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##    0.0000    0.0000    0.0000    0.1557    0.0000 1700.0000

Total health harm aggregating injuries and fatalities in a single variable to aid our analysis,

data$healthcost <- data$FATALITIES + data$INJURIES

Exploring Economic impact

As we would like to assess the economic impact of severe weather events, we need to check the relevant variables, to evaluate if the data is clean and tidy. We explore the variables PROPDMG,PROPDMGEXP and crop damage in CROPDMG and CROPDMGEXP. According to the Storm Data Documentation [1], property damage estimates are calculated in the following way: PROPDMG contains the property damage estimate rounded to three significant digits and PRPDMGEXP contains the magnitude of the estimate.

data$PROPDMGEXP<-as.character(data$PROPDMGEXP)
table(data$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5 
## 465934      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     40      1      6 424665      7  11330

In the table we can see that the majority of the entries are empty. As it turns out, all but three entries with an empty value have an estimate PROPDMG of 0. Also, according to the property damage estimates [1], the minimal quantity is thousands of dollars, i.e. ‘K’. With this in mind, we assign 0 to PROPDMGEXP with 0 in PROPDMG but assing a value of 3 (K) to values of PROPDMG with non-zero values.

data$PROPDMGEXP[which(data$PROPDMG==0 & data$PROPDMGEXP=="")]<-0
#values with missing PROPDMGEXP but non-zero PROPDMG
data$PROPDMGEXP[which(data$PROPDMGEXP=="")]<-3

Next, we substitute the alphabetic entries for its numeric equivalents.

data$PROPDMGEXP<-gsub("^[mM]","6",data$PROPDMGEXP)
data$PROPDMGEXP<-gsub("^[hH]","2",data$PROPDMGEXP)
data$PROPDMGEXP<-gsub("^[kK]","3",data$PROPDMGEXP)
data$PROPDMGEXP<-gsub("^[bB]","9",data$PROPDMGEXP)

After a closer look to values with an entry of ‘-’,‘+’ and ‘?’ it is safe to assume that ‘-’ and ‘+’ are equivalent to thousands. We will assume that ‘?’ corresponds to zero.

data$PROPDMGEXP<-gsub("^[-|+]","3",data$PROPDMGEXP)
data$PROPDMGEXP<-gsub("^[?]","0",data$PROPDMGEXP)

Our table now looks like,

table(data$PROPDMGEXP)
## 
##      0      1      2      3      4      5      6      7      8      9 
## 466082     25     20 424751      4     28  11341      5      1     40

It is safe now to create a property damage variable as PROPDMG * 10^(PROPDMGEXP)

data$PROPDMGEXP<-as.integer(data$PROPDMGEXP)
data$propertydamage<-data$PROPDMG*10^(data$PROPDMGEXP)

After exploring briefly the new variable propertydamage, we found 1 outlier that did not seem to be correct. The orignial value after our processing was of $115 billion dollars. Further research revealed that it was a major flooding in the area of Western California in new year’s eve 2006 around napa valley,

outlierdamage <- which(data$propertydamage==max(data$propertydamage))

Exploring neighboring rows, it was clear that the the value was an error and the exponent was assumed to be in the order of millions. Furthermore a government estimate was found around $300 million dollars [3], corroborating our assumptions.

data[outlierdamage,"PROPDMGEXP"]<-6
data[outlierdamage,"propertydamage"]<-115*10^6

The total property damage is 313 billion dollars.

totalpropertydamage<-sum(data$propertydamage)/1e9

All but 3 empty columns “” are zero the other three are non-zero. All “?” columns are zero CROPDMG

data$CROPDMGEXP<-as.character(data$CROPDMGEXP)
table(data$CROPDMG[data$CROPDMGEXP==""])
## 
##      0      3      4 
## 618410      1      2
table(data$CROPDMG[data$CROPDMGEXP=="?"])
## 
## 0 
## 7
#zero to empty strings with a cropdmg estimate of 0
data$CROPDMGEXP[which(data$CROPDMG==0 & data$CROPDMGEXP=="")]<-0
#assing a value of 3 (the minimal) to missing PROPDMGEXP but non-zero PROPDMG values
data$CROPDMGEXP[which(data$CROPDMGEXP=="")]<-3
data$CROPDMGEXP<-gsub("^[mM]","6",data$CROPDMGEXP)
data$CROPDMGEXP<-gsub("^[kK]","3",data$CROPDMGEXP)
data$CROPDMGEXP<-gsub("^[bB]","7",data$CROPDMGEXP)
data$CROPDMGEXP<-gsub("^[?]","0",data$CROPDMGEXP)
data$CROPDMGEXP<-as.integer(data$CROPDMGEXP)
table(data$CROPDMGEXP)
## 
##      0      2      3      6      7 
## 618436      1 281856   1995      9
data$cropdamage<-data$CROPDMG*10^(data$CROPDMGEXP)

Now, we can aggregate property and crop damage costs into a single variable describing total economic cost

data$economiccost <- data$propertydamage + data$cropdamage

We create a new tidy-er dataset containing only the relevant variables to aid our analysis,

tidydata <- data[,c("evtype", "healthcost", "economiccost")]
tidydata[sample(900000,20),]
##             evtype healthcost economiccost
## 423353        hail          0            0
## 561166  heavy snow          0            0
## 804710 flash flood          0            0
## 668896     tornado          0          250
## 638361     tornado          0            0
## 375818 flash flood          1        50000
## 694476        heat          0            0
## 612280        wind          0         2000
## 385481        wind          0            0
## 631983        wind          0        25000
## 517213        wind          0         6000
## 360142        hail          0            0
## 506768        wind          0            0
## 631874        hail          0            0
## 704595        hail          0            0
## 674944        wind          0        10000
## 212051        hail          0            0
## 83376      tornado          0        25000
## 243973 flash flood          0        50000
## 721298        hail          0            0

Results

In the following, we can see both, the Event Types with the greatest economic consequences and the Event Types most harmful to population health.

library(plyr)
EconomicCostEventType<-ddply(data,"evtype",summarize,economiccost=sum(economiccost))
HealthCostEventType<-ddply(data,"evtype",summarize,healthcost=sum(healthcost))
EconomicCostEventType<-EconomicCostEventType[order(EconomicCostEventType$economiccost,decreasing = TRUE),]
HealthCostEventType<-HealthCostEventType[order(HealthCostEventType$healthcost,decreasing = TRUE),]
head(EconomicCostEventType)
##         evtype economiccost
## 28   hurricane  88776572810
## 37     tornado  59030413584
## 25  heavy rain  53261262040
## 16       flood  41134225350
## 15 flash flood  19122214438
## 22        hail  19023995926
head(HealthCostEventType)
##            evtype healthcost
## 37        tornado      97100
## 40           wind      12577
## 13 excessive heat       8428
## 16          flood       7385
## 31      lightning       6046
## 26     heavy snow       5582

The event types with greatest economic consequences are: Hurricanes with a cost of 88 billion dollars followed by tornadoes with a total cost of 59 billioni dollars.

The event types most harmful to population health are: Tornados with 97100 injuries/casualties combined followed by Heavy Wind with 12577 injuries/casualties.

barplot(EconomicCostEventType[1:10,2]/1e6,
        ylab="millions of US$",
        main="Costliest Severe Weather Events",
        names.arg=EconomicCostEventType$evtype[1:10],
        las=2)

barplot(HealthCostEventType[1:10,2],
        ylab="Number of Injuries and Casualties",
        main="Most Harmful Severe Weather Events",
        names.arg=HealthCostEventType$evtype[1:10],        
        las=2)

References

[1] National Weather Service Storm Data Documentation https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf

[2] National Climatic Data Center Storm Events FAQ https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf

[3] Storms and Flooding in California in December 2005 and January 2006–a Preliminary Assessment’http://pubs.usgs.gov/of/2006/1182/pdf/ofr2006-1182.pdf’ accessed: Oct 25,2014