Report on Human and Economic Damages of Natural disasters

Synopsis

This data was cleaned by attempting to match event types to the now-standardized list of weather events at the NOAA. To determine costs, data was aggregated by taking the mean of each type of event and cost. On average, excessive heat, tsunami, and hurricanes have the greatest effect on human lives

On average, volcanic ash, waterspout, hurricances, and droughts have the greatest economic costs, when damage to both property and agriculture are totaled.

###Data Preprocessing

The following steps were taken to clean and preprocess the data: 1) Subset the data based on the type of event and the columns that pertain to human injuries and fatalities as well as property and crop damages.

2) Select only the rows of data where fatalities, injuries and damages are greater than zero, this reduces the size of the dataset to about a third of what it was.

storm <- read.csv("repdata-data-StormData.csv.bz2") 
relevant = c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
stDm = storm[, relevant ]
good = stDm$FATALITIES >0 | stDm$INJURIES > 0 |stDm$PROPDMG > 0 | stDm$CROPDMG > 0
stDm = stDm[good, ]

3) This is a function that will try to match the EVYPE to the standardized list in the NOAA database (http://www.ncdc.noaa.gov/stormevents/pd01016005curr.pdf) It does some basic cleaning of the data, separates the character strings into words, and then tries to match, as best it can to the list of 48 names. This is not exactly a perfect function as it won't be able to handle certain cases like “tstm” would be coded as “other” thunderstrom.

getBestEVTYPE <- function( sEVTYPE)  {
  #this function tries to match the event types in this dataset to the 48 standardized categories 
  #in the NOAA database (http://www.ncdc.noaa.gov/stormevents/pd01016005curr.pdf)
evt_groups <- tolower(c("Astronomical Low Tide", "Avalanche", "Blizzard",
                "Coastal Flood", "Cold/Wind Chill", "Debris Flow", 
                "Dense Fog", "Dense Smoke", "Drought", 
                "Dust Devil", "Dust Storm", "Excessive Heat",
                "Extreme Cold/Wind Chill", "Flash Flood", "Flood",
                "Frost/Freeze", "Funnel Cloud", "Freezing Fog", 
                "Hail", "Heat", "Heavy Rain", "Heavy Snow",
                "High Surf", "High Wind", "Hurricane", 
                "Ice Storm", "Lake-Effect Snow", "Lakeshore Flood",
                "Lightning", "Marine Hail", "Marine High Wind", 
                "Marine Strong Wind", "Marine Thunderstorm Wind", 
                "Rip Current", "Seiche", "Sleet", 
                "Storm Surge/Tide", "Strong Wind", "Thunderstorm Wind",
                "Tornado Typhoon", "Tropical Depression", "Tropical Storm",
                "Tsunami", "Volcanic Ash", "Waterspout",
                "Wildfire", "Winter Storm", "Winter Weather" ))  


#sEVTYPE <- tolower("HIGH SURF ADVISORY")    ## set manually here for testing

sEVTYPE <- as.character(sEVTYPE)
## tidy up the logged EVTYPE so we process each word - still can fail if words don't match afterwards

sEVTYPE <- gsub("\\(", " ", sEVTYPE)
sEVTYPE <- gsub("\\.", " ", sEVTYPE)
sEVTYPE <- gsub("\\\\", " ", sEVTYPE)
sEVTYPE <- gsub("-", " ", sEVTYPE)

## split into words, get a vector of words back
vStr <- unlist(strsplit(sEVTYPE, " "))

bestMatch <- ""
iMatches <- 0
matched_evtypes <- c()
## loop through EVTYPE words
for(word in vStr) {
  ## some words may be empty due to double/treble SPACing
  if (nchar(word)>0) {
    ## remove ING or S at the end of the word otherwise it won't match (eg. STORMS, RAINING)
    if (substr(word, nchar(word)-3+1, nchar(word))=="ING") {
      word <- substr(word, 1, nchar(word)-3)
    } else if (substr(word, nchar(word)-1+1, nchar(word))=="S") {
      word <- substr(word, 1, nchar(word)-1)
    }
    ## is the word in any of the event groups
    ## go through groups and build matching word groups
    matched_evtypes <- c(matched_evtypes, evt_groups[grepl(word,evt_groups)])   
  }    
}
best_evtype <- "OTHER" ## default return value where words didn't match anything

if (length(matched_evtypes)>0) {
  ## create a data frame containing a table of frequencies of matching words
  ## from the event type 
  mevt <- as.data.frame(table(matched_evtypes))
  ## sort by frequency - descending - to give top matching event type
  mevt <- mevt[order(mevt$Freq, decreasing=TRUE),]
  ## return best matching event type name - it could still be wrong but better to have 48 than 205/397 events
  best_evtype <- as.character(mevt[1,c(1)])

}

best_evtype
}

This code does some pre-cleaning of the data and then applies the above function to the complete dataset.

stDm$EVTYPE = tolower(stDm$EVTYPE)
stDm$EVTYPE = gsub("^\\s+||\\s+$", "", stDm$EVTYPE)
#removes beginning and trailing white spaces
stDm$EVTYPE = gsub("/|//", " ", stDm$EVTYPE)
#removes / or // and replaces with spaces

stDm$newEVTYPE <- lapply(as.character(stDm$EVTYPE), getBestEVTYPE)

Summarizing the results: This code aggregates the data by finding the mean of each colum (fatalities, injuring and economic costs ) and puts it in a new dataframe.

library(reshape2)
library(plyr)
library(ggplot2)

stDm$newEVTYPE = gsub("\\s+", "_", stDm$newEVTYPE)
m.stDm = melt(stDm, measure = c("FATALITIES", "INJURIES", "PROPDMG", "CROPDMG"))

stmean = dcast(m.stDm, newEVTYPE ~ variable, fun = mean)

####Results 1) The most damaging events in terms of human cost are excessive heat, tsunami, hurricane and dense fog. For my analysis, I combined the fatality and injury columns into one for total human cost. I arbitrarily decided that someone dying was twice as bad as an injury, so the total human cost factor is not an actual number of fatalities or injuries but a scaled factor that's indictative of the total affect on humans.

stmean$totalHumanCost =  stmean$FATALITIES*2 + stmean$INJURIES
bytot = arrange(stmean, desc(totalHumanCost))

bytot[1:10, ]

##                  newEVTYPE FATALITIES INJURIES PROPDMG CROPDMG
## 1           excessive_heat     3.1173   9.1769   4.938  1.5558
## 2                  tsunami     2.3571   9.2143  64.664  1.4286
## 3                hurricane     0.5964   5.9552 106.534 48.4430
## 4                dense_fog     0.4420   5.9448  94.338  0.0000
## 5               dust_storm     0.2095   4.1905  50.471 20.0143
## 6              marine_hail     1.6000   1.4000  10.800  0.0000
## 7                    sleet     2.0000   0.0000   0.000  0.0000
## 8                 blizzard     0.3961   3.1569  99.680  0.6745
## 9                ice_storm     0.1280   2.8693 100.543  2.2597
## 10 extreme_cold/wind_chill     0.9212   0.7879  33.686 18.8125
##    totalHumanCost
## 1          15.412
## 2          13.929
## 3           7.148
## 4           6.829
## 5           4.610
## 6           4.600
## 7           4.000
## 8           3.949
## 9           3.125
## 10          2.630

p1 = ggplot(bytot[1:10,], aes(x= newEVTYPE, y = totalHumanCost) ) + geom_bar(stat= "identity")

mytheme = theme_bw(base_size= 15) +theme(axis.text.x = element_text(angle = 90))
p1 + labs(title = "Top 10 Natural Disasters with Human cost", x = "") + mytheme

plot of chunk unnamed-chunk-5

2) Economic impact: I ran out of time to do this properly, so deduct points as you see fit. I completely ignoring the exponent column in tabulating the total economic cost (eg PROPDMGEXP). All I did was add up the crop and property damage colums, and graph the top 10 in terms of economic cost.

stmean$totcost = stmean$PROPDMG + stmean$CROPDMG
bycost = arrange(stmean, desc(totcost))

bycost[1:10, ]

##           newEVTYPE FATALITIES INJURIES PROPDMG CROPDMG totalHumanCost
## 1      volcanic_ash    0.00000  0.00000  250.00   0.000         0.0000
## 2        waterspout    0.07273  0.52727  173.89   0.000         0.6727
## 3         hurricane    0.59641  5.95516  106.53  48.443         7.1480
## 4           drought    0.05882  0.09191   15.84 124.811         0.2096
## 5    tropical_storm    0.15677  0.90974  118.60  15.357         1.2233
## 6  storm_surge/tide    0.10714  0.19196  116.83   3.817         0.4062
## 7            seiche    0.00000  0.00000  108.89   0.000         0.0000
## 8          wildfire    0.07183  1.28332   99.44   6.836         1.4270
## 9     coastal_flood    0.04509  0.63611   88.72  16.151         0.7263
## 10        ice_storm    0.12800  2.86933  100.54   2.260         3.1253
##    totcost
## 1    250.0
## 2    173.9
## 3    155.0
## 4    140.7
## 5    134.0
## 6    120.6
## 7    108.9
## 8    106.3
## 9    104.9
## 10   102.8

p2 = ggplot(bycost[1:10,], aes(x= newEVTYPE, y = totcost) ) + geom_bar(stat= "identity")

p2 + labs(title = "Top 10 Natural Disasters with Total Economic Cost", x = "", y = "Total Cost") + mytheme

plot of chunk unnamed-chunk-6