This data was cleaned by attempting to match event types to the now-standardized list of weather events at the NOAA. To determine costs, data was aggregated by taking the mean of each type of event and cost. On average, excessive heat, tsunami, and hurricanes have the greatest effect on human lives
On average, volcanic ash, waterspout, hurricances, and droughts have the greatest economic costs, when damage to both property and agriculture are totaled.
###Data Preprocessing
The following steps were taken to clean and preprocess the data: 1) Subset the data based on the type of event and the columns that pertain to human injuries and fatalities as well as property and crop damages.
2) Select only the rows of data where fatalities, injuries and damages are greater than zero, this reduces the size of the dataset to about a third of what it was.
storm <- read.csv("repdata-data-StormData.csv.bz2")
relevant = c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
stDm = storm[, relevant ]
good = stDm$FATALITIES >0 | stDm$INJURIES > 0 |stDm$PROPDMG > 0 | stDm$CROPDMG > 0
stDm = stDm[good, ]
3) This is a function that will try to match the EVYPE to the standardized list in the NOAA database (http://www.ncdc.noaa.gov/stormevents/pd01016005curr.pdf) It does some basic cleaning of the data, separates the character strings into words, and then tries to match, as best it can to the list of 48 names. This is not exactly a perfect function as it won't be able to handle certain cases like “tstm” would be coded as “other” thunderstrom.
getBestEVTYPE <- function( sEVTYPE) {
#this function tries to match the event types in this dataset to the 48 standardized categories
#in the NOAA database (http://www.ncdc.noaa.gov/stormevents/pd01016005curr.pdf)
evt_groups <- tolower(c("Astronomical Low Tide", "Avalanche", "Blizzard",
"Coastal Flood", "Cold/Wind Chill", "Debris Flow",
"Dense Fog", "Dense Smoke", "Drought",
"Dust Devil", "Dust Storm", "Excessive Heat",
"Extreme Cold/Wind Chill", "Flash Flood", "Flood",
"Frost/Freeze", "Funnel Cloud", "Freezing Fog",
"Hail", "Heat", "Heavy Rain", "Heavy Snow",
"High Surf", "High Wind", "Hurricane",
"Ice Storm", "Lake-Effect Snow", "Lakeshore Flood",
"Lightning", "Marine Hail", "Marine High Wind",
"Marine Strong Wind", "Marine Thunderstorm Wind",
"Rip Current", "Seiche", "Sleet",
"Storm Surge/Tide", "Strong Wind", "Thunderstorm Wind",
"Tornado Typhoon", "Tropical Depression", "Tropical Storm",
"Tsunami", "Volcanic Ash", "Waterspout",
"Wildfire", "Winter Storm", "Winter Weather" ))
#sEVTYPE <- tolower("HIGH SURF ADVISORY") ## set manually here for testing
sEVTYPE <- as.character(sEVTYPE)
## tidy up the logged EVTYPE so we process each word - still can fail if words don't match afterwards
sEVTYPE <- gsub("\\(", " ", sEVTYPE)
sEVTYPE <- gsub("\\.", " ", sEVTYPE)
sEVTYPE <- gsub("\\\\", " ", sEVTYPE)
sEVTYPE <- gsub("-", " ", sEVTYPE)
## split into words, get a vector of words back
vStr <- unlist(strsplit(sEVTYPE, " "))
bestMatch <- ""
iMatches <- 0
matched_evtypes <- c()
## loop through EVTYPE words
for(word in vStr) {
## some words may be empty due to double/treble SPACing
if (nchar(word)>0) {
## remove ING or S at the end of the word otherwise it won't match (eg. STORMS, RAINING)
if (substr(word, nchar(word)-3+1, nchar(word))=="ING") {
word <- substr(word, 1, nchar(word)-3)
} else if (substr(word, nchar(word)-1+1, nchar(word))=="S") {
word <- substr(word, 1, nchar(word)-1)
}
## is the word in any of the event groups
## go through groups and build matching word groups
matched_evtypes <- c(matched_evtypes, evt_groups[grepl(word,evt_groups)])
}
}
best_evtype <- "OTHER" ## default return value where words didn't match anything
if (length(matched_evtypes)>0) {
## create a data frame containing a table of frequencies of matching words
## from the event type
mevt <- as.data.frame(table(matched_evtypes))
## sort by frequency - descending - to give top matching event type
mevt <- mevt[order(mevt$Freq, decreasing=TRUE),]
## return best matching event type name - it could still be wrong but better to have 48 than 205/397 events
best_evtype <- as.character(mevt[1,c(1)])
}
best_evtype
}
This code does some pre-cleaning of the data and then applies the above function to the complete dataset.
stDm$EVTYPE = tolower(stDm$EVTYPE)
stDm$EVTYPE = gsub("^\\s+||\\s+$", "", stDm$EVTYPE)
#removes beginning and trailing white spaces
stDm$EVTYPE = gsub("/|//", " ", stDm$EVTYPE)
#removes / or // and replaces with spaces
stDm$newEVTYPE <- lapply(as.character(stDm$EVTYPE), getBestEVTYPE)
Summarizing the results: This code aggregates the data by finding the mean of each colum (fatalities, injuring and economic costs ) and puts it in a new dataframe.
library(reshape2)
library(plyr)
library(ggplot2)
stDm$newEVTYPE = gsub("\\s+", "_", stDm$newEVTYPE)
m.stDm = melt(stDm, measure = c("FATALITIES", "INJURIES", "PROPDMG", "CROPDMG"))
stmean = dcast(m.stDm, newEVTYPE ~ variable, fun = mean)
####Results 1) The most damaging events in terms of human cost are excessive heat, tsunami, hurricane and dense fog. For my analysis, I combined the fatality and injury columns into one for total human cost. I arbitrarily decided that someone dying was twice as bad as an injury, so the total human cost factor is not an actual number of fatalities or injuries but a scaled factor that's indictative of the total affect on humans.
stmean$totalHumanCost = stmean$FATALITIES*2 + stmean$INJURIES
bytot = arrange(stmean, desc(totalHumanCost))
bytot[1:10, ]
## newEVTYPE FATALITIES INJURIES PROPDMG CROPDMG
## 1 excessive_heat 3.1173 9.1769 4.938 1.5558
## 2 tsunami 2.3571 9.2143 64.664 1.4286
## 3 hurricane 0.5964 5.9552 106.534 48.4430
## 4 dense_fog 0.4420 5.9448 94.338 0.0000
## 5 dust_storm 0.2095 4.1905 50.471 20.0143
## 6 marine_hail 1.6000 1.4000 10.800 0.0000
## 7 sleet 2.0000 0.0000 0.000 0.0000
## 8 blizzard 0.3961 3.1569 99.680 0.6745
## 9 ice_storm 0.1280 2.8693 100.543 2.2597
## 10 extreme_cold/wind_chill 0.9212 0.7879 33.686 18.8125
## totalHumanCost
## 1 15.412
## 2 13.929
## 3 7.148
## 4 6.829
## 5 4.610
## 6 4.600
## 7 4.000
## 8 3.949
## 9 3.125
## 10 2.630
p1 = ggplot(bytot[1:10,], aes(x= newEVTYPE, y = totalHumanCost) ) + geom_bar(stat= "identity")
mytheme = theme_bw(base_size= 15) +theme(axis.text.x = element_text(angle = 90))
p1 + labs(title = "Top 10 Natural Disasters with Human cost", x = "") + mytheme
2) Economic impact: I ran out of time to do this properly, so deduct points as you see fit. I completely ignoring the exponent column in tabulating the total economic cost (eg PROPDMGEXP). All I did was add up the crop and property damage colums, and graph the top 10 in terms of economic cost.
stmean$totcost = stmean$PROPDMG + stmean$CROPDMG
bycost = arrange(stmean, desc(totcost))
bycost[1:10, ]
## newEVTYPE FATALITIES INJURIES PROPDMG CROPDMG totalHumanCost
## 1 volcanic_ash 0.00000 0.00000 250.00 0.000 0.0000
## 2 waterspout 0.07273 0.52727 173.89 0.000 0.6727
## 3 hurricane 0.59641 5.95516 106.53 48.443 7.1480
## 4 drought 0.05882 0.09191 15.84 124.811 0.2096
## 5 tropical_storm 0.15677 0.90974 118.60 15.357 1.2233
## 6 storm_surge/tide 0.10714 0.19196 116.83 3.817 0.4062
## 7 seiche 0.00000 0.00000 108.89 0.000 0.0000
## 8 wildfire 0.07183 1.28332 99.44 6.836 1.4270
## 9 coastal_flood 0.04509 0.63611 88.72 16.151 0.7263
## 10 ice_storm 0.12800 2.86933 100.54 2.260 3.1253
## totcost
## 1 250.0
## 2 173.9
## 3 155.0
## 4 140.7
## 5 134.0
## 6 120.6
## 7 108.9
## 8 106.3
## 9 104.9
## 10 102.8
p2 = ggplot(bycost[1:10,], aes(x= newEVTYPE, y = totcost) ) + geom_bar(stat= "identity")
p2 + labs(title = "Top 10 Natural Disasters with Total Economic Cost", x = "", y = "Total Cost") + mytheme