Based on the following analysis, Tornadoes should be the focus of resource allocation. It has the most bodily harm and economic impact that any other event. Thunderstorm Wind, Heat and Flood (flash and regular) should be prioritized next. The Results section will illustrate the quantifiable impact of the aforementioned events. The report tabulates data spanning over 60 years, and ensures proper analysis by reclassifying it to the latest NWS event categories.
The data processing will take place in 4 stages:
The 48 classifications were copied from page 6 of the NWS Storm Data Classification document
# Loading necessary libraries for the analysis --------------------------------------------------------------
library(stringr)
library(ggplot2)
library(scales)
# Loading raw data and creating the data set of the 48 categories -------------------------------------------
storm <- read.csv("repdata_data_StormData.csv.bz2")
storm.official <- c("Astronomical Low Tide","Avalanche","Blizzard","Coastal Flood","Cold/Wind Chill","Debris Flow","Dense Fog","Dense Smoke","Drought","Dust Devil","Dust Storm","Excessive Heat","Extreme Cold/Wind Chill","Flash Flood","Flood","Frost/Freeze","Funnel Cloud","Freezing Fog","Hail","Heat","Heavy Rain","Heavy Snow","High Surf","High Wind","Hurricane (Typhoon)","Ice Storm","Lake-Effect Snow","Lakeshore Flood","Lightning","Marine Hail","Marine High Wind","Marine Strong Wind","Marine Thunderstorm Wind","Rip Current","Seiche","Sleet","Storm Surge/Tide","Strong Wind","Thunderstorm Wind","Tornado","Tropical Depression","Tropical Storm","Tsunami","Volcanic Ash","Waterspout","Wildfire","Winter Storm","Winter Weather")
Created function that will update the event name in both data sets, to ensure consistency in the transformation
transformation <- function(original){
original <- toupper(original)
original <- str_replace_all(original , "[[:punct:]]", "")
original <- str_replace_all(original , " ", "")
original <- str_replace_all(original ,"TSTM", "THUNDERSTORM")
original <- str_replace_all(original ,"DENSEFOG", "FOG")
original <- str_replace_all(original ,"WAVE", "")
original <- str_replace_all(original ,"EXTREMECOLDWINDCHILL", "EXTREMECOLD")
original <- str_replace_all(original ,"HURRICANETYPHOON", "HURRICANE")
original <- str_replace_all(original ,"EXTREMEHEAT", "HEAT")
original <- str_replace_all(original ,"S", "")
}
Applying the new function to both data sets.
# Appending a new column (NewEV) using the 'transformation' function, passing the current event classification
storm <- cbind(storm, NewEV=transformation(storm[,8]))
# Doing the same with the data set of the 48 classifications to create a common key
storm.official <- as.data.frame(storm.official)
storm.official <- cbind(storm.official,transformation(storm.official[,1]))
colnames(storm.official) <- c("Official","NewEV")
The following section of code will aggregate the 3 main facts, namely fatalities, injuries and damages
# Aggregating Injuries by Event, and sorting by total -------------------------------------------------------
injuries.aggregate <-aggregate(INJURIES~NewEV, data=storm, FUN="sum")
injuries.sorted<- injuries.aggregate[order(injuries.aggregate$INJURIES,decreasing=TRUE),]
injuries.sorted<- cbind(injuries.sorted, rank.injuries=c(1:nrow(injuries.sorted)))
# Aggregating Fatalities by Event, and sorting by total -----------------------------------------------------
fatalities.aggregate <- aggregate(FATALITIES~NewEV, data=storm, FUN="sum")
fatalities.sorted<- fatalities.aggregate[order(fatalities.aggregate$FATALITIES,decreasing=TRUE),]
fatalities.sorted<- cbind(fatalities.sorted, rank.fatalities=c(1:nrow(fatalities.sorted)))
# Merging Injuries and Fatalities results and removing 0 totals ---------------------------------------------
sorted.table <- merge(injuries.sorted, fatalities.sorted, byx=NewEV, byy=NewEV)
sorted.table <- subset(sorted.table, INJURIES >=1 & FATALITIES >= 1)
# Finding the matches between the Official table and the new fatalities+injuries data set -------------------
sorted.matches <- merge(sorted.table, storm.official, byx=NewEV, byy=NewEV)
sorted.matches <- sorted.matches[order(sorted.matches$FATALITIES, decreasing = TRUE),]
# Calculating the percent of injuries and fatalities that reclassified into the 48 --------------------------
percent.injuries <- sum(sorted.matches$INJURIES)/sum(sorted.table$INJURIES)
percent.fatalities <- sum(sorted.matches$FATALITIES)/sum(sorted.table$FATALITIES)
# Obetining the total unmatched -----------------------------------------------------------------------------
difference.injuries <- sum(sorted.table$INJURIES)-sum(sorted.matches$INJURIES)
difference.fatalities <- sum(sorted.table$FATALITIES)-sum(sorted.matches$FATALITIES)
# Aggregating dollar damages by event ----------------------------------------------------------------------
damage.aggregate <- aggregate(PROPDMG+CROPDMG~NewEV, data=storm, FUN="sum")
colnames(damage.aggregate) <- c("NewEV", "DAMAGES")
# Finding the matches between the Official table and the aggregate damages table ----------------------------
damage.matched <- merge(damage.aggregate, storm.official, byx=NewEV, byY=NewEV)
damage.matched <- damage.matched[order(damage.matched$DAMAGES, decreasing=TRUE),]
# Estimating percent and amount difference between the full set and the matched set -------------------------
percent.damages <- sum(damage.matched$DAMAGES)/ sum(damage.aggregate$DAMAGES)
difference.damages <- sum(damage.aggregate$DAMAGES) - sum(damage.matched$DAMAGES)
The description of the planned transformations were outlined in the beginning of this section (Data Processing), also through the inline code comments, and also with the intoductions to each chunk.
As far as justification, since outlier event types within the data will be harder to reclassify, we need to find out if the approach above captures most of the quantifiable data we need analyze. The best way to confirm, is to take the totals of the matched categories for each measure and compare against the total counts.
paste(round(percent.fatalities*100, digits=2), "% of the total Injuries matched to one of the 48 official Classifications.", comma_format()(difference.fatalities), "of ", comma_format()(sum(fatalities.sorted$FATALITIES)), "total fatalities remain unmatched");paste(round(percent.injuries*100, digits=2), "% of the total Injuries matched to one of the 48 official Classifications.", comma_format()(difference.injuries), "of ", comma_format()(sum(injuries.sorted$INJURIES)), "total injuries remain unmatched") ;paste(round(percent.damages*100, digits=2), "% of the total Injuries matched to one of the 48 official Classifications.", comma_format()(difference.damages), "of ", comma_format()(sum(damage.aggregate$DAMAGES)), "total damages remain unmatched")
## [1] "97.02 % of the total Injuries matched to one of the 48 official Classifications. 446 of 15,145 total fatalities remain unmatched"
## [1] "98.46 % of the total Injuries matched to one of the 48 official Classifications. 2,158 of 140,528 total injuries remain unmatched"
## [1] "97.37 % of the total Injuries matched to one of the 48 official Classifications. 322,964 of 12,262,327 total damages remain unmatched"
As shown above, the data transformation and cleanup process reclassified the vast majority of the injuries, fatalities and damages into one of the 48 classification, and thus justifying this simple approach to conform to the official list and provide an accurate ranking.
# Selecting the top 10 events from the Injury/Fatality table ------------------------------------------------
top10.matches <- sorted.matches[1:10,]
# Calculating percentage and total injuries and fatalities that the top 10 represent ------------------------
top10.percent.injuries <- sum(top10.matches$INJURIES)/sum(sorted.table$INJURIES)
top10.injuries <- sum(top10.matches$INJURIES)
top10.percent.fatalities <- sum(top10.matches$FATALITIES)/sum(sorted.table$FATALITIES)
top10.fatalities <- sum(top10.matches$FATALITIES)
# Selecting top 5 from the damages table and calculating the percentage of total damages that represent -----
top5.damages <- damage.matched[1:5,]
percent.top5 <- sum(top5.damages$DAMAGES) / sum(damage.aggregate$DAMAGES)
paste("The top 10 events represent ", round(top10.percent.injuries*100, digits=2), "% of total Injuries and",round(top10.percent.fatalities*100, digits=2) , "% of the total fatalities");paste("The top 5 events represent ", round(percent.top5*100, digits=2), "% of the total in damages")
## [1] "The top 10 events represent 89.7 % of total Injuries and 85.91 % of the total fatalities"
## [1] "The top 5 events represent 82.43 % of the total in damages"
Was not appropriate to add fatalities and injuries together, since the severity of each is completely different. It was also a challenge to solely based this assessment on fatalities, since a high number of injuries should merit consideration for resource allocation. The plot below shows the injuries and fatalities in their own axis.
ggplot(data=top10.matches, aes(FATALITIES,INJURIES, colour=Official))+geom_point(size=5)+labs(title="Top 10 Health Damage Metrics",x="Fatalities",y="Injuries")+geom_text(label=top10.matches$Official, vjust=-1, size=4)
As the plot shows, Tornadoes are by far the most harmful on both fatalities and injuries. Thunderstorm Wind has a higher injury to fatality ratio and Exesive heat has an opposite ratio, both should be considered second. The rest in the ranking are Extreme Cold, Flash Flood, Flood, Heat, High Wind, Lightning and Rip Current.
ggplot(data=top5.damages, aes(y=DAMAGES, x=Official,fill=factor(Official)))+geom_bar(stat="identity")+ geom_text(label=comma_format()(top5.damages$DAMAGES), vjust=-1, size=3) +theme(legend.position="none",axis.text.y=element_blank()) +labs(title="Top 5 Monetary Damages",x="Event Name",y="Damage $")
Once again Tornado is ahead in economic consequences. Thunderstorm Wind follows behind, just as in with harmful events. Flash Flood and Flood are in both top lists, but Hail appears in 3rd place only on this list. This makes sense due to property damage hail can cause, without necessarily injuring a significant number of individuals.