National Weather Service - Top Severe Events

Data Processing

The data processing will take place in 4 stages:

Load the main storm raw data set, and create a secondary data set containing the current 48 categories
Through a fully repeatable string manipulation, match as many “event types” between the two data sets
Aggregate the totals by those that matched the 48 categories, and asses if the percentage of amount of bodily and dollar damage, not the 958 categories, is high enough to answer the questions we are attempting to answer
Find out which top number, top 5 or top 10, contains at least 80% of the measurements, to name the top of the 48 categories

Loading raw data

The 48 classifications were copied from page 6 of the NWS Storm Data Classification document

# Loading necessary libraries for the analysis --------------------------------------------------------------
library(stringr)
library(ggplot2)
library(scales)
# Loading raw data and creating the data set of the 48 categories -------------------------------------------
storm <- read.csv("repdata_data_StormData.csv.bz2")
storm.official <- c("Astronomical Low Tide","Avalanche","Blizzard","Coastal Flood","Cold/Wind Chill","Debris Flow","Dense Fog","Dense Smoke","Drought","Dust Devil","Dust Storm","Excessive Heat","Extreme Cold/Wind Chill","Flash Flood","Flood","Frost/Freeze","Funnel Cloud","Freezing Fog","Hail","Heat","Heavy Rain","Heavy Snow","High Surf","High Wind","Hurricane (Typhoon)","Ice Storm","Lake-Effect Snow","Lakeshore Flood","Lightning","Marine Hail","Marine High Wind","Marine Strong Wind","Marine Thunderstorm Wind","Rip Current","Seiche","Sleet","Storm Surge/Tide","Strong Wind","Thunderstorm Wind","Tornado","Tropical Depression","Tropical Storm","Tsunami","Volcanic Ash","Waterspout","Wildfire","Winter Storm","Winter Weather")

Data transfomation and cleanup

Created function that will update the event name in both data sets, to ensure consistency in the transformation

transformation <- function(original){
  original <- toupper(original)
  original  <- str_replace_all(original , "[[:punct:]]", "")
  original  <- str_replace_all(original ,  " ", "")
  original  <- str_replace_all(original ,"TSTM", "THUNDERSTORM")
  original  <- str_replace_all(original ,"DENSEFOG", "FOG")
  original  <- str_replace_all(original ,"WAVE", "")
  original  <- str_replace_all(original ,"EXTREMECOLDWINDCHILL", "EXTREMECOLD")
  original  <- str_replace_all(original ,"HURRICANETYPHOON", "HURRICANE")
  original  <- str_replace_all(original ,"EXTREMEHEAT", "HEAT")
  original  <- str_replace_all(original ,"S", "")
}

Applying the new function to both data sets.

# Appending a new column (NewEV) using the 'transformation' function, passing the current event classification
storm <- cbind(storm, NewEV=transformation(storm[,8]))
# Doing the same with the data set of the 48 classifications to create a common key
storm.official <- as.data.frame(storm.official)
storm.official <- cbind(storm.official,transformation(storm.official[,1]))
colnames(storm.official) <- c("Official","NewEV")

The following section of code will aggregate the 3 main facts, namely fatalities, injuries and damages

# Aggregating Injuries by Event, and sorting by total -------------------------------------------------------
injuries.aggregate <-aggregate(INJURIES~NewEV, data=storm, FUN="sum")
injuries.sorted<- injuries.aggregate[order(injuries.aggregate$INJURIES,decreasing=TRUE),]
injuries.sorted<- cbind(injuries.sorted, rank.injuries=c(1:nrow(injuries.sorted)))
# Aggregating Fatalities by Event, and sorting by total -----------------------------------------------------
fatalities.aggregate <- aggregate(FATALITIES~NewEV, data=storm, FUN="sum")
fatalities.sorted<- fatalities.aggregate[order(fatalities.aggregate$FATALITIES,decreasing=TRUE),]
fatalities.sorted<- cbind(fatalities.sorted, rank.fatalities=c(1:nrow(fatalities.sorted)))
# Merging Injuries and Fatalities results and removing 0 totals ---------------------------------------------
sorted.table <- merge(injuries.sorted, fatalities.sorted, byx=NewEV, byy=NewEV)
sorted.table <- subset(sorted.table, INJURIES >=1 & FATALITIES >= 1)
# Finding the matches between the Official table and the new fatalities+injuries data set -------------------
sorted.matches <- merge(sorted.table, storm.official, byx=NewEV, byy=NewEV)
sorted.matches <- sorted.matches[order(sorted.matches$FATALITIES, decreasing = TRUE),]
# Calculating the percent of injuries and fatalities that reclassified into the 48 --------------------------
percent.injuries <- sum(sorted.matches$INJURIES)/sum(sorted.table$INJURIES)
percent.fatalities <- sum(sorted.matches$FATALITIES)/sum(sorted.table$FATALITIES)
# Obetining the total unmatched -----------------------------------------------------------------------------
difference.injuries <- sum(sorted.table$INJURIES)-sum(sorted.matches$INJURIES)
difference.fatalities <- sum(sorted.table$FATALITIES)-sum(sorted.matches$FATALITIES)
# Aggregating dollar damages by event ----------------------------------------------------------------------
damage.aggregate <- aggregate(PROPDMG+CROPDMG~NewEV, data=storm, FUN="sum")
colnames(damage.aggregate) <- c("NewEV", "DAMAGES")
# Finding the matches between the Official table and the aggregate damages table ----------------------------
damage.matched <- merge(damage.aggregate, storm.official, byx=NewEV, byY=NewEV)
damage.matched <- damage.matched[order(damage.matched$DAMAGES, decreasing=TRUE),]
# Estimating percent and amount difference between the full set and the matched set -------------------------
percent.damages <- sum(damage.matched$DAMAGES)/ sum(damage.aggregate$DAMAGES)
difference.damages <-  sum(damage.aggregate$DAMAGES) - sum(damage.matched$DAMAGES)

Transformation Justification

The description of the planned transformations were outlined in the beginning of this section (Data Processing), also through the inline code comments, and also with the intoductions to each chunk.
As far as justification, since outlier event types within the data will be harder to reclassify, we need to find out if the approach above captures most of the quantifiable data we need analyze. The best way to confirm, is to take the totals of the matched categories for each measure and compare against the total counts.

paste(round(percent.fatalities*100, digits=2), "% of the total Injuries matched to one of the 48 official Classifications.", comma_format()(difference.fatalities), "of ", comma_format()(sum(fatalities.sorted$FATALITIES)), "total fatalities remain unmatched");paste(round(percent.injuries*100, digits=2), "% of the total Injuries matched to one of the 48 official Classifications.", comma_format()(difference.injuries), "of ", comma_format()(sum(injuries.sorted$INJURIES)), "total injuries remain unmatched") ;paste(round(percent.damages*100, digits=2), "% of the total Injuries matched to one of the 48 official Classifications.", comma_format()(difference.damages), "of ", comma_format()(sum(damage.aggregate$DAMAGES)), "total damages remain unmatched")

## [1] "97.02 % of the total Injuries matched to one of the 48 official Classifications. 446 of  15,145 total fatalities remain unmatched"

## [1] "98.46 % of the total Injuries matched to one of the 48 official Classifications. 2,158 of  140,528 total injuries remain unmatched"

## [1] "97.37 % of the total Injuries matched to one of the 48 official Classifications. 322,964 of  12,262,327 total damages remain unmatched"

As shown above, the data transformation and cleanup process reclassified the vast majority of the injuries, fatalities and damages into one of the 48 classification, and thus justifying this simple approach to conform to the official list and provide an accurate ranking.

Event ranking

# Selecting the top 10 events from the Injury/Fatality table ------------------------------------------------ 
top10.matches <- sorted.matches[1:10,]
# Calculating percentage and total injuries and fatalities that the top 10 represent ------------------------
top10.percent.injuries <- sum(top10.matches$INJURIES)/sum(sorted.table$INJURIES)
top10.injuries <- sum(top10.matches$INJURIES)
top10.percent.fatalities <- sum(top10.matches$FATALITIES)/sum(sorted.table$FATALITIES)
top10.fatalities <- sum(top10.matches$FATALITIES)
# Selecting top 5 from the damages table and calculating the percentage of total damages that represent -----
top5.damages <- damage.matched[1:5,]
percent.top5 <- sum(top5.damages$DAMAGES) /  sum(damage.aggregate$DAMAGES)
paste("The top 10 events represent ", round(top10.percent.injuries*100, digits=2), "% of total Injuries and",round(top10.percent.fatalities*100, digits=2) , "% of the total fatalities");paste("The top 5 events represent ", round(percent.top5*100, digits=2), "% of the total in damages")

## [1] "The top 10 events represent  89.7 % of total Injuries and 85.91 % of the total fatalities"

## [1] "The top 5 events represent  82.43 % of the total in damages"

National Weather Service - Top Severe Events

Edgar Ruiz

Saturday, August 22, 2015

Synopsis

Data Processing

Loading raw data

Data transfomation and cleanup

Transformation Justification

Event ranking

Results

Most harmful events

Events with the greatest economic consequences