Synopsis

The goal of this analysis is to find the weather events that create the most amount of damage, both economically and to personal health.
We will attempt to answer two questions directly. They are

1. Across the United States, which types of events (as indicated by the EVTYPE variable) are most harmful with respect to population health?
2. Across the United States, which types of events have the greatest economic consequences?

To accomplish this we look at data containing information regarding amounts of varying types of damage and the weather that caused the damages. The data used in this analysis is taken from the NOAA Storm Database. This data was not altered before the following process.



Loading libraries

install_load <- function(pkgs) {
    # Find which packages are not yet installed
    new_pkgs <- pkgs[!(pkgs %in% installed.packages()[, "Package"])]
    
    # Install missing ones
    if (length(new_pkgs)) {
        install.packages(new_pkgs, dependencies = TRUE)
    }
    
    # Load all packages
    sapply(pkgs, require, character.only = TRUE)
}

# Use the function
install_load(c("tidyverse", "scales","tidytext"))
## tidyverse    scales  tidytext 
##      TRUE      TRUE      TRUE



Data Processing


We begin by reading in the data using fread from the data.table package. We select nine columns to read to avoid bringing in any extra data. These columns contain the dates of the events, the state where the event occurred, the events themselves, and then six columns that all concern the impacts. We also remove data for years 1992 and prior as the number of records increases dramatically after this, likely indicating improved processes with regards to collecting data.

#   Read in dataset, select columns to read, and rename variables.
StormDataOriginal <- data.table::fread("repdata_data_StormData.csv.bz2", 
                       select = c(2,7,8,23,24,25,26,27,28))
names_vect <- c("Date", "State", "Event.Type", "Deaths", 
        "Injuries", "Prop.Dam.Significand", "Prop.Dam.Exponent",
        "Crop.Dam.Significand", "Crop.Dam.Exponent")

StormData <- StormDataOriginal
colnames(StormData) <- names_vect


#   Convert dates to date format and add a column for the year.
#   Subset the dataset for years after 1992.
StormData$Date <- as.Date(StormData$Date, format = "%m/%d/%Y")
StormData$Year <- with(StormData, format(Date, "%Y"))
StormData$Year <- as.numeric(StormData$Year)
StormData <- StormData[StormData$Year >= 1993]             
summary(StormData)
##       Date               State            Event.Type            Deaths            Injuries        
##  Min.   :1993-01-01   Length:714738      Length:714738      Min.   :  0.0000   Min.   :0.000e+00  
##  1st Qu.:1999-06-23   Class :character   Class :character   1st Qu.:  0.0000   1st Qu.:0.000e+00  
##  Median :2004-06-14   Mode  :character   Mode  :character   Median :  0.0000   Median :0.000e+00  
##  Mean   :2003-12-14                                         Mean   :  0.0152   Mean   :9.621e-02  
##  3rd Qu.:2008-06-22                                         3rd Qu.:  0.0000   3rd Qu.:0.000e+00  
##  Max.   :2011-11-30                                         Max.   :583.0000   Max.   :1.568e+03  
##  Prop.Dam.Significand Prop.Dam.Exponent  Crop.Dam.Significand Crop.Dam.Exponent       Year     
##  Min.   :   0.00      Length:714738      Min.   :  0.000      Length:714738      Min.   :1993  
##  1st Qu.:   0.00      Class :character   1st Qu.:  0.000      Class :character   1st Qu.:1999  
##  Median :   0.00      Mode  :character   Median :  0.000      Mode  :character   Median :2004  
##  Mean   :  12.68                         Mean   :  1.928                         Mean   :2003  
##  3rd Qu.:   2.00                         3rd Qu.:  0.000                         3rd Qu.:2008  
##  Max.   :5000.00                         Max.   :990.000                         Max.   :2011


Next, we transform the data. The values for the amount of damage (in dollars) for both property damage and damage to crops are stored across two variables each. One contains information on the exponent or magnitude of the damages. Another contains the coefficient of the damages. If we combine them, through multiplication, we get the amount of damage, to property or to crops.

#   Create list of unique exponent values and corresponding numerical values.
Dam.Exponent <- union(StormData$Prop.Dam.Exponent,StormData$Crop.Dam.Exponent)

Dam.Values <- c( 1e3, 1e6, 1, 1e9, 1e6, 1, 1, 1e5, 
              1e6, 1, 1e4, 1e2, 1e3, 1e2, 1e7, 
              1e2, 1, 10, 1e8, 1e3)

#   Create relational table to change the exponent symbol to its numerical meaning
Dam.Multiplier <- cbind(Dam.Exponent, Dam.Values)

StormData <- left_join(StormData, Dam.Multiplier, 
               join_by("Prop.Dam.Exponent" == "Dam.Exponent"), 
               copy = TRUE)
colnames(StormData)[11] <- "Prop.Dam.Values"
StormData <- left_join(StormData, Dam.Multiplier, 
        join_by("Crop.Dam.Exponent" == "Dam.Exponent"),
        copy = TRUE)
colnames(StormData)[12] <- "Crop.Dam.Values"
StormData$Prop.Dam.Values <- as.numeric(StormData$Prop.Dam.Values)
StormData$Crop.Dam.Values <- as.numeric(StormData$Crop.Dam.Values)

#   Multiply the exponent value by the coefficient.  
StormData$Prop.Dam.Cost <- StormData$Prop.Dam.Significand * 
            StormData$Prop.Dam.Values
StormData$Crop.Dam.Cost <- StormData$Crop.Dam.Significand * 
    StormData$Crop.Dam.Values


#   Convert economic damages to trillions of dollars.
StormData$Prop.Dam.Cost <- StormData$Prop.Dam.Cost / (1e12)
StormData$Crop.Dam.Cost <- StormData$Crop.Dam.Cost / (1e12)

#   Remove obsolete variables.
StormData <- subset(StormData, 
            select = -c( Prop.Dam.Significand, 
                 Prop.Dam.Exponent,
                 Crop.Dam.Significand, 
                 Crop.Dam.Exponent,
                 Prop.Dam.Values,
                 Crop.Dam.Values))           

We also converted damages to trillions of dollars and removed any extra columns that are a result of the transformations.


There are 985 unique entries in row containing all of the events. Many of them are redundant or abbreviations of same event or misspellings of the same event. First we will subset the data so there are no rows with all zeros in the impact columns. That is to say only rows where there is at least one death or injury, or some monetary damage recorded for either property or crops greater than zero. The rest of the steps are taken to combine and reduce the number of different events.

#   Subset only rows with some form of impact.  If any of the impact variables
#   (Deaths, Injuries, Prop.Dam.Cost, Crop.Dam.Cost) are non zero they are kept. 
SD <- StormData[!rowSums(StormData[, c(4,5,7,8)]) == 0,]

#   Make all Events uppercase to avoid distinction by case.
SD$Event.Type <- toupper(SD$Event.Type)

#   Group all Events by meaning to reduce buckets and fix misspellings. 
SD$Event.Type <- gsub(".*THUN.*|.*LIGHTNING.*|.*TSTM.*|.*LIGNTNING.*|.*LIGHTING.*|.*MICRO.*|.*DOWNBURST.*|.*TURBULENCE.*"
              , "THUNDER/LIGHTNING", SD$Event.Type)
SD$Event.Type <- gsub(".*BLIZZARD.*", "BLIZZARD", SD$Event.Type)
SD$Event.Type <- gsub(".*SNOW.*|.*ICE.*|.*WINT.*|.*FROST.*|.*FREEZE.*|.*ICY.*|.*GLAZE.*|.*HEAVY MIX.*|.*FREEZING SPRAY.*", 
              "ICE/SNOW", SD$Event.Type)
SD$Event.Type <- gsub(".*TORNADO.*|.*WATERSPOUT.*|.*DUST.*|.*GUSTNADO.*|.*LANDSPOUT.*|.*FUNNEL.*|.*TORNDAO.*", "TORNADO", SD$Event.Type)
SD$Event.Type <- gsub(".*HURR.*|.*TYPH.*", "HURRICANE", SD$Event.Type)
SD$Event.Type <- gsub(".*HAIL.*", "HAIL", SD$Event.Type)
SD$Event.Type <- gsub(".*FLOOD.*|.*MUD.*|.*SLIDE.*|.*DAM.*|.*RISING.*|.*EROSION.*|.*LANDSLUMP.*|.*DROWN.*|.*HIGH WATER.*"
              , "FLOODS/DEBRIS FLOW", SD$Event.Type)
SD$Event.Type <- gsub(".*HEAT.*|.*DROUGHT.*|.*HYPER.*|.*WARM.*|.*DENSE SMOKE.*", "HEAT", SD$Event.Type)
SD$Event.Type <- gsub(".*RAIN.*|.*DRIZZ.*|.*SLEET.*|.*PRECIP.*|.*SHOWER.*|.*COOL AND WET.*|.*WETNESS.*", 
              "RAIN/SLEET", SD$Event.Type)
SD$Event.Type <- gsub(".*FIRE.*", "WILDFIRE", SD$Event.Type)
SD$Event.Type <- gsub(".*FOG.*", "FOG", SD$Event.Type)
SD$Event.Type <- gsub(".*COLD.*|.*WIND.*|.*HYPO.*|.*LOW TEMP.*", "COLD/WIND", SD$Event.Type)
SD$Event.Type <- gsub(".*SURF.*|.*TIDE.*|.*COAST.*|.*WAVE.*|.*RIP CUR.*", "COASTAL/SURF/TIDE", SD$Event.Type)
SD$Event.Type <- gsub(".*TROPIC.*", "TROPICAL STORMS", SD$Event.Type)
SD$Event.Type <- gsub(".*OTHER.*|.*\\?.*|.*APACHE.*|.*ACCIDENT.*|.*URBAN.*|.*MISHAP.*|^HIGH$", "OTHER", SD$Event.Type)
SD$Event.Type <- gsub(".*SEA.*|.*SWELL.*|.*STORM SURGE.*", "MARINE/SEAS", SD$Event.Type)
SD$Event.Type <- gsub(".*AVALAN.*", "AVALANCHE", SD$Event.Type)        


We now sum the deaths and injuries by the type of weather event. We also sum the damages to property and crops by the type of weather event. Then we take the top ten events in damages and casualties.

#   Combine Deaths and Injuries to one variable called Casualties
SD$Casualties <- SD$Deaths + SD$Injuries
#   Combine Property Damage and Crop Damage to give Total Damage.
SD$Dam.Cost <- (SD$Prop.Dam.Cost + SD$Crop.Dam.Cost)

#   Aggregate Deaths and Injuries by Event Type. Rename new columns.
#   Create variable summing the aggregated Deaths and Injuries.
#   Take top 10 events with regard to Casualties (Deaths + Injuries).
SD.Agg.DI <- aggregate(cbind(SD$Deaths,SD$Injuries)~SD$Event.Type, FUN = sum)
names(SD.Agg.DI) <- c("Event", "Deaths", "Injuries")
SD.Agg.DI$Cas <- SD.Agg.DI$Deaths + SD.Agg.DI$Injuries
SD.Agg.DI <- slice_max(.data = SD.Agg.DI, order_by = SD.Agg.DI$Cas, n = 10)


#   Aggregate Property and Crop damages by Event Type. Rename new columns.
#   Create variable summing the aggregated Property and Crop damages.
#   Take top 10 events with regard to Total damage (Property damage + Crop damage).
SD.Agg.PC <- aggregate(cbind(SD$Prop.Dam.Cost,SD$Crop.Dam.Cost)~SD$Event.Type, FUN = sum)
names(SD.Agg.PC) <- c("Event", "Prop.Dam", "Crop.Dam")
SD.Agg.PC$Dam.Cost <- SD.Agg.PC$Crop.Dam + SD.Agg.PC$Prop.Dam
SD.Agg.PC <- slice_max(.data = SD.Agg.PC, order_by = SD.Agg.PC$Dam.Cost, n = 10)


Next, we tidy the data. We make the dataset longer but reduce its width by collapse two variables into one. This will allow us to plot totals for damages and casualties but also show the components.

#   Tidy the data by taking the two variables Deaths and Injuries and combining 
#   them into one variable and then making the new variable a factor.
SD.Agg.byCas <- pivot_longer(data = SD.Agg.DI,cols = c("Deaths", "Injuries"), 
                 names_to = "Casualty.Type", values_to = "Casualties")

SD.Agg.byCas$Casualty.Type <- factor(SD.Agg.byCas$Casualty.Type, 
                     levels = c("Injuries", "Deaths"))

#   Tidy the data by taking the two variables Property Damage and Crop Damage
#   and combining them into one variable and then making the new variable a factor.
SD.Agg.byDam <- pivot_longer(data = SD.Agg.PC,cols = c("Prop.Dam", "Crop.Dam"), 
                 names_to = "Damage.Type", values_to = "Damage.Cost")

SD.Agg.byDam$Damage.Type <- factor(SD.Agg.byDam$Damage.Type, 
                     levels = c("Prop.Dam", "Crop.Dam"))



Results


Now we are ready to plot the impacts by event. First we will plot Casualties by event.

#   Plot stacked bar plot showing Casualties (stacking deaths and injuries) by Event Type.
k <- ggplot(SD.Agg.byCas, aes(x = reorder(Event, -Casualties, sum), y = Casualties, fill = Casualty.Type))+
    geom_bar(stat = "identity")+
    xlab("Weather Event")+
    ylab("Casualties")+
    theme(plot.title = element_text(hjust = 0.5) , 
          axis.text.x = element_text(angle = 45, vjust = 1, 
                        hjust = 1, size = 11), legend.position = "bottom", 
                    legend.title = element_blank(), 
                    axis.ticks.length.x = unit(0.5, "cm"))+
    labs(title = "Casualties by Weather Event")+
    ylim(0,30000)+
    scale_x_discrete(labels = c("Tornado", "Thunder\nLightning", 
                "Heat", "Floods\nDebris Flow", 
                "Ice\nSnow", "Cold\nWind", 
                "Wildfire", "Coastal\nSurf/Tide", 
                "Hurricane", "Fog"))+
    stat_summary(fun = sum, aes(label = after_stat(prettyNum(y, big.mark = ",")), group = Event), 
             geom = "text", vjust = -0.5)

k

Tornadoes are responsible for over 25,000 casualties. Thunder and Lightning are the next closest with less than 13,000. The event most harmful with respect to population health is a tornado.

Next we plot the damages by the event that caused the damages.

#   Plot stacked bar plot showing Total Damage (stacking property and crop damage) by Event Type.
l <- ggplot(SD.Agg.byDam, aes(x = reorder(Event, -Damage.Cost, sum), y = Damage.Cost, fill = Damage.Type))+
    geom_bar(stat = "identity")+
    xlab("Weather Event")+
    ylab("Damages (trillions of dollars)")+
    theme(plot.title = element_text(hjust = 0.5) , 
          axis.text.x = element_text(angle = 45, 
          vjust = 1, hjust = 1, size = 11), legend.position = "bottom", 
          legend.title = element_blank(),
          axis.ticks.length.x = unit(0.5, "cm"))+
    ylim(0, 50)+
    labs(title = "Economic Impact by Weather Event")+
    scale_x_discrete(labels = c("Floods\nDebris Flow", "Tornado", "Hail",
            "Hurricane",  "Heat", "Thunder\nLightning",
            "Ice\nSnow", "Cold\nWind",
            "Wildfire", "Tropical \n Storms"))+
    stat_summary(fun = sum, aes(label = dollar(round(after_stat(y), 2)), group = Event), 
             geom = "text", vjust = -0.5)

l

Floods (and flood-like weather) are the most economically consequential events in the United States. They have caused almost 44 trillion dollars in damage from 1993 to 2011 accounting for around 30% of all damages.



Additional Insights


We have now answered the questions set forth at the beginning of this analysis. Beyond this, we will look at the impacts of these weather events after grouping them by the state they happened in.

#   Aggregate all variables measuring impact by State. Rename variables. 
#   Scale from trillions to billions
SD.State <- aggregate(cbind(SD$Deaths, SD$Injuries,SD$Casualties,
                SD$Prop.Dam.Cost, SD$Crop.Dam.Cost, 
                SD$Dam.Cost)~SD$State, FUN = sum)

names(SD.State) <- c("State", "Deaths", "Injuries","Casualties", "Property.Damage",
             "Crop.Damage", "Total.Damage")
SD.State$Total.Damage <- SD.State$Total.Damage * 1e3
SD.State$Property.Damage <- SD.State$Property.Damage * 1e3
SD.State$Crop.Damage <- SD.State$Crop.Damage * 1e3



Again we will take the aggregated data and “unpivot” it so that the impacts are collapsed to one variable. This time we will take the states with the ten highest amounts of casualties and also damages.

#   Tidy the data by combining the four impact variables into one and making that 
#   new variable a factor.  Also add a variable classifying the type of impact.
SD.State.combined <- pivot_longer(data = SD.State, cols = c("Deaths","Injuries", "Property.Damage", "Crop.Damage"),
                  names_to = "Type", values_to = "Impact")
SD.State.combined$Type <- factor(SD.State.combined$Type, 
                       levels = c("Injuries", "Deaths","Crop.Damage", "Property.Damage"))
SD.State.combined$Facet <- ifelse(SD.State.combined$Type == "Deaths" | SD.State.combined$Type == "Injuries", "Personal", "Economic")


#   Split by type of impact.  Find 10 highest impacts in each type and then
#   recombine the split datasets.
SD.Personal <- SD.State.combined[SD.State.combined$Facet == "Personal",]
SD.Economic <- SD.State.combined[SD.State.combined$Facet == "Economic",]

SD.Personal.T10 <- slice_max(.data = SD.Personal, order_by = Casualties, n = 20)
SD.Economic.T10 <- slice_max(.data = SD.Economic, order_by = Total.Damage, n = 20)

SD.State.combined.Top10 <- rbind(SD.Personal.T10, SD.Economic.T10)



Now we can plot the ten most impacted states, both economically and with regards to health.

#   Plot a bar plot showing 10 most impacted states by both Economic and Personal Health
n <- ggplot(data = SD.State.combined.Top10, aes(x = reorder_within(State,-Impact,Facet), y = Impact, fill = Type))+
    geom_bar(stat = "identity")+
    facet_wrap(~Facet, scales = "free", labeller = as_labeller(
        c(Economic = "Economic Damage (billions of dollars)", Personal = "Casualties")))+
    xlab("State")+
    theme(plot.title = element_text(hjust = 0.5) , 
          axis.text.x = element_text(angle = 0, 
                       vjust = 1, hjust = 0.5, size = 11), legend.position = "bottom", 
          legend.title = element_blank(),
          axis.ticks.length.x = unit(0.5, "cm"))+
    labs(title = "Economic and Personal Health Damage by State")+
    scale_x_discrete(labels = function(x) gsub("__Economic|__Personal|___Economic|___Personal","",x))

n

Texas, Florida, California and Ohio are in the top ten for both economic damage and number of casualties. Increased efforts to buttress protections against major weather events in these states would likely yield the greatest return.