Synopsis

This data analysis uses the NOAA Storm Database to addresses two fundamental questions about severe weather events: Across the US, which types of events are most harmful with respect to human health? And which type of events have the greatest economic consequences?

The analysis is fully reproducible, beginning with the downloading and reading of the raw NOAA storm data csv file, in a bz2 zip format. Next, the data is processed by removing variables that are unnecessary for the analysis, formatting certain fields, and subsetting the data to cover the period of 1996 (when all event types were first recorded) to 2011 (most recent data in the file), when all event types were. Significant effort is spent cleaning the EVTYPE values to match names of the most recent storm data event table. The EVTYPE values are further condensed to remove distinction between ambiguous qualifiers to provide a clearer picture of the type of events causing the most damage.

The end result is a quantified and visual comparison of the human and economic damage inflicted by severe weather event types with heat and tornadoes causing the greatest human impact, and floods and hurricanes causing the most economic impact.

Data Processing

First, the appropriate packages are loaded and the raw data is read into a tibble. The raw data format is shown for reference.

#load necessary packages

library(tidyverse)
library(lubridate)
library(stringdist)
library(ggplot2)

#download data in a tibble

url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"

setwd("C:/Users/207014104/Desktop/DataScience/Reproduceable Research/Project2")

filename <- "stormData.csv.bz2"

if (!file.exists(filename)){
        url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
        download.file(url, filename)
}

stormData <- as_tibble(read.csv(filename, stringsAsFactors = FALSE))

print(stormData)
## # A tibble: 902,297 x 37
##    STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
##      <dbl> <chr>    <chr>    <chr>      <dbl> <chr>      <chr> <chr> 
##  1       1 4/18/19~ 0130     CST           97 MOBILE     AL    TORNA~
##  2       1 4/18/19~ 0145     CST            3 BALDWIN    AL    TORNA~
##  3       1 2/20/19~ 1600     CST           57 FAYETTE    AL    TORNA~
##  4       1 6/8/195~ 0900     CST           89 MADISON    AL    TORNA~
##  5       1 11/15/1~ 1500     CST           43 CULLMAN    AL    TORNA~
##  6       1 11/15/1~ 2000     CST           77 LAUDERDALE AL    TORNA~
##  7       1 11/16/1~ 0100     CST            9 BLOUNT     AL    TORNA~
##  8       1 1/22/19~ 0900     CST          123 TALLAPOOSA AL    TORNA~
##  9       1 2/13/19~ 2000     CST          125 TUSCALOOSA AL    TORNA~
## 10       1 2/13/19~ 2000     CST           57 FAYETTE    AL    TORNA~
## # ... with 902,287 more rows, and 29 more variables: BGN_RANGE <dbl>,
## #   BGN_AZI <chr>, BGN_LOCATI <chr>, END_DATE <chr>, END_TIME <chr>,
## #   COUNTY_END <dbl>, COUNTYENDN <lgl>, END_RANGE <dbl>, END_AZI <chr>,
## #   END_LOCATI <chr>, LENGTH <dbl>, WIDTH <dbl>, F <int>, MAG <dbl>,
## #   FATALITIES <dbl>, INJURIES <dbl>, PROPDMG <dbl>, PROPDMGEXP <chr>,
## #   CROPDMG <dbl>, CROPDMGEXP <chr>, WFO <chr>, STATEOFFIC <chr>,
## #   ZONENAMES <chr>, LATITUDE <dbl>, LONGITUDE <dbl>, LATITUDE_E <dbl>,
## #   LONGITUDE_ <dbl>, REMARKS <chr>, REFNUM <dbl>

Then, significant effort is placed on cleaning the data so that records are matched in an efficient and logical manner.

#select only the variables relevant to determining the greatest economic consequences and harm to health
storm <- select(stormData, BGN_DATE, STATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP,CROPDMG, CROPDMGEXP)

#format dates as a date class
storm$BGN_DATE <- as.Date(stormData$BGN_DATE, "%m/%d/%Y")

#remove events recorded in years when all the weather type were not recorded (prior to 1996)
storm <- storm[year(storm$BGN_DATE) >= 1996,]

#remove the summary rows (they were all in OK and had no data)
storm <- storm[!str_detect(storm$EVTYPE, "^Summary"),]

##CLEANING UP EVTYPE

#remove errors in entry of EVTYPE, such as abbreviations and extra info
storm$EVTYPE <- str_to_upper(storm$EVTYPE) #make all upper case
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "TSTM", "THUNDERSTORM")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "FLD", "FLOOD")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "URBAN/SML STREAM", "")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "/FOREST", "")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "/MIX", "")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "THUNDERSTORM WIND/", "")

#remove ambiguous qualifiers that result in separation of similar events
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "EXTREME", "")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "HIGH", "")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "DENSE", "")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "EXCESSIVE", "")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "STRONG", "")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "UNSEASONABLY", "")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "HEAVY", "")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "RECORD", "")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "WAVE", "")

#recode certain events to match more appropriate and general EVTYPES
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "LANDSLIDE", "DEBRIS FLOW")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "STORM SURGE$", "STORM SURGE/TIDE")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "GLAZE", "ICE STORM")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "SNOW SQUALL", "WINTER STORM")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "WINTER WEATHER MIX", "WINTER WEATHER")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "WINTRY MIX", "WINTER WEATHER")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "COASTAL FLOODING/EROSION", "FLOOD")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "COASTAL FLOODING", "FLOOD")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "RIVER FLOOD", "FLOOD")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "FLASH FLOOD", "FLOOD")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "HURRICANE$", "HURRICANE/TYPHOON")

#remove extra spaces from both sides of the character string
storm$EVTYPE <- str_trim(storm$EVTYPE, side = "both")

#adjust certain ending or beginning words the inappropriately separate groups
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "^COLD$", "COLD/WIND CHILL")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "^TYPHOON", "HURRICANE/TYPHOON")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "FREEZE$", "FROST/FREEZE")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "FROST/FROST/FREEZE", "FROST/FREEZE")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "FLOODING$", "FLOOD")
storm$EVTYPE <- str_replace_all(storm$EVTYPE, "FLOOD$", "FLOOD")


#load all of the 48 official EVTYPE codes 
EVcodes <- c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood", "Cold/Wind Chill",
             "Debris Flow", "Dense Fog", "Dense Smoke", "Drought", "Dust Devil", "Dust Storm", 
             "Excessive Heat", "Extreme Cold/Wind Chill", "Flash Flood", "Flood", "Frost/Freeze", 
             "Funnel Cloud", "Freezing Fog", "Hail", "Heat", "Heavy Rain", "Heavy Snow", "High Surf",
             "High Wind", "Hurricane (Typhoon)", "Ice Storm", "Lake-Effect Snow", "Lakeshore Flood", 
             "Lightning", "Marine Hail", "Marine High Wind", "Marine Strong Wind", 
             "Marine Thunderstorm Wind", "Rip Current", "Seiche", "Sleet", "Storm Surge/Tide", 
             "Strong Wind", "Thunderstorm Wind", "Tornado", "Tropical Depression", "Tropical Storm",
             "Tsunami", "Volcanic Ash", "Waterspout", "Wildfire", "Winter Storm", "Winter Weather") 

#remove ambiguous qualifiers from the official EVTYPES to group like events
EVcodes <- str_to_upper(EVcodes)
EVcodes <- str_replace_all(EVcodes, "EXTREME", "")
EVcodes <- str_replace_all(EVcodes, "HIGH", "")
EVcodes <- str_replace_all(EVcodes, "DENSE", "")
EVcodes <- str_replace_all(EVcodes, "EXCESSIVE", "")
EVcodes <- str_replace_all(EVcodes, "STRONG", "")
EVcodes <- str_replace_all(EVcodes, "HEAVY", "") 
EVcodes <- str_replace_all(EVcodes, " \\(TYPHOON\\)", "/TYPHOON") 
EVcodes <- str_replace_all(EVcodes, "HEAVY", "")
EVcodes <- str_trim(EVcodes, side = "both")

#create a separate field that matches the processed EVTYPES to the official EV codes, including those that almost match
storm <- mutate(storm, EVTYPE1 = EVcodes[amatch(storm$EVTYPE, EVcodes, maxDist = 1)])

#fraction of records without a matched EVTYPE
unassigned <- mean(is.na(storm$EVTYPE1))

Out of the 653458 records included in the analysis, 0.4167062 percent do not have a matched event name. Effort was also made to ensure that those few events were not some of the largest impact events. Those event names with errors affecting jsut a small numbe rof records, but a large impact were recoded.

The final cleaning step is to calculate the dollar ($) value of the economic impact of events where multipliers are located in separate columns.

# replace multipliers in propdmexp with numeric values
storm$PROPDMGEXP <- str_replace_all(storm$PROPDMGEXP, "K", "1000")
storm$PROPDMGEXP <- str_replace_all(storm$PROPDMGEXP, "M", "1000000")
storm$PROPDMGEXP <- str_replace_all(storm$PROPDMGEXP, "B", "1000000000")
storm$PROPDMGEXP <- as.integer(storm$PROPDMGEXP)
storm$PROPDMGEXP[is.na(storm$PROPDMGEXP)] <- 1

# replace multipliers in cropdmexp with numeric values        
storm$CROPDMGEXP <- str_replace_all(storm$CROPDMGEXP, "K", "1000")
storm$CROPDMGEXP <- str_replace_all(storm$CROPDMGEXP, "M", "1000000")
storm$CROPDMGEXP <- str_replace_all(storm$CROPDMGEXP, "B", "1000000000")
storm$CROPDMGEXP <- as.integer(storm$CROPDMGEXP)
storm$CROPDMGEXP[is.na(storm$CROPDMGEXP)] <- 1

# calculate the property damage and crop damage for each event
storm <- mutate(storm, PropertyDamage = PROPDMG * PROPDMGEXP) %>%
        mutate(CropDamage = CROPDMG * CROPDMGEXP) %>%
        mutate(TotalDamage = PropertyDamage + CropDamage) %>%
        select(BGN_DATE, EVTYPE, EVTYPE1, FATALITIES, INJURIES, PropertyDamage, CropDamage, TotalDamage, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

Results

First, the weather events with the greatest impact to human health are examined. The impact of the events is ordered by most fatalities, and also notes number of injuries.

#remove measurements with no matching event type after all cleaning efforts
stormCat <- storm[!is.na(storm$EVTYPE1),]

#create a table with that summarizes total deaths and injuries for each event type and sort by deaths from most to least
HarmSum <- group_by(stormCat, EVTYPE1) %>%
        summarise(totalDeaths = sum(FATALITIES), 
                  totalInjuries = sum(INJURIES)) %>%
        arrange(desc(totalDeaths))

#print the top 15 events in terms of fatalities, showing associated injuries
print(HarmSum[1:15,])
## # A tibble: 15 x 3
##    EVTYPE1           totalDeaths totalInjuries
##    <chr>                   <dbl>         <dbl>
##  1 HEAT                     2036          7683
##  2 TORNADO                  1511         20667
##  3 FLOOD                    1334          8517
##  4 LIGHTNING                 651          4141
##  5 RIP CURRENT               542           503
##  6 THUNDERSTORM WIND         371          5029
##  7 WIND                      364          1466
##  8 COLD/WIND CHILL           353           127
##  9 AVALANCHE                 223           156
## 10 WINTER STORM              194          1327
## 11 HURRICANE/TYPHOON         125          1326
## 12 SNOW                      109           712
## 13 SURF                       96           190
## 14 RAIN                       94           230
## 15 WILDFIRE                   87          1456

The 3 event types that have the greatest impact on human health are heat, tornado, and flood for both injuries and fatalities. heat is the #1 cause of death. It should be noted that this includes both “excessive heat” and “heat” from the official NOAA Event Table, and the decision to combine these is due to the ambiguous difference in naming. If an event of heat is classified as a severe event and results in deaths, it certainly could be considered excessive, whether explicit in name or not.

Chart 1 identifies both injuries and fatalities by event type on the x and y axes.

chart1 <- ggplot(data = HarmSum, 
                 aes(x=totalInjuries, y=totalDeaths, label=EVTYPE1)) + 
        geom_point() + 
        geom_text(data=subset(HarmSum,
                              totalInjuries > 2500 | totalDeaths > 250),
                  aes(label=EVTYPE1), 
                  hjust = -.1, 
                  vjust = -.5, 
                  check_overlap = TRUE,
                  size = 3) +
        coord_cartesian(xlim = c(0,25000)) +
        labs(x = "Total Human Injuries", 
             y = "Total Human Deaths", 
             title = "Chart 1: Greatest human harm by event type ('96-'11)") +
        theme_minimal()

plot(chart1)

Next, total economic damage is analyzed by summarizing the cost of property damage and crop damage for each event type. Because both are in dollar ($) terms, they can be added together to calculate a total damage cost. A summary table of the event types causing a total of over $1 Billion in damage is shown below.

#create a summary table with total damage, and its two components
TotalSum <- group_by(stormCat, EVTYPE1) %>%
        summarise(TotalDamage = sum(TotalDamage), 
                  totalCropDamage = sum(CropDamage), 
                  totalPropDamage = sum(PropertyDamage)) %>%
        arrange(desc(TotalDamage)) %>%
        filter(TotalDamage > 1000000000)

print(TotalSum)
## # A tibble: 15 x 4
##    EVTYPE1            TotalDamage totalCropDamage totalPropDamage
##    <chr>                    <dbl>           <dbl>           <dbl>
##  1 FLOOD             165823736310      6348063200    159475673110
##  2 HURRICANE/TYPHOON  87068996810      5350107800     81718889010
##  3 STORM SURGE/TIDE   47835579000          855000     47834724000
##  4 TORNADO            24900370720       283425010     24616945710
##  5 HAIL               17180204620      2540725700     14639478920
##  6 DROUGHT            14413667000     13367566000      1046101000
##  7 THUNDERSTORM WIND   8821057230       952246350      7868810880
##  8 TROPICAL STORM      8320186550       677711000      7642475550
##  9 WILDFIRE            8162704630       402255130      7760449500
## 10 WIND                6126458900       698814800      5427644100
## 11 ICE STORM           3658058810        15660000      3642398810
## 12 WINTER STORM        1544787250        11944000      1532843250
## 13 COLD/WIND CHILL     1365617900      1334665500        30952400
## 14 RAIN                1313584240       728419800       585164440
## 15 FROST/FREEZE        1261591000      1250911000        10680000
#gather the data into a tidy table for plotting      
tidyDamage <- select(TotalSum, EVTYPE1, totalCropDamage, totalPropDamage) %>%
gather(damageType, cost, 2:3)

Floods are clearly the most impactful severe weather events. Note that several different types of floods have been combined given that certain entries, such as “river flood” are not an official event type, but if they are added to general flooding, then it is only logical to also add “coastal” and “flash” flooding to this general flood category. While drought causes the most crop damage, the total value is much less than that of flooding, hurricanes/typhoons, storm surge, etc.

Chart 2 highlights the total economic damage of the top severe weather event types while also identifying the prortion of damage resulting from property or crop impacts by color.

chart2 <- ggplot(tidyDamage, 
                 aes(x=reorder(EVTYPE1, cost), 
                     y=cost/1000000000, 
                     fill = damageType)) +
        geom_col(color = "black") +
        coord_flip() + 
        labs(y = "Total Damage ($B)", 
             x = "Type of Event", 
             title = "Chart 2: Greatest economic damage by event type ('96-'11)") +
        scale_fill_discrete(name = "Type of damage", 
                            labels = c("Crop loss", "Property loss"), 
                            direction = -1, l = 75) +
        theme_minimal()

plot(chart2)

Closing

This analysis provides a starting point for understanding the impact of several weather in both human and economic terms. While it must be noted that many discretionary decisions were made in the cleaning and recoding of event type data to improve the clarity of the analysis and results, many alternate interpretations of the data could be reached leading to diferences in findings. With that said, this analysis provide evidence that the heat and tornadoes are the leading causes of human harm, and floods and hurricane/typhoons cause the greatest economic impact.