Synopsis

We analyzed the NOAA storm database to answer two questions: which type of weather event is most harmful with respect to population health and which events have the greatest economic impact? For the first question we assigned weather event categories to the data and calculated the sums of the number of fatalities (FATALITIES) and injuries (INJURIES) by event type (EVTYPE). Categories were used instead of the straight EVTYPE because of duplicate or similar entries. For example there were 213 events types with “wind” in their name. In this analysis Tornados, Wind and Heat were the three most harmful weather events, assuming the sum of injuries and deaths was the best metric for harmfulness. A similar analysis was performed looking at the property damage (PROPDMG) and agricultural damage (CROPDMG). The most expensive weather events are Floods, Hurricanes and Tornados.

Data Processing

We began by downloading the file from the course webpage. The file path to the data was saved to dlDest.

library(tidyverse)
## -- Attaching packages ---------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0     v purrr   0.2.5
## v tibble  1.4.2     v dplyr   0.7.8
## v tidyr   0.8.2     v stringr 1.3.1
## v readr   1.3.1     v forcats 0.3.0
## -- Conflicts ------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(knitr)

dl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
dlDest <- "./data/StormData.csv.bz2"

if(!dir.exists("./data")){
    dir.create("./data")
}

if(!file.exists(dlDest)){
    download.file(url = dl,
                  destfile = dlDest)
}

The first line was read in, split up and turned into a character vector of column names. The data was then loaded into the variable storm using readr::read_csv with column types set based on the types of data found therein.

cnames <- read_lines(dlDest,
                     n_max = 1)
cnames <- unlist(strsplit(cnames, ",", fixed = TRUE))
cnames <- gsub("\"", "", cnames)

storm <- read_csv(dlDest,
                  col_types = "ncccncffcfccccccfcdncnnndfdfffcnnnncn")
names(storm) <- make.names(cnames)

Analyzing the health impacts of EVTYPEs

To examine how harmful each event type (EVTYPE) was, we calculated total fatalities (Event.F), injuries (Event.I) and both (Event.T), grouped by event type. Event.r.F, Event.r.I and Event.r.T give the rate of death and/or injury per event.

EventSummary <- storm %>%
    select(EVTYPE, FATALITIES, INJURIES) %>%
    group_by(EVTYPE) %>%
    summarize(Event.Count = n(),
              Event.F   = sum(FATALITIES),
              Event.I   = sum(INJURIES),
              Event.T   = Event.I + Event.F,
              Event.r.F = format(Event.F / Event.Count, digits = 3),
              Event.r.I = format(Event.I / Event.Count, digits = 3),
              Event.r.T = format(Event.T / Event.Count, digits = 3)) %>%
    arrange(desc(Event.T))
kable(head(EventSummary[,1:5], 10),
      caption = "Table of 10 Most Harmful Weather Events, by EVTYPE")
Table of 10 Most Harmful Weather Events, by EVTYPE
EVTYPE Event.Count Event.F Event.I Event.T
TORNADO 60652 5633 91346 96979
EXCESSIVE HEAT 1678 1903 6525 8428
TSTM WIND 219944 504 6957 7461
FLOOD 25326 470 6789 7259
LIGHTNING 15755 816 5230 6046
HEAT 767 937 2100 3037
FLASH FLOOD 54278 978 1777 2755
ICE STORM 2006 89 1975 2064
THUNDERSTORM WIND 82563 133 1488 1621
WINTER STORM 11433 206 1321 1527

Looking at the top ten list, sorted by total fatalities and injuries (Event.T), there are a few interesting features. First off, those data indicated that tornadoes were clearly responsible for the most deaths and injuries (being approximately 12 times more harmful than the next entry). It also suggests that human error or inconsistent (and poor) data entry practices resulted in similar EVTYPE factors (e.g. “FLOOD” and “FLASH FLOOD”) and for there to be duplicates (“TSTM WIND” and “THUNDERSTORM WIND”). We had to take these issue into consideration.

Grouping similar EVTYPEs

There are 977 factor levels in EVTYPE, but as noted above the data quality looked to be low, with duplicate entries and inconsistent naming. To improve this we utilized grepl() to append columns of logical values that indicate whether a given search string has a hit for a grouping term.
See the appendix for the method used for choosing the search terms. The following nine categories of event were determined to cover the majority of the events: WIND, TORNADO, HEAT, FLOOD, HURRICANE, LIGHTNING, CURRENTS, FIRE and SNOW. grepl() returns a TRUE if the string is found in the EVTYPE, which allowed us to use that column as a filter. SNOW includes events that with names containing ice, snow, blizzard or winter. All of the values not caught by one of the search terms (i.e. was FALSE for WIND:SNOW) was loaded into subEV.

EVSum2 <- EventSummary %>%
    mutate(WIND      = grepl("WIND", EVTYPE),
           TORNADO   = grepl("TORNADO", EVTYPE),
           HEAT      = grepl("HEAT", EVTYPE),
           FLOOD     = grepl("FLOOD", EVTYPE),
           HURRICANE = grepl("HURRICANE|TYPHOON|TROPICAL|CYCLONE", EVTYPE),
           LIGHTNING = grepl("LIGHTNING", EVTYPE),
           CURRENTS  = grepl("RIP CURRENT", EVTYPE),
           FIRE      = grepl("FIRE", EVTYPE),
           SNOW      = grepl("ICE|SNOW|BLIZZARD|WINTER|COLD", EVTYPE))
subEV <- EVSum2[!(EVSum2$WIND|
                  EVSum2$HEAT|
                  EVSum2$TORNADO|
                  EVSum2$FLOOD|
                  EVSum2$HURRICANE|
                  EVSum2$LIGHTNING|
                  EVSum2$CURRENTS|
                  EVSum2$FIRE|
                  EVSum2$SNOW),] %>%
    arrange(desc(Event.F))

subEV %>%
    select(EVTYPE:Event.T) %>%
    head(10) %>%
    kable(caption = "Table of Top 10 Ungrouped Weather Events")
Table of Top 10 Ungrouped Weather Events
EVTYPE Event.Count Event.F Event.I Event.T
AVALANCHE 386 224 170 394
HIGH SURF 725 101 152 253
HEAVY RAIN 11723 98 251 349
FOG 538 62 734 796
HEAVY SURF/HIGH SURF 228 42 48 90
LANDSLIDE 600 38 52 90
TSUNAMI 20 33 129 162
UNSEASONABLY WARM AND DRY 13 29 0 29
URBAN/SML STREAM FLD 3392 28 79 107
DUST STORM 427 22 440 462

The ungrouped categories covered only 5.78% of the total number of fatalities and 3.49% of all fatalities and injuries. This is low enough that we can ignore those events for now, although it would be simple to include more categories if deemed necessary.

Checking for overlaps

Because it was possible for more than one of the category grepl() to be true (e.g. an event of “SNOW/WIND/FLOODING” would be true for each of SNOW, WIND and FLOOD) it was important to establish the potential impact from choices made about how to handle these special cases. OVERLAP counts the number of categories. By summing the combined fatalities and injuries over the subset of EVTYPES with more than one category we can see how many cases are effected.

EVSum2 <- EVSum2 %>%
    mutate(OVERLAP = WIND +
                     TORNADO +
                     HEAT +
                     FLOOD +
                     HURRICANE +
                     LIGHTNING +
                     CURRENTS +
                     FIRE +
                     SNOW)
sum(EVSum2[EVSum2$OVERLAP > 1,]$Event.T)
## [1] 354

With only 354 total injuries or death impacted by the multiple category designations we won’t worry about it too much. The method used below for assigning the category imposes an ordering/hierachy and prevents double counting, but other methods (like splitting values) might make sense as well and aren’t examined in this analysis.

Recalculate Fatalities and Injuries by Category

Using a series of nested ifelse() the following categories were assigned, with the given order:
HURRICANE > SNOW > FLOOD > TORNADO > LIGHTNING > WIND > FIRE > HEAT > OTHER
This ordering was chosen because WIND was commonly one of the two or three overlapping categories, and in cases where it was WIND + TORNADO or WIND + HURRICANE it made the most sense to group it with the primary event (the tornado or hurricane, in those cases). HURRICANE, SNOW, FLOOD, TORNADO and LIGHTNING are primary descriptors with minimal internal overlap, while FIRE and HEAT have little to no overlap with other categories. Anything not assigned a category was called OTHER and contains the same events as subEV.

CatSum <- EVSum2 %>%
    mutate(CATEGORY = ifelse(HURRICANE,
                          "HURRICANE",
                          ifelse(SNOW,
                          "SNOW",
                          ifelse(FLOOD,
                          "FLOOD",
                          ifelse(TORNADO,
                          "TORNADO",
                          ifelse(LIGHTNING,
                          "LIGHTNING",
                          ifelse(CURRENTS,
                          "RIP CURRENTS",
                          ifelse(WIND,
                          "WIND",
                          ifelse(FIRE,
                          "FIRE",
                          ifelse(HEAT,
                          "HEAT",
                          "OTHER")))))))))) %>%
    select(-(WIND:SNOW)) %>%
    group_by(CATEGORY) %>%
    summarize(`Count` = sum(Event.Count),
              `Fatalities`    = sum(Event.F),
              `Injuries`      = sum(Event.I),
              `Total`         = `Injuries` + `Fatalities`,
              `Fatality Rate` = format(`Fatalities` / `Count`, digits = 3),
              `Injury Rate`   = format(`Injuries` / `Count`, digits = 3),
              `Total Rate`    = format(`Total` / `Count`, digits = 3)) %>%
    arrange(desc(`Total`))

The result of these calculations can be found in the Results section.

Economic Impact

Economic impact was calculated in a similar manner, albeit with some modifications. Because of the range of values on the economic impact property and crop damage were each encoded into two columns, one for the coefficient (PROPDMG or CROPDMG) and one for the scale/exponent indicator (PROPDMGEXP or CROPDMGEXP). For example, a damage may be written as 5.2 M, representing 5.2 million USD or $5,200,000. Unfortunately PROPDMGEXP encodes a weird range of characters. As a result of the US Government partial shutdown it is not possible to access the NOAA website for supplementary data and tables, but we were able to find a reference to someone else who has analyzed the same thing here: https://rpubs.com/flyingdisc/PROPDMGEXP. Those levels were used to replace the factor values with the correct multiplier.

EconSummary <- storm %>%
    select(EVTYPE, PROPDMG:CROPDMGEXP)

levels(EconSummary$PROPDMGEXP) <- c(1000, 1000000, 1000000000, 1000000, 1, 0,
                                    10, 10, 0, 10, 10, 10, 100, 10, 100, 0, 10,
                                    10)
levels(EconSummary$CROPDMGEXP) <- c(1000000, 1000, 1000000, 1000000000, 0, 0,
                                    1000, 10)
EconSummary <- EconSummary %>%
    mutate(PropertyDamage = PROPDMG * as.numeric(as.character(PROPDMGEXP)),
           CropDamage     = CROPDMG * as.numeric(as.character(CROPDMGEXP)),
           TotalDamage    = ifelse(is.na(PropertyDamage), 0, PropertyDamage) +
                            ifelse(is.na(CropDamage), 0, CropDamage))

A similar grouping process was run, although a different set of terms were used. Hurricanes, typhoons, cyclones are the same thing, and tropical storms (and depressions) represent lower category versions of the same type of storm, so those were grouped together. Freeze and cold were added to the SNOW search term.

EconSum2 <- EconSummary %>%
    mutate(WIND      = grepl("WIND", EVTYPE),
           TORNADO   = grepl("TORNADO", EVTYPE),
           FLOOD     = grepl("FLOOD", EVTYPE),
           HURRICANE = grepl("HURRICANE|TYPHOON|TROPICAL|CYCLONE", EVTYPE),
           FIRE      = grepl("FIRE", EVTYPE),
           DROUGHT   = grepl("DROUGHT", EVTYPE),
           HAIL      = grepl("HAIL", EVTYPE),
           SURGE     = grepl("SURGE", EVTYPE),
           SNOW      = grepl("ICE|SNOW|BLIZZARD|WINTER|FREEZE|COLD",EVTYPE))

subEC <- EconSum2[!(EconSum2$WIND|
                    EconSum2$TORNADO|
                    EconSum2$FLOOD|
                    EconSum2$HURRICANE|
                    EconSum2$FIRE|
                    EconSum2$DROUGHT|
                    EconSum2$HAIL|
                    EconSum2$SURGE|
                    EconSum2$SNOW),] %>% arrange(desc(TotalDamage))

subEC %>%
    select(EVTYPE, PropertyDamage:TotalDamage ) %>%
    head(10) %>%
    kable(caption = "Table of Top 10 Ungrouped Weather Events")
Table of Top 10 Ungrouped Weather Events
EVTYPE PropertyDamage CropDamage TotalDamage
HEAVY RAIN/SEVERE WEATHER 2.500e+09 NA 2500000000
SEVERE THUNDERSTORM 1.200e+09 NA 1200000000
EXCESSIVE HEAT 1.700e+05 492400000 492570000
HEAT NA 400000000 400000000
HEAVY RAIN NA 200000000 200000000
HEAVY RAIN NA 190000000 190000000
EXCESSIVE WETNESS NA 142000000 142000000
River Flooding 7.874e+07 26840000 105580000
TSUNAMI 8.100e+07 20000 81020000
HEAVY RAIN NA 73600000 73600000

Like subEV before, subEC contained the ungrouped levels of EVTYPE. Those ungrouped EVTYPE factors only covered 1.76% of the total damage and were not worth the time to pull apart any further.

The dataset was then assigned CATEGORY values using a similar ifelse() structure, and the damage amounts were then subtotaled for each entry.

CatEconSum <- EconSum2 %>%
    mutate(CATEGORY = ifelse(HURRICANE,
                             "HURRICANE",
                             ifelse(SURGE,
                             "SURGE",
                             ifelse(SNOW,
                             "SNOW",
                             ifelse(FLOOD,
                             "FLOOD",
                             ifelse(HAIL,
                             "HAIL",
                             ifelse(TORNADO,
                             "TORNADO",
                             ifelse(WIND,
                             "WIND",
                             ifelse(DROUGHT,
                             "DROUGHT",
                             ifelse(FIRE,
                             "FIRE",
                             "OTHER")))))))))) %>%
    select(-(WIND:SNOW)) %>%
    group_by(CATEGORY) %>%
    summarize(`Property Damage` = sum(PropertyDamage, na.rm = TRUE),
              `Crop Damage`     = sum(CropDamage, na.rm = TRUE),
              `Total Damage`    = `Property Damage` + `Crop Damage`) %>%
    arrange(desc(`Total Damage`))

Results

Most Harmful Weather Events

Answering the question of which weather event is most harmful first requires that we choose the criteria for “most harmful.” If want to look at fatalities primarily we would find that Tornados, Heat and Floods are the three most deadly types of events.

kable(arrange(CatSum, desc(CatSum$Fatalities)),
      padding = 2,
      caption = "Most harmful weather events, ranked by fatalities")
Most harmful weather events, ranked by fatalities
CATEGORY Count Fatalities Injuries Total Fatality Rate Injury Rate Total Rate
TORNADO 60699 5661 91407 97068 0.0933 1.51 1.6
HEAT 2645 3138 9154 12292 1.19 3.46 4.65
FLOOD 82648 1523 8601 10124 0.0184 0.104 0.122
WIND 363224 1192 11397 12589 0.00328 0.0314 0.0347
SNOW 44462 1071 6331 7402 0.0241 0.142 0.166
OTHER 326773 875 4555 5430 0.00268 0.0139 0.0166
LIGHTNING 15776 817 5232 6049 0.0518 0.332 0.383
RIP CURRENTS 777 577 529 1106 0.743 0.681 1.42
HURRICANE 1054 201 1714 1915 0.191 1.63 1.82
FIRE 4239 90 1608 1698 0.0212 0.379 0.401

Another way we can think about how harmful a weather events is by looking at the number of dead and injured per event. The top three events that have the highest Total Rate are hurricanes, heat and tornado. While hurricanes and heat are much less common than tornados they are 3.2 and 2.9 times as likely to cause harm, compared against tornadoes.

kable(arrange(CatSum, desc(CatSum$`Total Rate`)),
      padding = 2,
      caption = "Most harmful weather events, ranked by total death and injury rate")
Most harmful weather events, ranked by total death and injury rate
CATEGORY Count Fatalities Injuries Total Fatality Rate Injury Rate Total Rate
HEAT 2645 3138 9154 12292 1.19 3.46 4.65
HURRICANE 1054 201 1714 1915 0.191 1.63 1.82
TORNADO 60699 5661 91407 97068 0.0933 1.51 1.6
RIP CURRENTS 777 577 529 1106 0.743 0.681 1.42
FIRE 4239 90 1608 1698 0.0212 0.379 0.401
LIGHTNING 15776 817 5232 6049 0.0518 0.332 0.383
SNOW 44462 1071 6331 7402 0.0241 0.142 0.166
FLOOD 82648 1523 8601 10124 0.0184 0.104 0.122
WIND 363224 1192 11397 12589 0.00328 0.0314 0.0347
OTHER 326773 875 4555 5430 0.00268 0.0139 0.0166

Finally, If we instead wanted to find the events with the highest total impact (fatalities and injuries combined) then we would get tornados, wind and heat as the most harmful.

kable(arrange(CatSum, desc(CatSum$Total)),
      padding = 2,
      caption = "Most harmful weather events, ranked by total death and injury")
Most harmful weather events, ranked by total death and injury
CATEGORY Count Fatalities Injuries Total Fatality Rate Injury Rate Total Rate
TORNADO 60699 5661 91407 97068 0.0933 1.51 1.6
WIND 363224 1192 11397 12589 0.00328 0.0314 0.0347
HEAT 2645 3138 9154 12292 1.19 3.46 4.65
FLOOD 82648 1523 8601 10124 0.0184 0.104 0.122
SNOW 44462 1071 6331 7402 0.0241 0.142 0.166
LIGHTNING 15776 817 5232 6049 0.0518 0.332 0.383
OTHER 326773 875 4555 5430 0.00268 0.0139 0.0166
HURRICANE 1054 201 1714 1915 0.191 1.63 1.82
FIRE 4239 90 1608 1698 0.0212 0.379 0.401
RIP CURRENTS 777 577 529 1106 0.743 0.681 1.42

This can also be seen graphically:

CatSum %>%
    select(CATEGORY,`Fatalities`, `Injuries`) %>%
    gather(key = `Casualty Type`, value = `Count`, -CATEGORY) %>%
    ggplot() +
    geom_col(mapping = aes(x = fct_reorder(CATEGORY, `Count`),
                           y = `Count`,
                           fill = `Casualty Type`)) +
    coord_flip() +
    labs(y = "Number of Casualties (Injured or Killed)",
         x = "",
         title = "Cost to Health of Extreme Weather Events",
         subtitle = "U.S. NOAA Storm Database (1950-2011)",
         fill = "Casualty Type") +
    theme_bw() +
    guides(fill=guide_legend(title = NULL)) +
    theme(legend.justification=c(1,0), legend.position=c(0.9,0.1))

Economic Damages

The same type of analysis was performed on the economic data, specifically the property damage and agricultural damage. Flooding was by far the most costly, almost twice the damage of hurricanes. Since the values were so large we scaled them back by \(10^9\), making the values representative of billions of dollars.

CatEconSum$`Property Damage` <- CatEconSum$`Property Damage`/1000000000
CatEconSum$`Crop Damage` <- CatEconSum$`Crop Damage`/1000000000
CatEconSum$`Total Damage` <- CatEconSum$`Total Damage`/1000000000
kable(CatEconSum,
      caption = "Summary of Economic Damages, by Category")
Summary of Economic Damages, by Category
CATEGORY Property Damage Crop Damage Total Damage
FLOOD 167.366829 12.3470691 179.71390
HURRICANE 93.072538 6.2110138 99.28355
TORNADO 56.993099 0.4149613 57.40806
SURGE 47.965224 0.0008550 47.96608
SNOW 12.699465 8.5572980 21.25676
HAIL 17.619992 3.1142128 20.73420
WIND 15.802228 1.9564790 17.75871
DROUGHT 1.046306 13.9726218 15.01893
FIRE 8.501629 0.4032816 8.90491
OTHER 6.251340 2.1263994 8.37774

A column chart can be made to show the how the totals relate and how much the property and crop components contribute to the total.

CatEconSum %>%
    select(CATEGORY:`Crop Damage`) %>%
    gather(key = `Damage Type`, value = `Amount`, -CATEGORY) %>%
    ggplot() +
    geom_col(mapping = aes(x = fct_reorder(CATEGORY, `Amount`),
                           y = `Amount`,
                           fill = `Damage Type`)) +
    coord_flip() +
    labs(y = "Cost of Damages (Billions $)",
         x = "",
         title = "Cost of Damages of Extreme Weather Events",
         subtitle = "U.S. NOAA Storm Database (1950-2011)",
         fill = "Damage Type") +
    theme_bw() +
    guides(fill=guide_legend(title = NULL)) +
    theme(legend.justification=c(1,0), legend.position=c(0.9,0.1))

Acknowledgments

The following page has a data dictionary was useful in determining the column types for the initial loading of the data.

http://rstudio-pubs-static.s3.amazonaws.com/23028_821a9591f7de4289a21c4ec8a41604c3.html

As stated above, the following website has a useful analysis of the meanings of the exponent values for the damages in this dataset.

https://rpubs.com/flyingdisc/PROPDMGEXP

Appendix

Selection criteria for grouping

Choosing the groups was done manually. From Table 1 it was clear that there were name-based issues, so terms like “heat”, “flood” and “wind” looked like they should catch multiple top 10 events. After using mutate to add the logical columns for the search terms, those terms were used to exclude “found” rows.

EventExample <- EventSummary %>%
    mutate(WIND      = grepl("WIND", EVTYPE),
           HEAT      = grepl("HEAT", EVTYPE),
           FLOOD     = grepl("FLOOD", EVTYPE))
subEvent <- EventExample[!(EventExample$WIND|
                           EventExample$HEAT|
                           EventExample$FLOOD),] %>%
    arrange(desc(Event.F))
kable(subEvent[1:10,1:8])
EVTYPE Event.Count Event.F Event.I Event.T Event.r.F Event.r.I Event.r.T
TORNADO 60652 5633 91346 96979 0.0929 1.51 1.6
LIGHTNING 15755 816 5230 6046 0.0518 0.332 0.384
RIP CURRENT 470 368 232 600 0.783 0.494 1.28
AVALANCHE 386 224 170 394 0.58 0.44 1.02
WINTER STORM 11433 206 1321 1527 0.018 0.116 0.134
RIP CURRENTS 304 204 297 501 0.671 0.977 1.65
EXTREME COLD 655 160 231 391 0.244 0.353 0.597
HEAVY SNOW 15708 127 1021 1148 0.00809 0.065 0.0731
BLIZZARD 2719 101 805 906 0.0371 0.296 0.333
HIGH SURF 725 101 152 253 0.139 0.21 0.349

This was continued until most outcomes were accounted for.