Reproducible Research, Course Project 2

Synopsis

We analyzed the NOAA storm database to answer two questions: which type of weather event is most harmful with respect to population health and which events have the greatest economic impact? For the first question we assigned weather event categories to the data and calculated the sums of the number of fatalities (FATALITIES) and injuries (INJURIES) by event type (EVTYPE). Categories were used instead of the straight EVTYPE because of duplicate or similar entries. For example there were 213 events types with “wind” in their name. In this analysis Tornados, Wind and Heat were the three most harmful weather events, assuming the sum of injuries and deaths was the best metric for harmfulness. A similar analysis was performed looking at the property damage (PROPDMG) and agricultural damage (CROPDMG). The most expensive weather events are Floods, Hurricanes and Tornados.

Data Processing

We began by downloading the file from the course webpage. The file path to the data was saved to dlDest.

library(tidyverse)

## -- Attaching packages ---------------------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.1.0     v purrr   0.2.5
## v tibble  1.4.2     v dplyr   0.7.8
## v tidyr   0.8.2     v stringr 1.3.1
## v readr   1.3.1     v forcats 0.3.0

## -- Conflicts ------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(knitr)

dl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
dlDest <- "./data/StormData.csv.bz2"

if(!dir.exists("./data")){
    dir.create("./data")
}

if(!file.exists(dlDest)){
    download.file(url = dl,
                  destfile = dlDest)
}

The first line was read in, split up and turned into a character vector of column names. The data was then loaded into the variable storm using readr::read_csv with column types set based on the types of data found therein.

cnames <- read_lines(dlDest,
                     n_max = 1)
cnames <- unlist(strsplit(cnames, ",", fixed = TRUE))
cnames <- gsub("\"", "", cnames)

storm <- read_csv(dlDest,
                  col_types = "ncccncffcfccccccfcdncnnndfdfffcnnnncn")
names(storm) <- make.names(cnames)

Analyzing the health impacts of EVTYPEs

To examine how harmful each event type (EVTYPE) was, we calculated total fatalities (Event.F), injuries (Event.I) and both (Event.T), grouped by event type. Event.r.F, Event.r.I and Event.r.T give the rate of death and/or injury per event.

EventSummary <- storm %>%
    select(EVTYPE, FATALITIES, INJURIES) %>%
    group_by(EVTYPE) %>%
    summarize(Event.Count = n(),
              Event.F   = sum(FATALITIES),
              Event.I   = sum(INJURIES),
              Event.T   = Event.I + Event.F,
              Event.r.F = format(Event.F / Event.Count, digits = 3),
              Event.r.I = format(Event.I / Event.Count, digits = 3),
              Event.r.T = format(Event.T / Event.Count, digits = 3)) %>%
    arrange(desc(Event.T))
kable(head(EventSummary[,1:5], 10),
      caption = "Table of 10 Most Harmful Weather Events, by EVTYPE")

Table of 10 Most Harmful Weather Events, by EVTYPE
EVTYPE	Event.Count	Event.F	Event.I	Event.T
TORNADO	60652	5633	91346	96979
EXCESSIVE HEAT	1678	1903	6525	8428
TSTM WIND	219944	504	6957	7461
FLOOD	25326	470	6789	7259
LIGHTNING	15755	816	5230	6046
HEAT	767	937	2100	3037
FLASH FLOOD	54278	978	1777	2755
ICE STORM	2006	89	1975	2064
THUNDERSTORM WIND	82563	133	1488	1621
WINTER STORM	11433	206	1321	1527

Looking at the top ten list, sorted by total fatalities and injuries (Event.T), there are a few interesting features. First off, those data indicated that tornadoes were clearly responsible for the most deaths and injuries (being approximately 12 times more harmful than the next entry). It also suggests that human error or inconsistent (and poor) data entry practices resulted in similar EVTYPE factors (e.g. “FLOOD” and “FLASH FLOOD”) and for there to be duplicates (“TSTM WIND” and “THUNDERSTORM WIND”). We had to take these issue into consideration.

Grouping similar EVTYPEs

There are 977 factor levels in EVTYPE, but as noted above the data quality looked to be low, with duplicate entries and inconsistent naming. To improve this we utilized grepl() to append columns of logical values that indicate whether a given search string has a hit for a grouping term.
See the appendix for the method used for choosing the search terms. The following nine categories of event were determined to cover the majority of the events: WIND, TORNADO, HEAT, FLOOD, HURRICANE, LIGHTNING, CURRENTS, FIRE and SNOW. grepl() returns a TRUE if the string is found in the EVTYPE, which allowed us to use that column as a filter. SNOW includes events that with names containing ice, snow, blizzard or winter. All of the values not caught by one of the search terms (i.e. was FALSE for WIND:SNOW) was loaded into subEV.

EVSum2 <- EventSummary %>%
    mutate(WIND      = grepl("WIND", EVTYPE),
           TORNADO   = grepl("TORNADO", EVTYPE),
           HEAT      = grepl("HEAT", EVTYPE),
           FLOOD     = grepl("FLOOD", EVTYPE),
           HURRICANE = grepl("HURRICANE|TYPHOON|TROPICAL|CYCLONE", EVTYPE),
           LIGHTNING = grepl("LIGHTNING", EVTYPE),
           CURRENTS  = grepl("RIP CURRENT", EVTYPE),
           FIRE      = grepl("FIRE", EVTYPE),
           SNOW      = grepl("ICE|SNOW|BLIZZARD|WINTER|COLD", EVTYPE))
subEV <- EVSum2[!(EVSum2$WIND|
                  EVSum2$HEAT|
                  EVSum2$TORNADO|
                  EVSum2$FLOOD|
                  EVSum2$HURRICANE|
                  EVSum2$LIGHTNING|
                  EVSum2$CURRENTS|
                  EVSum2$FIRE|
                  EVSum2$SNOW),] %>%
    arrange(desc(Event.F))

subEV %>%
    select(EVTYPE:Event.T) %>%
    head(10) %>%
    kable(caption = "Table of Top 10 Ungrouped Weather Events")

Table of Top 10 Ungrouped Weather Events
EVTYPE	Event.Count	Event.F	Event.I	Event.T
AVALANCHE	386	224	170	394
HIGH SURF	725	101	152	253
HEAVY RAIN	11723	98	251	349
FOG	538	62	734	796
HEAVY SURF/HIGH SURF	228	42	48	90
LANDSLIDE	600	38	52	90
TSUNAMI	20	33	129	162
UNSEASONABLY WARM AND DRY	13	29	0	29
URBAN/SML STREAM FLD	3392	28	79	107
DUST STORM	427	22	440	462

The ungrouped categories covered only 5.78% of the total number of fatalities and 3.49% of all fatalities and injuries. This is low enough that we can ignore those events for now, although it would be simple to include more categories if deemed necessary.

Checking for overlaps

Because it was possible for more than one of the category grepl() to be true (e.g. an event of “SNOW/WIND/FLOODING” would be true for each of SNOW, WIND and FLOOD) it was important to establish the potential impact from choices made about how to handle these special cases. OVERLAP counts the number of categories. By summing the combined fatalities and injuries over the subset of EVTYPES with more than one category we can see how many cases are effected.

EVSum2 <- EVSum2 %>%
    mutate(OVERLAP = WIND +
                     TORNADO +
                     HEAT +
                     FLOOD +
                     HURRICANE +
                     LIGHTNING +
                     CURRENTS +
                     FIRE +
                     SNOW)
sum(EVSum2[EVSum2$OVERLAP > 1,]$Event.T)

## [1] 354

With only 354 total injuries or death impacted by the multiple category designations we won’t worry about it too much. The method used below for assigning the category imposes an ordering/hierachy and prevents double counting, but other methods (like splitting values) might make sense as well and aren’t examined in this analysis.

Recalculate Fatalities and Injuries by Category

Using a series of nested ifelse() the following categories were assigned, with the given order:
HURRICANE > SNOW > FLOOD > TORNADO > LIGHTNING > WIND > FIRE > HEAT > OTHER
This ordering was chosen because WIND was commonly one of the two or three overlapping categories, and in cases where it was WIND + TORNADO or WIND + HURRICANE it made the most sense to group it with the primary event (the tornado or hurricane, in those cases). HURRICANE, SNOW, FLOOD, TORNADO and LIGHTNING are primary descriptors with minimal internal overlap, while FIRE and HEAT have little to no overlap with other categories. Anything not assigned a category was called OTHER and contains the same events as subEV.

CatSum <- EVSum2 %>%
    mutate(CATEGORY = ifelse(HURRICANE,
                          "HURRICANE",
                          ifelse(SNOW,
                          "SNOW",
                          ifelse(FLOOD,
                          "FLOOD",
                          ifelse(TORNADO,
                          "TORNADO",
                          ifelse(LIGHTNING,
                          "LIGHTNING",
                          ifelse(CURRENTS,
                          "RIP CURRENTS",
                          ifelse(WIND,
                          "WIND",
                          ifelse(FIRE,
                          "FIRE",
                          ifelse(HEAT,
                          "HEAT",
                          "OTHER")))))))))) %>%
    select(-(WIND:SNOW)) %>%
    group_by(CATEGORY) %>%
    summarize(`Count` = sum(Event.Count),
              `Fatalities`    = sum(Event.F),
              `Injuries`      = sum(Event.I),
              `Total`         = `Injuries` + `Fatalities`,
              `Fatality Rate` = format(`Fatalities` / `Count`, digits = 3),
              `Injury Rate`   = format(`Injuries` / `Count`, digits = 3),
              `Total Rate`    = format(`Total` / `Count`, digits = 3)) %>%
    arrange(desc(`Total`))

The result of these calculations can be found in the Results section.

Economic Impact

Economic impact was calculated in a similar manner, albeit with some modifications. Because of the range of values on the economic impact property and crop damage were each encoded into two columns, one for the coefficient (PROPDMG or CROPDMG) and one for the scale/exponent indicator (PROPDMGEXP or CROPDMGEXP). For example, a damage may be written as 5.2 M, representing 5.2 million USD or $5,200,000. Unfortunately PROPDMGEXP encodes a weird range of characters. As a result of the US Government partial shutdown it is not possible to access the NOAA website for supplementary data and tables, but we were able to find a reference to someone else who has analyzed the same thing here: https://rpubs.com/flyingdisc/PROPDMGEXP. Those levels were used to replace the factor values with the correct multiplier.

EconSummary <- storm %>%
    select(EVTYPE, PROPDMG:CROPDMGEXP)

levels(EconSummary$PROPDMGEXP) <- c(1000, 1000000, 1000000000, 1000000, 1, 0,
                                    10, 10, 0, 10, 10, 10, 100, 10, 100, 0, 10,
                                    10)
levels(EconSummary$CROPDMGEXP) <- c(1000000, 1000, 1000000, 1000000000, 0, 0,
                                    1000, 10)
EconSummary <- EconSummary %>%
    mutate(PropertyDamage = PROPDMG * as.numeric(as.character(PROPDMGEXP)),
           CropDamage     = CROPDMG * as.numeric(as.character(CROPDMGEXP)),
           TotalDamage    = ifelse(is.na(PropertyDamage), 0, PropertyDamage) +
                            ifelse(is.na(CropDamage), 0, CropDamage))

A similar grouping process was run, although a different set of terms were used. Hurricanes, typhoons, cyclones are the same thing, and tropical storms (and depressions) represent lower category versions of the same type of storm, so those were grouped together. Freeze and cold were added to the SNOW search term.

EconSum2 <- EconSummary %>%
    mutate(WIND      = grepl("WIND", EVTYPE),
           TORNADO   = grepl("TORNADO", EVTYPE),
           FLOOD     = grepl("FLOOD", EVTYPE),
           HURRICANE = grepl("HURRICANE|TYPHOON|TROPICAL|CYCLONE", EVTYPE),
           FIRE      = grepl("FIRE", EVTYPE),
           DROUGHT   = grepl("DROUGHT", EVTYPE),
           HAIL      = grepl("HAIL", EVTYPE),
           SURGE     = grepl("SURGE", EVTYPE),
           SNOW      = grepl("ICE|SNOW|BLIZZARD|WINTER|FREEZE|COLD",EVTYPE))

subEC <- EconSum2[!(EconSum2$WIND|
                    EconSum2$TORNADO|
                    EconSum2$FLOOD|
                    EconSum2$HURRICANE|
                    EconSum2$FIRE|
                    EconSum2$DROUGHT|
                    EconSum2$HAIL|
                    EconSum2$SURGE|
                    EconSum2$SNOW),] %>% arrange(desc(TotalDamage))

subEC %>%
    select(EVTYPE, PropertyDamage:TotalDamage ) %>%
    head(10) %>%
    kable(caption = "Table of Top 10 Ungrouped Weather Events")

Table of Top 10 Ungrouped Weather Events
EVTYPE	PropertyDamage	CropDamage	TotalDamage
HEAVY RAIN/SEVERE WEATHER	2.500e+09	NA	2500000000
SEVERE THUNDERSTORM	1.200e+09	NA	1200000000
EXCESSIVE HEAT	1.700e+05	492400000	492570000
HEAT	NA	400000000	400000000
HEAVY RAIN	NA	200000000	200000000
HEAVY RAIN	NA	190000000	190000000
EXCESSIVE WETNESS	NA	142000000	142000000
River Flooding	7.874e+07	26840000	105580000
TSUNAMI	8.100e+07	20000	81020000
HEAVY RAIN	NA	73600000	73600000

Like subEV before, subEC contained the ungrouped levels of EVTYPE. Those ungrouped EVTYPE factors only covered 1.76% of the total damage and were not worth the time to pull apart any further.

The dataset was then assigned CATEGORY values using a similar ifelse() structure, and the damage amounts were then subtotaled for each entry.

CatEconSum <- EconSum2 %>%
    mutate(CATEGORY = ifelse(HURRICANE,
                             "HURRICANE",
                             ifelse(SURGE,
                             "SURGE",
                             ifelse(SNOW,
                             "SNOW",
                             ifelse(FLOOD,
                             "FLOOD",
                             ifelse(HAIL,
                             "HAIL",
                             ifelse(TORNADO,
                             "TORNADO",
                             ifelse(WIND,
                             "WIND",
                             ifelse(DROUGHT,
                             "DROUGHT",
                             ifelse(FIRE,
                             "FIRE",
                             "OTHER")))))))))) %>%
    select(-(WIND:SNOW)) %>%
    group_by(CATEGORY) %>%
    summarize(`Property Damage` = sum(PropertyDamage, na.rm = TRUE),
              `Crop Damage`     = sum(CropDamage, na.rm = TRUE),
              `Total Damage`    = `Property Damage` + `Crop Damage`) %>%
    arrange(desc(`Total Damage`))

Results

Most Harmful Weather Events

Answering the question of which weather event is most harmful first requires that we choose the criteria for “most harmful.” If want to look at fatalities primarily we would find that Tornados, Heat and Floods are the three most deadly types of events.

kable(arrange(CatSum, desc(CatSum$Fatalities)),
      padding = 2,
      caption = "Most harmful weather events, ranked by fatalities")

Most harmful weather events, ranked by fatalities
CATEGORY	Count	Fatalities	Injuries	Total	Fatality Rate	Injury Rate	Total Rate
TORNADO	60699	5661	91407	97068	0.0933	1.51	1.6
HEAT	2645	3138	9154	12292	1.19	3.46	4.65
FLOOD	82648	1523	8601	10124	0.0184	0.104	0.122
WIND	363224	1192	11397	12589	0.00328	0.0314	0.0347
SNOW	44462	1071	6331	7402	0.0241	0.142	0.166
OTHER	326773	875	4555	5430	0.00268	0.0139	0.0166
LIGHTNING	15776	817	5232	6049	0.0518	0.332	0.383
RIP CURRENTS	777	577	529	1106	0.743	0.681	1.42
HURRICANE	1054	201	1714	1915	0.191	1.63	1.82
FIRE	4239	90	1608	1698	0.0212	0.379	0.401

Another way we can think about how harmful a weather events is by looking at the number of dead and injured per event. The top three events that have the highest Total Rate are hurricanes, heat and tornado. While hurricanes and heat are much less common than tornados they are 3.2 and 2.9 times as likely to cause harm, compared against tornadoes.

kable(arrange(CatSum, desc(CatSum$`Total Rate`)),
      padding = 2,
      caption = "Most harmful weather events, ranked by total death and injury rate")

Most harmful weather events, ranked by total death and injury rate
CATEGORY	Count	Fatalities	Injuries	Total	Fatality Rate	Injury Rate	Total Rate
HEAT	2645	3138	9154	12292	1.19	3.46	4.65
HURRICANE	1054	201	1714	1915	0.191	1.63	1.82
TORNADO	60699	5661	91407	97068	0.0933	1.51	1.6
RIP CURRENTS	777	577	529	1106	0.743	0.681	1.42
FIRE	4239	90	1608	1698	0.0212	0.379	0.401
LIGHTNING	15776	817	5232	6049	0.0518	0.332	0.383
SNOW	44462	1071	6331	7402	0.0241	0.142	0.166
FLOOD	82648	1523	8601	10124	0.0184	0.104	0.122
WIND	363224	1192	11397	12589	0.00328	0.0314	0.0347
OTHER	326773	875	4555	5430	0.00268	0.0139	0.0166

Finally, If we instead wanted to find the events with the highest total impact (fatalities and injuries combined) then we would get tornados, wind and heat as the most harmful.

kable(arrange(CatSum, desc(CatSum$Total)),
      padding = 2,
      caption = "Most harmful weather events, ranked by total death and injury")

Most harmful weather events, ranked by total death and injury
CATEGORY	Count	Fatalities	Injuries	Total	Fatality Rate	Injury Rate	Total Rate
TORNADO	60699	5661	91407	97068	0.0933	1.51	1.6
WIND	363224	1192	11397	12589	0.00328	0.0314	0.0347
HEAT	2645	3138	9154	12292	1.19	3.46	4.65
FLOOD	82648	1523	8601	10124	0.0184	0.104	0.122
SNOW	44462	1071	6331	7402	0.0241	0.142	0.166
LIGHTNING	15776	817	5232	6049	0.0518	0.332	0.383
OTHER	326773	875	4555	5430	0.00268	0.0139	0.0166
HURRICANE	1054	201	1714	1915	0.191	1.63	1.82
FIRE	4239	90	1608	1698	0.0212	0.379	0.401
RIP CURRENTS	777	577	529	1106	0.743	0.681	1.42

This can also be seen graphically:

CatSum %>%
    select(CATEGORY,`Fatalities`, `Injuries`) %>%
    gather(key = `Casualty Type`, value = `Count`, -CATEGORY) %>%
    ggplot() +
    geom_col(mapping = aes(x = fct_reorder(CATEGORY, `Count`),
                           y = `Count`,
                           fill = `Casualty Type`)) +
    coord_flip() +
    labs(y = "Number of Casualties (Injured or Killed)",
         x = "",
         title = "Cost to Health of Extreme Weather Events",
         subtitle = "U.S. NOAA Storm Database (1950-2011)",
         fill = "Casualty Type") +
    theme_bw() +
    guides(fill=guide_legend(title = NULL)) +
    theme(legend.justification=c(1,0), legend.position=c(0.9,0.1))

Economic Damages

The same type of analysis was performed on the economic data, specifically the property damage and agricultural damage. Flooding was by far the most costly, almost twice the damage of hurricanes. Since the values were so large we scaled them back by $10^9$, making the values representative of billions of dollars.

CatEconSum$`Property Damage` <- CatEconSum$`Property Damage`/1000000000
CatEconSum$`Crop Damage` <- CatEconSum$`Crop Damage`/1000000000
CatEconSum$`Total Damage` <- CatEconSum$`Total Damage`/1000000000
kable(CatEconSum,
      caption = "Summary of Economic Damages, by Category")

Summary of Economic Damages, by Category
CATEGORY	Property Damage	Crop Damage	Total Damage
FLOOD	167.366829	12.3470691	179.71390
HURRICANE	93.072538	6.2110138	99.28355
TORNADO	56.993099	0.4149613	57.40806
SURGE	47.965224	0.0008550	47.96608
SNOW	12.699465	8.5572980	21.25676
HAIL	17.619992	3.1142128	20.73420
WIND	15.802228	1.9564790	17.75871
DROUGHT	1.046306	13.9726218	15.01893
FIRE	8.501629	0.4032816	8.90491
OTHER	6.251340	2.1263994	8.37774

A column chart can be made to show the how the totals relate and how much the property and crop components contribute to the total.

CatEconSum %>%
    select(CATEGORY:`Crop Damage`) %>%
    gather(key = `Damage Type`, value = `Amount`, -CATEGORY) %>%
    ggplot() +
    geom_col(mapping = aes(x = fct_reorder(CATEGORY, `Amount`),
                           y = `Amount`,
                           fill = `Damage Type`)) +
    coord_flip() +
    labs(y = "Cost of Damages (Billions $)",
         x = "",
         title = "Cost of Damages of Extreme Weather Events",
         subtitle = "U.S. NOAA Storm Database (1950-2011)",
         fill = "Damage Type") +
    theme_bw() +
    guides(fill=guide_legend(title = NULL)) +
    theme(legend.justification=c(1,0), legend.position=c(0.9,0.1))

Acknowledgments

The following page has a data dictionary was useful in determining the column types for the initial loading of the data.

http://rstudio-pubs-static.s3.amazonaws.com/23028_821a9591f7de4289a21c4ec8a41604c3.html

As stated above, the following website has a useful analysis of the meanings of the exponent values for the damages in this dataset.

https://rpubs.com/flyingdisc/PROPDMGEXP

Appendix

Selection criteria for grouping

Choosing the groups was done manually. From Table 1 it was clear that there were name-based issues, so terms like “heat”, “flood” and “wind” looked like they should catch multiple top 10 events. After using mutate to add the logical columns for the search terms, those terms were used to exclude “found” rows.

EventExample <- EventSummary %>%
    mutate(WIND      = grepl("WIND", EVTYPE),
           HEAT      = grepl("HEAT", EVTYPE),
           FLOOD     = grepl("FLOOD", EVTYPE))
subEvent <- EventExample[!(EventExample$WIND|
                           EventExample$HEAT|
                           EventExample$FLOOD),] %>%
    arrange(desc(Event.F))
kable(subEvent[1:10,1:8])

EVTYPE	Event.Count	Event.F	Event.I	Event.T	Event.r.F	Event.r.I	Event.r.T
TORNADO	60652	5633	91346	96979	0.0929	1.51	1.6
LIGHTNING	15755	816	5230	6046	0.0518	0.332	0.384
RIP CURRENT	470	368	232	600	0.783	0.494	1.28
AVALANCHE	386	224	170	394	0.58	0.44	1.02
WINTER STORM	11433	206	1321	1527	0.018	0.116	0.134
RIP CURRENTS	304	204	297	501	0.671	0.977	1.65
EXTREME COLD	655	160	231	391	0.244	0.353	0.597
HEAVY SNOW	15708	127	1021	1148	0.00809	0.065	0.0731
BLIZZARD	2719	101	805	906	0.0371	0.296	0.333
HIGH SURF	725	101	152	253	0.139	0.21	0.349

This was continued until most outcomes were accounted for.