Synopsis

An analysis is performed on the National Weather Service Storm Data, which contains recorded weather events from 1950 to 2011. The objective is to understand which types of events are most harmful to people, and have the greatest economic consequences, across the United States.

The event types contained in the data are supposed to be restricted to a set of 48 possible types; however, the manual entry of the data resulted in a set of 985 distinct values. Furthermore, many event types are fairly similar from the high level view of this analysis. Thus, a large part of the processing done for this analysis was targeted at consolidating the data accurately into the top 20 contributor events. This consolidated data was then represented using bar charts and accounted for over 98% of the people harmed and over 99% of the damage to crops and property.

The results show that tornadoes are the most dangerous and destructive weather type, with floods and thunderstorm winds also contributing strongly to both metrics.

Data Processing

The storm data was downloaded from the course website on January 20, 2016. The original .bz2 file was uncompressed using the Linux bunzip2 utility. The bunzip2 output is the comma-separated-value (CSV) file repdata-data-StormData.csv. The file contains numerous variables for each storm event recorded from 1950 to 2011. This analysis focuses on people harmed and property damaged by different event types across the US over the entire time span, so most of the variables are not of present interest.

The storm data is loaded into R using the following statement:

storms <- read.csv("./repdata-data-StormData.csv", 
                   stringsAsFactors = FALSE) 

Next, the variables of interest are extracted and renamed using more readable strings in camelCase format:

storms <- storms %>%
    select(startDate = BGN_DATE, 
           countyName = COUNTYNAME, 
           state = STATE, 
           eventType = EVTYPE, 
           fatalities = FATALITIES, 
           injuries = INJURIES, 
           propertyDamage = PROPDMG, 
           cropDamage = CROPDMG)
cat("The storm data contains", nlevels(as.factor(storms$eventType)), "unique weather event types.")
The storm data contains 985 unique weather event types.

This unwieldy number of event types is a consequence of human data entry allowing both creative combinations and subclassing of official event types, and the unavoidable misspellings when data entry is unchecked. Misspellings are irrelevant since they are mainly individual events that do not contribute much to the total harm or damage from the intended event type. Of more interest is consolidating the repeated use of different event types that from a high level can be considered to be essentially the same type of event.

To consolidate the data into a small yet informative and accurate number of event types, I applied the iterative approach described below until I had the top twenty distinct contributors, each, to people harmed and propery damaged.

My ranking metrics were defined as follows:

Top 20 Harming Event Types

Focusing on the number of people harmed by different event types, I iterated over the following steps:

  1. Apply the latest “mutate” rules that modify event type values (In the first iteration there are no rules)
  2. Group the storms data by event type, and summarize the the deaths and injuries into a “harmed” variable
  3. Rank the result by the top twenty values of the numHarmed variable
  4. Identify event types in the top twenty that are essentially the same event; for example: “TSTRM WIND” is the same as “THUNDERSTORM WIND”; or, all types of winter related events are considered to be “WINTER WEATHER”.
  5. Add “mutate” rules that modify the event type values in the storm data table so consolidate those similar/same top 20 events into a single event.
  6. Redo the procedure to get a new top 20 set of event types. Repeat until have 20 distinct event types.

This algorithm used and iteratively produced the following R code, which will be commented on chunk by chunk.

Apply “mutate” Rules

The first step is to apply the mutate rules that are built up by successive iterations of the described approach. This code represents the final set of mutate rules.

storms2 <- storms %>%
    mutate(eventType = ifelse(eventType %in% c("TSTM WIND", 
                                               "THUNDERSTORM WINDS"), 
                              "THUNDERSTORM WIND", eventType)) %>%
    mutate(eventType = ifelse(eventType %in% c("TSTM WIND/HAIL"), 
                              "HAIL", eventType)) %>%
    mutate(eventType = ifelse(eventType %in% c("STRONG WIND", 
                                               "HIGH WINDS", 
                                               "HIGH WIND"), 
                              "WIND", eventType)) %>%
    mutate(eventType = ifelse(eventType %in% c("HEAT WAVE", 
                                               "EXCESSIVE HEAT", 
                                               "EXTREME HEAT"), 
                              "HEAT", eventType)) %>%
    mutate(eventType = ifelse(eventType %in% c("WILD/FOREST FIRE", 
                                               "WILD FIRES"), 
                              "WILDFIRE", eventType)) %>%
    mutate(eventType = ifelse(eventType %in% c("RIP CURRENTS"), 
                              "RIP CURRENT", eventType)) %>%
    mutate(eventType = ifelse(eventType %in% c("BLIZZARD", 
                                               "WINTER WEATHER", 
                                               "WINTER WEATHER/MIX", 
                                               "HEAVY SNOW", 
                                               "ICE STORM", 
                                               "GLAZE", 
                                               "ICE", 
                                               "LAKE-EFFECT SNOW"), 
                              "WINTER STORM", eventType)) %>%
    mutate(eventType = ifelse(eventType %in% c("DENSE FOG"), 
                              "FOG", eventType)) %>%
    mutate(eventType = ifelse(eventType %in% c("FLASH FLOOD", 
                                               "FLASH FLOODING", 
                                               "FLOOD/FLASH FLOOD", 
                                               "URBAN/SML STREAM FLD", 
                                               "URBAN FLOOD", 
                                               "URBAN FLOODING", 
                                               "RIVER FLOOD", 
                                               "COASTAL FLOOD", 
                                               "COASTAL FLOODING", 
                                               "FLOODING"), 
                              "FLOOD", eventType)) %>%
    mutate(eventType = ifelse(eventType %in% c("HURRICANE/TYPHOON"), 
                              "HURRICANE", eventType)) %>%
    mutate(eventType = ifelse(eventType %in% c("STORM SURGE/TIDE"), 
                              "STORM SURGE", eventType)) %>%
    mutate(eventType = ifelse(eventType %in% c("HEAVY SURF/HIGH SURF"), 
                              "HIGH SURF", eventType)) %>%
    mutate(eventType = ifelse(eventType %in% c("EXTREME COLD/WIND CHILL", 
                                               "COLD/WIND CHILL", 
                                               "FROST/FREEZE"), 
                              "EXTREME COLD", eventType))

Group and Summarize

The second step is to group and summarize the modified storm data

harmed <- storms2 %>%
    group_by(eventType) %>% 
    summarize(numHarmed = sum(fatalities) + sum(injuries)) 

harmedTotal <- sum(harmed$numHarmed)
cat("The total number of people harmed by all event types is:", harmedTotal)
The total number of people harmed by all event types is: 155673

Rank the Top 20

The third step is to rank the top twenty event types based on number of people harmed

harmed20 <- harmed %>%
    top_n(20, numHarmed) %>%
    arrange(desc(numHarmed))

harmed20Pct <- paste0(round(100*sum(harmed20$numHarmed)/harmedTotal, digits=1), "%")
cat("These top twenty event types contributed", harmed20Pct, "of the total number of people harmed by all event types.")
These top twenty event types contributed 98.9% of the total number of people harmed by all event types.

Identify Similar Event Types

As an example, during the first iteration using this approach, the top ten events causing harm are:

eventType numHarmed
TORNADO 96979
EXCESSIVE HEAT 8428
TSTM WIND 7461
FLOOD 7259
LIGHTNING 6046
HEAT 3037
FLASH FLOOD 2755
ICE STORM 2064
THUNDERSTORM WIND 1621
WINTER STORM 1527

I will consider “TSTM WIND” to be the same as “THUNDERSTORM WIND”, “FLASH FLOOD” to be the same as “FLOOD”, “EXCESSIVE HEAT” to be the same as “HEAT”, and “ICE STORM” to be the same as “WINTER STORM”. Thus I will write mutate rules to modify them to have the same eventType value for the next iteration.

Add Mutate Rules

From the previous step example, I would add/append to the mutate rules to capture the indicated equivalences.

harmed <- storms %>%
    mutate(eventType = ifelse(eventType %in% c("TSTM WIND"), 
                              "THUNDERSTORM WIND", eventType)) %>%
    mutate(eventType = ifelse(eventType %in% c("FLASH FLOOD"), 
                              "FLOOD", eventType)) %>%
    mutate(eventType = ifelse(eventType %in% c("EXCESSIVE HEAT"), 
                              "HEAT", eventType)) %>%
    mutate(eventType = ifelse(eventType %in% c("ICE STORM"), 
                              "WINTER STORM", eventType))

Each iteration will contribute new mutate rules or add event types to existing rules. The set of rules described in the first step above is the result after the final iteration.

Top 20 Harming Event Type Keywords

To provide clarity for generating bar charts of these event types, I created a two-letter keyword corresponding to each of the top twenty event types, and added this as another variable to the top twenty table. These keywords can then be used on the x-axis of the bar charts. The eventKeyword variable is also made to be a factor, ordered in the top 20 order.

harmed20 <- harmed20 %>%
    mutate(eventKeyword = ifelse(eventType == "TORNADO", "TN",
                                 ifelse(eventType == "HEAT", "HE", 
                                        ifelse(eventType == "FLOOD", "FL", 
                                               ifelse(eventType == "THUNDERSTORM WIND", "TW", 
                                                      ifelse(eventType == "WINTER STORM", "WS", 
                                                             ifelse(eventType == "LIGHTNING", "LI",
                                                                    ifelse(eventType == "WIND", "WI", 
                                                                           ifelse(eventType == "WILDFIRE", "WF", 
                                                                                  ifelse(eventType == "HAIL", "HL",
                                                                                         ifelse(eventType == "HURRICANE", "HU",
                                                                                                ifelse(eventType == "FOG", "FG", 
                                                                                                       ifelse(eventType == "RIP CURRENT", "RC",
                                                                                                              ifelse(eventType == "EXTREME COLD", "EC",
                                                                                                                     ifelse(eventType == "DUST STORM", "DS",
                                                                                                                            ifelse(eventType == "TROPICAL STORM", "TS", 
                                                                                                                                   ifelse(eventType == "AVALANCHE", "AV", 
                                                                                                                                          ifelse(eventType == "HEAVY RAIN", "HR", 
                                                                                                                                                 ifelse(eventType == "HIGH SURF", "HS",
                                                                                                                                                        ifelse(eventType == "TSUNAMI", "TI",
                                                                                                                                                               ifelse(eventType == "LANDSLIDE", "LS", "XX")))))))))))))))))))))%>%
    select(eventKeyword, eventType, numHarmed)
harmed20$eventKeyword <- factor(harmed20$eventKeyword, levels = harmed20$eventKeyword)

For reference later for the graphs produced in the Results section, the table of event types and keywords is printed here.

kable(select(harmed20, eventKeyword, eventType), 
      caption = "Top 20 Harm Event Types", 
      col.names = c("Keyword ", "Event Type"))
Top 20 Harm Event Types
Keyword Event Type
TN TORNADO
HE HEAT
FL FLOOD
TW THUNDERSTORM WIND
WS WINTER STORM
LI LIGHTNING
WI WIND
WF WILDFIRE
HL HAIL
HU HURRICANE
FG FOG
RC RIP CURRENT
EC EXTREME COLD
DS DUST STORM
TS TROPICAL STORM
AV AVALANCHE
HR HEAVY RAIN
HS HIGH SURF
TI TSUNAMI
LS LANDSLIDE

Top 20 Damage Event Types

The same approach was used to consolidate the event types contributing the most damage. Here the damage metric was used instead of the numHarmed metric; however the same iteratve approach was used to rank the top 20, identify essentially similar event types, and write mutate rules to consolidate those event types. The resulting top 20 damage event types had considerable overlap with the top 20 harming event types: only 3 event types differed between the two lists.

The mutate rules from the top 20 harm event types iterations were re-used and added to for the top 20 damage event types. The listing described above that produced the storms2 table contains all mutate rules from both iterations.

Below is the final R code for producing the top 20 damage event types. Refer to the detailed discussion above to understand the steps involved.

damages <- storms2 %>%
    group_by(eventType) %>% 
    summarize(damage = sum(propertyDamage) + sum(cropDamage)) 
damageTotal <- sum(damages$damage)
cat("The total of damage by all event types is:", paste0("$", damageTotal))
The total of damage by all event types is: $12262327.33
damages20 <- damages %>%
    top_n(20, damage) %>%
    arrange(desc(damage))
damages20Pct <- paste0(round(100*sum(damages20$damage)/damageTotal, digits=1), "%")
cat("These top twenty event types contributed", damages20Pct, "of the total damage by all event types.")
These top twenty event types contributed 99.2% of the total damage by all event types.

Top 20 Damage Event Type Keywords

As with the top 20 harming event types, I created a two-letter keyword corresponding to each of the top twenty damage event types, and added this as another variable to the top twenty table. These keywords can then be used on the x-axis of the bar charts. The eventKeyword variable is also made to be a factor, ordered in the top 20 order.

damages20 <- damages20 %>%
    mutate(eventKeyword = ifelse(eventType == "TORNADO", "TN",
                                 ifelse(eventType == "HEAT", "HE", 
                                        ifelse(eventType == "FLOOD", "FL", 
                                               ifelse(eventType == "THUNDERSTORM WIND", "TW", 
                                                      ifelse(eventType == "WINTER STORM", "WS", 
                                                             ifelse(eventType == "LIGHTNING", "LI",
                                                                    ifelse(eventType == "WIND", "WI", 
                                                                           ifelse(eventType == "WILDFIRE", "WF", 
                                                                                  ifelse(eventType == "HAIL", "HL",
                                                                                         ifelse(eventType == "HURRICANE", "HU",
                                                                                                ifelse(eventType == "FOG", "FG", 
                                                                                                       ifelse(eventType == "DROUGHT", "DR",
                                                                                                              ifelse(eventType == "EXTREME COLD", "EC",
                                                                                                                     ifelse(eventType == "DUST STORM", "DS",
                                                                                                                            ifelse(eventType == "TROPICAL STORM", "TS", 
                                                                                                                                   ifelse(eventType == "STORM SURGE", "SS", 
                                                                                                                                          ifelse(eventType == "HEAVY RAIN", "HR", 
                                                                                                                                                 ifelse(eventType == "HIGH SURF", "HS",
                                                                                                                                                        ifelse(eventType == "WATERSPOUT", "WA",
                                                                                                                                                               ifelse(eventType == "LANDSLIDE", "LS", "XX"))))))))))))))))))))) %>%
    select(eventKeyword, eventType, damage)
damages20$eventKeyword <- factor(damages20$eventKeyword, levels = damages20$eventKeyword)

For reference later for the graphs produced in the Results section, the table of event types and keywords is printed here.

kable(select(damages20, eventKeyword, eventType), 
      caption = "Top 20 Damage Event Types", 
      col.names = c("Keyword ", "Event Type"))
Top 20 Damage Event Types
Keyword Event Type
TN TORNADO
TW THUNDERSTORM WIND
FL FLOOD
HL HAIL
LI LIGHTNING
WI WIND
WS WINTER STORM
WF WILDFIRE
HR HEAVY RAIN
TS TROPICAL STORM
DR DROUGHT
HU HURRICANE
EC EXTREME COLD
SS STORM SURGE
LS LANDSLIDE
FG FOG
WA WATERSPOUT
DS DUST STORM
HE HEAT
HS HIGH SURF

The harmed20 and damages20 are the processed files that will be used to produce graphs in the Results section.

Results

For this analysis I used the two tables produced from the top 20 iterations described in the previous section, and created bar charts to show the contribution by each event type. The base10 log of damages cost and number of people harmed was used to better see the contributions among all of the top 20 event types to each metric.

ggplot(harmed20, aes(eventKeyword, log10(numHarmed))) +
    geom_bar(stat = "identity", col = "black", fill = "blue") +
    labs(title = "# People Harmed by Weather Event Type",
         x = "Event Type Keyword (see table)",
         y = "Total # People Harmed (log10 scale)") +
    theme(plot.title = element_text(size=12)) +
    theme(axis.title = element_text(size=10)) +
    theme(axis.text = element_text(size=6))

ggplot(damages20, aes(eventKeyword, log10(damage))) + 
    geom_bar(stat = "identity", col = "black", fill = "blue") +
    labs(title = "Total $ Damages by Weather Event Type",
         x = "Event Type Keyword (see table)",
         y = "Total $ Damages (log10 scale)")+
    theme(plot.title = element_text(size=12)) +
    theme(axis.title = element_text(size=10)) +
    theme(axis.text = element_text(size=6))

Tornadoes stand out as the top weather event for both harm to people and damage to crops and property, while flood events rank third for both metrics. Thunderstorm winds rank in the top four of both harm and damage metrics.

Some event types can cause high counts in one metric but not in the other. For example, Heat related events rank second in the amount of harm caused to people, but only 19th in the amount of damage. Conversely, Drought ranks 11th in the amount of damage caused, but is not present in the top 20 in the amount of harm caused to people.