Weather is never the same – similarly the threat from major weather events is always changing as well. There are some types of weather events year after year that can be expected to wreak more mayhem than others, regularly more people killed by lightning events than by landslides, commonly more property damage from tornadoes than other wind-related events. However, things at the top of the list, the most harmful events, here things are not readily predictable. When looking at the events with the greatest impact, there is no one event type that stands above the rest. Mother Nature has shown that she has lots of tricks she can play, and dealing with her worst will likely have more to do with preparing for the scale of the problems than worrying about the specific type of phenomenon that was the trigger.

Data Processing

It has been said that 80% of Data Science is just getting to the data of interest. This dataset is no different. However, for this report the steps and the transformations have been kept simple, and each part of the process has been implemented with an intent to follow best practices of Reproducible Research.

Preparation

The processing methods in the report rely on
* dplyr – a library of functions to help wrestle datasets into better shape
* ggplot2 – a powerful and flexible graphics library

# Load required libraries
suppressPackageStartupMessages(library(dplyr))   # Favorite way to tidy datasets
suppressPackageStartupMessages(library(ggplot2)) # Favorite graphing library

Access and Load the Dataset

The canonical source for this material is NOAA’s Storm Events Database, https://www.ncdc.noaa.gov/stormevents/

This report is generated from a specifically provided dataset, the exact URL and filename follow.

downloadfile <- 'repdata-data-StormData.csv.bz2'
url <- 'https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2'

Before the dataset can be processed, it first must be obtained and read into R.

Thankfully, the raw form this data as provided is friendly to R. The file is heavily compressed, but other than the compression the data is in a basic CSV format and there are no complications that require anything more complicated than a simple call to read.csv().

if (!file.exists(downloadfile)) {
    # Then we need to (re)fetch from the last known URL
    download.file(url, file)
}

# Though the file is compressed, the format follows read.csv defaults
SD <- read.csv(downloadfile)
dim(SD)
## [1] 902297     37

Note: there are nearly a million events in this dataset.

Pre-Processing

With a dataset of this size, it is often wise to trim the working set based on the parts of the dataset needed for the desired analyses. The rest of the processing can be a lot easier once the data is managed into a more effective form.

Correction

Errors can be found in almost any dataset. In this case there is a typo that inflates the reported property damage by three orders of magnitude for the Napa County flood that occurred over New Years 2006 The error makes this event the single most costly event in the database, far more costly than Hurricane Katrina.

# Typo: Napa Flood of 01/01/2006 is listed as 115 Billion of Property Damage
#   - makes this event the most expensive, ever, including Hurricane Katrina

# Check the remarks for this event coded as 115 Billion in property damage
as.character(SD[SD$PROPDMG==115 & SD$PROPDMGEXP=='B', 36])
## [1] "Major flooding continued into the early hours of January 1st, before the Napa River finally fell below flood stage and the water receeded. Flooding was severe in Downtown Napa from the Napa Creek and the City and Parks Department was hit with $6 million in damage alone. The City of Napa had 600 homes with moderate damage, 150 damaged businesses with costs of at least $70 million."

The specific error is the use of the ‘B’ (for Billions) suffix for the Property Damage Exponent. From the remarks field for this same record we can read that the largest components of the damage estimates are in the millions of dollars, so for the purposes of this report we have changed that ‘B’ to an ‘M’ for millions.

As is noted in a “Storm Data Disclaimer” that is part of the FAQ provided with the dataset, https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf, “An effort is made to use the best available information but because of time and resource constraints, information from these sources may be unverified by the NWS. Therefore, when using information from Storm Data, customers should be cautious as the NWS does not guarantee the accuracy or validity of the information.” Therefore, the author of this report believes the onus is placed on us to perform corrections that may be necessary for the intended uses of this data.

For this report we have chosen to make the minimal possible step to correct, and have retained an unchanged copy of the original dataframe that can be used for comparison if there are concerns about the change.

# Handle a typo that changes the most expensive event in database
fixSD <- SD                                                  # Keep the original
fixSD[fixSD$PROPDMG==115 & fixSD$PROPDMGEXP=='B', 26] <- 'M' # Correct to Millions

Selection

To facilitate processing selections are made from the full dataset, cutting down the almost 40 columns to a more manageable dozen, and a simpler (faster) dataframe is built for our work.

Since this report is working towards broad generalizations, we are not concerned at this time with many of the minor specifics for each event (e.g. exact location, jurisdictions, and remarks) We can reduce the columns down to focus on a few.

# Columns we need to keep
colids <- c(2,7,8,23,24,25,26,27,28,30,37)
names(fixSD)[colids]
##  [1] "BGN_DATE"   "STATE"      "EVTYPE"     "FATALITIES" "INJURIES"  
##  [6] "PROPDMG"    "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "STATEOFFIC"
## [11] "REFNUM"

Derived Values

In addition to focusing down to a few columns, there are some new columns that could be useful:
* year – we don’t need the full dates for our charts
* tot_cash – compute a numeric total from the encoded dollar values
* people – a single sum of all people reported harmed

In order to decode the dollar values, it is necessary to properly manage the values in the ‘EXP’ fields, e.g. PROPDMGEXP and CROPDMGEXP.

There are some values in the ‘EXP’ fields that defy simple translation, e.g. a few are encoded as ‘?’. In these cases the function returns a multiplier of 1.0

row_count <- dim(fixSD)[1]
known_exp <- c('B','b','M','m','K','k','H','h','',      # few letters and blank
               '0','1','2','3','4','5','6','7','8','9') # and all digits
exp_prop_unknown <- sum(!fixSD$PROPDMGEXP %in% known_exp)
exp_crop_unknown <- sum(!fixSD$CROPDMGEXP %in% known_exp)
cbind(
    c('Total Event Count','Unknown PROPDMGEXP', 'Unknown CROPDMGEXP'),
    c(row_count, exp_prop_unknown, exp_crop_unknown)
)
##      [,1]                 [,2]    
## [1,] "Total Event Count"  "902297"
## [2,] "Unknown PROPDMGEXP" "14"    
## [3,] "Unknown CROPDMGEXP" "7"

The unknown cases amount to 21 out of 902297 total events (each event having two ‘EXP’ codes).

For this we create a simple-minded, but ugly looking, function that takes in one of these ‘EXP’ code characters and will return a multiplicative value to use to translate the damage fields into a native numeric value.

# Find Magnitude -- helper function to convert B/M/K exponentiation
find_magnitude <- function(char) {
    ifelse(
        # if blank, just assume we multiply by 1
        char == '', 1,
        ifelse(
            # Billions
            toupper(char) == 'B', 10 ^ 9,
            ifelse(
                # Millions
                toupper(char) == 'M', 10 ^ 6,
                ifelse(
                    # Thousands
                    toupper(char) == 'K', 1000,
                    ifelse(
                        # Hundreds
                        toupper(char) == 'H', 100,
                        ifelse(
                            # Numeric Exponent provided, use it
                            is.numeric(char), 10 ^ as.numeric(char),
                            
                            # fall thru case, just assume exponent == 0
                            # hence we are telling to multiply result by 1
                            # Only a tiny number of cases fall thru to here
                            1
                        )
                    )
                )
            )
        )
    )
}

Filtering

Also, the reporting for years prior to 1993 was limited to just a few categories of tornadoes and storms which severely limits the comparability of values before and after that year. This is not a new observation, the assignment itself notes “In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.”

For the purposes of this report will will restrict the analyses to the years after 1992 where there is more consistency for comparisons between event types. Note, by creating a simple ‘year’ value, the earlier years are easy to filter out.

Execution

The complete set of changes can be performed in one pipeline of ‘dplyr’ functions.

# Manipulate df into a more usable form
df <- fixSD %>%
    select(colids) %>%
    mutate(
        # Simplify dates into just a year
        year = format(as.Date(BGN_DATE, format = '%m/%d/%Y'), '%Y'),
        # Calculate a total cash cost for property + crop damage
        prop_cash = PROPDMG * find_magnitude(PROPDMGEXP),
        crop_cash = CROPDMG * find_magnitude(CROPDMGEXP),
        tot_cash = prop_cash + crop_cash,
        # Track total number of people physically harmed
        people = INJURIES + FATALITIES
    ) %>%
    # Entries before 1993 are much more limited, not readily comparable
    filter(year > 1992)

Simplifying

Weather is rarely a discrete event. There are lots of definitions, but it still can be difficult to draw an edge that separates a rainstorm from thunderstorm (how often and how close does the thunder need to be?) It is clear looking at the EVTYPE values in the NOAA Storm Database that an effort has been made to classify all the events, but it is also clear that this classification is not an exact process. Any changes that might be made as part of the processing for this report are just one more human interpretation in a database of values that has taken hundreds if not thousands of people to enter, collate, and process.

Still for this report, one of the potentially significant transformations involves decisions of how to manage the ~1,000 different values used to encode the EVTYPE field.

length(unique(df$EVTYPE))
## [1] 985

Some of the differences are simple:
* Many of the types are entered in upper case, but not all, so there is ‘COLD’ and ‘Cold’; we chose to use toupper() on all types before processing.
* Others differ just by minor spelling, e.g. “WIND” and “WINDS”; this report used patterns that will match “WINDS” as well as “WIND”

Some differences are much more of a potential judgement call.
* There are ~90 different forms of ‘THUNDERSTORM’, can this be simplified? Or are there important differences between “THUNDERSTORM WIND” and “GUSTY THUNDERSTORM WIND”

For the purposes of this report, because the intent is to stay high level, the judgement calls have been made to keep things as simple as is reasonable.

We have chosen to use pattern matching to create a ‘category’ field. For each new type, there is a vector of patterns that will be used; those values of EVTYPES that match a pattern will be classified in that category. In this way we reduce the 985 EVTYPEs down to 16 categories.

The categories and their patterns are kept in a list. Note, the list is in order of execution, and this does have potentially subtle impacts, for example since ‘Winter’ is processed before ‘Wind’ EVTYPES that include ‘WINDCHILL’ will fall into the category of ‘Winter’ rather than ‘Wind’.

# Create lists of text patterns to replace
# - note careful choices of patterns will also catch typos in original types
# - also be careful of ordering 'WINDCHILL' a 'Wind' or a 'Winter' event?
event_classes <- list()
event_classes[['Hail']] <- c('HAIL')
event_classes[['Thunderstorm']] <- c('THUNDERSTORM', 'TSTM')
event_classes[['Flood']] <- c('FLOOD', 'FLD')
event_classes[['Tornado']] <- c('TORNADO', 'FUNNEL', 'WATERSPOUT', 'DEVIL')
event_classes[['Storm']] <- c('HURRICANE', 'TROPICAL', 'TYPHOON')
event_classes[['Fire']] <- c('FIRE')
event_classes[['Water']] <- c('RIP', 'SURF', 'TIDE', 'SURGE', 'TSUNAMI', 'SEAS')
event_classes[['Winter']] <- c('WINTER', 'WINTRY', 'BLIZZARD', 'SNOW', 'SLEET',
                            'ICE', 'ICY', 'FROST', 'COLD', 'AVALANCHE',
                            'FREEZE', 'CHILL', 'GLAZE')
event_classes[['Wind']] <- c('WIND') # After 'WIND CHILL' is processed...
event_classes[['Fog']] <- c('FOG')
event_classes[['Lightning']] <- c('LIGHTNING')
event_classes[['Heat']] <- c('DROUGHT', 'HEAT', 'WARM')
event_classes[['Earth']] <- c('SLIDE')
event_classes[['Rain']] <- c('RAIN')
event_classes[['Dust']] <- c('DUST')
event_classes[['Other']] <- c('')   # Catchall -- everything left is 'Other'
str(event_classes)
## List of 16
##  $ Hail        : chr "HAIL"
##  $ Thunderstorm: chr [1:2] "THUNDERSTORM" "TSTM"
##  $ Flood       : chr [1:2] "FLOOD" "FLD"
##  $ Tornado     : chr [1:4] "TORNADO" "FUNNEL" "WATERSPOUT" "DEVIL"
##  $ Storm       : chr [1:3] "HURRICANE" "TROPICAL" "TYPHOON"
##  $ Fire        : chr "FIRE"
##  $ Water       : chr [1:6] "RIP" "SURF" "TIDE" "SURGE" ...
##  $ Winter      : chr [1:13] "WINTER" "WINTRY" "BLIZZARD" "SNOW" ...
##  $ Wind        : chr "WIND"
##  $ Fog         : chr "FOG"
##  $ Lightning   : chr "LIGHTNING"
##  $ Heat        : chr [1:3] "DROUGHT" "HEAT" "WARM"
##  $ Earth       : chr "SLIDE"
##  $ Rain        : chr "RAIN"
##  $ Dust        : chr "DUST"
##  $ Other       : chr ""

Once the list is defined, it is a fairly simple process to go through and apply each category’s patterns to the EVTYPE list and then classify those that match.

# Two new vectors: old classifiers and new classifiers
df$pre_class <- toupper(df$EVTYPE)            # convert to all upper case
df$category <- rep(NA, length(df$EVTYPE))     # start new with all NAs
progress <- vector()                          # a place to hold status info

# Run through all of the EVTYPES and reclassify
for (label in names(event_classes)) {
    # Work through each pattern in its list
    for (pattern in event_classes[[label]]) {
        # Find where in the dataframe we have matches
        map <- grepl(pattern, df$pre_class, fixed = TRUE)
        # Set the corresponding simple type to our label
        df$category[map] <- label
        # Clear the corresponding entry (so event goes in 1 class only!)
        df$pre_class[map] <- NA
    }
    # Retain status information
    progress <- c(progress, round(mean(is.na(df$category)), 3))
}
cbind(CategoryLabel = names(event_classes), NotYetClassified = progress)
##       CategoryLabel  NotYetClassified
##  [1,] "Hail"         "0.68"          
##  [2,] "Thunderstorm" "0.338"         
##  [3,] "Flood"        "0.217"         
##  [4,] "Tornado"      "0.166"         
##  [5,] "Storm"        "0.164"         
##  [6,] "Fire"         "0.158"         
##  [7,] "Water"        "0.154"         
##  [8,] "Winter"       "0.089"         
##  [9,] "Wind"         "0.052"         
## [10,] "Fog"          "0.049"         
## [11,] "Lightning"    "0.027"         
## [12,] "Heat"         "0.02"          
## [13,] "Earth"        "0.019"         
## [14,] "Rain"         "0.002"         
## [15,] "Dust"         "0.001"         
## [16,] "Other"        "0"

Results

To address the questions of which types of events are most harmful and have the greatest economic consequences we can rework our dataframe of NOAA values and calculate the sums of people injured or killed and the total dollars of damage, and we can do this by year and by type of event.

# Simplify grouping for where we will group the results
df$category <- factor(df$category)   # Categories should factor easily
df$year <- as.integer(df$year)       # year works as an integer

# Create a table of summary values
summ <- df %>%
    group_by(year, category) %>%
    summarise(
        n = n(),                     # a count of the events
        people = sum(people),        # count of people (hurt or killed)
        cost = sum(tot_cash)         # sum of all dollar costs
    )
print(summary(summ))
##       year         category         n             people      
##  Min.   :1993   Dust   : 19   Min.   :    1   Min.   :   0.0  
##  1st Qu.:1997   Fire   : 19   1st Qu.:   66   1st Qu.:  17.5  
##  Median :2002   Flood  : 19   Median :  381   Median :  92.0  
##  Mean   :2002   Fog    : 19   Mean   : 2359   Mean   : 262.8  
##  3rd Qu.:2007   Hail   : 19   3rd Qu.: 1976   3rd Qu.: 257.0  
##  Max.   :2011   Heat   : 19   Max.   :23223   Max.   :6758.0  
##                 (Other):189                                   
##       cost          
##  Min.   :0.000e+00  
##  1st Qu.:5.580e+06  
##  Median :9.759e+07  
##  Mean   :1.092e+09  
##  3rd Qu.:7.677e+08  
##  Max.   :5.209e+10  
## 

Property Damage

For the question of greatest economic consequences the closest proxy that exists in the NOAA Storm Database is the PROPDMG and CROPDMG, the dollar costs of damage to property and crops respectively. Summing these two together gives a value that can be used as a rough relative scale for economic consequences (the actual consequences, including following effects such as decreased productivity, may reach well beyond what NOAA includes but for the sake of argument this report proposes that the economic consequences will track with the sum of NOAA’s values).

Of course, storm events happen over a relatively short timescale, but consequences build over time. This report looks at total costs over a year’s time as a way of balancing the expenses of a major event versus the cumulative impact of many smaller events.

Another feature of storm events is their variety. Even taking sums over a year’s time frame, the values for each category can vary by two orders of magnitude or more.

To try to help make some sense out of all the different values, we display the costs in a panel of charts
* one chart for each of the 16 categories
* each chart shows total cost over years
* and each chart uses a logarithmic Y axis to show the range of costs

# Build a plot based on summ showing cost over the years
ggplot(summ, aes(x = year, y = cost)) +
    # Display each value as a point
    geom_point() + 
    # Use a log scale on Y axis (range of Y values is huge)
    scale_y_log10(breaks = c(10^6, 10^9)) +
    labs(
        x = '',
        y = 'Annual Sum of Costs of Events (dollars) [log scale]',
        title = 'Economic Consequences of Storm Events by Category'
    ) +
    # Make a panel plot by category
    facet_wrap(~category) +
    # The data is complex enough, keep the visual theme very simple
    theme_bw()
Sum of Event Costs by Category over Years

Sum of Event Costs by Category over Years

The panels in this figure enable quick comparisons of the relative costs by year for each category of event. Note that some categories exhibit relatively regular behavior, but that the most expensive categories are commonly less regular.

The water related categories are the ones revealed to be commonly at the top of the charts.
* Floods are commonly in the billions of dollars each year.
* Storms vary wildly, but the worst storms total 10s and 100s of billions in a year.
* Not surprisingly, the events in the Water category, e.g. storm surge, tidal damage, etc, also add billions in certain stormy years.
* Hail steadily remains just under the biggest items, almost reliable at damaging billions each year.
* Winter events used to be amongst the top, but the damage costs have generally fallen over the years though now not far below the billion/year level.
* Tornadoes are the other category commonly at/near the billion/year level

Population Health

To derive harm to the population health from the values in the NOAA Storm Database, we can use the count of deaths and injuries reported. Rather than make any attempt to weigh the relative pain of and injury versus a death, this report uses a simple sum of the two categories as a proxy for the number of individuals ill affected by the events.

As was done for damage costs, we use a panel plot to show the impacts across multiple categories of events.
* Again, each panel holds totals for events in one category.
* Each panel plots sums of individuals impacted across each year
* Each panel is using a logarithmic Y axis to show the full ranges

# Build a plot based on summ, showing people over the years
ggplot(summ, aes(x = year, y = people)) +
    # Display each value with a point
    geom_point() +
    # Use a log scale for the Y axis (the ranges are still huge)
    scale_y_log10(breaks = c(1000, 100, 10)) +
    labs(
        x = '',
        y = 'Sum of People Injured or Killed [log scale]',
        title = 'Population Harmed by Storm Events by Category'
    ) +
    # Make a panel plot by category
    facet_wrap(~category) +
    # Data is complex, use a very simple visual theme
    theme_bw()
Sum of People Injured or Killed in Storm Events by Category over Years

Sum of People Injured or Killed in Storm Events by Category over Years

The panels in this figure enable quick comparisons of the people harmed by year for each category of event. In many cases the counts exhibit stable patterns, but the most painful events tend not to follow the norms, the events with the heavy consequences are the outlier cases.

From these charts we can compare some of the human toll of various kinds of events.
* Floods are painful, but occasionally they can be extremely harmful.
* Tornadoes have a similar profile.
* Storms vary widely.
* Good news that both Lightning and Winter related events appear to be hurting fewer numbers of people over time.
* Hail has a lower relative impact on people than it does on dollar costs (comparing relative position on this chart to the cost charts above)
* Fog, on the other hand, appears to have a more serious impact on people than it showed for dollar costs, though the recent years have been much better.

Top Events

To get an idea of what dominates the yearly costs, we can create a small summary made up of just the Top 5 worst events, based on their total dollar costs.

# Create a Top-5 summary
top5 <- df %>% 
    # For this highlight we only need a few columns
    select(c(2,3,12,15,16,18)) %>%
    # Look at the results in each year
    group_by(year) %>%
    # Keep only the top 5 events, as defined by the event's dollar cost
    top_n(n = 5, wt = tot_cash)

# Reorder the list to be in decreasing order of dollar cost
top5 <- top5[order(top5$year, top5$tot_cash, decreasing = TRUE),]

# Provide a quick summary of these results
summary(top5)
##      STATE                  EVTYPE        year         tot_cash        
##  FL     :11   FLOOD            :14   Min.   :1993   Min.   :5.500e+07  
##  CA     : 9   DROUGHT          :12   1st Qu.:1997   1st Qu.:3.749e+08  
##  TX     : 9   HAIL             :11   Median :2002   Median :6.000e+08  
##  AL     : 8   HURRICANE        :10   Mean   :2002   Mean   :1.883e+09  
##  LA     : 6   HURRICANE/TYPHOON: 8   3rd Qu.:2006   3rd Qu.:1.500e+09  
##  OK     : 6   TORNADO          : 7   Max.   :2011   Max.   :3.130e+10  
##  (Other):50   (Other)          :37                                     
##      people           category 
##  Min.   :   0.00   Storm  :23  
##  1st Qu.:   0.00   Flood  :18  
##  Median :   0.00   Heat   :13  
##  Mean   :  65.71   Hail   :12  
##  3rd Qu.:   7.50   Tornado: 7  
##  Max.   :1569.00   Winter : 7  
##                    (Other):19

Placing all of these events on a single chart enables comparisons.

# Build a plot based on the Top-5, showing tot_cash over the years
ggplot(top5, aes(x = year, y = tot_cash)) +
    # Mark each point in color based on its category
    geom_point(size = 5, alpha = .75, aes(color = category)) +
    # Utilize a log scale on the Y axis to manage the wide range
    scale_y_log10(breaks = c(10^7, 10^8, 10^9, 10^10, 10^11)) + 
    labs(
        x = '',
        y = 'Costs of Major Events (dollars) [log scale]',
        title = 'Economic Consequences of Top-5 Storm Events Each Year'
    ) +
    # Use a simple visual so the attention is on the data
    theme_bw()
View of Top-5 Costly Events by Category over Years

View of Top-5 Costly Events by Category over Years

The most expensive events for each year are plotted together, enabling a quick comparison through the years of which event categories appear above each other. Not surprisingly the big Storm events show frequently amongst the top events, but there are many other categories also in these cost leaders.

Reviewing these comparisons it is worth taking a moment to remember that these costs are shown on a log scale, and that there are two orders of magnitude between the values for different years and often at least an order of magnitude of separation between events within individual years. These single large events can greatly overwhelm entire categories of other event costs – and these single large events come in many forms.

Conclusion

Mother Nature does present us with a very wide variety of challenges. Preparing to manage her worst looks to be far more of an exercise in preparation for the tremendous scale of possible events and would frustrate attempts to focus down on any single kind of event.