Synopsis

This report aims to determine the severe weather events that cause the most problems in terms of public health and economic impact. The conclusion is that tornadoes are the most injurous events with heat being the main cause of death. The most damaging events in terms of cost are caused by floods impacting property and drought events impact on agriculture. To investigate these events a data set from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database was obtained. This contains data for major storms and weather events in the United States during the years from 1950 to 2011 and contains estimates on of any fatalities, injuries, and property damage. The storm data was processed and cleaned (detailed below) and then aggregated by weather event type to discover the most harmful and costly events.

Data Processing

The raw data set was originally sourced from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. For this report, the data was downloaded from this link.

Other documentation that was used to create this report were:

Load and attach required packages

The following libraries were used in creating the report.

library(readr)     # for reading in the BZipped CSV data
library(lubridate) # for handling dates
library(dplyr)     # for processing data (aggregating, filtering etc)
library(ggplot2)   # for plotting
library(stringr)   # for string replacement when cleaing the EVTYPE data
library(tidyr)     # for tidying and shaping data (gather) for plotting
library(knitr)     # for using kable to generate tables

Loading the storm data set

The zipped (BZ2) CSV file was loaded using readr::read_csv into a data set (tibble) called csv_storm_data. The file is a comma delimited file with missing values given as ‘NA’. The BGN_DATE column was used to determine when the weather event happened. The format of this date is explicity specified when reading in the data set since it is in a non-default format.

storm_data_col_types = cols(
    BGN_DATE = col_date(format = "%m/%d/%Y %H:%M:%S")
)

csv_storm_data <- read_csv("repdata-data-StormData.csv.bz2", col_types = storm_data_col_types)

This creates a large object in memory.

format(object.size(csv_storm_data) , units = "Mb")
## [1] "469 Mb"

Since we do not require all the columns, only the ones directly required for analysis are selected. These are:

  • BGN_DATE - the date the event happened
  • EVTYPE - the weather event type
  • FATALITIES - an estimate of the number of fatalities
  • INJURIES - an estimate of the number of injuries
  • PROPDMG - an estimate of the property damage in US Dollars
  • PROPDMGEXP - an exponent for multiplying PROPDMG
  • CROPDMG - an estimate of the crop damage in US Dollars
  • CROPDMGEXP - an exponent for multiplying CROPDMG
storm_data <- csv_storm_data %>%
                  select(BGN_DATE,   EVTYPE,
                         FATALITIES, INJURIES,
                         PROPDMG,    PROPDMGEXP,
                         CROPDMG,    CROPDMGEXP)

This greatly reduced the size of the data set.

format(object.size(storm_data) , units = "Mb")
## [1] "55.1 Mb"

Filtering out older years

The report looked at the number of different event types (EVTYPE) by year.

storm_year_event <- storm_data %>%
                        mutate(event_year = year(BGN_DATE)) %>%
                        group_by(event_year, EVTYPE) %>%
                        summarise(event_count = n())

# Show a count of the different EVTYPEs by year
table(storm_year_event$event_year)
## 
## 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 
##    1    1    1    1    1    3    3    3    3    3    3    3    3    3    3 
## 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 
##    3    3    3    3    3    3    3    3    3    3    3    3    3    3    3 
## 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 
##    3    3    3    3    3    3    3    3    3    3    3    3    3  160  267 
## 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 
##  387  227  169  124  119  110  120   99   51   38   46   50   46   46   46 
## 2010 2011 
##   46   46

From the table above, many more types of event were recorded from 1994 onwards.

storm_year_event %>% 
    group_by(event_year) %>% 
    summarise(n_events = sum(event_count)) %>%
    arrange(desc(event_year)) %>% 
    head(20)
## # A tibble: 20 × 2
##    event_year n_events
##         <dbl>    <int>
## 1        2011    62174
## 2        2010    48161
## 3        2009    45817
## 4        2008    55663
## 5        2007    43289
## 6        2006    44034
## 7        2005    39184
## 8        2004    39363
## 9        2003    39752
## 10       2002    36293
## 11       2001    34962
## 12       2000    34471
## 13       1999    31289
## 14       1998    38128
## 15       1997    28680
## 16       1996    32270
## 17       1995    27970
## 18       1994    20631
## 19       1993    12607
## 20       1992    13534

There is also a jump in the number of events between 1993 and 1994 and future years. Therefore, the report looked only at data from 1994 onwards to 2011.

# extract data from 1994 onwards
storm_data <- storm_data %>% filter(BGN_DATE >= "1994-01-01")

Check the first few rows:

kable(head(storm_data), caption = "Filtered storm data from 1994 onwards (first few rows)")
Filtered storm data from 1994 onwards (first few rows)
BGN_DATE EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
1995-01-06 FREEZING RAIN 0 0 0.0 NA 0 NA
1995-01-22 SNOW 0 0 0.0 NA 0 NA
1994-02-09 ICE STORM/FLASH FLOOD 0 2 0.0 NA 0 NA
1995-02-06 SNOW/ICE 0 0 0.0 NA 0 NA
1995-02-11 SNOW/ICE 0 0 0.0 NA 0 NA
1995-10-04 HURRICANE OPAL/HIGH WINDS 2 0 0.1 B 10 M

This reduces the size of the storm_data set to 702,131 rows.

Cleaning EVTYPE

1. Converting to Upper Case

The EVTYPE data has a mixture of upper and lower case. To make this consistent, the event types were converted to all upper case.

storm_data <- storm_data %>% mutate(EVTYPE = toupper(EVTYPE))

2. Matching to standard EVTYPEs

A list of standard EVTYPE is provided in the “National Weather Service Instruction document”, on “Table 2.1.1 Storm Data Event Table”.

By looking at the EVTYPE in the data set, it can be seen that there are quite a few values that do not correspond to the documented list. For example, “THUNDERSTORMW WINDS”, “TSTM”, and “THUNDERTORM WINDS” instead of “THUNDERSTORM WIND”.

Also, there are some EVTYPEs that use a different terminology to the standard list. The main ones that were found were:

  • Dry Microburst - this was matched with the standard “THUNDERSTORM WIND” by using information on Wikipedia
  • Gustnado - again, this was matched with “THUNDERSTORM WIND” by using information on Wikipedia
  • Winter Storm events were examined using the following link to the NOAA storm events database. This database is referenced by the “How To Handle Exponent Value of PROPDMGEXP and CROPDMGEXP” web page

To clean this data, a set of transformations were created to map non-standard EVTYPE values from the data set to the standard ones. This consisted of a regular expression to match the non-standard value and a corresponding string with the standard name. This was not an exact process as the regular expressions could be greedy and match too much of the EVTYPE value. I also looked at the function str_replace_all() - this was much faster, but it would match multiple expressions. This set of transformations was then used by a function to replace non-standard EVTYPEs with standard ones.

Tidy event types to match those in documentation

# A list of expressions to match and their replacement
event_translations <- read_csv(
"pattern,event_type
(.*FIRE.*),WILDFIRE
(.*HIGH WIND.*),HIGH WIND
(.*HEAVY RAIN.*),HEAVY RAIN
(.*(COLD|CHILL).*),COLD/WIND CHILL
(.*(FREEZ|FROST).*),FROST/FREEZE
(.*FLASH FLOOD.*),FLASH FLOOD
(.*FLOOD.*),FLOOD
(.*STREAM FLD.*),FLOOD
(.*HAIL.*),HAIL
(^TORNADO.*),TORNADO
(LANDSLIDE),DEBRIS FLOW
(HURRICANE.*),HURRICANE (TYPHOON)
(.*TYPHOON.*),HURRICANE (TYPHOON)
(S.*THUNDER.*),THUNDERSTORM WIND
(THUNDERSTORM.*WIN.*),THUNDERSTORM WIND
(^TSTM.*),THUNDERSTORM WIND
(^TROPICAL STORM.*),TROPICAL STORM
(.*WIND.*),STRONG WIND
(.*EX.*HEAT.*),EXCESSIVE HEAT
(.*HEAT.*),HEAT
(.*RAIN.*),HEAVY RAIN
(.*WET.*),HEAVY RAIN
(.*SURGE.*),STORM SURGE/TIDE
(.*THUNDE.*),THUNDERSTORM WIND
(.*SNOW.*),HEAVY SNOW
(.*FOG.*),DENSE FOG
(.*SURF.*),HIGH SURF
(.*HIGH TIDE.*),STORM SURGE/TIDE
(.*WINTER WEATHER.*),WINTER WEATHER
(.*MICROBURST.*),THUNDERSTORM WIND
(.*WATERSPOUT.*),WATERSPOUT
(.*GUSTNADO.*),THUNDERSTORM WIND
(.*ICE.*),WINTER STORM")

translate_event_type <- function(event_string) {

    # save the maximum number of regular expressions we have to process for each EVTYPE
    event_translations_length <- nrow(event_translations)

    unlist(
        lapply(event_string, function(x) {
            for (ix in 1:event_translations_length) {

               # Search for a matching string and replace if found
               event_tidy <- str_replace(x,
                                         event_translations[ix,1]$pattern,
                                         event_translations[ix,2]$event_type)

               # If we have replaced the text, then exit out of the for-loop
               if ( event_tidy != x ) {
                   break
               }
            }
            # Return the EVTYPE or the cleaned EVTYPE
            event_tidy
        })
    )
}

Note that there are 702,131 rows of data in storm_data. Running the translate_event_type() function on this data set took a long time (more than 10 minutes). So, the actual cleaning of EVTYPE was done on aggregated data as below.

Process Fatalities and Injuries

Events where no fatalities or injuries occurred were filtered out - again reducing the number of rows

storm_harm <- storm_data %>% filter(FATALITIES > 0 | INJURIES > 0)

To reduce the size of the data sets, separate data sets were created for Fatalities and Injuries. These were then cleaned as described above, the EVTYPE was standardised. The data set was aggregated based on the EVTYPE.

Fatalities

# Aggregate by EVTYPE to remove date data
fatal_sum <- storm_harm %>% filter(FATALITIES > 0) %>%
                            group_by(EVTYPE) %>%
                            summarise(fatalities_count = sum(FATALITIES))

# Aggregate by cleaned EVTYPE
fatal_sum_clean <- fatal_sum %>% mutate(EVTYPE = translate_event_type(EVTYPE)) %>%
                                 group_by(EVTYPE) %>%
                                 summarise(fatalities_total = sum(fatalities_count)) %>%
                                 arrange(desc(fatalities_total))

Injuries

The same steps were performed for injuries:

injury_sum <- storm_harm %>% filter(INJURIES > 0) %>%
                             group_by(EVTYPE) %>%
                             summarise(injuries_count = sum(INJURIES))

injury_sum_clean <- injury_sum %>%
                      mutate(EVTYPE = translate_event_type(EVTYPE)) %>%
                      group_by(EVTYPE) %>%
                      summarise(injuries_total = sum(injuries_count)) %>%
                      arrange(desc(injuries_total))

Process Crop and Property Damage

Similar to fatalities and injuries, the EVTYPE values were cleaned and the data sets were aggregated by that variable.

Events where no damage occurred were filtered out.

storm_damage <- storm_data %>% filter(CROPDMG > 0 | PROPDMG > 0)

To calculate damage, the cost (CROPDMG and PROPDMG) had to be multiplied by a factor that was determined by a corresponding exponent column (CROPDMGEXP and PROPDMGEXP).

There were some issues related to this calculation and “How To Handle Exponent Value of PROPDMGEXP and CROPDMGEXP” web page was very helpful.

For example, there are some storms where property damage is non-zero, but the exponent is NA

storm_damage %>% filter(PROPDMG != 0) %>% filter(is.na(PROPDMGEXP))
## # A tibble: 70 × 8
##      BGN_DATE             EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP
##        <date>              <chr>      <dbl>    <dbl>   <dbl>      <chr>
## 1  1995-03-09  FLASH FLOOD WINDS          0        0    0.41       <NA>
## 2  1995-08-19  THUNDERSTORM WIND          0        0    2.00       <NA>
## 3  1995-05-14        FLASH FLOOD          0        0   10.00       <NA>
## 4  1995-05-18        FLASH FLOOD          0        0   10.00       <NA>
## 5  1995-05-17        FLASH FLOOD          0        0   10.00       <NA>
## 6  1995-05-28 THUNDERSTORM WINDS          0        0    4.00       <NA>
## 7  1995-05-12 THUNDERSTORM WINDS          0        0    5.00       <NA>
## 8  1995-04-08               HAIL          0        0   10.00       <NA>
## 9  1995-04-08               HAIL          0        0   35.00       <NA>
## 10 1995-05-23        FLASH FLOOD          0        0   75.00       <NA>
## # ... with 60 more rows, and 2 more variables: CROPDMG <dbl>,
## #   CROPDMGEXP <chr>

There are similar cases for crop damage data.

storm_damage %>% filter(CROPDMG != 0) %>% filter(is.na(CROPDMGEXP)) %>% select(BGN_DATE, EVTYPE, CROPDMG, CROPDMGEXP)
## # A tibble: 3 × 4
##     BGN_DATE             EVTYPE CROPDMG CROPDMGEXP
##       <date>              <chr>   <dbl>      <chr>
## 1 1994-07-04               HAIL       3       <NA>
## 2 1994-04-05 THUNDERSTORM WINDS       4       <NA>
## 3 1994-04-15 THUNDERSTORM WINDS       4       <NA>

To calculate the crop and property cost, a function to translate the exponent to a number was created:

# See https://rstudio-pubs-static.s3.amazonaws.com/58957_37b6723ee52b455990e149edde45e5b6.html
# Overall conclusion for all numeric, exp 0, 1, 2, 3, 4, 5, 6, 7, 8, they are multiplier of 10.
#   e.g. 50 exp 0 is 500, 88 exp 5 is 880
# empty-character () is == multiplier of 0.

# Translate exponent character to a number
exp_translations <- c("+" = 1,
                      "H" = 100,
                      "K" = 1000,
                      "M" = 1000000,
                      "B" = 1000000000)

convert_exp <- function(exp_str) {
    # check if the EXP is a number
    exp_number <- as.integer(exp_str)
    # check if the EXP can be mapped to a number
    exp_factor <- ifelse(is.na(exp_number), exp_translations[toupper(exp_str)], 10)
    # otherwise set it to 0
    exp_factor <- ifelse(is.na(exp_factor), 0, exp_factor)
    exp_factor
}

The crop and damage costs were then calculated using the above function as follows:

storm_damage <- storm_damage %>%
                    mutate(crop_damage = convert_exp(CROPDMGEXP) * CROPDMG,
                           prop_damage = convert_exp(PROPDMGEXP) * PROPDMG)
## Warning in convert_exp(c("M", NA, "K", NA, NA, "M", NA, "M", "m", NA, NA, :
## NAs introduced by coercion
## Warning in convert_exp(c("B", "K", "M", "K", "K", "M", "K", "M", "m",
## "K", : NAs introduced by coercion

Crop Damage

A data set containing events that caused crop damage was created. This was aggregated by the event type (EVTYPE).

crop_sum <- storm_damage %>% filter(CROPDMG > 0) %>%
                group_by(EVTYPE) %>%
                summarise(crop_damage_total = sum(crop_damage))

This is now a much smaller data set (123 rows) and the translate_event_type() function can be executed and run with in a short time (a few seconds).

crop_sum_clean <- crop_sum %>%
                      mutate(EVTYPE = translate_event_type(EVTYPE)) %>%
                      group_by(EVTYPE) %>%
                      summarise(crop_damage_cost = sum(crop_damage_total)) %>%
                      arrange(desc(crop_damage_cost))

Property Damage

The same steps were performed for property damage:

prop_sum <- storm_damage %>% filter(PROPDMG > 0) %>%
                group_by(EVTYPE) %>%
                summarise(prop_damage_total = sum(prop_damage))

prop_sum_clean <- prop_sum %>%
                      mutate(EVTYPE = translate_event_type(EVTYPE)) %>%
                      group_by(EVTYPE) %>%
                      summarise(prop_damage_cost = sum(prop_damage_total)) %>%
                      arrange(desc(prop_damage_cost))

Results

Weather events that are most harmful with respect to population health

From the fatal_sum_clean and injury_sum_clean data sets, it can be seen that injuries are more common than fatalities.

Fatalities

kable(head(fatal_sum_clean),
      caption = "Fatalities by Weather Event (first few rows)",
      format.args = list(big.mark = ','))
Fatalities by Weather Event (first few rows)
EVTYPE fatalities_total
HEAT 3,016
TORNADO 1,593
FLOOD 1,443
LIGHTNING 794
STRONG WIND 645
RIP CURRENT 368

Injuries

kable(head(injury_sum_clean),
      caption = "Injuries by Weather Event (first few rows)",
      format.args = list(big.mark = ','))
Injuries by Weather Event (first few rows)
EVTYPE injuries_total
TORNADO 22,589
HEAT 9,064
FLOOD 8,623
LIGHTNING 5,116
THUNDERSTORM WIND 4,445
WINTER STORM 3,375

The two data sets were cut off data points at an appropriate quantile. In this way, only the top most harmful weather events were plotted and tabulated below. This was found by trying different quantiles and plotting the results.

quantile(fatal_sum_clean$fatalities_total)
##     0%    25%    50%    75%   100% 
##    1.0    2.0   12.0  112.5 3016.0
quantile(injury_sum_clean$injuries_total)
##      0%     25%     50%     75%    100% 
##     1.0     5.0   129.0   488.5 22589.0

Plot a facet grid between injuries and fatalities

fatal_cut_off  <- quantile(fatal_sum_clean$fatalities_total)[4]
injury_cut_off <- quantile(injury_sum_clean$injuries_total)[4]

# Join the two data sets into one
storm_harm_total <- full_join(fatal_sum_clean, injury_sum_clean, by = c("EVTYPE") )

# Combine fatalities and injuries into one column
storm_harm_type <- storm_harm_total %>%
        select(event = EVTYPE,
               Fatalities = fatalities_total,
               Injuries   = injuries_total) %>%
        gather(harm_type, harm_total, -event) %>%
        filter( (harm_type == 'Fatalities'  & harm_total >= fatal_cut_off) |
                (harm_type == 'Injuries'    & harm_total >= injury_cut_off) )

# Plot results - note that reoder() does not work as expected - perhaps due to facet_grid()
storm_harm_type %>%
    ggplot(aes(reorder(event, harm_total), harm_total, fill = event) ) +
    geom_bar(stat = "identity") +
    coord_flip() +
    theme(legend.position = "none") +
    facet_grid(harm_type ~ .) +
    labs(x = "Weather Event Type",
         y = "Estimated accounts of injuries and fatalities") +
    ggtitle("Most harmful weather events between 1994 and 2011")

Tabulated results show the most harmful events.

harm <- as.data.frame(storm_harm_type %>% arrange(desc(harm_total)))
kable(head(harm,10), caption = "Most harmful weather events ordered by most number of cases first", format.args = list(big.mark = ','))
Most harmful weather events ordered by most number of cases first
event harm_type harm_total
TORNADO Injuries 22,589
HEAT Injuries 9,064
FLOOD Injuries 8,623
LIGHTNING Injuries 5,116
THUNDERSTORM WIND Injuries 4,445
WINTER STORM Injuries 3,375
STRONG WIND Injuries 3,050
HEAT Fatalities 3,016
TORNADO Fatalities 1,593
WILDFIRE Injuries 1,458

So, tornadoes cause the most injuries, while heat events cause the more fatalities.

Weather events that have the greatest economic consequences

From the prop_sum_clean and prop_sum_clean data sets, it can be seen that damage to property is larger than crop damage.

Crop Damage

kable(head(crop_sum_clean), caption = "Crop damage per weather event in US Dollars (first few rows, ordered by most expensive first)", format.args = list(big.mark = ','))
Crop damage per weather event in US Dollars (first few rows, ordered by most expensive first)
EVTYPE crop_damage_cost
DROUGHT 13,922,066,000
FLOOD 7,097,621,550
HURRICANE (TYPHOON) 5,505,617,800
WINTER STORM 5,034,058,300
HAIL 3,068,457,700
FROST/FREEZE 1,886,061,000

Property Damage

kable(head(prop_sum_clean), caption = "Property damage per weather event in US Dollars  (first few rows, ordered by most expensive first)", format.args = list(big.mark = ','))
Property damage per weather event in US Dollars (first few rows, ordered by most expensive first)
EVTYPE prop_damage_cost
FLOOD 160,707,771,901
HURRICANE (TYPHOON) 85,200,910,010
STORM SURGE/TIDE 47,844,150,500
TORNADO 25,625,141,797
HAIL 15,622,770,827
STRONG WIND 8,945,965,810

For generating a plot of crop and property damage, the data sets were filtered based on a appropriate value from the above quantiles. This was found by trying different quantiles and plotting the results.

quantile(crop_sum_clean$crop_damage_cost)
##          0%         25%         50%         75%        100% 
##       10000     9523545   361834110  1273116700 13922066000
quantile(prop_sum_clean$prop_damage_cost)
##           0%          25%          50%          75%         100% 
##           50        30000       712630    144062000 160707771901
crop_cut_off_cost <- quantile(crop_sum_clean$crop_damage_cost)[3]
prop_cut_off_cost <- quantile(prop_sum_clean$prop_damage_cost)[4]

A facet grid between crop and property damage was plotted.

storm_damage_total <- full_join(crop_sum_clean, prop_sum_clean, by = c("EVTYPE") )

storm_damage_type <- storm_damage_total %>%
        select(event    = EVTYPE,
               Crop     = crop_damage_cost,
               Property = prop_damage_cost) %>%
        gather(damage_type, damage_cost, -event) %>%
        filter( (damage_type == 'Crop'     & damage_cost >= crop_cut_off_cost) |
                (damage_type == 'Property' & damage_cost >= prop_cut_off_cost) )

storm_damage_type %>%
    ggplot(aes(reorder(event, damage_cost), damage_cost/1000000, fill = event) ) +
    geom_bar(stat = "identity") +
    coord_flip() +
    theme(legend.position = "none") +
    facet_grid(damage_type ~ .) +
    labs(x = "Weather Event Type",
         y = "Estimated cost of crop and property damage in US Dollars (Millions)") +
    ggtitle("Highest economic impact weather events between 1994 and 2011")

damage <- as.data.frame(storm_damage_type %>% arrange(desc(damage_cost)))
kable(head(damage,10),
      caption = "Top most damage causing weather events by type of damage in US Dollars",
      format.args = list(big.mark = ','))
Top most damage causing weather events by type of damage in US Dollars
event damage_type damage_cost
FLOOD Property 160,707,771,901
HURRICANE (TYPHOON) Property 85,200,910,010
STORM SURGE/TIDE Property 47,844,150,500
TORNADO Property 25,625,141,797
HAIL Property 15,622,770,827
DROUGHT Crop 13,922,066,000
STRONG WIND Property 8,945,965,810
WILDFIRE Property 7,882,628,500
TROPICAL STORM Property 7,713,885,550
FLOOD Crop 7,097,621,550

So, floods have the greatest economic impact on property, while droughts have the greatest economic impact on crops.

Notes

Created using RStudio with the following version of R:

R.version
##                _                           
## platform       x86_64-apple-darwin16.5.0   
## arch           x86_64                      
## os             darwin16.5.0                
## system         x86_64, darwin16.5.0        
## status                                     
## major          3                           
## minor          4.0                         
## year           2017                        
## month          04                          
## day            21                          
## svn rev        72570                       
## language       R                           
## version.string R version 3.4.0 (2017-04-21)
## nickname       You Stupid Darkness