Synopsis

This analysis explores the NOAA Storm Database from 1950 to 2011 to determine 1) acrosss the United States, which weather events are the most harmful to the health of the population and 2) which have the greatest economic impact. Tornados and heat emerge with the highest cumulative fatalities, while hurricanes and ice storms generate the greatest economic damage. The analysis first identifies all weather events that resulted in at least one fatality, cleans that subset so it aligns with the standard NOAA weather event coding, and then summarizes the events in the 90th percentile. The analysis of the economic impact follows a similar process, with an additional step that combines the property damage cost with crop damage cost in $ billions. The summary of the economic impact focuses on the events in the 95th percentile of total damage cost in billions of dollars. The biggest challenge for both analyses is the data cleaning. The raw information comes from a variety of sources including: county, state and federal emergency management officials, local law enforcement, skywarn spotters, NWS damage surveys, newspaper clipping services, the insurance industry and the general public. As a result the coding of event types varies wildly from NOAA standards and needs reconciling.

Results

Tornados are the most deadly of the events tracked by NOAA, leading all others with a wide margin on cumulative deaths and total number of deadly events. Heat, and excessive heat, follow tornados in overall death count, but take the lead on the highest death rate associated with a single event.

The greatest economic impact, however, comes from hurricanes (typhoons).Hurricanes lead all other events in terms of the cumulative cost since 1950. Thunderstorm and Wind events, however, are far more commonly recorded since 1950 - but much less costly. Ice Storms share the lead with Hurricanes (Typhoons) on the maximum damages for any single event.

Data Processing

See http://www.ncdc.noaa.gov/stormevents/details.jsp?type=eventtype to learn more about the 55 official event types.

library(plyr)
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.3.1
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stringdist)
## Warning: package 'stringdist' was built under R version 3.3.1
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.1
library(lubridate)
## Warning: package 'lubridate' was built under R version 3.3.1
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:plyr':
## 
##     here
## The following object is masked from 'package:base':
## 
##     date
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.3.1
library(stringr)
library(formattable)
## Warning: package 'formattable' was built under R version 3.3.1
###  Load the data, explore data structure
NOAA <- read.csv("repdata%2Fdata%2FStormData.csv.bz2")
str(NOAA)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

The official NOAA guide indicates that there are 55 levels for EVTYPE. However, in the NOAA event database file there are 985 EVTYPE levels! This discrepancy reflects the human error present in the coding process. Some of these levels deviate greatly from the list, others simply have extra spaces and small typos. The following code helps refine the descrepancy in the Event Type coding.

NOAA55 <- c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood",
              "Cold/Wind Chill", "Debris Flow", "Dense Fog", "Dense Smoke", 
              "Drought","Dust Devil", "Dust Storm", "Excessive Heat",
              "Extreme Cold/Wind Chill", "Flash Flood", "Flood", "Frost/Freeze",
              "Funnel Cloud", "Freezing Fog", "Hail", "Heat", "Heavy Rain", 
              "Heavy Snow", "High Surf", "High Wind", "Hurricane (Typhoon)", 
              "Ice Storm", "Lake-Effect Snow", "Lakeshore Flood", "Lightning",
              "Marine Dense Fog", "Marine Hail", "Marine Heavy Freezing Spray", 
              "Marine High Wind", "Marine Hurricane/Typhoon", "Marine Lightning",
              "Marine Strong Wind", "Marine Thunderstorm Wind", 
              "Marine Tropical Depression","Marine Tropical Storm", "Rip Current",
              "Seiche", "Sleet", "Sneaker Wave", "Storm Surge/Tide", "Strong Wind",
              "Thunderstorm Wind", "Tornado", "Tropical Depression",
              "Tropical Storm", "Tsunami", "Volcanic Ash", "Waterspout",
              "Wildfire","Winter Storm", "Winter Weather")

### Transform NOAA55 into a df, and process to make it easier
### to match up with EVTYPE in the data set
NOAA55              <- as.data.frame(NOAA55)
NOAA55$nopunct55    <- gsub("[/()]", " ", NOAA55$NOAA55)
NOAA55$nospaces55   <- as.factor(tolower(gsub("\\s", "", NOAA55$nopunct55)))
NOAA55$match55      <- c(1:55)
NOAA55              <- select(NOAA55, -nopunct55)

### Clean 985 EVTYPE factors and reduce to 701 factors
noaa_evtype           <- as.factor(levels(NOAA$EVTYPE))
noaa_evtype           <- as.data.frame(noaa_evtype)
noaa_evtype$nopunct   <- gsub("[]$*+.?[^{|\\#%&~_/<=>'!,:;`\"()}@-]", " ", noaa_evtype$noaa_evtype)
noaa_evtype$cleaned   <- tolower(gsub("^([ \t\n\r\f\v]+)|([ \t\n\r\f\v]+)$", "", 
                                      noaa_evtype$nopunct))
noaa_evtype$cleaned   <- gsub("[0-9]", "", noaa_evtype$cleaned)
noaa_evtype$nospaces  <- tolower(gsub("\\s", "", noaa_evtype$cleaned))
noaa_evtype$nospaces  <- as.factor(noaa_evtype$nospaces)
noaa_evtype           <- select(noaa_evtype, -nopunct, -cleaned)

Fatality Analysis

This code pulls all of the events that had fatalities. Then it recodes the events to match the NOAA 55 events. A key step in this process is isolating the EVTYPE factors in the 90th percentile of the cumulative weather related fatalities. This step is important because it simplifies the reconciliation of the 985 EVTYPEs recorded wtih the official NOAA 55 events.

### Extract fatal events from the NOAA data base
fatal_events_raw <- NOAA %>% 
  select(EVTYPE, FATALITIES) %>% 
  filter(FATALITIES>0)

### Match fatal events EVTYPE to the official 55 NOAA event codes
### as closely as possible. This brings the EVTYPE factors down
### from 985 to 157.
fatal_events_raw1          <- merge(fatal_events_raw, noaa_evtype, 
                                    by.x="EVTYPE",
                                    by.y="noaa_evtype", 
                                    all.x = TRUE)
fatal_events_raw1$nospaces <- as.factor(as.character(fatal_events_raw1$nospaces))
fatal_events_raw1$match    <- amatch(fatal_events_raw1$nospaces,
                                     NOAA55$nospaces55, 
                                     maxDist = 3)
fatal_events_raw2          <- merge(fatal_events_raw1, NOAA55, 
                                    by.x = "match", 
                                    by.y = "match55", 
                                    all.x = TRUE)

## Summarize the deathtoll for 157 event types
fatal_events_summary      <- fatal_events_raw2 %>%
  group_by(nospaces)%>%
  summarize(count = n(),
            max_death_single_event = max(FATALITIES, na.rm = TRUE),
            total_fatalities_1950_2011 = sum(FATALITIES, na.rm=TRUE)
  )  
fatal_events_summary$match <- amatch(fatal_events_summary$nospaces, 
                                     NOAA55$nospaces55, maxDist=3)
fatal_events_summary1      <- merge(fatal_events_summary, NOAA55, 
                                    by.x="match", by.y="match55", all.x=TRUE )

### Isolate the most deadly events among the 157 EVTYPES - now the
### factors are reduced to just 14.
fatality_quantile  <- .90
fatality_count     <- quantile(fatal_events_summary1$total_fatalities_1950_2011, 
                          probs=fatality_quantile)
fatal_topcounts    <- filter(fatal_events_summary1,
                    total_fatalities_1950_2011 > fatality_count)


### Correct event coding for the 14 most deadly EVTYPEs 
### to match the NOAA 55 list
fatal_topcounts$nospaces  <- gsub("([a-z]+|^)extremecold([a-z]+|$)", 
                                  "coldwindchill",    fatal_topcounts$nospaces)
fatal_topcounts$nospaces  <- gsub("([a-z]+|^)heatwave([a-z]+|$)", 
                                    "heat",           fatal_topcounts$nospaces)
fatal_topcounts$nospaces  <- gsub("([a-z]+|^)tstm([a-z]+|$)",
                                    "thunderstormwind", fatal_topcounts$nospaces)


fatal_topcounts$nospaces  <- as.factor((fatal_topcounts$nospaces))
fatal_topcounts           <- select(fatal_topcounts, -match, -NOAA55, -nospaces55)
fatal_topcounts$match     <- amatch(fatal_topcounts$nospaces, 
                                     NOAA55$nospaces55, maxDist=3)
fatal_topcounts       <- merge(fatal_topcounts, NOAA55, 
                               by.x = "match", by.y="match55",
                               all.x = TRUE)

The summary calculates the total death count from 1950 to 2011, the # of deadly events, and the maximum death count in a single weather event for each type.

### Calculate summary statistics for most deadly events
fatal_events_final <- fatal_topcounts %>%
  group_by(NOAA55)%>%
  summarize (total_events_with_fatalities=sum(count),
             highest_deathtoll_in_single_event = max(max_death_single_event),
             cumulative_fatalities_1950_2011 = sum(total_fatalities_1950_2011)
             )

fatal_events_final <- fatal_events_final %>% 
  arrange(cumulative_fatalities_1950_2011)

maxdeath       <- max(fatal_events_final$
                  cumulative_fatalities_1950_2011)
maxdeathevents <- max(fatal_events_final$
                        total_events_with_fatalities)
maxdeathsper   <- max(fatal_events_final$
                        highest_deathtoll_in_single_event)

### Plot fatality analysis
par(mfrow=c(1,3), mar=c(4,4,1,1), oma=c(1,6,4,2))
barplot(height= fatal_events_final$cumulative_fatalities_1950_2011, 
        names = fatal_events_final$NOAA55,
        horiz = TRUE,
        main  = "Cumulative Deaths", 
        cex.axis = 1, cex.main=1, las = 2)
barplot(height= fatal_events_final$total_events_with_fatalities, 
        horiz = TRUE,
        main  = "# Deadly Events", 
        cex.axis = 1, cex.main=1, cex.lab=.5, las=2)

barplot(height= fatal_events_final$highest_deathtoll_in_single_event, 
        horiz = TRUE,
        main  = "Max Deaths/Event", 
        cex.axis = 1,cex.main=1, cex.lab=.5, las=2)
title("NOAA Event Deathtoll from 1950 to 2011 by Event Type", outer=TRUE)

Analysis of Economic Impact of Weather Events

The analysis of the cost of the damage follows the same process as the fatality analysis, with a few extra steps to standardize the value calculation and total the value of damage to property and crops into one total cost of damage.

### Extract the damage data, standardize cost variables, and
### summarize by EVTYPE
damage_events_raw <- NOAA %>% 
  select(EVTYPE, PROPDMG:CROPDMGEXP) %>% 
  filter(PROPDMG>0|CROPDMG>0)%>%
  mutate(propcost=(as.character(gsub("[0-8]", "", gsub("[-?+h]","", tolower(PROPDMGEXP))))))%>%
  mutate(cropcost=(as.character(gsub("[0-2]", "", gsub("[?]", "",   tolower(CROPDMGEXP))))))

costcode       <- c("", "k", "m", "b")
unitsp         <- c(1, 1000, 1000000, 1000000000)
unitsc         <- unitsp
props_units    <- data.frame(costcode, unitsc)
crops_units    <- data.frame(costcode, unitsp)
  
damage_events_raw  <- merge(damage_events_raw, crops_units, 
                            by.x = "cropcost", by.y = "costcode", 
                            all.x = TRUE)
damage_events_raw   <- merge(damage_events_raw, props_units, 
                            by.x = "propcost", by.y=  "costcode", 
                            all.x = TRUE)

damage_events_raw$propclaims   <- damage_events_raw$unitsp *
  damage_events_raw$PROPDMG
damage_events_raw$cropclaims   <- damage_events_raw$unitsc *
  damage_events_raw$CROPDMG
damage_events_raw$total_claims <- damage_events_raw$propclaims +
  damage_events_raw$cropclaims

damage_events_raw$total_claims[is.na(damage_events_raw$total_claims)] <-0


damage_events_summary <- damage_events_raw %>%
  group_by(EVTYPE)%>%
  summarize(count = n(),
            maxclaim = max(total_claims, na.rm = TRUE),
            totalclaims1950_2011 = sum(total_claims, na.rm=TRUE)
  )

The next chunk of code will recode the events so they match the NOAA 55 standards.

### Refine codes, bringing list of 985 factors in alignment with 55 NOAA standards
damage_events_wip       <- merge(damage_events_summary, 
                                 noaa_evtype, by.x="EVTYPE", by.y="noaa_evtype")
damage_events_wip$match <- amatch(damage_events_wip$nospaces, 
                                  NOAA55$nospaces55, maxDist=3)
damage_events_wip1      <- merge(damage_events_wip, 
                                 NOAA55, by.x="match", by.y="match55", all.x=TRUE )

### Introduce a threshold for isolating the events that are in the
### 95th percentile ### in terms of the total value of the damages
### to property and crops combined.
damage_quantile <- .95   
Claims_value    <- quantile(damage_events_wip1$totalclaims1950_2011,
                            probs=damage_quantile)
topcounts       <- filter(damage_events_wip1,totalclaims1950_2011>Claims_value)

### Further refine the event coding to match the 55 NOAA events
topcounts$nospaces  <- gsub("([a-z]+|^)fire([a-z]+|$)", 
                            "wildfire", 
                            topcounts$nospaces)
topcounts$nospaces  <- gsub("([a-z]+|^)urban([a-z]+|$)|([a-z]+|^)heavyrains([a-z]+|$)",
                           "heavyrain", 
                           topcounts$nospaces)
topcounts$nospaces  <- gsub("([a-z]+|^)flash([a-z]+|$)|([a-z]+|^)river([a-z]+|$)", 
                           "flashflood", 
                           topcounts$nospaces)
topcounts$nospaces  <- gsub("([a-z]+|^)hurricane([a-z]+|$)|([a-z]+|^)typhoon([a-z]+|$)",
                           "hurricanetyphoon", 
                           topcounts$nospaces)
topcounts$nospaces  <- gsub("([a-z]+|^)tornado([a-z]+|$)", 
                            "tornado", 
                            topcounts$nospaces)
topcounts$nospaces  <- gsub("([a-z]+|^)tropical([a-z]+|$)", 
                            "tropicalstorm", 
                            topcounts$nospaces)
topcounts$nospaces  <- gsub("([a-z]+|^)thunderstorm([a-z]+|$)|([a-z]+|^)tstm([a-z]+|$)",
                           "thunderstormwind", 
                           topcounts$nospaces)
topcounts$nospaces  <- gsub("([a-z]+|^)cold([a-z]+|$)", 
                            "extremecoldwindchill",
                            topcounts$nospaces)

topcounts$nospaces  <- as.factor((topcounts$nospaces))
topcounts           <- select(topcounts, -match, -NOAA55, -nospaces55)
topcounts$match     <- amatch(topcounts$nospaces, 
                              NOAA55$nospaces55, maxDist=3)
topcounts           <- merge(topcounts, NOAA55, 
                               by.x = "match", by.y="match55",
                               all.x = TRUE)

The following code calcuates the total value of the weather events with the greatest economic impact, the # of the events, and the maximum cost of any event by type of event.

### Summarize the damage data by event
damage_events_final <- topcounts %>%
  select(match, EVTYPE, count, maxclaim, totalclaims1950_2011, nospaces, NOAA55,
         nospaces55)%>%
  group_by(NOAA55)%>%
  summarize (total_number_of_events=sum(count),
            highest_cost_single_event = max(maxclaim),
            cumulative_value_of_claims_1950_2011 = sum(totalclaims1950_2011))
damage_events_final <- damage_events_final %>%
  arrange(cumulative_value_of_claims_1950_2011)%>%
  mutate(cumulative_value_of_claims_1950_2011 = 
           cumulative_value_of_claims_1950_2011/1000000000) %>%
  mutate(highest_cost_single_event = 
           highest_cost_single_event/1000000000)%>%
  mutate(total_number_of_events = 
           total_number_of_events/1000)

maxdamages        <- max(damage_events_final$
                           cumulative_value_of_claims_1950_2011)
mostcommondamages <- max(damage_events_final$
                           total_number_of_events)
maxdamagesper     <- max(damage_events_final$
                           highest_cost_single_event)

The code below creates a panel plot of the 3 key metrics for the events in the 95th percentile of total economic cost.

### Plot the results of damage analysis
par(mfrow=c(1,3), mar=c(4,4,1,1), oma=c(1,6,4,2))
barplot(height = damage_events_final$cumulative_value_of_claims_1950_2011, 
        names  = damage_events_final$NOAA55,
        horiz  = TRUE,
        main   = "Cumulative Cost $B", 
        cex.axis = 1, cex.main=1, cex.lab=1, las=2)
barplot(height = damage_events_final$total_number_of_events, 
        horiz  = TRUE,
        main   = "#K of Damaging Events ", 
        cex.axis = 1, cex.main=1, cex.lab=.5, las=2)
barplot(height = damage_events_final$highest_cost_single_event, 
        horiz  = TRUE,
        main   = "Max $B Damages/Event", 
        cex.axis = 1,cex.main=1, cex.lab=.5, las=2)
title("NOAA Economic Impact,  from 1950 to 2011 by Event Type", outer=TRUE)