Reproduceable Research - NOAA Storms Data

Synopsis

In this report we look at data from the National Oceanographic & Atmospheric Administration (NOAA) storm data this data analysis intends to answer these questions:

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

To answer these questions we standardized and summarized the data to compare the impact of the various storm event types. We found that excessive heat and tornadoes are most harmful to human life and that hurricanes and storm tides have the greatest economic impact.

Data Sources

Storm event data is provided by the NOAA at this location. The combined event data starting from 1950 is included in the zip file used in the code below.

48 event types have been used from 1996 to present are provided by the NOAA at this location. The event type values provided in a Word document have been stored in a .csv file for use here.Prior to 1996 the only three event types used were “Tornado”, “Thunderstorm Wind” and “Hail”. A larger set of possible values might have changed the distribution events by event type over this period.

Data Processing

Reading the NOAA data file

Reading the compressed zip file - this includes data from 1950:

setwd("C:/Users/IBM_ADMIN/Documents/Self/Training/JohnsHopkinsDataScience/05 - Reproducable Research/Prog 2")
dt <- read.csv("repdata_data_StormData.csv.bz2", stringsAsFactors = FALSE, strip.white = TRUE)
dim(dt)
## [1] 902297     37
str(dt)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Event type

Both questions focus on the event type determined for each storm record in the database. A quick check confirms there are no missing values for this variable but that the number of event types used in the data is much larger than the official 48 event types.

sum(is.na(dt$EVTYPE))
## [1] 0
length(table(dt$EVTYPE))
## [1] 985

Retrieve the official event type values

Build a vector of the 48 official event types can be built from the source identified earlier:

evType <- read.csv("eventtypes.csv", header=FALSE, stringsAsFactors = FALSE, strip.white = TRUE)
names(evType) <- c("eventType", "zone")
#Take out leading blanks and upshift
trim.leading <- function (x)  sub("^\\s+", "", x)
ev <- toupper(trim.leading(evType$eventType))

Prepare impact amounts

Both property and crop damage are represented by and amount value and a scalar code such as “K” meaning thousand or “8” meaning multiply by ten to the eighth power. We compute the amount for property and crop damage: $^8 $

#Extract the year
dt$dat <- as.Date(dt$BGN_DATE, '%m/%d/%Y')
dt$yr <- year(dt$dat)

dt$PROPDMGEXP <- toupper(dt$PROPDMGEXP)
dt$CROPDMGEXP <- toupper(dt$CROPDMGEXP)

dt$propExp <- 1
dt$propExp [dt$PROPDMGEXP == "1"] <- 10
dt$propExp [dt$PROPDMGEXP == "2"] <- 100
dt$propExp [dt$PROPDMGEXP == "3"] <- 1000
dt$propExp [dt$PROPDMGEXP == "4"] <- 10000
dt$propExp [dt$PROPDMGEXP == "5"] <- 100000
dt$propExp [dt$PROPDMGEXP == "6"] <- 1000000
dt$propExp [dt$PROPDMGEXP == "7"] <- 10000000
dt$propExp [dt$PROPDMGEXP == "8"] <- 100000000
dt$propExp [dt$PROPDMGEXP == "B"] <- 1000000000
dt$propExp [dt$PROPDMGEXP == "H"] <- 100
dt$propExp [dt$PROPDMGEXP == "K"] <- 1000
dt$propExp [dt$PROPDMGEXP == "M"] <- 1000000

dt$propDamage <- dt$PROPDMG * dt$propExp

dt$cropExp <- 1
dt$cropExp [dt$CROPDMGEXP == "2"] <- 100
dt$cropExp [dt$CROPDMGEXP == "K"] <- 1000
dt$cropExp [dt$CROPDMGEXP == "M"] <- 1000000
dt$cropExp [dt$CROPDMGEXP == "B"] <- 1000000000

dt$cropDamage <- dt$CROPDMG * dt$cropExp

dt$damage <- dt$propDamage + dt$cropDamage

Looking at the US Dollar impact of storm damage

summary(dt$propDamage)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.000e+00 0.000e+00 0.000e+00 4.746e+05 5.000e+02 1.150e+11

summary(dt$cropDamage)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.000e+00 0.000e+00 0.000e+00 5.442e+04 0.000e+00 5.000e+09

Standardize the event types

There are a few event types which describe themselves as “summary” - these won’t match any of the event types so we remove them. From 1950 to 1995 there were only three event types used (sometimes not even all three in a given year) so we won’t learn enough from this earlier period - these are also removed.

dt$event <- toupper(trim.leading(dt$EVTYPE))
dt$mevent <- ""

#remove"Summary" event types - they won't match any event type
dt <- dt [!grepl("^SUMMARY", dt$event), ]

#remove data prior to 1996 - we won't learn about different event types
dt <- dt [dt$yr >= 1996, ]

Match event type values to the standard codes

The remaining more than 900 event type values can be largely matched using a character matching routine in R named “amatch”.

#match the event type from the database to the official event type list (amatch)
dt$ematch <- amatch(dt$event, ev, nomatch=0, maxDist = 6)
dt$mevent [dt$ematch > 0] <- ev[dt$ematch]
length(dt$ematch [dt$ematch != 0]) / length(dt$ematch)

## [1] 0.9873411

length(dt$mevent [dt$mevent != ""]) / length(dt$mevent)

## [1] 0.9873411

How well did the automated matching work? After a few trials with the maxDist parameter, the matching rate seems acceptable. More than 98% of the event types are matched. This approach has several advantages over observing patterns and manually creating string manipulation commands. Event types from the database still unmatched are combined in a synthetic category for “all other”.

#calculate % of unmatched damage
sum(dt$damage [dt$mevent == ""]) / sum(dt$damage)
## [1] 0.01241253
#calculate % of unmatched fatality
sum(dt$FATALITIES [dt$mevent == ""]) / sum(dt$FATALITIES)
## [1] 0.02874485
#calculate % of unmatched injury
sum(dt$INJURIES [dt$mevent == ""]) / sum(dt$INJURIES)
## [1] 0.0189392

#how many records remain unmatched? 
length(table(dt$event [dt$mevent == ""]))
## [1] 197
#Mark event types not matched as 'all other' 
dt$mevent[dt$mevent == ""] <- "ALL-OTHER-TYPES"

Summarizing the data

The data can be then summarized by event type and the highest impact even types for fatality, injury, property damage and crop damage can be extracted (separate sets of event types for each of the four measures):

nt <- select(dt, mevent, damage, propDamage, cropDamage, FATALITIES, INJURIES)
gt <- group_by(nt, mevent)
dsumm <- summarise(group_by(gt, mevent),
                   sum(damage),
                   sum(propDamage), 
                   sum(cropDamage), 
                   sum(FATALITIES), 
                   sum(INJURIES)
                   )
names(dsumm) <- c("evType", "damage", "propDamage", "cropDamage", "fatalities", "injuries")
sl <- 6
#Select the top property damage event type
dsumm <- arrange(dsumm, desc(propDamage))
sumProp <- data.frame(typ="Property", level=dsumm$propDamage[1:sl], evType=dsumm$evType[1:sl])
#and crop damage event type
dsumm <- arrange(dsumm, desc(cropDamage))
sumCrop <- data.frame(typ="Crops",    level=dsumm$cropDamage[1:sl], evType=dsumm$evType[1:sl])
#and top event type for fatalities
dsumm<- arrange(dsumm, desc(fatalities))
sumFatl <- data.frame(typ="Fatalities",level=dsumm$fatalities[1:sl], evType=dsumm$evType[1:sl])
#and injuries
dsumm<- arrange(dsumm, desc(injuries))
sumInju <- data.frame(typ="Injuries",level=dsumm$injuries[1:sl], evType=dsumm$evType[1:sl])
d10 <- rbind(sumFatl, sumInju, sumProp, sumCrop)

In addition to looking at the summary over all years it might be instructive to summarize the damage data by year to view impact of event types varies over time. Here we select event types if they were in the top 10% of damage in any given year:

ny <- select(dt, mevent, yr, damage, propDamage, cropDamage, FATALITIES, INJURIES)
gy <- group_by(ny, yr, mevent)
dsumy <- summarise(group_by(gy, yr, mevent),
                   sum(damage),
                   sum(propDamage), 
                   sum(cropDamage), 
                   sum(FATALITIES), 
                   sum(INJURIES)
                   )
names(dsumy) <- c("year", "evType", "damage", "propDamage", "cropDamage", "fatalities", "injuries")

#Select data for event types if they were in the top 10% in any year
sumyFtype <- dsumy$evType [dsumy$fatalities > quantile(dsumy$fatalities, .95)]
sumyFatalities <- dsumy [(dsumy$evType %in% sumyFtype), c("year", "evType", "fatalities")]

#Select data for event types if they were in the top 10% in any year
sumyDtype <- dsumy$evType [dsumy$damage > quantile(dsumy$damage, .95)]
sumyDamage <- dsumy [(dsumy$evType %in% sumyDtype), c("year", "evType", "damage")]

Results

From the summarized data across all years we can see the highest impact event types across all years.

ggplot(d10, aes(x=evType, y=level)) + 
        geom_bar(stat="identity") +
        facet_wrap(~typ, scales="free", ncol=1) + 
        ggtitle("Fatalities, Injuries, Property Damage and Crop Damage (1996-2011)") + 
        xlab("Highest impact event types - Property and Crop damage in US Dollars")

Fatalities from storm events can be seen over time:

ggplot(sumyFatalities, aes(x=year, y=fatalities, group=evType, colour=evType, fill=evType)) + 
        geom_bar(stat="identity", position="dodge") + 
        ggtitle("Fatalities from Storms by top event types (1996-2011)") +
        ylab("Number of Fatalities")

The damage by event type can be seen over time:

ggplot(sumyDamage, aes(x=year, y=damage, group=evType, colour=evType, fill=evType)) + 
        geom_bar(stat="identity", position="dodge") + 
        ggtitle("Property + Crop damage produced by top event types (1996-2011)") +
        ylab("Damage in US Dollars")

Conculsions

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

Excessive heat and tornadoes have more fatalities than other event types.
Tornadoes cause the most injuries.
There is a broad pattern to the number of fatalities over time with a notable large number of deaths due to tornado and another due to excessive heat.

Across the United States, which types of events have the greatest economic consequences?

One flood, from hurricane Katrina in 2005, caused damage that dwarfs all other storm damage events.
Aside from that one flood, hurricanes cause the greatest economic impact with storm surge/tide the next greatest impact.
There is no steady state pattern. Year to year variation in the type and value of storm damage is quite significant. Large individual events greatly effect damage totals.
Crop damage even for the most significant event types is somewhat smaller than property damage with hurricanes and storm/surge times causing the most damage.

Appendix

Thanks to Robert Carman for the suggestion in the discussion forum to use the amatch routine for event type matching. After trying a few different parameters this approach seemed to be effective and reproduceable.

My session info:

sessionInfo()
## R version 3.2.1 (2015-06-18)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_2.0.0    dplyr_0.4.3      lubridate_1.5.0  stringdist_0.9.4
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.2      digest_0.6.8     assertthat_0.1   plyr_1.8.3      
##  [5] grid_3.2.1       R6_2.1.1         gtable_0.1.2     DBI_0.3.1       
##  [9] formatR_1.2.1    magrittr_1.5     scales_0.3.0     evaluate_0.8    
## [13] stringi_1.0-1    lazyeval_0.1.10  rmarkdown_0.8.1  labeling_0.3    
## [17] tools_3.2.1      stringr_1.0.0    munsell_0.4.2    yaml_2.1.13     
## [21] parallel_3.2.1   colorspace_1.2-6 htmltools_0.2.6  knitr_1.11

```