Identifying the Most Damaging Types of Storm Events Across the US

1. Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. In this report we explore the US NOAA storm database to identify the most damaging types of storms across the US. We begin by identifying the top 10 most frequent types of storm events in the US from 1996 to 2011. We then assess all 10 storm types’ impact, first on population health by reporting casualty distributions, then on the economy by reporting crop and property damage cost distributions. We conclude with a brief commentary on the shape of the distributions.

2. Data Processing

We downloaded the data from the course website and read it as a comma-separated bzip2 document:

setwd("~/coursera/Reproducible Research/Week4")
d<-read.csv("StormData.csv.bz2")
str(d)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

After examining the supporting documentation on the NOAA website, we determined that crop and property damage records were more consistent and that event types were more precise by 1996. We decided to focus our efforts on summarizing this more accurate subset of the data.

#subset dataframe from 1996 onwards
d$BGN_DATE<-as.Date(as.character(d$BGN_DATE),"%m/%d/%Y")
d0<-subset(d, d$BGN_DATE>"1996-01-01",
           select=c(BGN_DATE,EVTYPE,FATALITIES,INJURIES,PROPDMG,PROPDMGEXP,
                    CROPDMG,CROPDMGEXP))

We decided to focus our analysis on the 10 most frequent storm events. Frequent events tend to have better records, making it easier to study their impact. Exploration of the most frequent event types revealed that thunderstorm events had multiple labels, which dominated several spots on the top 10 list. We therefore consolidated thunderstorm wind events under a single, consistent label using the revalue function in the dplyr package. This increased our confidence that the top 10 most frequent labels were each describing unique storm events.

# consolidate different levels representing the same event
d0$EVTYPE<-revalue(d0$EVTYPE, c("TSTM WIND"="THUNDERSTORM WIND", 
                                "THUNDERSTORM WINDS"="THUNDERSTORM WIND"))
# calculate event frequencies
tab <- table(d0$EVTYPE)
# sort decreasing
tab_s <- sort(tab,decreasing=TRUE)
# extract 10 most frequent events
top10 <- head(names(tab_s), 10)
top10
##  [1] "THUNDERSTORM WIND" "HAIL"              "FLASH FLOOD"      
##  [4] "FLOOD"             "TORNADO"           "HIGH WIND"        
##  [7] "HEAVY SNOW"        "LIGHTNING"         "HEAVY RAIN"       
## [10] "WINTER STORM"

2.1. Population Health

We then prepared the population health-related data for visualization. This data consisted of the “FATALITIES” and “INJURIES” columns, subsetted by event type. In order to create a single histogram for each event type in the top 10, we first calculated a new column in the dataset called “casualties” which was the sum of the fatalities and injuries columns. We then subsetted the post-1996 dataset by the top 10 events and whether the event had any casualties. This new dataset was used to generate the casualty histograms.

#calculate casualties column to measure harmfulness
d0$casualties<-d0$FATALITIES+d0$INJURIES
# subset of data frame, top10 events with at least some casualties
d_s <- subset(d0, EVTYPE %in% top10 & casualties>0)

2.2. Economic Consequences

The last bit of data processing involved calculating economic impact of storm events. The relevant columns for this data were the “PROPDMG”,“CROPDMG”, and their respective exponent columns ending “EXP”. The trick here was to convert character values into numeric values. This was fairly straight-forward, especially with the records after 1996– for some reason the characters were used more consistently.

Once the character values were converted to the appropriate exponent, the “_DMG" and “_EXP" crop damage columns were multiplied and added together with the product of the “_DMG" and “_EXP" property damage columns to produce a new column representing the total cost of damage in US dollars, named the “dollar” column, which was added to the post-1996 dataset. This was further subsetted by the top 10 most frequent event types. This new dataset was used to generate the cost histograms.

prop<- sapply(d0$PROPDMGEXP,function(x) {
        if(x=='k'|x=='K'){
                '1e3'
        }else if(x=='m'|x=='M'){
                '1e6'
        }else if(x=='b'|x=='B'){
                '1e9'
        }else{x}
        })
prop<-as.numeric(prop)

propdmg<-prop*d0$PROPDMG

crop<- sapply(d0$CROPDMGEXP,function(x) {
        if(x=='k'|x=='K'){
                '1e3'
        }else if(x=='m'|x=='M'){
                '1e6'
        }else if(x=='b'|x=='B'){
                '1e9'
        }else{x}
})
crop<-as.numeric(crop)

cropdmg<-crop*d0$CROPDMG

d0$dollars<-cropdmg+propdmg

d_s2 <- subset(d0, EVTYPE %in% top10 & dollars>0)

3. Results

The results of the analysis are presented as two panels of histograms, each histogram representing the distribution of values for one of the top 10 most common types of storm events. The x and y axes are log-transformed in order to visualize more of the data. The x and y scales are equal across histograms within a given value type (casualties or dollars) to make it easier to visually compare across different event types.

Histograms have the advantage of increased granularity over single-valued metrics such as historic totals or central tendencies. This helps avoid the commonly-seen error of over-estimating the impact of tornados. Historically, NOAA recorded only tornados, hail, and thunderstorm wind for the years prior to 1996. Thus, any historic totals not taking this into account overestimate the present-day dangers of these three storm events relative to the other types of events recorded from 1996 onward.

Histograms also visually remind the reader of the long-tailed nature of most of the distributions, which can have a large effect on the mean. Although extreme values are sometimes ignored in favor of describing the central tendency of some large middle fraction of the data, the trade-off in this dataset would be ignoring especially devestating events, which is unacceptable. Lacking a suitable measure of central tendency that can compromise between the desire to summarize the data and the desire to highlight especially devestating events, we instead present panals of histograms.

3.1. Population Health

# plot histograms of casualties by top 10 most frequent storm event types.
ggplot(d_s, aes(casualties)) +
        stat_density(aes(y=..count..), color="black", fill="blue", alpha=0.3) +
        facet_wrap(~EVTYPE) +
        scale_x_continuous(breaks=c(3,10,30,100,300,1000), 
                           trans="log1p", 
                           expand=c(0,0)) +
        scale_y_continuous(breaks=c(0,10,30,100,300,1000,3000,10000),
                           trans="log1p") +
        theme_bw()

Floods, thunderstorm winds, tornados, and Winter storms all have especially long, fat-tails. This suggests that these storm types can be especially deadly.

Because we only examined the top 10 most frequent storm event types, we ignored a fraction of the total casualties in the post-1996 dataset. The fraction of total casualties these histograms address is given by the formula:

# fraction of total casualties accounted for by these events 
sum(d_s$casualties)/sum(d0$casualties)
## [1] 0.700862

3.2. Economic Consequences

# plot histograms of casualties by top 10 most frequent storm event types.
ggplot(d_s2, aes(dollars)) +
        stat_density(aes(y=..count..), color="black", fill="blue", alpha=0.3) +
        facet_wrap(~EVTYPE) +
        scale_x_continuous(breaks=c(1e3,1e6,1e9),trans="log1p",expand=c(0,0)) +
        scale_y_continuous(breaks=c(3,10,30,100,300,1000,3000,10000),
                           trans="log1p")+
theme_bw()

Flash floods, floods, hail, thunderstorm winds, and tornados can have especially severe economic impact.

The fraction of total post-1996 storm event costs represented in this panel can be formulated as:

# fraction of total costs accounted for by these events
sum(d_s2$dollars)/sum(d0$dollars)
## [1] 0.5639959