Do most frequent types of storm cause most harm and damage?



Synopsis

Severe weather events, and storms in particular, can cause significant harm to public health and damage to national economy. It might be tempting to allocate larger chunks of public budgets to preparation for those types of storms that occur more frequently.

This research paper switches focus from frequency of occurrence of a particular storm to measurable consequences. By analyzing NOAA Storm Database maintained since 1950, this paper answers two questions:

Introduction

Data Processing

The aim of this section is to show and explain logic behind of all steps involved in obtaining tidy data set suitable for further analysis. The author of the report assumes that no other information than the:

is known to the reader at this point. So I will read, explore, and transform data together with the reader of the report.

Load data

# download bzipped file if it has not been downlowded before
if (!file.exists("./data/StormData.csv.bz2")) {
        url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
        dir.create("data")
        download.file(url, "./data/StormData.csv.bz2", method="curl")
        }

Read CSV

# read StormData.csv into R object stormData if it has not been read already
# no manipulations on stormData object are allowed !
# before making any change to data, copy stormData into another object 
if (!exists("stormData")) {
        stormData <- read.csv(bzfile("./data/StormData.csv.bz2"),
                              stringsAsFactors=F)
}

Transform data

# explore stormData object
str(stormData)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

After perusing these two documents (Storm Data Documentation, FAQ) and looking at the structure of stormData, it is becoming clear that harm to public health and damage to economy contain in variables that have to do with “FATALITIES”, “INJURIES”, and those names containing “DMG” string. Let’s subset these variables together with date and event type.

# subset data related to health and economic damages
ind <- grep("fat|inj|dmg", colnames(stormData), ignore.case = T)
str(stormData[,ind])
## 'data.frame':    902297 obs. of  6 variables:
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
damage <- stormData[,c(2, 8, ind)]  # 2 and 8 for date and event type
colnames(damage) <- tolower(colnames(damage))
# check if we got what we intended to
str(damage)
## 'data.frame':    902297 obs. of  8 variables:
##  $ bgn_date  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ evtype    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ fatalities: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ injuries  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ propdmg   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ propdmgexp: chr  "K" "K" "K" "K" ...
##  $ cropdmg   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ cropdmgexp: chr  "" "" "" "" ...

Storm Data Documentation further tells us that dollar value of property and crop damages consists of numeric figure and character label: K for ’000, M for ’000’000, and B for ’000’000’000. Let’s explore if we should use this line verbatim in our analysis or rather as a hint.

# check do we have "K", "M", "B" labels consistently
table(damage$propdmgexp)
## 
##             -      ?      +      0      1      2      3      4      5 
## 465934      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     40      1      6 424665      7  11330

As it turns out people who entered data made quite a few mistakes. Thus, our rule for combining figures and labels into dollar values will be as follows:

  • use “k” or “K” for ’000

  • use “m” or “M” for ’000’000

  • use “b” or “B” for ’000’000’000

  • substitute all other with numeric “0”

# calculate tables of multiples for property and crop 
propMultiples <- data.frame(propdmgexp=c("k","K","m","M","b","B"),
                            propMultiples= c(1000,
                                             1000,
                                             1000000,
                                             1000000,
                                             1000000000,
                                             1000000000))
# explore
propMultiples
##   propdmgexp propMultiples
## 1          k         1e+03
## 2          K         1e+03
## 3          m         1e+06
## 4          M         1e+06
## 5          b         1e+09
## 6          B         1e+09
table(damage$cropdmgexp)
## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994
cropMultiples <- data.frame(cropdmgexp=c("k","K","m","M","b","B"),
                            cropMultiples= c(1000,
                                             1000,
                                             1000000,
                                             1000000,
                                             1000000000,
                                             1000000000))
# explore
cropMultiples
##   cropdmgexp cropMultiples
## 1          k         1e+03
## 2          K         1e+03
## 3          m         1e+06
## 4          M         1e+06
## 5          b         1e+09
## 6          B         1e+09
library(plyr)
# join multiples with damage table
damage <- join(damage, propMultiples)
## Joining by: propdmgexp
damage <- join(damage, cropMultiples)
## Joining by: cropdmgexp
#explore
str(damage)
## 'data.frame':    902297 obs. of  10 variables:
##  $ bgn_date     : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ evtype       : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ fatalities   : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ injuries     : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ propdmg      : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ propdmgexp   : chr  "K" "K" "K" "K" ...
##  $ cropdmg      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ cropdmgexp   : chr  "" "" "" "" ...
##  $ propMultiples: num  1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 ...
##  $ cropMultiples: num  NA NA NA NA NA NA NA NA NA NA ...
# calculate dollar values of property and crop damage
damageValue <- mutate(damage,
                      propDamage = propdmg * propMultiples,
                      cropDamage = cropdmg * cropMultiples)

# explore
str(damageValue)
## 'data.frame':    902297 obs. of  12 variables:
##  $ bgn_date     : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ evtype       : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ fatalities   : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ injuries     : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ propdmg      : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ propdmgexp   : chr  "K" "K" "K" "K" ...
##  $ cropdmg      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ cropdmgexp   : chr  "" "" "" "" ...
##  $ propMultiples: num  1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 ...
##  $ cropMultiples: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ propDamage   : num  25000 2500 25000 2500 2500 2500 2500 2500 25000 25000 ...
##  $ cropDamage   : num  NA NA NA NA NA NA NA NA NA NA ...
# explore data
damageValue[1:5,1]
## [1] "4/18/1950 0:00:00"  "4/18/1950 0:00:00"  "2/20/1951 0:00:00" 
## [4] "6/8/1951 0:00:00"   "11/15/1951 0:00:00"
# coerce date column to Date class
damageValue$bgn_date <- as.Date(damageValue$bgn_date, format="%m/%d/%Y")

# substitute NA's with 0
damageValue[is.na(damageValue)] <-0

Before we can proceed to the next step of creating tidy data set we need to define harm to public health and damage to economy due to a particular event.

For the sake of this report, I use the following definitions:

Harm to public health: total amount of fatalities and injuries occurred due to a particular storm event

Damage to economy: total dollar amount of crop and property damage occurred due to a particular storm event

Now let’s calculate a tidy data set containing one column for every variable: date, type of event, health damage, and economic damage

# create tidy data set for further analysis
data <- with(damageValue, data.frame(date=bgn_date,
                                     eventType=evtype,
                                     healthDamage = fatalities + injuries,
                                     economicDamage = propDamage + cropDamage)
             )
        
summary(data)
##       date                        eventType       healthDamage   
##  Min.   :1950-01-03   HAIL             :288661   Min.   :   0.0  
##  1st Qu.:1995-04-20   TSTM WIND        :219940   1st Qu.:   0.0  
##  Median :2002-03-18   THUNDERSTORM WIND: 82563   Median :   0.0  
##  Mean   :1998-12-27   TORNADO          : 60652   Mean   :   0.2  
##  3rd Qu.:2007-07-28   FLASH FLOOD      : 54277   3rd Qu.:   0.0  
##  Max.   :2011-11-30   FLOOD            : 25326   Max.   :1742.0  
##                       (Other)          :170878                   
##  economicDamage    
##  Min.   :0.00e+00  
##  1st Qu.:0.00e+00  
##  Median :0.00e+00  
##  Mean   :5.28e+05  
##  3rd Qu.:1.00e+03  
##  Max.   :1.15e+11  
## 

Data Analysis

Different types of storm by annual frequency

Let’s visualize types of storms by average annual occurrences. To do so, I calculate an eventFreq object that contains two variables, type of event and frequency of occurrence, and subset it for top 10 events.

eventFreq <- arrange(count(data, "eventType"), desc(freq))
topEvents <- eventFreq[1:10,]
# save top 3 for results presentation
results <- data.frame(eventFreq = eventFreq[1:3, "eventType"])
library(ggplot2)
ggplot(data=topEvents, aes(y=freq/62, x = reorder(eventType, freq))) +  #calculate mean over 62 years
        geom_bar(stat="identity",
                 fill="#0D3F65",
                 width=.7) +
        coord_flip() +
        labs(y="Number of storm occurrences",
             x=NULL,
             title = "Average number of occurrences per annum \n of a particular type of storm, US, 1950 - 2011")

plot of chunk unnamed-chunk-7

Harm to public health

Let’s analize vizually what are the top 10 types of storm that cause most damage to public health as defined earlier:

healthDamage <- ddply(data,
                      "eventType",
                      summarize, 
                      healthDamage=sum(healthDamage))
healthDamage <- arrange(healthDamage, desc(healthDamage))
topHealthDamage <- healthDamage[1:10,]
topHealthDamage
##            eventType healthDamage
## 1            TORNADO        96979
## 2     EXCESSIVE HEAT         8428
## 3          TSTM WIND         7461
## 4              FLOOD         7259
## 5          LIGHTNING         6046
## 6               HEAT         3037
## 7        FLASH FLOOD         2755
## 8          ICE STORM         2064
## 9  THUNDERSTORM WIND         1621
## 10      WINTER STORM         1527
# save top3 for results presentation
results$healthDamage <- topHealthDamage[1:3, "eventType"]
# plot top 10 as average per annum
ggplot(data=topHealthDamage, aes(y=healthDamage/62, 
                              x = reorder(eventType, healthDamage))) +
        geom_bar(stat="identity",
                 fill="#0D3F65",
                 width=.7) +
        coord_flip() +
        labs(y="Number of injuries and fatalities per annum",
             x=NULL,
             title = "Average annual harm to public health\n by a particular type of storm, US, 1950 - 2011")

plot of chunk unnamed-chunk-8

Damage to economy

Let’s analyze vizually what are the top 10 event causing most damage to economy

head(data)
##         date eventType healthDamage economicDamage
## 1 1950-04-18   TORNADO           15          25000
## 2 1950-04-18   TORNADO            0           2500
## 3 1951-02-20   TORNADO            2          25000
## 4 1951-06-08   TORNADO            2           2500
## 5 1951-11-15   TORNADO            2           2500
## 6 1951-11-15   TORNADO            6           2500
economicDamage <- ddply(data,
                        "eventType",
                        summarize,
                        economicDamage=sum(economicDamage))

economicDamage <- arrange(economicDamage, desc(economicDamage))
topEconomicDamage <- economicDamage[1:10,]
# save top3 for results presentation
results$economicDamage <- topEconomicDamage[1:3, "eventType"]
# plot top 10 as average per annum
ggplot(data=topEconomicDamage, aes(y=economicDamage/62, 
                              x = reorder(eventType, economicDamage))) +
        geom_bar(stat="identity",
                 fill="#0D3F65",
                 width=.7) +
        coord_flip() +
        labs(y="Economic damage per annum, USD",
             x=NULL,
             title = "Average annual economic damage \n by a particular type of storm, US, 1950 - 2011")

plot of chunk unnamed-chunk-9

Results

Let’s summarize in a table top 3 types of storms in all the categories that we have vizualized before:

library(xtable)
colnames(results) <- c("By frequency of occurrence", 
                      "By harm to public health",
                      "By damage to economy")
table <- xtable(results,
                caption = "Table 1: Top 3 types of storm")
print(table, type="html")
Table 1: Top 3 types of storm
By frequency of occurrence By harm to public health By damage to economy
1 HAIL TORNADO FLOOD
2 TSTM WIND EXCESSIVE HEAT HURRICANE/TYPHOON
3 THUNDERSTORM WIND TSTM WIND TORNADO


It is becoming clear now, after analyzing 60 years worth of data, that most attention need to be paid to preparation for Tornadoes and Floods. These events do not occur as frequently as hails, but the consequences are by far more damaging.