Synopsis

The aim of this report is to address the following research questions:

Research Qustions

  1. Across the United States, which types of storm events are most harmful with respect to population health?
  2. Across the United States, which types of storm events have the greatest economic consequences?

The analysis explores the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. It is concluded that tornadoes have the largest impact on human health and that floods have the greatest economic impact. However, this analysis is limited in that it is of an exploratory nature only. Further study should be performed to develop these conclusions using a statistical analysis.

Data Processing

The data.table package will be used to store the data and the ggplot2 package to plot the data:

library('data.table')
library('ggplot2')

Load Data

It is assumed that the working directory has been set to the source folder. This code block downloads and reads the data from here. Caching is used since this is time consuming operation. Only the relevant data are stored; the relevant data were deduced from Storm Data Archive and Storm Data Documentation. Note that the data available seems to be a subset of the data described in the documentation.

# download data if it has not been already
if (!file.exists("./data/data.bz2")){
        if (!file.exists("./data")) dir.create("data")
        my.url <- paste("http://d396qusza40orc.cloudfront.net",
                        "/repdata%2Fdata%2FStormData.csv.bz2",
                        sep = "")
        download.file(my.url, "./data/data.bz2")
}
# load the data if it has not been already
if (!exists("my.data")){
    my.data <- read.csv(bzfile("./data/data.bz2"))
    my.data <- as.data.table(my.data[,
        c("EVTYPE", 
          "FATALITIES",
          "INJURIES",
          "PROPDMG",
          "PROPDMGEXP",
          "CROPDMG",
          "CROPDMGEXP"
         )]
    )
}

Clean Data

Population Health

The impact on population health will be quantified by the number of injuries and the number of fatalities that an event caused. This is given by the variables FATALITIES and INJURIES. As a sanity check (i.e. a quick check of the basic data structure), a summary of the variables is calculated:

# summarise population health varibles as sanity check
my.data[, summary(INJURIES)]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     0.0     0.2     0.0  1700.0
my.data[, summary(FATALITIES)]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0       0       0     583

Economic Consequences

Economic consequences will be quantified as the total cost of damage to property and to crops. The variables PROPDMG and CROPDMG quantify this; in addition to the numerical data given by these variables, an exponent is given in PROPDMGEXP and CROPDMGEXP (‘h’ for hundred, ‘k’ for thousand, ‘m’ for million, ‘b’ for billion) which must be accounted for. The total damage is the product of the numeric variable and the exponent. The following code block performs the operation; writing each expression is repetitive so they are created as strings in a loop and then parsed to create a new variable in the data.table. The expressions are displayed for the readers reference.

# intialise new variables
my.data[, post.PROPDMG:= 0]
my.data[, post.CROPDMG:= 0]
my.data[, post.TOTDMG:= 0]
# exponent lookup list
expon <- list('hH' = '100', 'kK' = '1e3', 'mM' = '1e6', 'bB' = '1e9')

# template expression - strings will be subsituted to form expression
temp.expr <- 
    "post.%sDMG := ifelse(grepl('[%s]', %sDMGEXP), %sDMG * %s, post.%sDMG)"

# this loop constructs and parses a series of expressions to compute the total 
# damage
for (i in c('PROP', 'CROP')){
   for (j in seq_along(expon)){
        expr <- sprintf(temp.expr, i, names(expon[j]), i, i, expon[[j]], i)
       
        # print expression for readers reference
        print(expr)
        my.data[, eval(parse(text = expr))]
   }
}
## [1] "post.PROPDMG := ifelse(grepl('[hH]', PROPDMGEXP), PROPDMG * 100, post.PROPDMG)"
## [1] "post.PROPDMG := ifelse(grepl('[kK]', PROPDMGEXP), PROPDMG * 1e3, post.PROPDMG)"
## [1] "post.PROPDMG := ifelse(grepl('[mM]', PROPDMGEXP), PROPDMG * 1e6, post.PROPDMG)"
## [1] "post.PROPDMG := ifelse(grepl('[bB]', PROPDMGEXP), PROPDMG * 1e9, post.PROPDMG)"
## [1] "post.CROPDMG := ifelse(grepl('[hH]', CROPDMGEXP), CROPDMG * 100, post.CROPDMG)"
## [1] "post.CROPDMG := ifelse(grepl('[kK]', CROPDMGEXP), CROPDMG * 1e3, post.CROPDMG)"
## [1] "post.CROPDMG := ifelse(grepl('[mM]', CROPDMGEXP), CROPDMG * 1e6, post.CROPDMG)"
## [1] "post.CROPDMG := ifelse(grepl('[bB]', CROPDMGEXP), CROPDMG * 1e9, post.CROPDMG)"
# total damage is sum of property damage and crop damage
my.data[, post.TOTDMG := post.PROPDMG + post.CROPDMG]

Sample results (those variables prepended by post.) are checked here to ensure that the operation was performed as intended. Also, a summary of post.TOTDMG is shown as a sanity check.

# sanity checks on post processed data
subset(my.data, CROPDMGEXP == 'B')
##               EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
## 1:            FREEZE          0        0    0.00               0.20
## 2:       RIVER FLOOD          0        0    5.00          B    5.00
## 3:              HEAT          0        0    0.00               0.40
## 4:           DROUGHT          0        0    0.00               0.50
## 5:           DROUGHT          0        0    0.00               1.00
## 6:           DROUGHT          0        0    0.00          K    0.00
## 7:           DROUGHT          0        0    0.00          K    0.00
## 8:         ICE STORM          0        0  500.00          K    5.00
## 9: HURRICANE/TYPHOON         15      104    5.88          B    1.51
##    CROPDMGEXP post.PROPDMG post.CROPDMG post.TOTDMG post.EVTYPE
## 1:          B     0.00e+00     2.00e+08    2.00e+08        COLD
## 2:          B     5.00e+09     5.00e+09    1.00e+10       FLOOD
## 3:          B     0.00e+00     4.00e+08    4.00e+08        HEAT
## 4:          B     0.00e+00     5.00e+08    5.00e+08        HEAT
## 5:          B     0.00e+00     1.00e+09    1.00e+09        HEAT
## 6:          B     0.00e+00     0.00e+00    0.00e+00        HEAT
## 7:          B     0.00e+00     0.00e+00    0.00e+00        HEAT
## 8:          B     5.00e+05     5.00e+09    5.00e+09       STORM
## 9:          B     5.88e+09     1.51e+09    7.39e+09  TROP.STORM
my.data[, summary(post.TOTDMG)]
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.00e+00 0.00e+00 0.00e+00 5.28e+05 1.00e+03 1.15e+11

Storm Events

The event type is given by EVTYPE. This is confirmed by listing the ‘levels’ for that variable:

# show extract of events and show number of levels
head(my.data[, levels(EVTYPE)], 20)
##  [1] "   HIGH SURF ADVISORY"  " COASTAL FLOOD"        
##  [3] " FLASH FLOOD"           " LIGHTNING"            
##  [5] " TSTM WIND"             " TSTM WIND (G45)"      
##  [7] " WATERSPOUT"            " WIND"                 
##  [9] "?"                      "ABNORMAL WARMTH"       
## [11] "ABNORMALLY DRY"         "ABNORMALLY WET"        
## [13] "ACCUMULATED SNOWFALL"   "AGRICULTURAL FREEZE"   
## [15] "APACHE COUNTY"          "ASTRONOMICAL HIGH TIDE"
## [17] "ASTRONOMICAL LOW TIDE"  "AVALANCE"              
## [19] "AVALANCHE"              "BEACH EROSIN"
sprintf("There are %i levels", length(my.data[, levels(EVTYPE)]))
## [1] "There are 985 levels"

However, as shown in the block, there are 985 levels; in order to draw meaningful conclusions from this study, the levels will be grouped into wider categories. Regular expressions are used to group key words; the new categories are given by post.EVTYPE:

# initialise new categories with 'OTHER'
my.data[, post.EVTYPE := "OTHER"]

# create groups of categrories defined by regular expressions
groups <- list(
    FOG = "FOG",
    HEAT = "HEAT|WARM|FIRE|DROUGHT|DR[YI]|HOT|HIGH TEMP|HYPERT",
    COLD = "COLD|COOL|FREEZ|WINTE?R|FROST|IC[EY]|GLAZE|HYPOT|LOW TEMP",
    STORM = "STORM",
    RAIN = "RAIN|PREC|WET|SHOWER",
    SNOW = "SNOW|BLIZZ|SLEET",
    LIGHTNING = "LIGHTNING",
    FLOOD = "FLOOD",
    TROP.STORM = "HURRICAN|TYPHOON|CYCLONE",
    TORNADO = "TORNADO|FUNNEL|SPOUT",
    WIND = "WIND",
    HAIL = "HAIL",
    TSUNAMI = "TSUNAMI",
    WATER = "CURRENT|SURF",
    SLIDES = "SLIDE|AVALANCHE"
)

# apply the regular expressions in a loop to recategorise
lapply(seq_along(groups), function(i){
    my.data[, post.EVTYPE := ifelse(grepl(groups[[i]], EVTYPE, ignore.case = T), 
                                            names(groups[i]), post.EVTYPE)]
})

Note that in the above, it can happen that two categories apply. In that case, the order of precedence of categorization is from the bottom-up in groups. After re-categorization, a check is performed to ensure that the uncategorized data is insignificant in terms of the variables of interest. The target was to ensure no more than 1% of the total variables of interest class as OTHER i.e. uncategorized.

# the key will be used for fast subsetting
setkey(my.data, 'post.EVTYPE')

sprintf("Percentage of uncategorized EVTYPE: %.2f%%", 
        100*nrow(my.data['OTHER'])/nrow(my.data))
## [1] "Percentage of uncategorized EVTYPE: 0.48%"
sprintf("Percentage of total damage cost uncategorized: %.2f%%",
        100*my.data['OTHER'][, sum(post.TOTDMG)]/my.data[, sum(post.TOTDMG)])
## [1] "Percentage of total damage cost uncategorized: 0.02%"
sprintf("Percentage of total fatalities uncategorized: %.2f%%",
        100*my.data['OTHER'][, sum(FATALITIES)]/my.data[, sum(FATALITIES)])
## [1] "Percentage of total fatalities uncategorized: 0.41%"
sprintf("Percentage of total injuries uncategorized: %.2f%%",
        100*my.data['OTHER'][, sum(INJURIES)]/my.data[, sum(INJURIES)])
## [1] "Percentage of total injuries uncategorized: 0.11%"

As shown above, the OTHER categorization is not significant in terms of the variables of interest.

Summarised Data

This code block aggregates the data so as to evaluate the impact of each variable for each storm event categorization:

pp.data <- my.data[, 
    list(FATALITIES = sum(FATALITIES),
         INJURIES = sum(INJURIES),
         COST = sum(post.TOTDMG)),
    by = post.EVTYPE]
print(pp.data)
##     post.EVTYPE FATALITIES INJURIES      COST
##  1:        COLD        317     1322 3.543e+09
##  2:       FLOOD       1525     8604 1.798e+11
##  3:         FOG         80     1076 2.283e+07
##  4:        HAIL         45     1467 2.073e+10
##  5:        HEAT       3272    10883 2.485e+10
##  6:   LIGHTNING        817     5232 9.508e+08
##  7:       OTHER         62      149 8.684e+07
##  8:        RAIN        109      329 4.239e+09
##  9:      SLIDES        268      226 3.556e+08
## 10:        SNOW        267     1927 1.926e+09
## 11:       STORM        422     4235 7.329e+10
## 12:     TORNADO       5639    91439 5.742e+10
## 13:  TROP.STORM        133     1333 9.076e+10
## 14:     TSUNAMI         33      129 1.441e+08
## 15:       WATER        738      775 1.167e+08
## 16:        WIND       1418    11402 1.819e+10

Results

Population Health

Below, plots of the number of fatalities and injuries for each event type are shown:

# reorder the levels
pp.data <- within(pp.data, 
    post.EVTYPE <- factor(post.EVTYPE, 
        levels=pp.data[order(FATALITIES, decreasing = T), post.EVTYPE]))

# plot the graph
g = ggplot(pp.data, aes(x = post.EVTYPE, y = FATALITIES))
g = g + geom_bar(stat="identity")
g = g + theme(axis.text.x = element_text(angle = 90, hjust = 1))
g = g + ggtitle('Number of Fatalities')
g = g + ylab('Fatalities') + xlab('Event Type')
print(g)

plot of chunk unnamed-chunk-12

# reorder the levels
pp.data <- within(pp.data, 
    post.EVTYPE <- factor(post.EVTYPE, 
        levels=pp.data[order(INJURIES, decreasing = T), post.EVTYPE]))

# plot the graph
g = ggplot(pp.data, aes(x = post.EVTYPE, y = INJURIES))
g = g + geom_bar(stat="identity")
g = g + theme(axis.text.x = element_text(angle = 90, hjust = 1))
g = g + ggtitle('Number of Injuries')
g = g + ylab('Injuries') + xlab('Event Type')
print(g)

plot of chunk unnamed-chunk-12

This exploratory analysis suggests that tornadoes are the storm events that are the most harmful with respect to population health.

Economic Consequences

# reorder the levels
pp.data <- within(pp.data, 
    post.EVTYPE <- factor(post.EVTYPE, 
        levels=pp.data[order(COST, decreasing = T), post.EVTYPE]))

# plot the graph
g = ggplot(pp.data, aes(x = post.EVTYPE, y = COST))
g = g + geom_bar(stat="identity")
g = g + theme(axis.text.x = element_text(angle = 90, hjust = 1))
g = g + ggtitle('Total Cost')
g = g + ylab('Cost') + xlab('Event Type')
print(g)

plot of chunk unnamed-chunk-13

This exploratory analysis suggests that flooding is the storm events that have the greatest economic consequences.

Conclusion

This exploratory study has found that tornadoes are the events that have the largest impact on poluation health and that flooding are the most costly events. Robust conclusions cannot be drawn from the exploratory analysis alone. Nevertheless, the exploratory evidence is strong in favour of the inferences found above. It is recommended that a more detailed statistical study should be performed in order to develop these conclusions further.