This document is located at http://rpubs.com/minglho/RRPeer2.

Synopsis

Severe weather events can have consequences on the health and economy of a community with injuries, fatalities, and property damage. Examining which types of weather events have the most impact contributes to policy decisions to minimize damages resulting from these events. We examine the US National Oceanic and Atmosphereic Adiminstratinos’s (NOAA) strom database containing events starting in the year 1950 and ending in November 2011. To evaluate which types of events are most harmful to population health, we examine the total number of fatalities and injuries for each event type. Tornado is the leading cause of fatalities and injuries to people, accounting for 37% of all fatalities and 65% of all injuries recorded in the dataset. To evaluate which types of events have the greatest economic consequences, we examine the total value of property and crop damages.

Data Processing

Reading from the data file.

The data file was provided by the course website.

## Load libraries and set Global R Markdown setting
library(knitr)
library(dplyr)
library(stringr)
opts_chunk$set(echo = TRUE)
## Read data file. Assume the file is in the same directory as 
## this R Markdown file.
raw.data <- read.csv("repdata-data-StormData.csv.bz2")

length(levels(raw.data$EVTYPE))  # Number of event types

After reading in the raw data, we find that EVTYPE has 985 different levels, way more than the 48 event types described in National Weather Service Storm Data Documentation.

Subsetting the raw data

Before attempting to clean up the variable EVTYPE, we want to subset the raw data to contain only the observations and variables relevant to analysis of population health and economic consequences. This is hoping that reducing the dataset would reduce the number of event types, in order to make cleaninging up EVTYPE easier. We also convert data to upper case and remove leading and trailing spaces as an easy clean up step.

In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete. However, not being a discipline expert, I have no idea how to judge completeness, so I have decided for this analysis to use all the data, rather than leaving out the early years.

The subseting result is in the data frame df.

# Select relevant variables.
df <- select(raw.data, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) 

# Filter out observation that have no injuries, fatalities, and damages
mask = df$PROPDMG>0 | df$CROPDMG>0 | df$INJURIES>0 | df$FATALITIES>0
df <- df[mask, ]

# Convert appropriate variables to upper case.
df <- mutate(df, EVTYPE = factor(str_trim(toupper(EVTYPE))),  
             PROPDMGEXP = toupper(PROPDMGEXP),
             CROPDMGEXP = toupper(CROPDMGEXP))

Preparing Data for Population Health Analysis

To examine the effect of weather events on population health, we sum up the number of injuries and fatalities by event type.

## summing up the number of injuries and fatalities 
health.sum <- 
    select(df, EVTYPE, FATALITIES, INJURIES) %>%
    filter(FATALITIES > 0 | INJURIES > 0) %>%
    group_by(EVTYPE) %>%
    summarize(FATALITIES = sum(FATALITIES), INJURIES = sum(INJURIES))

Preparing Data for Economic Consequences Analysis

To examine the economic consequences of weather events, we sum up the amount of property and crop damage by event type.

Processing Damage Value

Damages for property and crop are recorded in two parts: a real number (the variables PROPDMG and CROPDMG) and a power of 10 (the variables PROPDMGEXP and CROPDMGEXP).

## Select revlevant variables for economic analysis and filter out observations with no damgages recorded.
df.econ <- 
    select(df, EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) %>%
    filter(PROPDMG > 0 | CROPDMG > 0)

table(df.econ$PROPDMGEXP) 
## 
##             -      +      0      2      3      4      5      6      7 
##   4357      1      5    209      1      1      4     18      3      2 
##      B      H      K      M 
##     40      7 229057  11326
table(df.econ$CROPDMGEXP) 
## 
##             ?      0      B      K      M 
## 145037      6     17      7  97981   1983

The table for PROPDMGEXP above shows that the data entry has not been consistent. The single digit can be reasonably interpretted to indicate power of 10, letters to indicate units like “billion,” “million,” “hundreds,” and “thousands.” An entry of “” is reasonably interpreted as “0,” but I have no idea what the symbols ‘-’, ‘+’, and ‘?’ mean. The following code extract the real number part of the damage with the symbols ‘-’, ‘+’, and ‘?’ for expoenent.

odd.exp.ind = df.econ$PROPDMGEXP %in% c("?","+","-")  ### Boolean index for weird exponent.
propdmg.odd.exp = df.econ$PROPDMG[odd.exp.ind]
propdmg.odd.exp
## [1] 20 20  2 15 60 15
odd.exp.ind = df.econ$CROPDMGEXP %in% c("?","+","-")
cropdmg.odd.exp = df.econ$CROPDMG[odd.exp.ind]
cropdmg.odd.exp
## [1] 0 0 0 0 0 0

As we can see above, there are very few damage entries with the symbols ‘-’, ‘+’, and ‘?’ for exponent, and only 6 entries are non-zero for the real part, all with values no larger than 60. Therefore, we will exclude those damage entries when calculating the damage values per observation.

getValue <- function(M, x) {
## 'M' is a number.
## 'x' is a character representing an exponent
## getExp returns M times 10^N, where N is the power of 10 
##   represented by x.
## If x is "?","+", or "-", then the function returns 0.
    
    value = M
    
    if (x == "K")       {N = 3L}
    else if (x == "")   {N = 0L}
    else if (x == "M")  {N = 6L}
    else if (x == "B")  {N = 9L}
    else if (x == "H")  {N = 2L}
    else {
        N = as.integer(x)
        if (is.na(N)) {
            value = 0            
            N = 0L
        }
    }            
    value = value * 10^N
}

N = length(df.econ$EVTYPE) 
for (k in 1:N){
    df.econ$PROPDMGVAL[k] = getValue(df.econ$PROPDMG[k], df.econ$PROPDMGEXP[k])
    df.econ$CROPDMGVAL[k] = getValue(df.econ$CROPDMG[k], df.econ$CROPDMGEXP[k])
}
econ.sum <- 
    select(df.econ, EVTYPE, PROPDMGVAL, CROPDMGVAL) %>%
    group_by(EVTYPE) %>%
    summarize(PROPDMGVAL = sum(PROPDMGVAL), CROPDMGVAL = sum(CROPDMGVAL))
econ.sum <- mutate(econ.sum, TOTALDMGVAL = PROPDMGVAL + CROPDMGVAL) 

Results

I have not taken the course “Getting and Cleaning Data,” as it was not listed either as hard or soft dependency for taking this current course, in the course dependencies chart for the Data Specialization. Thus, the impact of an event type may appear less than it really is because all the codes for that event type were not aggregated. In the interest of completing the assignment in a timely manner, I will continue with the analysis without cleaning EVTYPE. Yes, it’s bad. I am sorry, but I just don’t have the time now.

Health Effect by Weather Events

A quick dotplot by event types of the number of fatalities and injuries (with log10 transformation) shows that the top 10 event types easily cover at least the top order of magnitute of number of fatalities and injuries. The log10 transformation is used as the values range over many orders of magnitude. Thus, we will later show only the top 10 entries in table form.

par(mfrow = c(1,2), mar = c(4,4,2,1), oma = c(0,0,2,0))
with(health.sum, {
    dotchart(log10(health.sum$FATALITIES), main = "Log(FATALITIES) by EVTYPE")
    dotchart(log10(health.sum$INJURIES), main = "Log(FATALITIES) by EVTYPE")
    mtext("Number of Fatalities and Injuries by Event Type", outer = TRUE)
    box(which = "outer")
})

Below is a table sorted in descending number of fatalities, showing the top 10 entries,

health.sum <- mutate(health.sum, 
                     FATALITIES_PCT = FATALITIES / sum(FATALITIES), 
                     INJURIES_PCT   = INJURIES   / sum(INJURIES) ) 
fatTop10 <- head(as.data.frame(health.sum[order(health.sum$FATALITIES, decreasing = TRUE),]), n = 10)
fatTop10
##            EVTYPE FATALITIES INJURIES FATALITIES_PCT INJURIES_PCT
## 1         TORNADO       5633    91346     0.37193793  0.650019925
## 2  EXCESSIVE HEAT       1903     6525     0.12565203  0.046432028
## 3     FLASH FLOOD        978     1777     0.06457577  0.012645167
## 4            HEAT        937     2100     0.06186860  0.014943641
## 5       LIGHTNING        816     5230     0.05387917  0.037216782
## 6       TSTM WIND        504     6957     0.03327831  0.049506148
## 7           FLOOD        470     6789     0.03103334  0.048310657
## 8     RIP CURRENT        368      232     0.02429845  0.001650917
## 9       HIGH WIND        248     1137     0.01637504  0.008090914
## 10      AVALANCHE        224      170     0.01479036  0.001209723
sum(fatTop10$FATALITIES_PCT)
## [1] 0.797689

Below is a table sorted in descending number of injuries, showing the top 10 entries,

injTop10 <- head(as.data.frame(health.sum[order(health.sum$INJURIES, decreasing = TRUE),]), n = 10)
injTop10
##               EVTYPE FATALITIES INJURIES FATALITIES_PCT INJURIES_PCT
## 1            TORNADO       5633    91346   0.3719379333  0.650019925
## 2          TSTM WIND        504     6957   0.0332783097  0.049506148
## 3              FLOOD        470     6789   0.0310333443  0.048310657
## 4     EXCESSIVE HEAT       1903     6525   0.1256520304  0.046432028
## 5          LIGHTNING        816     5230   0.0538791680  0.037216782
## 6               HEAT        937     2100   0.0618686035  0.014943641
## 7          ICE STORM         89     1975   0.0058765269  0.014054139
## 8        FLASH FLOOD        978     1777   0.0645757676  0.012645167
## 9  THUNDERSTORM WIND        133     1488   0.0087817762  0.010588637
## 10              HAIL         15     1361   0.0009904259  0.009684903
sum(injTop10$INJURIES_PCT)
## [1] 0.893402

From the tables above, we see that tornado is the leading cause of fatalities and injuries to people, accounting for 37% of all fatalities and 65% of all injuries recorded in the dataset. The top 10 weather events causing the highest fatalities account for about 80% of all fatalities, while each of the other events contribute to no more than 1.5% of all fatalities. The top 10 weather events causing the highest injuries account for about 89% of all injuries, while each of the other events contribute to no more than 1% of all injuries.

Economic Consequences by Weather Events

A quick dotplot by event types of the damage values (with log10 transformation) shows that the top 10 event types easily cover at least the top order of magnitute of number of fatalities and injuries. The log10 transformation is used as the values range over many orders of magnitude. Thus, we will later show only the top 10 entries in table form.

par(mfrow = c(1,1), mar = c(4,4,2,1), oma = c(0,0,2,0))
with(econ.sum, {
    dotchart(log10(TOTALDMGVAL), main = "Log(TOTALDMGVAL) by EVTYPE")
    mtext("Total Property and Crop Damages by Event Type", outer = TRUE)
})

Below is a table sorted in descending value of combined property and crop damages, showing the top 10 entries,

econ.sum <- mutate(econ.sum, TOTALDMGVAL_PCT  = TOTALDMGVAL / sum(TOTALDMGVAL))
totalTop10 <- head(as.data.frame(econ.sum[order(econ.sum$TOTALDMGVAL, decreasing = TRUE), c("EVTYPE" , "TOTALDMGVAL", "TOTALDMGVAL_PCT")]), n = 10)
totalTop10
##               EVTYPE  TOTALDMGVAL TOTALDMGVAL_PCT
## 1              FLOOD 150319678257      0.31491835
## 2  HURRICANE/TYPHOON  71913712800      0.15065857
## 3            TORNADO  57362333887      0.12017356
## 4        STORM SURGE  43323541000      0.09076242
## 5               HAIL  18761221986      0.03930459
## 6        FLASH FLOOD  18244041079      0.03822110
## 7            DROUGHT  15018672000      0.03146398
## 8          HURRICANE  14610229010      0.03060830
## 9        RIVER FLOOD  10148404500      0.02126081
## 10         ICE STORM   8967041360      0.01878587
sum(totalTop10$TOTALDMGVAL_PCT)
## [1] 0.8561575

From the table above, we see that flood inflicts the most economice consequence, as measured by the total damage value (combining property and crop damages), and accounts for about 31% of all the total damage values. The top 10 weather events causing the highest total damage values property account for about 86% of all the total damage values.