Notice

I joined the Reproducible Research class on February 2015 and finished it. This time (March, 2015), I apply for an verified certificate. This work is based on my own work on February (I fixed typo mistakes and some minor improvements).

Synopsis

This document explores weather events in USA between 1950 and 2011 to identify which weather event type caused most severe consequence to human and economic. Using Storm dataset of the National Weather Service, the analysis found that tornado is the most harmful for human followed by tstm wind and excessive heat. In term of economic impact, flood is the weather even type had the greatest economic consequences, followed by huricance typhoon and tornado.

Analysis questions

The analysis must address the 2 following questions:

Data using for the analysis

The data for this assignment comes in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:

There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

Data Processing

Download and load data

library(plyr)
library(dplyr)
library(ggplot2)
library(knitr)
library(reshape2)
if (!file.exists("./data/repdata_data_StormData.csv.bz2")) {
        message("Download data file ...")
        download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
              "./data/repdata_data_StormData.csv.bz2",
              method = "wb")
} else {
        message("Data file has already existed")
}
data <- read.csv(bzfile("./data/repdata_data_StormData.csv.bz2"), header = TRUE, 
                 stringsAsFactors = FALSE)

The analysis requires these variable as below:

  • Event type information (EVTYPE)

  • Population health information (FATALITIES, INJURIES)

  • Economic damage information (PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

We subset data set, only keep neccesary variable and has positive value for variable of healthy information or economic information. There is not any missing values in these variables.

analysis.data <- data %>% select(BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG,
                                 PROPDMGEXP, CROPDMG, CROPDMGEXP)

Looking at EVTYPE, the values is inconsistent:

  • There are values with white space in head and tail.

  • Mixing of upper case and lower case.

  • Invalid values.

  • Speacial characters (&, /, -, ;, (, ), .)

Therefore, we run through some steps to clean up values of EVTYPE:

  • Upper-case all values.

  • Trim white space.

  • Replace special characters and and invalid values with white space

analysis.data$EVTYPE <- gsub("[^A-Z]", " ", toupper(analysis.data$EVTYPE))
analysis.data$EVTYPE <- gsub("[ ]+", " ", analysis.data$EVTYPE)
analysis.data$EVTYPE <- gsub("^\\s+", "", analysis.data$EVTYPE)
analysis.data$EVTYPE <- gsub("\\s+$", "", analysis.data$EVTYPE)
analysis.data$EVTYPE <- as.factor(analysis.data$EVTYPE)

Health impact across weather events

The first analysis question addresses to events which caused most damage for human health w. As a result, we only focus on entries has positive value for FATALITIES or INJURIES.

analysis.data.health <- analysis.data %>% select(EVTYPE, FATALITIES, INJURIES) %>%
    filter(FATALITIES > 0 | INJURIES > 0) %>%
    group_by(EVTYPE) %>%
    summarise(sum.FATALITIES = sum(FATALITIES), sum.INJURIES = sum(INJURIES))

We can not directly compare a FATALITIES case to a INJURIES case, in order to find hightest health impact across weather events, we go through these steps:

  • Calculate sum of sum.FATALITIES, sum.INJURIES for each value of EVTYPE

  • Identify top 15 of sum.FATALITIES.

  • Identify top 15 of sum.INJURIES.

  • Combining those 2 event types set mentioned above to have the event types caused the most damage to human health

top.fatal.et <- analysis.data.health %>% arrange(desc(sum.FATALITIES)) %>% top_n(15) %>%
    select(EVTYPE)
top.injur.et <- analysis.data.health %>% arrange(desc(sum.INJURIES)) %>% top_n(15) %>%
    select(EVTYPE)
top.health.impact.et <- unique(rbind(top.fatal.et, top.injur.et))
top.health.impact <- analysis.data.health[analysis.data.health$EVTYPE %in% top.health.impact.et$EVTYPE,]

Economic impact across weather events

For measuring economic impact of weather events, there are 4 variable is used:

  • PROPDMG: properties damage

  • CROPDMG: crop damage

  • PROPDMGEXP: unit used for PROPDMG

  • CROPDMGEXP: unit used for CROPDMG

Subset the data set, we concerntrate on events which caused economic damage

analysis.data.economic <- analysis.data %>% select(EVTYPE, PROPDMG, CROPDMG,
                                                   PROPDMGEXP, CROPDMGEXP)

PROPDMGEXP and CROPDMGEXP should has the values below:

  • H: Hundreds also means 10^2

  • K: Thousands also means 10^3

  • M: Millions also means 10^6

  • B: Billions also means 10^9

  • An positive integer indicate an exponent of 10

Actually PROPDMG and CROPDMG contain not only upper-case, lower-case for character but also invalid values like blank value, ?, +, -. We need to convert unit character to proper exponent of 10, replace invalid one with 0 (means those units are 10^0=1). Then,

  • properties.damage.amount = PROPDMG * 10^exp.Value(PROPDMGEXP)
  • crop.damage.amount = CROPDMG * 10^exp.Value(CROPDMGEXP)

In order to identify which weather event type caused most economic damage, we calculate sum of Total.Damage for each weather event type and choose 15 highest ones. Because PROPDMG and CROPDMG could be calculated to same unit (USD), Total.Damage = properties.damage.amount + crop.damage .amount

exp_code = c("-", "?", "+", "0", "H", "h", "K", "k", "M", "m",
              "B", "b", "1", "2", "3", "4", "5", "6", "7", "8")
exp_code_value <- data.frame(exp_code,
                    exp_values = c(0, 0, 0, 0, 
                    2, 2, 3, 3, 6, 6, 9, 9, 1, 2, 3, 4, 5, 6, 7, 8), 
              row.names = exp_code)
analysis.data.economic[analysis.data.economic$PROPDMGEXP=="","PROPDMGEXP"] <- 0
analysis.data.economic[analysis.data.economic$CROPDMGEXP=="","CROPDMGEXP"] <- 0
analysis.data.economic$converted.PROPDMGEXP <- exp_code_value[analysis.data.economic$PROPDMGEXP,"exp_values"]
analysis.data.economic$converted.CROPDMGEXP <- exp_code_value[analysis.data.economic$CROPDMGEXP,"exp_values"]
analysis.data.economic$Total.Damage <- analysis.data.economic$PROPDMG * 10^analysis.data.economic$converted.PROPDMGEXP +
    analysis.data.economic$CROPDMG * 10 ^ analysis.data.economic$converted.CROPDMGEXP
top.economic.damage <- analysis.data.economic %>% group_by(EVTYPE) %>%
                        summarise(sum.Total.Damage = sum(Total.Damage)) %>%
                        arrange(desc(sum.Total.Damage)) %>% 
                        top_n(15)

Results

Which types of events are most harmful with respect to population health?

kable(top.health.impact, caption = "Most harmful to population health \nTop 15 weather events type")
Most harmful to population health Top 15 weather events type
EVTYPE sum.FATALITIES sum.INJURIES
EXCESSIVE HEAT 1903 6525
FLASH FLOOD 978 1777
FLOOD 470 6789
HAIL 15 1361
HEAT 937 2100
HEAVY SNOW 127 1021
HIGH WIND 248 1138
HURRICANE TYPHOON 64 1275
ICE STORM 89 1975
LIGHTNING 817 5230
THUNDERSTORM WIND 133 1488
THUNDERSTORM WINDS 64 919
TORNADO 5633 91346
TSTM WIND 504 6957
WINTER STORM 206 1321
top.health.impact.long <- melt(top.health.impact, id=c("EVTYPE"),
                               measured=c("sum.FATALITIES", "sum.INJURIES"))
top.health.impact.long$variable <- gsub("sum.","", top.health.impact.long$variable)
ggplot(top.health.impact.long, aes(x=EVTYPE, y=value, fill = variable)) +
    geom_bar(stat="identity", position="dodge") +
    labs(title="Most harmful to population health \nTop 15 weather events type",
         x = "Weather event",
         y = "Number of cases") +
    coord_flip()

TORNADO is the weather event type that caused most damaged to human health, followed by flood and excessive heat

top.health.impact[top.health.impact$EVTYPE=="TORNADO" | top.health.impact$EVTYPE=="TSTM WIND" 
                  |top.health.impact$EVTYPE=="EXCESSIVE HEAT", ]
## Source: local data frame [3 x 3]
## 
##           EVTYPE sum.FATALITIES sum.INJURIES
## 1 EXCESSIVE HEAT           1903         6525
## 2        TORNADO           5633        91346
## 3      TSTM WIND            504         6957

Which types of events have the greatest economic consequences?

ggplot(top.economic.damage, aes(x=EVTYPE, y=sum.Total.Damage)) +
        geom_bar(stat="identity", fill="lightgreen") +
    labs(title="Greatest economic damage \nTop 15 weather events type",
         x = "Weather event",
         y = "Economic damage(USD)") +
    geom_text(aes(label=sum.Total.Damage), colour = "black", size=3) + 
    coord_flip()

Tornado, flood and huricance typhoon are the waether events that caused most economic damage. The greatest economic consequences is:

top.economic.damage[top.economic.damage$sum.Total.Damage == max(top.economic.damage$sum.Total.Damage),]
## Source: local data frame [1 x 2]
## 
##   EVTYPE sum.Total.Damage
## 1  FLOOD     150319678257