Synopsis

The purpose of this data analysis is to answer folloing questions by analyzing NOAA Storm Database.

The result of data analysis:

Data Processing

Software Environment

sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 8 x64 (build 9200)
## 
## locale:
## [1] LC_COLLATE=Chinese (Traditional)_Taiwan.950 
## [2] LC_CTYPE=Chinese (Traditional)_Taiwan.950   
## [3] LC_MONETARY=Chinese (Traditional)_Taiwan.950
## [4] LC_NUMERIC=C                                
## [5] LC_TIME=Chinese (Traditional)_Taiwan.950    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] magrittr_1.5    formatR_1.2     tools_3.2.2     htmltools_0.2.6
##  [5] yaml_2.1.13     stringi_0.5-5   rmarkdown_0.8   knitr_1.11     
##  [9] stringr_1.0.0   digest_0.6.8    evaluate_0.7.2

Loading Data

NOAA storm database can be downloaded from here.

Please note you must read the file under ASCII encoding, or some characters will be recognized as EOF and data is incompleted.

library(R.utils)  # for bunzip2()
if (!file.exists("repdata_data_StormData.csv.bz2")) {
    download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
                  "repdata_data_StormData.csv.bz2")
}
if (!file.exists("repdata_data_StormData.csv")) {
    bunzip2("repdata_data_StormData.csv.bz2", remove=FALSE)
}
storm <- read.csv("repdata_data_StormData.csv", fileEncoding = "ascii")

Since we only care about healthy damage and economic damage, only following columns are preserved, other columns are abandoned.

Column Name Type Meaning
EVTYPE Factor Event type
FATALITIES num Number of fatalities
INJURIES num Number of injuries
PROPDMG num Property damage
PROPDMGEXP Factor Property damage exponent
CROPDMG num Crop damage
CROPDMGEXP logi Crop damage exponent
library(dplyr)  # for select, arrange, mutate, summarise
storm <- storm %>% select(EVTYPE, FATALITIES, INJURIES,
                        PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

Preprocessing Data

The meanings of PROPDMGEXP and CROPDMGEXP explained in the article “How To Handle Exponent Value of PROPDMGEXP and CROPDMGEXP”.

PROPDMGEXP or CROPDMGEXP Multiplying factor
-, ?, +, empty 0
numeric 0~8 10
H,h 100
K,k 1,000
M,m 1,000,000
B,b 1,000,000,000

The original levels of PROPDMGEXP is:

levels(storm$PROPDMGEXP)
##  [1] ""  "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"

The modified levels of PROPDMGEXP is:

tmpLvl <- levels(storm$PROPDMGEXP)
tmpLvl <- sub("[0-8]", "10", tmpLvl)
tmpLvl <- sub("h|H", "100", tmpLvl)
tmpLvl <- sub("k|K", "1000", tmpLvl)
tmpLvl <- sub("m|M", "1000000", tmpLvl)
tmpLvl <- sub("b|B", "1000000000", tmpLvl)
tmpLvl <- sub("^$|-|\\?|\\+", "0", tmpLvl)
levels(storm$PROPDMGEXP) <- tmpLvl
levels(storm$PROPDMGEXP)
## [1] "0"          "10"         "1000000000" "100"        "1000"      
## [6] "1000000"

The original levels of CROPDMGEXP is:

levels(storm$CROPDMGEXP)
## [1] ""  "?" "0" "2" "B" "k" "K" "m" "M"

The modified levels of CROPDMGEXP is:

tmpLvl <- levels(storm$CROPDMGEXP)
tmpLvl <- sub("[0-8]", "10", tmpLvl)
tmpLvl <- sub("h|H", "100", tmpLvl)
tmpLvl <- sub("k|K", "1000", tmpLvl)
tmpLvl <- sub("m|M", "1000000", tmpLvl)
tmpLvl <- sub("b|B", "1000000000", tmpLvl)
tmpLvl <- sub("^$|-|\\?|\\+", "0", tmpLvl)
levels(storm$CROPDMGEXP) <- tmpLvl
levels(storm$CROPDMGEXP)
## [1] "0"          "10"         "1000000000" "1000"       "1000000"

Multiply exponent parts and fractional parts.

storm <- storm %>% mutate(
    PROPDMG = PROPDMG * as.numeric(as.character(PROPDMGEXP)),
    CROPDMG = CROPDMG * as.numeric(as.character(CROPDMGEXP))) %>%
    select(-PROPDMGEXP, -CROPDMGEXP)

Summarize Data

For each event type, sum up fatalities, injuries, property damage, and crop damage. Then use tidyr::gather to transform data.frame into long format. ans1 contains fatalities and injuries counts, used for answering first question. ans2 contains property damage and crop damage, used for answering second question.

library(tidyr)  # for gather()
storm2 <- storm %>% group_by(EVTYPE) %>% summarise(
    TOTAL_FATAL = sum(FATALITIES),
    TOTAL_INJURY = sum(INJURIES),
    TOTAL_PROP = sum(PROPDMG),
    TOTLA_CROP = sum(CROPDMG))

ans1 <- storm2 %>%
        arrange(desc(TOTAL_FATAL)) %>%
        select(-TOTAL_PROP,-TOTLA_CROP) %>% slice(1:10) %>%
        gather(TYPE, VALUE, TOTAL_FATAL:TOTAL_INJURY)

ans2 <- storm2 %>%
        arrange(desc(TOTAL_PROP)) %>%
        select(-TOTAL_FATAL,-TOTAL_INJURY) %>% slice(1:10) %>%
        gather(TYPE, VALUE, TOTAL_PROP:TOTLA_CROP)

Results

Across the United States, which types of events are most harmful with respect to population health?

library(ggplot2)
ggplot(ans1, aes(x=reorder(EVTYPE, -VALUE), y=VALUE, fill=EVTYPE)) +
    geom_bar(stat="identity") + facet_grid(. ~ TYPE) +
    ggtitle("Fatalities and Injuries Caused by Severe Events") +
    xlab("Event type") + ylab("Numbers") +
    theme(legend.position="none") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

As we can see in the plot, TORNADO is the most harmful event with respect to population health.

Across the United States, which types of events have the greatest economic consequences?

ggplot(ans2, aes(x=reorder(EVTYPE, -VALUE), y=VALUE, fill=TYPE)) +
    geom_bar(stat="identity") +
    ggtitle("Property Damage and Crop Damage Caused by Severe Events") +
    xlab("Event type") + ylab("Damage") +
    theme(legend.position="none") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

FLOOD has the greatest economic consequences, causes largest property damage and crop damage.