The purpose of this data analysis is to answer folloing questions by analyzing NOAA Storm Database.
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
The result of data analysis:
TORNADO is the most harmful event with respect to population health.
FLOOD has the greatest economic consequences.
sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 8 x64 (build 9200)
##
## locale:
## [1] LC_COLLATE=Chinese (Traditional)_Taiwan.950
## [2] LC_CTYPE=Chinese (Traditional)_Taiwan.950
## [3] LC_MONETARY=Chinese (Traditional)_Taiwan.950
## [4] LC_NUMERIC=C
## [5] LC_TIME=Chinese (Traditional)_Taiwan.950
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] magrittr_1.5 formatR_1.2 tools_3.2.2 htmltools_0.2.6
## [5] yaml_2.1.13 stringi_0.5-5 rmarkdown_0.8 knitr_1.11
## [9] stringr_1.0.0 digest_0.6.8 evaluate_0.7.2
NOAA storm database can be downloaded from here.
Please note you must read the file under ASCII encoding, or some characters will be recognized as EOF and data is incompleted.
library(R.utils) # for bunzip2()
if (!file.exists("repdata_data_StormData.csv.bz2")) {
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
"repdata_data_StormData.csv.bz2")
}
if (!file.exists("repdata_data_StormData.csv")) {
bunzip2("repdata_data_StormData.csv.bz2", remove=FALSE)
}
storm <- read.csv("repdata_data_StormData.csv", fileEncoding = "ascii")
Since we only care about healthy damage and economic damage, only following columns are preserved, other columns are abandoned.
| Column Name | Type | Meaning |
|---|---|---|
| EVTYPE | Factor | Event type |
| FATALITIES | num | Number of fatalities |
| INJURIES | num | Number of injuries |
| PROPDMG | num | Property damage |
| PROPDMGEXP | Factor | Property damage exponent |
| CROPDMG | num | Crop damage |
| CROPDMGEXP | logi | Crop damage exponent |
library(dplyr) # for select, arrange, mutate, summarise
storm <- storm %>% select(EVTYPE, FATALITIES, INJURIES,
PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
The meanings of PROPDMGEXP and CROPDMGEXP explained in the article “How To Handle Exponent Value of PROPDMGEXP and CROPDMGEXP”.
| PROPDMGEXP or CROPDMGEXP | Multiplying factor |
|---|---|
| -, ?, +, empty | 0 |
| numeric 0~8 | 10 |
| H,h | 100 |
| K,k | 1,000 |
| M,m | 1,000,000 |
| B,b | 1,000,000,000 |
The original levels of PROPDMGEXP is:
levels(storm$PROPDMGEXP)
## [1] "" "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"
The modified levels of PROPDMGEXP is:
tmpLvl <- levels(storm$PROPDMGEXP)
tmpLvl <- sub("[0-8]", "10", tmpLvl)
tmpLvl <- sub("h|H", "100", tmpLvl)
tmpLvl <- sub("k|K", "1000", tmpLvl)
tmpLvl <- sub("m|M", "1000000", tmpLvl)
tmpLvl <- sub("b|B", "1000000000", tmpLvl)
tmpLvl <- sub("^$|-|\\?|\\+", "0", tmpLvl)
levels(storm$PROPDMGEXP) <- tmpLvl
levels(storm$PROPDMGEXP)
## [1] "0" "10" "1000000000" "100" "1000"
## [6] "1000000"
The original levels of CROPDMGEXP is:
levels(storm$CROPDMGEXP)
## [1] "" "?" "0" "2" "B" "k" "K" "m" "M"
The modified levels of CROPDMGEXP is:
tmpLvl <- levels(storm$CROPDMGEXP)
tmpLvl <- sub("[0-8]", "10", tmpLvl)
tmpLvl <- sub("h|H", "100", tmpLvl)
tmpLvl <- sub("k|K", "1000", tmpLvl)
tmpLvl <- sub("m|M", "1000000", tmpLvl)
tmpLvl <- sub("b|B", "1000000000", tmpLvl)
tmpLvl <- sub("^$|-|\\?|\\+", "0", tmpLvl)
levels(storm$CROPDMGEXP) <- tmpLvl
levels(storm$CROPDMGEXP)
## [1] "0" "10" "1000000000" "1000" "1000000"
Multiply exponent parts and fractional parts.
storm <- storm %>% mutate(
PROPDMG = PROPDMG * as.numeric(as.character(PROPDMGEXP)),
CROPDMG = CROPDMG * as.numeric(as.character(CROPDMGEXP))) %>%
select(-PROPDMGEXP, -CROPDMGEXP)
For each event type, sum up fatalities, injuries, property damage, and crop damage. Then use tidyr::gather to transform data.frame into long format. ans1 contains fatalities and injuries counts, used for answering first question. ans2 contains property damage and crop damage, used for answering second question.
library(tidyr) # for gather()
storm2 <- storm %>% group_by(EVTYPE) %>% summarise(
TOTAL_FATAL = sum(FATALITIES),
TOTAL_INJURY = sum(INJURIES),
TOTAL_PROP = sum(PROPDMG),
TOTLA_CROP = sum(CROPDMG))
ans1 <- storm2 %>%
arrange(desc(TOTAL_FATAL)) %>%
select(-TOTAL_PROP,-TOTLA_CROP) %>% slice(1:10) %>%
gather(TYPE, VALUE, TOTAL_FATAL:TOTAL_INJURY)
ans2 <- storm2 %>%
arrange(desc(TOTAL_PROP)) %>%
select(-TOTAL_FATAL,-TOTAL_INJURY) %>% slice(1:10) %>%
gather(TYPE, VALUE, TOTAL_PROP:TOTLA_CROP)
library(ggplot2)
ggplot(ans1, aes(x=reorder(EVTYPE, -VALUE), y=VALUE, fill=EVTYPE)) +
geom_bar(stat="identity") + facet_grid(. ~ TYPE) +
ggtitle("Fatalities and Injuries Caused by Severe Events") +
xlab("Event type") + ylab("Numbers") +
theme(legend.position="none") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
As we can see in the plot, TORNADO is the most harmful event with respect to population health.
ggplot(ans2, aes(x=reorder(EVTYPE, -VALUE), y=VALUE, fill=TYPE)) +
geom_bar(stat="identity") +
ggtitle("Property Damage and Crop Damage Caused by Severe Events") +
xlab("Event type") + ylab("Damage") +
theme(legend.position="none") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
FLOOD has the greatest economic consequences, causes largest property damage and crop damage.