I joined the Reproducible Research class on February 2015 and finished it. This time (March, 2015), I apply for an verified certificate. This work is based on my own work on February (I fixed typo mistakes and some minor improvements).
This document explores weather events in USA between 1950 and 2011 to identify which weather event type caused most severe consequence to human and economic. Using Storm dataset of the National Weather Service, the analysis found that tornado is the most harmful for human followed by tstm wind and excessive heat. In term of economic impact, flood is the weather even type had the greatest economic consequences, followed by huricance typhoon and tornado.
The analysis must address the 2 following questions:
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
The data for this assignment comes in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
library(plyr)
library(dplyr)
library(ggplot2)
library(knitr)
library(reshape2)
if (!file.exists("./data/repdata_data_StormData.csv.bz2")) {
message("Download data file ...")
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
"./data/repdata_data_StormData.csv.bz2",
method = "wb")
} else {
message("Data file has already existed")
}
data <- read.csv(bzfile("./data/repdata_data_StormData.csv.bz2"), header = TRUE,
stringsAsFactors = FALSE)
The analysis requires these variable as below:
Event type information (EVTYPE)
Population health information (FATALITIES, INJURIES)
Economic damage information (PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
We subset data set, only keep neccesary variable and has positive value for variable of healthy information or economic information. There is not any missing values in these variables.
analysis.data <- data %>% select(BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG,
PROPDMGEXP, CROPDMG, CROPDMGEXP)
Looking at EVTYPE, the values is inconsistent:
There are values with white space in head and tail.
Mixing of upper case and lower case.
Invalid values.
Speacial characters (&, /, -, ;, (, ), .)
Therefore, we run through some steps to clean up values of EVTYPE:
Upper-case all values.
Trim white space.
Replace special characters and and invalid values with white space
analysis.data$EVTYPE <- gsub("[^A-Z]", " ", toupper(analysis.data$EVTYPE))
analysis.data$EVTYPE <- gsub("[ ]+", " ", analysis.data$EVTYPE)
analysis.data$EVTYPE <- gsub("^\\s+", "", analysis.data$EVTYPE)
analysis.data$EVTYPE <- gsub("\\s+$", "", analysis.data$EVTYPE)
analysis.data$EVTYPE <- as.factor(analysis.data$EVTYPE)
The first analysis question addresses to events which caused most damage for human health w. As a result, we only focus on entries has positive value for FATALITIES or INJURIES.
analysis.data.health <- analysis.data %>% select(EVTYPE, FATALITIES, INJURIES) %>%
filter(FATALITIES > 0 | INJURIES > 0) %>%
group_by(EVTYPE) %>%
summarise(sum.FATALITIES = sum(FATALITIES), sum.INJURIES = sum(INJURIES))
We can not directly compare a FATALITIES case to a INJURIES case, in order to find hightest health impact across weather events, we go through these steps:
Calculate sum of sum.FATALITIES, sum.INJURIES for each value of EVTYPE
Identify top 15 of sum.FATALITIES.
Identify top 15 of sum.INJURIES.
Combining those 2 event types set mentioned above to have the event types caused the most damage to human health
top.fatal.et <- analysis.data.health %>% arrange(desc(sum.FATALITIES)) %>% top_n(15) %>%
select(EVTYPE)
top.injur.et <- analysis.data.health %>% arrange(desc(sum.INJURIES)) %>% top_n(15) %>%
select(EVTYPE)
top.health.impact.et <- unique(rbind(top.fatal.et, top.injur.et))
top.health.impact <- analysis.data.health[analysis.data.health$EVTYPE %in% top.health.impact.et$EVTYPE,]
For measuring economic impact of weather events, there are 4 variable is used:
PROPDMG: properties damage
CROPDMG: crop damage
PROPDMGEXP: unit used for PROPDMG
CROPDMGEXP: unit used for CROPDMG
Subset the data set, we concerntrate on events which caused economic damage
analysis.data.economic <- analysis.data %>% select(EVTYPE, PROPDMG, CROPDMG,
PROPDMGEXP, CROPDMGEXP)
PROPDMGEXP and CROPDMGEXP should has the values below:
H: Hundreds also means 10^2
K: Thousands also means 10^3
M: Millions also means 10^6
B: Billions also means 10^9
An positive integer indicate an exponent of 10
Actually PROPDMG and CROPDMG contain not only upper-case, lower-case for character but also invalid values like blank value, ?, +, -. We need to convert unit character to proper exponent of 10, replace invalid one with 0 (means those units are 10^0=1). Then,
properties.damage.amount = PROPDMG * 10^exp.Value(PROPDMGEXP)crop.damage.amount = CROPDMG * 10^exp.Value(CROPDMGEXP)In order to identify which weather event type caused most economic damage, we calculate sum of Total.Damage for each weather event type and choose 15 highest ones. Because PROPDMG and CROPDMG could be calculated to same unit (USD), Total.Damage = properties.damage.amount + crop.damage .amount
exp_code = c("-", "?", "+", "0", "H", "h", "K", "k", "M", "m",
"B", "b", "1", "2", "3", "4", "5", "6", "7", "8")
exp_code_value <- data.frame(exp_code,
exp_values = c(0, 0, 0, 0,
2, 2, 3, 3, 6, 6, 9, 9, 1, 2, 3, 4, 5, 6, 7, 8),
row.names = exp_code)
analysis.data.economic[analysis.data.economic$PROPDMGEXP=="","PROPDMGEXP"] <- 0
analysis.data.economic[analysis.data.economic$CROPDMGEXP=="","CROPDMGEXP"] <- 0
analysis.data.economic$converted.PROPDMGEXP <- exp_code_value[analysis.data.economic$PROPDMGEXP,"exp_values"]
analysis.data.economic$converted.CROPDMGEXP <- exp_code_value[analysis.data.economic$CROPDMGEXP,"exp_values"]
analysis.data.economic$Total.Damage <- analysis.data.economic$PROPDMG * 10^analysis.data.economic$converted.PROPDMGEXP +
analysis.data.economic$CROPDMG * 10 ^ analysis.data.economic$converted.CROPDMGEXP
top.economic.damage <- analysis.data.economic %>% group_by(EVTYPE) %>%
summarise(sum.Total.Damage = sum(Total.Damage)) %>%
arrange(desc(sum.Total.Damage)) %>%
top_n(15)
kable(top.health.impact, caption = "Most harmful to population health \nTop 15 weather events type")
| EVTYPE | sum.FATALITIES | sum.INJURIES |
|---|---|---|
| EXCESSIVE HEAT | 1903 | 6525 |
| FLASH FLOOD | 978 | 1777 |
| FLOOD | 470 | 6789 |
| HAIL | 15 | 1361 |
| HEAT | 937 | 2100 |
| HEAVY SNOW | 127 | 1021 |
| HIGH WIND | 248 | 1138 |
| HURRICANE TYPHOON | 64 | 1275 |
| ICE STORM | 89 | 1975 |
| LIGHTNING | 817 | 5230 |
| THUNDERSTORM WIND | 133 | 1488 |
| THUNDERSTORM WINDS | 64 | 919 |
| TORNADO | 5633 | 91346 |
| TSTM WIND | 504 | 6957 |
| WINTER STORM | 206 | 1321 |
top.health.impact.long <- melt(top.health.impact, id=c("EVTYPE"),
measured=c("sum.FATALITIES", "sum.INJURIES"))
top.health.impact.long$variable <- gsub("sum.","", top.health.impact.long$variable)
ggplot(top.health.impact.long, aes(x=EVTYPE, y=value, fill = variable)) +
geom_bar(stat="identity", position="dodge") +
labs(title="Most harmful to population health \nTop 15 weather events type",
x = "Weather event",
y = "Number of cases") +
coord_flip()
TORNADO is the weather event type that caused most damaged to human health, followed by flood and excessive heat
top.health.impact[top.health.impact$EVTYPE=="TORNADO" | top.health.impact$EVTYPE=="TSTM WIND"
|top.health.impact$EVTYPE=="EXCESSIVE HEAT", ]
## Source: local data frame [3 x 3]
##
## EVTYPE sum.FATALITIES sum.INJURIES
## 1 EXCESSIVE HEAT 1903 6525
## 2 TORNADO 5633 91346
## 3 TSTM WIND 504 6957
ggplot(top.economic.damage, aes(x=EVTYPE, y=sum.Total.Damage)) +
geom_bar(stat="identity", fill="lightgreen") +
labs(title="Greatest economic damage \nTop 15 weather events type",
x = "Weather event",
y = "Economic damage(USD)") +
geom_text(aes(label=sum.Total.Damage), colour = "black", size=3) +
coord_flip()
Tornado, flood and huricance typhoon are the waether events that caused most economic damage. The greatest economic consequences is:
top.economic.damage[top.economic.damage$sum.Total.Damage == max(top.economic.damage$sum.Total.Damage),]
## Source: local data frame [1 x 2]
##
## EVTYPE sum.Total.Damage
## 1 FLOOD 150319678257