Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern. In this document, we will analyze U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database to address these two questions:
1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
2. Across the United States, which types of events have the greatest economic consequences?
URL<- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if (!file.exists("data")) dir.create("data")
if (!file.exists("./data/storm_data.csv.bz2")){
download.file(URL, destfile = "./data/storm_data.csv.bz2", method="curl")
}
storm <- read.csv("./data/storm_data.csv.bz2")
library('dplyr')
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library('ggplot2')
To address the two researh questions, we only need 7 variables from storm dataset. Below are the summaries of the required variables:
EVTYPE: types of eventsFATALITIES: number of fatalitiesINJURIES : number of injuriesPROPDMG: first few digits of property damagePROPDMGEXP: exponent value of PROPDMGCROPDMG: first few digits of crop damageCROPDMGEXP: exponent value of CROPDMGWe subset the dataset into only variables of our interest, and store the output in variable df.
df <- subset(storm, select = c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP"))
There is confusion on how to handle exponent value of PROPDMGEXP and CROPDMGEXP columns of the database due to lack of official information in the NOAA website. We will follow several rules below in handling the exponent variables (Credits: https://github.com/flyingdisc/RepData_PeerAssessment2/blob/master/how-to-handle-PROPDMGEXP.md) :
These are possible values of CROPDMGEXP and PROPDMGEXP:
H,h,K,k,M,m,B,b,+,-,?,0,1,2,3,4,5,6,7,8, and blank-character
numeric 0..8 = 10
We process PROPDMGEXP and CROPDMGEXP based on the rules defined above. Afterwards, we create two new variables PROPMUL and CROPMUL, which are results of multiplication between PROPDMG-PROPDMGEXP and CROPDMG-CROPDMGEXP. We store this tidy data set in variable df_tidy
replace_from <- c("H", "h", "K", "k", "M", "m", "B", "b", "+", "-", "?", "", 0:8)
replace_to <- c(100, 100, 1000, 1000, 1e+06, 1e+06, 1e+09, 1e+09, 1, 0, 0, 0, rep(10,9))
df_tidy <- df %>%
mutate(PROPDMGEXP = plyr::mapvalues(PROPDMGEXP, replace_from, replace_to)) %>%
mutate(CROPDMGEXP = plyr::mapvalues(CROPDMGEXP, replace_from, replace_to))
## The following `from` values were not present in `x`: k, b
## The following `from` values were not present in `x`: H, h, b, +, -, 1, 3, 4, 5, 6, 7, 8
df_tidy$PROPDMGEXP <- as.numeric(as.character(df_tidy$PROPDMGEXP))
df_tidy$CROPDMGEXP <- as.numeric(as.character(df_tidy$CROPDMGEXP))
df_tidy <- df_tidy %>%
mutate(PROPMUL = PROPDMG * PROPDMGEXP) %>%
mutate(CROPMUL = CROPDMG * CROPDMGEXP)
head(df_tidy)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO 0 15 25.0 1000 0 0
## 2 TORNADO 0 0 2.5 1000 0 0
## 3 TORNADO 0 2 25.0 1000 0 0
## 4 TORNADO 0 2 2.5 1000 0 0
## 5 TORNADO 0 2 2.5 1000 0 0
## 6 TORNADO 0 6 2.5 1000 0 0
## PROPMUL CROPMUL
## 1 25000 0
## 2 2500 0
## 3 25000 0
## 4 2500 0
## 5 2500 0
## 6 2500 0
1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
We summarise total of injuries and fatalities, group by event types.
summary1 <- df_tidy %>%
group_by(EVTYPE) %>%
summarise(fatalities = sum(FATALITIES), injuries = sum(INJURIES), injuries_and_fatalities = sum(INJURIES+FATALITIES)) %>%
arrange(desc(injuries_and_fatalities))
head(summary1,5)
## # A tibble: 5 x 4
## EVTYPE fatalities injuries injuries_and_fatalities
## <fct> <dbl> <dbl> <dbl>
## 1 TORNADO 5633 91346 96979
## 2 EXCESSIVE HEAT 1903 6525 8428
## 3 TSTM WIND 504 6957 7461
## 4 FLOOD 470 6789 7259
## 5 LIGHTNING 816 5230 6046
Let’s see it clearer on the plot below
ggplot(head(summary1,5), aes(reorder(EVTYPE, -injuries_and_fatalities), injuries_and_fatalities)) + geom_bar(stat = "identity") + labs(title = "Injuries and Fatalities vs Event Types", x = "", y = "Number of injuries and fatalities")
From the summary and plot above, we clearly see Tornado is the most harmful with respect to population health, followed by excessive heat and thunderstorm wind.
Across the United States, which types of events have the greatest economic consequences?
We summarise total of crop and prop damages, group by event types.
summary2 <- df_tidy %>%
group_by(EVTYPE) %>%
summarise(prop_dmg = sum(PROPMUL), crop_dmg = sum(CROPMUL), crop_prop_dmg = sum(CROPMUL+PROPMUL)) %>%
arrange(desc(crop_prop_dmg))
head(summary2,5)
## # A tibble: 5 x 4
## EVTYPE prop_dmg crop_dmg crop_prop_dmg
## <fct> <dbl> <dbl> <dbl>
## 1 FLOOD 144657709800 5661968450 150319678250
## 2 HURRICANE/TYPHOON 69305840000 2607872800 71913712800
## 3 TORNADO 56937162897 414954710 57352117607
## 4 STORM SURGE 43323536000 5000 43323541000
## 5 HAIL 15732269877 3025954650 18758224527
And plot the summary above.
ggplot(head(summary2,5), aes(reorder(EVTYPE, -crop_prop_dmg), crop_prop_dmg)) + geom_bar(stat = "identity") + labs(title = "Economic Consequences ($)", x = "", y = "")
From the summary and plot above, we see flood has the greatest economic consequences, followed by hurricane and tornado.