This document is located at http://rpubs.com/minglho/RRPeer2.
Severe weather events can have consequences on the health and economy of a community with injuries, fatalities, and property damage. Examining which types of weather events have the most impact contributes to policy decisions to minimize damages resulting from these events. We examine the US National Oceanic and Atmosphereic Adiminstratinos’s (NOAA) strom database containing events starting in the year 1950 and ending in November 2011. To evaluate which types of events are most harmful to population health, we examine the total number of fatalities and injuries for each event type. Tornado is the leading cause of fatalities and injuries to people, accounting for 37% of all fatalities and 65% of all injuries recorded in the dataset. To evaluate which types of events have the greatest economic consequences, we examine the total value of property and crop damages.
The data file was provided by the course website.
## Load libraries and set Global R Markdown setting
library(knitr)
library(dplyr)
library(stringr)
opts_chunk$set(echo = TRUE)
## Read data file. Assume the file is in the same directory as
## this R Markdown file.
raw.data <- read.csv("repdata-data-StormData.csv.bz2")
length(levels(raw.data$EVTYPE)) # Number of event types
After reading in the raw data, we find that EVTYPE has 985 different levels, way more than the 48 event types described in National Weather Service Storm Data Documentation.
Before attempting to clean up the variable EVTYPE, we want to subset the raw data to contain only the observations and variables relevant to analysis of population health and economic consequences. This is hoping that reducing the dataset would reduce the number of event types, in order to make cleaninging up EVTYPE easier. We also convert data to upper case and remove leading and trailing spaces as an easy clean up step.
In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete. However, not being a discipline expert, I have no idea how to judge completeness, so I have decided for this analysis to use all the data, rather than leaving out the early years.
The subseting result is in the data frame df.
# Select relevant variables.
df <- select(raw.data, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
# Filter out observation that have no injuries, fatalities, and damages
mask = df$PROPDMG>0 | df$CROPDMG>0 | df$INJURIES>0 | df$FATALITIES>0
df <- df[mask, ]
# Convert appropriate variables to upper case.
df <- mutate(df, EVTYPE = factor(str_trim(toupper(EVTYPE))),
PROPDMGEXP = toupper(PROPDMGEXP),
CROPDMGEXP = toupper(CROPDMGEXP))
To examine the effect of weather events on population health, we sum up the number of injuries and fatalities by event type.
## summing up the number of injuries and fatalities
health.sum <-
select(df, EVTYPE, FATALITIES, INJURIES) %>%
filter(FATALITIES > 0 | INJURIES > 0) %>%
group_by(EVTYPE) %>%
summarize(FATALITIES = sum(FATALITIES), INJURIES = sum(INJURIES))
To examine the economic consequences of weather events, we sum up the amount of property and crop damage by event type.
Damages for property and crop are recorded in two parts: a real number (the variables PROPDMG and CROPDMG) and a power of 10 (the variables PROPDMGEXP and CROPDMGEXP).
## Select revlevant variables for economic analysis and filter out observations with no damgages recorded.
df.econ <-
select(df, EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) %>%
filter(PROPDMG > 0 | CROPDMG > 0)
table(df.econ$PROPDMGEXP)
##
## - + 0 2 3 4 5 6 7
## 4357 1 5 209 1 1 4 18 3 2
## B H K M
## 40 7 229057 11326
table(df.econ$CROPDMGEXP)
##
## ? 0 B K M
## 145037 6 17 7 97981 1983
The table for PROPDMGEXP above shows that the data entry has not been consistent. The single digit can be reasonably interpretted to indicate power of 10, letters to indicate units like “billion,” “million,” “hundreds,” and “thousands.” An entry of “” is reasonably interpreted as “0,” but I have no idea what the symbols ‘-’, ‘+’, and ‘?’ mean. The following code extract the real number part of the damage with the symbols ‘-’, ‘+’, and ‘?’ for expoenent.
odd.exp.ind = df.econ$PROPDMGEXP %in% c("?","+","-") ### Boolean index for weird exponent.
propdmg.odd.exp = df.econ$PROPDMG[odd.exp.ind]
propdmg.odd.exp
## [1] 20 20 2 15 60 15
odd.exp.ind = df.econ$CROPDMGEXP %in% c("?","+","-")
cropdmg.odd.exp = df.econ$CROPDMG[odd.exp.ind]
cropdmg.odd.exp
## [1] 0 0 0 0 0 0
As we can see above, there are very few damage entries with the symbols ‘-’, ‘+’, and ‘?’ for exponent, and only 6 entries are non-zero for the real part, all with values no larger than 60. Therefore, we will exclude those damage entries when calculating the damage values per observation.
getValue <- function(M, x) {
## 'M' is a number.
## 'x' is a character representing an exponent
## getExp returns M times 10^N, where N is the power of 10
## represented by x.
## If x is "?","+", or "-", then the function returns 0.
value = M
if (x == "K") {N = 3L}
else if (x == "") {N = 0L}
else if (x == "M") {N = 6L}
else if (x == "B") {N = 9L}
else if (x == "H") {N = 2L}
else {
N = as.integer(x)
if (is.na(N)) {
value = 0
N = 0L
}
}
value = value * 10^N
}
N = length(df.econ$EVTYPE)
for (k in 1:N){
df.econ$PROPDMGVAL[k] = getValue(df.econ$PROPDMG[k], df.econ$PROPDMGEXP[k])
df.econ$CROPDMGVAL[k] = getValue(df.econ$CROPDMG[k], df.econ$CROPDMGEXP[k])
}
econ.sum <-
select(df.econ, EVTYPE, PROPDMGVAL, CROPDMGVAL) %>%
group_by(EVTYPE) %>%
summarize(PROPDMGVAL = sum(PROPDMGVAL), CROPDMGVAL = sum(CROPDMGVAL))
econ.sum <- mutate(econ.sum, TOTALDMGVAL = PROPDMGVAL + CROPDMGVAL)
I have not taken the course “Getting and Cleaning Data,” as it was not listed either as hard or soft dependency for taking this current course, in the course dependencies chart for the Data Specialization. Thus, the impact of an event type may appear less than it really is because all the codes for that event type were not aggregated. In the interest of completing the assignment in a timely manner, I will continue with the analysis without cleaning EVTYPE. Yes, it’s bad. I am sorry, but I just don’t have the time now.
A quick dotplot by event types of the number of fatalities and injuries (with log10 transformation) shows that the top 10 event types easily cover at least the top order of magnitute of number of fatalities and injuries. The log10 transformation is used as the values range over many orders of magnitude. Thus, we will later show only the top 10 entries in table form.
par(mfrow = c(1,2), mar = c(4,4,2,1), oma = c(0,0,2,0))
with(health.sum, {
dotchart(log10(health.sum$FATALITIES), main = "Log(FATALITIES) by EVTYPE")
dotchart(log10(health.sum$INJURIES), main = "Log(FATALITIES) by EVTYPE")
mtext("Number of Fatalities and Injuries by Event Type", outer = TRUE)
box(which = "outer")
})
Below is a table sorted in descending number of fatalities, showing the top 10 entries,
health.sum <- mutate(health.sum,
FATALITIES_PCT = FATALITIES / sum(FATALITIES),
INJURIES_PCT = INJURIES / sum(INJURIES) )
fatTop10 <- head(as.data.frame(health.sum[order(health.sum$FATALITIES, decreasing = TRUE),]), n = 10)
fatTop10
## EVTYPE FATALITIES INJURIES FATALITIES_PCT INJURIES_PCT
## 1 TORNADO 5633 91346 0.37193793 0.650019925
## 2 EXCESSIVE HEAT 1903 6525 0.12565203 0.046432028
## 3 FLASH FLOOD 978 1777 0.06457577 0.012645167
## 4 HEAT 937 2100 0.06186860 0.014943641
## 5 LIGHTNING 816 5230 0.05387917 0.037216782
## 6 TSTM WIND 504 6957 0.03327831 0.049506148
## 7 FLOOD 470 6789 0.03103334 0.048310657
## 8 RIP CURRENT 368 232 0.02429845 0.001650917
## 9 HIGH WIND 248 1137 0.01637504 0.008090914
## 10 AVALANCHE 224 170 0.01479036 0.001209723
sum(fatTop10$FATALITIES_PCT)
## [1] 0.797689
Below is a table sorted in descending number of injuries, showing the top 10 entries,
injTop10 <- head(as.data.frame(health.sum[order(health.sum$INJURIES, decreasing = TRUE),]), n = 10)
injTop10
## EVTYPE FATALITIES INJURIES FATALITIES_PCT INJURIES_PCT
## 1 TORNADO 5633 91346 0.3719379333 0.650019925
## 2 TSTM WIND 504 6957 0.0332783097 0.049506148
## 3 FLOOD 470 6789 0.0310333443 0.048310657
## 4 EXCESSIVE HEAT 1903 6525 0.1256520304 0.046432028
## 5 LIGHTNING 816 5230 0.0538791680 0.037216782
## 6 HEAT 937 2100 0.0618686035 0.014943641
## 7 ICE STORM 89 1975 0.0058765269 0.014054139
## 8 FLASH FLOOD 978 1777 0.0645757676 0.012645167
## 9 THUNDERSTORM WIND 133 1488 0.0087817762 0.010588637
## 10 HAIL 15 1361 0.0009904259 0.009684903
sum(injTop10$INJURIES_PCT)
## [1] 0.893402
From the tables above, we see that tornado is the leading cause of fatalities and injuries to people, accounting for 37% of all fatalities and 65% of all injuries recorded in the dataset. The top 10 weather events causing the highest fatalities account for about 80% of all fatalities, while each of the other events contribute to no more than 1.5% of all fatalities. The top 10 weather events causing the highest injuries account for about 89% of all injuries, while each of the other events contribute to no more than 1% of all injuries.
A quick dotplot by event types of the damage values (with log10 transformation) shows that the top 10 event types easily cover at least the top order of magnitute of number of fatalities and injuries. The log10 transformation is used as the values range over many orders of magnitude. Thus, we will later show only the top 10 entries in table form.
par(mfrow = c(1,1), mar = c(4,4,2,1), oma = c(0,0,2,0))
with(econ.sum, {
dotchart(log10(TOTALDMGVAL), main = "Log(TOTALDMGVAL) by EVTYPE")
mtext("Total Property and Crop Damages by Event Type", outer = TRUE)
})
Below is a table sorted in descending value of combined property and crop damages, showing the top 10 entries,
econ.sum <- mutate(econ.sum, TOTALDMGVAL_PCT = TOTALDMGVAL / sum(TOTALDMGVAL))
totalTop10 <- head(as.data.frame(econ.sum[order(econ.sum$TOTALDMGVAL, decreasing = TRUE), c("EVTYPE" , "TOTALDMGVAL", "TOTALDMGVAL_PCT")]), n = 10)
totalTop10
## EVTYPE TOTALDMGVAL TOTALDMGVAL_PCT
## 1 FLOOD 150319678257 0.31491835
## 2 HURRICANE/TYPHOON 71913712800 0.15065857
## 3 TORNADO 57362333887 0.12017356
## 4 STORM SURGE 43323541000 0.09076242
## 5 HAIL 18761221986 0.03930459
## 6 FLASH FLOOD 18244041079 0.03822110
## 7 DROUGHT 15018672000 0.03146398
## 8 HURRICANE 14610229010 0.03060830
## 9 RIVER FLOOD 10148404500 0.02126081
## 10 ICE STORM 8967041360 0.01878587
sum(totalTop10$TOTALDMGVAL_PCT)
## [1] 0.8561575
From the table above, we see that flood inflicts the most economice consequence, as measured by the total damage value (combining property and crop damages), and accounts for about 31% of all the total damage values. The top 10 weather events causing the highest total damage values property account for about 86% of all the total damage values.