This analysis explores the National Oceanic and Atmospheric Administration’s (NOAA) Storm Database in answering three research questions:
According to this analysis, and keeping in mind its scope and limitations (see Part 2.), brief answers to these questions are (see Part 4. for a fuller account):
This is the final course project for Course 5: Reproducible Research in the 9-Course ‘Data Science Specialization’ from Johns Hopkins University in Coursera. As such, the scope and aim of this project is not toward industry-standard accuracy in its findings, but merely whether the analysis is fully reproducible or not. These are a few limitations when considering the results of this project’s analysis:
A few limitations of NOAA’s Storm Data are addressed in this study by Renato dos Santos.
Among many resources, this NOAA website provides a summary statistics against which to compare any results.
This analysis does not account for the change in relative value across the two decades considered: damage figures are unadjusted.
Among many reports, this study addresses Six Fallacies of Natural Hazards Loss Data.
This analysis was performed with the following system specifications:
sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: i386-w64-mingw32/i386 (32-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] compiler_3.4.3 backports_1.1.2 magrittr_1.5 rprojroot_1.3-1
## [5] tools_3.4.3 htmltools_0.3.6 yaml_2.1.16 Rcpp_0.12.14
## [9] stringi_1.1.6 rmarkdown_1.8 knitr_1.17 stringr_1.2.0
## [13] digest_0.6.13 evaluate_0.10.1
Caution: the first command will clear your R workspace.
Credit to Steven Worthington for the “ipak” function.
rm(list = ls())
ipak <- function(pkg){
new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
if (length(new.pkg))
install.packages(new.pkg, dependencies = TRUE, warn.conflicts = FALSE)
sapply(pkg, require, character.only = TRUE)
}
packages <- c("dplyr", "data.table", "ggplot2")
ipak(packages)
## dplyr data.table ggplot2
## TRUE TRUE TRUE
if (!file.exists("data")) {
dir.create("data")
}
fileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileURL, destfile = "./data/StormData.csv.bz2")
Since the storm data is large and takes several minutes to load, I subset the data by preprocessing steps determined after some initial exploratory analysis (not included). Here are some initial findings and how they guided the choice of subset:
Between 1950 and 1990 there were only three types of events in the storm data: TORNADO, TSTM WIND, and HAIL. After 1990, the number of events spike to 913. The subset below (“noaa”) starts at 1990 to minimize bias toward those three events, creating a DATE
variable from the BGN_DATE
, which gets discarded.
The Storm Data Documentation and FAQ page provide information about how the storm data is collected, but fails to describe the variables in the dataset. Lacking a proper CODEBOOK, the initial exploratory analysis showed that the appropriate variables for determining population harm are FATALITIES
and INJURIES
, and those for the cost of damages are PROPDMG
(property damage) and CROPDMG
. These variables contain simplified figures which need to be raised by the appropriate value in the corresponding exponent variables: PROPDMGEXP
and CROPDMGEXP
. Finally, the EVTYPE
is a conditioning variable with names of all the types of weather events.
noaa <- data.table(read.table("./data/StormData.csv.bz2", sep = ",", header = TRUE, nrows = 902298,
na.strings = "", stringsAsFactors = FALSE)) %>%
select(BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) %>%
mutate(DATE = as.Date(BGN_DATE, format = "%m/%d/%Y %H:%M:%S")) %>%
filter(year(DATE) > 1989)
# discard old date variable BGN_DATE
noaa <- noaa[, -1]
str(noaa)
## 'data.frame': 751740 obs. of 8 variables:
## $ EVTYPE : chr "HAIL" "TSTM WIND" "TSTM WIND" "TSTM WIND" ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 0 0 ...
## $ INJURIES : num 0 0 0 0 28 0 0 0 0 0 ...
## $ PROPDMG : num 0 0 0 0 2.5 0 0 0 25 25 ...
## $ PROPDMGEXP: chr NA NA NA NA ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr NA NA NA NA ...
## $ DATE : Date, format: "1990-01-05" "1990-01-20" ...
We immediately see that there are a lot of missing values and zeros in the data.
Handling Missing Values
mean(is.na(noaa$PROPDMGEXP))
## [1] 0.460618
mean(is.na(noaa$CROPDMGEXP))
## [1] 0.6223641
Since a large portion of the data is missing for the exponent variables (46% and 62%), the wrong choice of replacement would have disastrous consequences. The integer 1 is chosen so if any figures were entered for property or crop damages with an NA exponent, we would not lose that figure. There are no other missing data in the “noaa” subset.
noaa$PROPDMGEXP[is.na(noaa$PROPDMGEXP)] <- 1L
noaa$CROPDMGEXP[is.na(noaa$CROPDMGEXP)] <- 1L
Handling Zeroes
We discard those observations in which all variables under consideration have a zero, considerably reducing the size of the “noaa” subset from roughly 752 to 230 thousand observations.
noaa <- noaa[noaa$FATALITIES != 0 | noaa$INJURIES != 0 | noaa$PROPDMG != 0 | noaa$CROPDMG != 0, ]
dim(noaa)
## [1] 230112 8
Handling the EVTYPE variable
The EVTYPE
variable has several flaws, the first of which is random capitalization. For ease in cleaning, I capitalize everything.
noaa$EVTYPE <- toupper(noaa$EVTYPE)
length(unique(noaa$EVTYPE))
## [1] 447
Still contending with 447 unique event types, an exploratory analysis (not included) reveals a high number of misspellings, redundancies in terminology, and obscure entries.
We filter out a few obscure entries (such as “APACHE COUNTY”) and proceed to re-categorize various types of event types into broader categories for a streamlining the analysis. This process has a deep impact on the analysis as choices in categorization will lead to changes in results.
The new variable EVCAT
has 23 broad event categories. A few notes on categorization choices:
=TRUE
and not included = FALSE
noaa <- noaa %>%
filter(EVTYPE != "APACHE COUNTY" & EVTYPE != "?" & EVTYPE != "HIGH" & EVTYPE != "URBAN AND SMALL"
& EVTYPE != "URBAN SMALL") %>%
mutate(EVCAT = ifelse(grepl("COLD|CHILL|HYPOTHERMIA|LOW TEMP", EVTYPE) == TRUE, "cold",
ifelse(grepl("HEAT|HYPERTHERMIA|WARM", EVTYPE) == TRUE
& grepl("DROUGHT", EVTYPE) == FALSE, "heat",
ifelse(grepl("DROUGHT", EVTYPE) == TRUE, "drought",
ifelse(grepl("DROWN", EVTYPE) == TRUE, "drowning",
ifelse(grepl("DUST|DRY", EVTYPE) == TRUE
& grepl("WARM", EVTYPE) == FALSE, "dust",
ifelse(grepl("FLOOD|FLD|RISING WATER|HIGH WATER", EVTYPE) == TRUE, "flood",
ifelse(grepl("FIRE|SMOKE", EVTYPE) == TRUE, "fire",
ifelse(grepl("FOG", EVTYPE) == TRUE, "fog",
ifelse(grepl("FREEZ|FROST", EVTYPE) == TRUE, "frost",
ifelse(grepl("HAIL", EVTYPE) == TRUE
& grepl("TORNADO", EVTYPE) == FALSE, "hail",
ifelse(grepl("HURRICANE|TYPHOON|TROPICAL", EVTYPE) == TRUE, "hurricane",
ifelse(grepl("ICE|ICY|GLAZE", EVTYPE) == TRUE
& grepl("SNOW|FLOOD", EVTYPE) == FALSE, "ice",
ifelse(grepl("LANDSLIDE|MUDSLIDE|MUD SLIDE|LANDSLUMP", EVTYPE) == TRUE, "landslide",
ifelse(grepl("LIGHTNING|LIGHTING|LIGNTNING", EVTYPE) == TRUE, "lightning",
ifelse(grepl("BEACH|COASTAL|CURRENT|MARINE|SEAS|SURGE|SURF|TIDE|TSUN|WATERSPOUT|WAVE",
EVTYPE) == TRUE & grepl("UNSEASON|HEAT|COLD|FLOOD", EVTYPE) == FALSE, "ocean",
ifelse(grepl("RAIN|SHOWER|PRECIP", EVTYPE) == TRUE
& grepl("SNOW|FLOOD|FREEZ|WIND", EVTYPE) == FALSE, "rain",
ifelse(grepl("AVALAN|BLIZZ|SNOW", EVTYPE) == TRUE
& grepl("COLD|FREEZ", EVTYPE) == FALSE, "snow",
ifelse(grepl("THUNDERSTORM|THUNDERSTORMS", EVTYPE) == TRUE, "t-storm",
ifelse(grepl("TORNADO|TORNDAO|WHIRLWIND|FUNNEL|LANDSPOUT", EVTYPE) == TRUE, "tornado",
ifelse(grepl("BURST|GUSTNADO", EVTYPE) == TRUE, "wind burst",
ifelse(grepl("WIND|TURBULENCE|TSTMW", EVTYPE) == TRUE
& grepl("MARINE|SEAS|SURF|HURRIC|BURST|TORNADO|WHIRL|SNOW", EVTYPE) == FALSE, "wind",
ifelse(grepl("WINTER|WINTRY", EVTYPE) == TRUE, "winter", "other")))))))))))))))))))))))
# a look at "other" events:
unique(noaa[noaa$EVCAT == "other", ]$EVTYPE)
## [1] "SLEET" "COOL AND WET" "EXCESSIVE WETNESS"
## [4] "HEAVY MIX" "URBAN/SMALL STREAM" "HEAVY SWELLS"
## [7] "OTHER" "DAM BREAK" "HIGH SWELLS"
## [10] "SEICHE" "ROCK SLIDE" "VOLCANIC ASH"
Handling the Exponent Variables
For handling of exponent variables I partly consulted this GitHub repo, among other sources.
# values in the exponent variables
table(noaa$PROPDMGEXP)
##
## - + 0 1 2 3 4 5 6 7
## 1 5 210 10717 1 1 4 18 3 3
## B h H K m M
## 40 1 6 210144 7 8946
table(noaa$CROPDMGEXP)
##
## ? 0 1 B k K m M
## 6 17 128138 7 21 99932 1 1985
The number of observations classified with extraneous values beyond the clear-cut letter labels (B = billion, M = million, etc.) is small, however, this is another source of uncertainty in this analysis. After some exploratory analysis (not included), I decided to proceed with the approach that numbered exponent values mean actual exponents in base 10, so a 5 means 10,000 times the value in the property or crop variables.
# substituting code symbols for integer multipliers
noaa$PROPDMGEXP <- recode(noaa$PROPDMGEXP, B = 1e+09L, M = 1e+06L, m = 1e+06L, K = 1e+03L, H = 1e+02L,
h = 1e+02L, "2" = 1e+02L, "3" = 1e+03L, "4" = 1e+04L, "5" = 1e+05L,
"6" = 1e+06L, "7" = 1e+07L, "0" = 1L, "+" = 1L, "-" = 0L, .default = 1L)
noaa$CROPDMGEXP <- recode(noaa$CROPDMGEXP, B = 1e+09L, M = 1e+06L, m = 1e+06L, K = 1e+03L, k = 1e+03L,
"0" = 1L, "?" = 0L, .default = 1L)
# multiplying by exponent and further narrowing the "noaa" subset
noaa$PROPDMG <- noaa$PROPDMG * noaa$PROPDMGEXP
noaa$CROPDMG <- noaa$CROPDMG * noaa$CROPDMGEXP
# removing unnecessary variables
noaa <- noaa[, -c(1,5,7,8)]
head(noaa)
## FATALITIES INJURIES PROPDMG CROPDMG EVCAT
## 1 0 28 2500000 0 tornado
## 2 0 0 25000 0 tornado
## 3 0 0 25000 0 tornado
## 4 0 3 2500000 0 tornado
## 5 0 2 2500000 0 tornado
## 6 0 15 2500000 0 tornado
With the final figures for property and crop damage computed for every observation, the data is primed for final tidy subsets.
We assess harm to population health by considering total number of fatalities and injuries per event category. Noticing that fatality and injury are actually two levels of harm, I created a single index variable (ind) with those two levels.
# set print max options for tibbles
options('tibble.print_max' = 100)
harm <- noaa %>%
select(EVCAT, FATALITIES, INJURIES) %>%
group_by(EVCAT) %>%
filter(FATALITIES != 0 | INJURIES != 0) %>%
summarise_all(sum) %>%
arrange(desc(FATALITIES))
# final tidy dataset
tidyharm <- data.frame(harm, stack(harm[2:3]))[, c(1, 4, 5)]
head(tidyharm, 3)
## EVCAT values ind
## 1 heat 3173 FATALITIES
## 2 tornado 1778 FATALITIES
## 3 flood 1557 FATALITIES
tail(tidyharm, 3)
## EVCAT values ind
## 42 drought 19 INJURIES
## 43 other 4 INJURIES
## 44 drowning 0 INJURIES
Plot 1: POPULATION HARM
ggplot(tidyharm, aes(x = reorder(EVCAT, +values), y = values, fill = tolower(ind))) +
geom_bar(stat = "identity") + theme_minimal() + coord_flip() +
scale_fill_manual(values=c("#510A1A", "#CB103B")) +
labs(title = "Weather-Related Fatalities and Injuries in the U.S.",
subtitle = "Summary of NOAA Storm Data from 1990 to 2011",
x = "Event Type", y = "Number of Incidents Reported") +
theme(legend.title = element_blank())
The most harmful weather events, according to this analysis and keeping in mind its scope and limitations, seem to be tornadoes, extreme heat, flooding, strong winds (other than hurricanes, tornadoes), and lightning. Noticeably, heat causes more deaths than tornadoes. This could partly be due to inconsistent methodologies and incentives in classifying mortality by heat or tornado. Perhaps surprisingly, cold is further down the list of most harmful events; this is partly due to other classifications of events (as previously listed in this analysis) that include cold, such as: wind, snow, ice, and winter.
The cost of weather-related damage to property and crops is considered using the PROPDMG
and CROPDMG
variables, as previously computed. Similarly, a single cost variable can be computed using two levels: property, and crop.
cost <- noaa %>%
select(EVCAT, PROPDMG, CROPDMG) %>%
filter(PROPDMG != 0 | CROPDMG != 0) %>%
group_by(EVCAT) %>%
summarise_all(sum) %>%
arrange(desc(PROPDMG))
# final tidy dataset
tidycost <- data.frame(cost, stack(cost[2:3]))[, c(1, 4, 5)]
tidycost <- tidycost %>%
mutate(ind = ifelse(tidycost$ind == "CROPDMG", "crops", "properties"))
head(tidycost)
## EVCAT values ind
## 1 flood 168271030485 properties
## 2 hurricane 93072537560 properties
## 3 ocean 48305183590 properties
## 4 tornado 32073720807 properties
## 5 hail 16022991537 properties
## 6 wind 10630829763 properties
tail(tidycost)
## EVCAT values ind
## 39 frost 1997061000 crops
## 40 fog 0 crops
## 41 heat 904423500 crops
## 42 dust 3615000 crops
## 43 other 148034400 crops
## 44 wind burst 1550 crops
Plot 2: ECONOMIC COST
ggplot(tidycost, aes(x = reorder(EVCAT, +values), y = values/1e+09, fill = ind)) +
geom_bar(stat = "identity") + theme_minimal() + coord_flip() +
scale_fill_manual(values=c("#0E348B", "#1976EC")) +
labs(title = "Cost of Weather-Related Damages to Crops and Properties in the U.S.",
subtitle = "Summary of NOAA Storm Data from 1990 to 2011",
x = "Event Type", y = "Cost of Damages (in Billions of Dollars)") +
guides(fill=guide_legend(title="Damage to"))
The most expensive weather-related events, according to this analysis and considering its limitations, seem to be: flooding, tropical storms (including hurricanes), coastal and marine weather, tornadoes, hail, and drought. Most of the damage is done to properties, however, most of the drought damage, as expected, is done to crops (just as frost and cold). Noticeably, tornado damages rank much lower than flooding, a distinct difference from harm done to the population. This and other possible comparisons led me to pursue a third question of how population harm and economic damage relate to each other.
This question requires a comparison of number of incidents with dollar value of incidents, which is somewhat contrived since one can’t put a price on lives, among other considerations. For this comparison to work I calculated the proportional impact separately and plotted these proportions.
# Recalculating "harm" and "cost" subsets without removing observations of zeros separately
# This step avoids discarding event types that zeroes across one subset but not the other
harm <- noaa %>%
select(EVCAT, FATALITIES, INJURIES) %>%
group_by(EVCAT) %>%
summarise_all(sum) %>%
arrange(desc(FATALITIES))
cost <- noaa %>%
select(EVCAT, PROPDMG, CROPDMG) %>%
group_by(EVCAT) %>%
summarise_all(sum) %>%
arrange(desc(PROPDMG))
# Creating tidy set of totals and their proportional impact for each case
totharm <- harm %>%
mutate(TOT.harm = FATALITIES + INJURIES) %>%
mutate(prop = prop.table(TOT.harm)) %>%
select(EVCAT, prop) %>%
arrange(desc(prop))
totcost <- cost %>%
mutate(TOT.cost = PROPDMG + CROPDMG) %>%
mutate(prop = prop.table(TOT.cost)) %>%
select(EVCAT, prop) %>%
arrange(desc(prop))
# Merging data subsets
harmcost <- merge(totharm[-c(22,23),], totcost[-c(22,23),], by = "EVCAT") %>%
arrange(desc(prop.x))
# Recreating tidy dataset
tidyharmcost <- data.frame(harmcost, stack(harmcost[2:3]))[, c(1, 4, 5)]
tidyharmcost <- tidyharmcost %>%
mutate(ind = ifelse(tidyharmcost$ind == "prop.x", "total harm", "total cost"))
head(tidyharmcost)
## EVCAT values ind
## 1 tornado 0.33582591 total harm
## 2 heat 0.14626408 total harm
## 3 flood 0.12077608 total harm
## 4 wind 0.09015746 total harm
## 5 lightning 0.07134517 total harm
## 6 t-storm 0.03126732 total harm
tail(tidyharmcost)
## EVCAT values ind
## 37 dust 3.700946e-05 total cost
## 38 rain 8.911674e-03 total cost
## 39 landslide 7.702402e-04 total cost
## 40 frost 4.508665e-03 total cost
## 41 drought 3.331244e-02 total cost
## 42 other 3.372558e-04 total cost
Plot 3: HARM versus COST
ggplot(tidyharmcost, aes(x = reorder(EVCAT, -values), y = values, fill = ind)) +
geom_bar(stat = "identity", position = "dodge") + theme_minimal() +
scale_fill_manual(values=c("#1244B8", "#A40C2E")) +
labs(title = "Impacts of Weather Events on the U.S. Population and Economy",
subtitle = "Summary of NOAA Storm Data from 1990 to 2011",
x = "Event Type", y = "Proportional Distribution of Totals") +
theme(legend.title = element_blank()) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
This plot exemplifies how the impact of weather-related events vary greatly depending on whether one considers their cost or their harm. A further classification of weather events could be made by separating them into two broad categories, those that cause more population harm than economic cost, and those that do the opposite. In the former group, we can include: tornadoes, heat, wind, lightning, icy conditions, thunderstorms and snow; while in the latter group (more money loss than life loss) we can include: flood, tropical storms, coastal weather, hail, drought, and rain. Fire seems to be the only type of event that causes equal harm and cost to society.
FINAL NOTE: Further exploration and better methodology and consideration of fallacies and biases in this analysis are warranted.