This analysis makes use of data from the National Oceanic and Atmospheric Administration (NOAA) Storm Database to determine which types of extreme weather events are most harmful to life and property. This dataset contains information on individual weather events in the US, including estimates of fatalities, injuries, property damage and crop damage for each event. In this analysis, only the most recent ten years of data (2002-2011) are considered. Weather event records are matched to a number of general weather event types based on keywords in their description, and, for each of these types, the sum total and the mean are calculated for each variable of interest mentioned above. The sums indicate the overall casualties and damage due to each type of event, while the means give an idea of the effect of individual events of each type. These quantities are plotted as bar charts to give a visual representation of which kinds of weather event are most hazardous to health and the economy.
Firstly, the csv-formatted data is downloaded if necessary and loaded into R.
data_file <- "StormData.csv.bz2"
if(!file.exists(data_file)) {
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url = url, destfile = data_file, method = "curl")
}
data <- read.csv(bzfile(data_file), na.strings = "?", stringsAsFactors = F)
After being loaded, the data is processed. To ensure the results are relevant to modern times, only data from the most recent ten years (2002-2011) is kept.
library(dplyr)
data <- data %>%
mutate(BGN_DATE = as.Date(BGN_DATE, "%m/%d/%Y %H:%M:%S")) %>%
filter(as.numeric(format(BGN_DATE, format = "%Y")) %in% 2002:2011)
The second step in processing the data is to select the relevant variables. These are:
These variables are also given tidier, more descriptive names.
data <- data %>%
select(EVTYPE, FATALITIES:CROPDMGEXP)
colnames(data) <- c("event.type", "fatalities", "injuries", "property.damage",
"property.damage.units", "crop.damage", "crop.damage.units")
Next, in preparation for filtering the data based on these, the textual variables - the event type description and the property and crop damage units - are converted to lowercase and trimmed of leading and trailing whitespace.
data <- data %>%
mutate(event.type = trimws(tolower(event.type)),
property.damage.units = trimws(tolower(property.damage.units)),
crop.damage.units = trimws(tolower(crop.damage.units)))
The data is then filtered to remove any records which refer to summaries of time periods instead of specific weather events. Additionally, only records with unambiguous property and crop damage units - ‘k’, ‘m’, or ‘b’, or blank when there is no damage - are kept.
units <- list(k = 1e3, m = 1e6, b = 1e9)
data <- data %>%
filter(!grepl("summary", event.type),
property.damage.units %in% c("", names(units)),
crop.damage.units %in% c("", names(units)),
!(property.damage > 0 & property.damage.units == ""),
!(crop.damage > 0 & crop.damage.units == ""))
The penultimate preprocessing step is to combine the property and crop damage variables with their corresponding units to obtain damage estimates in dollars.
convert_unit <- function(unit) {
if(unit %in% names(units)) {
return(units[[unit]])
}
return(0)
}
data <- data %>%
mutate(property.damage.units = sapply(property.damage.units,
convert_unit,
USE.NAMES = FALSE),
crop.damage.units = sapply(crop.damage.units,
convert_unit,
USE.NAMES = FALSE),
property.damage = property.damage * property.damage.units,
crop.damage = crop.damage * crop.damage.units) %>%
select(-c(property.damage.units, crop.damage.units))
Finally, the event type variable, containing a description of the kind of weather event observed, is updated. At this point, there are 121 unique event descriptions, many of which refer to very similar types of event. To remedy this, a number of general weather event types are defined, and the subsets of data whose descriptions match each of these types are extracted, with the event type description changed to the relevant general name. These subsets are then recombined. Note that some records are duplicated as they match more than one general weather event type.
name <- c("hurricane/typhoon", "tornado", "thunderstorm", "tropical storm",
"wind", "storm surge", "low tide", "high tide","flood", "snow/sleet",
"rain", "hail", "cold/wintry weather", "ice", "drought", "heat",
"wildfires", "dust storm/devil", "erosion", "volcanic eruption/ash",
"mudslide", "fog")
regexp <- c("hurricane|typhoon", "tornado", "tstm|thunderstorm|lightning",
"tropical storm", "wind", "surge", "low tide",
"high tide", "flood", "snow|blizzard|sleet", "rain|wet", "hail",
"cold|freez|(wint(e)?r(y)?.*(weather|mix))", "ice", "drought|dry",
"heat|warmth", "fire", "dust", "erosion", "volcanic", "mud", "fog")
weather.types <- data.frame(name, regexp, stringsAsFactors = FALSE)
data_list <- vector(mode = "list", length = nrow(weather.types))
for(i in 1:nrow(weather.types)) {
data_list[[i]] <- data %>%
filter(grepl(weather.types[i,"regexp"], event.type)) %>%
mutate(event.type = weather.types[i,"name"])
}
data <- do.call("rbind", data_list); rm(data_list)
data$event.type <- as.factor(data$event.type)
The analysis of this data is straightforward, consisting of calculating the sum and mean of each variable of interest for each type of weather event. Firstly, the data on casualties is analysed. The data is grouped by event type, and the sum total and the mean are calculated for both fatalities and injuries for each of these event types. Event types are ordered by total fatalities.
health.data <- data %>%
select(event.type, fatalities, injuries) %>%
group_by(event.type) %>%
summarize_all(funs(sum, mean)) %>%
arrange(desc(fatalities_sum))
print(health.data)
## # A tibble: 21 x 5
## event.type fatalities_sum injuries_sum fatalities_mean injuries_mean
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 tornado 1112 13588 0.0733 0.896
## 2 heat 920 4019 0.567 2.48
## 3 flood 789 820 0.0144 0.0150
## 4 wind 665 3402 0.00381 0.0195
## 5 thunderstorm 597 4849 0.00367 0.0298
## 6 cold/wintry ~ 296 454 0.0268 0.0411
## 7 wildfires 76 1051 0.0247 0.342
## 8 hurricane/ty~ 67 1291 0.523 10.1
## 9 snow/sleet 36 258 0.00296 0.0212
## 10 fog 31 289 0.0237 0.221
## # ... with 11 more rows
The data on economic damage is analysed in the same way. For each weather event type, the sum total and the mean are calculated for both property damage and crop damage. Here, event types are ordered by the sum of the total property and crop damage.
economic.data <- data %>%
select(event.type, property.damage, crop.damage) %>%
group_by(event.type) %>%
summarize_all(funs(sum, mean)) %>%
arrange(desc(property.damage_sum + crop.damage_sum))
print(economic.data)
## # A tibble: 21 x 5
## event.type property.damage_s~ crop.damage_sum property.damage_me~
## <fct> <dbl> <dbl> <dbl>
## 1 flood 144321511800 4404421400 2638467.
## 2 hurricane/typho~ 72342695010 3056382800 565177305.
## 3 storm surge 47809503000 850000 190476108.
## 4 tornado 18406922660 220589910 1213537.
## 5 wind 10018541390 1155172600 57399.
## 6 hail 9189944470 1394738150 64327.
## 7 drought 846041000 5423626000 453641.
## 8 thunderstorm 5512802350 579269900 33928.
## 9 wildfires 4959547000 297479430 1611812.
## 10 tropical storm 2008360550 410061000 3438974.
## # ... with 11 more rows, and 1 more variable: crop.damage_mean <dbl>
A bar plot of the sum total and the mean of fatalities and injuries, for the 10 most hazardous event types by total fatalities, gives a visual representation of which types of extreme weather are most harmful to life.
library(reshape2)
library(tidyr)
library(ggplot2)
health.worst10 <- health.data %>%
.[1:10,] %>%
melt(id.vars = "event.type") %>%
separate(variable, into = c("variable", "summary.type"), sep = "_")
health.worst10$event.type <- factor(health.worst10$event.type,
levels = rev(health.worst10$event.type[1:10]))
health.worst10$summary.type <- factor(health.worst10$summary.type,
levels = c("sum", "mean"))
ggplot(health.worst10, aes(event.type, value, fill = variable)) +
geom_col(position = "dodge") +
facet_grid(. ~ summary.type, scales = "free") +
coord_flip() +
theme(legend.position = "bottom", legend.title = element_blank()) +
ggtitle("Casualties from US Weather Events") +
xlab("") +
ylab("Casualties") +
labs(caption = paste("Total and average fatalities and injuries",
"for the 10 most hazardous weather event",
"types by total fatalities caused")) +
scale_fill_manual(labels = c("Fatalities", "Injuries"),
values = c("red3", "orange1"))
This plot shows that tornadoes, extreme heat and flooding are responsible for the greatest number of fatalities. In terms of injuries, tornadoes cause the most by a wide margin, with thunderstorms, heat and wind also causing a significant number. However, the means show that, per-event, hurricanes and heat cause the largest number of casualties.
A similar bar plot for the economic damage data, containing information on the 10 most harmful events by total damage caused, shows which kinds of weather are most destructive to property and crops.
economic.worst10 <- economic.data %>%
.[1:10,] %>%
melt(id.vars = "event.type") %>%
separate(variable, into = c("variable", "summary.type"), sep = "_")
economic.worst10$event.type <- factor(economic.worst10$event.type,
levels = rev(economic.worst10$event.type[1:10]))
economic.worst10$summary.type <- factor(economic.worst10$summary.type,
levels = c("sum", "mean"))
ggplot(economic.worst10, aes(event.type, value / 1e6, fill = variable)) +
geom_col(position = "dodge") +
facet_grid(. ~ summary.type, scales = "free") +
coord_flip() +
theme(legend.position = "bottom", legend.title = element_blank()) +
ggtitle("Damage from US Weather Events") +
xlab("") +
ylab("Damage (Millions of Dollars)") +
scale_fill_manual(labels = c("Crop Damage", "Property Damage"),
values = c("wheat3", "steelblue")) +
labs(caption = paste("Total and average property and crop damage",
"for the 10 most harmful weather event",
"types by total damage caused"))
This plot shows that floods cause the most economic harm overall, with hurricanes and storm surges also responsible for a large amount of damage. However, it can be seen that droughts cause the greatest damage to crops. In terms of per-event damage, hurricanes are the most harmful by a vast margin, followed by storm surges. Other event types cause relatively little destruction per-event.
To aid reproducibility, information on the system and packages used to perform this analysis is provided below.
sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252
## [2] LC_CTYPE=English_United Kingdom.1252
## [3] LC_MONETARY=English_United Kingdom.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United Kingdom.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_2.2.1 tidyr_0.8.1 reshape2_1.4.3 bindrcpp_0.2.2
## [5] dplyr_0.7.5
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.17 knitr_1.20 bindr_0.1.1 magrittr_1.5
## [5] munsell_0.4.3 tidyselect_0.2.4 colorspace_1.3-2 R6_2.2.2
## [9] rlang_0.2.0 stringr_1.3.1 plyr_1.8.4 tools_3.5.0
## [13] grid_3.5.0 gtable_0.2.0 htmltools_0.3.6 lazyeval_0.2.1
## [17] yaml_2.1.19 assertthat_0.2.0 rprojroot_1.3-2 digest_0.6.15
## [21] tibble_1.4.2 purrr_0.2.4 glue_1.2.0 evaluate_0.10.1
## [25] rmarkdown_1.9 stringi_1.1.7 compiler_3.5.0 pillar_1.2.3
## [29] scales_0.5.0 backports_1.1.2 pkgconfig_2.0.1