The basic goal of this report is to explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database and analyse the damage caused by certain types of severe weather events.
This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
This report focuses on answering two main questions: 1. Across the United States, which types of events are most harmful with respect to population health? 2. Across the United States, which types of events have the greatest economic consequences?
As we are investigating only the damage done to people and property, only a small subset of the collected data will be relevant for us.
library("dplyr")
library("ggplot2")
First of all we need the data itself. If the .bz2 archive is not yet available, we try to download it from the specified URL.
input_url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
input_archive <- "StormData.csv.bz2"
if (!file.exists(input_archive)) {
message("Downloading '", input_url, "'")
download.file(input_url, input_archive)
}
Once we have the archive, we can load its content. As there are more than 900k records, loading them takes a considerable time, so we can speed up subsequent runs by caching it in an .rds file.
We will need only 7 columns of the 37, so the memory usage can be greatly reduced if we read only the needed columns:
For more details please refer to the Raw Data Documentation.
input_rds <- "StormData.rds"
if (file.exists(input_rds)) {
storm <- readRDS(input_rds)
} else {
message("Reading '", input_archive, "', please wait")
storm <- tibble::as_tibble(read.csv(input_archive,
na.strings = c("", "0.00"),
# read only those columns we actually need
colClasses = c("NULL", # "STATE__"
"NULL", # "BGN_DATE"
"NULL", # "BGN_TIME"
"NULL", # "TIME_ZONE"
"NULL", # "COUNTY"
"NULL", # "COUNTYNAME"
"NULL", # "STATE"
"factor", # "EVTYPE"
"NULL", # "BGN_RANGE"
"NULL", # "BGN_AZI"
"NULL", # "BGN_LOCATI"
"NULL", # "END_DATE"
"NULL", # "END_TIME"
"NULL", # "COUNTY_END"
"NULL", # "COUNTYENDN"
"NULL", # "END_RANGE"
"NULL", # "END_AZI"
"NULL", # "END_LOCATI"
"NULL", # "LENGTH"
"NULL", # "WIDTH"
"NULL", # "F"
"NULL", # "MAG"
"numeric", # "FATALITIES"
"numeric", # "INJURIES"
"numeric", # "PROPDMG"
"character", # "PROPDMGEXP"
"numeric", # "CROPDMG"
"character", # "CROPDMGEXP"
"NULL", # "WFO"
"NULL", # "STATEOFFIC"
"NULL", # "ZONENAMES"
"NULL", # "LATITUDE"
"NULL", # "LONGITUDE"
"NULL", # "LATITUDE_E"
"NULL", # "LONGITUDE_"
"NULL", # "REMARKS"
"NULL") # "REFNUM"
)) %>%
mutate(FATALITIES = as.integer(FATALITIES),
INJURIES = as.integer(INJURIES),
PROPDMGEXP = as.factor(toupper(PROPDMGEXP)),
CROPDMGEXP = as.factor(toupper(CROPDMGEXP)))
message("Saving cache '", input_rds, "'")
saveRDS(storm, input_rds)
}
summary(storm)
## EVTYPE FATALITIES INJURIES
## HAIL :288661 Min. : 1.0 Min. : 1
## TSTM WIND :219940 1st Qu.: 1.0 1st Qu.: 1
## THUNDERSTORM WIND: 82563 Median : 1.0 Median : 2
## TORNADO : 60652 Mean : 2.2 Mean : 8
## FLASH FLOOD : 54277 3rd Qu.: 2.0 3rd Qu.: 4
## FLOOD : 25326 Max. :583.0 Max. :1700
## (Other) :170878 NA's :895323 NA's :884693
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## Min. : 0.0 K :424665 Min. : 0.0 ? : 7
## 1st Qu.: 2.5 M : 11337 1st Qu.: 5.0 0 : 19
## Median : 10.0 0 : 216 Median : 10.0 2 : 1
## Mean : 45.5 B : 40 Mean : 62.3 B : 9
## 3rd Qu.: 25.0 5 : 28 3rd Qu.: 50.0 K :281853
## Max. :5000.0 (Other): 77 Max. :990.0 M : 1995
## NA's :663123 NA's :465934 NA's :880198 NA's:618413
According to the Documentation, PROPDMGEXP and CROPDMGEXP may take only “K”, “M” or “B”, but it seems that there are about 300 cases when PROPDMGEXP has some other value, and 27 similar for CROPDMGEXP.
To figure out how to handle these cases, let’s take a look at their characteristic values:
summary(storm$PROPDMG[!is.na(storm$PROPDMGEXP) & !(storm$PROPDMGEXP %in% c("K", "M", "B"))])
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.10 10.00 25.00 30.56 50.00 150.00 70
We may compare them to the ones of the known exponents, perhaps they represent only mistyped “K”/“M”/“B” values:
summary(storm$PROPDMG[is.na(storm$PROPDMGEXP)])
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.4 3.0 5.0 6.9 7.2 75.0 465858
summary(storm$PROPDMG[storm$PROPDMGEXP == "K"])
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 2.5 10.0 47.2 30.0 5000.0 663118
summary(storm$PROPDMG[storm$PROPDMGEXP == "M"])
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 1.5 2.5 12.4 6.0 929.0 465945
summary(storm$PROPDMG[storm$PROPDMGEXP == "B"])
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.1 1.4 2.3 6.9 5.0 115.0 465934
They aren’t really similar to any of the four valid categories.
On the other hand, they represent only 1/3000 of the records, so we might even just discard them, but there is a slightly better option:
If we just omitted them, then their whole value would be lost from the totals.
If we treat them as in the “magnitude of ones”, then if they indeed are, then we saved these observations, and otherwise we saved 1/1K, 1/1M or 1/1B of their represented values. Not too much, but still more than zero.
Having made this decision, we can restore the actual property- and crop-damage values and sort the events according to the total damage.
The same type of ordering is done to the human harm values as well.
As we want to display the event types according to the total harm/damage they done, we calculate these totals and sort the lists according to them.
The cumulative totals will be needed to find the events that are responsible for the top-X-percent of the harm/damage.
magnitude <- function(x) {
ifelse(is.na(x), 1, recode(x, K=1e3, M=1e6, B=1e9, .default=1))
}
damage <- storm %>%
filter(!is.na(PROPDMG) & !is.na(PROPDMGEXP) & !is.na(CROPDMG) & !is.na(CROPDMGEXP)) %>%
transmute(EVTYPE,
PROPDMG = PROPDMG * magnitude(PROPDMGEXP),
CROPDMG = CROPDMG * magnitude(CROPDMGEXP)) %>%
group_by(EVTYPE) %>%
summarise(PROPERTY = sum(PROPDMG),
CROP = sum(CROPDMG)) %>%
mutate(TOTAL_DAMAGE = PROPERTY + CROP) %>%
arrange(desc(TOTAL_DAMAGE)) %>%
mutate(CUMULATIVE_DAMAGE = cumsum(TOTAL_DAMAGE))
damage$EVTYPE <- factor(as.character(damage$EVTYPE), levels = damage$EVTYPE)
grand_total_damage <- max(damage$CUMULATIVE_DAMAGE)
damage <- damage %>%
mutate(CUMULATIVE_PERCENTAGE = CUMULATIVE_DAMAGE / grand_total_damage,
CUMULATIVE_DAMAGE = NULL)
harm <- storm %>%
select(EVTYPE, INJURIES, FATALITIES) %>%
filter(!is.na(INJURIES) & !is.na(FATALITIES)) %>%
group_by(EVTYPE) %>%
summarise(INJURY = sum(INJURIES),
FATALITY = sum(FATALITIES)) %>%
mutate(TOTAL_HARM = INJURY + FATALITY) %>%
arrange(desc(TOTAL_HARM)) %>%
mutate(CUMULATIVE_HARM = cumsum(TOTAL_HARM))
harm$EVTYPE <- factor(as.character(harm$EVTYPE), levels = harm$EVTYPE)
grand_total_harm <- max(harm$CUMULATIVE_HARM)
harm <- harm %>%
mutate(CUMULATIVE_PERCENTAGE = CUMULATIVE_HARM / grand_total_harm,
CUMULATIVE_HARM = NULL)
# helper function to display large integers in a readable form
label_for_integer <- function(x) {
x <- x + 0.5
if (is.na(x)) {
""
} else if (x > 1e9) {
sprintf("%.1fB", x * 1e-9)
} else if (x > 1e6) {
sprintf("%.1fM", x * 1e-6)
} else if (x > 1e3) {
sprintf("%.1fK", x * 1e-3)
} else {
sprintf("%d", as.integer(x))
}
}
plot_harm <- ggplot(harm) +
geom_col(aes(x = EVTYPE,
y = TOTAL_HARM,
fill="injury")) +
geom_col(aes(x = EVTYPE,
y = FATALITY,
fill="fatality"),
position="stack") +
scale_y_log10(breaks = function(r) { signif(10 ** seq(from = log10(r[1]),
to = log10(r[2]),
length.out = 10),
digits=2)},
labels = function(x) { sapply(x, label_for_integer) }) +
scale_fill_manual(labels=c("injury", "fatality"), values=c("blue", "red")) +
theme(panel.background = element_rect(fill = "white",
colour = "grey50"),
panel.grid.major.y = element_line(colour = "grey80"),
axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(x = "",
y = "Human harm (logarithmic)",
fill = "Harm type",
title = "Human harm by event type")
print(plot_harm)
harm[1:10, c("EVTYPE", "TOTAL_HARM", "CUMULATIVE_PERCENTAGE")]
## # A tibble: 10 x 3
## EVTYPE TOTAL_HARM CUMULATIVE_PERCENTAGE
## <fctr> <int> <dbl>
## 1 TORNADO 65414 0.7555761
## 2 EXCESSIVE HEAT 5193 0.8155588
## 3 FLOOD 2783 0.8477043
## 4 ICE STORM 1755 0.8679757
## 5 HEAT 1493 0.8852209
## 6 HURRICANE/TYPHOON 1251 0.8996708
## 7 LIGHTNING 932 0.9104360
## 8 TSTM WIND 845 0.9201964
## 9 FLASH FLOOD 812 0.9295755
## 10 BLIZZARD 766 0.9384233
plot_damage <- ggplot(damage) +
geom_col(aes(x = EVTYPE,
y = TOTAL_DAMAGE,
fill="property")) +
geom_col(aes(x = EVTYPE,
y = CROP,
fill="crop"),
position="stack") +
scale_y_log10(breaks = function(r) { signif(10 ** seq(from = log10(r[1]),
to = log10(r[2]),
length.out = 10),
digits=2)},
labels = function(x) { sapply(x, label_for_integer) }) +
scale_fill_manual(labels=c("property", "crop"), values=c("blue", "red")) +
theme(panel.background = element_rect(fill = "white",
colour = "grey50"),
panel.grid.major.y = element_line(colour = "grey80"),
axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(x = "",
y = "Property/crop damage (logarithmic)",
fill = "Damage type",
title = "Property/crop damage by event type")
print(plot_damage)
damage[1:10, c("EVTYPE", "TOTAL_DAMAGE", "CUMULATIVE_PERCENTAGE")]
## # A tibble: 10 x 3
## EVTYPE TOTAL_DAMAGE CUMULATIVE_PERCENTAGE
## <fctr> <dbl> <dbl>
## 1 FLOOD 126044533500 0.6065763
## 2 HURRICANE/TYPHOON 29348117800 0.7478111
## 3 HURRICANE 10498188000 0.7983325
## 4 RIVER FLOOD 10108369000 0.8469780
## 5 ICE STORM 5108617550 0.8715627
## 6 FLASH FLOOD 4309261413 0.8923006
## 7 HAIL 3838952570 0.9107752
## 8 TORNADO 2385937163 0.9222572
## 9 HURRICANE OPAL 2187000000 0.9327820
## 10 HIGH WIND 1918571300 0.9420149