The goal of this report is to understand which severe weather event across the United States cause the most damage to A) the population and B) the economy. We are looking at the NOAA Storm Database to identify the top 10 weather events causing the most damage in each category. To provide a reasonable accurate result, we will look only at the last 15 years of data; data goes back until 1950, however, over the last two decades our ability to record data accurately and cheaply has dramatically improved (approximately 2/3 of the available data is contained the most recent 15 years). Secondly, this is not a long term trend analysis of natural disasters. Instead the aim is to provide a concise picture of today’s risks due to weather events. Hopefully this will help states, counties and communities to allocate the proper resources to mitigate such future risks.
Prior to loading the data, the libraries required for data processing and plotting should be installed and loaded (not shown in this report). These packages are: dplyr, ggplot2, ggrepel, lubridate.
The original data is loaded from the compressed file directly into R for further processing:
# Loading Data into R
x.raw <- read.csv("Data/repdata-data-StormData.csv.bz2")
# Print dimensions of raw dataset
dim(x.raw)
## [1] 902297 37
# Subset dataset to relevant columns for population health investigation:
x <- select(x.raw, # Data input from CSV
BGN_DATE, # Date of event
EVTYPE, # Event Type
FATALITIES, INJURIES, # Number of fatalities and injuries
PROPDMG, CROPDMG) # Economic damages
The original dataset has been reduced through subsetting. Only the variables containing the data to answer our questions have been kept. In the next pre-processing step, we introduce a new variable, YEAR, which we will use to filter the dataset. We will keep only the lastest 15 years of data.
# Add year-only column and remove BGN_DATE
x$BGN_DATE <- mdy_hms(as.character(x$BGN_DATE)) # Convert date to POSIX
x <- mutate(x, YEAR = year(BGN_DATE)) # Add new column: YEAR (YYYY format)
x$BGN_DATE <- NULL # Remove original date column
# Look only at the most recent 15 years of data
x <- x %>% filter(YEAR > 1996)
Now the data has been reduced to the minimum amount required. Next we will add additional variables that will help with summarizing the data in a later step:
# Add 5-year time periods
periods <- c("1997 - 2001", "2002 - 2006", "2007 - 2011")
x$period <- periods[1]
x$period[x$YEAR >= 2002 & x$YEAR <= 2006] <- periods[2]
x$period[x$YEAR >= 2007] <- periods[3]
x$period <- factor(x$period)
# Add total impact for population and economy
x <- x %>%
mutate(total_pop = INJURIES + FATALITIES, # total damage = sum of damages
total_eco = PROPDMG + CROPDMG)
Below shows the dataset summarized.
glimpse(x)
## Observations: 621,260
## Variables: 9
## $ EVTYPE (fctr) TSTM WIND, TSTM WIND, TSTM WIND, TSTM WIND, TSTM W...
## $ FATALITIES (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ INJURIES (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ PROPDMG (dbl) 6, 5, 0, 10, 5, 15, 3, 10, 6, 8, 4, 0, 18, 7, 4, 5,...
## $ CROPDMG (dbl) 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ YEAR (dbl) 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 199...
## $ period (fctr) 1997 - 2001, 1997 - 2001, 1997 - 2001, 1997 - 2001...
## $ total_pop (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ total_eco (dbl) 6, 5, 0, 10, 5, 17, 3, 10, 6, 8, 4, 0, 18, 7, 4, 5,...
To look out for trends over the last 15 years, the data is processed further to summarize weather events over 5-year periods, as well as a consolidated, 15-year summary dataset. In this section, the data that will be used later for the data visualizations, is created for category A, population consequences.
# Split data into 3 time periods, summarize and merge back together
pop1 <- data.frame()
for (p in periods) {
# Create data frame from x dataset
df <- x %>%
filter(period == p) %>%
group_by(EVTYPE) %>%
summarise(fat_all = sum(FATALITIES), inj_all = sum(INJURIES), totals = sum(total_pop)) %>%
filter(totals > 0) %>%
mutate(period = p)
# Combine / Merge
if (length(pop1) == 0) {
pop1 <- df
} else {
pop1 <- rbind(pop1, df)
}
}
# Filter further and convert periods to factor variable
pop1 <- pop1 %>% filter(fat_all > 100, inj_all > 100)
pop1$period <- factor(pop1$period)
To reduce the number of events to the ones with the largest impact, filters are applied that filter out events that have low damage rates. The same filter is applied for the consolidated dataset.
# Consolidated data - population consequences
pop2 <- x %>%
group_by(EVTYPE) %>%
summarise(fat_all = sum(FATALITIES), inj_all = sum(INJURIES), totals = sum(total_pop)) %>%
filter(fat_all > 100, inj_all > 100)
The processed datasets are used to create a two plots, which will be shown later in the results section.
# Plot 1: Facet Plot, showing data over 5-year periods (not printed here)
p1 <- ggplot(as.data.frame(pop1), aes(x = inj_all, y = fat_all, size = totals)) +
geom_point(color = "red", alpha = 0.4) +
facet_grid(. ~ period) +
scale_x_log10() +
scale_y_log10() +
ggtitle("Fatalities vs Injuries over 5-year Periods") +
xlab("Number of Injured") +
ylab("Number of Fatalities") +
theme_bw() +
geom_text_repel(data = filter(pop1, totals >= 100), aes(label = EVTYPE))
# Plot 2: Consolidated Bubble Chart
p2 <- ggplot(as.data.frame(pop2), aes(x = inj_all, y = fat_all, size = totals)) +
geom_point(color = "red", alpha = 0.4) +
scale_x_log10() +
scale_y_log10() +
ggtitle("Total Fatalities vs Injuries from 1997 - 2011") +
xlab("Number of Injured") +
ylab("Number of Fatalities") +
theme_bw() +
geom_text_repel(data = filter(pop2, totals >= 100), aes(label = EVTYPE))
For economic damages, the dataset is also processed further. This time, we will look at the two different type of damages and how they are distributed.
# Summarize and arrange the economic data - by total economic damage
eco1 <- x %>%
group_by(EVTYPE) %>%
summarise(crop_all = round(sum(CROPDMG),0),
prop_all = round(sum(PROPDMG),0),
totals = round(sum(total_eco),0)) %>%
arrange(desc(totals))
These data are also combined into a single Figure, shown later in the results section.
# Plot 4: Total Economic Damage
p4 <- ggplot(eco1[1:10,], aes(x = reorder(EVTYPE, -totals), y = totals, fill = "Total")) +
geom_bar(stat = "identity", alpha = 0.5) +
geom_bar(stat = "identity", aes(x = reorder(EVTYPE, -totals), y = prop_all, fill = "Property")) +
theme_bw() +
theme(axis.text.x = element_text(angle=90, vjust=0.5)) +
ggtitle("Total and Property Damage by Event Type from 1997 - 2011") +
xlab("Event Type") +
ylab("Damage Amounts") +
scale_y_continuous(labels = abbreviate) +
scale_fill_discrete(name="", guide = guide_legend(reverse = TRUE))
The processed data are now used to visualize our findings graphically. Tables, where needed, are also showing all or the required part of the data to explain the charts.
The weather events causing the highest damage amoung the population across the United States are shown in the table below:
# Table of consolidated results
pop2 <- pop2 %>% arrange(desc(totals))
library(knitr)
kable(pop2[1:10,], col.names = c("Event Types", "Fatalities", "Injuries", "Total"))
| Event Types | Fatalities | Injuries | Total |
|---|---|---|---|
| TORNADO | 1485 | 19962 | 21447 |
| EXCESSIVE HEAT | 1761 | 6332 | 8093 |
| FLOOD | 381 | 6740 | 7121 |
| LIGHTNING | 598 | 3826 | 4424 |
| TSTM WIND | 219 | 3305 | 3524 |
| FLASH FLOOD | 795 | 1630 | 2425 |
| THUNDERSTORM WIND | 130 | 1400 | 1530 |
| HEAT | 237 | 1222 | 1459 |
| WINTER STORM | 167 | 1042 | 1209 |
| HIGH WIND | 211 | 970 | 1181 |
print(p1)
Figure-1: Fatalities vs. Injuries over 5-year periods from 1997 - 2011.
Figure-1 shows that there are two weather events that cause the most number of fatalities: Tornado and Excessive Heat. In third place, when looking at total damages, comes Flood, though these cause less fatalities than Lightning, but many more injuries.
print(p2)
Figure-2: Consolidated Fatalities vs. Injuries from 1997 - 2011 by Severe Weather Events.
Figure-2 confirms above interpretation; however it is notable that Flood, while being on third place, only occurred in the 1997-2001 period. In the 10 years afterwards, flooding appears but only resulting in minor damages to the population.
From the summary table below it is clear that the top three weather events for causing severe damages are TSTM Wind, Flash Flood and Tornado. The damage severity is declining linearly until Lightning, where it drops rapidly to approximately half of the next severe event.
kable(eco1[1:10,], col.names = c("Event Types", "Crop Damage", "Property Damage", "Total"))
| Event Types | Crop Damage | Property Damage | Total |
|---|---|---|---|
| TSTM WIND | 98746 | 1215890 | 1314636 |
| FLASH FLOOD | 146418 | 1156462 | 1302880 |
| TORNADO | 86111 | 1125792 | 1211903 |
| HAIL | 473313 | 544038 | 1017351 |
| FLOOD | 146402 | 803805 | 950207 |
| THUNDERSTORM WIND | 66663 | 862257 | 928920 |
| LIGHTNING | 1852 | 463150 | 465002 |
| HIGH WIND | 13872 | 290030 | 303901 |
| WINTER STORM | 1538 | 115809 | 117347 |
| WILDFIRE | 4364 | 83007 | 87372 |
print(p4)
Figure-3: Total and Property Damage by Event Type from 1997 - 2011. The main damage is done to property (compared to damage to crops).
Figure-3 shows that most of the damage is sustained by property. Only during Hail the damage amounts between property and crops are comparable.