This project explores an influence of weather events on USA economy. We will check out impact on population heath and also the economic consequences of weather events. The analysis will answer the question of which event type has the highest impact. As a data source serves the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database.
Read the the NOAA Storm Database from the downloaded archive.
library(dplyr)
data <- read.table("repdata_data_StormData.csv.bz2", header=T, quote="\"", sep=",")
To analyze the impact of weather event types we will explore the following measurements:
EVTYPE – event typeFATALITIES – number of fatal casesINJURIES – number of injury casesCROPDMGEXP – measurement unit for
CROPDMGCROPDMG – amount of damage on cropsPROPDMGEXP – measurement unit for
PROPDMGPROPDMG – amount of damage on propertiesMeasurement units
To assess the impact, we need to compare damages using the same base,
i.e. to transform values of CROPDMG/PROPDMG
using the measurement units:
B stands for billionsM stands for millionsK/k stands for thousands (Kilo)0 stands for nominal unit, i.e. no multiplierSince economic impact from the weather events can be quite high, we’ll use millions as the base unit.
measurement_unit_multiplier <- function(unit) {
# default
multiplier <- 1/1000000
# has to use ifelse, because the input is vectorized
multiplier <- ifelse(unit == 'B', 1000, multiplier)
multiplier <- ifelse(unit == 'M', 1, multiplier)
multiplier <- ifelse(unit == 'K', 1/1000, multiplier)
multiplier <- ifelse(unit == 'k', 1/1000, multiplier)
return (multiplier)
}
analysis_data <- data %>%
select(EVTYPE, FATALITIES, INJURIES,CROPDMG, CROPDMGEXP, PROPDMG, PROPDMGEXP) %>%
mutate(
CROPDMG = CROPDMG * measurement_unit_multiplier(CROPDMGEXP),
PROPDMG = PROPDMG * measurement_unit_multiplier(PROPDMGEXP)
) %>%
select(-CROPDMGEXP, -PROPDMGEXP) %>%
group_by(EVTYPE)
summary(analysis_data)
## EVTYPE FATALITIES INJURIES CROPDMG
## Length:902297 Min. : 0.0000 Min. : 0.0000 Min. :0.0e+00
## Class :character 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.:0.0e+00
## Mode :character Median : 0.0000 Median : 0.0000 Median :0.0e+00
## Mean : 0.0168 Mean : 0.1557 Mean :5.4e-02
## 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.:0.0e+00
## Max. :583.0000 Max. :1700.0000 Max. :5.0e+03
## PROPDMG
## Min. :0.00e+00
## 1st Qu.:0.00e+00
## Median :0.00e+00
## Mean :4.70e-01
## 3rd Qu.:0.00e+00
## Max. :1.15e+05
Let’s check which event types have the most impact and what’s the intersection across different impact types:
library(dplyr)
top_by <- function(column) {
total_column <- paste(column, "TOTAL", sep="_")
top <- analysis_data %>%
summarise(!!total_column := sum(get(column))) %>%
arrange(desc(get(total_column)))
return (top)
}
print_top_by <- function(column) {
head(top_by(column))
}
print_top_by("FATALITIES")
## # A tibble: 6 × 2
## EVTYPE FATALITIES_TOTAL
## <chr> <dbl>
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 TSTM WIND 504
print_top_by("INJURIES")
## # A tibble: 6 × 2
## EVTYPE INJURIES_TOTAL
## <chr> <dbl>
## 1 TORNADO 91346
## 2 TSTM WIND 6957
## 3 FLOOD 6789
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
## 6 HEAT 2100
print_top_by("CROPDMG")
## # A tibble: 6 × 2
## EVTYPE CROPDMG_TOTAL
## <chr> <dbl>
## 1 DROUGHT 13973.
## 2 FLOOD 5662.
## 3 RIVER FLOOD 5029.
## 4 ICE STORM 5022.
## 5 HAIL 3026.
## 6 HURRICANE 2742.
print_top_by("PROPDMG")
## # A tibble: 6 × 2
## EVTYPE PROPDMG_TOTAL
## <chr> <dbl>
## 1 FLOOD 144658.
## 2 HURRICANE/TYPHOON 69306.
## 3 TORNADO 56926.
## 4 STORM SURGE 43324.
## 5 FLASH FLOOD 16141.
## 6 HAIL 15727.
For both fatalities and injuries, Tornado is the most harmful event, and way ahead of other event types by the injuries number. Let’s explore this on the plot graph:
library(tidyr)
library(ggplot2)
# to be sure both top injuries and top fatalities events be present at the plot
top_events <- union(
top_by("INJURIES") %>% top_n(7) %>% pull(EVTYPE),
top_by("FATALITIES") %>% top_n(7) %>% pull(EVTYPE)
)
analysis_data %>%
summarise(FATALITIES_TOTAL = sum(FATALITIES),
INJURIES_TOTAL = sum(INJURIES)) %>%
filter(EVTYPE %in% top_events) %>%
ggplot(aes(EVTYPE, group = 1)) +
geom_line(aes(y = FATALITIES_TOTAL, colour = "Fatalities")) +
geom_line(aes(y = INJURIES_TOTAL, colour = "Injuries")) +
labs(title = "Most harmful events across US with respect to population health", x = "Event type", y = "People impacted", color=element_blank()) +
scale_x_discrete(guide = guide_axis(n.dodge = 2))
The greatest economic consequences have:
library(gridExtra)
# to be sure both top injuries and top fatalities events be present at the plot
top_events <- union(
top_by("CROPDMG") %>% top_n(7) %>% pull(EVTYPE),
top_by("PROPDMG") %>% top_n(7) %>% pull(EVTYPE)
)
damage_labels <- c(`CROPDMG` = "Crops Damage", `PROPDMG` = "Property Damage")
analysis_data %>%
summarise(
CROPDMG = sum(CROPDMG),
PROPDMG = sum(PROPDMG)
) %>%
filter(EVTYPE %in% top_events) %>%
gather(key, value, 2:3) %>%
ggplot(aes(x=EVTYPE, y = value, fill = factor(key))) +
facet_wrap(~key, labeller = as_labeller(damage_labels)) +
geom_bar(stat = "identity", position = position_dodge2(preserve = "single")) +
coord_flip() +
labs(title = "Most harmful events across US\nwith respect to economic consequences", x = element_blank(), y = "Million $") +
theme(legend.position="none")