Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This document contains the analysis the impacts of the severe weather events to health and economy on base of the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database (https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2), which documentation can be found here: https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf
Among the considered severe weather events the tornados have the highest impact to people's health including injuries and fatalities in the U.S.. The highest economic impact, including impact to property and crops, have floods.
The following R libraries are necessary for the analysis of the given data. A very useful library reshape2 will also be installed.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(reshape2)
With the following code the data is downloaded from https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2.
if(!exists("StormData.csv.bz2")) {
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "StormData.csv.bz2")
}
stormdata_raw <- read.csv("StormData.csv.bz2", header = TRUE)
First, we check the form, size and characterization of the data frame.
str(stormdata_raw)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
According to the documentation in https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf we can reduce the analysis to the following variables:
EVTYPE: Type of the event of the damage
FATALITIES / INJURIES: Impact on the health
PROPDMG / PROPDMGEXP: Economic damage to the property and its exponents
CROPDMG / CROPDMGEXP: Economic agricultural damage and its exponents
The data frame will be shortend to its relevant columns.
stormdata <- stormdata_raw %>%
select(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
It is checked, whether there are missing values or not.
sum(is.na(stormdata))
## [1] 0
There aren't any missing values.
The focus will be now on the impact to health. The cases of injuries and fatalities and its sum is summed up for each event type.
health_damage_per_eventtype <- stormdata %>%
group_by(EVTYPE) %>%
summarise(sum_injuries = sum(INJURIES), sum_fatalities = sum(FATALITIES), sum_damage = sum(FATALITIES) + sum(INJURIES))
## `summarise()` ungrouping output (override with `.groups` argument)
head(health_damage_per_eventtype)
## # A tibble: 6 x 4
## EVTYPE sum_injuries sum_fatalities sum_damage
## <chr> <dbl> <dbl> <dbl>
## 1 " HIGH SURF ADVISORY" 0 0 0
## 2 " COASTAL FLOOD" 0 0 0
## 3 " FLASH FLOOD" 0 0 0
## 4 " LIGHTNING" 0 0 0
## 5 " TSTM WIND" 0 0 0
## 6 " TSTM WIND (G45)" 0 0 0
The focus will be now an the economical impact. First we have to clean the exponential data, i.e. PROPDMGEXP and CROPEMGEXP. There are several different inputs in PROPDMGEXP and CROPDMGEXP:
unique_p_exp <- unique(stormdata$PROPDMGEXP)
unique_c_exp <- unique(stormdata$CROPDMGEXP)
unique_p_exp
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
unique_c_exp
## [1] "" "M" "K" "m" "B" "?" "0" "k" "2"
According to the documentation mentioned above, the exponents can be interpreted as follows:
A replacement table for PROP and CROP will be defined:
df_p_exp <- data.frame(unique_p_exp)
df_c_exp <- data.frame(unique_c_exp)
propdmgexp_new <- case_when((df_p_exp == "K" | df_p_exp == "k") ~ 1000,
(df_p_exp == "H" | df_p_exp == "h") ~ 100,
(df_p_exp == "M" | df_p_exp == "m") ~ 1000000,
df_p_exp =="B" ~ 10^9,
TRUE ~ 1)
cropdmgexp_new <- case_when((df_c_exp == "K" | df_c_exp == "k") ~ 1000,
(df_c_exp == "H" | df_c_exp == "h") ~ 100,
(df_c_exp == "M" | df_c_exp == "m") ~ 1000000,
df_c_exp =="B" ~ 10^9,
TRUE ~ 1)
replacement_p_exp <- data.frame(unique_p_exp, propdmgexp_new)
replacement_c_exp <- data.frame(unique_c_exp, cropdmgexp_new)
replacement_p_exp
## unique_p_exp propdmgexp_new
## 1 K 1e+03
## 2 M 1e+06
## 3 1e+00
## 4 B 1e+09
## 5 m 1e+06
## 6 + 1e+00
## 7 0 1e+00
## 8 5 1e+00
## 9 6 1e+00
## 10 ? 1e+00
## 11 4 1e+00
## 12 2 1e+00
## 13 3 1e+00
## 14 h 1e+02
## 15 7 1e+00
## 16 H 1e+02
## 17 - 1e+00
## 18 1 1e+00
## 19 8 1e+00
replacement_c_exp
## unique_c_exp cropdmgexp_new
## 1 1e+00
## 2 M 1e+06
## 3 K 1e+03
## 4 m 1e+06
## 5 B 1e+09
## 6 ? 1e+00
## 7 0 1e+00
## 8 k 1e+03
## 9 2 1e+00
These replacement tables are used to the storm data with its relevant data. The new columns PROPDMG_NEW and CROPDMG_NEW are the value calculated from PROPDMG and CROPDMG multiplicated with its exponents. So the values in PROPDMG_NEW and CROPDMG_NEW are comparable now.
replacement_p_exp <- replacement_p_exp %>%
rename(PROPDMGEXP = unique_p_exp)
replacement_c_exp <- replacement_c_exp %>%
rename(CROPDMGEXP = unique_c_exp)
stormdata <- stormdata %>%
inner_join(replacement_p_exp, by = "PROPDMGEXP") %>%
inner_join(replacement_c_exp, by = "CROPDMGEXP") %>%
mutate(PROPDMG_NEW = PROPDMG * propdmgexp_new) %>%
mutate(CROPDMG_NEW = CROPDMG * cropdmgexp_new)
head(stormdata)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO 0 15 25.0 K 0
## 2 TORNADO 0 0 2.5 K 0
## 3 TORNADO 0 2 25.0 K 0
## 4 TORNADO 0 2 2.5 K 0
## 5 TORNADO 0 2 2.5 K 0
## 6 TORNADO 0 6 2.5 K 0
## propdmgexp_new cropdmgexp_new PROPDMG_NEW CROPDMG_NEW
## 1 1000 1 25000 0
## 2 1000 1 2500 0
## 3 1000 1 25000 0
## 4 1000 1 2500 0
## 5 1000 1 2500 0
## 6 1000 1 2500 0
Now we are able to consider the financial impact for both, property and crop, for each weather event.
economic_damage_per_eventtype <- stormdata %>%
group_by(EVTYPE) %>%
summarise(eco_p = sum(PROPDMG_NEW), eco_c = sum(CROPDMG_NEW), eco_damage = sum(PROPDMG_NEW + CROPDMG_NEW))
## `summarise()` ungrouping output (override with `.groups` argument)
head(economic_damage_per_eventtype)
## # A tibble: 6 x 4
## EVTYPE eco_p eco_c eco_damage
## <chr> <dbl> <dbl> <dbl>
## 1 " HIGH SURF ADVISORY" 200000 0 200000
## 2 " COASTAL FLOOD" 0 0 0
## 3 " FLASH FLOOD" 50000 0 50000
## 4 " LIGHTNING" 0 0 0
## 5 " TSTM WIND" 8100000 0 8100000
## 6 " TSTM WIND (G45)" 8000 0 8000
The 10 event types with the highest impacts in injuries plus fatalities are as follows:
health_damage_per_eventtype_top10 <- health_damage_per_eventtype %>%
arrange(desc(sum_damage)) %>%
top_n(sum_damage, n = 10)
health_damage_per_eventtype_top10
## # A tibble: 10 x 4
## EVTYPE sum_injuries sum_fatalities sum_damage
## <chr> <dbl> <dbl> <dbl>
## 1 TORNADO 91346 5633 96979
## 2 EXCESSIVE HEAT 6525 1903 8428
## 3 TSTM WIND 6957 504 7461
## 4 FLOOD 6789 470 7259
## 5 LIGHTNING 5230 816 6046
## 6 HEAT 2100 937 3037
## 7 FLASH FLOOD 1777 978 2755
## 8 ICE STORM 1975 89 2064
## 9 THUNDERSTORM WIND 1488 133 1621
## 10 WINTER STORM 1321 206 1527
Across the United States, tornados are most harmful with respect to population health.
Here is a visualisation of the conclusion:
health_damage_per_eventtype_top10 %>%
select(EVTYPE, sum_injuries, sum_fatalities) %>%
rename(injuries = sum_injuries, fatalities = sum_fatalities) %>%
melt(id.vars = 'EVTYPE') %>%
rename(damage = variable) %>%
ggplot(aes(x=EVTYPE, y = value, fill = damage)) +
geom_bar(stat = "identity", position = "stack" ) +
coord_flip() +
labs(title = "The 10 weather events with the most highest impacts on health") +
ylab("number of cases") +
xlab("weather events")
Now we consider the weather events with the 10 greatest economical consequences:
economic_damage_per_eventtype_top10 <- economic_damage_per_eventtype %>%
arrange(desc(eco_damage)) %>%
top_n(eco_damage, n = 10)
economic_damage_per_eventtype_top10
## # A tibble: 10 x 4
## EVTYPE eco_p eco_c eco_damage
## <chr> <dbl> <dbl> <dbl>
## 1 FLOOD 144657709807 5661968450 150319678257
## 2 HURRICANE/TYPHOON 69305840000 2607872800 71913712800
## 3 TORNADO 56937160779. 414953270 57352114049.
## 4 STORM SURGE 43323536000 5000 43323541000
## 5 HAIL 15732267543. 3025954473 18758222016.
## 6 FLASH FLOOD 16140812067. 1421317100 17562129167.
## 7 DROUGHT 1046106000 13972566000 15018672000
## 8 HURRICANE 11868319010 2741910000 14610229010
## 9 RIVER FLOOD 5118945500 5029459000 10148404500
## 10 ICE STORM 3944927860 5022113500 8967041360
We can see, that flood has the highest econmical damage impact in the U.S. among the considered weather events.
The visualisation of this result can be created as follows:
economic_damage_per_eventtype_top10 %>%
select(EVTYPE, eco_p, eco_c) %>%
rename(property = eco_p, crop = eco_c) %>%
melt(id.vars = 'EVTYPE') %>%
rename(economical_damage = variable) %>%
ggplot(aes(x=EVTYPE, y = value, fill = economical_damage)) +
geom_bar(stat = "identity", position = "stack") +
coord_flip() +
labs(title = "The 10 weather events with the most highest economical consequences") +
ylab("economical loss (in $)") +
xlab("weather events")