Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This document contains the analysis the impacts of the severe weather events to health and economy on base of the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database (https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2), which documentation can be found here: https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf

Among the considered severe weather events the tornados have the highest impact to people's health including injuries and fatalities in the U.S.. The highest economic impact, including impact to property and crops, have floods.

Data Processing

The following R libraries are necessary for the analysis of the given data. A very useful library reshape2 will also be installed.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(reshape2)

With the following code the data is downloaded from https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2.

if(!exists("StormData.csv.bz2")) {
  download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "StormData.csv.bz2")
}
stormdata_raw <- read.csv("StormData.csv.bz2", header = TRUE) 

First, we check the form, size and characterization of the data frame.

str(stormdata_raw)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

According to the documentation in https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf we can reduce the analysis to the following variables:

The data frame will be shortend to its relevant columns.

stormdata <- stormdata_raw %>%
  select(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

It is checked, whether there are missing values or not.

sum(is.na(stormdata))
## [1] 0

There aren't any missing values.

Data Processing for Health Damage Anasysis

The focus will be now on the impact to health. The cases of injuries and fatalities and its sum is summed up for each event type.

health_damage_per_eventtype <- stormdata %>% 
  group_by(EVTYPE) %>% 
  summarise(sum_injuries = sum(INJURIES), sum_fatalities = sum(FATALITIES), sum_damage = sum(FATALITIES) + sum(INJURIES)) 
## `summarise()` ungrouping output (override with `.groups` argument)
head(health_damage_per_eventtype)
## # A tibble: 6 x 4
##   EVTYPE                  sum_injuries sum_fatalities sum_damage
##   <chr>                          <dbl>          <dbl>      <dbl>
## 1 "   HIGH SURF ADVISORY"            0              0          0
## 2 " COASTAL FLOOD"                   0              0          0
## 3 " FLASH FLOOD"                     0              0          0
## 4 " LIGHTNING"                       0              0          0
## 5 " TSTM WIND"                       0              0          0
## 6 " TSTM WIND (G45)"                 0              0          0

Data Processing for Economic Damage Analysis

The focus will be now an the economical impact. First we have to clean the exponential data, i.e. PROPDMGEXP and CROPEMGEXP. There are several different inputs in PROPDMGEXP and CROPDMGEXP:

unique_p_exp <- unique(stormdata$PROPDMGEXP)
unique_c_exp <- unique(stormdata$CROPDMGEXP)
unique_p_exp
##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
unique_c_exp
## [1] ""  "M" "K" "m" "B" "?" "0" "k" "2"

According to the documentation mentioned above, the exponents can be interpreted as follows:

  • "H" or "h": The used unit is 100$, i.e. 10^2 $.
  • "K" or "k": The used unit is 1,000$, i.e. 10^3 $.
  • "M" or "m": The used unit is 1,000,000$, i.e. 10^6 $.
  • "B": The used unit is 1,000,000,000$, i.e. 10^9 $.
  • All other symbols can be interpreted as 1$.

A replacement table for PROP and CROP will be defined:

df_p_exp <- data.frame(unique_p_exp)
df_c_exp <- data.frame(unique_c_exp)

propdmgexp_new <- case_when((df_p_exp == "K" | df_p_exp == "k") ~ 1000, 
                            (df_p_exp == "H" | df_p_exp == "h") ~ 100, 
                            (df_p_exp == "M" | df_p_exp == "m") ~ 1000000, 
                            df_p_exp =="B" ~ 10^9, 
                            TRUE ~ 1)
cropdmgexp_new <- case_when((df_c_exp == "K" | df_c_exp == "k") ~ 1000, 
                            (df_c_exp == "H" | df_c_exp == "h") ~ 100, 
                            (df_c_exp == "M" | df_c_exp == "m") ~ 1000000, 
                            df_c_exp =="B" ~ 10^9, 
                            TRUE ~ 1)

replacement_p_exp <- data.frame(unique_p_exp, propdmgexp_new)
replacement_c_exp <- data.frame(unique_c_exp, cropdmgexp_new)

replacement_p_exp
##    unique_p_exp propdmgexp_new
## 1             K          1e+03
## 2             M          1e+06
## 3                        1e+00
## 4             B          1e+09
## 5             m          1e+06
## 6             +          1e+00
## 7             0          1e+00
## 8             5          1e+00
## 9             6          1e+00
## 10            ?          1e+00
## 11            4          1e+00
## 12            2          1e+00
## 13            3          1e+00
## 14            h          1e+02
## 15            7          1e+00
## 16            H          1e+02
## 17            -          1e+00
## 18            1          1e+00
## 19            8          1e+00
replacement_c_exp
##   unique_c_exp cropdmgexp_new
## 1                       1e+00
## 2            M          1e+06
## 3            K          1e+03
## 4            m          1e+06
## 5            B          1e+09
## 6            ?          1e+00
## 7            0          1e+00
## 8            k          1e+03
## 9            2          1e+00

These replacement tables are used to the storm data with its relevant data. The new columns PROPDMG_NEW and CROPDMG_NEW are the value calculated from PROPDMG and CROPDMG multiplicated with its exponents. So the values in PROPDMG_NEW and CROPDMG_NEW are comparable now.

replacement_p_exp <- replacement_p_exp %>% 
  rename(PROPDMGEXP = unique_p_exp) 
replacement_c_exp <- replacement_c_exp %>% 
  rename(CROPDMGEXP = unique_c_exp)

stormdata <- stormdata %>%
  inner_join(replacement_p_exp, by = "PROPDMGEXP") %>%
  inner_join(replacement_c_exp, by = "CROPDMGEXP") %>%
  mutate(PROPDMG_NEW = PROPDMG * propdmgexp_new) %>%
  mutate(CROPDMG_NEW = CROPDMG * cropdmgexp_new)

head(stormdata)
##    EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO          0       15    25.0          K       0           
## 2 TORNADO          0        0     2.5          K       0           
## 3 TORNADO          0        2    25.0          K       0           
## 4 TORNADO          0        2     2.5          K       0           
## 5 TORNADO          0        2     2.5          K       0           
## 6 TORNADO          0        6     2.5          K       0           
##   propdmgexp_new cropdmgexp_new PROPDMG_NEW CROPDMG_NEW
## 1           1000              1       25000           0
## 2           1000              1        2500           0
## 3           1000              1       25000           0
## 4           1000              1        2500           0
## 5           1000              1        2500           0
## 6           1000              1        2500           0

Now we are able to consider the financial impact for both, property and crop, for each weather event.

economic_damage_per_eventtype <- stormdata %>%
  group_by(EVTYPE) %>%
  summarise(eco_p = sum(PROPDMG_NEW), eco_c = sum(CROPDMG_NEW), eco_damage = sum(PROPDMG_NEW + CROPDMG_NEW)) 
## `summarise()` ungrouping output (override with `.groups` argument)
head(economic_damage_per_eventtype)
## # A tibble: 6 x 4
##   EVTYPE                    eco_p eco_c eco_damage
##   <chr>                     <dbl> <dbl>      <dbl>
## 1 "   HIGH SURF ADVISORY"  200000     0     200000
## 2 " COASTAL FLOOD"              0     0          0
## 3 " FLASH FLOOD"            50000     0      50000
## 4 " LIGHTNING"                  0     0          0
## 5 " TSTM WIND"            8100000     0    8100000
## 6 " TSTM WIND (G45)"         8000     0       8000

Results

Results in Health Damage

The 10 event types with the highest impacts in injuries plus fatalities are as follows:

health_damage_per_eventtype_top10 <- health_damage_per_eventtype %>% 
  arrange(desc(sum_damage)) %>% 
  top_n(sum_damage, n = 10)
health_damage_per_eventtype_top10
## # A tibble: 10 x 4
##    EVTYPE            sum_injuries sum_fatalities sum_damage
##    <chr>                    <dbl>          <dbl>      <dbl>
##  1 TORNADO                  91346           5633      96979
##  2 EXCESSIVE HEAT            6525           1903       8428
##  3 TSTM WIND                 6957            504       7461
##  4 FLOOD                     6789            470       7259
##  5 LIGHTNING                 5230            816       6046
##  6 HEAT                      2100            937       3037
##  7 FLASH FLOOD               1777            978       2755
##  8 ICE STORM                 1975             89       2064
##  9 THUNDERSTORM WIND         1488            133       1621
## 10 WINTER STORM              1321            206       1527

Across the United States, tornados are most harmful with respect to population health.

Here is a visualisation of the conclusion:

health_damage_per_eventtype_top10 %>% 
  select(EVTYPE, sum_injuries, sum_fatalities) %>%
  rename(injuries = sum_injuries, fatalities = sum_fatalities) %>%
  melt(id.vars = 'EVTYPE') %>%
  rename(damage = variable) %>%
  ggplot(aes(x=EVTYPE, y =  value, fill = damage)) + 
  geom_bar(stat = "identity", position = "stack" ) + 
  coord_flip() + 
  labs(title = "The 10 weather events with the most highest impacts on health") +
  ylab("number of cases") +
  xlab("weather events")

Results in Economic Damage

Now we consider the weather events with the 10 greatest economical consequences:

economic_damage_per_eventtype_top10 <- economic_damage_per_eventtype %>%
  arrange(desc(eco_damage)) %>%
  top_n(eco_damage, n = 10)

economic_damage_per_eventtype_top10  
## # A tibble: 10 x 4
##    EVTYPE                    eco_p       eco_c    eco_damage
##    <chr>                     <dbl>       <dbl>         <dbl>
##  1 FLOOD             144657709807   5661968450 150319678257 
##  2 HURRICANE/TYPHOON  69305840000   2607872800  71913712800 
##  3 TORNADO            56937160779.   414953270  57352114049.
##  4 STORM SURGE        43323536000         5000  43323541000 
##  5 HAIL               15732267543.  3025954473  18758222016.
##  6 FLASH FLOOD        16140812067.  1421317100  17562129167.
##  7 DROUGHT             1046106000  13972566000  15018672000 
##  8 HURRICANE          11868319010   2741910000  14610229010 
##  9 RIVER FLOOD         5118945500   5029459000  10148404500 
## 10 ICE STORM           3944927860   5022113500   8967041360

We can see, that flood has the highest econmical damage impact in the U.S. among the considered weather events.

The visualisation of this result can be created as follows:

economic_damage_per_eventtype_top10 %>%
  select(EVTYPE, eco_p, eco_c) %>%
  rename(property = eco_p, crop = eco_c) %>%
  melt(id.vars = 'EVTYPE') %>%
  rename(economical_damage = variable) %>%
  ggplot(aes(x=EVTYPE, y = value, fill = economical_damage)) + 
  geom_bar(stat = "identity", position = "stack") + 
  coord_flip() + 
  labs(title = "The 10 weather events with the most highest economical consequences") +
  ylab("economical loss (in $)") +
  xlab("weather events")