Synopsis

Storm data provides information on a large number of weather events and their consequences. Health outcomes are measured by the number of deaths and the number of injuries caused by each event. Economic consequences are measured by the amount in USD of damages to properties and to crops. The aim of this study was to find which type of event caused most victims and damages. I found that tornados are responsible for 62% of all injuries and deaths, tropical storms are responsible for 37% of all property damages and floods are responsible for 38% of all agricultural damages. Unclassified events are resposible for 8% of victims and 30% of damages.

Data Processing

# importing some libraries:
library(tidyverse) # ggplot, dplyr and related
## ── Attaching packages ───────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0       ✔ purrr   0.3.1  
## ✔ tibble  2.0.1       ✔ dplyr   0.8.0.1
## ✔ tidyr   0.8.3       ✔ stringr 1.4.0  
## ✔ readr   1.3.1       ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
# I use theese packages to create nice tables
library(knitr)
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
# I use this package to automatically number captions.
library(captioner)
fig_caption <- captioner::captioner("Figure.")
tab_caption <- captioner::captioner("Table.")

# force US style dates and times
Sys.setlocale("LC_TIME","en_US.UTF-8") 
## [1] "en_US.UTF-8"
# set the working directory to the appropriate location on your system
setwd("~/Desktop/coursera/reproductible_research")
# I assume there is the required file in your woring directory
# it is a tar file (.bz2) which contains a .csv file
# these 2 steps take long to execute, so chache / comment them as necesary
#untar(tarfile = "repdata_data_StormData.csv.bz2")
storm_data <- read.csv("repdata_data_StormData.csv", as.is = T, stringsAsFactors = F)

# work on a copy to avoid reading the data again if I mess up any trasformations
SD <- storm_data

# I'll experiment on a sample and move on the real data when I'm ready
SD <- sample_n(SD, 1000)


#str(SD)

Results

Which types of events are most harmful to population health?

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

There are 2 variables that relate to populaton health: INJURIES and FATALITIES. Both are numeric (number of people harmed by each event) and are every skewed. Most events had no victims, while some had. I will just count the total number of victims for each.

Event type is just a mess. In the documentation, there should be just 48 event types (page 6, Table 1). In this list, the event types are about 1000 distinct ones. Some are just spelling mistakes, some are written differently, some have extra details and some are just plain wrong. I copied the 48 events in a file named “events.txt”, which the script will read and will try to match ‘Event type’ against the predefined list. I’ll use fuzzy matches (agrep) in order to catch some simple mistakes and I manage t catch about half of the distinct event types. There should be some manual curation and/or a smarter way to do this in order to recode all events to one of the predefined event types. I’ll run this as is because it’s still better than nothing.

# recode event types
predefined_levels <- read_lines("events.txt")
bad_levels <- storm_data$EVTYPE %>% str_squish() %>% tolower() %>% factor() %>% levels()
recoded_levels = rep(NA, length(bad_levels)) %>% as.character()

for (w in predefined_levels) { # find fuzzy matches
  positions <- agrep(x = bad_levels, pattern = w, ignore.case = T, value = F, max.distance = 0.01)
  recoded_levels[positions][is.na(recoded_levels[positions])] <- w
}

#sum(is.na(recoded_levels))
names(recoded_levels) = bad_levels

evtypes <- as.character(storm_data$EVTYPE) %>% str_squish() %>% tolower()
for (x in seq_along(evtypes)) {
  #x <- recoding_rules$recoded_levels[recoding_rules$bad_levels==x]
  evtypes[x] <- recoded_levels[evtypes[x]]
}
 evtypes[is.na(evtypes)] <- "(unclassified)"
# make a temporary dataset
DF <- storm_data %>% select(
  #`Event type` = EVTYPE, # just in case you want to use the original event types
  Injuries = INJURIES,
  Fatalities = FATALITIES
) %>% mutate(
  #`Event type` = tolower(`Event type`) %>% str_squish(), # if you want to use the original event types
  `Event type` = evtypes) %>%
  group_by(`Event type`) %>%
  summarise(
    Injuries = sum(Injuries), 
    Fatalities = sum(Fatalities)
  ) #%>% View()

After recoding, the nost dangerous events are tornados, responsible for almost 10,000 victims (mostly injuries).

# plot the previously generated data
## ggplot needs the data to be long, not wide

DF %>% 
  tidyr::gather("outcome", "victims", -c("Event type")) %>%
  ggplot(aes(x=`Event type`, y=`victims`, fill=`outcome`)) +
  geom_bar(stat="identity", position = "stack") +
  #scale_y_continuous(limits = c(0, 25000))+
  labs(title = "Health outcomes of storm-related events",
       subtitle = "data restricted to a predefined list of events",
       y="Total number of victims",
       x="Event type (restricted)")+
  coord_flip()+
  theme(legend.position = c(1,0), legend.justification = c(1, 0), 
        legend.background = element_blank(),
        legend.title = element_blank())
Figure. 1: Health efects (total number of injries and fatalities) atributable to various event types. These event types are obtained by coercing the original event types to a predefined list of events as suggested by the documentation (page 6, Table 1).

Figure. 1: Health efects (total number of injries and fatalities) atributable to various event types. These event types are obtained by coercing the original event types to a predefined list of events as suggested by the documentation (page 6, Table 1).

#sum(storm_data$INJURIES[storm_data$EVTYPE=="TORNADO"])

Which types of events have the greatest economic consequences?

Across the United States, which types of events have the greatest economic consequences?

In a similar manner to health effect, economc consequences are recoded by ‘PROPDMG’ (Propery Damages) and ‘CROPDMG’ (Crops Damage), each with a indicator for the mesurement unit (‘PROPDMGEXP’ and ‘CROPDMGEXP’). The units are interesting, because they code for the exponent of the damage in USD. They have several levels: K for thousands, M for millions, and B for billions. I will suppose that no exponent means plain USD. I will ignore all other data in these variables.

DF2 <- storm_data %>% select(
  #EVTYPE, 
  PROPDMG,PROPDMGEXP, 
  CROPDMG, CROPDMGEXP)  %>%
  mutate(
    `Event type` = evtypes, # same as befre
    prop.exp = ifelse(PROPDMGEXP %in% c("K", "k"), 3, 
                  ifelse(PROPDMGEXP %in% c("M", "m"), 6, 
                    ifelse(PROPDMGEXP %in% c("B", "b"), 9, 
                      ifelse(PROPDMGEXP == "", 0, NA)))),
    crop.exp = ifelse(CROPDMGEXP %in% c("K", "k"), 3, 
                  ifelse(CROPDMGEXP %in% c("M", "m"), 6, 
                    ifelse(CROPDMGEXP %in% c("B", "b"), 9, 
                      ifelse(CROPDMGEXP == "", 0, NA)))),
    prop = PROPDMG*10^prop.exp,
    crop = CROPDMG*10^crop.exp
    ) %>% group_by(`Event type`) %>%
  summarise(
    `Property Damages (USD)` = sum(prop),
    `Crops Damages (USD)` = sum(crop))

After recoding, the nost dangerous events are floods, responsible for almost 10 billion USD of crop damages and tropical storms, resposible for more than 7.5 billion USD in property damages and more than 1 billion USD in crops damages.

# plot the previously generated data
## ggplot needs the data to be long, not wide

DF2 %>% 
  tidyr::gather("outcome", "damage", -c("Event type")) %>%
  ggplot(aes(x=`Event type`, y=`damage`/10^9, fill=`outcome`)) +
  geom_bar(stat="identity", position = "stack") +
  scale_y_continuous(breaks=1:12)+
  labs(title = "Damages from storm-related events",
       subtitle = "data restricted to a predefined list of events",
       y="Total damage (Billion dollars)",
       x="Event type (restricted)")+
  coord_flip()+
  theme(legend.position = c(1,0.05), legend.justification = c(1, 0), 
        legend.background = element_blank(),
        legend.title = element_blank())
## Warning: Removed 17 rows containing missing values (position_stack).
Figure. 2: Damages (total cost in billion USD in propertis and crops) atributable to various event types. These event types are obtained by coercing the original event types to a predefined list of events as suggested by the documentation (page 6, Table 1).

Figure. 2: Damages (total cost in billion USD in propertis and crops) atributable to various event types. These event types are obtained by coercing the original event types to a predefined list of events as suggested by the documentation (page 6, Table 1).

#sum(storm_data$INJURIES[storm_data$EVTYPE=="TORNADO"])

Summary table

The table bellow summarizes both halth effects and economic damages related to various strorm events.

# make a nice table to summarize health effects

merge(DF, DF2, by = "Event type", all=T) %>%
  mutate(
    .victims = Injuries+Fatalities,
    `Injuries` = paste0(Injuries, " (", scales::percent(Injuries/sum(Injuries), 0.01), ")" ),
    `Fatalities` = paste0(Fatalities, " (", scales::percent(Fatalities/sum(Fatalities), 0.01), ")" ),
    `All victims` = paste0(.victims, " (", scales::percent(.victims/sum(.victims), 0.01), ")" ), 
    
    `Property Damages (USD)` = `Property Damages (USD)`/10^9,
    `Crops Damages (USD)` = `Crops Damages (USD)`/10^9,
    .damages = `Property Damages (USD)`+`Crops Damages (USD)`,
    `Property Damages (B USD)` = paste0(scales::number(`Property Damages (USD)`,0.001), " (", 
                                      scales::percent(`Property Damages (USD)`/sum(`Property Damages (USD)`, na.rm=T), 0.01), ")" ),
    `Crops Damages (B USD)` = paste0(scales::number(`Crops Damages (USD)`,0.001), " (", 
                                   scales::percent(`Crops Damages (USD)`/sum(`Crops Damages (USD)`, na.rm=T), 0.01), ")" ),
    `Total Damages (B USD)` = paste0(scales::number(.damages,0.001), " (", 
                                   scales::percent(.damages/sum(.damages, na.rm=T), 0.01), ")")) %>%
  select(`Event type`, `Injuries`, `Fatalities`,`All victims`, 
         `Property Damages (B USD)`, `Crops Damages (B USD)`, `Total Damages (B USD)`) %>%
  mutate_all(str_replace_all, pattern="NA \\(NA%\\)", replacement="-") %>%
  
  kable(escape = T, align = "l", booktabs=T, format = "html", row.names=F,
        caption=tab_caption("t2", "Health efects (total number of injries and fatalities) and damages (total cost in billion USD in propertis and crops) atributable to various event types. These event types are obtained by coercing the original event types to a predefined list of events as suggested by the documentation (page 6, Table 1).")) %>%
  kableExtra::row_spec(1, italic=T) %>%
  kable_styling(bootstrap_options = c("striped", "hover","condensed", full_width = T)) 
Table. 1: Health efects (total number of injries and fatalities) and damages (total cost in billion USD in propertis and crops) atributable to various event types. These event types are obtained by coercing the original event types to a predefined list of events as suggested by the documentation (page 6, Table 1).
Event type Injuries Fatalities All victims Property Damages (B USD) Crops Damages (B USD) Total Damages (B USD)
(unclassified) 10925 (7.77%) 1219 (8.05%) 12144 (7.80%)
8.686 (30.25%)
Astronomical Low Tide 0 (0.00%) 0 (0.00%) 0 (0.00%) 0.000 (0.00%) 0.000 (0.00%) 0.000 (0.00%)
Avalanche 171 (0.12%) 225 (1.49%) 396 (0.25%) 0.009 (0.04%) 0.000 (0.00%) 0.009 (0.04%)
Blizzard 805 (0.57%) 101 (0.67%) 906 (0.58%) 0.660 (3.22%) 0.112 (0.39%) 0.772 (3.47%)
Coastal Flood 7 (0.00%) 6 (0.04%) 13 (0.01%) 0.433 (2.11%) 0.000 (0.00%) 0.433 (1.95%)
Cold/Wind Chill 36 (0.03%) 220 (1.45%) 256 (0.16%) 0.011 (0.05%) 0.001 (0.00%) 0.011 (0.05%)
Dense Fog 342 (0.24%) 18 (0.12%) 360 (0.23%) 0.010 (0.05%) 0.000 (0.00%) 0.010 (0.04%)
Dense Smoke 0 (0.00%) 0 (0.00%) 0 (0.00%) 0.000 (0.00%) 0.000 (0.00%) 0.000 (0.00%)
Drought 19 (0.01%) 6 (0.04%) 25 (0.02%) 1.046 (5.10%)
Dust Devil 43 (0.03%) 2 (0.01%) 45 (0.03%) 0.001 (0.00%) 0.000 (0.00%) 0.001 (0.00%)
Dust Storm 440 (0.31%) 22 (0.15%) 462 (0.30%) 0.006 (0.03%) 0.004 (0.01%) 0.009 (0.04%)
Excessive Heat 6525 (4.64%) 1920 (12.68%) 8445 (5.42%) 0.008 (0.04%) 0.492 (1.71%) 0.500 (2.25%)
Flash Flood 1802 (1.28%) 1035 (6.83%) 2837 (1.82%)
Flood 6795 (4.84%) 484 (3.20%) 7279 (4.68%)
10.848 (37.77%)
Freezing Fog 0 (0.00%) 0 (0.00%) 0 (0.00%) 0.002 (0.01%) 0.000 (0.00%) 0.002 (0.01%)
Frost/Freeze 0 (0.00%) 0 (0.00%) 0 (0.00%) 0.011 (0.05%) 1.094 (3.81%) 1.105 (4.97%)
Funnel Cloud 3 (0.00%) 0 (0.00%) 3 (0.00%) 0.000 (0.00%) 0.000 (0.00%) 0.000 (0.00%)
Hail 1472 (1.05%) 62 (0.41%) 1534 (0.99%)
Heat 4601 (3.27%) 1566 (10.34%) 6167 (3.96%)
1.357 (4.73%)
High Surf 156 (0.11%) 104 (0.69%) 260 (0.17%) 0.090 (0.44%) 0.000 (0.00%) 0.090 (0.41%)
High Wind 1522 (1.08%) 297 (1.96%) 1819 (1.17%)
Ice Storm 1990 (1.42%) 89 (0.59%) 2079 (1.34%)
5.022 (17.49%)
Lake-Effect Snow 0 (0.00%) 0 (0.00%) 0 (0.00%) 0.040 (0.20%) 0.000 (0.00%) 0.040 (0.18%)
Lightning 5232 (3.72%) 817 (5.39%) 6049 (3.89%)
0.012 (0.04%)
Marine Strong Wind 22 (0.02%) 14 (0.09%) 36 (0.02%) 0.000 (0.00%) 0.000 (0.00%) 0.000 (0.00%)
Marine Thunderstorm Wind 26 (0.02%) 10 (0.07%) 36 (0.02%) 0.000 (0.00%) 0.000 (0.00%) 0.000 (0.00%)
Rip Current 529 (0.38%) 572 (3.78%) 1101 (0.71%) 0.000 (0.00%) 0.000 (0.00%) 0.000 (0.00%)
Seiche 0 (0.00%) 0 (0.00%) 0 (0.00%) 0.001 (0.00%) 0.000 (0.00%) 0.001 (0.00%)
Sleet 0 (0.00%) 2 (0.01%) 2 (0.00%) 0.002 (0.01%) 0.000 (0.00%) 0.002 (0.01%)
Storm Surge/Tide 5 (0.00%) 11 (0.07%) 16 (0.01%) 4.641 (22.63%) 0.001 (0.00%) 4.642 (20.88%)
Strong Wind 301 (0.21%) 111 (0.73%) 412 (0.26%) 0.181 (0.88%) 0.070 (0.24%) 0.251 (1.13%)
Thunderstorm Wind 2412 (1.72%) 200 (1.32%) 2612 (1.68%)
Tornado 91407 (65.05%) 5636 (37.21%) 97043 (62.34%)
Tropical Depression 0 (0.00%) 0 (0.00%) 0 (0.00%) 0.002 (0.01%) 0.000 (0.00%) 0.002 (0.01%)
Tropical Storm 383 (0.27%) 66 (0.44%) 449 (0.29%) 7.714 (37.61%) 0.695 (2.42%) 8.409 (37.83%)
Tsunami 129 (0.09%) 33 (0.22%) 162 (0.10%) 0.144 (0.70%) 0.000 (0.00%) 0.144 (0.65%)
Volcanic Ash 0 (0.00%) 0 (0.00%) 0 (0.00%) 0.000 (0.00%) 0.000 (0.00%) 0.000 (0.00%)
Waterspout 29 (0.02%) 3 (0.02%) 32 (0.02%) 0.010 (0.05%) 0.000 (0.00%) 0.010 (0.04%)
Wildfire 1061 (0.76%) 78 (0.52%) 1139 (0.73%) 5.490 (26.76%) 0.296 (1.03%) 5.786 (26.03%)
Winter Storm 1338 (0.95%) 216 (1.43%) 1554 (1.00%)
0.027 (0.10%)