Synopsis

This report explores the NOAA (U.S. National Oceanic and Atmospheric Administration) Storm Database to answer to following questions:

  1. Across the United States, which types of events are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

The report is written according to the principles of “Reproducible Reports”.

Data Processing

I use the package tidyverse and follow the respective guideline with the piping syntax.

The data is downloaded from the URL below and read into a tibble. Both steps are only executed when their result isn’t already present.

library(tidyverse)
## -- Attaching packages --------------------------------------------------------------------- tidyverse 1.2.0 --
## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.3.4     v dplyr   0.7.4
## v tidyr   0.7.2     v stringr 1.2.0
## v readr   1.1.1     v forcats 0.2.0
## -- Conflicts ------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
setwd("D:/Studium/DataScience/Course Material/DataScience/05 Reproducible Research")
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
file <- "stormData.bz2"
if(!file.exists(file)) {
  download.file(url, file)
}
if(!exists("stormData")){
  storm_data <-  as_tibble(read.csv("stormData.bz2"))
}

Looking at the data, it becomes clear that only the following columns are needed for the purpose of this analysis:

Unfortunately, there doesn’t seem to be any documentation on the exponents. But clearly, their values need to be transformed and then, I assume, just multiplied with the base values. I translate the values as good as I can. Values I can’t guess, get the value 1, meaning that the exponent doesn’t have any effect.

To clearly show the translation I create a data set with all distinct translations. Then I use this translation data set to join with the original data and use the translation for mutations.

str(storm_data)
## Classes 'tbl_df', 'tbl' and 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
# collect all exponents in one distinct set
exp_prop <- rename(distinct(storm_data, PROPDMGEXP), exp = PROPDMGEXP)
exp_crop <- rename(distinct(storm_data, CROPDMGEXP), exp = CROPDMGEXP)
exp <- distinct(rbind(exp_prop, exp_crop), exp)
(exp <- exp %>% mutate_if(is.factor, as.character))
## # A tibble: 20 x 1
##      exp
##    <chr>
##  1     K
##  2     M
##  3      
##  4     B
##  5     m
##  6     +
##  7     0
##  8     5
##  9     6
## 10     ?
## 11     4
## 12     2
## 13     3
## 14     h
## 15     7
## 16     H
## 17     -
## 18     1
## 19     8
## 20     k
# function to translate the exponents as far as feasible and set the rest to 1
clean_exp <- function(exp) {
  exp_clean <- 1
  if (toupper(exp) == "H") exp_clean <- 1e02
  else if (toupper(exp) == "K") exp_clean <- 1e03
  else if (toupper(exp) == "M") exp_clean <- 1e06
  else if (toupper(exp) == "B") exp_clean <- 1e09
  else if (is.numeric(exp)) exp_clean <- 10 ** exp
  return(exp_clean)
}
# The function needs to be applied rowwise, which is expensive.
# Do the rowwise cleaning on this small dataset instead of the large one!
(exp <- exp %>% rowwise() %>% mutate(exp_clean = clean_exp(exp)))
## Source: local data frame [20 x 2]
## Groups: <by row>
## 
## # A tibble: 20 x 2
##      exp exp_clean
##    <chr>     <dbl>
##  1     K     1e+03
##  2     M     1e+06
##  3           1e+00
##  4     B     1e+09
##  5     m     1e+06
##  6     +     1e+00
##  7     0     1e+00
##  8     5     1e+00
##  9     6     1e+00
## 10     ?     1e+00
## 11     4     1e+00
## 12     2     1e+00
## 13     3     1e+00
## 14     h     1e+02
## 15     7     1e+00
## 16     H     1e+02
## 17     -     1e+00
## 18     1     1e+00
## 19     8     1e+00
## 20     k     1e+03
# get a data set with all relevant columns and clean values
(storm_data_clean <- storm_data %>%
  select(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) %>%
  mutate_if(is.factor, as.character) %>%
  left_join(exp, c("PROPDMGEXP" = "exp")) %>%
  mutate(PROPDMG_CLEAN = PROPDMG * exp_clean) %>%
  select(-exp_clean) %>%
  left_join(exp, c("CROPDMGEXP" = "exp")) %>%
  mutate(CROPDMG_CLEAN = CROPDMG * exp_clean) %>%
  select(-exp_clean))
## # A tibble: 902,297 x 9
##     EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
##      <chr>      <dbl>    <dbl>   <dbl>      <chr>   <dbl>      <chr>
##  1 TORNADO          0       15    25.0          K       0           
##  2 TORNADO          0        0     2.5          K       0           
##  3 TORNADO          0        2    25.0          K       0           
##  4 TORNADO          0        2     2.5          K       0           
##  5 TORNADO          0        2     2.5          K       0           
##  6 TORNADO          0        6     2.5          K       0           
##  7 TORNADO          0        1     2.5          K       0           
##  8 TORNADO          0        0     2.5          K       0           
##  9 TORNADO          1       14    25.0          K       0           
## 10 TORNADO          0        0    25.0          K       0           
## # ... with 902,287 more rows, and 2 more variables: PROPDMG_CLEAN <dbl>,
## #   CROPDMG_CLEAN <dbl>
# sum the relevant values per EVTYPE
(evtype_sums <- storm_data_clean %>%
  rename(`Event Type` = EVTYPE) %>%
  group_by(`Event Type`) %>%
  summarize(
    Fatalities = sum(FATALITIES),
    Injuries = sum(INJURIES),
    `Property Damage` = sum(PROPDMG_CLEAN),
    `Crop Damage` = sum(CROPDMG_CLEAN)))
## # A tibble: 985 x 5
##             `Event Type` Fatalities Injuries `Property Damage`
##                    <chr>      <dbl>    <dbl>             <dbl>
##  1    HIGH SURF ADVISORY          0        0            200000
##  2         COASTAL FLOOD          0        0                 0
##  3           FLASH FLOOD          0        0             50000
##  4             LIGHTNING          0        0                 0
##  5             TSTM WIND          0        0           8100000
##  6       TSTM WIND (G45)          0        0              8000
##  7            WATERSPOUT          0        0                 0
##  8                  WIND          0        0                 0
##  9                     ?          0        0              5000
## 10       ABNORMAL WARMTH          0        0                 0
## # ... with 975 more rows, and 1 more variables: `Crop Damage` <dbl>

Results

Recall the questions to be answered with this report:

  1. Across the United States, which types of events are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

I will answer the questions with popular top-10 lists in charts.

Population health can be measured in the dimensions “fatalaties” and “injuries” with the given data. I report these dimensions separately because it’s not feasible to aggregate them in a formula like 1 fatalaty = 10 injuries. Tornados are clearly the most harmful events in both dimensions.

evtype_sums %>%
  arrange(desc(Fatalities)) %>%
  head(10) %>%
  ggplot(mapping = aes(x = reorder(`Event Type`, Fatalities),
                       y = Fatalities, fill = "red")) +
    geom_bar(stat = "identity") + coord_flip() +
    labs(x = "Event Type", title = "Top 10 Storm Event Types by Count of Fatalities") +
    guides(fill = FALSE)

evtype_sums %>%
  arrange(desc(Injuries)) %>%
  head(10) %>%
  ggplot(mapping = aes(x = reorder(`Event Type`, Injuries),
                       y = Injuries, fill = "blue")) +
    geom_bar(stat = "identity") + coord_flip() +
    labs(x = "Event Type", title = "Top 10 Storm Event Types by Count of Injuries") +
    guides(fill = FALSE)

Economic consequences can be measured in the dimensions “property damage” and “crop damage”. I assume the respective values are given in USD and add them together to select the top-10 list. Floods have the greatest economic consequences.

evtype_sums %>%
  rowwise() %>%
  mutate(`Total Damage` = sum(`Property Damage`, `Crop Damage`)) %>%
  arrange(desc(`Total Damage`)) %>%
  head(10) %>%
  mutate(`Event Type` = factor(
    `Event Type`,
    levels = `Event Type`[order(`Total Damage`)],
    ordered = TRUE)) %>%
  select(`Event Type`, `Property Damage`, `Crop Damage`) %>%
  gather(`Property Damage`, `Crop Damage`,
         key = "Damage Type", value = "Damage Value") %>%
  ggplot(mapping = aes(x = `Event Type`, y = `Damage Value`, fill = `Damage Type`)) +
  geom_bar(stat = "identity") + coord_flip() +
  labs(x = "Event Type",
       y = "Property Damage in USD",
       title = "Top 10 Storm Event Types by Damage Value")