Synopsis

The National Oceanic and Atmospheric Administration (NOAA) has been tasked with tracking and recording extreme weather events as well as their damage. They publish monthly updates to Storm Data, which makes note of significant weather events at the county level in the United States. This data is freely available to the public and is therefore used for a variety of various research and analysis projects.

This document uses Storm Data reports from 1950 - 2011 to answer some basic questions about the effect of weather events on both human health and economics. After the full analysis, it was determined that flooding is the most impactful weather event in terms of financial losses, while tornadoes were the most damaging to human health.

Data Processing

Loading the raw dataset

First, the necessary libraries were loaded: tidyverse and stringr for data wrangling and plotting and then lubridate for date formatting.

library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
library(stringr)

Then, the original zip file was downloaded from the course website and the two date columns, BGN_DATE and END_DATE were forced from character strings into a proper date format.

download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
              "stormdata.csv.bz2")
data <- read_csv("stormdata.csv.bz2", locale=locale(tz="UTC"))
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   STATE__ = col_double(),
##   COUNTY = col_double(),
##   BGN_RANGE = col_double(),
##   COUNTY_END = col_double(),
##   END_RANGE = col_double(),
##   LENGTH = col_double(),
##   WIDTH = col_double(),
##   F = col_integer(),
##   MAG = col_double(),
##   FATALITIES = col_double(),
##   INJURIES = col_double(),
##   PROPDMG = col_double(),
##   CROPDMG = col_double(),
##   LATITUDE = col_double(),
##   LONGITUDE = col_double(),
##   LATITUDE_E = col_double(),
##   LONGITUDE_ = col_double(),
##   REFNUM = col_double()
## )
## See spec(...) for full column specifications.
data$BGN_DATE <- parse_date_time(data$BGN_DATE, "mdyHMS")
data$END_DATE <- parse_date_time(data$END_DATE, "mdyHMS")

Exploratory analysis

Once the data were loaded in, a brief exploration was performed to get a better understanding of the major event types as well as a check for missing data in the injury and economic value columns. Here, we can see common weather events such as flood and thunderstorm dominating the event count.

group_by(data, EVTYPE) %>%
    summarize(Total = n()) %>%
    filter(Total > 10000)
## # A tibble: 12 × 2
##                EVTYPE  Total
##                 <chr>  <int>
## 1         FLASH FLOOD  54278
## 2               FLOOD  25326
## 3                HAIL 288661
## 4          HEAVY RAIN  11723
## 5          HEAVY SNOW  15708
## 6           HIGH WIND  20212
## 7           LIGHTNING  15755
## 8   THUNDERSTORM WIND  82563
## 9  THUNDERSTORM WINDS  20843
## 10            TORNADO  60652
## 11          TSTM WIND 219944
## 12       WINTER STORM  11433
sum(is.na(data$EVTYPE))
## [1] 0
sum(is.na(data$FATALITIES))
## [1] 0
sum(is.na(data$INJURIES))
## [1] 0
sum(is.na(data$PROPDMG))
## [1] 0
sum(is.na(data$CROPDMG))
## [1] 0

Preprocessing: Effects on Population Health

The NOAA data include columns on both injury and fatality count, both of which have an impact on population health. For the purposes of this analysis, a health score will be calculated by weighting and combining both injuries and fatalties. Injury counts will be reduced by 80% before being added to fatality counts. This health score will serve as a proxy of the impact on population health.

data <- mutate(data, HEALTH_SCORE = (0.2 * data$INJURIES) + data$FATALITIES)

Preprocessing: Economic Consequences

Economic losses are reported for both property and crops. However, each numerical loss value is associated with a corresponding alphabetical character used to signify the loss magnitude. K for thousands, M for millions, and B for billions according to the data documentation. When looking at the full set of characters for both crop and property damage, there are many other designations assigned. Since there is no description of them from NOAA, those records were excluded from analysis. Lowercase letters (k, m, b) were kept and converted to uppercase when possible.

group_by(data, CROPDMGEXP) %>% summarize(Total = n())
## # A tibble: 9 × 2
##   CROPDMGEXP  Total
##        <chr>  <int>
## 1          ?      7
## 2          0     19
## 3          2      1
## 4          B      9
## 5          k     21
## 6          K 281832
## 7          m      1
## 8          M   1994
## 9       <NA> 618413
group_by(data, PROPDMGEXP) %>% summarize(Total = n())
## # A tibble: 19 × 2
##    PROPDMGEXP  Total
##         <chr>  <int>
## 1           -      1
## 2           ?      8
## 3           +      5
## 4           0    216
## 5           1     25
## 6           2     13
## 7           3      4
## 8           4      4
## 9           5     28
## 10          6      4
## 11          7      5
## 12          8      1
## 13          B     40
## 14          h      1
## 15          H      6
## 16          K 424665
## 17          m      7
## 18          M  11330
## 19       <NA> 465934
correct_abbrv <- c("K", "k", "M", "m", "B", "b")
clean_data <- 
    filter(data, PROPDMGEXP %in% correct_abbrv) %>%
    filter(CROPDMGEXP %in% correct_abbrv) %>%
    mutate(PROPDMGEXP = str_to_upper(PROPDMGEXP)) %>%
    mutate(CROPDMGEXP = str_to_upper(CROPDMGEXP))

Once the loss magnitude factors were cleaned and the alphabet characters were converted to numerical multiplier values, the economic losses for each event were able to be combined into a single value, allowing them to be comparable across observations.

# Create loss multipliers
clean_data$PROPMULTI <- with(clean_data, if_else(PROPDMGEXP == "K", 1000, 
                                                 if_else(PROPDMGEXP == "M", 1000000,
                                                         1000000000)))
clean_data$CROPMULTI <- with(clean_data, if_else(CROPDMGEXP == "K", 1000,
                                                 if_else(CROPDMGEXP == "M", 1000000,
                                                         1000000000)))
# Combine all losses to a single column
clean_data <- 
    mutate(clean_data, ECONOMIC = (PROPDMG * PROPMULTI) + (CROPDMG * CROPMULTI))

Results

Effects on Population Health

After the proxy health score was summed by each event type, the top ten events were compared. The plot below shows the top ten events according to the calculated health score, where higher scores are more damaging to population health. By far and away, tornados have the most negative impact on human population health; their impact score was more than twice that of the second most devastating event, floods.

group_by(clean_data, EVTYPE) %>%
    summarize(Health = sum(HEALTH_SCORE)) %>%
    arrange(desc(Health)) %>%
    top_n(10, Health) %>%
    ggplot(aes(x = EVTYPE, y = Health)) + 
        geom_col() +
        theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
        labs(x = "", y = "Negative Health Impact",
             title = "Most Harmful Weather Events for Population Health, 1950-2011")

Economic Consequences

The weather events with the most dire economic consequences were much more direct to calculate. Since the actual economic losses from damages to property and crops were directly available, no proxy calculation was needed. The plot below shows the sum of economic damage each event type caused from 1950-2011. After looking at the top loss causing events, it is clear that floods have the strongest economic impact in the United States.

group_by(clean_data, EVTYPE) %>%
    summarize(Loss = sum(ECONOMIC)) %>%
    top_n(10, Loss) %>%
    ggplot(aes(x = EVTYPE, y = Loss)) +
        geom_col() + 
        theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
        labs(x = "", y = "Economic Loss, USD", 
             title = "Top Economic Loss causing Weather Events, 1950-2011")