Assignment

The basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events. You must use the database to answer the questions below and show the code for your entire analysis. Your analysis can consist of tables, figures, or other summaries. You may use any R package you want to support your analysis.

Questions

Your data analysis must address the following questions:

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

  2. Across the United States, which types of events have the greatest economic consequences?

Introduction

Title: Exploratory Analysis of the NOAA Storm Database

Synopsis: This document involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern. In this analysis, we would like to answer which events are most harmful with respect to population health and which events have the greatest economic consequences. According to the data, tornado was the most harmful with respect to population health and flood had the greatest economic consequences.

Data Processing

In this section, we show how the data were loaded into R and processed for analysis.

The data come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site shown in the following code.

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.4     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(stringr)

if (!file.exists("data")) {
        dir.create("data")
}

# Preparing the data. If the selected data already exists, 
# then it is retrieved in from the working directory. 
# Otherwise, it is downloaded.

if (file.exists("./data/StormData.csv.bz2")) {
        cat("Retrieving data file")
        storm <- read.csv("./data/StormData.csv.bz2")
} else {
        cat("No valid data file found in the local directory. Downloading from the original file on the internet. \n")
        fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
        download.file(fileUrl, destfile = "./data/StormData.csv.bz2")
        storm <- read.csv("./data/StormData.csv.bz2")
} 
## Retrieving data file
## Data Processing
df1 <- storm

# Only tornado was recorded before Jan-1996, so remove data before Jan-1996
df1$BGN_DATE <- str_replace(df1$BGN_DATE, pattern = " 0:00:00", replacement = "")
df1$END_DATE <- str_replace(df1$END_DATE, pattern = " 0:00:00", replacement = "")
df1$BGN_DATE <- mdy(df1$BGN_DATE)
df1$END_DATE <- mdy(df1$END_DATE)
df1 <- filter(df1, BGN_DATE >= as.Date("1996-01-01"))

# Preprocess "PROPDMGEXP" and "CROPDMGEXP" columns
# Calculate "PROPDMG_TOTAL", "CROPDMG_TOTAL", and "DMG_TOTAL"
numbers <- as.character(0:8)
others <- c("-", "?", "+", "")
df1 <-  select(df1, STATE__:REFNUM) %>%
        mutate(PROPDMGEXP = tolower(PROPDMGEXP)) %>% 
        mutate(CROPDMGEXP = tolower(CROPDMGEXP)) %>%
        mutate(PROPDMGEXP = ifelse(PROPDMGEXP %in% numbers, "10", PROPDMGEXP)) %>%
        mutate(CROPDMGEXP = ifelse(CROPDMGEXP %in% numbers, "10", CROPDMGEXP)) %>%
        mutate(PROPDMGEXP = ifelse(PROPDMGEXP %in% others, "0", PROPDMGEXP)) %>%
        mutate(CROPDMGEXP = ifelse(CROPDMGEXP %in% others, "0", CROPDMGEXP)) %>%
        mutate(PROPDMGEXP = str_replace_all(PROPDMGEXP, 
                                            c("h" = "100",
                                              "k" = "1000",
                                              "m" = "1000000",
                                              "b" = "1000000000"))) %>%
        mutate(CROPDMGEXP = str_replace_all(CROPDMGEXP, 
                                            c("h" = "100",
                                              "k" = "1000",
                                              "m" = "1000000",
                                              "b" = "1000000000"))) %>%
        mutate(PROPDMG_TOTAL = PROPDMG * as.numeric(PROPDMGEXP)) %>%
        mutate(CROPDMG_TOTAL = CROPDMG * as.numeric(CROPDMGEXP)) %>%
        mutate(DMG_TOTAL = PROPDMG_TOTAL + CROPDMG_TOTAL)

# Select relevant columns and filter based on the impact of events
columns <- c("REFNUM", "EVTYPE", "FATALITIES", "INJURIES", 
             "PROPDMG_TOTAL", "CROPDMG_TOTAL", "DMG_TOTAL")
df2 <- select(df1, all_of(columns)) %>%
        filter((FATALITIES > 100 | INJURIES > 1000 | DMG_TOTAL > 500000)) 

# Preprocess "EVTYPE" column
df3 <- mutate(df2, EVTYPE = tolower(EVTYPE)) %>%
        mutate(EVTYPE = str_replace(EVTYPE, 
                                    pattern = "[(].+[)]", 
                                    replacement = "")) %>%
        mutate(EVTYPE = str_replace(EVTYPE, 
                                    pattern = "g[0-9]+", 
                                    replacement = "")) %>%
        mutate(EVTYPE = str_replace(EVTYPE, 
                                    pattern = "[0-9]+", 
                                    replacement = "")) %>%
        mutate(EVTYPE = str_replace(EVTYPE, 
                                    pattern = "  ", 
                                    replacement = " ")) %>%
        mutate(EVTYPE = str_replace(EVTYPE, 
                                    pattern = "agricultural freeze", 
                                    replacement = "freeze")) %>%
        mutate(EVTYPE = str_replace(EVTYPE, 
                                    pattern = "coastal erosion", 
                                    replacement = "coastal flood")) %>%
        mutate(EVTYPE = str_replace(EVTYPE, 
                                    pattern = "coastal flooding", 
                                    replacement = "coastal flood")) %>%
        mutate(EVTYPE = str_replace(EVTYPE, 
                                    pattern = "coastal flood/erosion", 
                                    replacement = "coastal flood")) %>%
        mutate(EVTYPE = str_replace(EVTYPE, 
                                    pattern = "coastal flooding/erosion", 
                                    replacement = "coastal flood")) %>%
        mutate(EVTYPE = str_replace(EVTYPE, 
                                    pattern = "cold/wind chill", 
                                    replacement = "cold")) %>%
        mutate(EVTYPE = str_replace(EVTYPE, 
                                    pattern = "damaging freeze", 
                                    replacement = "freeze")) %>%
        mutate(EVTYPE = str_replace(EVTYPE, 
                                    pattern = "erosion/cstl flood", 
                                    replacement = "coastal flood")) %>%
        mutate(EVTYPE = str_replace(EVTYPE, 
                                    pattern = "extreme windchill", 
                                    replacement = "extreme cold")) %>%
        mutate(EVTYPE = str_replace(EVTYPE, 
                                    pattern = "flooding", 
                                    replacement = "flood")) %>%
        mutate(EVTYPE = str_replace(EVTYPE, 
                                    pattern = "frost/freeze", 
                                    replacement = "freeze")) %>%
        mutate(EVTYPE = str_replace(EVTYPE, 
                                    pattern = "hard freeze", 
                                    replacement = "freeze")) %>%
        mutate(EVTYPE = str_replace(EVTYPE, 
                                    pattern = "heavy rain", 
                                    replacement = "rain")) %>%
        mutate(EVTYPE = str_replace(EVTYPE, 
                                    pattern = "heavy snow", 
                                    replacement = "snow")) %>%
        mutate(EVTYPE = str_replace(EVTYPE, 
                                    pattern = "hurricane/typhoon", 
                                    replacement = "hurricane")) %>%
        mutate(EVTYPE = str_replace(EVTYPE, 
                                    pattern = "heavy surf/high surf", 
                                    replacement = "high surf")) %>%
        mutate(EVTYPE = str_replace(EVTYPE, 
                                    pattern = "landslides", 
                                    replacement = "landslide")) %>%
        mutate(EVTYPE = str_replace(EVTYPE, 
                                    pattern = "mud slide", 
                                    replacement = "mudslide")) %>%
        mutate(EVTYPE = str_replace(EVTYPE, 
                                    pattern = "mudslides", 
                                    replacement = "mudslide")) %>%
        mutate(EVTYPE = str_replace(EVTYPE, 
                                    pattern = "river flood", 
                                    replacement = "flood")) %>%
        mutate(EVTYPE = str_replace(EVTYPE, 
                                    pattern = "river flooding", 
                                    replacement = "flood")) %>%
        mutate(EVTYPE = str_replace(EVTYPE, 
                                    pattern = "lakeshore flood", 
                                    replacement = "flood")) %>%
        mutate(EVTYPE = str_replace(EVTYPE, 
                                    pattern = "small hail", 
                                    replacement = "hail")) %>%
        mutate(EVTYPE = str_replace(EVTYPE, 
                                    pattern = "strong winds", 
                                    replacement = "strong wind" )) %>%        
        mutate(EVTYPE = str_replace(EVTYPE, 
                                    pattern = "storm surge/tide", 
                                    replacement = "storm surge" )) %>%
        mutate(EVTYPE = str_replace(EVTYPE, 
                                    pattern = "typhoon", 
                                    replacement = "hurricane")) %>%
        mutate(EVTYPE = str_replace(EVTYPE, 
                                    pattern = "unseasonable", 
                                    replacement = "" )) %>%
        mutate(EVTYPE = str_replace(EVTYPE, 
                                    pattern = "unseasonably", 
                                    replacement = "" )) %>%
        mutate(EVTYPE = str_replace(EVTYPE, 
                                    pattern = "unseasonal", 
                                    replacement = "" )) %>%
        mutate(EVTYPE = str_replace(EVTYPE, 
                                    pattern = "warm weather", 
                                    replacement = "warm" )) %>%
        mutate(EVTYPE = str_replace(EVTYPE, 
                                    pattern = "wild/forest fire", 
                                    replacement = "wildfire" )) %>%
        mutate(EVTYPE = str_replace(EVTYPE, 
                                    pattern = "wind and wave", 
                                    replacement = "wind" )) %>%
        mutate(EVTYPE = str_replace(EVTYPE, 
                                    pattern = "winter weather/mix", 
                                    replacement = "winter weather" )) %>%
        mutate(EVTYPE = str_replace(EVTYPE, 
                                    pattern = "tstm", 
                                    replacement = "thunderstorm" )) %>%
        mutate(EVTYPE = str_trim(EVTYPE, side = "both"))

Results

In this section, we show two plots to answer the following questions:

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

  2. Across the United States, which types of events have the greatest economic consequences?

## Results Section

# Aggregate data by "EVTYPE"
df4 <- group_by(df3, EVTYPE) %>%
        summarise(n.events = n(),
                  total.fatalities = sum(FATALITIES),
                  total.injuries = sum(INJURIES),
                  total.propdmg.bn = sum(PROPDMG_TOTAL)/1000000000,
                  total.cropdmg.bn = sum(CROPDMG_TOTAL)/1000000000,
                  total.dmg.bn = sum(DMG_TOTAL)/1000000000)

# Plot total fatalities to show which events are most harmful with respect to population health
df5 <- select(df4, EVTYPE, total.fatalities) %>%
        top_n(10, total.fatalities) %>%
        arrange(desc(total.fatalities))
head(df5)
## # A tibble: 6 × 2
##   EVTYPE      total.fatalities
##   <chr>                  <dbl>
## 1 tornado                 1238
## 2 flash flood              274
## 3 flood                    228
## 4 hurricane                 99
## 5 high wind                 69
## 6 wildfire                  56
g1 <- ggplot(df5, aes(x = reorder(EVTYPE, total.fatalities), y = total.fatalities))
g1 <- g1 + geom_bar(stat = "identity")
g1 <- g1 + coord_flip()
g1 <- g1 + labs(title = "Total fatalities by top 10 types of events in the United States")
g1 <- g1 + labs(x = "Types of events")
g1 <- g1 + labs(y = "Total fatalities from Jan-1996 to Nov-2011")
plot(g1)

# Plot total damages to show which events have the greatest economic consequences
df6 <- select(df4, EVTYPE, contains("dmg")) %>%
        top_n(10, total.dmg.bn) %>%
        arrange(desc(total.dmg.bn))
head(df6)
## # A tibble: 6 × 4
##   EVTYPE      total.propdmg.bn total.cropdmg.bn total.dmg.bn
##   <chr>                  <dbl>            <dbl>        <dbl>
## 1 flood                  144.          4.95            148. 
## 2 hurricane               81.7         5.35             87.1
## 3 storm surge             47.8         0.000855         47.8
## 4 tornado                 23.7         0.248            24.0
## 5 hail                    14.2         2.13             16.3
## 6 flash flood             14.2         1.27             15.5
g2 <- ggplot(df6, aes(x = reorder(EVTYPE, total.dmg.bn), y = total.dmg.bn))
g2 <- g2 + geom_bar(stat = "identity")
g2 <- g2 + coord_flip()
g2 <- g2 + labs(title = "Total damages by top 10 types of events in the United States")
g2 <- g2 + labs(x = "Types of events")
g2 <- g2 + labs(y = "Total damages from Jan-1996 to Nov-2011 (USD billion)")
plot(g2)

As shown in the plots, tornado was the most harmful with respect to population health and flood had the greatest economic consequences.