Synopsis

Extreme weather events can be catastrophic for both public health and municipality economics. The most severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project will analyse a dataset of events to determine the answer to 2 questions:

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

  2. Across the United States, which types of events have the greatest economic consequences?

Data Processing

Load required libraries

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.0     v purrr   0.3.2
## v tibble  2.1.1     v dplyr   0.8.1
## v tidyr   0.8.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
library(ggplot2)

Download the initial csv data and read into a dataset

# Assign URL to variable
URL<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"

# If file doesn't exist, download it
if (!file.exists("StormData.csv.bz2")) {
        download.file(URL, "storm_data.csv.bz2")
}

# Read file into dataframe
storm_data <- read_csv("storm_data.csv.bz2")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   BGN_DATE = col_character(),
##   BGN_TIME = col_character(),
##   TIME_ZONE = col_character(),
##   COUNTYNAME = col_character(),
##   STATE = col_character(),
##   EVTYPE = col_character(),
##   BGN_AZI = col_logical(),
##   BGN_LOCATI = col_logical(),
##   END_DATE = col_logical(),
##   END_TIME = col_logical(),
##   COUNTYENDN = col_logical(),
##   END_AZI = col_logical(),
##   END_LOCATI = col_logical(),
##   PROPDMGEXP = col_character(),
##   CROPDMGEXP = col_logical(),
##   WFO = col_logical(),
##   STATEOFFIC = col_logical(),
##   ZONENAMES = col_logical(),
##   REMARKS = col_logical()
## )
## See spec(...) for full column specifications.
## Warning: 5255570 parsing failures.
##  row col           expected actual                 file
## 1671 WFO 1/0/T/F/TRUE/FALSE     NG 'storm_data.csv.bz2'
## 1673 WFO 1/0/T/F/TRUE/FALSE     NG 'storm_data.csv.bz2'
## 1674 WFO 1/0/T/F/TRUE/FALSE     NG 'storm_data.csv.bz2'
## 1675 WFO 1/0/T/F/TRUE/FALSE     NG 'storm_data.csv.bz2'
## 1678 WFO 1/0/T/F/TRUE/FALSE     NG 'storm_data.csv.bz2'
## .... ... .................. ...... ....................
## See problems(...) for more details.

Not all columns are required to answer the 2 questions given in the synopsis so we will extract only the ones we need

storm_data_subset <- storm_data %>%
        select(BGN_DATE, EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP, FATALITIES, INJURIES)

Convert the BGN_DATE column to date format and create a YEAR Column

storm_data_subset$BGN_DATE <- as.Date(storm_data_subset$BGN_DATE, "%m/%d/%Y")
storm_data_subset$YEAR <- year(storm_data_subset$BGN_DATE)

Check frequency of extreme events for each year

ggplot(storm_data_subset) +
        aes(x = YEAR) +
        geom_histogram(bins = 30) +
        labs(
                title = "Number of Extreme Weather Events by Year\n",
                x = "\nYear",
                y = "Number of Events\n"
        ) +
        theme_minimal()

Based off the histogram above the data after 1996 looks to be the most complete so this will be used as the cut off for our analysis

storm_data_subset <- storm_data_subset %>%
        filter(YEAR >= 1990)

Remove rows with invalid event types

storm_data_subset <- storm_data_subset %>%
        filter(!str_detect(EVTYPE, "Summary"))

Convert the PROPDMGEXP column from factors to numeric values. This is to ensure the correct values are taken into consideration when working out property damage.

storm_data_subset <- storm_data_subset %>%
        mutate(PROPDMGFACTOR = recode(storm_data_subset$PROPDMGEXP, 
                '-' = 0, '?' = 0, '+' = 0, '0' = 1, '1' = 10, '2' = 100, '3' = 1000, '4' = 10000, '5' = 100000, '6' = 1000000, 
                '7' = 10000000, '8' = 100000000, 'B' = 1000000000, 'h' = 100, 'H' = 100, 'k' = 1000, 'K' = 1000, 'm' = 1000000,
                'M' = 1000000, ' ' = 0)
        )

Create 2 new columns, 1 for the sum of injuries and deaths and 1 for total property damage in dollars.

storm_data_subset <- storm_data_subset %>%
        mutate(DEADORINJURED = FATALITIES + INJURIES)

storm_data_subset <- storm_data_subset %>%
        mutate(TOTALPROPERTYDAMAGE = PROPDMG * PROPDMGFACTOR)

Results

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
storm_data_subset %>% 
        select(EVTYPE, DEADORINJURED) %>%
        group_by(EVTYPE) %>%
        summarise(count = sum(DEADORINJURED)) %>%
        arrange(desc(count)) %>%
        slice(1:10) %>%
        ggplot() +
                aes(x = reorder(EVTYPE, -count), y = count) +
                geom_bar(stat = "identity") +
                labs(
                        title = "Top 10 Event Types by Number of Deaths and Injuries\n",
                        x = "\nEvent Type",
                        y = "Number of Deaths & Injuries\n"
                ) +
                theme_minimal() +
                theme(axis.text.x = element_text(angle = 90, hjust = 1))

We can see from the above bar plot that Tornados are by a distance the leading cause of death and injury amongst all the extreme weather event types.

  1. Across the United States, which types of events have the greatest economic consequences?
storm_data_subset %>% 
        select(EVTYPE, TOTALPROPERTYDAMAGE) %>%
        group_by(EVTYPE) %>%
        summarise(count = sum(TOTALPROPERTYDAMAGE)) %>%
        arrange(desc(count)) %>%
        slice(1:10) %>%
        ggplot() +
                aes(x = reorder(EVTYPE, -count), y = count) +
                geom_bar(stat = "identity") +
                labs(
                        title = "Top 10 Event Types by Property Damage Cost\n",
                        x = "\nEvent Type",
                        y = "Cost in Dollars\n"
                ) +
                theme_minimal() +
                theme(axis.text.x = element_text(angle = 90, hjust = 1))

The above graph indicates that tornados/wind/hail has the greatest economic impact on the United States.