Assesment of the Most Harmful Atmospherical Events for Humans and US Economy According To NOAA Storm Database (1950

Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project explores the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

The basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events.

Key question to adderss

Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

Loading the Libraries and the data

library(tidyverse)
library(lubridate)
library(pdftools)
library(tm)
library(stringdist)
# NOAA <- read_csv("./repdata_data_StormData.csv.bz2", cols(CROPDMGEXP = col_character()))
NOAA <- read_csv("repdata_data_StormData.csv", 
                 col_types = cols(BGN_TIME = col_time(format = "%H%M"), 
                                  CROPDMGEXP = col_character()))

Cleaning the Data

We want to determine the most costly disasters so we have to examine the the property damage (PROPDMG) and crop damage (CROPDMG) values. These are simple integers which must be multiplied by an exponent given in another field (PROPDMGEXP and CROPDMGEXP respectively. Unfortunately, the exponent values are messy and they need to be converted properly before we can actually calculate the actual values.

raw <- as.tibble(NOAA)
head(raw)

## # A tibble: 6 x 37
##   STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE BGN_RANGE
##     <dbl> <chr>    <time>   <chr>      <dbl> <chr>      <chr> <chr>      <dbl>
## 1       1 4/18/19~ 01:30    CST           97 MOBILE     AL    TORNA~         0
## 2       1 4/18/19~ 01:45    CST            3 BALDWIN    AL    TORNA~         0
## 3       1 2/20/19~ 16:00    CST           57 FAYETTE    AL    TORNA~         0
## 4       1 6/8/195~ 09:00    CST           89 MADISON    AL    TORNA~         0
## 5       1 11/15/1~ 15:00    CST           43 CULLMAN    AL    TORNA~         0
## 6       1 11/15/1~ 20:00    CST           77 LAUDERDALE AL    TORNA~         0
## # ... with 28 more variables: BGN_AZI <lgl>, BGN_LOCATI <lgl>, END_DATE <lgl>,
## #   END_TIME <lgl>, COUNTY_END <dbl>, COUNTYENDN <lgl>, END_RANGE <dbl>,
## #   END_AZI <lgl>, END_LOCATI <lgl>, LENGTH <dbl>, WIDTH <dbl>, F <dbl>,
## #   MAG <dbl>, FATALITIES <dbl>, INJURIES <dbl>, PROPDMG <dbl>,
## #   PROPDMGEXP <chr>, CROPDMG <dbl>, CROPDMGEXP <chr>, WFO <lgl>,
## #   STATEOFFIC <lgl>, ZONENAMES <lgl>, LATITUDE <dbl>, LONGITUDE <dbl>,
## #   LATITUDE_E <dbl>, LONGITUDE_ <dbl>, REMARKS <lgl>, REFNUM <dbl>

raw$PROPDMGEXP <- as.factor(raw$PROPDMGEXP)
levels(raw$PROPDMGEXP)

##  [1] "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K" "m" "M"

raw$CROPDMGEXP <- as.factor(raw$CROPDMGEXP)
levels(raw$CROPDMGEXP)

## [1] "?" "0" "2" "B" "k" "K" "m" "M"

raw <- raw %>% mutate(PROPDMG = ifelse(PROPDMGEXP == "K", PROPDMG*1000, PROPDMG),
                      PROPDMG = ifelse(PROPDMGEXP == "M", PROPDMG*1000000, PROPDMG),
                      PROPDMG = ifelse(PROPDMGEXP == "B", PROPDMG*1000000000, PROPDMG),
                      PROPDMG = ifelse(PROPDMGEXP == "H", PROPDMG*100, PROPDMG),
                      PROPDMG = ifelse(PROPDMGEXP == "h", PROPDMG*100, PROPDMG),
                      PROPDMG = ifelse(PROPDMGEXP == "m", PROPDMG*1000000, PROPDMG),
                      PROPDMG = ifelse(PROPDMGEXP == "8", PROPDMG*100000000, PROPDMG),
                      PROPDMG = ifelse(PROPDMGEXP == "7", PROPDMG*10000000, PROPDMG),
                      PROPDMG = ifelse(PROPDMGEXP == "6", PROPDMG*1000000, PROPDMG),
                      PROPDMG = ifelse(PROPDMGEXP == "5", PROPDMG*100000, PROPDMG),
                      PROPDMG = ifelse(PROPDMGEXP == "4", PROPDMG*10000, PROPDMG),
                      PROPDMG = ifelse(PROPDMGEXP == "3", PROPDMG*1000, PROPDMG),
                      PROPDMG = ifelse(PROPDMGEXP == "2", PROPDMG*100, PROPDMG),
                      PROPDMG = ifelse(PROPDMGEXP == "1", PROPDMG*10, PROPDMG),
                      PROPDMG = ifelse(PROPDMGEXP == "0", PROPDMG*1, PROPDMG)
                      ) %>%
               mutate(CROPDMG = ifelse(CROPDMGEXP == "B", CROPDMG*1000000000, CROPDMG),
                      CROPDMG = ifelse(CROPDMGEXP == "M", CROPDMG*1000000, CROPDMG),
                      CROPDMG = ifelse(CROPDMGEXP == "m", CROPDMG*1000000, CROPDMG),
                      CROPDMG = ifelse(CROPDMGEXP == "K", CROPDMG*1000, CROPDMG),
                      CROPDMG = ifelse(CROPDMGEXP == "k", CROPDMG*1000, CROPDMG),
                      CROPDMG = ifelse(CROPDMGEXP == "2", CROPDMG*100, CROPDMG),
                      CROPDMG = ifelse(CROPDMGEXP == "0", CROPDMG*1, CROPDMG),
                      CROPDMG = ifelse(CROPDMGEXP == "?", CROPDMG*0, CROPDMG)
                      )

NOAA Events

Next, a complete listing of the storm event types is needed. The Storm Data Documentation contains the complete list on page 6 of the documentation. The event types will be read in from this file.

First, load the required libraries. The package pdftools will read in the .pdf file, the tm package will convert the text in the file into a vector source and then a corpus, which can then be mined with regular expressions using the stringr package.

events <- pdf_text("https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf")
events <- events[6]
events <- stripWhitespace(events)
events <- str_extract(events, pattern = "Astronomical.*Winter Weather")
events <- str_split(events, pattern = "\\s[A-Z]\\s")
events <- unlist(events)
events <- tolower(events)

The resulting vector, events is a list, with each page from the original pdf being an element in the list. Therefore the list has 97 elements.

The ‘VectorSource’ and ‘Corpus’ functions are used to convert the text into a format that can be easily mined by regular expressions. Page 6 will be extracted from the list and the list will be flattened.

Now, using regular expressions, the complete listing of event types will be created for use later. The final list is set to lower case letters to allow for easier matching with the storm data.

data <- raw %>%
        mutate(EVTYPE = str_to_lower(EVTYPE),
                       EVTYPE = str_trim(EVTYPE, side = "left"),
                       EVTYPE = str_replace_all(EVTYPE, pattern = "tstm", replacement = "thunderstorm"),
                       EVTYPE = str_replace_all(EVTYPE, pattern = "\\s\\(g\\d*\\)", replacement = ""),
                       EVTYPE = str_replace_all(EVTYPE, pattern = "^heat", replacement = "excessive heat"),
                       EVTYPE = str_replace_all(EVTYPE, pattern = "wild/forest fire", replacement = "wildfire"),
                       EVTYPE = str_replace_all(EVTYPE, pattern = "strong wind", replacement = "high wind"),
                       EVTYPE = str_replace_all(EVTYPE, pattern = "winter weather", replacement = "winter storm"),
                       EVTYPE = str_replace_all(EVTYPE, pattern = "^hurricane$", replacement = "hurricane (typhoon)"))

The “EVTYPE” variable contains numerous different spellings and classification of the event types. Since there are only 48 official event types, the data in this variable must be cleaned to allow for proper classification. Several string replacements are made to frequently occurring entries that have a significant impact on the final analysis. Spaces are removed and set to lower case as in the events list.

data <- data %>%
        mutate(Eventcode = amatch(EVTYPE, events, maxDist = 5)) %>%
        mutate(Event = ifelse(is.na(Eventcode), EVTYPE, events[Eventcode]))

The “EVTYPE” is matched to the events list using the ‘amatch’ function and placed in a new variable is created in the data set. Finally, if the new variable is NA, then use the observation in the ‘EVTYPE’ variable, otherwise, use the matching entry in the events list.

Impact on Public Health

The data set is first tidied by combining the ‘FATALITIES’ and ‘INJURIES’ variables into a single column and their values into another variable called ‘Number’. Any observation in the ‘Number’ variable equal to zero will be removed, since there were no casualties. The data will be grouped by ‘Event’ and ‘Number’ will be summarized. The top 10 will be selected, first letters will be capitalized and only the ‘Event’ column will be output.

Then will be analysed again to provide output for charting. The data will be tidied again as above and the ‘Event’ column changed to first letter to uppercase. Then the data set will be reduced to only those events appearing in the top 10 list created above and then the ‘Event’ column is changed to a factor variable, with the levels set to the order of the top ten list. This will put the bars in order in the chart.

top10_casualty <- data %>%
        gather(FATALITIES, INJURIES, key = Casualty, value = Number) %>%
        filter(Number > 0) %>%
        group_by(Event) %>%
        summarize(Total = sum(Number)) %>%
        arrange(Total) %>%
        top_n(10, Total) %>%
        mutate(Event = str_to_title(Event)) %>%
        select(1)

Casualty <- data %>%
        gather(FATALITIES, INJURIES, key = Casualty, value = Number) %>%
        mutate(Event = str_to_title(Event)) %>%
        filter(Number > 0, Event %in% top10_casualty$Event) %>%
        mutate(Event = factor(Event, top10_casualty$Event))

Impact on US Economy

The steps for determining the financial impact of weather events are similiar to those above. The top 10 most costly events will be identified and ordered. The data will then be adjusted for inflation and output for charting.

top10_econ <- data %>%
        gather(CROPDMG, PROPDMG, key = Damage, value = Number) %>%
        filter(Number > 0) %>%
        group_by(Event) %>%
        summarize(Total = sum(Number)) %>%
        arrange(Total) %>%
        top_n(10, Total) %>%
        mutate(Event = str_to_title(Event)) %>%
        select(1)

Damage <- data %>%
        gather(CROPDMG, PROPDMG, key = Damage, value = Number) %>%
        mutate(Event = str_to_title(Event)) %>%
        filter(Number > 0, Event %in% top10_econ$Event) %>%
        mutate(Event = factor(Event, top10_econ$Event))

Results

Health impacts of ten most harmfull event types are presented in the following chart:

ggplot(Damage) +
 aes(x = Event, fill = Damage, weight = Number) +
 geom_bar() +
 scale_fill_hue() +
 labs(x = "Events ", y = "USD of damages ", title = "Top 10 Events harming US Economy ") +
 coord_flip() +
 theme_minimal()

Economic impact of ten most injurious event types are presented in the following chart:

ggplot(Casualty) +
 aes(x = Event, fill = Casualty, weight = Number) +
 geom_bar() +
 scale_fill_hue() +
 labs(x = "Events", y = "Numbers of casualties", title = "Top 10 Atmospherical Events causing Casualties") +
 coord_flip() +
 theme_minimal()

Conclusions

As we can clealrly see from the graphs Flood, Hurricanes and Tornados are the 3 most harmfull events for the american economy. For what it concerns Public Health and Casualties, Tornados, excessive hea and Thunderstorms are the most dangerous for human life.

Assesment of the Most Harmful Atmospherical Events for Humans and US Economy According To NOAA Storm Database (1950 - 2011)

Gaspare Mattarella