Synopsis: executive summary

In this report we analyse the weather events database taken from the U.S. National Oceanic and Atmospheric Administration (NOAA). Our goal is to determine which type of event is most harmful to US population (health) and which type of event has the largest consequences for the American economy. On one hand, the analysis shows that tornados are the most harmful weather events for people’s health. On the other hand, the analysis shows that floods are the weather events that cause the biggest economical damages.

Initial setup: environment and packages required

knitr::opts_chunk$set(echo = TRUE)

# 
if (!require(tidyverse)) {
    install.packages("tidyverse")
    library(tidyverse, warn.conflicts = FALSE)
}
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Data Processing, explained

Data presentation

The NOAA weather events database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The source raw data file is available online and can be downloaded in a remote server location.

There is also some documentation about the database available. Here you will find how some of the variables are constructed/defined.

  • National Weather Service Storm Data Documentation.
  • National Climatic Data Center Storm Events FAQ.

Downloading and reading raw data

With the following code we configure the file structure of our working repository (creating data folder), download the raw data from the web, and read it into R environment for further processing and analysis.

# Save & download URL as object
URLfile <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
# Save file name and location as object
name_location <- "data/storm-data.csv.bz2"

# Build conditional process to download file from URL, name and save it in specific location

### If /data dont exist in directory: create it
if (!file.exists('data')) {
    dir.create('data')
}
#### If the file (name_location) doesnt exist, download "URLfile" with saved name and location (name_location)
if (!file.exists(name_location)) {
    download.file(url = URLfile, destfile = name_location)
}

# Load data to local environment
weatherevents <- read.csv(name_location, sep = ",", header = TRUE)

# Remove further innecessary objects from environment
remove(name_location, URLfile)

Here we can see the structure of the dataset:

# Show the structure of the dataset
str(weatherevents)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

There are 902.297 observations with 37 variables in the file. Only a subset of this data is requrired for the analysis. There are a lot of different variables which register information that is not relevant for our research puropses. Relevant for our analysis are the following variables:

  1. Date (BGN_DATE) is an important variable. According to the NOAA the full set of wheather events (48 event types) is available since 1996. Between 1950 and 1995 only a subset (Tornado, Thunderstorm Wind and Hail) of these events is available in the storm database. In order to have o comparable basis for the analysis the dataset is limited to the observations between 1996 and 2011. So, we need to process date variable (BGN_DATE) in order to filter the appropriate years for analysis.
  2. We need the weather event type (EVTYPE) variable, in order to identify the kind of weather event more harmful in terms of population and economic damage.
  3. Based on variables about health impact (FATALITIES and INJURIES) we can estimate population damages.
  4. Based on variables that register monetary impact on crop and property (PROPDMG and CROPDMG) as well as their corresponding exponents (PROPDMGEXP and CROPDMGEXP), we can calculate the economic damages according each typo of weather event.
  5. The dataset contains a lot of observations without any information about health and/or economic damages. These observations need to be excluded from the analysis.

Selecting cases for analysis

In the following codes we select only the relevant variables, making easy to deal -in terms of computation resources- with a lighter database; after that, we configure the date variable in order to be able to select only the period of years with full weather events report; finall, we select the events with population impact or economic damage.

initial_data <- weatherevents %>% select(BGN_DATE, EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP, FATALITIES, INJURIES)

# Remove the original raw dataset from environment
remove(weatherevents)

# Format the BGN_DATE variable as a date
initial_data$BGN_DATE <- as.Date(initial_data$BGN_DATE, "%m/%d/%Y")
initial_data$YEAR <- year(initial_data$BGN_DATE)

# Only use events since 1996
# Only use events with either health impact or economic damage
analysis_data <- initial_data %>% 
  filter(YEAR >= 1996) %>% 
  filter(PROPDMG > 0 | CROPDMG > 0 | FATALITIES > 0 | INJURIES > 0)

Economical damages: preparation for analysis

The economical damages variables provided in the weather events dataset require some configurations. As you can see below, each variable - CROPDMG and PROPDMG - comes with a separate exponent - CROPDMGEXP and PROPDMGEXP. In both cases, the coded exponents are: “” (empty), B, K and M. We replace that character strings for numeric values in both variables, in order to be able to calculate the total economic damages by each weather event.

# Original exponents in CROPDMGEXP
table(analysis_data$CROPDMGEXP)
## 
##             B      K      M 
## 102767      2  96787   1762
# Create new variable (CROPDMGFACTOR) replacing coded exponents with numeric values
analysis_data$CROPDMGFACTOR[(analysis_data$CROPDMGEXP == "")] <- 10^0
analysis_data$CROPDMGFACTOR[(analysis_data$CROPDMGEXP == "K")] <- 10^3
analysis_data$CROPDMGFACTOR[(analysis_data$CROPDMGEXP == "M")] <- 10^6
analysis_data$CROPDMGFACTOR[(analysis_data$CROPDMGEXP == "B")] <- 10^9

# New variable CROPDMGFACTOR
table(analysis_data$CROPDMGFACTOR)
## 
##      1   1000  1e+06  1e+09 
## 102767  96787   1762      2
# Original exponents in PROPDMGEXP
table(analysis_data$PROPDMGEXP)
## 
##             B      K      M 
##   8448     32 185474   7364
# Create new variable (PROPDMGFACTOR) replacing coded exponents with numeric values
analysis_data$PROPDMGFACTOR[(analysis_data$PROPDMGEXP == "")] <- 10^0
analysis_data$PROPDMGFACTOR[(analysis_data$PROPDMGEXP == "K")] <- 10^3
analysis_data$PROPDMGFACTOR[(analysis_data$PROPDMGEXP == "M")] <- 10^6
analysis_data$PROPDMGFACTOR[(analysis_data$PROPDMGEXP == "B")] <- 10^9

# New variable PROPDMGFACTOR
table(analysis_data$PROPDMGFACTOR)
## 
##      1   1000  1e+06  1e+09 
##   8448 185474   7364     32

As far as both crop and property damages (in USD) refer to an economic cost, the distinction is irrelevant. Therefore, both variables are multiplied by their corresponding factor and added to form a new variable ECONOMIC_COST.

analysis_data <- mutate(analysis_data, ECONOMIC_COST = PROPDMG * PROPDMGFACTOR + CROPDMG * CROPDMGFACTOR)

Population health impacts: preparation for analysis

The distinction between injuries and fatalities is irrelevant for our analysis: for exploratory purposes we want an aggregated indicator about population health impact of weather events. Therefore both variables are added to form a new variable HEALTHIMP.

analysis_data <- mutate(analysis_data, HEALTHIMP = FATALITIES + INJURIES)

Event Types: preparation for analysis

The weather events variable requires further examination. After converting the variable EVTYPE to uppercase, there are still 186 different event types listed. According to the NOAA there should be only 48. So there are a lot of duplicates.

analysis_data$EVTYPE <- toupper(analysis_data$EVTYPE)
dim(data.frame(table(analysis_data$EVTYPE)))
## [1] 186   2

As we are interested in the most harmful weather events there is no point cleaning all the cases; that task would require quite a big effort as well. We will aggregate the dataset by the event type variable (EVTYPE), summing up in one case the economic costs, and the health impact on the other case. Only event types above the 95% quantile both in economic costs and health impact are going to be cleaned for visualization purposes.

Within the economic cost database, there is only one event type above the 95% quantile, which is not compliant to the official types: HURRICANE.

economic <- analysis_data %>% group_by(EVTYPE) %>% summarise(cost = sum(ECONOMIC_COST)) 
economic <- filter(economic, cost > quantile(economic$cost, probs = 0.95))
economic
## # A tibble: 10 × 2
##    EVTYPE                    cost
##    <chr>                    <dbl>
##  1 DROUGHT            14413667000
##  2 FLASH FLOOD        16557105610
##  3 FLOOD             148919611950
##  4 HAIL               17071172870
##  5 HIGH WIND           5881421660
##  6 HURRICANE          14554229010
##  7 HURRICANE/TYPHOON  71913712800
##  8 STORM SURGE        43193541000
##  9 TORNADO            24900370720
## 10 TROPICAL STORM      8320186550

With the following code chunk the event type was unified according the official type: HURRICANE (TYPHOON). This, within the original analysis_data.

analysis_data$EVTYPE[(analysis_data$EVTYPE == "HURRICANE")] <- "HURRICANE (TYPHOON)"
analysis_data$EVTYPE[(analysis_data$EVTYPE == "HURRICANE/TYPHOON")] <- "HURRICANE (TYPHOON)"

Within the health impact database, there are only two event types above the 95% quantile of impact, which are not compliant to the official types:TSTM WIND.

health <- analysis_data %>% group_by(EVTYPE) %>% summarise(impact = sum(HEALTHIMP)) 
health <- filter(health, impact > quantile(health$impact, probs = 0.95))
health
## # A tibble: 10 × 2
##    EVTYPE              impact
##    <chr>                <dbl>
##  1 EXCESSIVE HEAT        8188
##  2 FLASH FLOOD           2561
##  3 FLOOD                 7172
##  4 HEAT                  1459
##  5 HURRICANE (TYPHOON)   1446
##  6 LIGHTNING             4792
##  7 THUNDERSTORM WIND     1530
##  8 TORNADO              22178
##  9 TSTM WIND             3870
## 10 WINTER STORM          1483

With the following code chunk the event type was unified according the official type: THUNDERSTORM WIND. This, within the original analysis_data.

analysis_data$EVTYPE[(analysis_data$EVTYPE == "TSTM WIND")] <- "THUNDERSTORM WIND"

With the configurations above, the processing of our analysis data is ready. As we make the modifications over the original analysis data it will be needed to recalculate the economic damage and health impact to build the definitive results.

Results

Question 1: Across the United States, which types of events are most harmful with respect to population health?

The cleaned data frame analysis_data is been aggregated per EVTYPE and provided in a descending order in the new data frame healthImpact. Then, a visualization of the weather events and their respective health impact over population is built.

The barchart shows that Tornados are the most harmful weather events for people’s health.

# Clean and ordered dataframe
healthImpact <- analysis_data %>% 
                group_by(EVTYPE) %>% 
                summarise(HEALTHIMP = sum(HEALTHIMP)) %>% 
                arrange(desc(HEALTHIMP))

# Visualization: health impact over population
ggplot(healthImpact[1:10,], aes(x=reorder(EVTYPE, -HEALTHIMP),y=HEALTHIMP,color=EVTYPE)) + 
      geom_bar(stat="identity", fill="white") + 
      theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
      xlab("Event") + ylab("Number of fatalities and injuries") +
      theme(legend.position="none") +
      ggtitle("Fatalities and injuries in the US caused by severe weather events")

Question 2: Across the United States, which types of events have the greatest economic consequences?

The cleaned data frame analysis_data is been aggregated per EVTYPE and provided in a descending order in the new data frame economicCost Then, a visualization of the weather events and their respective economic costs is built.

The barchart shows that Floods cause the biggest economical damages.

# Clean and ordered dataframe
economicCost <- analysis_data %>% 
                group_by(EVTYPE) %>% 
                summarise(ECONOMIC_COST = sum(ECONOMIC_COST)) %>% 
                arrange(desc(ECONOMIC_COST))

# Visualization: health impact over population
ggplot(economicCost[1:10,], aes(x=reorder(EVTYPE, -ECONOMIC_COST),y=ECONOMIC_COST,color=EVTYPE)) + 
      geom_bar(stat="identity", fill="white") + 
      theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
      xlab("Event") + ylab("Economic cost in USD") +
      theme(legend.position="none") +
      ggtitle("Economic cost in the US caused by severe weather events")