Abstract

The National Oceanic and Atmospheric Administration (NOAA) has a database of weather events that happened in the last 70 years. It is possible to find all kind of events and information on the impact on the public health and economics for each event that occurred. This database is very important to understand what happened and visualize which can be done to better distribute the economical resources among the states. In order to show what events are the most harmful to the economics and public health, is necessary to make an analysis of the database, making all the transformation that are needed to get to the best results. There will be some graphs to evidence which events are the harmful and they will be ordered to show the real relevance at the time of assign the efforts. I won’t be giving any recommendation, but I will show the reality about the weather most dangerous events. R makes this work easier and totally reproducible with the help of the markdown language.

Data Processing

1. Data Loading

For this study I am using the following file provided by The National Oceanic and Atmospheric Administration (NOAA). The file (https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2) is being downloaded directly by the R script and decompressed by the bunzip2 command from the R.utils package.

Besides, I created a csv file called “event_table.csv” which contains the official list of the 48 weather events from the National Weather Service.

library(stringr)
library(dplyr)
library(lubridate)
library(R.utils)
library(fuzzyjoin)
library(ggplot2)

if (!file.exists("data/data.csv.bz2"))
{
  download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "data/data.csv.bz2")
}
 
R.utils::bunzip2("data/data.csv.bz2", remove = FALSE, skip = TRUE)

Both of the csv files are loaded in 2 new variables called “data” and “events”, which are going to be used along the study.

  data <- read.csv("data/data.csv")
events <- read.csv("data/event_table.csv")

2. Data transformations

In order to work with the data it is necessary to make some transformations to the original data. The good news is that R makes that very easy to do.

First, I transform the EVTYPE column to be uppercase in both data frames (data and events). Also, I update the EVTYPE in the events data frame to contain only the string value that is before the “/” character. This is to make the comparison easier at the time of joining both data frames.

I also formatted the BGN_DATE column in the data data frame to have date format (MM/DD/YYYY). Then, I am creating a new column called “YEAR” which contains the year of the BGN_DATE column.

events$EVTYPE <- stringr::str_to_upper(events$EVTYPE)
events$EVTYPE = gsub("\\/.*", "", events$EVTYPE)
data$EVTYPE <- stringr::str_to_upper(data$EVTYPE)
data$BGN_DATE <- as.Date(data$BGN_DATE, format="%m/%d/%Y")
data$YEAR <- lubridate::year(data$BGN_DATE)

Last, but not less important, it is a transformation on the data data frame to the EVTYPE column, where I replace the TSTM value with THUNDERSTORM, which is a valid transformation according to several sources.

#https://www.severe-weather.eu/severe-weather-outlooks-faq/
data$EVTYPE <- gsub("TSTM", "THUNDERSTORM", data$EVTYPE)
data$EVTYPE <- gsub("\\/.*", "", data$EVTYPE)

The new data frame is called “data_health” and is created subsetting the original data frame including only the rows with INJURIES or FATALITIES values greater than zero. In this subsetting, I also make sure to select only those values that are going to be use.

The next line of code is used to JOIN the data_health data frame with the events data frame. This JOIN is very nice and helpful and works very well to group the events in the data_health data frame (which are written in an non standard way) with the events in the events data frame. I achieved the best results using this function.

data_health <- subset(data, INJURIES > 0 | FATALITIES >0, select=c(YEAR, STATE, EVTYPE, INJURIES, FATALITIES))
data_health <- data_health %>% regex_inner_join(events, by=c(EVTYPE="EVTYPE"))

Here now I prepare a new data frame called data_econom which will be used to hold the data related to economic damage. I select (as in the data_health data frame) only those values with property or crop damage greater than zero. Also I will use in this new data frame only the column that are helpful to the analisys.

data_econom <- subset(data, PROPDMG > 0 | CROPDMG >0, select=c(YEAR, STATE, EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP))
data_econom <- data_econom %>% regex_inner_join(events, by=c(EVTYPE="EVTYPE"))

Here are some transformation which took a great amount of time to realize how to do it. Doing some research I could arrive to a conclusion and could confirm it on the discussion forum of rstudio.

The PROPDMGEXP and CROPDMGEXP columns hold values in the form of a character that has to be interpreted and then apply a multiplication against the PROPDMG and CROPDMG values which are expressed in US dollars. The values of these character are the following:
“H” and “h” means 100
“K” and “k” means 1.000
“M” and “m” means 1.000.000
“B” and “b” means 1.000.000.000
“+” means 1
“-”, “?” and "" means 0
everything else (there were numbers ranging from zero to nine means 10^N)

With this on mind, I did transformations on the CALCPROPEXP and CALCCROPEXP columns to store their real value. Once these transformation are done, I just created a couple of new calculated columns called PROPDMG_CALC and CROPDMG_CALC which holds the calculated value in US dollars of the damage weather events caused on both properties and crops.

data_econom <- data_econom %>% mutate(CALCPROPEXP = case_when (
  PROPDMGEXP == "H" | PROPDMGEXP == "h" ~ 100,
  PROPDMGEXP == "K" | PROPDMGEXP == "k" ~ 1000,
  PROPDMGEXP == "M" | PROPDMGEXP == "m" ~ 1000000,
  PROPDMGEXP == "B" | PROPDMGEXP == "b" ~ 1000000000,
  PROPDMGEXP == "+" ~ 1,
  PROPDMGEXP == "-" | PROPDMGEXP == "?" | PROPDMGEXP == "" ~ 0,
  PROPDMGEXP %in% c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9") ~ 10^as.numeric(PROPDMGEXP)
))
## Warning: Problem with `mutate()` input `CALCPROPEXP`.
## i NAs introducidos por coerción
## i Input `CALCPROPEXP` is `case_when(...)`.
## Warning in eval_tidy(pair$rhs, env = default_env): NAs introducidos por coerción
data_econom <- data_econom %>% mutate(CALCCROPEXP = case_when (
  CROPDMGEXP == "H" | CROPDMGEXP == "h" ~ 100,
  CROPDMGEXP == "K" | CROPDMGEXP == "k" ~ 1000,
  CROPDMGEXP == "M" | CROPDMGEXP == "m" ~ 1000000,
  CROPDMGEXP == "B" | CROPDMGEXP == "b" ~ 1000000000,
  CROPDMGEXP == "+" ~ 1,
  CROPDMGEXP == "-" | CROPDMGEXP == "?" | CROPDMGEXP == "" ~ 0,
  CROPDMGEXP %in% c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9") ~ 10^as.numeric(CROPDMGEXP)
))
## Warning: Problem with `mutate()` input `CALCCROPEXP`.
## i NAs introducidos por coerción
## i Input `CALCCROPEXP` is `case_when(...)`.

## Warning: NAs introducidos por coerción
data_econom$PROPDMG_CALC <- data_econom$PROPDMG * data_econom$CALCPROPEXP
data_econom$CROPDMG_CALC <- data_econom$CROPDMG * data_econom$CALCCROPEXP

3. Injuries and fatalities

Finally the two last lines are used to create two new data frames (data_ev_inj for injuries and data_ev_fat for fatalities). We will work with this two data frames which are loaded with summarized data on them. Also, for purpose of this assignment I created a data frame called “data_ev_tmp” which holds the sum of the injuries and the fatalities.

data_ev_inj <- data_health %>% group_by(EVTYPE.y) %>% summarize(N=sum(INJURIES))
## `summarise()` ungrouping output (override with `.groups` argument)
data_ev_fat <- data_health %>% group_by(EVTYPE.y) %>% summarize(N=sum(FATALITIES))
## `summarise()` ungrouping output (override with `.groups` argument)
data_ev_tmp <- data_health %>% group_by(EVTYPE.y) %>% summarize(N=sum(INJURIES+FATALITIES))
## `summarise()` ungrouping output (override with `.groups` argument)

Using the data_ev_inj, data_ev_fat and data_ev_tmp data frames, I calculate the quintiles, as a way to show which events are the most harmful in terms of injuries. Then in the plot below you can clearly see the diference in term of injuries and fatalities of the fifth quintile compared to the others quintiles. As a result, I decided to select the fifth quintile as the selected in order to prioritize the resources to be spent by the NOAA.

data_ev_inj$QNT <- cut(data_ev_inj$N, breaks=quantile(data_ev_inj$N, probs=(0:5)/5), labels=1:5, include.lowest=TRUE)

data_ev_fat$QNT <- cut(data_ev_fat$N, breaks=quantile(data_ev_fat$N, probs=(0:5)/5), labels=1:5, include.lowest=TRUE)

data_ev_tmp$QNT <- cut(data_ev_tmp$N, breaks=quantile(data_ev_tmp$N, probs=(0:5)/5), labels=1:5, include.lowest=TRUE)

Here is the graph showing the quantiles/weather events vs number of injuries and fatalities. One can see clearly that the 5th quintil has the most injuries and fatalities.

ggplot(data=data_ev_tmp, aes(fill=QNT, x=QNT, y=N, group=factor(EVTYPE.y))) + geom_bar(position="dodge", stat="identity") + ggtitle("Quintiles for INJURIES+FATALITIES\ndue to weather events") + ylab("Number of injuries+fatalities") + xlab("Quintiles/weather events") + scale_fill_brewer(name="Quintiles", palette="Set2")

4. Economical damage

Beginning with the economical damage analysis, the first thing I do is to create two new data frames called data_ev_prd (used to store information related to property damage) and data_ev_crd (used to store information related to crop damage). Both of this data frames hold the summarized view of the original data_econom data frame, grouping by the event type.

data_ev_prd <- data_econom %>% group_by(EVTYPE.y) %>% summarize(N=sum(PROPDMG_CALC))
## `summarise()` ungrouping output (override with `.groups` argument)
data_ev_prd <- data_ev_prd[data_ev_prd$N>0, ]
data_ev_crd <- data_econom %>% group_by(EVTYPE.y) %>% summarize(N=sum(CROPDMG_CALC))
## `summarise()` ungrouping output (override with `.groups` argument)
data_ev_crd <- data_ev_crd[data_ev_crd$N>0, ]

Next, as in the public health case, I calculated the quantiles and asigned the quantile obtainted into a new column called QNT. This will allow me to select the most economical harmful events.

data_ev_crd$QNT <- cut(data_ev_crd$N, breaks=quantile(data_ev_crd$N, probs=(0:5)/5), labels=1:5, include.lowest=TRUE)
data_ev_prd$QNT <- cut(data_ev_prd$N, breaks=quantile(data_ev_prd$N, probs=(0:5)/5), labels=1:5, include.lowest=TRUE)

Here are the two graphs showing the quantiles/weather events vs properties and crop damage expressed in US dollars. One can clearly see that the 5th quintile has the most damage to both properties and crops.

ggplot(data=data_ev_prd, aes(fill=QNT, x=QNT, y=N, group=factor(N))) + geom_bar(position="dodge", stat="identity") + ggtitle("Quintiles for economic damage to properties due to weather events") + ylab("Damage to properties in USD") + xlab("Quintiles/weater events") + scale_fill_brewer(name="Weather events", palette = "Set1") + scale_y_continuous(labels = function(x) format(x, scientific = FALSE, big.mark = ","))

ggplot(data=data_ev_crd, aes(fill=QNT, x=QNT, y=N, group=factor(N))) + geom_bar(position="dodge", stat="identity") + ggtitle("Quintiles for economic damage to crops due to weather events") + ylab("Damage to properties in USD") + xlab("Quintiles/weater events") + scale_fill_brewer(name="Weather events", palette = "Set2") + scale_y_continuous(labels = function(x) format(x, scientific = FALSE, big.mark = ","))

Results

1. Injuries and Fatalities Results

Once I have decided to work on the 5th quintile, I update the data_ev_inj and data_ev_fat data frames to hold only the data corresponding to this quintile. So we can show a more detailed view of the weather events in this quintile and thus being able to better understand the data.

data_ev_inj <- data_ev_inj[data_ev_inj$QNT==5, ]
data_ev_inj <- data_ev_inj[order(-data_ev_inj$N), ]
data_ev_fat <- data_ev_fat[data_ev_fat$QNT==5, ]
data_ev_fat <- data_ev_fat[order(-data_ev_fat$N), ]

Both of the tables below show the details of which events caused the most harm to the public health. Here is easy to see that the TORNADOS, HEAT and FLOODS are the top 3 harmful weather events recorded in the data of the NOAA

knitr::kable(data_ev_inj, col.names=c("EVENT","NUMBER OF INJURIES","QUINTILE"), caption="Detail of INJURIES due to weather events in the 5th quintile")
Detail of INJURIES due to weather events in the 5th quintile
EVENT NUMBER OF INJURIES QUINTILE
TORNADO 91365 5
THUNDERSTORM WIND 9493 5
HEAT 9224 5
FLOOD 8602 5
EXCESSIVE HEAT 6525 5
LIGHTNING 5232 5
ICE STORM 1977 5
FLASH FLOOD 1785 5
knitr::kable(data_ev_fat, col.names=c("EVENT","NUMBER OF FATALITIES","QUINTILE"), caption="Detail of FATALITIES due to weather events in the 5th quintile")
Detail of FATALITIES due to weather events in the 5th quintile
EVENT NUMBER OF FATALITIES QUINTILE
TORNADO 5658 5
HEAT 3119 5
EXCESSIVE HEAT 1903 5
FLOOD 1525 5
FLASH FLOOD 1018 5
LIGHTNING 817 5
THUNDERSTORM WIND 753 5
RIP CURRENT 577 5

2. Economical Damage Results

Once I have decided to work on the 5th quintile, I just updates both data frames (data_ev_prd and data_ev_crd) to only hold the data corresponding to this quintile. This way I can show a more detailed view of the weather events that are the most harmful in terms of economical damage

data_ev_prd <- data_ev_prd[data_ev_prd$QNT==5, ]
data_ev_prd <- data_ev_prd[order(-data_ev_prd$N), ]
data_ev_crd <- data_ev_crd[data_ev_crd$QNT==5, ]
data_ev_crd <- data_ev_crd[order(-data_ev_crd$N), ]

Both of the tables below show the details of which events caused the most harm to properties and crops in terms of US dollars. Here you can see both TORNADOES and FLOOD being the two most harmful events.

knitr::kable(data_ev_prd, col.names=c("EVENT","PROPERTY DAMAGE IN USD","QUINTILE"), caption="Detail of economic damage to properties due to weather events in the 5th quintile", format.args = list(big.mark = ",", scientific = FALSE))
Detail of economic damage to properties due to weather events in the 5th quintile
EVENT PROPERTY DAMAGE IN USD QUINTILE
FLOOD 168,184,579,561 5
TORNADO 58,552,206,924 5
STORM SURGE 47,964,724,000 5
HAIL 17,578,269,956 5
FLASH FLOOD 17,414,730,872 5
THUNDERSTORM WIND 11,575,230,873 5
TROPICAL STORM 7,714,390,550 5
WINTER STORM 6,748,997,251 5
HIGH WIND 6,064,650,035 5
knitr::kable(data_ev_crd, col.names=c("EVENT","CROP DAMAGE IN USD","QUINTILE"), caption="Detail of economic damage to crops due to weather events in the 5th quintile", format.args = list(big.mark = ",", scientific = FALSE))
Detail of economic damage to crops due to weather events in the 5th quintile
EVENT CROP DAMAGE IN USD QUINTILE
DROUGHT 13,972,621,780 5
FLOOD 12,379,706,100 5
ICE STORM 5,022,113,500 5
HAIL 3,049,486,620 5
FLASH FLOOD 1,437,163,150 5
COLD 1,409,765,550 5