The National Oceanic and Atmospheric Administration (NOAA) has a database of weather events that happened in the last 70 years. It is possible to find all kind of events and information on the impact on the public health and economics for each event that occurred. This database is very important to understand what happened and visualize which can be done to better distribute the economical resources among the states. In order to show what events are the most harmful to the economics and public health, is necessary to make an analysis of the database, making all the transformation that are needed to get to the best results. There will be some graphs to evidence which events are the harmful and they will be ordered to show the real relevance at the time of assign the efforts. I won’t be giving any recommendation, but I will show the reality about the weather most dangerous events. R makes this work easier and totally reproducible with the help of the markdown language.
For this study I am using the following file provided by The National Oceanic and Atmospheric Administration (NOAA). The file (https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2) is being downloaded directly by the R script and decompressed by the bunzip2 command from the R.utils package.
Besides, I created a csv file called “event_table.csv” which contains the official list of the 48 weather events from the National Weather Service.
library(stringr)
library(dplyr)
library(lubridate)
library(R.utils)
library(fuzzyjoin)
library(ggplot2)
if (!file.exists("data/data.csv.bz2"))
{
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "data/data.csv.bz2")
}
R.utils::bunzip2("data/data.csv.bz2", remove = FALSE, skip = TRUE)
Both of the csv files are loaded in 2 new variables called “data” and “events”, which are going to be used along the study.
data <- read.csv("data/data.csv")
events <- read.csv("data/event_table.csv")
In order to work with the data it is necessary to make some transformations to the original data. The good news is that R makes that very easy to do.
First, I transform the EVTYPE column to be uppercase in both data frames (data and events). Also, I update the EVTYPE in the events data frame to contain only the string value that is before the “/” character. This is to make the comparison easier at the time of joining both data frames.
I also formatted the BGN_DATE column in the data data frame to have date format (MM/DD/YYYY). Then, I am creating a new column called “YEAR” which contains the year of the BGN_DATE column.
events$EVTYPE <- stringr::str_to_upper(events$EVTYPE)
events$EVTYPE = gsub("\\/.*", "", events$EVTYPE)
data$EVTYPE <- stringr::str_to_upper(data$EVTYPE)
data$BGN_DATE <- as.Date(data$BGN_DATE, format="%m/%d/%Y")
data$YEAR <- lubridate::year(data$BGN_DATE)
Last, but not less important, it is a transformation on the data data frame to the EVTYPE column, where I replace the TSTM value with THUNDERSTORM, which is a valid transformation according to several sources.
#https://www.severe-weather.eu/severe-weather-outlooks-faq/
data$EVTYPE <- gsub("TSTM", "THUNDERSTORM", data$EVTYPE)
data$EVTYPE <- gsub("\\/.*", "", data$EVTYPE)
The new data frame is called “data_health” and is created subsetting the original data frame including only the rows with INJURIES or FATALITIES values greater than zero. In this subsetting, I also make sure to select only those values that are going to be use.
The next line of code is used to JOIN the data_health data frame with the events data frame. This JOIN is very nice and helpful and works very well to group the events in the data_health data frame (which are written in an non standard way) with the events in the events data frame. I achieved the best results using this function.
data_health <- subset(data, INJURIES > 0 | FATALITIES >0, select=c(YEAR, STATE, EVTYPE, INJURIES, FATALITIES))
data_health <- data_health %>% regex_inner_join(events, by=c(EVTYPE="EVTYPE"))
Here now I prepare a new data frame called data_econom which will be used to hold the data related to economic damage. I select (as in the data_health data frame) only those values with property or crop damage greater than zero. Also I will use in this new data frame only the column that are helpful to the analisys.
data_econom <- subset(data, PROPDMG > 0 | CROPDMG >0, select=c(YEAR, STATE, EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP))
data_econom <- data_econom %>% regex_inner_join(events, by=c(EVTYPE="EVTYPE"))
Here are some transformation which took a great amount of time to realize how to do it. Doing some research I could arrive to a conclusion and could confirm it on the discussion forum of rstudio.
The PROPDMGEXP and CROPDMGEXP columns hold values in the form of a character that has to be interpreted and then apply a multiplication against the PROPDMG and CROPDMG values which are expressed in US dollars. The values of these character are the following:
“H” and “h” means 100
“K” and “k” means 1.000
“M” and “m” means 1.000.000
“B” and “b” means 1.000.000.000
“+” means 1
“-”, “?” and "" means 0
everything else (there were numbers ranging from zero to nine means 10^N)
With this on mind, I did transformations on the CALCPROPEXP and CALCCROPEXP columns to store their real value. Once these transformation are done, I just created a couple of new calculated columns called PROPDMG_CALC and CROPDMG_CALC which holds the calculated value in US dollars of the damage weather events caused on both properties and crops.
data_econom <- data_econom %>% mutate(CALCPROPEXP = case_when (
PROPDMGEXP == "H" | PROPDMGEXP == "h" ~ 100,
PROPDMGEXP == "K" | PROPDMGEXP == "k" ~ 1000,
PROPDMGEXP == "M" | PROPDMGEXP == "m" ~ 1000000,
PROPDMGEXP == "B" | PROPDMGEXP == "b" ~ 1000000000,
PROPDMGEXP == "+" ~ 1,
PROPDMGEXP == "-" | PROPDMGEXP == "?" | PROPDMGEXP == "" ~ 0,
PROPDMGEXP %in% c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9") ~ 10^as.numeric(PROPDMGEXP)
))
## Warning: Problem with `mutate()` input `CALCPROPEXP`.
## i NAs introducidos por coerción
## i Input `CALCPROPEXP` is `case_when(...)`.
## Warning in eval_tidy(pair$rhs, env = default_env): NAs introducidos por coerción
data_econom <- data_econom %>% mutate(CALCCROPEXP = case_when (
CROPDMGEXP == "H" | CROPDMGEXP == "h" ~ 100,
CROPDMGEXP == "K" | CROPDMGEXP == "k" ~ 1000,
CROPDMGEXP == "M" | CROPDMGEXP == "m" ~ 1000000,
CROPDMGEXP == "B" | CROPDMGEXP == "b" ~ 1000000000,
CROPDMGEXP == "+" ~ 1,
CROPDMGEXP == "-" | CROPDMGEXP == "?" | CROPDMGEXP == "" ~ 0,
CROPDMGEXP %in% c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9") ~ 10^as.numeric(CROPDMGEXP)
))
## Warning: Problem with `mutate()` input `CALCCROPEXP`.
## i NAs introducidos por coerción
## i Input `CALCCROPEXP` is `case_when(...)`.
## Warning: NAs introducidos por coerción
data_econom$PROPDMG_CALC <- data_econom$PROPDMG * data_econom$CALCPROPEXP
data_econom$CROPDMG_CALC <- data_econom$CROPDMG * data_econom$CALCCROPEXP
Finally the two last lines are used to create two new data frames (data_ev_inj for injuries and data_ev_fat for fatalities). We will work with this two data frames which are loaded with summarized data on them. Also, for purpose of this assignment I created a data frame called “data_ev_tmp” which holds the sum of the injuries and the fatalities.
data_ev_inj <- data_health %>% group_by(EVTYPE.y) %>% summarize(N=sum(INJURIES))
## `summarise()` ungrouping output (override with `.groups` argument)
data_ev_fat <- data_health %>% group_by(EVTYPE.y) %>% summarize(N=sum(FATALITIES))
## `summarise()` ungrouping output (override with `.groups` argument)
data_ev_tmp <- data_health %>% group_by(EVTYPE.y) %>% summarize(N=sum(INJURIES+FATALITIES))
## `summarise()` ungrouping output (override with `.groups` argument)
Using the data_ev_inj, data_ev_fat and data_ev_tmp data frames, I calculate the quintiles, as a way to show which events are the most harmful in terms of injuries. Then in the plot below you can clearly see the diference in term of injuries and fatalities of the fifth quintile compared to the others quintiles. As a result, I decided to select the fifth quintile as the selected in order to prioritize the resources to be spent by the NOAA.
data_ev_inj$QNT <- cut(data_ev_inj$N, breaks=quantile(data_ev_inj$N, probs=(0:5)/5), labels=1:5, include.lowest=TRUE)
data_ev_fat$QNT <- cut(data_ev_fat$N, breaks=quantile(data_ev_fat$N, probs=(0:5)/5), labels=1:5, include.lowest=TRUE)
data_ev_tmp$QNT <- cut(data_ev_tmp$N, breaks=quantile(data_ev_tmp$N, probs=(0:5)/5), labels=1:5, include.lowest=TRUE)
Here is the graph showing the quantiles/weather events vs number of injuries and fatalities. One can see clearly that the 5th quintil has the most injuries and fatalities.
ggplot(data=data_ev_tmp, aes(fill=QNT, x=QNT, y=N, group=factor(EVTYPE.y))) + geom_bar(position="dodge", stat="identity") + ggtitle("Quintiles for INJURIES+FATALITIES\ndue to weather events") + ylab("Number of injuries+fatalities") + xlab("Quintiles/weather events") + scale_fill_brewer(name="Quintiles", palette="Set2")
Beginning with the economical damage analysis, the first thing I do is to create two new data frames called data_ev_prd (used to store information related to property damage) and data_ev_crd (used to store information related to crop damage). Both of this data frames hold the summarized view of the original data_econom data frame, grouping by the event type.
data_ev_prd <- data_econom %>% group_by(EVTYPE.y) %>% summarize(N=sum(PROPDMG_CALC))
## `summarise()` ungrouping output (override with `.groups` argument)
data_ev_prd <- data_ev_prd[data_ev_prd$N>0, ]
data_ev_crd <- data_econom %>% group_by(EVTYPE.y) %>% summarize(N=sum(CROPDMG_CALC))
## `summarise()` ungrouping output (override with `.groups` argument)
data_ev_crd <- data_ev_crd[data_ev_crd$N>0, ]
Next, as in the public health case, I calculated the quantiles and asigned the quantile obtainted into a new column called QNT. This will allow me to select the most economical harmful events.
data_ev_crd$QNT <- cut(data_ev_crd$N, breaks=quantile(data_ev_crd$N, probs=(0:5)/5), labels=1:5, include.lowest=TRUE)
data_ev_prd$QNT <- cut(data_ev_prd$N, breaks=quantile(data_ev_prd$N, probs=(0:5)/5), labels=1:5, include.lowest=TRUE)
Here are the two graphs showing the quantiles/weather events vs properties and crop damage expressed in US dollars. One can clearly see that the 5th quintile has the most damage to both properties and crops.
ggplot(data=data_ev_prd, aes(fill=QNT, x=QNT, y=N, group=factor(N))) + geom_bar(position="dodge", stat="identity") + ggtitle("Quintiles for economic damage to properties due to weather events") + ylab("Damage to properties in USD") + xlab("Quintiles/weater events") + scale_fill_brewer(name="Weather events", palette = "Set1") + scale_y_continuous(labels = function(x) format(x, scientific = FALSE, big.mark = ","))
ggplot(data=data_ev_crd, aes(fill=QNT, x=QNT, y=N, group=factor(N))) + geom_bar(position="dodge", stat="identity") + ggtitle("Quintiles for economic damage to crops due to weather events") + ylab("Damage to properties in USD") + xlab("Quintiles/weater events") + scale_fill_brewer(name="Weather events", palette = "Set2") + scale_y_continuous(labels = function(x) format(x, scientific = FALSE, big.mark = ","))
Once I have decided to work on the 5th quintile, I update the data_ev_inj and data_ev_fat data frames to hold only the data corresponding to this quintile. So we can show a more detailed view of the weather events in this quintile and thus being able to better understand the data.
data_ev_inj <- data_ev_inj[data_ev_inj$QNT==5, ]
data_ev_inj <- data_ev_inj[order(-data_ev_inj$N), ]
data_ev_fat <- data_ev_fat[data_ev_fat$QNT==5, ]
data_ev_fat <- data_ev_fat[order(-data_ev_fat$N), ]
Both of the tables below show the details of which events caused the most harm to the public health. Here is easy to see that the TORNADOS, HEAT and FLOODS are the top 3 harmful weather events recorded in the data of the NOAA
knitr::kable(data_ev_inj, col.names=c("EVENT","NUMBER OF INJURIES","QUINTILE"), caption="Detail of INJURIES due to weather events in the 5th quintile")
EVENT | NUMBER OF INJURIES | QUINTILE |
---|---|---|
TORNADO | 91365 | 5 |
THUNDERSTORM WIND | 9493 | 5 |
HEAT | 9224 | 5 |
FLOOD | 8602 | 5 |
EXCESSIVE HEAT | 6525 | 5 |
LIGHTNING | 5232 | 5 |
ICE STORM | 1977 | 5 |
FLASH FLOOD | 1785 | 5 |
knitr::kable(data_ev_fat, col.names=c("EVENT","NUMBER OF FATALITIES","QUINTILE"), caption="Detail of FATALITIES due to weather events in the 5th quintile")
EVENT | NUMBER OF FATALITIES | QUINTILE |
---|---|---|
TORNADO | 5658 | 5 |
HEAT | 3119 | 5 |
EXCESSIVE HEAT | 1903 | 5 |
FLOOD | 1525 | 5 |
FLASH FLOOD | 1018 | 5 |
LIGHTNING | 817 | 5 |
THUNDERSTORM WIND | 753 | 5 |
RIP CURRENT | 577 | 5 |
Once I have decided to work on the 5th quintile, I just updates both data frames (data_ev_prd and data_ev_crd) to only hold the data corresponding to this quintile. This way I can show a more detailed view of the weather events that are the most harmful in terms of economical damage
data_ev_prd <- data_ev_prd[data_ev_prd$QNT==5, ]
data_ev_prd <- data_ev_prd[order(-data_ev_prd$N), ]
data_ev_crd <- data_ev_crd[data_ev_crd$QNT==5, ]
data_ev_crd <- data_ev_crd[order(-data_ev_crd$N), ]
Both of the tables below show the details of which events caused the most harm to properties and crops in terms of US dollars. Here you can see both TORNADOES and FLOOD being the two most harmful events.
knitr::kable(data_ev_prd, col.names=c("EVENT","PROPERTY DAMAGE IN USD","QUINTILE"), caption="Detail of economic damage to properties due to weather events in the 5th quintile", format.args = list(big.mark = ",", scientific = FALSE))
EVENT | PROPERTY DAMAGE IN USD | QUINTILE |
---|---|---|
FLOOD | 168,184,579,561 | 5 |
TORNADO | 58,552,206,924 | 5 |
STORM SURGE | 47,964,724,000 | 5 |
HAIL | 17,578,269,956 | 5 |
FLASH FLOOD | 17,414,730,872 | 5 |
THUNDERSTORM WIND | 11,575,230,873 | 5 |
TROPICAL STORM | 7,714,390,550 | 5 |
WINTER STORM | 6,748,997,251 | 5 |
HIGH WIND | 6,064,650,035 | 5 |
knitr::kable(data_ev_crd, col.names=c("EVENT","CROP DAMAGE IN USD","QUINTILE"), caption="Detail of economic damage to crops due to weather events in the 5th quintile", format.args = list(big.mark = ",", scientific = FALSE))
EVENT | CROP DAMAGE IN USD | QUINTILE |
---|---|---|
DROUGHT | 13,972,621,780 | 5 |
FLOOD | 12,379,706,100 | 5 |
ICE STORM | 5,022,113,500 | 5 |
HAIL | 3,049,486,620 | 5 |
FLASH FLOOD | 1,437,163,150 | 5 |
COLD | 1,409,765,550 | 5 |