In this report we analyse the weather events database taken from the U.S. National Oceanic and Atmospheric Administration (NOAA). Our goal is to determine which type of event is most harmful to US population (health) and which type of event has the largest consequences for the American economy. On one hand, the analysis shows that tornados are the most harmful weather events for people’s health. On the other hand, the analysis shows that floods are the weather events that cause the biggest economical damages.
knitr::opts_chunk$set(echo = TRUE)
#
if (!require(tidyverse)) {
install.packages("tidyverse")
library(tidyverse, warn.conflicts = FALSE)
}
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
The NOAA weather events database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The source raw data file is available online and can be downloaded in a remote server location.
There is also some documentation about the database available. Here you will find how some of the variables are constructed/defined.
With the following code we configure the file structure of our working repository (creating data folder), download the raw data from the web, and read it into R environment for further processing and analysis.
# Save & download URL as object
URLfile <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
# Save file name and location as object
name_location <- "data/storm-data.csv.bz2"
# Build conditional process to download file from URL, name and save it in specific location
### If /data dont exist in directory: create it
if (!file.exists('data')) {
dir.create('data')
}
#### If the file (name_location) doesnt exist, download "URLfile" with saved name and location (name_location)
if (!file.exists(name_location)) {
download.file(url = URLfile, destfile = name_location)
}
# Load data to local environment
weatherevents <- read.csv(name_location, sep = ",", header = TRUE)
# Remove further innecessary objects from environment
remove(name_location, URLfile)
Here we can see the structure of the dataset:
# Show the structure of the dataset
str(weatherevents)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
There are 902.297 observations with 37 variables in the file. Only a subset of this data is requrired for the analysis. There are a lot of different variables which register information that is not relevant for our research puropses. Relevant for our analysis are the following variables:
In the following codes we select only the relevant variables, making easy to deal -in terms of computation resources- with a lighter database; after that, we configure the date variable in order to be able to select only the period of years with full weather events report; finall, we select the events with population impact or economic damage.
initial_data <- weatherevents %>% select(BGN_DATE, EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP, FATALITIES, INJURIES)
# Remove the original raw dataset from environment
remove(weatherevents)
# Format the BGN_DATE variable as a date
initial_data$BGN_DATE <- as.Date(initial_data$BGN_DATE, "%m/%d/%Y")
initial_data$YEAR <- year(initial_data$BGN_DATE)
# Only use events since 1996
# Only use events with either health impact or economic damage
analysis_data <- initial_data %>%
filter(YEAR >= 1996) %>%
filter(PROPDMG > 0 | CROPDMG > 0 | FATALITIES > 0 | INJURIES > 0)
The economical damages variables provided in the weather events dataset require some configurations. As you can see below, each variable - CROPDMG and PROPDMG - comes with a separate exponent - CROPDMGEXP and PROPDMGEXP. In both cases, the coded exponents are: “” (empty), B, K and M. We replace that character strings for numeric values in both variables, in order to be able to calculate the total economic damages by each weather event.
# Original exponents in CROPDMGEXP
table(analysis_data$CROPDMGEXP)
##
## B K M
## 102767 2 96787 1762
# Create new variable (CROPDMGFACTOR) replacing coded exponents with numeric values
analysis_data$CROPDMGFACTOR[(analysis_data$CROPDMGEXP == "")] <- 10^0
analysis_data$CROPDMGFACTOR[(analysis_data$CROPDMGEXP == "K")] <- 10^3
analysis_data$CROPDMGFACTOR[(analysis_data$CROPDMGEXP == "M")] <- 10^6
analysis_data$CROPDMGFACTOR[(analysis_data$CROPDMGEXP == "B")] <- 10^9
# New variable CROPDMGFACTOR
table(analysis_data$CROPDMGFACTOR)
##
## 1 1000 1e+06 1e+09
## 102767 96787 1762 2
# Original exponents in PROPDMGEXP
table(analysis_data$PROPDMGEXP)
##
## B K M
## 8448 32 185474 7364
# Create new variable (PROPDMGFACTOR) replacing coded exponents with numeric values
analysis_data$PROPDMGFACTOR[(analysis_data$PROPDMGEXP == "")] <- 10^0
analysis_data$PROPDMGFACTOR[(analysis_data$PROPDMGEXP == "K")] <- 10^3
analysis_data$PROPDMGFACTOR[(analysis_data$PROPDMGEXP == "M")] <- 10^6
analysis_data$PROPDMGFACTOR[(analysis_data$PROPDMGEXP == "B")] <- 10^9
# New variable PROPDMGFACTOR
table(analysis_data$PROPDMGFACTOR)
##
## 1 1000 1e+06 1e+09
## 8448 185474 7364 32
As far as both crop and property damages (in USD) refer to an economic cost, the distinction is irrelevant. Therefore, both variables are multiplied by their corresponding factor and added to form a new variable ECONOMIC_COST.
analysis_data <- mutate(analysis_data, ECONOMIC_COST = PROPDMG * PROPDMGFACTOR + CROPDMG * CROPDMGFACTOR)
The distinction between injuries and fatalities is irrelevant for our analysis: for exploratory purposes we want an aggregated indicator about population health impact of weather events. Therefore both variables are added to form a new variable HEALTHIMP.
analysis_data <- mutate(analysis_data, HEALTHIMP = FATALITIES + INJURIES)
The weather events variable requires further examination. After converting the variable EVTYPE to uppercase, there are still 186 different event types listed. According to the NOAA there should be only 48. So there are a lot of duplicates.
analysis_data$EVTYPE <- toupper(analysis_data$EVTYPE)
dim(data.frame(table(analysis_data$EVTYPE)))
## [1] 186 2
As we are interested in the most harmful weather events there is no point cleaning all the cases; that task would require quite a big effort as well. We will aggregate the dataset by the event type variable (EVTYPE), summing up in one case the economic costs, and the health impact on the other case. Only event types above the 95% quantile both in economic costs and health impact are going to be cleaned for visualization purposes.
Within the economic cost database, there is only one event type above the 95% quantile, which is not compliant to the official types: HURRICANE.
economic <- analysis_data %>% group_by(EVTYPE) %>% summarise(cost = sum(ECONOMIC_COST))
economic <- filter(economic, cost > quantile(economic$cost, probs = 0.95))
economic
## # A tibble: 10 × 2
## EVTYPE cost
## <chr> <dbl>
## 1 DROUGHT 14413667000
## 2 FLASH FLOOD 16557105610
## 3 FLOOD 148919611950
## 4 HAIL 17071172870
## 5 HIGH WIND 5881421660
## 6 HURRICANE 14554229010
## 7 HURRICANE/TYPHOON 71913712800
## 8 STORM SURGE 43193541000
## 9 TORNADO 24900370720
## 10 TROPICAL STORM 8320186550
With the following code chunk the event type was unified according the official type: HURRICANE (TYPHOON). This, within the original analysis_data.
analysis_data$EVTYPE[(analysis_data$EVTYPE == "HURRICANE")] <- "HURRICANE (TYPHOON)"
analysis_data$EVTYPE[(analysis_data$EVTYPE == "HURRICANE/TYPHOON")] <- "HURRICANE (TYPHOON)"
Within the health impact database, there are only two event types above the 95% quantile of impact, which are not compliant to the official types:TSTM WIND.
health <- analysis_data %>% group_by(EVTYPE) %>% summarise(impact = sum(HEALTHIMP))
health <- filter(health, impact > quantile(health$impact, probs = 0.95))
health
## # A tibble: 10 × 2
## EVTYPE impact
## <chr> <dbl>
## 1 EXCESSIVE HEAT 8188
## 2 FLASH FLOOD 2561
## 3 FLOOD 7172
## 4 HEAT 1459
## 5 HURRICANE (TYPHOON) 1446
## 6 LIGHTNING 4792
## 7 THUNDERSTORM WIND 1530
## 8 TORNADO 22178
## 9 TSTM WIND 3870
## 10 WINTER STORM 1483
With the following code chunk the event type was unified according the official type: THUNDERSTORM WIND. This, within the original analysis_data.
analysis_data$EVTYPE[(analysis_data$EVTYPE == "TSTM WIND")] <- "THUNDERSTORM WIND"
With the configurations above, the processing of our analysis data is ready. As we make the modifications over the original analysis data it will be needed to recalculate the economic damage and health impact to build the definitive results.
The cleaned data frame analysis_data is been aggregated per EVTYPE and provided in a descending order in the new data frame healthImpact. Then, a visualization of the weather events and their respective health impact over population is built.
The barchart shows that Tornados are the most harmful weather events for people’s health.
# Clean and ordered dataframe
healthImpact <- analysis_data %>%
group_by(EVTYPE) %>%
summarise(HEALTHIMP = sum(HEALTHIMP)) %>%
arrange(desc(HEALTHIMP))
# Visualization: health impact over population
ggplot(healthImpact[1:10,], aes(x=reorder(EVTYPE, -HEALTHIMP),y=HEALTHIMP,color=EVTYPE)) +
geom_bar(stat="identity", fill="white") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
xlab("Event") + ylab("Number of fatalities and injuries") +
theme(legend.position="none") +
ggtitle("Fatalities and injuries in the US caused by severe weather events")
The cleaned data frame analysis_data is been aggregated per EVTYPE and provided in a descending order in the new data frame economicCost Then, a visualization of the weather events and their respective economic costs is built.
The barchart shows that Floods cause the biggest economical damages.
# Clean and ordered dataframe
economicCost <- analysis_data %>%
group_by(EVTYPE) %>%
summarise(ECONOMIC_COST = sum(ECONOMIC_COST)) %>%
arrange(desc(ECONOMIC_COST))
# Visualization: health impact over population
ggplot(economicCost[1:10,], aes(x=reorder(EVTYPE, -ECONOMIC_COST),y=ECONOMIC_COST,color=EVTYPE)) +
geom_bar(stat="identity", fill="white") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
xlab("Event") + ylab("Economic cost in USD") +
theme(legend.position="none") +
ggtitle("Economic cost in the US caused by severe weather events")