Weather events are often thought of as some of the most fickle, yet unavoidable realities of living on this earth. Weather can bring joy and prosperity to the world, but also tragedy and destruction. Having a greater understanding of which events contribute to the latter may prove to be beneficial when determining what to expect and how to properly anticipate the implications of such events. In this analysis, I analyzed data from the U.S National Oceanic and Atmospheric Administration’s storm database spanning from the years 1950 to 2011 in order to answer two main questions:
After transforming and aggregating the data into analyzable formats, I created two tables that helped to answer the questions at hand. Through my analysis, I found that: tornadoes, by far, account for the largest amounts of deaths and injuries to the US population, while flooding and hurricanes account for the highest amount of damages to property and crops. The rest of this document details the entire process of the analysis as well as the entirely of the code used.
To being, I will first load in the packages I will be using. Tidyverse and magrittr are used for most of my general analyses as they streamline much of the process through useful functions and sub-packages contained within them (namely tidyverse). This is also my first time using the gt package, which is what will be used in order to create the tables for this analysis. The gtExtras package is an add-on to the gt package (includes some additional customizable features like themes) and RColorBrewer is used to add color palettes to the cells of the tables.
library(tidyverse)
library(magrittr)
library(gt)
library(gtExtras)
library(RColorBrewer)
We can now start to load/read in the data. I first code a simple “if” condition that creates a directory named “CourseProject” if there isn’t one that exists already. It then downloads the data to that directory under the file name “storm_data.csv”. From there, the working directory is set and the data is read in and assigned to a dataframe named data.
if(!file.exists("CourseProject")){
dir.create("CourseProject")
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
"./CourseProject/storm_data.csv")
}
setwd("~/CourseProject")
data <- read.csv("storm_data.csv")
Now to run some preliminary tests/checks to gain some insight on our data. As you can see, our dataset consists of 902,297 rows and 37 columns:
dim(data)
## [1] 902297 37
Previewing the first 6 rows of the dataset:
head(data)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL TORNADO
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL TORNADO
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1 0 0 NA
## 2 0 0 NA
## 3 0 0 NA
## 4 0 0 NA
## 5 0 0 NA
## 6 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1 0 14.0 100 3 0 0 15 25.0
## 2 0 2.0 150 2 0 0 0 2.5
## 3 0 0.1 123 2 0 0 2 25.0
## 4 0 0.0 100 2 0 0 2 2.5
## 5 0 0.0 150 2 0 0 2 2.5
## 6 0 1.5 177 2 0 0 6 2.5
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1 K 0 3040 8812
## 2 K 0 3042 8755
## 3 K 0 3340 8742
## 4 K 0 3458 8626
## 5 K 0 3412 8642
## 6 K 0 3450 8748
## LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3051 8806 1
## 2 0 0 2
## 3 0 0 3
## 4 0 0 4
## 5 0 0 5
## 6 0 0 6
Running a summary on the variables of the dataset:
summary(data)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE
## Min. : 1.0 Length:902297 Length:902297 Length:902297
## 1st Qu.:19.0 Class :character Class :character Class :character
## Median :30.0 Mode :character Mode :character Mode :character
## Mean :31.2
## 3rd Qu.:45.0
## Max. :95.0
##
## COUNTY COUNTYNAME STATE EVTYPE
## Min. : 0.0 Length:902297 Length:902297 Length:902297
## 1st Qu.: 31.0 Class :character Class :character Class :character
## Median : 75.0 Mode :character Mode :character Mode :character
## Mean :100.6
## 3rd Qu.:131.0
## Max. :873.0
##
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE
## Min. : 0.000 Length:902297 Length:902297 Length:902297
## 1st Qu.: 0.000 Class :character Class :character Class :character
## Median : 0.000 Mode :character Mode :character Mode :character
## Mean : 1.484
## 3rd Qu.: 1.000
## Max. :3749.000
##
## END_TIME COUNTY_END COUNTYENDN END_RANGE
## Length:902297 Min. :0 Mode:logical Min. : 0.0000
## Class :character 1st Qu.:0 NA's:902297 1st Qu.: 0.0000
## Mode :character Median :0 Median : 0.0000
## Mean :0 Mean : 0.9862
## 3rd Qu.:0 3rd Qu.: 0.0000
## Max. :0 Max. :925.0000
##
## END_AZI END_LOCATI LENGTH WIDTH
## Length:902297 Length:902297 Min. : 0.0000 Min. : 0.000
## Class :character Class :character 1st Qu.: 0.0000 1st Qu.: 0.000
## Mode :character Mode :character Median : 0.0000 Median : 0.000
## Mean : 0.2301 Mean : 7.503
## 3rd Qu.: 0.0000 3rd Qu.: 0.000
## Max. :2315.0000 Max. :4400.000
##
## F MAG FATALITIES INJURIES
## Min. :0.0 Min. : 0.0 Min. : 0.0000 Min. : 0.0000
## 1st Qu.:0.0 1st Qu.: 0.0 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median :1.0 Median : 50.0 Median : 0.0000 Median : 0.0000
## Mean :0.9 Mean : 46.9 Mean : 0.0168 Mean : 0.1557
## 3rd Qu.:1.0 3rd Qu.: 75.0 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :5.0 Max. :22000.0 Max. :583.0000 Max. :1700.0000
## NA's :843563
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## Min. : 0.00 Length:902297 Min. : 0.000 Length:902297
## 1st Qu.: 0.00 Class :character 1st Qu.: 0.000 Class :character
## Median : 0.00 Mode :character Median : 0.000 Mode :character
## Mean : 12.06 Mean : 1.527
## 3rd Qu.: 0.50 3rd Qu.: 0.000
## Max. :5000.00 Max. :990.000
##
## WFO STATEOFFIC ZONENAMES LATITUDE
## Length:902297 Length:902297 Length:902297 Min. : 0
## Class :character Class :character Class :character 1st Qu.:2802
## Mode :character Mode :character Mode :character Median :3540
## Mean :2875
## 3rd Qu.:4019
## Max. :9706
## NA's :47
## LONGITUDE LATITUDE_E LONGITUDE_ REMARKS
## Min. :-14451 Min. : 0 Min. :-14455 Length:902297
## 1st Qu.: 7247 1st Qu.: 0 1st Qu.: 0 Class :character
## Median : 8707 Median : 0 Median : 0 Mode :character
## Mean : 6940 Mean :1452 Mean : 3509
## 3rd Qu.: 9605 3rd Qu.:3549 3rd Qu.: 8735
## Max. : 17124 Max. :9706 Max. :106220
## NA's :40
## REFNUM
## Min. : 1
## 1st Qu.:225575
## Median :451149
## Mean :451149
## 3rd Qu.:676723
## Max. :902297
##
The main variable of interest to us is EVTYPE, which simply denotes the name of the weather event. Notice the summaries for the FATALITIES and INJURIES variables (the ones we will be examining to answer our first question). The maximum values are both significantly higher than their respective mean values. This can often indicate a large skew in the data that should be taken into consideration, especially when it comes to choosing how to graph/plot our findings (more on this later).
Before we can create our final polished tables that will be presented as part of the “results” section, we first need to transform the dataset to create the sub-tables necessary to answer our questions.
There are two relevant variables available to us that can help answer our first question: FATALITIES and INJURIES. Instead of simply grouping on the EVTYPE variable and creating two summary columns (sum of deaths and sum of fatalities) for the same table, I decided to create two separate tables that grouped the EVTYPE variable on deaths and injuries independently. These two separate tables will then be column-binded to each other into one table.
First, we will create the sub-table named most_deaths that calculates the total amount of deaths for each event (arranged from highest to lowest). Additionally, a new variable named PercentShare will be created within this table that calculates the percent share of deaths of each event to the total deaths of all events. The “head(10)” line of code returns only the top 10 events. Notice that in this case there is a “1” at the end of two of the variable names, “PercentShare1” and “Event1”. This is simply so that each column name is unique when we go to column-bind both tables later on (otherwise we would encounter errors when attempting to create the gt tables).
most_deaths <- data %>% group_by(EVTYPE) %>%
summarise(Deaths = sum(FATALITIES)) %>%
arrange(desc(Deaths)) %>%
mutate(PercentShare1 = round((Deaths/sum(data$FATALITIES)*100), digits = 1)) %>%
head(10) %>%
rename("Event1" = "EVTYPE")
Then, we will create the sub-table named most_injuries that calculates the total amount of injuries for each event. This table follows the same exact format as the last one, just with a different variable.
most_injuries <- data %>% group_by(EVTYPE) %>%
summarise(Injuries = sum(INJURIES)) %>%
arrange(desc(Injuries)) %>%
mutate(PercentShare2 = round((Injuries/sum(data$INJURIES)*100), digits = 1)) %>%
head(10) %>%
rename("Event2" = "EVTYPE")
Lastly, we combine both tables into one final table named harmful_events by simply merging the most_deaths and most_injuries tables together using the cbind() function:
harmful_events <- cbind(most_deaths, most_injuries)
To analyze this question, there are four main variables (besides EVTYPE) that are of importance to us: PROPDMG, PROPDMGEXP, CROPDMG, and CROPDMGEXP.
Lets take a quick look at these variables to gain a better understanding:
summary(data$PROPDMG)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 12.06 0.50 5000.00
summary(data$CROPDMG)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 1.527 0.000 990.000
unique(data$PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
unique(data$CROPDMGEXP)
## [1] "" "M" "K" "m" "B" "?" "0" "k" "2"
As you can see, there are multiple different characters that the -EXP variables can take on, some of which are obvious (k = thousands, M = millions, B = billions) while others not so much.
At this point, there was nothing explicitly stated in the NOAA documentation that clarified what these values should represent. I did however end up finding this very helpful article from someone who had already tackled this problem in the past. Essentially, they simply compared specific observations of each unique -EXP value from the raw dataset to NOAA’s official online database. By comparing the total dollar amounts shown on NOAA’s online database, it is then easy to piece together what the multiplier should be for any given -EXP symbol. The author’s name is not listed on the article, but they credit David Hood for inspiring the question and Eddie Song for doing a more accurate analysis.
Now that we have a better understanding of these variables, we can continue making our sub-tables. First, we will simply create a subset of our data named sub containing only the relevant columns necessary for this question:
sub <- select(data, EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
We now mutate our sub table in order to create three new columns: prop_damage, crop_damage, and total_damage. The variable prop_damage uses the case_when() function to multiply PROPDMG by the corresponding value dictated by each PROPDMGEXP variable, depending on what value it takes (as described in the article). This gives us the final dollar amount in damage for each observation/event. The same is done to calculate crop damage in the new crop_damage variable. Lastly, the total_damage variable is simply a result of summing the prop_damage and crop_damage variables.
sub %<>% mutate(prop_damage = case_when(PROPDMGEXP == "H" ~ PROPDMG*100,
PROPDMGEXP == "h" ~ PROPDMG*100,
PROPDMGEXP == "K" ~ PROPDMG*1000,
PROPDMGEXP == "M" ~ PROPDMG*1000000,
PROPDMGEXP == "m" ~ PROPDMG*1000000,
PROPDMGEXP == "B" ~ PROPDMG*1000000000,
PROPDMGEXP %in% c(0:8) ~ PROPDMG*10,
PROPDMGEXP == "+" ~ PROPDMG*1,
PROPDMGEXP == "-" ~ PROPDMG*0,
PROPDMGEXP == "" ~ PROPDMG*0,
PROPDMGEXP == "?" ~ PROPDMG*0,
.default = as.numeric(PROPDMG))) %>%
mutate(crop_damage = case_when(CROPDMGEXP == "k" ~ CROPDMG*1000,
CROPDMGEXP == "K" ~ CROPDMG*1000,
CROPDMGEXP == "M" ~ CROPDMG*1000000,
CROPDMGEXP == "m" ~ CROPDMG*1000000,
CROPDMGEXP == "B" ~ CROPDMG*1000000000,
CROPDMGEXP %in% c(0:8) ~ CROPDMG*10,
CROPDMGEXP == "" ~ CROPDMG*0,
CROPDMGEXP == "?" ~ CROPDMG*0,
.default = as.numeric(CROPDMG))) %>%
mutate(total_damage = (prop_damage + crop_damage))
We can now finally create the last aggregate table which lists the top 10 events by total damage in descending order. Once again, the process is virtually identical to the other two tables that we created to answer the first question.
most_damage <- sub %>% group_by(EVTYPE) %>%
summarise(total_damage = sum(total_damage)) %>%
arrange(desc(total_damage)) %>%
mutate(PercentShare = round((total_damage/sum(sub$total_damage)*100), digits = 1)) %>%
head(10)
When deciding whether to create bar plots (or any other visualization) for this analysis, I first previewed the aggregate tables in order to see if there was any noticeable skew. Indeed there was, as you will soon see in a second. There are several ways to tackle skew in visualizations, one of the best being plotting the log values of each metric (deaths, injuries, and total damage) instead of their raw values. Even then, I still was not satisfied by the way this method visualized this particular dataset. Because of this, I simply decided to leave the data in tabular format (albeit in a streamlined and presentable manner) in order to capture the nuance and magnitude of the data.
We can now use our harmful_events table and pipe it into our gt() function that initiates the creation of a gt table. Every other line/function of code are all add-ons that contribute to the look and text of the table. I decided to use the 538 theme, which mimics the style of tables/visualizations used by the online publication site FiveThirtyEight.
harmful_events %>% gt() %>%
tab_header(title = md("**Top Ten Most Harmful Weather Events**"),
subtitle = md("**From 1950 to 2011**")) %>%
tab_source_note(source_note = md("**Source: U.S. National Oceanic and
Atmospheric Administration (NOAA)**")) %>%
tab_spanner(label = md("**By Deaths**"),
columns = 1:3) %>%
tab_spanner(label = md("**By Injuries**"),
columns = 4:6) %>%
cols_label(Event1 = md("**Event**"),
Event2 = md("**Event**"),
PercentShare1 = md("**% Share**"),
PercentShare2 = md("**% Share**"),
Deaths = md("**Deaths**"),
Injuries = md("**Injuries**")) %>%
fmt_percent(columns = c(3,6),
scale_values = F,
decimals = 1) %>%
cols_align(align = "center",
columns = c(3,6,2,5)) %>%
tab_style(style = cell_borders(sides = "right",
weight = px(2)),
locations = cells_body(columns = 3)) %>%
gt_theme_538() %>%
gt_color_rows(columns = c(3,6),
palette = "ggsci::red_material",
pal_type = "continuous")
| Top Ten Most Harmful Weather Events | |||||
| From 1950 to 2011 | |||||
| By Deaths | By Injuries | ||||
|---|---|---|---|---|---|
| Event | Deaths | % Share | Event | Injuries | % Share |
| TORNADO | 5633 | 37.2% | TORNADO | 91346 | 65.0% |
| EXCESSIVE HEAT | 1903 | 12.6% | TSTM WIND | 6957 | 5.0% |
| FLASH FLOOD | 978 | 6.5% | FLOOD | 6789 | 4.8% |
| HEAT | 937 | 6.2% | EXCESSIVE HEAT | 6525 | 4.6% |
| LIGHTNING | 816 | 5.4% | LIGHTNING | 5230 | 3.7% |
| TSTM WIND | 504 | 3.3% | HEAT | 2100 | 1.5% |
| FLOOD | 470 | 3.1% | ICE STORM | 1975 | 1.4% |
| RIP CURRENT | 368 | 2.4% | FLASH FLOOD | 1777 | 1.3% |
| HIGH WIND | 248 | 1.6% | THUNDERSTORM WIND | 1488 | 1.1% |
| AVALANCHE | 224 | 1.5% | HAIL | 1361 | 1.0% |
| Source: U.S. National Oceanic and Atmospheric Administration (NOAA) | |||||
As you can see, tornadoes take the #1 spot of being the most fatal and destructive weather events by a large margin. Excessive heat takes the #2 spot w.r.t. fatalities, followed by flooding and more heat. As far as injuries go, things such as thunderstorm winds, floods, and excessive heat are culpable for thousands of injuries, yet none of them come close to tornadoes.
We can now use our most_damage sub-table to create another gt table, the same as we did previously.
most_damage %>% gt() %>%
tab_header(title = md("**Top Ten Most Costly Weather Events**"),
subtitle = md("**From 1950 to 2011**")) %>%
tab_source_note(source_note = md("**Source: U.S. National Oceanic and
Atmospheric Administration (NOAA)**")) %>%
cols_label(EVTYPE = md("**Event**"),
total_damage = md("**Total Damage (Dollars)**"),
PercentShare = md("**% Share**")) %>%
fmt_percent(columns = 3,
scale_values = F,
decimals = 1) %>%
fmt_currency(columns = 2,
decimals = 2,
suffixing = T) %>%
cols_align(align = "center",
columns = c(2,3)) %>%
gt_theme_538() %>%
gt_color_rows(columns = 3,
palette = brewer.pal(9, "Greens"),
pal_type = "continuous",
domain = c(0:32)) %>%
tab_footnote(footnote = "Total Damage = Property Damage + Crop Damage",
locations = cells_column_labels(columns = 2))
| Top Ten Most Costly Weather Events | ||
| From 1950 to 2011 | ||
| Event | Total Damage (Dollars)1 | % Share |
|---|---|---|
| FLOOD | $150.32B | 31.6% |
| HURRICANE/TYPHOON | $71.91B | 15.1% |
| TORNADO | $57.35B | 12.0% |
| STORM SURGE | $43.32B | 9.1% |
| HAIL | $18.76B | 3.9% |
| FLASH FLOOD | $17.56B | 3.7% |
| DROUGHT | $15.02B | 3.2% |
| HURRICANE | $14.61B | 3.1% |
| RIVER FLOOD | $10.15B | 2.1% |
| ICE STORM | $8.97B | 1.9% |
| Source: U.S. National Oceanic and Atmospheric Administration (NOAA) | ||
| 1 Total Damage = Property Damage + Crop Damage | ||
As you can see, flooding easily takes the #1 spot of being the most costly weather event to property and crops at an astounding $150 billion worth of damages, making up ~31% of all weather-related damages. Hurricanes/typhoons take the #2 spot with a share of ~15% of all damages, and Tornadoes a close third at ~12%.