Synopsis

Weather events are often thought of as some of the most fickle, yet unavoidable realities of living on this earth. Weather can bring joy and prosperity to the world, but also tragedy and destruction. Having a greater understanding of which events contribute to the latter may prove to be beneficial when determining what to expect and how to properly anticipate the implications of such events. In this analysis, I analyzed data from the U.S National Oceanic and Atmospheric Administration’s storm database spanning from the years 1950 to 2011 in order to answer two main questions:

  1. Which events are most harmful to population health?
  2. Which types of events have the greatest economic consequences?

After transforming and aggregating the data into analyzable formats, I created two tables that helped to answer the questions at hand. Through my analysis, I found that: tornadoes, by far, account for the largest amounts of deaths and injuries to the US population, while flooding and hurricanes account for the highest amount of damages to property and crops. The rest of this document details the entire process of the analysis as well as the entirely of the code used.

Data Processing

Loading in Packages

To being, I will first load in the packages I will be using. Tidyverse and magrittr are used for most of my general analyses as they streamline much of the process through useful functions and sub-packages contained within them (namely tidyverse). This is also my first time using the gt package, which is what will be used in order to create the tables for this analysis. The gtExtras package is an add-on to the gt package (includes some additional customizable features like themes) and RColorBrewer is used to add color palettes to the cells of the tables.

library(tidyverse)
library(magrittr)
library(gt)
library(gtExtras)
library(RColorBrewer)

Loading in Data

We can now start to load/read in the data. I first code a simple “if” condition that creates a directory named “CourseProject” if there isn’t one that exists already. It then downloads the data to that directory under the file name “storm_data.csv”. From there, the working directory is set and the data is read in and assigned to a dataframe named data.

if(!file.exists("CourseProject")){
        dir.create("CourseProject")
        download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
                        "./CourseProject/storm_data.csv")
        }
setwd("~/CourseProject")
data <- read.csv("storm_data.csv")

Preliminary Analysis

Now to run some preliminary tests/checks to gain some insight on our data. As you can see, our dataset consists of 902,297 rows and 37 columns:

dim(data)
## [1] 902297     37

Previewing the first 6 rows of the dataset:

head(data)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE  EVTYPE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL TORNADO
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL TORNADO
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL TORNADO
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL TORNADO
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL TORNADO
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL TORNADO
##   BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1         0                                               0         NA
## 2         0                                               0         NA
## 3         0                                               0         NA
## 4         0                                               0         NA
## 5         0                                               0         NA
## 6         0                                               0         NA
##   END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1         0                      14.0   100 3   0          0       15    25.0
## 2         0                       2.0   150 2   0          0        0     2.5
## 3         0                       0.1   123 2   0          0        2    25.0
## 4         0                       0.0   100 2   0          0        2     2.5
## 5         0                       0.0   150 2   0          0        2     2.5
## 6         0                       1.5   177 2   0          0        6     2.5
##   PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1          K       0                                         3040      8812
## 2          K       0                                         3042      8755
## 3          K       0                                         3340      8742
## 4          K       0                                         3458      8626
## 5          K       0                                         3412      8642
## 6          K       0                                         3450      8748
##   LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1       3051       8806              1
## 2          0          0              2
## 3          0          0              3
## 4          0          0              4
## 5          0          0              5
## 6          0          0              6

Running a summary on the variables of the dataset:

summary(data)
##     STATE__       BGN_DATE           BGN_TIME          TIME_ZONE        
##  Min.   : 1.0   Length:902297      Length:902297      Length:902297     
##  1st Qu.:19.0   Class :character   Class :character   Class :character  
##  Median :30.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :31.2                                                           
##  3rd Qu.:45.0                                                           
##  Max.   :95.0                                                           
##                                                                         
##      COUNTY       COUNTYNAME           STATE              EVTYPE         
##  Min.   :  0.0   Length:902297      Length:902297      Length:902297     
##  1st Qu.: 31.0   Class :character   Class :character   Class :character  
##  Median : 75.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :100.6                                                           
##  3rd Qu.:131.0                                                           
##  Max.   :873.0                                                           
##                                                                          
##    BGN_RANGE          BGN_AZI           BGN_LOCATI          END_DATE        
##  Min.   :   0.000   Length:902297      Length:902297      Length:902297     
##  1st Qu.:   0.000   Class :character   Class :character   Class :character  
##  Median :   0.000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :   1.484                                                           
##  3rd Qu.:   1.000                                                           
##  Max.   :3749.000                                                           
##                                                                             
##    END_TIME           COUNTY_END COUNTYENDN       END_RANGE       
##  Length:902297      Min.   :0    Mode:logical   Min.   :  0.0000  
##  Class :character   1st Qu.:0    NA's:902297    1st Qu.:  0.0000  
##  Mode  :character   Median :0                   Median :  0.0000  
##                     Mean   :0                   Mean   :  0.9862  
##                     3rd Qu.:0                   3rd Qu.:  0.0000  
##                     Max.   :0                   Max.   :925.0000  
##                                                                   
##    END_AZI           END_LOCATI            LENGTH              WIDTH         
##  Length:902297      Length:902297      Min.   :   0.0000   Min.   :   0.000  
##  Class :character   Class :character   1st Qu.:   0.0000   1st Qu.:   0.000  
##  Mode  :character   Mode  :character   Median :   0.0000   Median :   0.000  
##                                        Mean   :   0.2301   Mean   :   7.503  
##                                        3rd Qu.:   0.0000   3rd Qu.:   0.000  
##                                        Max.   :2315.0000   Max.   :4400.000  
##                                                                              
##        F               MAG            FATALITIES          INJURIES        
##  Min.   :0.0      Min.   :    0.0   Min.   :  0.0000   Min.   :   0.0000  
##  1st Qu.:0.0      1st Qu.:    0.0   1st Qu.:  0.0000   1st Qu.:   0.0000  
##  Median :1.0      Median :   50.0   Median :  0.0000   Median :   0.0000  
##  Mean   :0.9      Mean   :   46.9   Mean   :  0.0168   Mean   :   0.1557  
##  3rd Qu.:1.0      3rd Qu.:   75.0   3rd Qu.:  0.0000   3rd Qu.:   0.0000  
##  Max.   :5.0      Max.   :22000.0   Max.   :583.0000   Max.   :1700.0000  
##  NA's   :843563                                                           
##     PROPDMG         PROPDMGEXP           CROPDMG         CROPDMGEXP       
##  Min.   :   0.00   Length:902297      Min.   :  0.000   Length:902297     
##  1st Qu.:   0.00   Class :character   1st Qu.:  0.000   Class :character  
##  Median :   0.00   Mode  :character   Median :  0.000   Mode  :character  
##  Mean   :  12.06                      Mean   :  1.527                     
##  3rd Qu.:   0.50                      3rd Qu.:  0.000                     
##  Max.   :5000.00                      Max.   :990.000                     
##                                                                           
##      WFO             STATEOFFIC         ZONENAMES            LATITUDE   
##  Length:902297      Length:902297      Length:902297      Min.   :   0  
##  Class :character   Class :character   Class :character   1st Qu.:2802  
##  Mode  :character   Mode  :character   Mode  :character   Median :3540  
##                                                           Mean   :2875  
##                                                           3rd Qu.:4019  
##                                                           Max.   :9706  
##                                                           NA's   :47    
##    LONGITUDE        LATITUDE_E     LONGITUDE_       REMARKS         
##  Min.   :-14451   Min.   :   0   Min.   :-14455   Length:902297     
##  1st Qu.:  7247   1st Qu.:   0   1st Qu.:     0   Class :character  
##  Median :  8707   Median :   0   Median :     0   Mode  :character  
##  Mean   :  6940   Mean   :1452   Mean   :  3509                     
##  3rd Qu.:  9605   3rd Qu.:3549   3rd Qu.:  8735                     
##  Max.   : 17124   Max.   :9706   Max.   :106220                     
##                   NA's   :40                                        
##      REFNUM      
##  Min.   :     1  
##  1st Qu.:225575  
##  Median :451149  
##  Mean   :451149  
##  3rd Qu.:676723  
##  Max.   :902297  
## 

The main variable of interest to us is EVTYPE, which simply denotes the name of the weather event. Notice the summaries for the FATALITIES and INJURIES variables (the ones we will be examining to answer our first question). The maximum values are both significantly higher than their respective mean values. This can often indicate a large skew in the data that should be taken into consideration, especially when it comes to choosing how to graph/plot our findings (more on this later).

 

Data Transformation & Table Preparation

Before we can create our final polished tables that will be presented as part of the “results” section, we first need to transform the dataset to create the sub-tables necessary to answer our questions.

 

Creating the tables to answer our first question (Which events are most harmful to human population health?)

There are two relevant variables available to us that can help answer our first question: FATALITIES and INJURIES. Instead of simply grouping on the EVTYPE variable and creating two summary columns (sum of deaths and sum of fatalities) for the same table, I decided to create two separate tables that grouped the EVTYPE variable on deaths and injuries independently. These two separate tables will then be column-binded to each other into one table.

First, we will create the sub-table named most_deaths that calculates the total amount of deaths for each event (arranged from highest to lowest). Additionally, a new variable named PercentShare will be created within this table that calculates the percent share of deaths of each event to the total deaths of all events. The “head(10)” line of code returns only the top 10 events. Notice that in this case there is a “1” at the end of two of the variable names, “PercentShare1” and “Event1”. This is simply so that each column name is unique when we go to column-bind both tables later on (otherwise we would encounter errors when attempting to create the gt tables).

most_deaths <- data %>% group_by(EVTYPE) %>% 
                        summarise(Deaths = sum(FATALITIES)) %>%
                        arrange(desc(Deaths)) %>%
                        mutate(PercentShare1 = round((Deaths/sum(data$FATALITIES)*100), digits = 1)) %>%
                        head(10) %>%
                        rename("Event1" = "EVTYPE")

Then, we will create the sub-table named most_injuries that calculates the total amount of injuries for each event. This table follows the same exact format as the last one, just with a different variable.

most_injuries <- data %>% group_by(EVTYPE) %>% 
                          summarise(Injuries = sum(INJURIES)) %>%
                          arrange(desc(Injuries)) %>%
                          mutate(PercentShare2 = round((Injuries/sum(data$INJURIES)*100), digits = 1)) %>%
                          head(10) %>%
                          rename("Event2" = "EVTYPE")

Lastly, we combine both tables into one final table named harmful_events by simply merging the most_deaths and most_injuries tables together using the cbind() function:

harmful_events <- cbind(most_deaths, most_injuries)

 

Creating the tables to answer our second question (Which events have the greatest economic consequences?)

To analyze this question, there are four main variables (besides EVTYPE) that are of importance to us: PROPDMG, PROPDMGEXP, CROPDMG, and CROPDMGEXP.

  • PROPDMG and CROPDMG are numeric variables that represent estimated dollar amounts of damage to property and crops for each event, respectively.
  • PROPDMGEXP and CROPDMGEXP on the other hand, are character variables that are comprised of some symbols and numeric values which represent a multiplier to which the values of PROPDMG and CROPDMG should be multiplied by depending on what the value is.

Lets take a quick look at these variables to gain a better understanding:

summary(data$PROPDMG)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   12.06    0.50 5000.00
summary(data$CROPDMG)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   1.527   0.000 990.000
unique(data$PROPDMGEXP)
##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
unique(data$CROPDMGEXP)
## [1] ""  "M" "K" "m" "B" "?" "0" "k" "2"

As you can see, there are multiple different characters that the -EXP variables can take on, some of which are obvious (k = thousands, M = millions, B = billions) while others not so much.

 

At this point, there was nothing explicitly stated in the NOAA documentation that clarified what these values should represent. I did however end up finding this very helpful article from someone who had already tackled this problem in the past. Essentially, they simply compared specific observations of each unique -EXP value from the raw dataset to NOAA’s official online database. By comparing the total dollar amounts shown on NOAA’s online database, it is then easy to piece together what the multiplier should be for any given -EXP symbol. The author’s name is not listed on the article, but they credit David Hood for inspiring the question and Eddie Song for doing a more accurate analysis.

 

Now that we have a better understanding of these variables, we can continue making our sub-tables. First, we will simply create a subset of our data named sub containing only the relevant columns necessary for this question:

sub <- select(data, EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

We now mutate our sub table in order to create three new columns: prop_damage, crop_damage, and total_damage. The variable prop_damage uses the case_when() function to multiply PROPDMG by the corresponding value dictated by each PROPDMGEXP variable, depending on what value it takes (as described in the article). This gives us the final dollar amount in damage for each observation/event. The same is done to calculate crop damage in the new crop_damage variable. Lastly, the total_damage variable is simply a result of summing the prop_damage and crop_damage variables.

sub %<>% mutate(prop_damage = case_when(PROPDMGEXP == "H" ~ PROPDMG*100,
                                        PROPDMGEXP == "h" ~ PROPDMG*100,
                                        PROPDMGEXP == "K" ~ PROPDMG*1000,
                                        PROPDMGEXP == "M" ~ PROPDMG*1000000,
                                        PROPDMGEXP == "m" ~ PROPDMG*1000000,
                                        PROPDMGEXP == "B" ~ PROPDMG*1000000000,
                                        PROPDMGEXP %in% c(0:8) ~ PROPDMG*10, 
                                        PROPDMGEXP == "+" ~ PROPDMG*1,
                                        PROPDMGEXP == "-" ~ PROPDMG*0,
                                        PROPDMGEXP == "" ~ PROPDMG*0,
                                        PROPDMGEXP == "?" ~ PROPDMG*0,
                                        .default = as.numeric(PROPDMG))) %>%
        mutate(crop_damage = case_when(CROPDMGEXP == "k" ~ CROPDMG*1000,
                                       CROPDMGEXP == "K" ~ CROPDMG*1000,
                                       CROPDMGEXP == "M" ~ CROPDMG*1000000,
                                       CROPDMGEXP == "m" ~ CROPDMG*1000000,
                                       CROPDMGEXP == "B" ~ CROPDMG*1000000000,
                                       CROPDMGEXP %in% c(0:8) ~ CROPDMG*10,
                                       CROPDMGEXP == "" ~ CROPDMG*0,
                                       CROPDMGEXP == "?" ~ CROPDMG*0,
                                       .default = as.numeric(CROPDMG))) %>%
        mutate(total_damage = (prop_damage + crop_damage))

We can now finally create the last aggregate table which lists the top 10 events by total damage in descending order. Once again, the process is virtually identical to the other two tables that we created to answer the first question.

most_damage <- sub %>% group_by(EVTYPE) %>%
                       summarise(total_damage = sum(total_damage)) %>%
                       arrange(desc(total_damage)) %>%
                       mutate(PercentShare = round((total_damage/sum(sub$total_damage)*100), digits = 1)) %>%
                       head(10)

 

Results

A quick note regarding data visualization and skew:

When deciding whether to create bar plots (or any other visualization) for this analysis, I first previewed the aggregate tables in order to see if there was any noticeable skew. Indeed there was, as you will soon see in a second. There are several ways to tackle skew in visualizations, one of the best being plotting the log values of each metric (deaths, injuries, and total damage) instead of their raw values. Even then, I still was not satisfied by the way this method visualized this particular dataset. Because of this, I simply decided to leave the data in tabular format (albeit in a streamlined and presentable manner) in order to capture the nuance and magnitude of the data.

Which events are most harmful to population health?

We can now use our harmful_events table and pipe it into our gt() function that initiates the creation of a gt table. Every other line/function of code are all add-ons that contribute to the look and text of the table. I decided to use the 538 theme, which mimics the style of tables/visualizations used by the online publication site FiveThirtyEight.

harmful_events %>% gt() %>% 
                tab_header(title = md("**Top Ten Most Harmful Weather Events**"),
                           subtitle = md("**From 1950 to 2011**")) %>%
                tab_source_note(source_note = md("**Source: U.S. National Oceanic and 
                                                 Atmospheric Administration (NOAA)**")) %>%
                tab_spanner(label = md("**By Deaths**"),
                            columns = 1:3) %>%
                tab_spanner(label = md("**By Injuries**"),
                            columns = 4:6) %>%
                cols_label(Event1 = md("**Event**"),
                           Event2 = md("**Event**"),
                           PercentShare1 = md("**% Share**"),
                           PercentShare2 = md("**% Share**"),
                           Deaths = md("**Deaths**"),
                           Injuries = md("**Injuries**")) %>%
                fmt_percent(columns = c(3,6), 
                            scale_values = F, 
                            decimals = 1) %>%
                cols_align(align = "center",
                           columns = c(3,6,2,5)) %>%
                tab_style(style = cell_borders(sides = "right",
                                               weight = px(2)),
                          locations = cells_body(columns = 3)) %>%
                gt_theme_538() %>%
                gt_color_rows(columns = c(3,6),
                              palette = "ggsci::red_material",
                              pal_type = "continuous")
Top Ten Most Harmful Weather Events
From 1950 to 2011
By Deaths By Injuries
Event Deaths % Share Event Injuries % Share
TORNADO 5633 37.2% TORNADO 91346 65.0%
EXCESSIVE HEAT 1903 12.6% TSTM WIND 6957 5.0%
FLASH FLOOD 978 6.5% FLOOD 6789 4.8%
HEAT 937 6.2% EXCESSIVE HEAT 6525 4.6%
LIGHTNING 816 5.4% LIGHTNING 5230 3.7%
TSTM WIND 504 3.3% HEAT 2100 1.5%
FLOOD 470 3.1% ICE STORM 1975 1.4%
RIP CURRENT 368 2.4% FLASH FLOOD 1777 1.3%
HIGH WIND 248 1.6% THUNDERSTORM WIND 1488 1.1%
AVALANCHE 224 1.5% HAIL 1361 1.0%
Source: U.S. National Oceanic and Atmospheric Administration (NOAA)

As you can see, tornadoes take the #1 spot of being the most fatal and destructive weather events by a large margin. Excessive heat takes the #2 spot w.r.t. fatalities, followed by flooding and more heat. As far as injuries go, things such as thunderstorm winds, floods, and excessive heat are culpable for thousands of injuries, yet none of them come close to tornadoes.

 

Which types of events have the greatest economic consequences?

We can now use our most_damage sub-table to create another gt table, the same as we did previously.

most_damage %>% gt() %>% 
                tab_header(title = md("**Top Ten Most Costly Weather Events**"),
                           subtitle = md("**From 1950 to 2011**")) %>%
                tab_source_note(source_note = md("**Source: U.S. National Oceanic and 
                                                 Atmospheric Administration (NOAA)**")) %>%
                cols_label(EVTYPE = md("**Event**"),
                           total_damage = md("**Total Damage (Dollars)**"),
                           PercentShare = md("**% Share**")) %>%
                fmt_percent(columns = 3,
                            scale_values = F,
                            decimals = 1) %>%
                fmt_currency(columns = 2,
                             decimals = 2,
                             suffixing = T) %>%
                cols_align(align = "center",
                           columns = c(2,3)) %>%
                gt_theme_538() %>%
                gt_color_rows(columns = 3,
                              palette = brewer.pal(9, "Greens"),
                              pal_type = "continuous",
                              domain = c(0:32)) %>%
                tab_footnote(footnote = "Total Damage = Property Damage + Crop Damage",
                             locations = cells_column_labels(columns = 2))
Top Ten Most Costly Weather Events
From 1950 to 2011
Event Total Damage (Dollars)1 % Share
FLOOD $150.32B 31.6%
HURRICANE/TYPHOON $71.91B 15.1%
TORNADO $57.35B 12.0%
STORM SURGE $43.32B 9.1%
HAIL $18.76B 3.9%
FLASH FLOOD $17.56B 3.7%
DROUGHT $15.02B 3.2%
HURRICANE $14.61B 3.1%
RIVER FLOOD $10.15B 2.1%
ICE STORM $8.97B 1.9%
Source: U.S. National Oceanic and Atmospheric Administration (NOAA)
1 Total Damage = Property Damage + Crop Damage

As you can see, flooding easily takes the #1 spot of being the most costly weather event to property and crops at an astounding $150 billion worth of damages, making up ~31% of all weather-related damages. Hurricanes/typhoons take the #2 spot with a share of ~15% of all damages, and Tornadoes a close third at ~12%.