Using the NOAA Storm Database to assess the impact of severe weather events on Public Health and the Economy in the United States

Synopsis

Severe weather events can have a substantial impact on both the health of the public and the economy. Events such as these can cause property and crop damage as well as injuries and even fatalities. Minimising the impact of such outcomes is of great concern the the USA as a whole.

In this project we use the U.S. National Oceanic and Atmospheric Administrations (NOAA) Storm Database the assess the impact of these severe weather events.

Data Processing

Libraries

library(ggplot2)
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.5
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(knitr)
## Warning: package 'knitr' was built under R version 4.0.5
library(markdown)
library(rmarkdown)
## Warning: package 'rmarkdown' was built under R version 4.0.5
library(RColorBrewer)

Downloading data

fileurl = 'https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2'
data_dir <- getwd()
if (!file.exists('./storm_data.bzip2')){
        download.file(fileurl,'./storm_data.csv.bz2', mode = 'wb')
}

Loading data

if(!exists("storm_data")) {
storm_data <- read.csv(bzfile("storm_data.csv.bz2"),header = TRUE)
}

Initial Data Exploration

We first run str() to get an overview of the data.

str(storm_data)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

We interested in the most harmful events (indicated by the EVTYPE variable) with respect to the population health and economic consequences across the entirety of the United States; therefore, we will keep the following variables:

  • EVTYPE: Event Type e.g. Flood
  • FATALITIES: Number of fatalities caused by Event
  • INJURIES: Number of injuries caused by Event
  • PROPDMG: Property damage caused by event
  • PROPDMGEXP: The units for the property damage i.e. K = Thousands, M = Millions, B = Billions
  • CROPDMG: Crop damage caused by event
  • CROPDMGEXP: The units for the crop damage i.e. K = Thousands, M = Millions, B = Billions

We subset our data to only keep these variables.

variables <- c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
data <- storm_data[variables]
dim(data)
## [1] 902297      7

Let’s take a look at the first few observations.

head(data)
##    EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO          0       15    25.0          K       0           
## 2 TORNADO          0        0     2.5          K       0           
## 3 TORNADO          0        2    25.0          K       0           
## 4 TORNADO          0        2     2.5          K       0           
## 5 TORNADO          0        2     2.5          K       0           
## 6 TORNADO          0        6     2.5          K       0

We check for missing values in any of the seven variables. We see there are no missing values for any of the variables - this is an exceptionally nice dataset!

for (var in variables){
        print(sum(is.na(data$var)))
}
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0

As noted above, the damage variables have associated exponent variables for thousands, millions and billions; we will have to combine these later for true damage amounts. Let’s take a look at the different possible Units as this will be important.

sort(table(data$PROPDMGEXP), decreasing = TRUE)
## 
##             K      M      0      B      5      1      2      ?      m      H 
## 465934 424665  11330    216     40     28     25     13      8      7      6 
##      +      7      3      4      6      -      8      h 
##      5      5      4      4      4      1      1      1
sort(table(data$CROPDMGEXP), decreasing = TRUE)
## 
##             K      M      k      0      B      ?      2      m 
## 618413 281832   1994     21     19      9      7      1      1

We can see the expected K, M and B exponents. However, there are other unexpected results. Based on these tables and the documentation we will take the following:

  • K, k: Thousands
  • M, m: Millions
  • B: Billions
  • Those entries with blank EXP observations will be taken as dollars

Data Transformation

We remove all entries that are not in one of the four categories listed above as we cannot be sure about what the costs should be. We also check to confirm we are left with only these categories and the new dimensions.

exp <- c("K", "k", "M", "m", "B", "")
data <- subset(data, PROPDMGEXP %in% exp & CROPDMGEXP %in% exp)
sort(table(data$PROPDMGEXP), decreasing = TRUE)
## 
##             K      M      B      m 
## 465928 424645  11329     40      7
sort(table(data$CROPDMGEXP), decreasing = TRUE)
## 
##             K      M      k      B      m 
## 618100 281826   1992     21      9      1
dim(data)
## [1] 901949      7

We’ve only lost about 300 entries - very well. Now we turn our attention to the EVTYPE variable. We see that there are many different types of events so let’s just look at the first 20. We will need to do some categorisation.

length(unique(data$EVTYPE))
## [1] 981
sort(table(data$EVTYPE), decreasing = TRUE)[1:20]
## 
##                     HAIL                TSTM WIND        THUNDERSTORM WIND 
##                   288625                   219939                    82559 
##                  TORNADO              FLASH FLOOD                    FLOOD 
##                    60628                    54261                    25325 
##       THUNDERSTORM WINDS                HIGH WIND                LIGHTNING 
##                    20628                    20210                    15732 
##               HEAVY SNOW               HEAVY RAIN             WINTER STORM 
##                    15705                    11723                    11432 
##           WINTER WEATHER             FUNNEL CLOUD         MARINE TSTM WIND 
##                     7026                     6839                     6175 
## MARINE THUNDERSTORM WIND               WATERSPOUT              STRONG WIND 
##                     5812                     3796                     3566 
##     URBAN/SML STREAM FLD                 WILDFIRE 
##                     3392                     2761

Using this, and the documentation, we can summarise most of these events into seven categories:

  • HAIL
  • WIND/THUNDERSTORM/LIGHTNING
  • TORNADO
  • FLOOD/RAIN
  • WINTER/SNOW/ICE/ICY
  • HEAT/WARM/FIRE
  • All others set to Other

We will create a new variable, EVENT, which categorises as such.

# Create the new variable with the default value of OTHER
data$EVENT <- "OTHER"
# Categorise by keywords listed above
data$EVENT[grep("HAIL", data$EVTYPE, ignore.case = TRUE)] <- "HAIL"
data$EVENT[grep("WIND|THUNDERSTORM|LIGHTNING", data$EVTYPE, ignore.case = TRUE)] <- "STORM"
data$EVENT[grep("TORNADO", data$EVTYPE, ignore.case = TRUE)] <- "TORNADO"
data$EVENT[grep("FLOOD|RAIN", data$EVTYPE, ignore.case = TRUE)] <- "FLOOD"
data$EVENT[grep("WINTER|SNOW|ICE|ICY", data$EVTYPE, ignore.case = TRUE)] <- "FREEZE"
data$EVENT[grep("HEAT|WARM|FIRE", data$EVTYPE, ignore.case = TRUE)] <- "HEAT"
sort(table(data$EVENT), decreasing = TRUE)
## 
##   STORM    HAIL   FLOOD TORNADO  FREEZE   OTHER    HEAT 
##  380443  289238   94854   60676   39447   30076    7215

We also need to transform the damage variables to their full numbers as mentioned before. We do this by converting the exponents to numeric values and then raising the damage variables to these powers.

# Property
# Set all none K, M, B to be 0 exponent
data$PROPDMGEXPNUM[!grepl("K|k|M|m|B", data$PROPDMGEXP, ignore.case = TRUE)] <- 0
data$PROPDMGEXPNUM[grep("K|k", data$PROPDMGEXP, ignore.case = TRUE)] <- 3
data$PROPDMGEXPNUM[grep("M|m", data$PROPDMGEXP, ignore.case = TRUE)] <- 6
data$PROPDMGEXPNUM[grep("B", data$PROPDMGEXP, ignore.case = TRUE)] <- 9
data$PROPERTY_DAMAGE <- data$PROPDMG * 10^(as.numeric(data$PROPDMGEXPNUM))

# Crops
# Set all none K, M, B to be 0 exponent
data$CROPDMGEXPNUM[!grepl("K|k|M|m|B", data$CROPDMGEXP, ignore.case = TRUE)] <- 0
data$CROPDMGEXPNUM[grep("K|k", data$CROPDMGEXP, ignore.case = TRUE)] <- 3
data$CROPDMGEXPNUM[grep("M|m", data$CROPDMGEXP, ignore.case = TRUE)] <- 6
data$CROPDMGEXPNUM[grep("B", data$CROPDMGEXP, ignore.case = TRUE)] <- 9
data$CROP_DAMAGE <- data$CROPDMG * 10^(as.numeric(data$CROPDMGEXPNUM))

# Class check
class(data$PROPERTY_DAMAGE)
## [1] "numeric"
class(data$CROP_DAMAGE)
## [1] "numeric"

Our data is now suitable to start summarising to answer the questions posed. For economic impact we simply care about the total damage cost caused by the event. For Public Health impact we care about both the number of injuries and the number of fatalities. So we aim to make three summary datasets:

  • Total number of injuries by Event
  • Total number of fatalities by Event
  • Total damage cost by Event

Total number of injuries by Event

We take the sum of the number of injuries by event.

data_injuries <- data %>%
        select(EVENT, INJURIES) %>%
        group_by(EVENT) %>%
        summarise(TOTAL_INJURIES = sum(INJURIES))

head(data_injuries)
## # A tibble: 6 x 2
##   EVENT  TOTAL_INJURIES
##   <chr>           <dbl>
## 1 FLOOD            8905
## 2 FREEZE           5237
## 3 HAIL             1368
## 4 HEAT            10851
## 5 OTHER            6041
## 6 STORM           16688

Total number of fatalities by Event

We take the sum of the number of fatalities by event.

data_fatalities <- data %>%
        select(EVENT, FATALITIES) %>%
        group_by(EVENT) %>%
        summarise(TOTAL_FATALITIES = sum(FATALITIES))

head(data_fatalities)
## # A tibble: 6 x 2
##   EVENT  TOTAL_FATALITIES
##   <chr>             <dbl>
## 1 FLOOD              1633
## 2 FREEZE              547
## 3 HAIL                 15
## 4 HEAT               3268
## 5 OTHER              1782
## 6 STORM              2232

Total cost of damage by Event

We take the sum of both the property and crop damage by event.

data_damage <- data %>%
        select(EVENT, PROPERTY_DAMAGE, CROP_DAMAGE) %>%
        group_by(EVENT) %>%
        summarise(TOTAL_DAMAGE = sum(PROPERTY_DAMAGE, CROP_DAMAGE))

head(data_damage)
## # A tibble: 6 x 2
##   EVENT  TOTAL_DAMAGE
##   <chr>         <dbl>
## 1 FLOOD  183943763634
## 2 FREEZE  16986995410
## 3 HAIL    18995875230
## 4 HEAT     9829715160
## 5 OTHER  167360037440
## 6 STORM   20269204081

Results

We are now ready to present the results. We create three plots using ggplot2

injuries <- ggplot(data_injuries, aes(x=reorder(EVENT, -TOTAL_INJURIES), y=TOTAL_INJURIES/1000, fill = EVENT)) +
        geom_col() +
        xlab("Event") + 
        ylab("Total Injuries (Thousands)") +
        ggtitle("Top 7 Severe Storm Events - Injuries") + 
        scale_fill_brewer(palette = "Set1")

fatalities <- ggplot(data_fatalities, aes(x=reorder(EVENT, -TOTAL_FATALITIES), y=TOTAL_FATALITIES/1000, fill = EVENT)) +
        geom_col() +
        xlab("Event") + 
        ylab("Total Fatalities (Thousands)") +
        ggtitle("Top 7 Severe Storm Events - Fatalities") + 
        scale_fill_brewer(palette = "Set2")

damages <- ggplot(data_damage, aes(x=reorder(EVENT, -TOTAL_DAMAGE), y=TOTAL_DAMAGE/10^9, fill = EVENT)) +
        geom_col() +
        xlab("Event") + 
        ylab("Total Cost of Damages ($ in Billions)") +
        ggtitle("Top 7 Severe Storm Events - Damages") + 
        scale_fill_brewer(palette = "Set3")

Public Health

Based on these two graphs we can see that Tornados are by far the biggest threat to public health of all major weather events.

print(injuries)

print(fatalities)

Economic Impact

In terms of Economic Impact, it is clear that Floods have the largest impact.

print(damages)