Synopsis

This report explores damages incurred by weather phenomena based on the National Oceanic and Atmospheric Administration storm assessment database. Two specific questions that are examined are which types of events caused the most harm to the public, and which types of events caused the greatest economic impact. The relevant variables are extracted and reduced to produce the statistics to answer these questions.

Data Processing

The specific questions that will be the focus of this examination are:

1 Across the United States, which types of events are most harmful with respect to population health?
2 Across the United States, which types of events have the greatest economic consequences?

The download and unpack (unzip) of the source data is seen as a one-time process and is not deemed necessary to be performed for the purpose of this study. The download/unpack sequence is documented as a block quote for completeness.

source <- “https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2
destination <- “TheData.csv.bz2”
download.file(source,destination)

The R packages used in this study are:

library(dplyr)
library(data.table)
library(RColorBrewer)

And the versions of all software used in this study are as follows:

sessionInfo()
## R version 4.4.0 (2024-04-24)
## Platform: aarch64-apple-darwin20
## Running under: macOS Ventura 13.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Phoenix
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] RColorBrewer_1.1-3 data.table_1.15.4  dplyr_1.1.4       
## 
## loaded via a namespace (and not attached):
##  [1] vctrs_0.6.5       cli_3.6.2         knitr_1.46        rlang_1.1.3      
##  [5] xfun_0.43         generics_0.1.3    jsonlite_1.8.8    glue_1.7.0       
##  [9] htmltools_0.5.8.1 sass_0.4.9        fansi_1.0.6       rmarkdown_2.26   
## [13] evaluate_0.23     jquerylib_0.1.4   tibble_3.2.1      fastmap_1.1.1    
## [17] yaml_2.3.8        lifecycle_1.0.4   compiler_4.4.0    pkgconfig_2.0.3  
## [21] digest_0.6.35     R6_2.5.1          tidyselect_1.2.1  utf8_1.2.4       
## [25] pillar_1.9.0      magrittr_2.0.3    bslib_0.7.0       tools_4.4.0      
## [29] cachem_1.0.8

We load the original raw dataset into memory and review it’s essential structure:

foo <- read.csv("TheData.csv.bz2")
str(foo)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Question 1

We begin with a focus on question 1: Across the United States, which types of events are most harmful with respect to population health?

We first examine the quality of the data. Are there missing values?

# any missing values?
anyNA(foo)
## [1] TRUE
# how many?
sum(is.na(foo))
## [1] 1745947
# what percentage of total do these NA's represent?
sum(is.na(foo)) / (ncol(foo) * nrow(foo)) * 100.0
## [1] 5.229737

It must be pointed out that the National Oceanic and Atmospheric Administration’s storm data document is an instruction guide for storm data input preparers. It is not a code book, so certain latitude must be taken in the interpretation of what the variables mean. This study is working with NATIONAL WEATHER SERVICE INSTRUCTION 10-1605, AUGUST 17, 2007.

Specifically, from page 9 in the NWS document:

“Fatalities and injuries directly caused by the weather event will be entered in the Storm Data software “fatality” and “injury” entry fields, respectively.”

We find two columns that are interpreted to indicate levels of bodily harm: ‘FATALITIES’ and ‘INJURIES’.

# any missing values in the 'FATALITIES' or the 'INJURIES' columns?
sum(is.na(foo$FATALITIES))
## [1] 0
sum(is.na(foo$INJURIES))
## [1] 0
# how many fatal injuries are logged into the NWS database?
sum(foo$FATALITIES > 0)
## [1] 6974
# how many non-fatal injuries are logged into the NWS database?
sum(foo$INJURIES > 0)
## [1] 17604

The first question to explore is: are these two separate variables mutually exclusive? Without an effective code book that could tell us otherwise, we should verify whether INJURIES and FATALITIES are mutually exclusive. Are they?

X <- foo[(foo$FATALITIES > 0 & foo$INJURIES) > 0,]
nrow(X)
## [1] 2649

If INJURIES and FATALITIES are mutually exclusive, then table X should have 0 rows. It doesn’t. 2649 entries out of 902297 observations list both INJURIES and FATALITIES together. This comes to 2.7027027% of the total observations. In the absence of a code book, we deem this an acceptable redundancy. Since these bodily harm variables are collected separately, this study deems it fair to treat them separately.

So we extract those columns we need to proceed to study question 1.

Harm <- foo[(foo$FATALITIES > 0 | foo$INJURIES > 0),c(8,23,24)]

The variable ‘EVTYPE’ documents the type of weather phenomena associated with an observation. There are 985 unique values for EVTYPE. We sum both INJURIES and FATALITIES together, grouped by weather phenomena ‘EVTYPE’.

Harm <- foo %>% group_by(EVTYPE) %>% summarise(HarmCounts = sum(INJURIES,FATALITIES))

We sort the grouped sums in descending order, thereby bringing the highest contenders to the top of the list.

Harm <- arrange(Harm,desc(HarmCounts))

Question 2

We now turn our focus to question 2: Across the United States, which types of events have the greatest economic consequences?

Page 12 of the NWS document discusses damages from storms and how to assess monetary damages. One particular section that appears pertinent to our investigation here:

“Estimates should be rounded to three significant digits, followed by an alphabetccal character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000. Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions. If additional precision is available, it may be provided in the narrative part of the entry. When damage is due to more than one element of the storm, indicate, when possible, the amount of damage caused by each element. If the dollar amount of damage is unknown, or not available, check the “no information available” box.”

From the str() output at the beginning of this paper, we must interpret “economic consequences” to be accounted for in the four columns: PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP or columns 25, 26, 27, and 28. These appear to separate property damage from crop damage. Like fatal vs non-fatal injuries we deem it fair to assume that if these variables are collected separately, we can process them separately.

See why a code book is handy.

We account for missing values in these columns:

# any missing values in the these columns?
sum(is.na(foo$PROPDMG))
## [1] 0
sum(is.na(foo$PROPDMGEXP))
## [1] 0
sum(is.na(foo$CROPDMG))
## [1] 0
sum(is.na(foo$CROPDMGEXP))
## [1] 0

The unique values of ‘PROPDMG’ are large, but the unique values for ‘PROPDMGEXP’ and ‘CROPDMGEXP’ are interesting:

unique(foo$PROPDMGEXP)
##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
unique(foo$CROPDMGEXP)
## [1] ""  "M" "K" "m" "B" "?" "0" "k" "2"

We begin with PROPDMGEXP:

“K” “M” “” “B” “m” “+” “0” “5” “6” “?” “4” “2” “3” “h” “7” “H” “-” “1” “8”

Certain editorial license will need to be applied here. How to interpret tags such as +“,”-“,”?“,”h”?

How do we interpret “0”, “1”, “2”, “3”, etc? Simply eliminating these seems unfair. How to interpret these “tags”? This report takes the position that the “0”, “1”, “2”, etc “tags” are powers of ten. Without an effective code book to guide us, this assumption could be wrong, but how do we tell?

We start with the property data. We assume crop entries may be different and will break out crop damage in a separate table.

# we subset those columns that describe property damage...
P <- foo[,c(8,25,26)]
# ...and let's verify we have the correct columns
str(P)
## 'data.frame':    902297 obs. of  3 variables:
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
# replace all character designations to numeric equivalents and then convert the data type of 
# PROPDMGEXP to numeric to compute the full dollar amount by denomination.
P$PROPDMGEXP[P$PROPDMGEXP == "K"] <- 10^3
P$PROPDMGEXP[P$PROPDMGEXP == "M"] <- 10^6
P$PROPDMGEXP[P$PROPDMGEXP == "m"] <- 10^6
P$PROPDMGEXP[P$PROPDMGEXP == "B"] <- 10^9

# we interpret the "numeric" designations as powers of ten(?)
P$PROPDMGEXP[P$PROPDMGEXP == "0"]  <- 10^0
P$PROPDMGEXP[P$PROPDMGEXP == "1"]  <- 10^1
P$PROPDMGEXP[P$PROPDMGEXP == "2"]  <- 10^2
P$PROPDMGEXP[P$PROPDMGEXP == "3"]  <- 10^3
P$PROPDMGEXP[P$PROPDMGEXP == "4"]  <- 10^4
P$PROPDMGEXP[P$PROPDMGEXP == "5"]  <- 10^5
P$PROPDMGEXP[P$PROPDMGEXP == "6"]  <- 10^6
P$PROPDMGEXP[P$PROPDMGEXP == "7"]  <- 10^7
P$PROPDMGEXP[P$PROPDMGEXP == "8"]  <- 10^8

# and the remaining (undocumented) "tags" will just be set to 1.
# we do this to preserve all PROPDMG values rather than just eliminate them.
P$PROPDMGEXP[P$PROPDMGEXP == ""]   <- 1
P$PROPDMGEXP[P$PROPDMGEXP == "+"]  <- 1
P$PROPDMGEXP[P$PROPDMGEXP == "-"]  <- 1
P$PROPDMGEXP[P$PROPDMGEXP == "?"]  <- 1
P$PROPDMGEXP[P$PROPDMGEXP == "H"]  <- 1
P$PROPDMGEXP[P$PROPDMGEXP == "h"]  <- 1

To calculate dollar property damage, we now convert ‘PROPDMGEXP’ to numeric:

P$PROPDMGEXP <- as.numeric(P$PROPDMGEXP)

Now we add a new column that reflects the full dollar amount of property damage.

P$DollarDamage <- P$PROPDMG * P$PROPDMGEXP

Finally, let’s give everyone meaningful column names.

colnames(P) <- c("Event","DamageAmt","DamageFactor","DollarDamage")

We now turn our attention to crop damage. We first subset the columns we will be working with:

# we subset those columns that describe crop damage...
C <- foo[,c(8,27,28)]
# ...and let's verify we have the correct columns
str(C)
## 'data.frame':    902297 obs. of  3 variables:
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...

As with PROPDMGEXP, the “tags” of ‘CROPDMGEXP’ are:

“” “M” “K” “m” “B” “?” “0” “k” “2”

# as with PROPDMGEXP, we replace each "character-denomination" with it's numerical equivalent. 
# and replace the character designations with their equivalent numbers to convert the data type 
# of CROPDMGEXP to numeric to compute the final full dollar amount.
C$CROPDMGEXP[C$CROPDMGEXP == "K"] <- 10^3
C$CROPDMGEXP[C$CROPDMGEXP == "k"] <- 10^3
C$CROPDMGEXP[C$CROPDMGEXP == "M"] <- 10^6
C$CROPDMGEXP[C$CROPDMGEXP == "m"] <- 10^6
C$CROPDMGEXP[C$CROPDMGEXP == "B"] <- 10^9
C$CROPDMGEXP[C$CROPDMGEXP == "0"] <- 10^0
C$CROPDMGEXP[C$CROPDMGEXP == "2"] <- 10^2

# and the remaining (undocumented) "tags" will just be set to 1.
# we do this to preserve all CROPDMG values rather than just eliminate them.
C$CROPDMGEXP[C$CROPDMGEXP == ""]  <- 1
C$CROPDMGEXP[C$CROPDMGEXP == "?"] <- 1

To calculate dollar crop damage, we now convert ‘CROPDMGEXP’ to numeric:

C$CROPDMGEXP <- as.numeric(C$CROPDMGEXP)

Now we add a new column that reflects the full dollar amount of property damage.

C$DollarDamage <- C$CROPDMG * C$CROPDMGEXP

Finally, let’s give everyone meaningful column names.

colnames(C) <- c("Event","DamageAmt","DamageFactor","DollarDamage")

Given question 2, as posed, is rather generic, we will combine total damage costs of
property and crop to determine the overall top weather event that incurs the greatest damage costs.

Cost <- rbind(P,C)
Cost <- Cost %>% group_by(Event) %>% summarise(Costs = sum(DollarDamage))
Cost <- arrange(Cost,desc(Costs))
Cost$Costs <- as.integer(Cost$Costs / 10^9)
colnames(Cost) <- c("Weather","Cost(Billions)")

While not strictly necessary to answer question 2, it would be interesting to compare the damage sources between property damage and crop damage. The source of damage to agricultural crops may well be different than to personal property in urban and suburban regions.

P <- P %>% group_by(Event) %>% summarise(Costs = sum(DollarDamage))
P <- arrange(P,desc(Costs))
P$Costs <- as.integer(P$Costs / 10^9)
colnames(P) <- c("Weather","Cost(Billions)")

C <- C %>% group_by(Event) %>% summarise(Costs = sum(DollarDamage))
C <- arrange(C,desc(Costs))
C$Costs <- as.integer(C$Costs / 10^9)
colnames(C) <- c("Weather","Cost(Billions)")

Results

Question 1

To answer question one we only need display, say, the first 5 entries to answer the question Across the United States, which types of events are most harmful with respect to population health?

A visual display of the top five contenders reveal:

Harm <- Harm[1:5,]
Harm$Tag <- 1:5
barplot(Harm$HarmCounts ~ Harm$Tag, 
        xlab = "Weather Event", 
        ylab = "Harm Count", 
        col = brewer.pal(5, "Pastel1"), 
        legend.text = Harm$EVTYPE)
title("Top Five Contenders for Weather Events\nCausing Greatest Harm to Population Health")

From the graph it is seen that tornados are, by far, the leading cause of harm with respect to population health.

Question 2

We briefly compare the weather phenomena impacting property versus crop damages to see how each region type is affected by their respective weather type.

## What do the top ten weather phenomena with the greatest impact on property damage look like?
P[1:10,]
## # A tibble: 10 × 2
##    Weather           `Cost(Billions)`
##    <chr>                        <int>
##  1 FLOOD                          144
##  2 HURRICANE/TYPHOON               69
##  3 TORNADO                         56
##  4 STORM SURGE                     43
##  5 FLASH FLOOD                     16
##  6 HAIL                            15
##  7 HURRICANE                       11
##  8 TROPICAL STORM                   7
##  9 WINTER STORM                     6
## 10 HIGH WIND                        5
## What do the top ten weather phenomena with the greatest impact on crop damage look like?
C[1:10,]
## # A tibble: 10 × 2
##    Weather           `Cost(Billions)`
##    <chr>                        <int>
##  1 DROUGHT                         13
##  2 FLOOD                            5
##  3 RIVER FLOOD                      5
##  4 ICE STORM                        5
##  5 HAIL                             3
##  6 HURRICANE                        2
##  7 HURRICANE/TYPHOON                2
##  8 FLASH FLOOD                      1
##  9 EXTREME COLD                     1
## 10 FROST/FREEZE                     1

To the answer to question 2, we also only need display the first 5 entries to answer the question Across the United States, which types of events have the greatest economic consequences?

Cost[1:5,]
## # A tibble: 5 × 2
##   Weather           `Cost(Billions)`
##   <chr>                        <int>
## 1 FLOOD                          150
## 2 HURRICANE/TYPHOON               71
## 3 TORNADO                         57
## 4 STORM SURGE                     43
## 5 HAIL                            18

A visual display of the top five contenders to the question Across the United States, which types of events have the greatest economic consequences?

Cost <- Cost[1:5,]
Cost$Tag <- 1:5
barplot(Cost$`Cost(Billions)` ~ Cost$Tag, 
        xlab = "Weather Event", 
        ylab = "Dollar Costs (in billions)", 
        col = brewer.pal(5, "Pastel2"), 
        legend.text = Cost$Weather)
title("Top Five Contenders for Weather Events\nCausing Greatest Economic Impact")

From the graph it is seen that weather that results in flooding incurs the greatest economic consequence in the United States.