The purpose of this project is to demonstrate some basic techniques and considerations of exploratory data analysis using the R programming language. The analysis looks at weather-related property and crop damage in the United States over the 26-year period from 1996 through 2021, using the National Oceanic and Atmospheric Administration (NOAA) Storm Events Database. It was performed in R Studio version 1.4.1106 with R version 4.1.2 (2021-11-01) on a MacBook Air with 4 gigabytes of RAM, running MacOS version 11.6. Select “Show All Code” from the Code drop-down at the top right of this document to see the source code.
The analysis considers questions of subsetting the analytical data set from the full data set; how innate characteristics of weather data might affect analysis; some key characteristics and quirks of the NOAA database; management of missing data; visualization of the NOAA data; and, the importance of performing exploratory analysis before beginning to use a data set for quantitative studies.
NOAA’s Storm Events database might almost more properly be called the “Natural Environmental Phenomena” database or the “Physical Habitat Events” database. Originally created in 1950 to track the impact of tornados, the NOAA Storm Events Database has evolved considerably over time; in addition to actual weather events such as wind, snow, or tornados, it records effects of weather events (e.g., avalanches, floods, droughts); non-weather events that interact intimately with weather and its effects (e.g., extreme tides, wildfires); and geological phenomena, (e.g., tsunamis, volcanic eruptions). A number of the components of actual weather are not systematically recorded in the storm database, although some such information may be arbitrarily included in the narrative text of a data record.
In order to think clearly about the data, it is helpful to define terms as they are used in this analysis.
Weather: the state of the atmosphere at a particular time and place, comprising phenomena such as temperature, humidity, precipitation, wind, barometric pressure, dewpoint, and airborne particles (fog, dust, smoke).
Weather Event: an incident of specific notable weather conditions, such as a thunderstorm, hail, precipitation, fog.
Weather Effect: direct or indirect effect of a weather event, e.g., flooding (effect) due to prolonged heavy rainfall (event), drought (effect) due to reduced/deficient of precipitation (event), avalanche (effect) due to rapid accumulation of heavy snow (event).
Interacting Event: a non-weather event that interacts intimately with weather. The most notable is wildfire. Wildfires are often started by weather (lightning); their severity is strongly influenced by weather conditions (termperature, humidity), weather events (precipitation), and weather effects (drought, flooding); the heat and smoke from a fire can affect local weather. Tides are also interacting events.
Weather Incident: the combined actions and effects of an identifiable weather entity; may span dates, geographic boundaries, and event types, e.g., Hurricane Sandy, an Alberta clipper system, a drought.
Non-weather Event: some events recorded in the Storm Events database are not weather at all: tsunami, “sneakerwave”, northern lights.
This analysis evaluates weather events and weather effects over land. Most non-weather events are excluded, but wildfires are included due the connection between weather and wildfire and because the damage caused by the one cannot be understood without the other.
library(data.table)
library(R.utils)
library(tidyr)
library(usmap)
library(ggplot2)
library(knitr)
library(kableExtra)
library(tidytext)
# Ordinarily, the following would be read in from configuration files; they are hard-coded here to make the Rmd file self-contained.
YEARS <- 1996:2021
# Columns to read
VAR_NAMES <- c("EVENT_ID",
"EPISODE_ID",
"STATE_FIPS",
"BEGIN_YEARMONTH",
"BEGIN_DAY",
"EVENT_TYPE",
"DAMAGE_PROPERTY",
"DAMAGE_CROPS")
# Rename columns on read.
COL_NAMES = c("EventID",
"EpisodeID",
"FIPS",
"BeginYearMonth",
"BeginDay",
"EventType",
"PropertyDamage",
"CropDamage")
FILES <- data.table(
Year = YEARS,
Filename = c("StormEvents_details-ftp_v1.0_d1996_c20210803.csv.gz",
"StormEvents_details-ftp_v1.0_d1997_c20210803.csv.gz",
"StormEvents_details-ftp_v1.0_d1998_c20210803.csv.gz",
"StormEvents_details-ftp_v1.0_d1999_c20210803.csv.gz",
"StormEvents_details-ftp_v1.0_d2000_c20210803.csv.gz",
"StormEvents_details-ftp_v1.0_d2001_c20220107.csv.gz",
"StormEvents_details-ftp_v1.0_d2002_c20211102.csv.gz",
"StormEvents_details-ftp_v1.0_d2003_c20210803.csv.gz",
"StormEvents_details-ftp_v1.0_d2004_c20210803.csv.gz",
"StormEvents_details-ftp_v1.0_d2005_c20210803.csv.gz",
"StormEvents_details-ftp_v1.0_d2006_c20220107.csv.gz",
"StormEvents_details-ftp_v1.0_d2007_c20220107.csv.gz",
"StormEvents_details-ftp_v1.0_d2008_c20220107.csv.gz",
"StormEvents_details-ftp_v1.0_d2009_c20220107.csv.gz",
"StormEvents_details-ftp_v1.0_d2010_c20220107.csv.gz",
"StormEvents_details-ftp_v1.0_d2011_c20220107.csv.gz",
"StormEvents_details-ftp_v1.0_d2012_c20220107.csv.gz",
"StormEvents_details-ftp_v1.0_d2013_c20220124.csv.gz",
"StormEvents_details-ftp_v1.0_d2014_c20211217.csv.gz",
"StormEvents_details-ftp_v1.0_d2015_c20211217.csv.gz",
"StormEvents_details-ftp_v1.0_d2016_c20211217.csv.gz",
"StormEvents_details-ftp_v1.0_d2017_c20220124.csv.gz",
"StormEvents_details-ftp_v1.0_d2018_c20220124.csv.gz",
"StormEvents_details-ftp_v1.0_d2019_c20220124.csv.gz",
"StormEvents_details-ftp_v1.0_d2020_c20220124.csv.gz",
"StormEvents_details-ftp_v1.0_d2021_c20220124.csv.gz")
)
NSE_URL <- "https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/"
FIPS_PATH <- "StateFIPSCodes.txt"
FIPS_URL <- "https://www2.census.gov/geo/docs/reference/state.txt"
CPI_PATH <- "CPILFESL.csv"
CPI_URL <- "https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=1168&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=CPILFESL&scale=left&cosd=1957-01-01&coed=2021-12-01&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=2&ost=-99999&oet=99999&mma=0&fml=a&fq=Monthly&fam=avg&fgst=lin&fgsnd=2020-02-01&line_index=1&transformation=lin&vintage_date=2022-02-08&revision_date=2022-02-08&nd=1957-01-01"
# Event-specific colors for maps and plots.
EVENT_COLORS <- data.table(EventType = c("Avalanche",
"Blizzard",
"Coastal Flood",
"Cold/Wind Chill",
"Debris Flow",
"Drought",
"Flash Flood",
"Flood",
"Frost/Freeze",
"Hail",
"Heat",
"Heavy Rain",
"High Surf",
"High Wind",
"Hurricane",
"Ice Storm",
"Lightning",
"Snow",
"Storm Surge/Tide",
"Strong Wind",
"Thunderstorm Wind",
"Tornado",
"Tropical Storm",
"Wildfire",
"Winter Storm"),
EventColor = c("plum",
"#443C84FF",
"greenyellow",
"#A4DDEF",
"#641A0A",
"#A6761D",
"#26A63A",
"darkgreen",
"aliceblue",
"#B3B3B3",
"khaki",
"slategrey",
"blueviolet",
"#FDE725FF",
"#E7298A",
"#00BFC4",
"black",
"ivory",
"aquamarine",
"#FDE9AAFF",
"#FDB030FF",
"tomato",
"#FDDAEC",
"#E31A1C",
"#2F6C8EFF")
)
##Define named functions used in more than one code chunk
# Retrieve/read data from NSE website.
readData <- function(idx) {
if(!(idx %in% 1950:2021)) return(data.table(NA))
fn <- FILES[Year == idx, Filename]
if(!file.exists(fn)) {
url <- paste0(NSE_URL, fn)
print(url)
download.file(url, fn, method = "curl")
}
fread(file = fn, col.names = COL_NAMES, select = VAR_NAMES)
}
# Extract the appropriate event colors for a specific data table
# containing event types. Data must be sorted by value of interest
# before being passed to this function.
eventColorsForDataTable <- function(dt) {
colors <- NA
if("EventType" %chin% names(dt)) {
colors <- unique(
dt[,EVENT_COLORS[EventType == .BY, EventColor],by=EventType])
setnames(colors, 2, "EventColor")
}
}
weightedColor <- function(color, alpha) {
weighted <- vector(mode = "character", length = 0L)
if(length(color) != length(alpha)) {
print("Colors and alphas don't match.")
} else {
crgb <- t(col2rgb(color))
calpha <- alpha * 255
weighted <- c(weighted, rgb(crgb, alpha = calpha, maxColorValue = 255))
}
return(weighted)
}
# Extract data for a specified state and type (property or crop)
dataTableForStateType <- function(state, type) {
err <- 0
ST <- NULL
if(!(state %in% states)) {
err <- err + 1
cat(state, "is not a valid state name.")
}
if(!(type %in% c("property", "crop"))) {
err <- err + 1
cat(type, "Invalid type: ", type,
"; type argument must be either property or crop.")
}
if(err == 0) {
if(type == "property") {
ST <- property[State == state]
} else {
ST <- crop[State == state]
}
}
return(ST)
}
# Calculate total, count, and percent crop or property damage by event type.
eventStatsForState <- function(state, type, drop = TRUE) {
st_data <- dataTableForStateType(state, type)
if(!(is.null(st_data))) {
st_state <- st_data[1, State]
ev_count <- st_data[, .N, by = EventType]
setnames(ev_count, "N", "Count")
ev_total <- st_data[, sum(DamageValue), by = EventType]
setnames(ev_total, "V1", "Total")
st_total = st_data[, sum(DamageValue)]
if(st_total == 0) {
ev_count_total <- sum(ev_count$Count)
ev_percent <- ev_count[, round(Count/ev_count_total, 3), by = EventType]
} else {
ev_percent <- ev_total[, round(Total/st_total, 3), by = EventType]
}
if(drop) {ev_percent <- ev_percent[V1 >= 0.01]}
else {ev_percent <- ev_percent[V1 > 0.0]}
setnames(ev_percent, "V1", "Percent")
st_data <- merge(ev_count, ev_total, by = "EventType")
st_data <- merge(st_data, ev_percent, by = "EventType")
st_data[, State := st_state]
setorder(st_data, -Percent)
# Display strings for tables.
if( type == "property") {
st_data[, "PropertyDamage" := paste0(Percent*100, "%")]
} else {
st_data[, "CropDamage" := paste0(Percent*100, "%")]
}
}
return(st_data)
}
# Retrieve episode or event narratives given the type, year and id.
retrieveNarrative <- function(type, year, id) {
err <- 0
sel <- 0
col <- ""
narrative <- data.table(NA)
if(!(year %in% YEARS)) {
err <- err + 1
cat(year, "is not a valid year.")
}
if(id %% 1L > 0) {
err <- err + 2
cat(id, "is not a valid event or episode ID")
}
if(type == "event") {
sel <- 50
col <- "EventNarrative"
} else {
if(type == "episode") {
sel <- 49
col <- "EpisodeNarrative"
} else {
err <- err + 3
cat(type, "is not a valid narrative type.")
}
}
if(err == 0) {
fn <- FILES[Year == year, Filename]
dn <- paste0(year,".csv")
if(!file.exists(dn)) gunzip(fn, dn, remove = FALSE)
command <- sprintf("grep --text ',%i,' %s", id, dn)
narrative <- tryCatch(
fread(cmd = command, select = sel, nrows = 1L),
message = "",
warning = "",
error = function(e) {
cat("No ", type, " narrative available in ", year, "for id", id, "\n")
narrative <- data.table(NA)
}
)
setnames(narrative, 1, col)
}
file.remove(dn)
return(narrative)
}
The NOAA Storm Events data is organized in separate CSV files by year. A data record consists of 51 variables containing structured numeric or character data, or unstructured narrative text data. The codebook describing these variables is available here. Data files for the years 1996-2021 contain a total of 1,484,121 records. For this analysis, only structured data related to property and crop damage, event type, date, and location (state-level) are loaded for all records. Where needed, narrative text for specific records is loaded on demand from the data files on disk.
The NOAA data for 1996-2021 contain 57 unique EVENT_TYPE values. The event types are not entirely discrete, due to the complexity of weather and its effects and interactions with land, bodies of water, and natural phenomena such as tides and geologic events. There is no single obvious or correct way to classify weather-related observations. Event types in the NOAA Storm Events database reflect certain choices on how to organize the data. For example, two weather events, cold and wind, combine to create an effect, windchill, but cold and windchill are combined in the Cold/Wind Chill and Extreme Cold/Wind Chill event types. Dense Smoke is an effect of fire, but is recorded as a separate event type from Wildfire.
The following event types are simply excluded from this analysis, which focuses on property and crop damage on land that is attributable to weather.
This analysis is interested in weather-related crop and property damage, and of the latter, more infrastructure damage, or damage with societal impact. No definitive distinctions can be made between primarily individual property damage and primarily infrastructural property damage in the NOAA data set. Damage to a sewage plant or destruction of a road from high surf and coastal flooding has societal impact. The loss of homes due to wildfire has both individual and large societal impact. The effects on human health of dense smoke generated by wildfires has significant societal impact, but not in the categories of crop or property damage. Fog, in itself, does not damage crops or property; vehicle damage due to collisions in dense fog has large individual impact but less large societal impact. Considerations of this nature have resulted in the following event types being excluded from the analytical data:
Furthermore, several event types are combined where the distinctions do not seem to provide significant information for this particular analysis:
nse <- rbindlist(lapply(YEARS, readData))
# Check for missing data in EventType column before using it
if(sum(is.na(nse$EventType)) > 0) print("Missing EventType data.")
# Combine snow events.
setorder(nse, EventType)
idx <- grep("Heavy Snow", nse$EventType)
nse[idx, EventType := "Snow"]
idx <- grep("Lake-Effect Snow", nse$EventType)
nse[idx, EventType := "Snow"]
# Combine hurricanes and typhoons.
idx <- grep("Typhoon", nse$EventType)
nse[idx, EventType := "Hurricane"]
# Combine certain winter precipitation event types.
idx <- grep("Freezing Fog", nse$EventType)
nse[idx, EventType := "Wintry Mix"]
idx <- grep("Sleet", nse$EventType)
nse[idx, EventType := "Wintry Mix"]
idx <- grep("Winter Weather", nse$EventType)
nse[idx, EventType := "Wintry Mix"]
# EventType exclusion list
exclude <- c("Astronomical Low Tide",
"Dense Fog",
"Dense Smoke",
"Funnel Cloud",
"Marine Dense Fog",
"Marine Hail",
"Marine High Wind",
"Marine Hurricane/Typhoon",
"Marine Lightning",
"Marine Strong Wind",
"Marine Thunderstorm Wind",
"Marine Tropical Depression",
"Marine Tropical Storm",
"Northern Lights",
"Rip Current",
"Sneakerwave",
"Tsunami",
"Volcanic Ash",
"Volcanic Ashfall",
"Waterspout"
)
nse <- nse[!(EventType %chin% exclude), ]
# Check for missing data in columns that shouldn't have missing data
if(sum(is.na(nse$EventID)) > 0) print("Missing EventID data.")
if(sum(is.na(nse$BeginYearMonth)) > 0) print("Missing BeginYearMonth data.")
if(sum(is.na(nse$FIPS)) > 0) print("Missing FIPS data.")
# Download the state FIPS codes from the U.S. Census Bureau
if(!file.exists(FIPS_PATH)) {
download.file(FIPS_URL, FIPS_PATH, method = "curl")
}
fipsData <- fread(FIPS_PATH, select = c("STATE", "STATE_NAME"))
# Restrict analytical data to the 50 U.S. states and District of Columbia
setorder(nse, FIPS)
us50 <- fipsData[STATE <= 56]
nse <- nse[FIPS %in% us50$STATE,]
# Exclude rows with missing PropertyDamage and CropDamage values.
# Convert empty strings to NA.
nse[!nzchar(PropertyDamage), PropertyDamage := NA]
nse[!nzchar(CropDamage), CropDamage := NA]
# Count missing values before removing them.
propertyCounts <- data.table(Year = YEARS)
counts <- nse[is.na(PropertyDamage), .N, by = trunc(BeginYearMonth/100)]
setorder(counts, trunc)
propertyCounts[, Missing := counts$N]
cropCounts <- data.table(Year = YEARS)
counts <- nse[is.na(CropDamage), .N, by = trunc(BeginYearMonth/100)]
setorder(counts, trunc)
cropCounts[, Missing := counts$N]
# Exclude rows where both PropertyDamage and CropDamage are NA
idx <- nse[, (is.na(PropertyDamage) & is.na(CropDamage))]
nse <- nse[!idx,]
# Get the properly capitalized state names from the fips data.
nse[, State := fipsData[STATE == .BY, STATE_NAME], by=FIPS]
# Create a variable of type IDate from BeginYearMonth and BeginDay variables.
nse[, BeginDate := as.IDate(as.character(BeginYearMonth * 100 + BeginDay), format = "%Y%m%d")]
nse <- nse[, !c("BeginDay")]
states <- unique(nse$State)
opar <- par(no.readonly = TRUE)
The property and crop damage variables (DAMAGE_PROPERTY and DAMAGE_CROP, respectively) are stored in the database as strings, which may be empty, indicating missing data; which may contain numerals, with or without a decimal marker (“.”); and, which may end in a letter indicating the magnitude of the numeric portion of the string (“K” = 103, “M” = 106, “B” = 109).
# Exclude the 50 observations where PropertyDamage or CropDamage have
# ambiguous values.
exclude <- c("K", "M", "0T")
nse <- nse[!(PropertyDamage %chin% exclude), ]
nse <- nse[!(CropDamage %chin% exclude), ]
calculateDamage <- function(type) {
damageName <- quote(PropertyDamage)
costName <- quote(PropertyDamageValue)
costStr <- "PropertyDamageValue"
if(type == "crop") {
damageName <- quote(CropDamage)
costName <- quote(CropDamageValue)
costStr <- "CropDamageValue"
}
nse[, eval(costStr) := as.double(NA)]
# Convert a variety of representations of zero to zero
nse[grep("^0$|^0K|^0M|^0.00", eval(damageName)), eval(costStr) := 0]
# Convert damage strings to numeric values
patterns <- c("[K]$", "[M]$", "[B]$")
magnitudes <- c(10^3, 10^6, 10^9)
for(i in 1:3) {
eventIDs <- nse[is.na(eval(costName))][grep(patterns[i],eval(damageName)), EventID]
nse[EventID %in% eventIDs,
eval(costStr) :=
round(
as.numeric(
unlist(
strsplit(eval(damageName), patterns[i])))
* magnitudes[i])]
}
}
calculateDamage("property")
calculateDamage("crop")
After converting property and crop damage data to numeric values, the results are adjusted for inflation to the average 2021 USD value using CPILFESL, the “U.S. Bureau of Labor Statistics Consumer Price Index for All Urban Consumers: All Items Less Food & Energy”, sometimes called the “core CPI”, retrieved from FRED, Federal Reserve Bank of St. Louis, February 7, 2022.
# Download CPI data from the St. Louis Federal Reserve
if(!file.exists(CPI_PATH)) {
download.file(CPI_URL, CPI_PATH, method = "curl")
}
# Load data from 1996 through 2021.
cpi <- fread(file = CPI_PATH,
skip = ((1996 - 1957) * 12) + 1,
nrows = (2022-1996) * 12,
col.names = c("Date", "CPI"))
# Convert date strings to date objects.
cpi[, Date := as.IDate(Date, format = "%Y-%m-%d")]
# Calculate inflation rates.
ref_cpi <- cpi[year(Date) == 2021, mean(CPI)]
cpi[, rate := ref_cpi/CPI]
cpi[, ym := year(Date)*10^2 + month(Date)]
# Adjust damage values.
setorder(nse, BeginYearMonth)
nse[, rate := cpi[ym == .BY, rate], by=list(BeginYearMonth)]
nse[, CropDamageValue := CropDamageValue*rate]
nse[, PropertyDamageValue := PropertyDamageValue*rate]
nse[, BeginYearMonth := NULL]
# Avoid having to constantly check for NA in the damage values
property <- nse[!is.na(PropertyDamage), !c("CropDamage", "CropDamageValue")]
crop <- nse[!is.na(CropDamage), !c("PropertyDamage", "PropertyDamageValue")]
setnames(property, "PropertyDamageValue", "DamageValue")
setnames(crop, "CropDamageValue", "DamageValue")
rm(nse, cpi)
Simple aggregate statistics from numerical data are easily generated and displayed as box plots.
# Total crop and property damage by year.
par(mfrow = c(1,2))
par(mar = c(4, 4, 4, 2))
aticks = c(0.0e+00, 5.0e+09, 1.0e+10, 1.5e+10, 2.0e+10, 2.5e+10)
alabels = c("0", "5", "10", "15", "20", "25")
boxplot(DamageValue ~ year(BeginDate), data = property,
xlab = NULL,
main = "Property Damage",
ylab = "Damage Value (USD Billions)", yaxt = "n")
axis(2, at = aticks, labels = alabels, las = 1)
aticks = c(0.0e+00, 0.5e+09, 1.0e+09, 1.5e+09)
alabels = c("0", "0.5", "1.0", "1.5")
boxplot(DamageValue ~ year(BeginDate), data = crop,
xlab = NULL,
main = "Crop Damage",
ylab = "Damage Value (USD Billions)", yaxt = "n")
axis(2, at = aticks, labels = alabels, las = 1)
Fig. 1. Summary Statistics for Weather Damage by Year
par(opar)
The results in Figure 1 are not at all useful. The data are so dominated by extreme values that the box plots are flattened into the baseline. Replotting the data using the base 10 logarithm (log10) of the damage values provides a better view of the distribution of the data.
# Total crop and property damage by year.
par(mfrow = c(1,2))
par(mar = c(4, 4, 4, 2))
property[, log10 := fifelse(DamageValue > 0, log10(DamageValue), 0)]
boxplot(log10 ~ year(BeginDate), data = property,
xlab = NULL,
main = "Property Damage",
ylab = "log10 Damage Value", yaxt = "n")
crop[, log10 := fifelse(DamageValue > 0, log10(DamageValue), 0)]
boxplot(log10 ~ year(BeginDate), data = crop,
xlab = NULL,
main = "Crop Damage",
ylab = "log10 Damage Value", yaxt = "n")
Fig. 1, Revised. Summary Statistics for Weather Damage by Year (log10)
par(opar)
property[, log10 := NULL]
crop[, log10 := NULL]
Clearly, there is something odd about these data which requires investigation. After 2006, the value distribution is very strongly weighted towards zero, so the first thing to check is the distribution of zero values across the data.
counts <- property[DamageValue == 0, .N, by = year(BeginDate)]
setorder(counts, year)
propertyCounts[, Zero := counts$N]
counts <- property[DamageValue > 0, .N, by = year(BeginDate)]
setorder(counts, year)
propertyCounts[, NonZero := counts$N]
par(mfrow = c(1,2))
par(mar = c(6, 5, 4, 2))
pcolors <- viridisLite::viridis(3, option = "D")
aticks = c(0.0e+00, 1.0e+04, 2.0e+04, 3.0e+04, 4.0e+04, 5.0e+04, 6.0e+04, 7.0e+04)
alabels = c("0", "10", "20", "30", "40", "50", "60", "70")
barplot(height = as.matrix(t(propertyCounts[, .(Missing, Zero, NonZero)])),
names.arg = YEARS, las = 2, col = pcolors,
main = "Property",
yaxt = "n",
ylab = "Number of Values (Thousands)")
axis(2, at = aticks, labels = alabels, las = 1)
counts <- crop[DamageValue == 0, .N, by = year(BeginDate)]
setorder(counts, year)
cropCounts[, Zero := counts$N]
counts <- crop[DamageValue > 0, .N, by = year(BeginDate)]
setorder(counts, year)
cropCounts[, NonZero := counts$N]
barplot(height = as.matrix(t(cropCounts[, .(Missing, Zero, NonZero)])),
names.arg = YEARS,
las = 2, col = pcolors,
main = "Crop",
yaxt = "n",
ylab = "Number of Values (Thousands)")
axis(2, at = aticks, labels = alabels, las = 1)
par(fig = c(0, 1, 0, 1),
oma = c(0, 0, 0, 0),
mar = c(1, 0, 1, 0),
new = TRUE)
plot(0, 0, type = 'l', bty = 'n', xaxt = 'n', yaxt = 'n')
legend(x = "bottom",
ncol = 3,
legend = c("Missing", "Zero", "NonZero"),
fill = pcolors,
cex = 0.8,
xpd = TRUE)
Fig. 2. Numbers of Missing, Zero, and NonZero Damage Values by Year
par(opar)
Three important characteristics of this data set are illustrated in Figure 2:
Something changed in 2007, possibly a move to consistently distinguish missing values (no available data) from zero values (no damage occurred).
Judging from the 2007-2021 data, either most severe weather events do not result in property or crop damage, or damage is not reliably reported to the NOAA Storm Events Database.
Only a small percentage of weather events (observations) have non-missing, nonzero damage values, in other words, positive data with which calculations can be performed.
Events that did not result in damage are excluded from the remainder of this analysis. The paucity of the resulting analytical data set is important with respect to the usefulness or credibility of statistics generated from these values: the sample size is small - or, in the case of crop damage, very small. In fact, this effect is even more pronounced than it appears in Figure 2 for several reasons which will be explored later in this analysis.
#Exclude weather events that did not result in crop or property damage.
idx <- property[, DamageValue == 0]
property <- property[!idx,]
# Rhode Island poses something of a problem: all crop damage values are
# either missing or zero.
idx <- crop[FIPS != 44 & DamageValue == 0, EventID]
crop <- crop[!(EventID %in% idx),]
# Select only events that are not also in the property damage table.
idx <- intersect(property[FIPS == 44]$EventID, crop[FIPS == 44]$EventID)
crop <- crop[!(EventID %in% idx)]
rm(idx)
layout(mat = matrix(c(1, 2, # property by year, crop by year
3, 3, # property by state
4, 4),# crop by state
nrow = 3, ncol = 2,
byrow = TRUE),
heights = c(1, 1, 1)
)
par(oma = c(4, 0, 0, 0))
aticks = c(0.0e+00, 2.5e+10, 5.0e+10, 7.5e+10, 1.0e+11, 1.25e+11)
alabels = c("0", "25", "50", "75", "100", "125")
tcolors <- viridisLite::viridis(100, option = "D")
propertyByYear <- property[, sum(DamageValue), by = year(BeginDate)]
setnames(propertyByYear, "V1", "Total")
setorder(propertyByYear, year)
tmax <- max(propertyByYear$Total)
propertyByYear[, Color := tcolors[round((Total/tmax)*100)]]
barplot(height = propertyByYear$Total,
main = "Property Damage by Year",
names.arg = YEARS, las = 2,
col = propertyByYear$Color,
xlab = "",
yaxt = "n",
ylab = "Total Damage (USD Billions)")
axis(2, at = aticks, labels = alabels, las = 1)
aticks = c(0.0e+00, 1.0e+09, 2.0e+09, 3.0e+09, 4.0e+09, 5.0e+09, 6.0e+09, 7.0e+09)
alabels = c("0", "1", "2", "3", "4", "5", "6", "7")
cropByYear <- crop[, sum(DamageValue), by = year(BeginDate)]
setnames(cropByYear, "V1", "Total")
setorder(cropByYear, year)
tmax <- max(cropByYear$Total)
cropByYear[, Color := tcolors[round((Total/tmax)*100)]]
barplot(height = cropByYear$Total,
main = "Crop Damage by Year",
names.arg = YEARS,
las = 2,
col = cropByYear$Color,
xlab = "",
yaxt = "n",
ylab = "Total Damage (USD Billions)")
axis(2, at = aticks, labels = alabels, las = 1)
tcolors <- viridisLite::viridis(101, option = "D")
aticks = c(0.0e+00, 2.5e+10, 5.0e+10, 7.5e+10, 1.0e+11, 1.25e+11)
alabels = c("0", "25", "50", "75", "100", "125")
propertyByState <- property[, sum(DamageValue), by = FIPS]
setorder(propertyByState, FIPS)
setnames(propertyByState, "V1", "Total")
tmax <- max(propertyByState$Total)
propertyByState[, Color := tcolors[round((Total/tmax)*100)+1]]
barplot(height = propertyByState$Total,
main = "Property Damage by State",
names.arg = states,
las = 2,
col = propertyByState$Color,
xlab = "",
yaxt = "n",
ylim = c(0, 1.25*10^11),
ylab = "Total Damage (USD Billions)")
axis(2, at = aticks, labels = alabels, las = 1)
aticks = c(0.0e+00, 5.0e+09, 1.0e+10, 1.5e+10, 2.0e+10)
alabels = c("0", "5", "10", "15", "20")
cropByState <- crop[, sum(DamageValue), by = FIPS]
setnames(cropByState, "V1", "Total")
setorder(cropByState, FIPS)
tmax <- max(cropByState$Total)
cropByState[, Color := tcolors[round((Total/tmax)*100)+1]]
barplot(height = cropByState$Total,
names.arg = states, las = 2,
col = cropByState$Color,
xlab = "",
yaxt = "n",
ylim = c(0, 2.0*10^10),
ylab = "Total Damage (USD Billions)")
title(main = "Crop Damage by State", line = -1)
axis(2, at = aticks, labels = alabels, las = 1)
mtext("Fig. 3, Damage Totals by Year and State",
side = 1,
line = +3,
cex = 0.75,
outer = TRUE)
par(opar)
Figure 3 gives totals by year and by state for property damage and crop damage. Comparison of simple totals by state is not particularly useful. There are numerous confounding variables that affect the results, the most obvious being the vast differences in size, topography, and latitude of the different states. Social, economic, and historical factors such as degree of urban development, area under cultivation, population density, real estate prices, agricultural supply and demand, and so on, obscure the influence of weather alone.
What types of weather events have caused the most damage?
# Calculate percentage of property damage, and crop damage by event type.
total_property_damage <- sum(property$DamageValue)
pctPropertyDamage <-
property[, round(sum(DamageValue)/total_property_damage, 2),
by = EventType][V1 > 0]
setnames(pctPropertyDamage, "V1", "Percent")
total_crop_damage <- sum(crop$DamageValue)
pctCropDamage <-
crop[, round(sum(DamageValue)/total_crop_damage, 2),
by = EventType][V1 > 0]
setnames(pctCropDamage, "V1", "Percent")
par(mfrow = c(1,2))
setorder(pctPropertyDamage, Percent)
aticks = c(0, 0.1, 0.2, 0.3, 0.4, 0.5)
alabels = c("0", "0.1", "0.2", "0.3", "0.4", "0.5")
plotColors1 <- eventColorsForDataTable(pctPropertyDamage)
with(pctPropertyDamage,
barplot(Percent,
col = plotColors1$EventColor,
xlab = "",
xaxt = "n",
ylim = c(0.00, 0.50), yaxt = "n",
ylab = "% Total Damage",
main = "Property"))
axis(2, at = aticks, labels = alabels, las = 1)
setorder(pctCropDamage, Percent)
plotColors2 <- eventColorsForDataTable(pctCropDamage)
with(pctCropDamage,
barplot(Percent,
col = plotColors2$EventColor,
xlab = "",
xaxt = "n",
ylim = c(0.00, 0.50), yaxt = "n",
ylab = "% Total Damage",
main = "Crop"))
axis(2, at = aticks, labels = alabels, las = 1)
plotColors <- rbind(plotColors1, plotColors2)
plotColors <- unique(plotColors)
par(fig = c(0, 1, 0, 1),
oma = c(0, 0, 0, 0),
mar = c(1, 0, 1, 0),
new = TRUE)
plot(0, 0, type = 'l', bty = 'n', xaxt = 'n', yaxt = 'n')
legend(x = "bottom",
ncol = 6,
legend = plotColors$EventType,
fill = plotColors$EventColor,
cex = 0.5,
xpd = TRUE)
Fig. 4. Property and Crop Damage by Weather Event Type
par(opar)
Figure 4 shows the percentage of total property and crop damage caused by various types of weather events over the period 1996-2021. According to this data set:
Hurricanes, flooding, and tropical storm surge caused the greatest percentage of property damage nationally. Indeed, the plot of property damage by year in Figure 3 (upper right) is dominated by the huge impact of Hurricane Katrina in 2005.
Drought caused the greatest percentage of crop damage nationally, followed by hurricanes, flooding, and cold.
In a country as large and geographically diverse as the United States, considering the impacts of weather at a national level is of somewhat limited utility. The fact that hurricanes accounted for greatest percentage of property damage from weather nationally means little in states such as Alaska or Montana. Which event types are associated with the greatest economic impact at the state level?
Below are two maps showing which weather event type caused the greatest percentage of damage in each state to property and crops, respectively, over the period of 1996-2021.
cols <- c("FIPS", "EventType", "DamageValue")
stmap <- property[, sum(DamageValue),
by = list(FIPS, EventType)][ , .SD[which.max(V1)], by = FIPS]
setnames(stmap, "V1", "Damage")
stmap[, Type := 1L] #property damage
stmap[, Value := 1L] #event type
setorder(stmap, FIPS)
cropMax <-
crop[, sum(DamageValue),
by = list(FIPS, EventType)][ , .SD[which.max(V1)], by = FIPS]
setnames(cropMax, "V1", "Damage")
cropMax[, Type := 2L] #crop damage
cropMax[, Value := 1L] #event type
setorder(cropMax, FIPS)
stmap <- rbind(stmap, cropMax)
stmap$Type <- factor(stmap$Type,
levels = 1:2,
labels = c("Property", "Crop"))
stmap$Value <- factor(stmap$Value,
levels = 1:2,
labels = c("Event Type", "Single Event"))
# plot_usmap requires lowercase "state" or "fips" column
setnames(stmap, 1, "fips")
setorder(stmap, EventType)
dataColors <- eventColorsForDataTable(stmap)
mapColors = setNames(dataColors$EventColor, dataColors$EventType)
plot_usmap(regions = "states",
data = stmap,
values = "EventType") +
scale_fill_manual(values = mapColors, drop = TRUE) +
facet_wrap("Type") +
theme(legend.position = "bottom",
legend.key = element_rect(color = "black", size = 0.25),
legend.justification = "center",
legend.key.size = unit(0.4, 'cm'),
plot.title = element_text(hjust = 0.5)) +
guides(fill = guide_legend(nrow = 3,
label.hjust = 0,
title = "Event Type",
title.position = "bottom",
title.hjust = 0.5)
)
Fig. 5. Property and Crop Damage by Weather Event Type, by State
This seems like a reasonable approach to obtain an overview of weather impact by state, until you look more closely at the results. These superficial summary results appear informative, and to some extent are informative; however, they mask a great deal of information about the types of weather events experienced by states, how often they occur, and the relative amounts of damage they cause, and about how weather data are recorded in the NOAA Storm Events Database.
The maps in Figure 5, above, show the most costly weather event type by state; how much more costly are these maximum event types than other types of weather events in each state?
propertyStats <- rbindlist(lapply(states, function(x) {
st <- eventStatsForState(x, "property")
setorder(st, -Percent)
st[1,]
}))
propertyStats[, Type := 1]
cropStats <- rbindlist(lapply(states, function(x) {
st <- eventStatsForState(x, "crop")
setorder(st, -Percent)
st[1,]
}))
cropStats[, Type := 2]
allStats <- rbind(propertyStats[, !"PropertyDamage"], cropStats[, !"CropDamage"])
allStats$Type <- factor(allStats$Type,
levels = 1:2,
labels = c("Property", "Crop"))
# plot_usmap requires lowercase "state" or "fips" column.
setnames(allStats, "State", "state")
plot_usmap(regions = "states",
data = allStats,
values = "Percent",
show.legend = FALSE) +
theme(legend.position = "right",
plot.title = element_text(hjust = 0.5),
plot.margin = margin(t = 0, r = 0, b = 0, l = 0)) +
scale_fill_viridis_c(direction = 1, alpha = 0.95, option = "magma") +
facet_wrap("Type")
Fig. 6. Masking Effect: Worst Weather Event Type
The gradient maps in Figure 6 depict the relative masking effect of worst weather event type for each state: the darker the color, the more information about other event types is being hidden. Some specific examples will help clarify this effect.
Low and High Event Type Masking: Crop Damage in Oklahoma and Utah
On the crop damage map in Figure 6, Oklahoma shows the lowest level of event type masking, and Utah shows the highest level. In Oklahoma, 99.4% of damage was caused by a single event type, drought; summary statistics based on worst event type only exclude, or mask, 0.6% of all Oklahoma crop damage.
ok <- eventStatsForState("Oklahoma", "crop", drop = FALSE)
ut <- eventStatsForState("Utah", "crop")
par(mfrow = c(1,2))
# Oklahoma
setorder(ok, -Total)
aticks = c(0.0e+00, 0.5e+09, 1.0e+09, 1.5e+09, 2.0e+09)
alabels = c("0", "0.5", "1.0", "1.5", "2.0")
plotColors1 <- eventColorsForDataTable(ok)
p <- barplot(ok$Total,
col = plotColors1$EventColor,
xaxt = "n",
cex.lab = 0.75,
ylab = "Event Type Damage (USD Billions)",
yaxt = "n",
main = "Oklahoma")
axis(2, at = aticks, labels = alabels, las = 1)
# Utah
setorder(ut, -Total)
aticks = c(0.0e+00, 0.5e+06, 1.0e+06, 1.5e+06)
alabels = c("0", "0.5", "1.0", "1.5")
plotColors2 <- eventColorsForDataTable(ut)
plotColors <- rbind(plotColors1, plotColors2)
plotColors <- unique(plotColors)
p <- barplot(ut$Total,
col = plotColors2$EventColor,
xaxt = "n",
cex.lab = 0.75,
ylab = "Event Type Damage (USD Millions)",
yaxt = "n",
main = "Utah")
axis(2, at = aticks, labels = alabels, las = 1)
par(fig = c(0, 1, 0, 1),
oma = c(0, 0, 0, 0),
mar = c(1, 0, 1, 0),
new = TRUE)
plot(0, 0, type = 'l', bty = 'n', xaxt = 'n', yaxt = 'n')
legend(x = "bottom",
legend = plotColors$EventType,
fill = plotColors$EventColor,
ncol = 7,
cex = 0.5,
xpd = TRUE)
Fig. 7. Low (left) and High (right) Masking by Worst Event Type
par(opar)
rm(ok, ut)
In contrast, crop damage in Utah shows a high degree of masking by worst event type. Here, the most costly event type, high wind, accounted for only 25.6% of total damage; the second-worst event type, tornado, accounted for 25.3%, only 0.3% less than high wind. Summary statistics for Utah based on the worst event type exclude, or mask, 74.4% of all crop damage, in other words, most of the available information.
A single catastrophic weather event may cause so much damage that it continues to dominate statistics for a location for many years. An easy assumption is that these single dominant weather events are outliers that are masking or distorting the “true” effects of weather on crops and property. In some cases, this is true.
Low and High Single Event Masking: Crop Damage in Nebraska and Arizona
ne <- crop[State == "Nebraska"]
setorder(ne, -DamageValue)
ne <- ne[1:6]
az <- crop[State == "Arizona"]
setorder(az, -DamageValue)
az <- az[1:6]
# Nebraska
par(mfrow = c(1,2))
plotColors1 <- eventColorsForDataTable(ne)
ne[, EventColor := plotColors1[EventType == .BY, EventColor],by=EventType]
aticks = c(0.0e+00, 5.0e+06, 1.0e+07, 1.5e+07, 2.0e+07, 2.5e+07, 3.0e+07, 3.5e+07)
alabels = c("0", "5", "10", "15", "20", "25", "30", "35")
p <- barplot(ne$DamageValue,
col = ne$EventColor,
xaxt = "n",
cex.lab = 0.75,
ylab = "Single Event Damage (USD Millions)",
yaxt = "n",
main = "Nebraska")
axis(2, at = aticks, labels = alabels, las = 1)
# Arizona
aticks = c(0.0e+00, 0.5e+08, 1.0e+08, 1.5e+08, 2.0e+08, 2.5e+08, 3.0e+08)
alabels = c("0", "50", "100", "150", "200", "250", "300")
plotColors2 <- eventColorsForDataTable(az)
az[, EventColor := plotColors2[EventType == .BY, EventColor], by=EventType]
plotColors <- rbind(plotColors1, plotColors2)
plotColors <- unique(plotColors)
p <- barplot(az$DamageValue,
col = az$EventColor,
xaxt = "n",
cex.lab = 0.75,
ylab = "Single Event Damage (USD Millions)",
yaxt = "n",
main = "Arizona")
axis(2, at = aticks, labels = alabels, las = 1)
par(fig = c(0, 1, 0, 1),
oma = c(0, 0, 0, 0),
mar = c(1, 0, 1, 0),
new = TRUE)
plot(0, 0, type = 'l', bty = 'n', xaxt = 'n', yaxt = 'n')
legend(x = "bottom",
legend = plotColors$EventType,
fill = plotColors$EventColor,
ncol = 5,
cex = 0.6,
xpd = TRUE)
Fig. 8. Low (left) and High (right) Masking by Worst Single Event
par(opar)
rm(ne, az)
The single worst weather event in Nebraska from 1996 through 2021 was a hailstorm in July of 1996, which caused less than one percent of total crop damage. That single worst event is not an outlier. Hail is the leading cause of crop damage in Nebraska and that particular hailstorm was just slightly more destructive than others. Removing this event would not have a significant effect on statistics for crop damage in Nebraska.
Somewhat surprisingly, the worst weather event in Arizona from 1996-2021 was a tropical storm. In 1997, Hurricane Nora, passing from the Pacific Ocean up the narrow Sea of Cortés between Baja California and mainland México, made landfall as a tropical storm in Arizona. This single storm event caused 91.6% of all crop damage reported for the period. It is an undoubted outlier, the only tropical storm event from 1996 through 2021. Removing it from the analysis will certainly affect the Arizona crop damage results.
az <- crop[State == "Arizona"]
nora <- az[DamageValue == max(DamageValue)]$EventID
az_total_noNora <- az[EventID != nora, sum(DamageValue)]
az_noNora <- az[EventID != nora,
round(sum(DamageValue)/az_total_noNora, 3),
by = EventType][V1 > 0.01]
setnames(az_noNora, "V1", "Percent")
setorder(az_noNora, -Percent)
az_crop <- eventStatsForState("Arizona", "crop")
par(mfrow = c(1,2))
par(mar = c(4, 4, 2, 2) + 0.1)
# Arizona - all
plotColors <- eventColorsForDataTable(az_crop)
p <- barplot(az_crop$Percent,
col = plotColors$EventColor,
xaxt = "n",
ylim = c(0.0, 1.0),
ylab = "% Crop Damage",
main = "All Event Types")
text(x = p, y = az_crop$Percent + 0.05,
labels = az_crop$CropDamage)
# Arizona - without tropical storm event
plotColors <- eventColorsForDataTable(az_noNora)
p <- barplot(az_noNora$Percent,
col = plotColors$EventColor,
xaxt = "n",
ylim = c(0.0, 1.0),
ylab = "% Crop Damage",
main = "Minus Tropical Storm")
noNoraLabels <- paste0(az_noNora$Percent * 100, "%")
text(x = p, y = az_noNora$Percent + 0.05,
labels = noNoraLabels)
# Pick up all event color for the legend.
plotColors <- eventColorsForDataTable(az_crop)
par(fig = c(0, 1, 0, 1),
oma = c(0, 0, 0, 0),
mar = c(1, 0, 1, 0),
new = TRUE)
plot(0, 0, type = 'l', bty = 'n', xaxt = 'n', yaxt = 'n')
legend(x = "bottom",
legend = plotColors$EventType,
fill = plotColors$EventColor,
ncol = 4,
cex = 0.6,
xpd = TRUE)
Fig. 9. Unmasking Crop Damage in Arizona
par(opar)
rm(az)
Figure 9 illustrates the degree to which the single tropical storm event masks the causes of crop damage that Arizona farmers face year after year: frost, hail, and flooding. The conclusion that tropical storm was the leading cause of crop damage in Arizona for the period is true but not necessarily relevant. The striking nature of the result should not be mistaken for importance, which depends on the question being researched. For example, this fact would probably be a poor guide for decision-making about protecting or insuring crops in Arizona.
The “unmasked” Arizona crop damage summary provides useful, actionable information. The predominance of frost/freeze damage suggests that investments in cold-tolerant crop strains, greenhouses, row covers, and methods for preventing or mitigating freeze damage to orchards could provide the real benefits; the statistic for hail indicates that it would be advisable that protective structures such as greenhouses (which are commonly made of glass) be built to withstand hailstorms.
A less immediately obvious contributor to the extreme effect of Arizona’s single worst event on crop statistics is this: that tropical storm was one of 31 total crop-damage events reported from 1996-2021, or 3.2% of all events. In contrast, Nebraska’s worst event was just one of 2,164 hail events (0.046%), and 3,445 total events (0.029%)) that affected crops in Nebraska during the years 1996-2021. Statistics on Nebraska are based on more data than statistics on Arizona. In part, this is due to the differences in climate/weather, but here again, the difference in land area under cultivation is likely a confounding variable. Arizona is largely desert unsuitable for commercial agriculture, whereas Nebraska is intensively cultivated.
Periodic catastrophic events are characteristic of weather data in general and their effects on statistical analysis may persist over long periods of time. This effect should be borne in mind when formulating questions and methods for analysis of weather data in order to obtain the most meaningful and useful results. To gauge the significance of this effect in each state, it is useful to look at the percentage of total damage caused by the single most costly weather event.
processMaxEvent <- function(dt, type = "property") {
setnames(maxEvent, "DamageValue", "Damage")
setnames(maxEvent, "FIPS", "fips")
if( type == "crop") {
maxEvent[, Type := 2L] #crop damage
maxEvent[, Value := 2L] #single worst event
maxType = stmap[Type == "Crop"]
} else {
maxEvent[, Type := 1L] #property damage
maxEvent[, Value := 2L] #single worst event
maxType = stmap[Type == "Property"]
}
setorder(maxEvent, fips)
maxEvent$Type <- factor(maxEvent$Type,
levels = 1:2,
labels = c("Property", "Crop"))
maxEvent$Value <- factor(maxEvent$Value,
levels = 1:2,
labels = c("Event Type", "Single Event"))
return(maxEvent)
}
cols <- c("FIPS", "EventType", "DamageValue")
maxEvent <- rbindlist(lapply(states, function(x) {
st <- property[State == x, ..cols]
setorder(st, -DamageValue)
st[1,]
}))
maxEvent <- processMaxEvent(maxEvent, "property")
stmap <- rbind(stmap, maxEvent)
maxEvent <- rbindlist(lapply(states, function(x) {
st <- crop[State == x, ..cols]
setorder(st, -DamageValue)
st[1,]
}))
maxEvent <- processMaxEvent(maxEvent, "crop")
stmap <- rbind(stmap, maxEvent)
setorder(stmap, EventType)
pctByStateP <- stmap[(Type == "Property" & Value == "Single Event"),
.(fips, Damage, Type)]
totalByState <- property[, sum(DamageValue), by = FIPS]
setnames(totalByState, "FIPS", "fips")
setorder(pctByStateP, fips)
setorder(totalByState, fips)
pctByStateP <- cbind(pctByStateP, totalByState$V1)
setnames(pctByStateP, "V2", "StateTotal")
pctByStateP[, MaxPercent := Damage/StateTotal]
pctByStateC <- stmap[(Type == "Crop" & Value == "Single Event"),
.(fips, Damage, Type)]
totalByState <- crop[, sum(DamageValue), by = FIPS]
setnames(totalByState, "FIPS", "fips")
setorder(pctByStateC, fips)
setorder(totalByState, fips)
pctByStateC <- cbind(pctByStateC, totalByState$V1)
setnames(pctByStateC, "V2", "StateTotal")
pctByStateC[, MaxPercent := Damage/StateTotal]
pctByStateC[StateTotal == 0, MaxPercent := 0] # Fix Rhode Island
stpct <- rbind(pctByStateP, pctByStateC)
stpct[, MaxDisplay := MaxPercent * 100]
plot_usmap(regions = "states",
data = stpct,
values = "MaxDisplay") +
theme(legend.position = "right",
plot.title = element_text(hjust = 0.5),
plot.margin = margin(t = 0, r = 0, b = 0, l = 0)) +
scale_fill_viridis_c(name = "% Damage", direction = -1, option = "magma") +
facet_wrap("Type")
Fig. 10. Masking Effect: Single Catastrophic Weather Event
In Figure 10, the greater the percentage of total damage that was caused by a single event, the darker the state color. For crop damage, Arizona stands out strongly, as expected, being dominated by its freak tropical storm; Pennsylvania and Connecticut also catch the eye. The effects of single events on property damage are not quite as striking, but New Mexico, Nevada, California, and North Dakota are noticeable as states where the relevance of dominating events should be carefully evaluated.
Comparison of the current version of the NOAA Storm Events Database with earlier versions reveals that the National Oceanic and Atmospheric Administration has done an enormous amount of work to standardize and rationalize the weather-related data reported by hundreds, or possibly thousands, of people over a period of 71 years. Among many achievements of this effort, the Event Type data have been cleaned up dramatically. The standardization of the event type data has highlighted an issue inherent to the recording of weather data: assigning a single, discrete event type to a weather incident is not always realistic.
Property Damage: High Wind in Oregon
One unexpected result on the property damage map in Figure 4 is the event type responsible for the most property damage in Oregon: high wind. This is not an obvious finding; Oregon is not renowned for wind in the way that, for instance, Wyoming is.
st_data <- dataTableForStateType("Oregon", "property")
st_total <- sum(st_data$DamageValue)
st_data[, Year := year(BeginDate)]
maxByYear <- st_data[, sum(DamageValue),
by = list(Year,
EventType)][,.SD[which.max(V1)], by = Year]
setnames(maxByYear, "V1", "Damage")
stColors <- eventColorsForDataTable(maxByYear)
par(mar = c(6, 4.1, 2.1, 2.1))
layout(matrix(c(1, 2),
nrow = 1,
ncol = 2,
byrow = TRUE),
widths = c(1, 2))
yearCounts <- maxByYear[, .N, by = "EventType"]
setorder(yearCounts, -N)
setnames(yearCounts, "N", "Count")
plotColors <- stColors[yearCounts, .(EventColor), on = "EventType"]
with(yearCounts, plot(Count, pch = 21,
col = "black",
bg = plotColors$EventColor,
ylab = "Number of Years",
xlab = "",
xaxt = "n")
)
title(xlab = "Event Type", line = 1)
plotColors <- stColors[maxByYear, .(EventColor), on = "EventType"]
with(maxByYear, plot(Damage/10^9, pch = 21,
col = "black",
bg = plotColors$EventColor,
ylim = c(0.00, 3.50),
ylab = "Total Damage (USD Billions)",
xlab = "",
xaxt = "n")
)
axis(side = 1, at = 1:26, labels = as.character(YEARS), las = 2)
par(fig = c(0, 1, 0, 1),
oma = c(0, 0, 0, 0),
mar = c(1, 0, 3, 0),
new = TRUE)
plot(0, 0, type = 'l', bty = 'n', xaxt = 'n', yaxt = 'n')
legend(x = "bottom",
legend = unique(maxByYear$EventType),
fill = unique(plotColors$EventColor),
ncol = 6,
cex = 0.5,
xpd = TRUE)
Fig. 11. Most Destructive Event Type by Year in Oregon
par(opar)
cols <- c("EventID", "EpisodeID", "EventType", "DamageValue", "BeginDate")
dateRange <- as.IDate("20200101", format = "%Y%m%d"):
as.IDate("20201231", format = "%Y%m%d")
or20 <- property[State == "Oregon"][BeginDate %in% dateRange,..cols]
or20_total <- sum(or20$DamageValue)
evcounts <- or20[, .N, by = EventType]
evType <- or20[, round(sum(DamageValue)/or20_total, 3) *100,
by=EventType][V1 > 0]
setnames(evType, "V1", "Damage")
evType <- merge(evType, evcounts)
setnames(evType, "N", "Count")
evType[, DamagePct := paste0(Damage, "%")]
setcolorder(evType, c("EventType", "Count", "Damage", "DamagePct"))
setorder(evType, -Damage)
kbl(evType[, !c("Damage")],
align = "rcl",
col.names = c("Event Type", "Count", "% Damage"),
caption = "Table 1a: Oregon Summary 2020") %>%
row_spec(1:2, bold = TRUE) %>%
kable_classic_2(full_width = FALSE,
position = "float_left")
| Event Type | Count | % Damage |
|---|---|---|
| High Wind | 6 | 87.7% |
| Strong Wind | 2 | 10.9% |
| Flood | 3 | 1.1% |
| Wildfire | 1 | 0.3% |
A variety of weather event types dominated property damage year by year in Oregon but the damage from high wind in 2020 was an order of magnitude worse than the damage from any other year. That year saw some of the worst wildfires in U.S. history, which were particularly devastating in Oregon.
Table 1a shows that wind damage (high wind plus strong wind) accounts for 98.6% of total property damage for 2020 in Oregon. The worst wildfires in the state’s history account for less than 1%. What wind event(s) could have been so bad as to dominate not only the 2020 wildfires but everything in the preceding 24 years, by an order of magnitude? Is this an error in the NOAA database? Table 1b, below, shows what weather events were reported in that year and how much damage they caused.
# Format damage for display
or20[, Damage := paste0("$", format(round(DamageValue), big.mark = ","))]
setorder(or20, BeginDate, EventType, -DamageValue)
wind_episode <- unique(or20[EventType == "High Wind", EpisodeID])
bold_rows <- which(or20$EpisodeID == wind_episode)
or20$EpisodeID <-
cell_spec(or20$EpisodeID,
color = ifelse(or20$EpisodeID == wind_episode, "blue", "black"))
kbl(or20[, !("DamageValue")],
escape = FALSE,
align = "cclcr",
col.names = c("Event ID", "Episode ID", "Event Type", "Date", "Damage"),
caption = "Table 1b: Oregon Weather Events 2020") %>%
row_spec(bold_rows, bold = TRUE) %>%
kable_classic_2(full_width = FALSE,
position = "center")
| Event ID | Episode ID | Event Type | Date | Damage |
|---|---|---|---|---|
| 871967 | 145223 | Flood | 2020-02-06 | $ 41,245,201 |
| 871972 | 145223 | Flood | 2020-02-06 | $ 779,191 |
| 871973 | 145223 | Flood | 2020-02-07 | $ 134,021 |
| 922952 | 153297 | High Wind | 2020-09-07 | $2,060,250,289 |
| 922953 | 153297 | High Wind | 2020-09-07 | $1,030,125,145 |
| 922949 | 153297 | High Wind | 2020-09-07 | $ 103,012,514 |
| 922958 | 153297 | High Wind | 2020-09-07 | $ 515,063 |
| 922957 | 153297 | High Wind | 2020-09-07 | $ 515,063 |
| 922956 | 153297 | Strong Wind | 2020-09-07 | $ 206,025,029 |
| 922954 | 153297 | Strong Wind | 2020-09-07 | $ 206,025,029 |
| 922948 | 153297 | High Wind | 2020-09-08 | $ 123,615,017 |
| 909610 | 150909 | Wildfire | 2020-09-08 | $ 10,301,251 |
| 922977 | 153300 | Thunderstorm Wind | 2020-09-17 | $ 2,060 |
| 922972 | 153300 | Thunderstorm Wind | 2020-09-18 | $ 25,753 |
| 922973 | 153300 | Thunderstorm Wind | 2020-09-18 | $ 10,301 |
| 922971 | 153300 | Thunderstorm Wind | 2020-09-18 | $ 3,090 |
| 922984 | 153300 | Thunderstorm Wind | 2020-09-18 | $ 1,030 |
| 921015 | 152909 | Tornado | 2020-11-10 | $ 41,104 |
It is notable that all of the 2020 wind events have the same Episode ID. The codebook for the NOAA Storm Events Database explains the relationship between episodes and events as follows.
excerpts <- data.table(
Variable = c("episode_id", "event_id"),
Description = c("ID assigned by NWS to denote the storm episode; Episodes may contain multiple Events. The occurrence of storms and other significant weather phenomena having sufficient intensity to cause loss of life, injuries, significant property damage, and/or disruption to commerce.", "ID assigned by NWS for each individual storm event contained within a storm episode; links the record with the same event in the storm_event_details, storm_event_locations and storm_event_fatalities tables (Primary database key field)."))
kbl(excerpts) %>%
kable_paper(full_width = FALSE, html_font = "Helvetica") %>%
row_spec(1:2, background = "cornsilk") %>%
column_spec(1, bold = TRUE, border_right = TRUE)
| Variable | Description |
|---|---|
| episode_id | ID assigned by NWS to denote the storm episode; Episodes may contain multiple Events. The occurrence of storms and other significant weather phenomena having sufficient intensity to cause loss of life, injuries, significant property damage, and/or disruption to commerce. |
| event_id | ID assigned by NWS for each individual storm event contained within a storm episode; links the record with the same event in the storm_event_details, storm_event_locations and storm_event_fatalities tables (Primary database key field). |
Given that the wind events in question are all part of a single episode, it is worth retrieving the episode narrative for further information. The narrative is quite long, but the first and last sentences succinctly identify the issue:
narrative <- retrieveNarrative("episode", 2020, wind_episode)
sentences <- unnest_tokens(narrative,
sentence,
EpisodeNarrative,
token = "sentences",
to_lower = FALSE)
s <- sentences[[1,1]]
excerpts <- data.table(
Key = c("1", "9"),
Excerpt = c(paste0("...[I]n ", trimws(strsplit(s, " in")[[1]][2])),
sentences[9,1]))
kbl(excerpts) %>%
kable_paper(full_width = FALSE, html_font = "Helvetica") %>%
row_spec(1:2, background = "aliceblue") %>%
column_spec(1, bold = TRUE, border_right = TRUE)
| Key | Excerpt |
|---|---|
| 1 | …[I]n early September, very strong easterly downslope and offshore winds off the Cascades and Coastal Ranges occurred. |
| 9 | Rapidly spreading wildfires resulted in multiple fatalities, hundreds of displaced persons for many weeks, and billions of dollars in damage. |
The episode narrative states that the damage resulted from rapidly spreading wildfires. Should these events have been classified as wildfire events rather than wind events? As it turns out, there is no single, obviously “correct” way to classify these events, as can be seen from further key phrases excerpted from the episode narrative.
one <- paste0("...", trimws(strsplit(sentences[[3,1]], "the")[[1]][3]), "...")
two <- sentences[[4,1]]
s <- gsub("Widespread", "", sentences[[5,1]])
three <- paste0("...", trimws(strsplit(s, " on")[[1]][1]), "...", trimws(strsplit(s, "other")[[1]][2]))
four <- sentences[[6,1]]
excerpts <- data.table(
Key = c("3", "4", "5", "6"),
Excerpt = c(one, two, three, four)
)
kbl(excerpts) %>%
kable_paper(full_width = FALSE, html_font = "Helvetica") %>%
row_spec(1:4, background = "aliceblue") %>%
column_spec(1, bold = TRUE, border_right = TRUE)
| Key | Excerpt |
|---|---|
| 3 | …strong winds combined with extremely low relative humidity and exceptionally dry existing fuel conditions…. |
| 4 | The result was explosive growth of ongoing wildfires, and the new start and explosive spread of numerous new wildfires. |
| 5 | …wind gusts from 50-70 mph were common…in exposed areas, including portions of the greater Portland metro area, the Willamette Valley, and areas of the Oregon coast. |
| 6 | Strong winds caused widespread damage to trees, and downed numerous power lines across the region, which started at least 13 additional wildfires. |
This episode narrative describes complex, synergistic interactions of wind and fire, occurring in exceptionally dry conditions. Cause and effect cannot be clearly separated. The existing fires would not have done as much damage in the absence of high winds; the winds alone would not have done as much damage without the pre-existing fires; but the winds also started new fires by dropping trees on power lines. Neither wind nor fire would have been so destructive if conditions had not been abnormally dry.
A single event type cannot accurately represent such a complex incident. It would be neither more nor less correct to classify these events as wildfire rather than wind: information is lost either way. Some weather events are irreducibly complex. In the NOAA Storm Events Database, the necessity of assigning a single event type can misrepresent the data without being incorrect. This is a non-trivial problem in the representation and analysis of weather information. In the NOAA storm data records, key information that is needed to understand statistics that can be generated from the structured numeric data is contained in unstructured episode and event text narratives, where it is not readily accessible. Over time, NOAA has been extracting some of these data into structured variables, such as the FLOOD_CAUSE (self-explanatory), CATEGORY (drought classification), MAGNITUDE and MAGNITUDE_TYPE (hail size and wind speed), and various TOR_* (quantitative aspects of tornados) variables.
The episode identifier is very useful for understanding the weather data for Oregon in 2020; however, this variable cannot be relied on for aggregating events belonging to the same weather incident, at least not in this version of the database (downloaded 7 Feb. 2022).
Inconsistent Usage: 2014 Crop Damage in Idaho
cols <- c("EventID", "EpisodeID", "EventType", "DamageValue", "BeginDate")
id14 <- crop[State == "Idaho" & year(BeginDate) == 2014, ..cols][DamageValue > 0]
id14[, Damage := paste0("$", format(round(DamageValue), big.mark = ","))]
setorder(id14, BeginDate, EventType, -DamageValue)
pcolors = c("orangered", "green", "blue", "purple")
hc2 <- rgb(t(col2rgb("red")), alpha = 0.25*255, maxColorValue = 255)
hc1 <- rgb(t(col2rgb("white")), alpha = 0, maxColorValue = 255)
id14[, EventColor := pcolors[.GRP], by=EventType]
id14[, DateColor := pcolors[.GRP], by=BeginDate]
rainIDs <- id14[EventType == "Heavy Rain"]$EpisodeID
nonRainIDs <- id14[!(EpisodeID %in% rainIDs)]$EpisodeID
id14$BeginDate <-
cell_spec(id14$BeginDate, color = id14$DateColor,
background = ifelse(id14$EventID == 541154, hc2, hc1))
id14$EpisodeID <-
cell_spec(id14$EpisodeID,
color = ifelse(id14$EventType == "Heavy Rain", "black", id14$EventColor),
background = factor(id14$EpisodeID, c(rainIDs, nonRainIDs), c(rep_len(hc2, 17), rep_len(hc1, 6))),
bold = ifelse(id14$EventType == "Heavy Rain", FALSE, TRUE))
kbl(id14[, !c("DamageValue", "DateColor", "EventColor")],
escape = FALSE,
align = "cclcr",
col.names = c("Event ID", "Episode ID", "Event Type", "Date", "Damage"),
caption = "Table 2: Idaho Crop Damage 2014") %>%
column_spec(4, bold = TRUE) %>%
kable_classic_2(full_width = FALSE,
position = "float_right")
| Event ID | Episode ID | Event Type | Date | Damage |
|---|---|---|---|---|
| 540728 | 89916 | Heavy Rain | 2014-08-01 | $8,140,951 |
| 540724 | 89912 | Heavy Rain | 2014-08-01 | $6,977,958 |
| 540719 | 89910 | Heavy Rain | 2014-08-01 | $6,977,958 |
| 540692 | 89908 | Heavy Rain | 2014-08-01 | $5,233,468 |
| 540725 | 89913 | Heavy Rain | 2014-08-01 | $4,651,972 |
| 540726 | 89914 | Heavy Rain | 2014-08-01 | $4,651,972 |
| 540727 | 89915 | Heavy Rain | 2014-08-01 | $4,651,972 |
| 540721 | 89911 | Heavy Rain | 2014-08-01 | $4,651,972 |
| 540730 | 89919 | Heavy Rain | 2014-08-01 | $3,488,979 |
| 540734 | 89923 | Heavy Rain | 2014-08-01 | $3,488,979 |
| 540731 | 89920 | Heavy Rain | 2014-08-01 | $3,488,979 |
| 540732 | 89921 | Heavy Rain | 2014-08-01 | $3,488,979 |
| 540736 | 89925 | Heavy Rain | 2014-08-01 | $3,488,979 |
| 540729 | 89918 | Heavy Rain | 2014-08-01 | $2,325,986 |
| 540735 | 89924 | Heavy Rain | 2014-08-01 | $2,325,986 |
| 540733 | 89922 | Heavy Rain | 2014-08-01 | $2,325,986 |
| 540737 | 89926 | Heavy Rain | 2014-08-01 | $2,325,986 |
| 540552 | 89850 | Flash Flood | 2014-08-05 | $ 2,326 |
| 540547 | 89850 | Flash Flood | 2014-08-05 | $ 1,163 |
| 541154 | 90044 | Flood | 2014-08-05 | $9,303,944 |
| 541159 | 90044 | Flood | 2014-08-06 | $9,303,944 |
| 538410 | 89374 | Hail | 2014-08-14 | $ 290,748 |
| 538273 | 89374 | Hail | 2014-08-14 | $ 203,524 |
In 2014, Idaho reported crop damage for four dates, all in August. The episode IDs correctly group the flood, flash flood, and hail events, but not the heavy rain events. The heavy rain events all have separate episode IDs although they all occurred on the same date. Only by reading each of their event and episode narratives can one confirm that these are county-by-county reports on a single, very large rainstorm that affected most of eastern Idaho on August 1, 2014. A shared episode identifier and narrative would make this obvious. Instead, the episode narratives simply repeat the event narratives, and the different episode identifiers obscure the fact that what occurred was one storm incident, not 17 unrelated storms.
Observations within the same month of the same year in a single state show inconsistencies that could alter the results of any analysis that was performed without a full understanding of these quirks in the data.
Notice that the Date (which is specifically the event begin date) will not correctly group the data either. The two flood events constitute a single flood incident which spanned two days. The first of these two days happens to be the same date as the flash flood incident, so grouping the data by Date incorrectly places the first flood event in the flash flood incident.
The result is that the dates and narratives, and possibly geographic data such as county or latitude/longitude, for every event may have to be reviewed in order to determine how to incorporate the data into an analysis.
There are 23 crop-damaging events reported for Idaho in 2014, but these represent only four weather incidents: a rainstorm on August 1; some flash flooding on August 5; a two-day flood on August 5-6, and a hail storm on Aug. 14. For some analyses, this significantly reduces the number of data points for calculations.
Geographic Boundaries
Another significant limitation of the episode ID is that it cannot be used to aggregate observations across state lines. There are only two instances in this entire analytical data set where the episode ID is shared by events in different states. This makes it much more difficult to piece together the effects of weather systems such as tornados, hurricanes, and winter storms, which often move as definite weather entities across multiple states.
RI <- crop[State == "Rhode Island"]
drought <- RI[EventType == "Drought", .N, by = .(EpisodeID, BeginDate)]
setnames(drought, "N", "Events")
drought$BeginDate <-
cell_spec(drought$BeginDate,
color = ifelse(year(drought$BeginDate) == 2012,
"green", "blue"),
bold = TRUE)
kbl(drought, escape = FALSE, align = "ccc", col.names = c("Episode ID", "Date", "Events"), caption = "Table 3. Drought in Rhode Island") %>%
kable_classic_2(full_width = FALSE,
position = "float_right")
| Episode ID | Date | Events |
|---|---|---|
| 63101 | 2012-04-12 | 8 |
| 63102 | 2012-05-01 | 1 |
| 109966 | 2016-08-02 | 1 |
| 110702 | 2016-09-01 | 1 |
| 110704 | 2016-09-13 | 7 |
| 111194 | 2016-10-01 | 1 |
| 111554 | 2016-11-01 | 8 |
Timeframe Boundaries: Drought in Rhode Island
The episode ID also cannot be used to aggregate weather across months or years. Consider crop damage data for drought in Rhode Island. There are 27 drought events reported in Rhode Island from 1996 through 2021. Aggregating by episode ID reduces these to seven drought episodes. Table 3 shows that these seven episodes occurred across consecutive months in just two years, 2012 and 2016. Rhode Island has experienced 2 droughts, not 27.
Note that there is no clear algorithmic way to define a weather incident across month or year boundaries. Rhode Island provides a “toy example” only. The total number of events/episodes/dates that comprise the two drought incidents is small, and the droughts are short-lived and confined to a single year each. Droughts in the western U.S. can persist for years or even decades. Rhode Island has such a small land area, with a range of elevation of only 126 meters, that it is safe to assume that any drought would affect the whole state - this is not a safe assumption for many states. For example, in Colorado, drought conditions west of the continental divide do not necessarily imply drought conditions east of the continental divide; drought incidents could not simply be defined by date.
So how much damage did the 2016 drought cause in Rhode Island? Impossible to say, from the NOAA data. One of the surprises in this database is that all Rhode Island crop damage values are either missing (empty strings) or zero. This might mean that weather which would damage crops tends to occur outside the outdoor growing season, or that crop varieties and agricultural practices in Rhode Island are particularly well-suited to the weather there, or that the farmers have just been lucky for 26 years. Or, it might mean that Rhode Island doesn’t report crop damage reliably to NOAA.
The episode narratives for the 2012 drought, which occurred in April and May, explicitly state that no agricultural losses occurred because the growing season had not yet begun. The story for 2016 is a bit more interesting.
ri16 <- RI[EventType == "Drought" & year(BeginDate) == 2016]
episodes <- unique(ri16$EpisodeID)
narratives <- vapply(episodes, FUN.VALUE = data.table(1), FUN = function(x) {
sentences <- data.table(NA)
narrative <- retrieveNarrative("episode", 2016, x)
if(!is.na(narrative)) {
sentences <- unnest_tokens(narrative,
sentence,
EpisodeNarrative,
token = "sentences",
to_lower = FALSE)
}
} )
overview <- paste0(narratives[[1]][1], "..T",
strsplit(narratives[[1]][3], "T")[[1]][2])
aug <- paste0("F", strsplit(narratives[[1]][8], "F")[[1]][2])
sep <- paste0("...[M]any", strsplit(narratives[[2]][10], "many")[[1]][2], " ", narratives[[2]][11], " ", narratives[[2]][12])
oct <- paste0("O", strsplit(narratives[[4]][3], "O")[[1]][2], " ", narratives[[4]][1], " ", narratives[[4]][2])
nov <- paste0("A ", strsplit(narratives[[5]][2], "A ")[[1]][2])
excerpts <- data.table(
Episode = c("Overview", "Aug", "Sep", "Oct", "Nov"),
Excerpt = c(overview, aug, sep, oct, nov)
)
kbl(excerpts) %>%
kable_paper(full_width = FALSE, html_font = "Helvetica") %>%
row_spec(1:5, background = "aliceblue") %>%
column_spec(1, bold = TRUE, border_right = TRUE)
| Episode | Excerpt |
|---|---|
| Overview | Below normal monthly precipitation that began in March of 2016, continued through August…The governor of Rhode Island issued a drought advisory for the entire state on August 17th. |
| Aug | Farmers have had to irrigate their crops much more than normal for August. |
| Sep | …[M]any farmers are having to choose which fields to irrigate and which to let go, as well as not doing 2nd and 3rd plantings of late summer crops. Providence County was declared a primary disaster county due to the drought. Farms in this county are eligible for assistance from the Farm Service Agency and the Livestock Forage Disaster Program. |
| Oct | October brought above normal rainfall to Rhode Island, but continued longer-term drought conditions. A Drought Advisory continued for the whole state of Rhode Island during the month of October. The advisory encouraged voluntary water conservation and adherence to any local water restrictions. |
| Nov | A Drought Advisory continued statewide in Rhode Island during the month of November. |
The outdoor growing season in Rhode Island is, generally speaking, May through October.
The September narrative clearly states that Rhode Island suffered direct crop damage and indirect crop losses from drought in 2016. This damage was not quantified in the reports to NOAA. This is a reporting error.
So, three levels of missing data are known to exist in the NOAA crop data: records that contain no information, represented as empty strings in the database; reporting errors, where known direct damage was incorrectly reported as zero damage; and unreported opportunity cost, where weather has prevented agricultural activity. Missing data calls into question the value of quantitative analysis - how relevant or reliable are statistics generated from partial data? Although there is significantly more positive data for property damage than crop damage in the NOAA database, the sketchy nature of the crop data also casts some uncertainty on the completeness of the property data.
This analysis has considered some issues that will be encountered when working with the NOAA Storm Events Database. A relatively small subset of the available variables were used for simplicity and clarity. The analysis demonstrated that it is quite easy to generate casual or superficial statistical results from property and crop data, which may look meaningful or even impressive. However, due both to issues innate to weather data in general, and to specific characteristics of the NOAA Storm Events database, the analyst who does this risks producing results that do not at all mean what the analyst believes they mean. A more complex or contextual research question, such as, “How much total property and crop damage was caused by Hurricane Sandy in 2012?” poses a non-trivial problem in identifying the relevant records in the NOAA database and incorporating them into a worthwhile analysis.
Weather data have some general characterisitics that may affect any analysis. Some weather events occur only rarely or very rarely (e.g., 50-year or 100-year storms). The NOAA Storm Events Database contains data since 1950, but initially only tornados were recorded. Only in 1996 was a standard implemented for collecting a broad spectrum of weather and related data. For the sake of simplicity, this analysis was restricted to the 26-year period after standardization, which is quite a short time horizon for the consideration of weather. An analysis which incorporates pre-1996 data would need to determine how to properly account for when different types of data began to be collected.
It is not exactly easy to define the set of phenomena that is collected in the Storm Events database, as it includes weather, weather effects, tides, certain geological events, and wildfires, as well as making distinctions on whether the phenomena occurred over land or at sea.
When a rare event falls within the timeframe of an analysis, the context of the analysis will affect whether the event should be included and how it should be incorporated into the analysis. A rare event that is not extreme may be easily overlooked; again, context will dictate whether this is relevant or not.
Extreme or catastrophic weather events can dramatically affect statistics, potentially obscuring valuable patterns or insights. The effects on analysis may persist over long time horizons. Extreme events may or may not be outliers depending on the analysis performed.
Classifying complex weather incidents by a single event type can be misleading. As seen with the 2020 Oregon wind/wildfire interactions, some weather incidents are irreducibly complex, and any choice of event type classification will have pros and cons. This complexity is not superficially evident; the analyst has to go looking for it, or risk misrepresenting reality.
The states and territories included in the NOAA database vary enormously in size, topography, urban development, area under agricultural production, property values, all of which may operate as confounding variables to an analysis.
There is an attribute of the NOAA database, which could be called “data shrinkage”, that the analyst would do well to keep in mind. The crop and property damage data demonstrate this attribute quite well: the vast majority of the records or observations in the database, nearly 1.5 million at this writing, have missing or zero damage values, and in 2007, something was changed which radically altered the proportions of zero and missing values. Throughout the data set, the proportion of non-zero damage data is quite small, particularly the crop damage data.
Furthermore, each observation - that is to say, each individual “event” identified by a unique event ID - may represent a distinct weather incident or only one part of an incident. Recall the 17 heavy rain events which comprise the August 1, 2014, rainstorm incident in Idaho. Depending on the question to be answered, these 17 observations might represent 17 data points or only 1.
Determining whether an event is part of a larger weather phenomenon is not straightforward. The episode identifier can be helpful but inconsistencies in usage render it unreliable when used alone. The episode identifer cannot be used to group related events across state lines, or across months or years. Other data which might be used to define a multi-event incident include the begin and end dates and times, county, latitude/longitude variables, and unstructured text data in the event and episode narratives.
Hence, the number of data points that end up comprising the inputs to a quantitative analysis can be far smaller than it originally appeared, potentially rendering the sample size suboptimal for the intended use of the analysis. These effects are unequally distributed through the data, which may affect the validity of comparisons: results for one component of the analysis, such as a state, may be based on one or more orders of magnitude fewer data points than another.
In summary, this database is best approached with a willingness to take the time required to explore the data carefully in the context of the specific questions the researcher wishes to address.