This report looks at the National Weather Services (NWS) Storm Data Database (SDD) to determine which weather events have the biggest impact on public health and the economy across the United States. The SDD holds information on severe weather events across United States for the period 1953 to 2011 with the current range of 48 weather events being recorded since 1996. The SDD is manually populated by NWS employees from across the country and data quality is an issue - therefore a significant part of this analysis was data cleansing to ensure a robust analysis could be undertaken. All analysis in this report was undertaken using R and has been conducted in a reproducible way, all instructions and code to reproduce the analysis can be found in the Data Processing section. The key findings from this analysis are
This analysis was a preliminary look into the data and the questions around public health and economic impact and at the end of the paper further areas of analysis are suggested.
This section describes how data was processed to produce the results in the Results section. All analysis for this report was run using the R statistical language in the R Studio editor. All analysis has been conducted in a reproducible way, with the intention that other researchers can both interrogate the robustness of the analysis and build further on the analysis for deeper insights. All instructions and code to reproduce the analysis can be found in this section, in addition all code in this document can be run directly to recreate the analysis.
The analysis was conducted within R Studio using the following additional packages:
install.packages("tidyverse", "lubridate", "tidystringdist")
library(tidyverse)
library(lubridate)
library(tidystringdist)
The data for this project were all obtained from publicly available sources. For the purpose of this analysis all of the data is provided as zip or csv files along with links to where it was originally obtained.
Storm data available from the NWS website: https://www.ncdc.noaa.gov/stormevents
## Download the main SDD plus help and FAQ files
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormdata.csv.bz2",
"Stormdata.csv.bz2", method = "curl")
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf",
"NWSI_10_1605.pdf")
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf",
"Storm_data_FAQ.pdf")
## Upload main dataset to an R dataframe
data_0 <- read.csv("Stormdata.csv.bz2", header = TRUE)
## Upload a list of event names to an R dataframe (this list is as per the NWSI_10_1605.pdf document)
data_events <- read.csv("event_names.csv", header = TRUE)
State names and abbreviations were obtained from Wikipedia: https://en.wikipedia.org/wiki/List_of_U.S._state_and_territory_abbreviations
## Upload list of 50 US states and their 2-letter abbreviations
data_states <- read.csv("state_abbreviations.csv", header = TRUE)
## Upload list of US territories and their 2-letter abbreviations
data_territories <- read.csv("territory_abbreviations.csv", header = TRUE)
Inflation rates and indices were obtained from the following website: https://www.in2013dollars.com/us/inflation/1950
## Upload table of inflation and index
data_inflation <- read.csv("inflation_data.csv", header = TRUE)
For this analysis three explanatory variables were used to explore the data: the event type, the state the event occurred in and the year the event occurred in. Four output variables were considered to address the areas of interest: fatalities and injuries for public health, and cost of crop damage and cost of property damage for economic impact. Various data cleansing and data processing was undertaken on all of these variables to produce the final analysis and this is described here for each variable in turn.
The key explanatory variable for this analysis is the event type. A list of 48 event types is provided by the NWS in the help documentation and this is used as the categorisation for events. These event types are used to categorise each weather event in the database but an initial analysis of the EVTYPE field shows that although it appears mandatory it is a free text field which has led to a significant data quality issue.
# store a list of eventnames
eventnames <- data_events[,1]
# create a new master dataset with additional columns and add an event_match flag
data_1 <- data_0 %>% mutate(event_match = EVTYPE %in% eventnames)
# number of event types
length(eventnames)
## [1] 48
# number of observations in the SDD
nrow(data_1)
## [1] 902297
# proportion of observations with a valid event in EVTYPE field
mean(data_1$event_match)
## [1] 0.7040642
# number of different entries in EVTYPE field
length(unique(data_1$EVTYPE))
## [1] 985
# add 'other' categories to eventnames
eventnames <- c(eventnames, "OTHER/TBD", "NO EVENT")
The dataset has 902,297 rows but only 70.41% have a valid event name. 985 different values have been used to populate the EVTYPE field, meaning that there are 935 invalid descriptions used in the EVTYPE field.
To ensure a more complete data coverage for the analysis data cleansing was undertaken to convert non-valid entries in the EVTYPE field into valid event names. Four categories of data cleansing rules were applied to the EVTYPE data field as follows:
The rules are implemented as functions which take a string input and return a revised string. They are shown below for completeness and transparency.
################################################################################
## The functions below perform string transformations on the EVTYPE field ######
## Initial function to test the name already in EVTYPE
nNAME <- function(x) {x}
## "g" functions to generically tidy up strings
gUPPR <- function(x) {str_to_upper(x)} #convert to UPPER CASE
gSQSH <- function(x) {str_squish(x)} #trim white space and remove double space
gCSTL <- function(x) {str_replace(x, "CSTL", "COASTAL")}
gFLDD <- function(x) {str_replace(x, "FLD", "FLOOD")}
gTHUN <- function(x) {str_replace(x, "THUNDERSTORMS", "THUNDERSTORM")}
gTIDL <- function(x) {str_replace(x, "TIDAL", "COASTAL")}
gTSTM <- function(x) {str_replace(x, "TSTM", "THUNDERSTORM")}
gWNDS <- function(x) {str_replace(x, "WINDS", "WIND")}
## "s" functions to replace specific full strings with event types
sCOLD <- function(x) {ifelse( x == "COLD" | x == "PROLONG COLD/SNOW"|
x == "UNSEASONABLY COOL" | x == "COLD WEATHER" | x == "PROLONG COLD" |
x == "UNSEASONABLY COLD" | x == "COLD WAVE" | x == "EXTENDED COLD" |
x == "UNSEASONABLE COLD" | x == "SNOW/COLD" | x == "LOW TEMPERATURE" |
x == "UNSEASONAL LOW TEMP"| x == "SNOW\\COLD" | x == "COLD TEMPERATURE" |
x == "UNUSUALLY COLD" | x == "SNOW AND COLD" | x == "COLD TEMPERATURES"|
x == "LOW WIND CHILL" | x == "COLD WIND CHILL TEMPERATURES" |
x == "WIND CHILL" | x == "FOG AND COLD TEMPERATURES", "COLD/WIND CHILL", x)}
sDDVL <- function(x) {ifelse(x == "WHIRLWIND"| x == "LANDSPOUT", "DUST DEVIL", x)}
sDFLW <- function(x) {ifelse(x == "LANDSLUMP", "DEBRIS FLOW", x)}
sDFOG <- function(x) {ifelse(x == "FOG"| x == "VOG", "DENSE FOG", x)}
sDRGT <- function(x) {ifelse(x == "RECORD LOW RAINFALL", "DROUGHT", x)}
sDUST <- function(x) {ifelse(x == "SAHARAN DUST", "DUST STORM", x)}
sFFLD <- function(x) {ifelse(x == "ICE JAM", "FLASH FLOOD", x)}
sFFOG <- function(x) {ifelse(x == "GLAZE"| x == "ICE FOG"| x == "GLAZE ICE",
"FREEZING FOG", x)}
sFLDD <- function(x) {ifelse(x == "HIGH WATER" |x == "RAPIDLY RISING WATER"|
x == "SMALL STREAM" |x == "DAM BREAK" |
x == "SMALL STREAM AND"|x == "DAM FAILURE",
"FLOOD", x)}
sFRFR <- function(x) {ifelse(x == "FREEZE"| x == "ICY ROADS" |x == "PATCHY ICE"|
x == "FROST" | x == "ICE ROADS" |x == "BLACK ICE" |
x == "COLD AND FROST" | x == "ICE ON ROAD"|x == "ICE",
"FROST/FREEZE", x)}
sHAIL <- function(x) {ifelse(x == "ICE PELLETS", "HAIL", x)}
sHEAT <- function(x) {ifelse(x == "PROLONG WARMTH" | x == "UNSEASONABLY HOT" |
x == "ABNORMAL WARMTH" | x == "UNSEASONABLY WARM" |
x == "VERY WARM" | x == "UNUSUAL WARMTH" |
x == "HOT SPELL" | x == "UNUSUALLY WARM" |
x == "HOT WEATHER", "HEAT", x)}
sHSRF <- function(x) {ifelse(x == "HEAVY SURF" | x == "HIGH WAVES" |
x == "HIGH SWELLS" | x == "HAZARDOUS SURF" | x == "WIND AND WAVE" |
x == "HEAVY SWELLS"| x == "ROUGH SURF" | x == "ROGUE WAVE" |
x == "HURRICANE-GENERATED SWELLS" | x == "HEAVY SURF AND WIND",
"HIGH SURF", x)}
sHURR <- function(x) {ifelse(x == "TYPHOON" | x == "HURRICANE/TYPHOON" |
x == "HURRICANE" | x == "REMNANTS OF FLOYD",
"HURRICANE (TYPHOON)", x)}
sHVRN <- function(x) {ifelse(x == "RAIN" | x == "UNSEASONABLY WET" |
x == "EXCESSIVE RAIN" | x == "RAIN (HEAVY)" | x == "UNSEASONABLE RAIN" |
x == "EXCESSIVE RAINFALL" | x == "HVY RAIN" | x == "ABNORMALLY WET" |
x == "TORRENTIAL RAINFALL"| x == "HEAVY SHOWER" | x == "EXTREMELY WET" |
x == "TORRENTIAL RAIN" | x == "HEAVY SHOWERS" | x == "DOWNBURST" |
x == "PROLONGED RAIN" | x == "RECORD RAINFALL" | x == "RAINSTORM" |
x == "RAIN DAMAGE" | x == "RECORD/EXCESSIVE RAINFALL", "HEAVY RAIN", x)}
sHVSN <- function(x) {ifelse(x == "RECORD SNOW" | x == "SNOW" |
x == "HEAVY WET SNOW" | x == "RECORD SNOW/COLD"| x == "EXCESSIVE SNOW" |
x == "NEAR RECORD SNOW" | x == "RECORD SNOWFALL" | x == "SNOWFALL RECORD" |
x =="ACCUMULATED SNOWFALL"| x == "RECORD WINTER SNOW", "HEAVY SNOW", x)}
sLESN <- function(x) {ifelse(x == "HEAVY LAKE SNOW"| x == "LAKE EFFECT SNOW",
"LAKE-EFFECT SNOW", x)}
sLGTN <- function(x) {ifelse(x == "LIGNTNING", "LIGHTNING", x)}
sSMKE <- function(x) {ifelse(x == "SMOKE", "DENSE SMOKE", x)}
sSRGE <- function(x) {ifelse(x == "STORM SURGE" | x == "ASTRONOMICAL HIGH TIDE"|
x == "HIGH SEAS" | x == "COASTAL SURGE"| x == "BLOW-OUT TIDES" |
x == "ROUGH SEAS"| x == "COASTAL STORM"| x == "BLOW-OUT TIDE" |
x == "HEAVY SEAS"| x == "COASTALSTORM" | x == "HIGH TIDES",
"STORM SURGE/TIDE", x)}
sSWND <- function(x) {ifelse(x == "WIND"| x == "GUSTY WIND"|
x == "WND" | x == "WIND GUSTS", "STRONG WIND", x)}
sTHNW <- function(x) {ifelse(x == "THUNDERSTORM"| x == "THUNDERSTORM DAMAGE TO" |
x == "GUSTNADO" | x == "SEVERE THUNDERSTORM"| x == "THUNDERSTORMW 50"|
x == "GUSTNADO AND"| x == "THUNDERSTORM DAMAGE"| x == "WIND DAMAGE",
"THUNDERSTORM WIND", x)}
sVLCA <- function(x) {ifelse(x == "VOLCANIC ERUPTION", "VOLCANIC ASH", x)}
sWSTM <- function(x) {ifelse(x == "THUNDERSNOW"| x == "SNOWSTORM"|
x == "SNOW AND WIND", "WINTER STORM", x)}
sWWTH <- function(x) {ifelse(x == "WINTRY MIX" | x == "HEAVY PRECIPITATION" |
x == "SNOW AND ICE"| x == "WINTER MIX" | x == "HEAVY PRECIPATATION" |
x == "SNOW/ICE" | x == "WINTERY MIX"| x == "MIXED PRECIPITATION" |
x == "SNOW/ ICE" | x == "HEAVY MIX" | x == "EXCESSIVE PRECIPITATION" |
x == "ICE/SNOW" | x == "SNOW/RAIN" | x == "MIXED PRECIP" |
x == "RAIN/SNOW", "WINTER WEATHER", x)}
sXCLD <- function(x) {ifelse(x == "RECORD COLD"| x == "RECORD COLD/FROST"|
x == "SEVERE COLD" | x == "RECORD COOL"| x == "LOW TEMPERATURE RECORD"|
x == "SNOW/ BITTER COLD" | x == "RECORD LOW" | x == "EXCESSIVE COLD"|
x == "BITTER WIND CHILL" | x == "BITTER WIND CHILL TEMPERATURES",
"EXTREME COLD/WIND CHILL", x)}
sXHET <- function(x) {ifelse( x == "RECORD WARMTH" | x == "RECORD HIGH" |
x == "RECORD HIGH TEMPERATURE" | x == "RECORD WARM" | x == "RECORD HEAT" |
x == "RECORD HIGH TEMPERATURES"| x == "RECORD WARM TEMPS."| x == "EXTREME HEAT" |
x == "HIGH TEMPERATURE RECORD" | x == "UNUSUAL/RECORD WARMTH",
"EXCESSIVE HEAT", x)}
## "m" functions to replace partial match strings with event types
m1MCB <- function(x) {ifelse(str_detect(x, "MICR*OBURST"), "THUNDERSTORM WIND", x)}
mBLZD <- function(x) {ifelse(str_detect(x, "BLOWING|SQUALL"), "BLIZZARD", x)}
mDFLW <- function(x) {ifelse(str_detect(x, "SLIDE"), "DEBRIS FLOW", x)}
mDRGT <- function(x) {ifelse(str_detect(x, "DRY"), "DROUGHT", x)}
mECCW <- function(x) {ifelse(str_detect(x, ".*EXTREME.+(COLD|CHILL|WIND)"), "EXTREME COLD/WIND CHILL", x)}
mFIRE <- function(x) {ifelse(str_detect(x, "FIRE"), "WILDFIRE", x)}
mFLSH <- function(x) {ifelse(str_detect(x, "FLASH"), "FLASH FLOOD", x)}
mFNNL <- function(x) {ifelse(str_detect(x, "FUNNEL"), "FUNNEL CLOUD", x)}
mFRST <- function(x) {ifelse(str_detect(x, "FROST"), "FROST/FREEZE", x)}
mFRZE <- function(x) {ifelse(str_detect(x, "FREEZE"), "FROST/FREEZE", x)}
mHURR <- function(x) {ifelse(str_detect(x, "HURRICANE .+"), "HURRICANE (TYPHOON)", x)}
mSLET <- function(x) {ifelse(str_detect(x, "FREEZING"), "SLEET", x)}
mTNDO <- function(x) {ifelse(str_detect(x, "WALL CLOUD"), "TORNADO", x)}
mURBN <- function(x) {ifelse(str_detect(x, ".*(URBAN|UBN).+(SMALL|SML).*"), "FLOOD", x)}
# The mEVENT function searches through a string for a valid event name, it returns the
# event which is found first in a given string, if there is no match it returns the original string
mEVNT <- function(x) {
match0 <- 100
eventm <- 0
counter <- 0
for (ee in eventnames) {
match1 <- str_locate(x, ee)[1]
counter <- counter + 1
if (!is.na(match1)) {
if (match1 < match0) {
match0 <- match1
eventm <- counter
}
}
}
if (match0 == 100) {
return(x)
} else {
return(eventnames[eventm])
}
}
# The mJWO1 function returns the best match for a given string from the event list using
# the Jaro–Winkler distance (implemented using the tidystringdist package). The function
# only returns a match if it is below a given score otherwise it returns the original string.
# The threshold score is set at 0.1 which means a close match i.e. 1 character difference is required
mJW01 <- function(x) {jw_match(x, eventnames, 0.1)}
jw_match <- function(name, list, score) {
output <- name
matchlist <- tidy_comb(name, list)
scorelist <- tidy_stringdist(matchlist)
minlist = slice_min(scorelist, jw, n=1, with_ties = FALSE)
if (minlist$jw <= score) {output <- minlist$V1}
output
}
## "o" functions to replace partial match strings with 'OTHER/TBD' for later classification
oOTH0 <- function(x) {ifelse(str_detect(x, "OTHER"), "OTHER/TBD", x)}
oOTH1 <- function(x) {ifelse(str_detect(x, "SUMMARY"), "OTHER/TBD", x)}
oOTH2 <- function(x) {ifelse(str_detect(x, "(MONTH|YEAR|SEASON)"), "OTHER/TBD", x)}
oOTH3 <- function(x) {ifelse(str_detect(x, "(REC|TEMP).*(TEMP|REC)"), "OTHER/TBD", x)}
oOTH4 <- function(x) {ifelse(str_detect(x, "GRADIENT"), "OTHER/TBD", x)}
oOTH5 <- function(x) {ifelse(str_detect(x, "(EROSION|EROSIN)"), "OTHER/TBD", x)}
oOTH6 <- function(x) {ifelse(str_detect(x, "SNOW"), "OTHER/TBD", x)}
oOTH7 <- function(x) {ifelse(str_detect(x, "WIND"), "OTHER/TBD", x)}
oOTH8 <- function(x) {"OTHER/TBD"}
################################################################################
To understand the impact of these rules two dataframes are used:
A routine called applyrules() applies each of the string transformation rules above in turn by calling stringtransform() and records the results in events1 and rules1. The rules are implemented in order and if a rule has successfully created a valid event type for a given event string then no further rules are applied.
# get list of unique strings entered into the EVTYPE datafield
e0 <- unique(data_0$EVTYPE)
# create a dataframe to track string transformation (will become events1)
events0 <- data.frame(orig = e0, #list of unique entry strings
count = rep(0,length(e0)), #how many time entry appears
test = e0, #each rule applied to this string
transform = rep("-",length(e0)), #event type string for this string
match = rep(0,length(e0)), #flag whether event type has been found
rule = rep(0,length(e0)), #record rule used to create event type
mcount = rep(0,length(e0))) #track how many records have been matched
events0$count <- sapply(events0$orig, function(x) {sum(data_0$EVTYPE == x)}) #initialise count
events0 <- events0[order(-events0$count),] #rank strings by most common to least
# This function applies the transformation rules and updates the events1 and rules1 dataframes
applyrules <- function() {
events1 <<- events0
rvec <- c( nNAME, gUPPR, gSQSH, gCSTL, gFLDD, gTHUN, gTIDL, gTSTM, gWNDS,
sCOLD, sDDVL, sDFLW, sDFOG, sDRGT, sDUST, sFFLD, sFFOG, sFLDD,
sFRFR, sHAIL, sHEAT, sHSRF, sHURR, sHVRN, sHVSN, sLESN, sLGTN,
sSMKE, sSRGE, sSWND, sTHNW, sVLCA, sWSTM, sWWTH, sXCLD, sXHET,
m1MCB, mBLZD, mDFLW, mDRGT, mECCW, mFIRE, mFLSH, mFNNL, mFRST,
mFRZE, mHURR, mSLET, mTNDO, mURBN, mEVNT, mJW01,
oOTH0, oOTH1, oOTH2, oOTH3, oOTH4, oOTH5, oOTH6, oOTH7, oOTH8)
rnames <- c("nNAME", "gUPPR", "gSQSH", "gCSTL", "gFLDD", "gTHUN", "gTIDL", "gTSTM", "gWNDS",
"sCOLD", "sDDVL", "sDFLW", "sDFOG", "sDRGT", "sDUST", "sFFLD", "sFFOG", "sFLDD",
"sFRFR", "sHAIL", "sHEAT", "sHSRF", "sHURR", "sHVRN", "sHVSN", "sLESN", "sLGTN",
"sSMKE", "sSRGE", "sSWND", "sTHNW", "sVLCA", "sWSTM", "sWWTH", "sXCLD", "sXHET",
"m1MCB", "mBLZD", "mDFLW", "mDRGT", "mECCW", "mFIRE", "mFLSH", "mFNNL", "mFRST",
"mFRZE", "mHURR", "mSLET", "mTNDO", "mURBN", "mEVNT", "mJW01",
"oOTH0", "oOTH1", "oOTH2", "oOTH3", "oOTH4", "oOTH5", "oOTH6", "oOTH7", "oOTH8")
rules1 <- data.frame("rule_name" = rnames,
"impacted" = rep(0, length(rnames)),
"pct_coverage" = rep(0, length(rnames)),
"covered" = rep(0, length(rnames)),
"not_covered" = rep(0, length(rnames)))
rcounter = 0
# loop over every rule in the list - note this is done in order of rvec
for (rr in rvec) {
# set a counter and track the name of the rule we are applying
rcounter <- rcounter + 1
rname <- rnames[rcounter]
# populate the events1 dataframe using the stringtransform function
events1$test <<- sapply(events1$test, rr)
tempevents1_transform <- events1$transform
events1$transform <<- mapply(stringtransform, tempevents1_transform, events1$test, events1$rule, rname,1)
events1$rule <<- mapply(stringtransform, tempevents1_transform, events1$test, events1$rule, rname, 2)
events1$match <<- sapply(events1$transform, function(x) {ifelse(x %in% eventnames, 1, 0)})
events1$mcount <<- events1$match * events1$count
# populate the rule1 dataframe
rules1[rcounter, "pct_coverage"] <- as.numeric(round(sum(events1$mcount)/sum(events1$count), 6))
rules1[rcounter, "covered"] <- as.numeric(sum(events1$mcount))
rules1[rcounter, "not_covered"] <- length(data_0$EVTYPE) - as.numeric(sum(events1$mcount))
prior = 0
if(rcounter != 1) {prior <- as.numeric(rules1[rcounter-1, "covered"])}
rules1[rcounter, "impacted"] <- sum(events1$mcount) - prior
}
return(rules1)
}
# This function performs the string transformation and records the result(1) and rule used(2)
# The function compares the existing transformed value with a test value for the latest
# string transformation to see whether it should be overwritten.
stringtransform <- function(trans_str, test_str, rule_trans, rule_test, answer) {
outputs <- character(2)
if (trans_str %in% eventnames) { #if current value is valid then
outputs[1] <- trans_str #keep string and rule as is
outputs[2] <- rule_trans
}
else if (test_str %in% eventnames) { #elseif new value is valid then
outputs[1] <- test_str #update string and rule
outputs[2] <- rule_test
}
else { #else no valid string yet
outputs[1] <- "-"
outputs[2] <- "-"
}
output1 <- outputs[answer]
output1
}
# run the 'applyrules' routine
rules1 <- applyrules()
The events1 dataframe shows how each unique events string is transformed, it can also be used to see any strings left to be categorised/transformed.
head(events1, 10)
## orig count test transform match rule mcount
## 3 HAIL 288661 OTHER/TBD HAIL 1 nNAME 288661
## 2 TSTM WIND 219940 OTHER/TBD THUNDERSTORM WIND 1 gTSTM 219940
## 16 THUNDERSTORM WIND 82563 OTHER/TBD THUNDERSTORM WIND 1 nNAME 82563
## 1 TORNADO 60652 OTHER/TBD TORNADO 1 nNAME 60652
## 20 FLASH FLOOD 54277 OTHER/TBD FLASH FLOOD 1 nNAME 54277
## 36 FLOOD 25326 OTHER/TBD FLOOD 1 nNAME 25326
## 10 THUNDERSTORM WINDS 20843 OTHER/TBD THUNDERSTORM WIND 1 gWNDS 20843
## 46 HIGH WIND 20212 OTHER/TBD HIGH WIND 1 nNAME 20212
## 15 LIGHTNING 15754 OTHER/TBD LIGHTNING 1 nNAME 15754
## 53 HEAVY SNOW 15708 OTHER/TBD HEAVY SNOW 1 nNAME 15708
events1 %>% filter(rule == "oOTH8") %>% arrange(desc(count))
## orig count test transform match rule mcount
## 1 NORMAL PRECIPITATION 3 OTHER/TBD OTHER/TBD 1 oOTH8 3
## 2 MARINE MISHAP 2 OTHER/TBD OTHER/TBD 1 oOTH8 2
## 3 ICE FLOES 2 OTHER/TBD OTHER/TBD 1 oOTH8 2
## 4 BELOW NORMAL PRECIPITATION 2 OTHER/TBD OTHER/TBD 1 oOTH8 2
## 5 NONE 2 OTHER/TBD OTHER/TBD 1 oOTH8 2
## 6 RED FLAG CRITERIA 2 OTHER/TBD OTHER/TBD 1 oOTH8 2
## 7 SEVERE TURBULENCE 1 OTHER/TBD OTHER/TBD 1 oOTH8 1
## 8 APACHE COUNTY 1 OTHER/TBD OTHER/TBD 1 oOTH8 1
## 9 HIGH 1 OTHER/TBD OTHER/TBD 1 oOTH8 1
## 10 COOL AND WET 1 OTHER/TBD OTHER/TBD 1 oOTH8 1
## 11 COLD AND WET CONDITIONS 1 OTHER/TBD OTHER/TBD 1 oOTH8 1
## 12 EXCESSIVE WETNESS 1 OTHER/TBD OTHER/TBD 1 oOTH8 1
## 13 SOUTHEAST 1 OTHER/TBD OTHER/TBD 1 oOTH8 1
## 14 WET WEATHER 1 OTHER/TBD OTHER/TBD 1 oOTH8 1
## 15 EXCESSIVE 1 OTHER/TBD OTHER/TBD 1 oOTH8 1
## 16 ? 1 OTHER/TBD OTHER/TBD 1 oOTH8 1
## 17 HOT PATTERN 1 OTHER/TBD OTHER/TBD 1 oOTH8 1
## 18 MILD PATTERN 1 OTHER/TBD OTHER/TBD 1 oOTH8 1
## 19 Marine Accident 1 OTHER/TBD OTHER/TBD 1 oOTH8 1
## 20 Metro Storm, May 26 1 OTHER/TBD OTHER/TBD 1 oOTH8 1
## 21 No Severe Weather 1 OTHER/TBD OTHER/TBD 1 oOTH8 1
## 22 EARLY RAIN 1 OTHER/TBD OTHER/TBD 1 oOTH8 1
## 23 HYPERTHERMIA/EXPOSURE 1 OTHER/TBD OTHER/TBD 1 oOTH8 1
## 24 COOL SPELL 1 OTHER/TBD OTHER/TBD 1 oOTH8 1
## 25 WARM WEATHER 1 OTHER/TBD OTHER/TBD 1 oOTH8 1
## 26 NORTHERN LIGHTS 1 OTHER/TBD OTHER/TBD 1 oOTH8 1
## 27 DROWNING 1 OTHER/TBD OTHER/TBD 1 oOTH8 1
The rules1 dataframe can be reviewed to see how each rule impacted the transformation.
rules1
## rule_name impacted pct_coverage covered not_covered
## 1 nNAME 635275 0.704064 635275 267022
## 2 gUPPR 76 0.704148 635351 266946
## 3 gSQSH 4 0.704153 635355 266942
## 4 gCSTL 0 0.704153 635355 266942
## 5 gFLDD 0 0.704153 635355 266942
## 6 gTHUN 6 0.704159 635361 266936
## 7 gTIDL 1 0.704161 635362 266935
## 8 gTSTM 226121 0.954767 861483 40814
## 9 gWNDS 22608 0.979823 884091 18206
## 10 sCOLD 203 0.980048 884294 18003
## 11 sDDVL 5 0.980053 884299 17998
## 12 sDFLW 2 0.980055 884301 17996
## 13 sDFOG 539 0.980653 884840 17457
## 14 sDRGT 2 0.980655 884842 17455
## 15 sDUST 4 0.980659 884846 17451
## 16 sFFLD 4 0.980664 884850 17447
## 17 sFFOG 47 0.980716 884897 17400
## 18 sFLDD 14 0.980731 884911 17386
## 19 sFRFR 253 0.981012 885164 17133
## 20 sHAIL 1 0.981013 885165 17132
## 21 sHEAT 162 0.981192 885327 16970
## 22 sHSRF 108 0.981312 885435 16862
## 23 sHURR 275 0.981617 885710 16587
## 24 sHVRN 77 0.981702 885787 16510
## 25 sHVSN 667 0.982441 886454 15843
## 26 sLESN 48 0.982495 886502 15795
## 27 sLGTN 1 0.982496 886503 15794
## 28 sSMKE 11 0.982508 886514 15783
## 29 sSRGE 394 0.982945 886908 15389
## 30 sSWND 476 0.983472 887384 14913
## 31 sTHNW 128 0.983614 887512 14785
## 32 sVLCA 2 0.983616 887514 14783
## 33 sWSTM 3 0.983620 887517 14780
## 34 sWWTH 212 0.983855 887729 14568
## 35 sXCLD 88 0.983952 887817 14480
## 36 sXHET 276 0.984258 888093 14204
## 37 m1MCB 218 0.984500 888311 13986
## 38 mBLZD 125 0.984638 888436 13861
## 39 mDFLW 652 0.985361 889088 13209
## 40 mDRGT 110 0.985483 889198 13099
## 41 mECCW 893 0.986472 890091 12206
## 42 mFIRE 1479 0.988111 891570 10727
## 43 mFLSH 1397 0.989660 892967 9330
## 44 mFNNL 147 0.989823 893114 9183
## 45 mFRST 4 0.989827 893118 9179
## 46 mFRZE 23 0.989853 893141 9156
## 47 mHURR 23 0.989878 893164 9133
## 48 mSLET 364 0.990281 893528 8769
## 49 mTNDO 11 0.990294 893539 8758
## 50 mURBN 3458 0.994126 896997 5300
## 51 mEVNT 4584 0.999206 901581 716
## 52 mJW01 34 0.999244 901615 682
## 53 oOTH0 59 0.999310 901674 623
## 54 oOTH1 75 0.999393 901749 548
## 55 oOTH2 78 0.999479 901827 470
## 56 oOTH3 65 0.999551 901892 405
## 57 oOTH4 17 0.999570 901909 388
## 58 oOTH5 6 0.999577 901915 382
## 59 oOTH6 323 0.999935 902238 59
## 60 oOTH7 25 0.999962 902263 34
## 61 oOTH8 34 1.000000 902297 0
The rules applied cover 99.92% of the data before getting to the ‘other’ categorisations. This appears to be a very solid basis for moving forwards. There would ideally be a review of all rule applications by an expert familiar with the weather event classifications, but given the majority of the transformation comes from converting ‘TSTM’ to ‘THUNDERSTORM’ and ‘WINDS’ to ‘WIND’ it seems like a good basis to proceed on.
To allow easier charting of the event data it has been sorted into the following event groups:
| Event Groups | Weather Events |
|---|---|
| Dry/Heat | Drought, Dust Devil, Dust Storm, Excessive Heat, Heat, Volcanic Ash, Wildfire |
| Flood/Rain | Coastal Flood, Flash Flood, Flood, Heavy Rain, Lakeshore Flood |
| Marine/Lake | Astronomical Low Tide, High Surf, Lake-Effect Snow, Marine Hail, Marine High Wind, High Surf, Lake-Effect Snow, Marine Hail, Marine High Wind, Marine Strong Wind, Marine Thunderstorm Wind, Rip Current, Seiche, Storm Surge/Tide, Tsunami, Waterspout |
| Wind | High Wind, Hurricane (Typhoon), Strong Wind, Thunderstorm Wind, Tornado, Tropical Depression, Tropical Storm |
| Snow/Storm | Avalanche, Blizzard, Hail, Heavy Snow, Ice Storm, Lightning, Sleet, Winter Storm, Winter Weather |
| Cold/Freeze | Cold/Wind Chill, Freezing Fog, Frost/Freeze |
| Other | Debris Flow, Dense Fog, Dense Smoke, Funnel Cloud, Other/TBD |
The above grouping and a sensible ordering of the groups is applied via an events_grouping.csv file. The transformed events from applytransform() and the event grouping are added to the main data_1 dataset.
eventgrouping <- read.csv("event_grouping.csv", header = TRUE)
eventnamessorted <- eventgrouping$event
eventgrouping
## INDEX event code group chart_order
## 1 9 DROUGHT Z Dry/Heat 1
## 2 10 DUST DEVIL C Dry/Heat 2
## 3 11 DUST STORM Z Dry/Heat 3
## 4 12 EXCESSIVE HEAT Z Dry/Heat 4
## 5 20 HEAT Z Dry/Heat 5
## 6 44 VOLCANIC ASH Z Dry/Heat 6
## 7 46 WILDFIRE Z Dry/Heat 7
## 8 4 COASTAL FLOOD Z Flood/Rain 8
## 9 14 FLASH FLOOD C Flood/Rain 9
## 10 15 FLOOD C Flood/Rain 10
## 11 21 HEAVY RAIN Z Flood/Rain 11
## 12 28 LAKESHORE FLOOD Z Flood/Rain 12
## 13 1 ASTRONOMICAL LOW TIDE Z Marine/Lake 13
## 14 23 HIGH SURF Z Marine/Lake 14
## 15 27 LAKE-EFFECT SNOW Z Marine/Lake 15
## 16 30 MARINE HAIL M Marine/Lake 16
## 17 31 MARINE HIGH WIND M Marine/Lake 17
## 18 32 MARINE STRONG WIND M Marine/Lake 18
## 19 33 MARINE THUNDERSTORM WIND M Marine/Lake 19
## 20 34 RIP CURRENT Z Marine/Lake 20
## 21 35 SEICHE Z Marine/Lake 21
## 22 37 STORM SURGE/TIDE Z Marine/Lake 22
## 23 43 TSUNAMI Z Marine/Lake 23
## 24 45 WATERSPOUT M Marine/Lake 24
## 25 24 HIGH WIND Z Wind 25
## 26 25 HURRICANE (TYPHOON) Z Wind 26
## 27 38 STRONG WIND Z Wind 27
## 28 39 THUNDERSTORM WIND C Wind 28
## 29 40 TORNADO C Wind 29
## 30 41 TROPICAL DEPRESSION Z Wind 30
## 31 42 TROPICAL STORM Z Wind 31
## 32 2 AVALANCHE Z Snow/Storm 32
## 33 3 BLIZZARD Z Snow/Storm 33
## 34 19 HAIL C Snow/Storm 34
## 35 22 HEAVY SNOW Z Snow/Storm 35
## 36 26 ICE STORM Z Snow/Storm 36
## 37 29 LIGHTNING C Snow/Storm 37
## 38 36 SLEET Z Snow/Storm 38
## 39 47 WINTER STORM Z Snow/Storm 39
## 40 48 WINTER WEATHER Z Snow/Storm 40
## 41 5 COLD/WIND CHILL Z Cold/Freeze 41
## 42 13 EXTREME COLD/WIND CHILL Z Cold/Freeze 42
## 43 18 FREEZING FOG Z Cold/Freeze 43
## 44 16 FROST/FREEZE Z Cold/Freeze 44
## 45 6 DEBRIS FLOW C Other 45
## 46 7 DENSE FOG Z Other 46
## 47 8 DENSE SMOKE Z Other 47
## 48 17 FUNNEL CLOUD C Other 48
## 49 49 OTHER/TBD O Other 49
## 50 50 NO EVENT O Other 50
data_1 <- mutate(data_1,
event = events1$transform[match(EVTYPE, events1$orig)],
eventGrp = eventgrouping$group[match(event, eventgrouping$event)],
rule = events1$rule[match(EVTYPE, events1$orig)]
)
The coverage of the transformed events, and how they were transformed by the different rules, can be seen by running the following ruletable() function (and two supporting functions):
# This function filters grouped data by different rule conditions
filter_grp <- function(data_grp, condition) {
output <- data_grp %>% filter(str_detect(rule, condition)) %>% summarise(count = n())
output
}
# This function performs an indexmatch allowing non-matches to be defined
index_match <- function(index_vec, match_DF, missing.val = NA) {
index_vec2 <- match_DF[2][[1]][match(index_vec, match_DF[1][[1]])]
output <- replace(index_vec2, is.na(index_vec2), missing.val)
output
}
# This function generates a table of events (events2) showing how rule groupings
# have been applied. It is written as a function so that it can be applied to other
# variables later on in the analysis.
ruletable <- function(nameslist, variable) { # (eventnames, "events")
# generate a dataframe with events as rows and rule types as columns
nEvents <- length(nameslist)
events2row <- data.frame(rowtype = rep("row", nEvents),
name = nameslist,
orig = rep(0, nEvents),
generic = rep(0, nEvents),
specific = rep(0, nEvents),
match = rep(0, nEvents),
other = rep(0, nEvents))
# group the main datatable by the variable (events) column
data_grp1 <- group_by(data_1, across(variable))
# for each family of rules count how many are applied to each event
events2row$orig <- index_match(nameslist, filter_grp(data_grp1, "^n"), 0)
events2row$generic <- index_match(nameslist, filter_grp(data_grp1, "^g"), 0)
events2row$specific <- index_match(nameslist, filter_grp(data_grp1, "^s"), 0)
events2row$match <- index_match(nameslist, filter_grp(data_grp1, "^m"), 0)
events2row$other <- index_match(nameslist, filter_grp(data_grp1, "^o"), 0)
# calculate the final row (NO EVENTS) by subtracting rule EVENTS from previous rule NO EVENTS
events2row$orig[nEvents] <- nrow(data_0) - sum(events2row$orig[-nEvents])
events2row$generic[nEvents] <- events2row$orig[nEvents] - sum(events2row$generic[-nEvents])
events2row$specific[nEvents] <- events2row$generic[nEvents] - sum(events2row$specific[-nEvents])
events2row$match[nEvents] <- events2row$specific[nEvents] - sum(events2row$match[-nEvents])
events2row$other[nEvents] <- events2row$match[nEvents] - sum(events2row$other[-nEvents])
# create total rows showing number covered (ie with valid event name), total, and percent covered
events2total <- events2row[1:3,]
events2total[["rowtype"]] <- rep("total", 3)
events2total[["name"]] <- c("covered", "total", "pct_covered")
# calculate the total rows
colNames <- names(events2row)
for (col1 in colNames[-(1:2)]) {
if (col1 == colNames[3]) {rowtotal = 0}
events2total[1,col1] <- sum(events2row[-nEvents, col1]) + rowtotal
rowtotal <- events2total[1,col1]
events2total[2,col1] <- events2total[1,col1] + sum(events2row[nEvents, col1])
events2total[3,col1] <- round(100 * events2total[1,col1] / events2total[2,col1], 0)
}
# join the row and totals together
events2 <- rbind(events2row, events2total)
events2 <- cbind(events2, pct_gen = rep(0, nEvents + 3))
# add a column to show the % generated from rules
for (row1 in c(1:nEvents, nEvents+3)) {
rowtotal <- sum(events2[row1, 3:7])
events2$pct_gen[row1] <- round(100 * (1 - (sum(events2$orig[row1]/rowtotal))), 1)
}
return(events2)
}
events2 <- ruletable(eventnames, "event")
events2[-1]
## name orig generic specific match other pct_gen
## 1 ASTRONOMICAL LOW TIDE 174 0 0 0 0 0.0
## 2 AVALANCHE 386 0 0 1 0 0.3
## 3 BLIZZARD 2719 0 0 135 0 4.7
## 4 COASTAL FLOOD 650 8 0 222 0 26.1
## 5 COLD/WIND CHILL 539 0 203 0 0 27.4
## 6 DEBRIS FLOW 0 0 2 652 0 100.0
## 7 DENSE FOG 1293 0 539 3 0 29.5
## 8 DENSE SMOKE 10 0 11 0 0 52.4
## 9 DROUGHT 2488 0 2 130 0 5.0
## 10 DUST DEVIL 141 8 5 2 0 9.6
## 11 DUST STORM 427 0 4 2 0 1.4
## 12 EXCESSIVE HEAT 1678 0 276 4 0 14.3
## 13 EXTREME COLD/WIND CHILL 1002 0 88 893 0 49.5
## 14 FLASH FLOOD 54277 1 4 1397 0 2.5
## 15 FLOOD 25326 1 14 4211 0 14.3
## 16 FROST/FREEZE 1342 1 253 27 0 17.3
## 17 FUNNEL CLOUD 6839 5 0 147 0 2.2
## 18 FREEZING FOG 45 1 47 0 0 51.6
## 19 HAIL 288661 0 1 180 0 0.1
## 20 HEAT 767 0 162 82 0 24.1
## 21 HEAVY RAIN 11723 19 77 62 0 1.3
## 22 HEAVY SNOW 15708 0 667 32 0 4.3
## 23 HIGH SURF 725 9 108 234 0 32.6
## 24 HIGH WIND 20212 1536 0 58 0 7.3
## 25 HURRICANE (TYPHOON) 0 0 275 23 0 100.0
## 26 ICE STORM 2006 0 0 20 0 1.0
## 27 LAKE-EFFECT SNOW 636 0 48 0 0 7.0
## 28 LAKESHORE FLOOD 23 0 0 0 0 0.0
## 29 LIGHTNING 15754 1 1 13 0 0.1
## 30 MARINE HAIL 442 0 0 0 0 0.0
## 31 MARINE HIGH WIND 135 0 0 0 0 0.0
## 32 MARINE STRONG WIND 48 0 0 0 0 0.0
## 33 MARINE THUNDERSTORM WIND 5812 6175 0 0 0 51.5
## 34 RIP CURRENT 470 0 0 307 0 39.5
## 35 SEICHE 21 0 0 0 0 0.0
## 36 SLEET 59 0 0 400 0 87.1
## 37 STORM SURGE/TIDE 148 0 394 0 0 72.7
## 38 STRONG WIND 3566 207 476 3 0 16.1
## 39 THUNDERSTORM WIND 82563 240823 128 1566 0 74.6
## 40 TORNADO 60652 0 0 47 0 0.1
## 41 TROPICAL DEPRESSION 60 0 0 0 0 0.0
## 42 TROPICAL STORM 690 0 0 7 0 1.0
## 43 TSUNAMI 20 0 0 0 0 0.0
## 44 VOLCANIC ASH 22 1 2 4 0 24.1
## 45 WATERSPOUT 3796 1 0 63 0 1.7
## 46 WILDFIRE 2761 0 0 1479 0 34.9
## 47 WINTER STORM 11433 0 3 6 0 0.1
## 48 WINTER WEATHER 7026 19 212 1110 0 16.0
## 49 OTHER/TBD 0 0 0 0 682 100.0
## 50 NO EVENT 267022 18206 14204 682 0 11.0
## 51 covered 635275 884091 888093 901615 902297 0.0
## 52 total 902297 902297 902297 902297 902297 0.0
## 53 pct_covered 70 98 98 100 100 85.0
Some events such as ‘Avalanche’, ‘Dust Storm’ and ‘Ice Storm’ are mostly keyed in correctly and would have been picked up without any data cleansing. However events such as ‘Debris Flow’ and ‘Hurricane (Typhoon)’ have not been keyed correctly and are generated from the data cleansing routine. An extreme example of this is ‘Thunderstorm Wind’ event which is the largest category and has 74.6% of its values generated from the data cleansing. This shows how important the data cleansing is to get a more accurate dataset. And also shows that the data verification approach will need validation from other external advisers. The event groupings show much larger groupings for ‘Snow/Storm’ and ‘Wind’ due to the pre-1992 database only having 3 weather event types. If these event groupings are to be used to describe analysis in further anlaysis they should be reviewed by experts for suitablility.
table(data_1$eventGrp)
##
## Cold/Freeze Dry/Heat Flood/Rain Marine/Lake Other Snow/Storm
## 4441 10447 98015 19766 10183 346553
## Wind
## 412892
Each entry in the SDD has a corresponding date. This is used in the analysis to understand how weather has changed over time, as well as see how the quality of the data has changed over time. The year variable is created by taking the year element of the BGN_DATE field using lubridate functions. Time of day is not analysed, so the time element of the BGN_DATE timestamp is ignored as is the BGN_TIME field. A grouping is applied for the years variable, is in 5 year blocks counting backwards from final year of data (2011). There is a relatively small amount of data per year before 1992 so pre-1992 is counted as a single group. By grouping to five year blocks offset by 1 it aligns with the current 48 event classifications being brought in in 1996 so this provides 3 x five year groupings with full data.
data_1 <- mutate(data_1,
year = year(mdy(str_sub(BGN_DATE, end = -8))),
yearGrp = ifelse(year < 1992, "pre 1992",
paste0(floor((year+3) / 5) * 5 -3,"-", floor((year+3) / 5) * 5+1))
)
yearnames <- c(as.character(seq(1950, 2011, 1)), "NO YEAR")
It is useful to see how the data coverage has changed over the years, and also which years have events which were transformed by the rules that were applied.
years2 <- ruletable(yearnames, "year")
years2
## rowtype name orig generic specific match other pct_gen
## 1 row 1950 223 0 0 0 0 0.0
## 2 row 1951 269 0 0 0 0 0.0
## 3 row 1952 272 0 0 0 0 0.0
## 4 row 1953 492 0 0 0 0 0.0
## 5 row 1954 609 0 0 0 0 0.0
## 6 row 1955 992 421 0 0 0 29.8
## 7 row 1956 968 735 0 0 0 43.2
## 8 row 1957 1409 775 0 0 0 35.5
## 9 row 1958 1314 899 0 0 0 40.6
## 10 row 1959 1161 652 0 0 0 36.0
## 11 row 1960 1226 719 0 0 0 37.0
## 12 row 1961 1494 752 0 0 0 33.5
## 13 row 1962 1559 830 0 0 0 34.7
## 14 row 1963 1145 823 0 0 0 41.8
## 15 row 1964 1439 909 0 0 0 38.7
## 16 row 1965 1800 1055 0 0 0 37.0
## 17 row 1966 1338 1050 0 0 0 44.0
## 18 row 1967 1730 958 0 0 0 35.6
## 19 row 1968 1783 1529 0 0 0 46.2
## 20 row 1969 1416 1510 0 0 0 51.6
## 21 row 1970 1421 1794 0 0 0 55.8
## 22 row 1971 1927 1544 0 0 0 44.5
## 23 row 1972 1456 712 0 0 0 32.8
## 24 row 1973 2297 2166 0 0 0 48.5
## 25 row 1974 2783 2603 0 0 0 48.3
## 26 row 1975 2336 2639 0 0 0 53.0
## 27 row 1976 2026 1742 0 0 0 46.2
## 28 row 1977 2005 1723 0 0 0 46.2
## 29 row 1978 1899 1758 0 0 0 48.1
## 30 row 1979 2233 2046 0 0 0 47.8
## 31 row 1980 2965 3181 0 0 0 51.8
## 32 row 1981 2324 2193 0 0 0 48.5
## 33 row 1982 3562 3570 0 0 0 50.1
## 34 row 1983 3329 4993 0 0 0 60.0
## 35 row 1984 3769 3566 0 0 0 48.6
## 36 row 1985 4152 3827 0 0 0 48.0
## 37 row 1986 4361 4365 0 0 0 50.0
## 38 row 1987 3111 4256 0 0 0 57.8
## 39 row 1988 3310 3947 0 0 0 54.4
## 40 row 1989 4699 5711 0 0 0 54.9
## 41 row 1990 4882 6064 0 0 0 55.4
## 42 row 1991 6019 6503 0 0 0 51.9
## 43 row 1992 7091 6443 0 0 0 47.6
## 44 row 1993 7827 4142 194 440 4 37.9
## 45 row 1994 11684 7489 149 1297 12 43.4
## 46 row 1995 14617 11105 481 1726 41 47.7
## 47 row 1996 20879 10030 215 971 175 35.3
## 48 row 1997 17335 9887 410 967 81 39.6
## 49 row 1998 22608 13610 482 1373 55 40.7
## 50 row 1999 19221 10359 438 1185 86 38.6
## 51 row 2000 20399 12175 527 1293 77 40.8
## 52 row 2001 21382 11842 429 1220 89 38.8
## 53 row 2002 21791 12976 361 1103 62 40.0
## 54 row 2003 25774 13438 81 459 0 35.2
## 55 row 2004 25613 13225 75 450 0 34.9
## 56 row 2005 25043 13437 122 582 0 36.1
## 57 row 2006 29739 14138 5 152 0 32.5
## 58 row 2007 43217 0 5 67 0 0.2
## 59 row 2008 55606 0 23 34 0 0.1
## 60 row 2009 45771 0 1 45 0 0.1
## 61 row 2010 48093 0 1 67 0 0.1
## 62 row 2011 62080 0 3 91 0 0.2
## 63 row NO YEAR 267022 18206 14204 682 0 11.0
## 64 total covered 635275 884091 888093 901615 902297 0.0
## 65 total total 902297 902297 902297 902297 902297 0.0
## 66 total pct_covered 70 98 98 100 100 85.0
From this table it can be seen how errors and recording of events has changed over time. This aligns with the data-entry history available on the NOAA website.
It can also be seen that the coverage across the year groupings is fairly even.
table(data_1$yearGrp)
##
## 1992-1996 1997-2001 2002-2006 2007-2011 pre 1992
## 107012 167530 198626 255104 174025
Each entry in the SDD has a corresponding state code in the STATE field. This is used in the analysis to understand how weather is different across different states, as well as how different states report. There are 72 state codes used in the SDD. To understand what the additional codes above 50 are they are mapped against three datasets: state codes, territory codes, and the most common (modal) office recorded in the SDD against each code.
# get the list of state codes from the SDD and count how many each occurs
statelist <- unique(data_1$STATE)
countlist <- data_1 %>% group_by(STATE) %>% summarise(count = n()) %>% ungroup()
statecount <- countlist$count[match(statelist, countlist$STATE)]
# This function returns the mode from a vector
statmode <- function(vec) {
unq <- unique(vec)
unq[which.max(tabulate(match(vec, unq)))]
}
# get the list of states, territories and office locations
statematchS <- index_match(statelist, data_states, "")
statematchT <- index_match(statelist, data_territories, "")
officelist <- data_1 %>% group_by(STATE) %>% summarise(mode = statmode(STATEOFFIC)) %>% ungroup()
statematchO <- index_match(statelist, officelist, "")
# combine for review
statelist <- data.frame(code = statelist,
count = statecount,
state = statematchS,
territory = statematchT,
office = str_sub(statematchO,1,20))
statelist
## code count state territory office
## 1 AL 22739 Alabama ALABAMA, Central
## 2 AZ 6156 Arizona ARIZONA, Central and
## 3 AR 27102 Arkansas ARKANSAS, Central an
## 4 CA 10780 California CALIFORNIA, South Ce
## 5 CO 20473 Colorado COLORADO, Central an
## 6 CT 3294 Connecticut CONNECTICUT, Souther
## 7 DE 1913 Delaware DELAWARE
## 8 DC 437 DIST COLUMBIA
## 9 FL 22124 Florida
## 10 GA 25259 Georgia GEORGIA, North and C
## 11 HI 2547 Hawaii HAWAII
## 12 ID 4767 Idaho IDAHO, Southeast
## 13 IL 28488 Illinois
## 14 IN 21506 Indiana INDIANA, Central
## 15 IA 31069 Iowa IOWA, Central
## 16 KS 53440 Kansas
## 17 KY 22092 Kentucky KENTUCKY, Eastern
## 18 LA 17323 Louisiana
## 19 ME 4524 Maine MAINE, South
## 20 MD 8185 Maryland MARYLAND, Central
## 21 MA 5652 Massachusetts MASSACHUSETTS, Centr
## 22 MI 17910 Michigan
## 23 MN 23609 Minnesota MINNESOTA, Central a
## 24 MS 22192 Mississippi MISSISSIPPI, Central
## 25 MO 35648 Missouri MISSOURI, Southwest
## 26 MT 14695 Montana MONTANA, South
## 27 NE 30271 Nebraska
## 28 NV 3139 Nevada NEVADA, West
## 29 NH 3022 New Hampshire NEW HAMPSHIRE, North
## 30 NJ 8075 New Jersey NEW JERSEY, South an
## 31 NM 7129 New Mexico NEW MEXICO, Central
## 32 NY 21058 New York
## 33 NC 25351 North Carolina NORTH CAROLINA, Sout
## 34 ND 14632 North Dakota NORTH DAKOTA, Centra
## 35 OH 24922 Ohio
## 36 OK 46802 Oklahoma
## 37 OR 4821 Oregon OREGON, Central and
## 38 PA 22226 Pennsylvania
## 39 RI 839 Rhode Island RHODE ISLAND
## 40 SC 17126 South Carolina
## 41 SD 21727 South Dakota SOUTH DAKOTA, West
## 42 TN 21721 Tennessee TENNESSEE, Central
## 43 TX 83728 Texas
## 44 UT 4135 Utah UTAH, West and Centr
## 45 VT 3871 Vermont VERMONT, North and C
## 46 VA 21189 Virginia VIRGINIA, Southwest
## 47 WA 3312 Washington WASHINGTON, Northeas
## 48 WV 9099 West Virginia WEST VIRGINIA, West
## 49 WI 19781 Wisconsin
## 50 WY 7332 Wyoming WYOMING, Southeast
## 51 PR 3015 Puerto Rico PUERTO RICO
## 52 AK 4391 Alaska ALASKA, Southern
## 53 ST 1
## 54 AS 257 America Samoa AMERICAN SAMOA
## 55 GU 306 Guam PACIFIC
## 56 MH 1 Marshall Islands
## 57 VI 338 U.S. Virgin Islands VIRGIN ISLANDS
## 58 AM 1879 ATLANTIC, Caribbean
## 59 LC 274 LAKE ST CLAIR
## 60 PH 28 HAWAII WATERS
## 61 GM 5337 GULF OF MEXICO
## 62 PZ 96 PACIFIC OCEAN
## 63 AN 3250 ATLANTIC, North
## 64 LH 654 LAKE HURON
## 65 LM 1347 LAKE MICHIGAN
## 66 LE 1526 LAKE ERIE
## 67 LS 262 LAKE SUPERIOR
## 68 SL 7 ST LAWRENCE RIVER
## 69 LO 70 LAKE ONTARIO
## 70 PM 1 PACIFIC ISLAND WATER
## 71 PK 23 GULF OF ALASKA
## 72 XX 2
This shows only 3 entries against unmatched state codes ST(1) and XX(2). The analysis above is used to create a full list of state/territory codes which include territories and locations identified by NWS offices (namely the great lakes and other marine locations).
This full set of state/territory codes is applied via a state_abbreviations_full.csv file. Two further levels of grouping are applied to the state/territory codes. Firstly the state/territory codes are mapped to state codes only plus inshore and offshore lake/sea groupings. Secondly the state codes are mapped to geographic groupings as used by the Agricultural Research Service (ARS), namely Pacific West Area, Plains Area, Midwest Area, Southeast Area, Northeast Area. https://en.wikipedia.org/wiki/List_of_regions_of_the_United_States. These groupings are applied via a state_grouping.csv.
data_states_full <- read.csv("state_abbreviations_full.csv", header = TRUE)
data_states_full
## code state statecode
## 1 AL Alabama AL
## 2 AK Alaska AK
## 3 AZ Arizona AZ
## 4 AR Arkansas AR
## 5 CA California CA
## 6 CO Colorado CO
## 7 CT Connecticut CT
## 8 DE Delaware DE
## 9 DC District of Columbia MA
## 10 FL Florida FL
## 11 GA Georgia GA
## 12 HI Hawaii HI
## 13 ID Idaho ID
## 14 IL Illinois IL
## 15 IN Indiana IN
## 16 IA Iowa IA
## 17 KS Kansas KS
## 18 KY Kentucky KY
## 19 LA Louisiana LA
## 20 ME Maine ME
## 21 MD Maryland MD
## 22 MA Massachusetts MA
## 23 MI Michigan MI
## 24 MN Minnesota MN
## 25 MS Mississippi MS
## 26 MO Missouri MO
## 27 MT Montana MT
## 28 NE Nebraska NE
## 29 NV Nevada NV
## 30 NH New Hampshire NH
## 31 NJ New Jersey NJ
## 32 NM New Mexico NM
## 33 NY New York NY
## 34 NC North Carolina NC
## 35 ND North Dakota ND
## 36 OH Ohio OH
## 37 OK Oklahoma OK
## 38 OR Oregon OR
## 39 PA Pennsylvania PA
## 40 RI Rhode Island RI
## 41 SC South Carolina SC
## 42 SD South Dakota SD
## 43 TN Tennessee TN
## 44 TX Texas TX
## 45 UT Utah UT
## 46 VT Vermont VT
## 47 VA Virginia VA
## 48 WA Washington WA
## 49 WV West Virginia WV
## 50 WI Wisconsin WI
## 51 WY Wyoming WY
## 52 AM Atlantic - Caribbean and Tropical ZO
## 53 AN Atlantic Ocean ZO
## 54 AS America Samoa ZO
## 55 GU Guam ZO
## 56 PK Gulf of Alaska ZO
## 57 GM Gulf of Mexico ZO
## 58 PH Hawaii Off Shore ZO
## 59 LE Lake Erie ZI
## 60 LH Lake Huron ZI
## 61 LM Lake Michigan ZI
## 62 LO Lake Ontario ZI
## 63 LC Lake St Clair ZI
## 64 LS Lake Superior ZI
## 65 MP Northern Marina Islands ZO
## 66 PM Pacific Island Waters ZO
## 67 PZ Pacific Ocean ZO
## 68 PR Puerto Rico ZO
## 69 VI U.S. Virgin Islands ZO
## 70 UM U.S. Minor Outlying Islands ZO
## 71 MH Marshall Islands ZO
## 72 FM Micronesia ZO
## 73 PW Palsu ZO
## 74 NB Nebraska ZO
## 75 CM Northern Marina Islands ZO
## 76 CZ Panama Canal Zone ZO
## 77 PI Philippine Islands ZO
## 78 SL St Lawrence River ZI
## 79 TT Trust Territory of the Pacific Islands ZO
stategrouping <- read.csv("state_grouping.csv", header = TRUE)
stategrouping
## INDEX state_abr state_name region chart_order
## 1 2 AK Alaska Pacific West Area 1
## 2 47 WA Washington Pacific West Area 2
## 3 37 OR Oregon Pacific West Area 3
## 4 12 ID Idaho Pacific West Area 4
## 5 5 CA California Pacific West Area 5
## 6 28 NV Nevada Pacific West Area 6
## 7 44 UT Utah Pacific West Area 7
## 8 3 AZ Arizona Pacific West Area 8
## 9 11 HI Hawaii Pacific West Area 9
## 10 26 MT Montana Plains Area 10
## 11 34 ND North Dakota Plains Area 11
## 12 50 WY Wyoming Plains Area 12
## 13 41 SD South Dakota Plains Area 13
## 14 27 NE Nebraska Plains Area 14
## 15 6 CO Colorado Plains Area 15
## 16 16 KS Kansas Plains Area 16
## 17 31 NM New Mexico Plains Area 17
## 18 43 TX Texas Plains Area 18
## 19 36 OK Oklahoma Plains Area 19
## 20 23 MN Minnesota Midwest Area 20
## 21 49 WI Wisconsin Midwest Area 21
## 22 15 IA Iowa Midwest Area 22
## 23 13 IL Illinois Midwest Area 23
## 24 25 MO Missouri Midwest Area 24
## 25 22 MI Michigan Midwest Area 25
## 26 14 IN Indiana Midwest Area 26
## 27 35 OH Ohio Midwest Area 27
## 28 17 KY Kentucky Midwest Area 28
## 29 4 AR Arkansas Southeast Area 29
## 30 18 LA Louisiana Southeast Area 30
## 31 42 TN Tennessee Southeast Area 31
## 32 24 MS Mississippi Southeast Area 32
## 33 1 AL Alabama Southeast Area 33
## 34 9 FL Florida Southeast Area 34
## 35 10 GA Georgia Southeast Area 35
## 36 40 SC South Carolina Southeast Area 36
## 37 33 NC North Carolina Southeast Area 37
## 38 46 VA Virginia Northeast Area 38
## 39 48 WV West Virginia Northeast Area 39
## 40 20 MD Maryland Northeast Area 40
## 41 8 DE Delaware Northeast Area 41
## 42 38 PA Pennsylvania Northeast Area 42
## 43 30 NJ New Jersey Northeast Area 43
## 44 32 NY New York Northeast Area 44
## 45 7 CT Connecticut Northeast Area 45
## 46 39 RI Rhode Island Northeast Area 46
## 47 21 MA Massachusetts Northeast Area 47
## 48 45 VT Vermont Northeast Area 48
## 49 29 NH New Hampshire Northeast Area 49
## 50 19 ME Maine Northeast Area 50
## 51 51 ZI Inshore Other 51
## 52 52 ZO Offshore Other 52
## 53 53 ZZ Not Matched Other 53
stateabrs <- stategrouping[order(stategrouping$state_name),]$state_abr
stateabrssorted <- stategrouping$state_abr
statenames <- sort(stategrouping$state_name)
statenamessorted <- stategrouping$state_name
# create a mapping table for the state hierarchy
sl_L3code <- index_match(statelist[,1], data.frame(data_states_full[,1],data_states_full[,1]), "XX")
sl_L3name <- index_match(statelist[,1], data_states_full[,c(1,2)], "Unknown")
sl_L2code <- index_match(statelist[,1], data_states_full[,c(1,3)], "ZZ")
sl_L2name <- index_match(sl_L2code, stategrouping[,c(2,3)], "")
sl_L1name <- index_match(sl_L2code, stategrouping[,c(2,4)], "")
sl_order <- index_match(sl_L2code, stategrouping[,c(2,5)], "")
statelist_all <- data.frame(D1code = statelist[,1],
D1count = statelist[,2],
L3code = sl_L3code,
L3name = sl_L3name,
L2code = sl_L2code,
L2name = sl_L2name,
L1name = sl_L1name,
c_order = as.integer(sl_order))
statelist_all <- arrange(statelist_all, c_order, L1name, L2name, L3name, D1code)
# show statelist in one table by reducing columns
statelist_display <- statelist_all[,c("D1code", "D1count", "L3name", "L2name", "L1name")]
statelist_display[,"L3name"] <- str_sub(statelist_display[,"L3name"],1,20)
statelist_display[,"L1name"] <- str_replace(statelist_display[,"L1name"]," Area","")
statelist_display
## D1code D1count L3name L2name L1name
## 1 AK 4391 Alaska Alaska Pacific West
## 2 WA 3312 Washington Washington Pacific West
## 3 OR 4821 Oregon Oregon Pacific West
## 4 ID 4767 Idaho Idaho Pacific West
## 5 CA 10780 California California Pacific West
## 6 NV 3139 Nevada Nevada Pacific West
## 7 UT 4135 Utah Utah Pacific West
## 8 AZ 6156 Arizona Arizona Pacific West
## 9 HI 2547 Hawaii Hawaii Pacific West
## 10 MT 14695 Montana Montana Plains
## 11 ND 14632 North Dakota North Dakota Plains
## 12 WY 7332 Wyoming Wyoming Plains
## 13 SD 21727 South Dakota South Dakota Plains
## 14 NE 30271 Nebraska Nebraska Plains
## 15 CO 20473 Colorado Colorado Plains
## 16 KS 53440 Kansas Kansas Plains
## 17 NM 7129 New Mexico New Mexico Plains
## 18 TX 83728 Texas Texas Plains
## 19 OK 46802 Oklahoma Oklahoma Plains
## 20 MN 23609 Minnesota Minnesota Midwest
## 21 WI 19781 Wisconsin Wisconsin Midwest
## 22 IA 31069 Iowa Iowa Midwest
## 23 IL 28488 Illinois Illinois Midwest
## 24 MO 35648 Missouri Missouri Midwest
## 25 MI 17910 Michigan Michigan Midwest
## 26 IN 21506 Indiana Indiana Midwest
## 27 OH 24922 Ohio Ohio Midwest
## 28 KY 22092 Kentucky Kentucky Midwest
## 29 AR 27102 Arkansas Arkansas Southeast
## 30 LA 17323 Louisiana Louisiana Southeast
## 31 TN 21721 Tennessee Tennessee Southeast
## 32 MS 22192 Mississippi Mississippi Southeast
## 33 AL 22739 Alabama Alabama Southeast
## 34 FL 22124 Florida Florida Southeast
## 35 GA 25259 Georgia Georgia Southeast
## 36 SC 17126 South Carolina South Carolina Southeast
## 37 NC 25351 North Carolina North Carolina Southeast
## 38 VA 21189 Virginia Virginia Northeast
## 39 WV 9099 West Virginia West Virginia Northeast
## 40 MD 8185 Maryland Maryland Northeast
## 41 DE 1913 Delaware Delaware Northeast
## 42 PA 22226 Pennsylvania Pennsylvania Northeast
## 43 NJ 8075 New Jersey New Jersey Northeast
## 44 NY 21058 New York New York Northeast
## 45 CT 3294 Connecticut Connecticut Northeast
## 46 RI 839 Rhode Island Rhode Island Northeast
## 47 DC 437 District of Columbia Massachusetts Northeast
## 48 MA 5652 Massachusetts Massachusetts Northeast
## 49 VT 3871 Vermont Vermont Northeast
## 50 NH 3022 New Hampshire New Hampshire Northeast
## 51 ME 4524 Maine Maine Northeast
## 52 LE 1526 Lake Erie Inshore Other
## 53 LH 654 Lake Huron Inshore Other
## 54 LM 1347 Lake Michigan Inshore Other
## 55 LO 70 Lake Ontario Inshore Other
## 56 LC 274 Lake St Clair Inshore Other
## 57 LS 262 Lake Superior Inshore Other
## 58 SL 7 St Lawrence River Inshore Other
## 59 AS 257 America Samoa Offshore Other
## 60 AM 1879 Atlantic - Caribbean Offshore Other
## 61 AN 3250 Atlantic Ocean Offshore Other
## 62 GU 306 Guam Offshore Other
## 63 PK 23 Gulf of Alaska Offshore Other
## 64 GM 5337 Gulf of Mexico Offshore Other
## 65 PH 28 Hawaii Off Shore Offshore Other
## 66 MH 1 Marshall Islands Offshore Other
## 67 PM 1 Pacific Island Water Offshore Other
## 68 PZ 96 Pacific Ocean Offshore Other
## 69 PR 3015 Puerto Rico Offshore Other
## 70 VI 338 U.S. Virgin Islands Offshore Other
## 71 ST 1 Unknown Not Matched Other
## 72 XX 2 Unknown Not Matched Other
This shows the mapping hierarchy for the state information. This is added to the main data_1 table, with state_L2name and state_L1name selected as the key fields for state and stateGrp variables for analysis.
data_1 <- mutate(data_1,
state_L3code = index_match(STATE, statelist_all[, c("D1code","L3code")]),
state_L3name = index_match(STATE, statelist_all[, c("D1code","L3name")]),
state_L2code = index_match(STATE, statelist_all[, c("D1code","L2code")]),
state_L2name = index_match(STATE, statelist_all[, c("D1code","L2name")]),
state_L1name = index_match(STATE, statelist_all[, c("D1code","L1name")]),
state = state_L2name,
stateGrp = state_L1name)
It can be seen how the data coverage is spread across states and also which states have events which were transformed by the rules that were applied.
states2 <- ruletable(statenames, "state")
states2
## rowtype name orig generic specific match other pct_gen
## 1 row Alabama 15268 7261 23 187 0 32.9
## 2 row Alaska 4004 145 67 124 51 8.8
## 3 row Arizona 4306 1639 13 197 1 30.1
## 4 row Arkansas 17936 9101 12 53 0 33.8
## 5 row California 8653 530 483 1095 19 19.7
## 6 row Colorado 17765 2041 35 632 0 13.2
## 7 row Connecticut 2398 796 52 48 0 27.2
## 8 row Delaware 1264 382 124 136 7 33.9
## 9 row Florida 14563 6310 266 984 1 34.2
## 10 row Georgia 15594 9316 60 284 5 38.3
## 11 row Hawaii 2057 33 18 426 13 19.2
## 12 row Idaho 3354 1123 69 220 1 29.6
## 13 row Illinois 19099 9051 46 290 2 33.0
## 14 row Indiana 13584 7688 29 203 2 36.8
## 15 row Inshore 1615 2525 0 0 0 61.0
## 16 row Iowa 22286 8154 34 591 4 28.3
## 17 row Kansas 41367 11861 31 174 7 22.6
## 18 row Kentucky 14249 7473 76 275 19 35.5
## 19 row Louisiana 10526 6562 68 167 0 39.2
## 20 row Maine 3283 1037 72 69 63 27.4
## 21 row Maryland 5574 2273 149 183 6 31.9
## 22 row Massachusetts 4491 1435 91 69 3 26.2
## 23 row Michigan 10816 6888 86 120 0 39.6
## 24 row Minnesota 16898 6602 6 102 1 28.4
## 25 row Mississippi 14065 7968 27 131 1 36.6
## 26 row Missouri 26053 9318 40 236 1 26.9
## 27 row Montana 11955 2538 4 198 0 18.6
## 28 row Nebraska 24089 6018 18 145 1 20.4
## 29 row Nevada 2658 415 13 50 3 15.3
## 30 row New Hampshire 2060 807 39 64 52 31.8
## 31 row New Jersey 5578 1849 223 412 13 30.9
## 32 row New Mexico 6200 827 14 75 13 13.0
## 33 row New York 13087 7268 180 441 82 37.9
## 34 row North Carolina 17518 7257 218 355 3 30.9
## 35 row North Dakota 11621 2939 5 67 0 20.6
## 36 row Not Matched 2 1 0 0 0 33.3
## 37 row Offshore 9798 3737 91 904 1 32.6
## 38 row Ohio 14657 10017 102 101 45 41.2
## 39 row Oklahoma 34008 12568 41 124 61 27.3
## 40 row Oregon 4060 314 36 410 1 15.8
## 41 row Pennsylvania 12806 8624 200 574 22 42.4
## 42 row Rhode Island 637 163 32 7 0 24.1
## 43 row South Carolina 10983 5851 96 191 5 35.9
## 44 row South Dakota 17125 4447 16 138 1 21.2
## 45 row Tennessee 12648 8776 13 284 0 41.8
## 46 row Texas 64104 18946 75 582 21 23.4
## 47 row Utah 3316 628 11 180 0 19.8
## 48 row Vermont 2587 942 78 157 107 33.2
## 49 row Virginia 14878 5822 151 327 11 29.8
## 50 row Washington 2863 249 25 173 2 13.6
## 51 row West Virginia 6184 2539 163 185 28 32.0
## 52 row Wisconsin 12542 6779 168 289 3 36.6
## 53 row Wyoming 273265 25432 21443 8014 7332 18.5
## 54 total covered 629032 876865 880854 894283 894965 0.0
## 55 total total 902297 902297 902297 902297 902297 0.0
## 56 total pct_covered 70 97 98 99 99 84.9
It can also be seen that the the coverage across the state groupings is fairly even.
table(data_1$stateGrp)
##
## Midwest Area Northeast Area Other Pacific West Area
## 225025 113384 18674 44048
## Plains Area Southeast Area
## 300229 200937
The fatalities and injuries data is explored using a function called element_yearly() to see how complete it is.
element_yearly <- function(element1){
data_2 <- data_1 %>% mutate(element2 = as.numeric(data_1[,element1]))
yearnames1 = yearnames[-length(yearnames)]
elyr <- data.frame(year = yearnames1,
count = rep(0,length(yearnames1)),
count_0 = rep(0,length(yearnames1)),
count_1 = rep(0,length(yearnames1)),
complete_1 = rep(0,length(yearnames1)),
sum_1 = rep(0,length(yearnames1)),
mean_1 = rep(0,length(yearnames1)))
for (yy in yearnames1) {
data_2f <- data_2 %>% filter(year == yy)
elyr[(elyr$year == yy),"count"] = nrow(data_2f)
elyr[(elyr$year == yy),"count_0"] = sum(data_2f$element2 == 0)
elyr[(elyr$year == yy),"count_1"] = elyr[(elyr$year == yy),"count"] - elyr[(elyr$year == yy),"count_0"]
elyr[(elyr$year == yy),"complete_1"] = ifelse(elyr[(elyr$year == yy),"count"] == 0,
0, round(elyr[(elyr$year == yy),"count_1"] / elyr[(elyr$year == yy),"count"],2))
elyr[(elyr$year == yy),"sum_1"] = round(sum(data_2f$element2),2)
elyr[(elyr$year == yy),"mean_1"] = ifelse(elyr[(elyr$year == yy),"count_1"] == 0,
0, round(elyr[(elyr$year == yy),"sum_1"]/elyr[(elyr$year == yy),"count_1"],2))
}
elyr
}
element_yearly("FATALITIES")
## year count count_0 count_1 complete_1 sum_1 mean_1
## 1 1950 223 200 23 0.10 70 3.04
## 2 1951 269 252 17 0.06 34 2.00
## 3 1952 272 236 36 0.13 230 6.39
## 4 1953 492 439 53 0.11 519 9.79
## 5 1954 609 588 21 0.03 36 1.71
## 6 1955 1413 1391 22 0.02 129 5.86
## 7 1956 1703 1683 20 0.01 83 4.15
## 8 1957 2184 2133 51 0.02 193 3.78
## 9 1958 2213 2192 21 0.01 67 3.19
## 10 1959 1813 1797 16 0.01 58 3.62
## 11 1960 1945 1928 17 0.01 46 2.71
## 12 1961 2246 2225 21 0.01 52 2.48
## 13 1962 2389 2379 10 0.00 30 3.00
## 14 1963 1968 1951 17 0.01 31 1.82
## 15 1964 2348 2330 18 0.01 73 4.06
## 16 1965 2855 2798 57 0.02 301 5.28
## 17 1966 2388 2376 12 0.01 98 8.17
## 18 1967 2688 2658 30 0.01 114 3.80
## 19 1968 3312 3284 28 0.01 131 4.68
## 20 1969 2926 2911 15 0.01 66 4.40
## 21 1970 3215 3192 23 0.01 73 3.17
## 22 1971 3471 3440 31 0.01 159 5.13
## 23 1972 2168 2156 12 0.01 27 2.25
## 24 1973 4463 4421 42 0.01 89 2.12
## 25 1974 5386 5288 98 0.02 366 3.73
## 26 1975 4975 4945 30 0.01 60 2.00
## 27 1976 3768 3739 29 0.01 44 1.52
## 28 1977 3728 3713 15 0.00 43 2.87
## 29 1978 3657 3638 19 0.01 53 2.79
## 30 1979 4279 4258 21 0.00 84 4.00
## 31 1980 6146 6126 20 0.00 28 1.40
## 32 1981 4517 4503 14 0.00 24 1.71
## 33 1982 7132 7099 33 0.00 64 1.94
## 34 1983 8322 8292 30 0.00 37 1.23
## 35 1984 7335 7270 65 0.01 160 2.46
## 36 1985 7979 7940 39 0.00 112 2.87
## 37 1986 8726 8692 34 0.00 51 1.50
## 38 1987 7367 7328 39 0.01 89 2.28
## 39 1988 7257 7218 39 0.01 55 1.41
## 40 1989 10410 10372 38 0.00 79 2.08
## 41 1990 10946 10901 45 0.00 95 2.11
## 42 1991 12522 12480 42 0.00 73 1.74
## 43 1992 13534 13504 30 0.00 54 1.80
## 44 1993 12607 12435 172 0.01 298 1.73
## 45 1994 20631 20432 199 0.01 344 1.73
## 46 1995 27970 27618 352 0.01 1491 4.24
## 47 1996 32270 31904 366 0.01 542 1.48
## 48 1997 28680 28343 337 0.01 601 1.78
## 49 1998 38128 37764 364 0.01 687 1.89
## 50 1999 31289 30949 340 0.01 908 2.67
## 51 2000 34471 34151 320 0.01 477 1.49
## 52 2001 34962 34664 298 0.01 469 1.57
## 53 2002 36293 35995 298 0.01 498 1.67
## 54 2003 39752 39468 284 0.01 443 1.56
## 55 2004 39363 39098 265 0.01 370 1.40
## 56 2005 39184 38944 240 0.01 469 1.95
## 57 2006 44034 43745 289 0.01 599 2.07
## 58 2007 43289 43001 288 0.01 421 1.46
## 59 2008 55663 55340 323 0.01 488 1.51
## 60 2009 45817 45562 255 0.01 333 1.31
## 61 2010 48161 47866 295 0.01 425 1.44
## 62 2011 62174 61778 396 0.01 1002 2.53
element_yearly("INJURIES")
## year count count_0 count_1 complete_1 sum_1 mean_1
## 1 1950 223 148 75 0.34 659 8.79
## 2 1951 269 206 63 0.23 524 8.32
## 3 1952 272 166 106 0.39 1915 18.07
## 4 1953 492 341 151 0.31 5131 33.98
## 5 1954 609 495 114 0.19 715 6.27
## 6 1955 1413 1316 97 0.07 926 9.55
## 7 1956 1703 1587 116 0.07 1355 11.68
## 8 1957 2184 2012 172 0.08 1976 11.49
## 9 1958 2213 2113 100 0.05 535 5.35
## 10 1959 1813 1732 81 0.04 734 9.06
## 11 1960 1945 1849 96 0.05 737 7.68
## 12 1961 2246 2088 158 0.07 1087 6.88
## 13 1962 2389 2313 76 0.03 551 7.25
## 14 1963 1968 1892 76 0.04 538 7.08
## 15 1964 2348 2211 137 0.06 1148 8.38
## 16 1965 2855 2668 187 0.07 5197 27.79
## 17 1966 2388 2300 88 0.04 2030 23.07
## 18 1967 2688 2525 163 0.06 2144 13.15
## 19 1968 3312 3180 132 0.04 2522 19.11
## 20 1969 2926 2822 104 0.04 1311 12.61
## 21 1970 3215 3082 133 0.04 1355 10.19
## 22 1971 3471 3292 179 0.05 2723 15.21
## 23 1972 2168 2031 137 0.06 976 7.12
## 24 1973 4463 4199 264 0.06 2406 9.11
## 25 1974 5386 5086 300 0.06 6824 22.75
## 26 1975 4975 4809 166 0.03 1457 8.78
## 27 1976 3768 3620 148 0.04 1195 8.07
## 28 1977 3728 3589 139 0.04 771 5.55
## 29 1978 3657 3552 105 0.03 919 8.75
## 30 1979 4279 4150 129 0.03 3014 23.36
## 31 1980 6146 6003 143 0.02 1157 8.09
## 32 1981 4517 4416 101 0.02 798 7.90
## 33 1982 7132 6975 157 0.02 1276 8.13
## 34 1983 8322 8152 170 0.02 816 4.80
## 35 1984 7335 7059 276 0.04 2858 10.36
## 36 1985 7979 7806 173 0.02 1513 8.75
## 37 1986 8726 8513 213 0.02 915 4.30
## 38 1987 7367 7174 193 0.03 1416 7.34
## 39 1988 7257 7038 219 0.03 1030 4.70
## 40 1989 10410 10108 302 0.03 1675 5.55
## 41 1990 10946 10627 319 0.03 1825 5.72
## 42 1991 12522 12241 281 0.02 1355 4.82
## 43 1992 13534 13262 272 0.02 1754 6.45
## 44 1993 12607 12312 295 0.02 2149 7.28
## 45 1994 20631 20027 604 0.03 4161 6.89
## 46 1995 27970 27267 703 0.03 4480 6.37
## 47 1996 32270 31631 639 0.02 2717 4.25
## 48 1997 28680 27983 697 0.02 3800 5.45
## 49 1998 38128 37264 864 0.02 11177 12.94
## 50 1999 31289 30580 709 0.02 5148 7.26
## 51 2000 34471 33842 629 0.02 2803 4.46
## 52 2001 34962 34304 658 0.02 2721 4.14
## 53 2002 36293 35736 557 0.02 3155 5.66
## 54 2003 39752 39192 560 0.01 2931 5.23
## 55 2004 39363 38897 466 0.01 2426 5.21
## 56 2005 39184 38758 426 0.01 1834 4.31
## 57 2006 44034 43511 523 0.01 3368 6.44
## 58 2007 43289 42893 396 0.01 2191 5.53
## 59 2008 55663 55137 526 0.01 2703 5.14
## 60 2009 45817 45379 438 0.01 1354 3.09
## 61 2010 48161 47688 473 0.01 1855 3.92
## 62 2011 62174 61544 630 0.01 7792 12.37
There are a lot of zero entries for the fatalities data with only around 0.5% of events leading to fatalities. This is similar for injuries with around 2% of events leading to injuries. This would need to be verified as correct by some external expertise, but there is nothing obvious in the data leading us to suspect a data quality issue. Therefore these fields are added to the data_1 dataset as is.
data_1 <- mutate(data_1,
injuries = INJURIES,
fatalities = FATALITIES)
The PROPDMG and CROPDMG fields are similar to the fatalities and injuries fields in that they appear fully populated but there are a lot of zero values.
element_yearly("CROPDMG")
## year count count_0 count_1 complete_1 sum_1 mean_1
## 1 1950 223 223 0 0.00 0.00 0.00
## 2 1951 269 269 0 0.00 0.00 0.00
## 3 1952 272 272 0 0.00 0.00 0.00
## 4 1953 492 492 0 0.00 0.00 0.00
## 5 1954 609 609 0 0.00 0.00 0.00
## 6 1955 1413 1413 0 0.00 0.00 0.00
## 7 1956 1703 1703 0 0.00 0.00 0.00
## 8 1957 2184 2184 0 0.00 0.00 0.00
## 9 1958 2213 2213 0 0.00 0.00 0.00
## 10 1959 1813 1813 0 0.00 0.00 0.00
## 11 1960 1945 1945 0 0.00 0.00 0.00
## 12 1961 2246 2246 0 0.00 0.00 0.00
## 13 1962 2389 2389 0 0.00 0.00 0.00
## 14 1963 1968 1968 0 0.00 0.00 0.00
## 15 1964 2348 2348 0 0.00 0.00 0.00
## 16 1965 2855 2855 0 0.00 0.00 0.00
## 17 1966 2388 2388 0 0.00 0.00 0.00
## 18 1967 2688 2688 0 0.00 0.00 0.00
## 19 1968 3312 3312 0 0.00 0.00 0.00
## 20 1969 2926 2926 0 0.00 0.00 0.00
## 21 1970 3215 3215 0 0.00 0.00 0.00
## 22 1971 3471 3471 0 0.00 0.00 0.00
## 23 1972 2168 2168 0 0.00 0.00 0.00
## 24 1973 4463 4463 0 0.00 0.00 0.00
## 25 1974 5386 5386 0 0.00 0.00 0.00
## 26 1975 4975 4975 0 0.00 0.00 0.00
## 27 1976 3768 3768 0 0.00 0.00 0.00
## 28 1977 3728 3728 0 0.00 0.00 0.00
## 29 1978 3657 3657 0 0.00 0.00 0.00
## 30 1979 4279 4279 0 0.00 0.00 0.00
## 31 1980 6146 6146 0 0.00 0.00 0.00
## 32 1981 4517 4517 0 0.00 0.00 0.00
## 33 1982 7132 7132 0 0.00 0.00 0.00
## 34 1983 8322 8322 0 0.00 0.00 0.00
## 35 1984 7335 7335 0 0.00 0.00 0.00
## 36 1985 7979 7979 0 0.00 0.00 0.00
## 37 1986 8726 8726 0 0.00 0.00 0.00
## 38 1987 7367 7367 0 0.00 0.00 0.00
## 39 1988 7257 7257 0 0.00 0.00 0.00
## 40 1989 10410 10410 0 0.00 0.00 0.00
## 41 1990 10946 10946 0 0.00 0.00 0.00
## 42 1991 12522 12522 0 0.00 0.00 0.00
## 43 1992 13534 13534 0 0.00 0.00 0.00
## 44 1993 12607 11769 838 0.07 51661.15 61.65
## 45 1994 20631 19001 1630 0.08 88819.98 54.49
## 46 1995 27970 27030 940 0.03 35493.00 37.76
## 47 1996 32270 31162 1108 0.03 69291.90 62.54
## 48 1997 28680 27536 1144 0.04 74074.56 64.75
## 49 1998 38128 36036 2092 0.05 102235.21 48.87
## 50 1999 31289 30541 748 0.02 62707.15 83.83
## 51 2000 34471 33083 1388 0.04 79413.61 57.21
## 52 2001 34962 33668 1294 0.04 48248.35 37.29
## 53 2002 36293 35451 842 0.02 49230.64 58.47
## 54 2003 39752 38792 960 0.02 52322.19 54.50
## 55 2004 39363 37922 1441 0.04 72438.97 50.27
## 56 2005 39184 38454 730 0.02 48443.05 66.36
## 57 2006 44034 43075 959 0.02 65850.56 68.67
## 58 2007 43289 42258 1031 0.02 76768.00 74.46
## 59 2008 55663 53848 1815 0.03 183821.00 101.28
## 60 2009 45817 44803 1014 0.02 59683.00 58.86
## 61 2010 48161 47203 958 0.02 59014.00 61.60
## 62 2011 62174 61007 1167 0.02 98311.00 84.24
element_yearly("PROPDMG")
## year count count_0 count_1 complete_1 sum_1 mean_1
## 1 1950 223 28 195 0.87 16999.15 87.18
## 2 1951 269 39 230 0.86 10560.99 45.92
## 3 1952 272 47 225 0.83 16679.74 74.13
## 4 1953 492 88 404 0.82 19182.20 47.48
## 5 1954 609 127 482 0.79 23367.82 48.48
## 6 1955 1413 980 433 0.31 27715.63 64.01
## 7 1956 1703 1280 423 0.25 27002.35 63.84
## 8 1957 2184 1366 818 0.37 44568.89 54.49
## 9 1958 2213 1674 539 0.24 26597.11 49.35
## 10 1959 1813 1311 502 0.28 25015.54 49.83
## 11 1960 1945 1393 552 0.28 28314.24 51.29
## 12 1961 2246 1626 620 0.28 39528.73 63.76
## 13 1962 2389 1988 401 0.17 22245.73 55.48
## 14 1963 1968 1597 371 0.19 24793.08 66.83
## 15 1964 2348 1758 590 0.25 38618.32 65.45
## 16 1965 2855 2147 708 0.25 46716.54 65.98
## 17 1966 2388 1970 418 0.18 27079.36 64.78
## 18 1967 2688 2010 678 0.25 40056.72 59.08
## 19 1968 3312 2799 513 0.15 27762.41 54.12
## 20 1969 2926 2470 456 0.16 33354.23 73.15
## 21 1970 3215 2712 503 0.16 30800.98 61.23
## 22 1971 3471 2775 696 0.20 44778.05 64.34
## 23 1972 2168 1595 573 0.26 36287.57 63.33
## 24 1973 4463 3454 1009 0.23 75056.56 74.39
## 25 1974 5386 4565 821 0.15 57905.28 70.53
## 26 1975 4975 4232 743 0.15 52498.79 70.66
## 27 1976 3768 3065 703 0.19 56249.69 80.01
## 28 1977 3728 3037 691 0.19 54811.73 79.32
## 29 1978 3657 3043 614 0.17 49135.66 80.03
## 30 1979 4279 3632 647 0.15 54633.74 84.44
## 31 1980 6146 5423 723 0.12 64904.35 89.77
## 32 1981 4517 3942 575 0.13 46608.31 81.06
## 33 1982 7132 6004 1128 0.16 83065.50 73.64
## 34 1983 8322 7328 994 0.12 64058.09 64.44
## 35 1984 7335 6539 796 0.11 63241.39 79.45
## 36 1985 7979 7506 473 0.06 39667.21 83.86
## 37 1986 8726 8170 556 0.06 50897.34 91.54
## 38 1987 7367 6955 412 0.06 33212.95 80.61
## 39 1988 7257 6722 535 0.07 50289.65 94.00
## 40 1989 10410 9826 584 0.06 49263.55 84.36
## 41 1990 10946 10166 780 0.07 69560.57 89.18
## 42 1991 12522 11844 678 0.05 60888.93 89.81
## 43 1992 13534 12674 860 0.06 70526.40 82.01
## 44 1993 12607 7049 5558 0.44 477319.20 85.88
## 45 1994 20631 11574 9057 0.44 622212.70 68.70
## 46 1995 27970 18331 9639 0.34 319225.80 33.12
## 47 1996 32270 22981 9289 0.29 424503.33 45.70
## 48 1997 28680 19225 9455 0.33 337999.53 35.75
## 49 1998 38128 25099 13029 0.34 545844.65 41.89
## 50 1999 31289 21481 9808 0.31 364611.53 37.17
## 51 2000 34471 23961 10510 0.30 403471.23 38.39
## 52 2001 34962 25655 9307 0.27 383596.77 41.22
## 53 2002 36293 26535 9758 0.27 355741.34 36.46
## 54 2003 39752 29400 10352 0.26 437754.86 42.29
## 55 2004 39363 29618 9745 0.25 412837.59 42.36
## 56 2005 39184 29743 9441 0.24 428309.11 45.37
## 57 2006 44034 32696 11338 0.26 449583.65 39.65
## 58 2007 43289 31958 11331 0.26 438170.28 38.67
## 59 2008 55663 38686 16977 0.30 690246.11 40.66
## 60 2009 45817 32087 13730 0.30 553432.58 40.31
## 61 2010 48161 32866 15295 0.32 594847.94 38.89
## 62 2011 62174 42271 19903 0.32 820290.74 41.21
The CROPDMG field is only populated from 1993 onwards with around 5% of events having crop damage data. The PROPDMG data appears to be more populated with around 30% of the events having property damage data. Again this would need to be verified as correct by some external expertise, but there is nothing obvious in the data leading us to suspect a data quality issue. The information appears to be entered as ’000s of dollars although this is not explicitly set out. So an assumption is made.
We also make an adjustment to the dollar values in each of these fields by converting to 2021 US dollar values by using a 100 US dollar index from 1950 to 2021. This is to ensure a fair comparison for historic data.
data_1 <- mutate(data_1,
infyrX = index_match(year, data_inflation[,c(1,2)], data_inflation[nrow(data_inflation),2]),
inf2021 = data_inflation[nrow(data_inflation),2],
CROPDMG21 = round(1000 * as.numeric(CROPDMG) * inf2021/infyrX,0),
PROPDMG21 = round(1000 * as.numeric(PROPDMG) * inf2021/infyrX,0))
This section considers some initial results from analysing the SDD data. Specifically it sets out we look at three pieces of analysis as follows:
This considers the full dataset across events, years and states. The analysis highlights where event data has been cleansed during the data cleansing process so the impact can be understood, and also compared to analysis which has not undergone data cleansing.
# set up the margins for a triple plot (vertically)
par(mfrow = c(3,1), mar = c(1, 2, 4, 1), oma = c(3, 3, 5, 1))
options(scipen=999)
# build the chart1 dataset for analysis by event
data_c <- data_1 %>% group_by(event_match, event, .drop = FALSE) %>% summarise(count = n()) %>% ungroup()
cdata1 <- matrix(0, ncol = length(eventnamessorted), nrow = 2)
cdata1[1,] <- index_match(eventnamessorted, filter(data_c, event_match == "TRUE")[-1],0)
cdata1[2,] <- index_match(eventnamessorted, filter(data_c, event_match == "FALSE")[-1],0)
rownames(cdata1) <- c("original", "generated")
colnames(cdata1) <- eventnamessorted
# draw chart1 for analysis by event
b1 <- barplot(cdata1,
ylim = c(0, 100000),
col = c("blue", "light blue"),
border = NA,
ylab = "frequency",
xaxt = "n",
yaxt = "n",
xaxs = "i",
yaxs = "i")
axis(side = 2,
las = 2,
cex.axis = 1,
at = seq(0, 100000, by = 20000),
labels = format(axTicks(2), big.mark = ","),
mgp = c(3, 1, 0))
text( x = b1,
y = par("usr")[3] - 1,
labels = eventnamessorted,
cex = 1,
xpd = NA,
srt = 90,
adj = 1)
legend("topright",
legend = c("original", "generated"),
fill = c("blue", "light blue"),
border = NA,
box.lty = 0)
text( x = c(32,39.25),
y = c(90000,90000),
label = c("325,080", "288,842"),
cex = 1,
srt = 90)
title("Figure 1: Number of different weather events recorded in SDD from 1950 to 2011",
cex = 3, line = 3, outer = TRUE)
# build the chart2 dataset for analysis by year
data_c <- data_1 %>% group_by(event_match, year, .drop = FALSE) %>% summarise(count = n()) %>% ungroup()
yearnames1 <- yearnames[-length(yearnames)]
cdata2 <- matrix(0, ncol = length(yearnames1), nrow = 2)
cdata2[1,] <- index_match(yearnames1, filter(data_c, event_match == "TRUE")[-1],0)
cdata2[2,] <- index_match(yearnames1, filter(data_c, event_match == "FALSE")[-1],0)
rownames(cdata2) <- c("original", "generated")
colnames(cdata2) <- yearnames1
# draw chart2 for analysis by year
b2 <- barplot(cdata2,
ylim = c(0, 100000),
col = c("blue", "light blue"),
border = NA,
ylab = "frequency",
xaxt = "n",
yaxt = "n",
xaxs = "i",
yaxs = "i")
axis(side = 2,
las = 2,
cex.axis = 1,
at = seq(0, 100000, by = 20000),
labels = format(axTicks(2), big.mark = ","),
mgp = c(3, 1, 0))
text( x = b2,
y = par("usr")[3] - 1,
labels = yearnames1,
cex = 1,
xpd = NA,
srt = 90,
adj = 1)
# build the chart3 dataset for analysis by state
data_c <- data_1 %>% group_by(event_match, state, .drop = FALSE) %>% summarise(count = n()) %>% ungroup()
cdata3 <- matrix(0, ncol = length(statenamessorted), nrow = 2)
cdata3[1,] <- index_match(statenamessorted, filter(data_c, event_match == "TRUE")[-1],0)
cdata3[2,] <- index_match(statenamessorted, filter(data_c, event_match == "FALSE")[-1],0)
rownames(cdata3) <- c("original", "generated")
colnames(cdata3) <- statenamessorted
# draw chart3 for analysis by state
b3 <- barplot(cdata3,
ylim = c(0, 100000),
col = c("blue", "light blue"),
border = NA,
ylab = "frequency",
xaxt = "n",
yaxt = "n",
xaxs = "i",
yaxs = "i")
axis(side = 2,
las = 2,
cex.axis = 1,
at = seq(0, 100000, by = 20000),
labels = format(axTicks(2), big.mark = ","),
mgp = c(3, 1, 0))
text( x = b3,
y = par("usr")[3] - 1,
labels = statenamessorted,
cex = 1,
xpd = NA,
srt = 90,
adj = 1)
The analysis in Figure 1 shows that:
The second analysis looks at fatalities and injuries data. Given the improvement of the data from 1992 onwards the pre-1992 data is ignored for this analysis. The analysis considers the total number of fatalities and injuries per event type, and shows this alongside the fatalities and injuries per event. This allows a full comparison of the events from a public health perspective.
# This stores the data for charts 4 to 7
perevent1 <- data.frame(event = eventnames,
count = rep(0, length(eventnames)),
injuries = rep(0, length(eventnames)),
fatalities = rep(0, length(eventnames)),
CROPDMG21 = rep(0, length(eventnames)),
PROPDMG21 = rep(0, length(eventnames)),
injuries_e = rep(0, length(eventnames)),
fatalities_e = rep(0, length(eventnames)),
CROPDMG21_e = rep(0, length(eventnames)),
PROPDMG21_e = rep(0, length(eventnames)))
for (ee in eventnames) {
data_pe <- data_1 %>% filter(yearGrp != "pre 1992", event == ee)
perevent1[(perevent1$event == ee), "count"] <- nrow(data_pe)
cEvents <- nrow(data_pe)
perevent1[(perevent1$event == ee), "injuries"] <- sum(as.numeric(data_pe$injuries))
perevent1[(perevent1$event == ee), "fatalities"] <- sum(as.numeric(data_pe$fatalities))
perevent1[(perevent1$event == ee), "CROPDMG21"] <- sum(as.numeric(data_pe$CROPDMG21))
perevent1[(perevent1$event == ee), "PROPDMG21"] <- sum(as.numeric(data_pe$PROPDMG21))
if (cEvents == 0) {
perevent1[(perevent1$event == ee), "injuries_e"] <- 0
perevent1[(perevent1$event == ee), "fatalities_e"] <- 0
perevent1[(perevent1$event == ee), "CROPDMG21_e"] <- 0
perevent1[(perevent1$event == ee), "PROPDMG21_e"] <- 0
} else {
perevent1[(perevent1$event == ee), "injuries_e"] <-
round(perevent1[perevent1$event == ee, "injuries"] / cEvents,2)
perevent1[(perevent1$event == ee), "fatalities_e"] <-
round(perevent1[perevent1$event == ee, "fatalities"] / cEvents,2)
perevent1[(perevent1$event == ee), "CROPDMG21_e"] <-
round(perevent1[perevent1$event == ee, "CROPDMG21"] / cEvents,2)
perevent1[(perevent1$event == ee), "PROPDMG21_e"] <-
round(perevent1[perevent1$event == ee, "PROPDMG21"] / cEvents,2)
}
}
# set up parameters for 2 charts side by side
par(mfrow = c(1,2), mar = c(1, 5, 2, 1), oma = c(3, 3, 5, 1))
eventnamesreversed = rev(eventnames)
eventnamessortedrev = rev(eventnamessorted)
# build the chart4 dataset for analysis of injuries and fatalities
cdata4 <- matrix(0, ncol = length(eventnamessortedrev), nrow = 2)
# cdata4[1,] <- rev(perevent1[,"injuries"])
cdata4[1,] <- index_match(eventnamessortedrev, perevent1[,c("event", "injuries")])
# cdata4[2,] <- rev(perevent1[,"fatalities"])
cdata4[2,] <- index_match(eventnamessortedrev, perevent1[,c("event", "fatalities")])
rownames(cdata4) <- c("injuries", "fatalities")
colnames(cdata4) <- eventnamessortedrev
# draw chart4
b4 <- barplot(cdata4,
horiz = TRUE,
beside=TRUE,
xlim = c(0, 10000),
col = c("blue", "red"),
border = NA,
xlab = "frequency",
xaxt = "n",
yaxt = "n",
xaxs = "i",
yaxs = "i")
axis(side = 1,
cex.axis = 0.7,
at = seq(0, 10000, by = 2000),
labels = format(axTicks(1), big.mark = ","),
mgp = c(3, 1, 0))
text( y = (b4[1,]+b4[2,])/2,
x = -100,
labels = eventnamessortedrev,
cex = 0.6,
xpd = NA,
adj = 1)
mtext("Total incidents",
side = 3,
line = 1,
adj = 0)
legend("top",
legend = c("fatalaties", "injuries"),
fill = c("red", "blue"),
border = NA,
box.lty = 0,
cex = 0.7)
abline(v = seq(0, 10000, by = 2000),
col = "grey",
lty = 3,
lwd = 1)
title("Figure 2: Injuries and fatalities from different weather events recorded in SDD from 1996 to 2011", cex = 1.5, line = 3, outer = TRUE)
# build the chart5 dataset for analysis of injuries and fatalities by event
cdata5 <- matrix(0, ncol = length(eventnamessortedrev), nrow = 2)
# cdata5[1,] <- rev(perevent1[,"injuries_e"])
cdata5[1,] <- index_match(eventnamessortedrev, perevent1[,c("event", "injuries_e")])
# cdata5[2,] <- rev(perevent1[,"fatalities_e"])
cdata5[2,] <- index_match(eventnamessortedrev, perevent1[,c("event", "fatalities_e")])
rownames(cdata5) <- c("injuries_e", "fatalities_e")
colnames(cdata5) <- eventnamessortedrev
# draw chart5
b5 <- barplot(cdata5,
horiz = TRUE,
beside=TRUE,
xlim = c(0, 10),
col = c("blue", "red"),
border = NA,
xlab = "frequency",
xaxt = "n",
yaxt = "n",
xaxs = "i",
yaxs = "i")
axis(side = 1,
cex.axis = 0.7,
at = seq(0, 10, by = 2),
labels = format(axTicks(1), big.mark = ","),
mgp = c(3, 1, 0))
text( y = (b5[1,]+b5[2,])/2,
x = -0.1,
labels = eventnamessortedrev,
cex = 0.6,
xpd = NA,
adj = 1)
mtext("Incidents per event",
side = 3,
line = 1,
adj = 0)
legend("top",
legend = c("fatalaties", "injuries"),
fill = c("red", "blue"),
border = NA,
box.lty = 0,
cex = 0.7)
# add vertical gridlines
abline(v = seq(0, 10, by = 2),
col = "grey",
lty = 3,
lwd = 1)
The analysis in Figure 2 shows that:
The third analysis looks at property damage and crop damage data. Given the improvement of the data from 1992 onwards the pre-1992 data is ignored for this analysis. The analysis considers the total economic impact of property damage and crop damage, and shows this alongside the economic impact per event. This allows a full comparison of the events from an economic perspective.
par(mfrow = c(1,2), mar = c(1, 5, 2, 1), oma = c(3, 3, 5, 1))
# build the chart6 dataset for analysis of crop damage and property damage
cdata6 <- matrix(0, ncol = length(eventnamessortedrev), nrow = 2)
# cdata6[1,] <- rev(perevent1[,"CROPDMG21"])
cdata6[1,] <- index_match(eventnamessortedrev, perevent1[,c("event", "CROPDMG21")])
# cdata6[2,] <- rev(perevent1[,"PROPDMG21"])
cdata6[2,] <- index_match(eventnamessortedrev, perevent1[,c("event", "PROPDMG21")])
rownames(cdata6) <- c("CROPDMG21", "PROPDMG21")
colnames(cdata6) <- eventnamesreversed
# draw chart 6
b6 <- barplot(cdata6,
horiz = TRUE,
beside=TRUE,
xlim = c(0, 5000000000),
col = c("forestgreen", "green"),
border = NA,
xlab = "frequency",
xaxt = "n",
yaxt = "n",
xaxs = "i",
yaxs = "i")
# axis(side = 1, labels = FALSE)
axis(side = 1,
cex.axis = 0.7,
at = seq(0, 5000000000, by = 1000000000),
labels = c("0", "1bn", "2bn", "3bn", "4bn", "5bn"),
mgp = c(3, 1, 0))
text( y = (b6[1,]+b6[2,])/2,
x = -100,
labels = eventnamessortedrev,
cex = 0.6,
xpd = NA,
adj = 1)
mtext("Total economic impact (USD 2021)",
side = 3,
line = 1,
adj = 0)
legend("top",
legend = c("crop damage", "property damage"),
fill = c("forestgreen", "green"),
border = NA,
box.lty = 0,
cex = 0.7)
# add vertical gridlines
abline(v = seq(0, 5000000000, by = 1000000000),
col = "grey",
lty = 3,
lwd = 1)
title("Figure 3: Crop damage and property damage from different weather events recorded in SDD from 1996 to 2011", cex = 1.5, line = 3, outer = TRUE)
# build the chart7 dataset for analysis of crop damage and property damage per event
cdata7 <- matrix(0, ncol = length(eventnamessortedrev), nrow = 2)
cdata7[1,] <- rev(perevent1[,"CROPDMG21_e"])
cdata7[1,] <- index_match(eventnamessortedrev, perevent1[,c("event", "CROPDMG21_e")])
cdata7[2,] <- rev(perevent1[,"PROPDMG21_e"])
cdata7[2,] <- index_match(eventnamessortedrev, perevent1[,c("event", "PROPDMG21_e")])
rownames(cdata7) <- c("CROPDMG21_e", "PROPDMG21_e")
colnames(cdata7) <- eventnamessortedrev
# draw chart 7
b7 <- barplot(cdata7,
horiz = TRUE,
beside = TRUE,
xlim = c(0, 150000),
col = c("forestgreen", "green"),
border = NA,
xlab = "frequency",
xaxt = "n",
yaxt = "n",
xaxs = "i",
yaxs = "i")
# axis(side = 1, labels = FALSE)
axis(side = 1,
cex.axis = 0.7,
at = seq(0, 150000, by = 25000),
mgp = c(3, 1, 0))
text( y = (b7[1,]+b7[2,])/2,
x = -0.1,
labels = eventnamessortedrev,
cex = 0.6,
xpd = NA,
adj = 1)
mtext("Economic impact per event (USD 2021)",
side = 3,
line = 1,
adj = 0)
legend("top",
legend = c("crop damage", "property damage"),
fill = c("forestgreen", "green"),
border = NA,
box.lty = 0,
cex = 0.7)
# add vertical gridlines
abline(v = seq(0, 150000, by = 25000),
col = "grey",
lty = 3,
lwd = 1)
The analysis in Figure 3 shows that:
The initial three piece of analysis are intersting and point towards intial directions of investigation around policy on piublic health and economic impact. Further analysis is suggested in the following areas: