Synopsis

This report looks at the National Weather Services (NWS) Storm Data Database (SDD) to determine which weather events have the biggest impact on public health and the economy across the United States. The SDD holds information on severe weather events across United States for the period 1953 to 2011 with the current range of 48 weather events being recorded since 1996. The SDD is manually populated by NWS employees from across the country and data quality is an issue - therefore a significant part of this analysis was data cleansing to ensure a robust analysis could be undertaken. All analysis in this report was undertaken using R and has been conducted in a reproducible way, all instructions and code to reproduce the analysis can be found in the Data Processing section. The key findings from this analysis are

This analysis was a preliminary look into the data and the questions around public health and economic impact and at the end of the paper further areas of analysis are suggested.



Data Processing

This section describes how data was processed to produce the results in the Results section. All analysis for this report was run using the R statistical language in the R Studio editor. All analysis has been conducted in a reproducible way, with the intention that other researchers can both interrogate the robustness of the analysis and build further on the analysis for deeper insights. All instructions and code to reproduce the analysis can be found in this section, in addition all code in this document can be run directly to recreate the analysis.


Setting up the R environment

The analysis was conducted within R Studio using the following additional packages:

  • tidyverse to support with the manipulating and formatting data in a reproducible way
  • lubridate to support the handling of dates and times
  • tidystringdist to enable the quantification of string matching.
install.packages("tidyverse", "lubridate", "tidystringdist")
library(tidyverse)
library(lubridate)
library(tidystringdist)


Sourcing and uploading the data

The data for this project were all obtained from publicly available sources. For the purpose of this analysis all of the data is provided as zip or csv files along with links to where it was originally obtained.

Storm data available from the NWS website: https://www.ncdc.noaa.gov/stormevents

## Download the main SDD plus help and FAQ files

download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormdata.csv.bz2", 
                "Stormdata.csv.bz2", method = "curl")
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf", 
                "NWSI_10_1605.pdf")
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf",
                "Storm_data_FAQ.pdf")

## Upload main dataset to an R dataframe
data_0 <- read.csv("Stormdata.csv.bz2", header = TRUE)

## Upload a list of event names to an R dataframe (this list is as per the NWSI_10_1605.pdf document)
data_events <- read.csv("event_names.csv", header = TRUE)

State names and abbreviations were obtained from Wikipedia: https://en.wikipedia.org/wiki/List_of_U.S._state_and_territory_abbreviations

## Upload list of 50 US states and their 2-letter abbreviations
data_states <- read.csv("state_abbreviations.csv", header = TRUE)

## Upload list of US territories and their 2-letter abbreviations
data_territories <- read.csv("territory_abbreviations.csv", header = TRUE)

Inflation rates and indices were obtained from the following website: https://www.in2013dollars.com/us/inflation/1950

## Upload table of inflation and index
data_inflation <- read.csv("inflation_data.csv", header = TRUE)


Processing the data

For this analysis three explanatory variables were used to explore the data: the event type, the state the event occurred in and the year the event occurred in. Four output variables were considered to address the areas of interest: fatalities and injuries for public health, and cost of crop damage and cost of property damage for economic impact. Various data cleansing and data processing was undertaken on all of these variables to produce the final analysis and this is described here for each variable in turn.


Processing - events

The key explanatory variable for this analysis is the event type. A list of 48 event types is provided by the NWS in the help documentation and this is used as the categorisation for events. These event types are used to categorise each weather event in the database but an initial analysis of the EVTYPE field shows that although it appears mandatory it is a free text field which has led to a significant data quality issue.

# store a list of eventnames
eventnames <- data_events[,1]
# create a new master dataset with additional columns and add an event_match flag
data_1 <- data_0 %>% mutate(event_match = EVTYPE %in% eventnames)

# number of event types
length(eventnames)
## [1] 48
# number of observations in the SDD
nrow(data_1)
## [1] 902297
# proportion of observations with a valid event in EVTYPE field
mean(data_1$event_match)
## [1] 0.7040642
# number of different entries in EVTYPE field
length(unique(data_1$EVTYPE))
## [1] 985
# add 'other' categories to eventnames
eventnames <- c(eventnames, "OTHER/TBD", "NO EVENT") 

The dataset has 902,297 rows but only 70.41% have a valid event name. 985 different values have been used to populate the EVTYPE field, meaning that there are 935 invalid descriptions used in the EVTYPE field.

To ensure a more complete data coverage for the analysis data cleansing was undertaken to convert non-valid entries in the EVTYPE field into valid event names. Four categories of data cleansing rules were applied to the EVTYPE data field as follows:

  • Generic - A set of 8 rules to tidy up all strings (e.g. convert all to UPPER CASE)
  • Specific - A set of 27 rules to replace specific strings (e.g. convert ‘ICE FOG’ to ‘FREEZING FOG’)
  • Matching - A set of 16 rules to replace partially matched strings (e.g. convert any string containing ‘SLIDE’ to ‘DEBRIS FLOW’)
  • Other - A set of 9 rules to replace some strings with ‘other’ for review at a later date (e.g. convert any string containing ‘GRADIENT’ to ‘OTHER/TBD’)

The rules are implemented as functions which take a string input and return a revised string. They are shown below for completeness and transparency.

################################################################################
## The functions below perform string transformations on the EVTYPE field ######

## Initial function to test the name already in EVTYPE
nNAME <- function(x) {x}

## "g" functions to generically tidy up strings 
gUPPR <- function(x) {str_to_upper(x)}  #convert to UPPER CASE
gSQSH <- function(x) {str_squish(x)}   #trim white space and remove double space
gCSTL <- function(x) {str_replace(x, "CSTL", "COASTAL")}
gFLDD <- function(x) {str_replace(x, "FLD", "FLOOD")}
gTHUN <- function(x) {str_replace(x, "THUNDERSTORMS", "THUNDERSTORM")}
gTIDL <- function(x) {str_replace(x, "TIDAL", "COASTAL")}
gTSTM <- function(x) {str_replace(x, "TSTM", "THUNDERSTORM")}
gWNDS <- function(x) {str_replace(x, "WINDS", "WIND")}

## "s" functions to replace specific full strings with event types
sCOLD <- function(x) {ifelse(   x == "COLD"           | x == "PROLONG COLD/SNOW"|
    x == "UNSEASONABLY COOL"  | x == "COLD WEATHER"   | x == "PROLONG COLD"     |
    x == "UNSEASONABLY COLD"  | x == "COLD WAVE"      | x == "EXTENDED COLD"    |
    x == "UNSEASONABLE COLD"  | x == "SNOW/COLD"      | x == "LOW TEMPERATURE"  |
    x == "UNSEASONAL LOW TEMP"| x == "SNOW\\COLD"     | x == "COLD TEMPERATURE" |
    x == "UNUSUALLY COLD"     | x == "SNOW AND COLD"  | x == "COLD TEMPERATURES"|
    x == "LOW WIND CHILL"     | x == "COLD WIND CHILL TEMPERATURES"             |         
    x == "WIND CHILL" | x == "FOG AND COLD TEMPERATURES",         "COLD/WIND CHILL", x)}                     
                                                            
sDDVL <- function(x) {ifelse(x == "WHIRLWIND"| x == "LANDSPOUT",       "DUST DEVIL", x)}

sDFLW <- function(x) {ifelse(x == "LANDSLUMP",                        "DEBRIS FLOW", x)}

sDFOG <- function(x) {ifelse(x == "FOG"| x == "VOG",                    "DENSE FOG", x)}

sDRGT <- function(x) {ifelse(x == "RECORD LOW RAINFALL",                  "DROUGHT", x)}

sDUST <- function(x) {ifelse(x == "SAHARAN DUST",                      "DUST STORM", x)}

sFFLD <- function(x) {ifelse(x == "ICE JAM",                          "FLASH FLOOD", x)}

sFFOG <- function(x) {ifelse(x == "GLAZE"| x == "ICE FOG"| x == "GLAZE ICE",
                                                                     "FREEZING FOG", x)}                                                                 
sFLDD <- function(x) {ifelse(x == "HIGH WATER"      |x == "RAPIDLY RISING WATER"| 
                             x == "SMALL STREAM"    |x == "DAM BREAK"           |  
                             x == "SMALL STREAM AND"|x == "DAM FAILURE",
                                                                            "FLOOD", x)}                                                                       
sFRFR <- function(x) {ifelse(x == "FREEZE"| x == "ICY ROADS"  |x == "PATCHY ICE"|       
                             x == "FROST" | x == "ICE ROADS"  |x == "BLACK ICE" |
                    x == "COLD AND FROST" | x == "ICE ON ROAD"|x == "ICE", 
                                                                     "FROST/FREEZE", x)}                                                                
sHAIL <- function(x) {ifelse(x == "ICE PELLETS",                             "HAIL", x)}

sHEAT <- function(x) {ifelse(x == "PROLONG WARMTH"  | x == "UNSEASONABLY HOT"   |
                             x == "ABNORMAL WARMTH" | x == "UNSEASONABLY WARM"  |   
                             x == "VERY WARM"       | x == "UNUSUAL WARMTH"     |
                             x == "HOT SPELL"       | x == "UNUSUALLY WARM"     |
                             x == "HOT WEATHER",                             "HEAT", x)}
  
sHSRF <- function(x) {ifelse(x == "HEAVY SURF"      | x == "HIGH WAVES"         |
        x == "HIGH SWELLS" | x == "HAZARDOUS SURF"  | x == "WIND AND WAVE"      | 
        x == "HEAVY SWELLS"| x == "ROUGH SURF"      | x == "ROGUE WAVE"         |
        x == "HURRICANE-GENERATED SWELLS"           | x == "HEAVY SURF AND WIND",     
                                                                        "HIGH SURF", x)}

sHURR <- function(x) {ifelse(x == "TYPHOON"   | x == "HURRICANE/TYPHOON"        | 
                             x == "HURRICANE" | x == "REMNANTS OF FLOYD", 
                                                              "HURRICANE (TYPHOON)", x)}

sHVRN <- function(x) {ifelse(x == "RAIN"            | x == "UNSEASONABLY WET"   |
 x == "EXCESSIVE RAIN"     | x == "RAIN (HEAVY)"    | x == "UNSEASONABLE RAIN"  |  
 x == "EXCESSIVE RAINFALL" | x == "HVY RAIN"        | x == "ABNORMALLY WET"     |
 x == "TORRENTIAL RAINFALL"| x == "HEAVY SHOWER"    | x == "EXTREMELY WET"      |  
 x == "TORRENTIAL RAIN"    | x == "HEAVY SHOWERS"   | x == "DOWNBURST"          |
 x == "PROLONGED RAIN"     | x == "RECORD RAINFALL" | x == "RAINSTORM"          |    
 x == "RAIN DAMAGE"        | x == "RECORD/EXCESSIVE RAINFALL",         "HEAVY RAIN", x)}
 
sHVSN <- function(x) {ifelse(x == "RECORD SNOW"     | x == "SNOW"               | 
 x == "HEAVY WET SNOW"     | x == "RECORD SNOW/COLD"| x == "EXCESSIVE SNOW"     | 
 x == "NEAR RECORD SNOW"   | x == "RECORD SNOWFALL" | x == "SNOWFALL RECORD"    |                               
 x =="ACCUMULATED SNOWFALL"| x == "RECORD WINTER SNOW",                "HEAVY SNOW", x)}                     
                                                                  
sLESN <- function(x) {ifelse(x == "HEAVY LAKE SNOW"| x == "LAKE EFFECT SNOW", 
                                                                 "LAKE-EFFECT SNOW", x)}

sLGTN <- function(x) {ifelse(x == "LIGNTNING",                          "LIGHTNING", x)}

sSMKE <- function(x) {ifelse(x == "SMOKE",                            "DENSE SMOKE", x)}

sSRGE <- function(x) {ifelse(x == "STORM SURGE"  | x == "ASTRONOMICAL HIGH TIDE"|  
          x == "HIGH SEAS" | x == "COASTAL SURGE"| x == "BLOW-OUT TIDES"        |                                 
          x == "ROUGH SEAS"| x == "COASTAL STORM"| x == "BLOW-OUT TIDE"         |      
          x == "HEAVY SEAS"| x == "COASTALSTORM" | x == "HIGH TIDES",                              
                                                                 "STORM SURGE/TIDE", x)}
                                 
sSWND <- function(x) {ifelse(x == "WIND"| x == "GUSTY WIND"|
                             x == "WND" | x == "WIND GUSTS",          "STRONG WIND", x)}
                                                                 
sTHNW <- function(x) {ifelse(x == "THUNDERSTORM"| x == "THUNDERSTORM DAMAGE TO" | 
        x == "GUSTNADO"    | x == "SEVERE THUNDERSTORM"| x == "THUNDERSTORMW 50"|
        x == "GUSTNADO AND"| x == "THUNDERSTORM DAMAGE"| x == "WIND DAMAGE",
                                                                "THUNDERSTORM WIND", x)}

sVLCA <- function(x) {ifelse(x == "VOLCANIC ERUPTION",               "VOLCANIC ASH", x)}

sWSTM <- function(x) {ifelse(x == "THUNDERSNOW"| x == "SNOWSTORM"|
                             x == "SNOW AND WIND",                   "WINTER STORM", x)}
                              
sWWTH <- function(x) {ifelse(x == "WINTRY MIX" | x == "HEAVY PRECIPITATION"     |
        x == "SNOW AND ICE"| x == "WINTER MIX" | x == "HEAVY PRECIPATATION"     |
        x == "SNOW/ICE"    | x == "WINTERY MIX"| x == "MIXED PRECIPITATION"     |   
        x == "SNOW/ ICE"   | x == "HEAVY MIX"  | x == "EXCESSIVE PRECIPITATION" |   
        x == "ICE/SNOW"    | x == "SNOW/RAIN"  | x == "MIXED PRECIP"            |                       
        x == "RAIN/SNOW",                                          "WINTER WEATHER", x)}
                                                              
sXCLD <- function(x) {ifelse(x == "RECORD COLD"| x == "RECORD COLD/FROST"|
  x == "SEVERE COLD"       | x == "RECORD COOL"| x == "LOW TEMPERATURE RECORD"|   
  x == "SNOW/ BITTER COLD" | x == "RECORD LOW" | x == "EXCESSIVE COLD"|
  x == "BITTER WIND CHILL" | x == "BITTER WIND CHILL TEMPERATURES",    
                                                          "EXTREME COLD/WIND CHILL", x)}

sXHET <- function(x) {ifelse(    x == "RECORD WARMTH"     | x == "RECORD HIGH"  |
x == "RECORD HIGH TEMPERATURE" | x == "RECORD WARM"       | x == "RECORD HEAT"  |                               
x == "RECORD HIGH TEMPERATURES"| x == "RECORD WARM TEMPS."| x == "EXTREME HEAT" |                                       
x == "HIGH TEMPERATURE RECORD" | x == "UNUSUAL/RECORD WARMTH",                               
                                                                   "EXCESSIVE HEAT", x)}

## "m" functions to replace partial match strings with event types
m1MCB <- function(x) {ifelse(str_detect(x, "MICR*OBURST"), "THUNDERSTORM WIND", x)}
mBLZD <- function(x) {ifelse(str_detect(x, "BLOWING|SQUALL"), "BLIZZARD", x)}
mDFLW <- function(x) {ifelse(str_detect(x, "SLIDE"), "DEBRIS FLOW", x)}
mDRGT <- function(x) {ifelse(str_detect(x, "DRY"), "DROUGHT", x)}
mECCW <- function(x) {ifelse(str_detect(x, ".*EXTREME.+(COLD|CHILL|WIND)"), "EXTREME COLD/WIND CHILL", x)}
mFIRE <- function(x) {ifelse(str_detect(x, "FIRE"), "WILDFIRE", x)}
mFLSH <- function(x) {ifelse(str_detect(x, "FLASH"), "FLASH FLOOD", x)}
mFNNL <- function(x) {ifelse(str_detect(x, "FUNNEL"), "FUNNEL CLOUD", x)}
mFRST <- function(x) {ifelse(str_detect(x, "FROST"), "FROST/FREEZE", x)}
mFRZE <- function(x) {ifelse(str_detect(x, "FREEZE"), "FROST/FREEZE", x)}
mHURR <- function(x) {ifelse(str_detect(x, "HURRICANE .+"), "HURRICANE (TYPHOON)", x)}
mSLET <- function(x) {ifelse(str_detect(x, "FREEZING"), "SLEET", x)}
mTNDO <- function(x) {ifelse(str_detect(x, "WALL CLOUD"), "TORNADO", x)}
mURBN <- function(x) {ifelse(str_detect(x, ".*(URBAN|UBN).+(SMALL|SML).*"), "FLOOD", x)}

 # The mEVENT function searches through a string for a valid event name, it returns the
 # event which is found first in a given string, if there is no match it returns the original string
mEVNT <- function(x) {
    match0 <- 100
    eventm <- 0
    counter <- 0
    for (ee in eventnames) {
        match1 <- str_locate(x, ee)[1]
        counter <- counter + 1
        if (!is.na(match1)) {
            if (match1 < match0) {
                match0 <- match1
                eventm <- counter
            }    
        }    
    }
    if (match0 == 100) {
        return(x)
    } else {
        return(eventnames[eventm])
    }
}

 # The mJWO1 function returns the best match for a given string from the event list using 
 # the Jaro–Winkler distance (implemented using the tidystringdist package). The function
 # only returns a match if it is below a given score otherwise it returns the original string.
 # The threshold score is set at 0.1 which means a close match i.e. 1 character difference is required
mJW01 <- function(x) {jw_match(x, eventnames, 0.1)}
jw_match <- function(name, list, score) {
    output <- name
    matchlist <- tidy_comb(name, list)
    scorelist <- tidy_stringdist(matchlist)
    minlist = slice_min(scorelist, jw, n=1, with_ties = FALSE)
    if (minlist$jw <= score) {output <- minlist$V1}
    output
}

## "o" functions to replace partial match strings with 'OTHER/TBD' for later classification
oOTH0 <- function(x) {ifelse(str_detect(x, "OTHER"), "OTHER/TBD", x)}
oOTH1 <- function(x) {ifelse(str_detect(x, "SUMMARY"), "OTHER/TBD", x)}
oOTH2 <- function(x) {ifelse(str_detect(x, "(MONTH|YEAR|SEASON)"), "OTHER/TBD", x)}
oOTH3 <- function(x) {ifelse(str_detect(x, "(REC|TEMP).*(TEMP|REC)"), "OTHER/TBD", x)}
oOTH4 <- function(x) {ifelse(str_detect(x, "GRADIENT"), "OTHER/TBD", x)}
oOTH5 <- function(x) {ifelse(str_detect(x, "(EROSION|EROSIN)"), "OTHER/TBD", x)}
oOTH6 <- function(x) {ifelse(str_detect(x, "SNOW"), "OTHER/TBD", x)}
oOTH7 <- function(x) {ifelse(str_detect(x, "WIND"), "OTHER/TBD", x)}
oOTH8 <- function(x) {"OTHER/TBD"}

################################################################################

To understand the impact of these rules two dataframes are used:

  • events1 tracks the list of unique entries in the EVTYPES field to show which have been successfully transformed and which have not.
  • rules1 tracks the impact of each rule by recording how many entries it has successfully transformed.

A routine called applyrules() applies each of the string transformation rules above in turn by calling stringtransform() and records the results in events1 and rules1. The rules are implemented in order and if a rule has successfully created a valid event type for a given event string then no further rules are applied.

# get list of unique strings entered into the EVTYPE datafield
e0 <- unique(data_0$EVTYPE)

# create a dataframe to track string transformation (will become events1)
events0 <- data.frame(orig = e0,                        #list of unique entry strings 
                      count = rep(0,length(e0)),        #how many time entry appears
                      test = e0,                        #each rule applied to this string
                      transform = rep("-",length(e0)),  #event type string for this string
                      match = rep(0,length(e0)),        #flag whether event type has been found
                      rule = rep(0,length(e0)),         #record rule used to create event type
                      mcount = rep(0,length(e0)))       #track how many records have been matched
events0$count <- sapply(events0$orig, function(x) {sum(data_0$EVTYPE == x)})  #initialise count
events0 <- events0[order(-events0$count),]              #rank strings by most common to least

# This function applies the transformation rules and updates the events1 and rules1 dataframes
applyrules <- function() {
    
    events1 <<- events0
    rvec <- c(    nNAME, gUPPR, gSQSH, gCSTL, gFLDD, gTHUN, gTIDL, gTSTM, gWNDS,  
                  sCOLD, sDDVL, sDFLW, sDFOG, sDRGT, sDUST, sFFLD, sFFOG, sFLDD, 
                  sFRFR, sHAIL, sHEAT, sHSRF, sHURR, sHVRN, sHVSN, sLESN, sLGTN,
                  sSMKE, sSRGE, sSWND, sTHNW, sVLCA, sWSTM, sWWTH, sXCLD, sXHET,
                  m1MCB, mBLZD, mDFLW, mDRGT, mECCW, mFIRE, mFLSH, mFNNL, mFRST, 
                  mFRZE, mHURR, mSLET, mTNDO, mURBN, mEVNT, mJW01,
                  oOTH0, oOTH1, oOTH2, oOTH3, oOTH4, oOTH5, oOTH6, oOTH7, oOTH8)
    
    rnames <- c("nNAME", "gUPPR", "gSQSH", "gCSTL", "gFLDD", "gTHUN", "gTIDL", "gTSTM", "gWNDS", 
                "sCOLD", "sDDVL", "sDFLW", "sDFOG", "sDRGT", "sDUST", "sFFLD", "sFFOG", "sFLDD",
                "sFRFR", "sHAIL", "sHEAT", "sHSRF", "sHURR", "sHVRN", "sHVSN", "sLESN", "sLGTN",
                "sSMKE", "sSRGE", "sSWND", "sTHNW", "sVLCA", "sWSTM", "sWWTH", "sXCLD", "sXHET",
                "m1MCB", "mBLZD", "mDFLW", "mDRGT", "mECCW", "mFIRE", "mFLSH", "mFNNL", "mFRST",
                "mFRZE", "mHURR", "mSLET", "mTNDO", "mURBN", "mEVNT", "mJW01",
                "oOTH0", "oOTH1", "oOTH2", "oOTH3", "oOTH4", "oOTH5", "oOTH6", "oOTH7", "oOTH8")
        
    rules1 <- data.frame("rule_name" = rnames, 
                         "impacted" = rep(0, length(rnames)), 
                         "pct_coverage" = rep(0, length(rnames)),
                         "covered" = rep(0, length(rnames)), 
                         "not_covered" = rep(0, length(rnames)))
    rcounter = 0
    
    # loop over every rule in the list - note this is done in order of rvec
    for (rr in rvec) {
        
        # set a counter and track the name of the rule we are applying 
        rcounter <- rcounter + 1
        rname <- rnames[rcounter]
        
        # populate the events1 dataframe using the stringtransform function
        events1$test <<- sapply(events1$test, rr)
        tempevents1_transform <- events1$transform
        events1$transform <<- mapply(stringtransform, tempevents1_transform, events1$test, events1$rule, rname,1)
        events1$rule <<- mapply(stringtransform, tempevents1_transform, events1$test, events1$rule, rname, 2)
        events1$match <<- sapply(events1$transform, function(x) {ifelse(x %in% eventnames, 1, 0)})
        events1$mcount <<- events1$match * events1$count
        
        # populate the rule1 dataframe
        rules1[rcounter, "pct_coverage"] <- as.numeric(round(sum(events1$mcount)/sum(events1$count), 6))
        rules1[rcounter, "covered"] <- as.numeric(sum(events1$mcount))
        rules1[rcounter, "not_covered"] <- length(data_0$EVTYPE) - as.numeric(sum(events1$mcount))        
        prior = 0
        if(rcounter != 1) {prior <- as.numeric(rules1[rcounter-1, "covered"])}
        rules1[rcounter, "impacted"] <- sum(events1$mcount) - prior
    
    }
    return(rules1)
}

# This function performs the string transformation and records the result(1) and rule used(2)
# The function compares the existing transformed value with a test value for the latest
# string transformation to see whether it should be overwritten.
stringtransform <- function(trans_str, test_str, rule_trans, rule_test, answer) {
    outputs <- character(2)
    if (trans_str %in% eventnames) {        #if current value is valid then
        outputs[1] <- trans_str             #keep string and rule as is
        outputs[2] <- rule_trans
    } 
    else if (test_str %in% eventnames) {    #elseif new value is valid then
        outputs[1] <- test_str              #update string and rule 
        outputs[2] <- rule_test
    }
    else {                                  #else no valid string yet
        outputs[1] <- "-"
        outputs[2] <- "-"
    }
    output1 <- outputs[answer]
    output1
}

# run the 'applyrules' routine  
rules1 <- applyrules()

The events1 dataframe shows how each unique events string is transformed, it can also be used to see any strings left to be categorised/transformed.

head(events1, 10)
##                  orig  count      test         transform match  rule mcount
## 3                HAIL 288661 OTHER/TBD              HAIL     1 nNAME 288661
## 2           TSTM WIND 219940 OTHER/TBD THUNDERSTORM WIND     1 gTSTM 219940
## 16  THUNDERSTORM WIND  82563 OTHER/TBD THUNDERSTORM WIND     1 nNAME  82563
## 1             TORNADO  60652 OTHER/TBD           TORNADO     1 nNAME  60652
## 20        FLASH FLOOD  54277 OTHER/TBD       FLASH FLOOD     1 nNAME  54277
## 36              FLOOD  25326 OTHER/TBD             FLOOD     1 nNAME  25326
## 10 THUNDERSTORM WINDS  20843 OTHER/TBD THUNDERSTORM WIND     1 gWNDS  20843
## 46          HIGH WIND  20212 OTHER/TBD         HIGH WIND     1 nNAME  20212
## 15          LIGHTNING  15754 OTHER/TBD         LIGHTNING     1 nNAME  15754
## 53         HEAVY SNOW  15708 OTHER/TBD        HEAVY SNOW     1 nNAME  15708
events1 %>% filter(rule == "oOTH8") %>% arrange(desc(count))
##                          orig count      test transform match  rule mcount
## 1        NORMAL PRECIPITATION     3 OTHER/TBD OTHER/TBD     1 oOTH8      3
## 2               MARINE MISHAP     2 OTHER/TBD OTHER/TBD     1 oOTH8      2
## 3                   ICE FLOES     2 OTHER/TBD OTHER/TBD     1 oOTH8      2
## 4  BELOW NORMAL PRECIPITATION     2 OTHER/TBD OTHER/TBD     1 oOTH8      2
## 5                        NONE     2 OTHER/TBD OTHER/TBD     1 oOTH8      2
## 6           RED FLAG CRITERIA     2 OTHER/TBD OTHER/TBD     1 oOTH8      2
## 7           SEVERE TURBULENCE     1 OTHER/TBD OTHER/TBD     1 oOTH8      1
## 8               APACHE COUNTY     1 OTHER/TBD OTHER/TBD     1 oOTH8      1
## 9                        HIGH     1 OTHER/TBD OTHER/TBD     1 oOTH8      1
## 10               COOL AND WET     1 OTHER/TBD OTHER/TBD     1 oOTH8      1
## 11    COLD AND WET CONDITIONS     1 OTHER/TBD OTHER/TBD     1 oOTH8      1
## 12          EXCESSIVE WETNESS     1 OTHER/TBD OTHER/TBD     1 oOTH8      1
## 13                  SOUTHEAST     1 OTHER/TBD OTHER/TBD     1 oOTH8      1
## 14                WET WEATHER     1 OTHER/TBD OTHER/TBD     1 oOTH8      1
## 15                  EXCESSIVE     1 OTHER/TBD OTHER/TBD     1 oOTH8      1
## 16                          ?     1 OTHER/TBD OTHER/TBD     1 oOTH8      1
## 17                HOT PATTERN     1 OTHER/TBD OTHER/TBD     1 oOTH8      1
## 18               MILD PATTERN     1 OTHER/TBD OTHER/TBD     1 oOTH8      1
## 19            Marine Accident     1 OTHER/TBD OTHER/TBD     1 oOTH8      1
## 20        Metro Storm, May 26     1 OTHER/TBD OTHER/TBD     1 oOTH8      1
## 21          No Severe Weather     1 OTHER/TBD OTHER/TBD     1 oOTH8      1
## 22                 EARLY RAIN     1 OTHER/TBD OTHER/TBD     1 oOTH8      1
## 23      HYPERTHERMIA/EXPOSURE     1 OTHER/TBD OTHER/TBD     1 oOTH8      1
## 24                 COOL SPELL     1 OTHER/TBD OTHER/TBD     1 oOTH8      1
## 25               WARM WEATHER     1 OTHER/TBD OTHER/TBD     1 oOTH8      1
## 26            NORTHERN LIGHTS     1 OTHER/TBD OTHER/TBD     1 oOTH8      1
## 27                   DROWNING     1 OTHER/TBD OTHER/TBD     1 oOTH8      1

The rules1 dataframe can be reviewed to see how each rule impacted the transformation.

rules1
##    rule_name impacted pct_coverage covered not_covered
## 1      nNAME   635275     0.704064  635275      267022
## 2      gUPPR       76     0.704148  635351      266946
## 3      gSQSH        4     0.704153  635355      266942
## 4      gCSTL        0     0.704153  635355      266942
## 5      gFLDD        0     0.704153  635355      266942
## 6      gTHUN        6     0.704159  635361      266936
## 7      gTIDL        1     0.704161  635362      266935
## 8      gTSTM   226121     0.954767  861483       40814
## 9      gWNDS    22608     0.979823  884091       18206
## 10     sCOLD      203     0.980048  884294       18003
## 11     sDDVL        5     0.980053  884299       17998
## 12     sDFLW        2     0.980055  884301       17996
## 13     sDFOG      539     0.980653  884840       17457
## 14     sDRGT        2     0.980655  884842       17455
## 15     sDUST        4     0.980659  884846       17451
## 16     sFFLD        4     0.980664  884850       17447
## 17     sFFOG       47     0.980716  884897       17400
## 18     sFLDD       14     0.980731  884911       17386
## 19     sFRFR      253     0.981012  885164       17133
## 20     sHAIL        1     0.981013  885165       17132
## 21     sHEAT      162     0.981192  885327       16970
## 22     sHSRF      108     0.981312  885435       16862
## 23     sHURR      275     0.981617  885710       16587
## 24     sHVRN       77     0.981702  885787       16510
## 25     sHVSN      667     0.982441  886454       15843
## 26     sLESN       48     0.982495  886502       15795
## 27     sLGTN        1     0.982496  886503       15794
## 28     sSMKE       11     0.982508  886514       15783
## 29     sSRGE      394     0.982945  886908       15389
## 30     sSWND      476     0.983472  887384       14913
## 31     sTHNW      128     0.983614  887512       14785
## 32     sVLCA        2     0.983616  887514       14783
## 33     sWSTM        3     0.983620  887517       14780
## 34     sWWTH      212     0.983855  887729       14568
## 35     sXCLD       88     0.983952  887817       14480
## 36     sXHET      276     0.984258  888093       14204
## 37     m1MCB      218     0.984500  888311       13986
## 38     mBLZD      125     0.984638  888436       13861
## 39     mDFLW      652     0.985361  889088       13209
## 40     mDRGT      110     0.985483  889198       13099
## 41     mECCW      893     0.986472  890091       12206
## 42     mFIRE     1479     0.988111  891570       10727
## 43     mFLSH     1397     0.989660  892967        9330
## 44     mFNNL      147     0.989823  893114        9183
## 45     mFRST        4     0.989827  893118        9179
## 46     mFRZE       23     0.989853  893141        9156
## 47     mHURR       23     0.989878  893164        9133
## 48     mSLET      364     0.990281  893528        8769
## 49     mTNDO       11     0.990294  893539        8758
## 50     mURBN     3458     0.994126  896997        5300
## 51     mEVNT     4584     0.999206  901581         716
## 52     mJW01       34     0.999244  901615         682
## 53     oOTH0       59     0.999310  901674         623
## 54     oOTH1       75     0.999393  901749         548
## 55     oOTH2       78     0.999479  901827         470
## 56     oOTH3       65     0.999551  901892         405
## 57     oOTH4       17     0.999570  901909         388
## 58     oOTH5        6     0.999577  901915         382
## 59     oOTH6      323     0.999935  902238          59
## 60     oOTH7       25     0.999962  902263          34
## 61     oOTH8       34     1.000000  902297           0

The rules applied cover 99.92% of the data before getting to the ‘other’ categorisations. This appears to be a very solid basis for moving forwards. There would ideally be a review of all rule applications by an expert familiar with the weather event classifications, but given the majority of the transformation comes from converting ‘TSTM’ to ‘THUNDERSTORM’ and ‘WINDS’ to ‘WIND’ it seems like a good basis to proceed on.

To allow easier charting of the event data it has been sorted into the following event groups:

Event Groups Weather Events
Dry/Heat Drought, Dust Devil, Dust Storm, Excessive Heat, Heat, Volcanic Ash, Wildfire
Flood/Rain Coastal Flood, Flash Flood, Flood, Heavy Rain, Lakeshore Flood
Marine/Lake Astronomical Low Tide, High Surf, Lake-Effect Snow, Marine Hail, Marine High Wind, High Surf, Lake-Effect Snow, Marine Hail, Marine High Wind, Marine Strong Wind, Marine Thunderstorm Wind, Rip Current, Seiche, Storm Surge/Tide, Tsunami, Waterspout
Wind High Wind, Hurricane (Typhoon), Strong Wind, Thunderstorm Wind, Tornado, Tropical Depression, Tropical Storm
Snow/Storm Avalanche, Blizzard, Hail, Heavy Snow, Ice Storm, Lightning, Sleet, Winter Storm, Winter Weather
Cold/Freeze Cold/Wind Chill, Freezing Fog, Frost/Freeze
Other Debris Flow, Dense Fog, Dense Smoke, Funnel Cloud, Other/TBD

The above grouping and a sensible ordering of the groups is applied via an events_grouping.csv file. The transformed events from applytransform() and the event grouping are added to the main data_1 dataset.

eventgrouping <- read.csv("event_grouping.csv", header = TRUE)
eventnamessorted <- eventgrouping$event
eventgrouping
##    INDEX                    event code       group chart_order
## 1      9                  DROUGHT    Z    Dry/Heat           1
## 2     10               DUST DEVIL    C    Dry/Heat           2
## 3     11               DUST STORM    Z    Dry/Heat           3
## 4     12           EXCESSIVE HEAT    Z    Dry/Heat           4
## 5     20                     HEAT    Z    Dry/Heat           5
## 6     44             VOLCANIC ASH    Z    Dry/Heat           6
## 7     46                 WILDFIRE    Z    Dry/Heat           7
## 8      4            COASTAL FLOOD    Z  Flood/Rain           8
## 9     14              FLASH FLOOD    C  Flood/Rain           9
## 10    15                    FLOOD    C  Flood/Rain          10
## 11    21               HEAVY RAIN    Z  Flood/Rain          11
## 12    28          LAKESHORE FLOOD    Z  Flood/Rain          12
## 13     1    ASTRONOMICAL LOW TIDE    Z Marine/Lake          13
## 14    23                HIGH SURF    Z Marine/Lake          14
## 15    27         LAKE-EFFECT SNOW    Z Marine/Lake          15
## 16    30              MARINE HAIL    M Marine/Lake          16
## 17    31         MARINE HIGH WIND    M Marine/Lake          17
## 18    32       MARINE STRONG WIND    M Marine/Lake          18
## 19    33 MARINE THUNDERSTORM WIND    M Marine/Lake          19
## 20    34              RIP CURRENT    Z Marine/Lake          20
## 21    35                   SEICHE    Z Marine/Lake          21
## 22    37         STORM SURGE/TIDE    Z Marine/Lake          22
## 23    43                  TSUNAMI    Z Marine/Lake          23
## 24    45               WATERSPOUT    M Marine/Lake          24
## 25    24                HIGH WIND    Z        Wind          25
## 26    25      HURRICANE (TYPHOON)    Z        Wind          26
## 27    38              STRONG WIND    Z        Wind          27
## 28    39        THUNDERSTORM WIND    C        Wind          28
## 29    40                  TORNADO    C        Wind          29
## 30    41      TROPICAL DEPRESSION    Z        Wind          30
## 31    42           TROPICAL STORM    Z        Wind          31
## 32     2                AVALANCHE    Z  Snow/Storm          32
## 33     3                 BLIZZARD    Z  Snow/Storm          33
## 34    19                     HAIL    C  Snow/Storm          34
## 35    22               HEAVY SNOW    Z  Snow/Storm          35
## 36    26                ICE STORM    Z  Snow/Storm          36
## 37    29                LIGHTNING    C  Snow/Storm          37
## 38    36                    SLEET    Z  Snow/Storm          38
## 39    47             WINTER STORM    Z  Snow/Storm          39
## 40    48           WINTER WEATHER    Z  Snow/Storm          40
## 41     5          COLD/WIND CHILL    Z Cold/Freeze          41
## 42    13  EXTREME COLD/WIND CHILL    Z Cold/Freeze          42
## 43    18             FREEZING FOG    Z Cold/Freeze          43
## 44    16             FROST/FREEZE    Z Cold/Freeze          44
## 45     6              DEBRIS FLOW    C       Other          45
## 46     7                DENSE FOG    Z       Other          46
## 47     8              DENSE SMOKE    Z       Other          47
## 48    17             FUNNEL CLOUD    C       Other          48
## 49    49                OTHER/TBD    O       Other          49
## 50    50                 NO EVENT    O       Other          50
data_1 <- mutate(data_1, 
                event = events1$transform[match(EVTYPE, events1$orig)],
                eventGrp = eventgrouping$group[match(event, eventgrouping$event)],
                rule = events1$rule[match(EVTYPE, events1$orig)]
)

The coverage of the transformed events, and how they were transformed by the different rules, can be seen by running the following ruletable() function (and two supporting functions):

# This function filters grouped data by different rule conditions
filter_grp <- function(data_grp, condition) {
    output <- data_grp %>% filter(str_detect(rule, condition)) %>% summarise(count = n())
    output
}

# This function performs an indexmatch allowing non-matches to be defined
index_match <- function(index_vec, match_DF, missing.val = NA) {
    index_vec2 <- match_DF[2][[1]][match(index_vec, match_DF[1][[1]])]
    output <- replace(index_vec2, is.na(index_vec2), missing.val)
    output
}

# This function generates a table of events (events2) showing how rule groupings
# have been applied.  It is written as a function so that it can be applied to other 
# variables later on in the analysis.
ruletable <- function(nameslist, variable) { # (eventnames, "events")
    # generate a dataframe with events as rows and rule types as columns
    nEvents <- length(nameslist)
    events2row <- data.frame(rowtype = rep("row", nEvents),
                         name = nameslist, 
                         orig = rep(0, nEvents), 
                         generic = rep(0, nEvents), 
                         specific = rep(0, nEvents), 
                         match = rep(0, nEvents), 
                         other = rep(0, nEvents))

    # group the main datatable by the variable (events) column
    data_grp1 <- group_by(data_1, across(variable))
    
    # for each family of rules count how many are applied to each event
    events2row$orig <- index_match(nameslist,  filter_grp(data_grp1, "^n"), 0)      
    events2row$generic <- index_match(nameslist,  filter_grp(data_grp1, "^g"), 0)
    events2row$specific <- index_match(nameslist,  filter_grp(data_grp1, "^s"), 0)
    events2row$match <- index_match(nameslist,  filter_grp(data_grp1, "^m"), 0)
    events2row$other <- index_match(nameslist,  filter_grp(data_grp1, "^o"), 0)    
    
    # calculate the final row (NO EVENTS) by subtracting rule EVENTS from previous rule NO EVENTS
    events2row$orig[nEvents] <- nrow(data_0) - sum(events2row$orig[-nEvents])
    events2row$generic[nEvents] <- events2row$orig[nEvents] - sum(events2row$generic[-nEvents])
    events2row$specific[nEvents] <- events2row$generic[nEvents] - sum(events2row$specific[-nEvents])
    events2row$match[nEvents] <- events2row$specific[nEvents] - sum(events2row$match[-nEvents])
    events2row$other[nEvents] <- events2row$match[nEvents] - sum(events2row$other[-nEvents])
    
    # create total rows showing number covered (ie with valid event name), total, and percent covered
    events2total <- events2row[1:3,]
    events2total[["rowtype"]] <- rep("total", 3)
    events2total[["name"]] <- c("covered", "total", "pct_covered")
    
    # calculate the total rows
    colNames <- names(events2row)
    for (col1 in colNames[-(1:2)]) {
        if (col1 == colNames[3]) {rowtotal = 0}
        events2total[1,col1] <- sum(events2row[-nEvents, col1]) + rowtotal
        rowtotal <- events2total[1,col1]
        events2total[2,col1] <- events2total[1,col1] + sum(events2row[nEvents, col1])
        events2total[3,col1] <- round(100 * events2total[1,col1] / events2total[2,col1], 0)
    }
    
    # join the row and totals together
    events2 <- rbind(events2row, events2total)
    events2 <- cbind(events2, pct_gen = rep(0, nEvents + 3))
    
    # add a column to show the % generated from rules
    for (row1 in c(1:nEvents, nEvents+3)) {
        rowtotal <- sum(events2[row1, 3:7])
        events2$pct_gen[row1] <- round(100 * (1 - (sum(events2$orig[row1]/rowtotal))), 1)
    }
    
    return(events2)

}

events2 <- ruletable(eventnames, "event")
events2[-1]
##                        name   orig generic specific  match  other pct_gen
## 1     ASTRONOMICAL LOW TIDE    174       0        0      0      0     0.0
## 2                 AVALANCHE    386       0        0      1      0     0.3
## 3                  BLIZZARD   2719       0        0    135      0     4.7
## 4             COASTAL FLOOD    650       8        0    222      0    26.1
## 5           COLD/WIND CHILL    539       0      203      0      0    27.4
## 6               DEBRIS FLOW      0       0        2    652      0   100.0
## 7                 DENSE FOG   1293       0      539      3      0    29.5
## 8               DENSE SMOKE     10       0       11      0      0    52.4
## 9                   DROUGHT   2488       0        2    130      0     5.0
## 10               DUST DEVIL    141       8        5      2      0     9.6
## 11               DUST STORM    427       0        4      2      0     1.4
## 12           EXCESSIVE HEAT   1678       0      276      4      0    14.3
## 13  EXTREME COLD/WIND CHILL   1002       0       88    893      0    49.5
## 14              FLASH FLOOD  54277       1        4   1397      0     2.5
## 15                    FLOOD  25326       1       14   4211      0    14.3
## 16             FROST/FREEZE   1342       1      253     27      0    17.3
## 17             FUNNEL CLOUD   6839       5        0    147      0     2.2
## 18             FREEZING FOG     45       1       47      0      0    51.6
## 19                     HAIL 288661       0        1    180      0     0.1
## 20                     HEAT    767       0      162     82      0    24.1
## 21               HEAVY RAIN  11723      19       77     62      0     1.3
## 22               HEAVY SNOW  15708       0      667     32      0     4.3
## 23                HIGH SURF    725       9      108    234      0    32.6
## 24                HIGH WIND  20212    1536        0     58      0     7.3
## 25      HURRICANE (TYPHOON)      0       0      275     23      0   100.0
## 26                ICE STORM   2006       0        0     20      0     1.0
## 27         LAKE-EFFECT SNOW    636       0       48      0      0     7.0
## 28          LAKESHORE FLOOD     23       0        0      0      0     0.0
## 29                LIGHTNING  15754       1        1     13      0     0.1
## 30              MARINE HAIL    442       0        0      0      0     0.0
## 31         MARINE HIGH WIND    135       0        0      0      0     0.0
## 32       MARINE STRONG WIND     48       0        0      0      0     0.0
## 33 MARINE THUNDERSTORM WIND   5812    6175        0      0      0    51.5
## 34              RIP CURRENT    470       0        0    307      0    39.5
## 35                   SEICHE     21       0        0      0      0     0.0
## 36                    SLEET     59       0        0    400      0    87.1
## 37         STORM SURGE/TIDE    148       0      394      0      0    72.7
## 38              STRONG WIND   3566     207      476      3      0    16.1
## 39        THUNDERSTORM WIND  82563  240823      128   1566      0    74.6
## 40                  TORNADO  60652       0        0     47      0     0.1
## 41      TROPICAL DEPRESSION     60       0        0      0      0     0.0
## 42           TROPICAL STORM    690       0        0      7      0     1.0
## 43                  TSUNAMI     20       0        0      0      0     0.0
## 44             VOLCANIC ASH     22       1        2      4      0    24.1
## 45               WATERSPOUT   3796       1        0     63      0     1.7
## 46                 WILDFIRE   2761       0        0   1479      0    34.9
## 47             WINTER STORM  11433       0        3      6      0     0.1
## 48           WINTER WEATHER   7026      19      212   1110      0    16.0
## 49                OTHER/TBD      0       0        0      0    682   100.0
## 50                 NO EVENT 267022   18206    14204    682      0    11.0
## 51                  covered 635275  884091   888093 901615 902297     0.0
## 52                    total 902297  902297   902297 902297 902297     0.0
## 53              pct_covered     70      98       98    100    100    85.0

Some events such as ‘Avalanche’, ‘Dust Storm’ and ‘Ice Storm’ are mostly keyed in correctly and would have been picked up without any data cleansing. However events such as ‘Debris Flow’ and ‘Hurricane (Typhoon)’ have not been keyed correctly and are generated from the data cleansing routine. An extreme example of this is ‘Thunderstorm Wind’ event which is the largest category and has 74.6% of its values generated from the data cleansing. This shows how important the data cleansing is to get a more accurate dataset. And also shows that the data verification approach will need validation from other external advisers. The event groupings show much larger groupings for ‘Snow/Storm’ and ‘Wind’ due to the pre-1992 database only having 3 weather event types. If these event groupings are to be used to describe analysis in further anlaysis they should be reviewed by experts for suitablility.

table(data_1$eventGrp)
## 
## Cold/Freeze    Dry/Heat  Flood/Rain Marine/Lake       Other  Snow/Storm 
##        4441       10447       98015       19766       10183      346553 
##        Wind 
##      412892


Processing - years

Each entry in the SDD has a corresponding date. This is used in the analysis to understand how weather has changed over time, as well as see how the quality of the data has changed over time. The year variable is created by taking the year element of the BGN_DATE field using lubridate functions. Time of day is not analysed, so the time element of the BGN_DATE timestamp is ignored as is the BGN_TIME field. A grouping is applied for the years variable, is in 5 year blocks counting backwards from final year of data (2011). There is a relatively small amount of data per year before 1992 so pre-1992 is counted as a single group. By grouping to five year blocks offset by 1 it aligns with the current 48 event classifications being brought in in 1996 so this provides 3 x five year groupings with full data.

data_1 <- mutate(data_1, 
                 year = year(mdy(str_sub(BGN_DATE, end = -8))),
                 yearGrp = ifelse(year < 1992, "pre 1992", 
                                    paste0(floor((year+3) / 5) * 5 -3,"-", floor((year+3) / 5) * 5+1))      
                 )        
yearnames <- c(as.character(seq(1950, 2011, 1)), "NO YEAR")

It is useful to see how the data coverage has changed over the years, and also which years have events which were transformed by the rules that were applied.

years2 <- ruletable(yearnames, "year")
years2
##    rowtype        name   orig generic specific  match  other pct_gen
## 1      row        1950    223       0        0      0      0     0.0
## 2      row        1951    269       0        0      0      0     0.0
## 3      row        1952    272       0        0      0      0     0.0
## 4      row        1953    492       0        0      0      0     0.0
## 5      row        1954    609       0        0      0      0     0.0
## 6      row        1955    992     421        0      0      0    29.8
## 7      row        1956    968     735        0      0      0    43.2
## 8      row        1957   1409     775        0      0      0    35.5
## 9      row        1958   1314     899        0      0      0    40.6
## 10     row        1959   1161     652        0      0      0    36.0
## 11     row        1960   1226     719        0      0      0    37.0
## 12     row        1961   1494     752        0      0      0    33.5
## 13     row        1962   1559     830        0      0      0    34.7
## 14     row        1963   1145     823        0      0      0    41.8
## 15     row        1964   1439     909        0      0      0    38.7
## 16     row        1965   1800    1055        0      0      0    37.0
## 17     row        1966   1338    1050        0      0      0    44.0
## 18     row        1967   1730     958        0      0      0    35.6
## 19     row        1968   1783    1529        0      0      0    46.2
## 20     row        1969   1416    1510        0      0      0    51.6
## 21     row        1970   1421    1794        0      0      0    55.8
## 22     row        1971   1927    1544        0      0      0    44.5
## 23     row        1972   1456     712        0      0      0    32.8
## 24     row        1973   2297    2166        0      0      0    48.5
## 25     row        1974   2783    2603        0      0      0    48.3
## 26     row        1975   2336    2639        0      0      0    53.0
## 27     row        1976   2026    1742        0      0      0    46.2
## 28     row        1977   2005    1723        0      0      0    46.2
## 29     row        1978   1899    1758        0      0      0    48.1
## 30     row        1979   2233    2046        0      0      0    47.8
## 31     row        1980   2965    3181        0      0      0    51.8
## 32     row        1981   2324    2193        0      0      0    48.5
## 33     row        1982   3562    3570        0      0      0    50.1
## 34     row        1983   3329    4993        0      0      0    60.0
## 35     row        1984   3769    3566        0      0      0    48.6
## 36     row        1985   4152    3827        0      0      0    48.0
## 37     row        1986   4361    4365        0      0      0    50.0
## 38     row        1987   3111    4256        0      0      0    57.8
## 39     row        1988   3310    3947        0      0      0    54.4
## 40     row        1989   4699    5711        0      0      0    54.9
## 41     row        1990   4882    6064        0      0      0    55.4
## 42     row        1991   6019    6503        0      0      0    51.9
## 43     row        1992   7091    6443        0      0      0    47.6
## 44     row        1993   7827    4142      194    440      4    37.9
## 45     row        1994  11684    7489      149   1297     12    43.4
## 46     row        1995  14617   11105      481   1726     41    47.7
## 47     row        1996  20879   10030      215    971    175    35.3
## 48     row        1997  17335    9887      410    967     81    39.6
## 49     row        1998  22608   13610      482   1373     55    40.7
## 50     row        1999  19221   10359      438   1185     86    38.6
## 51     row        2000  20399   12175      527   1293     77    40.8
## 52     row        2001  21382   11842      429   1220     89    38.8
## 53     row        2002  21791   12976      361   1103     62    40.0
## 54     row        2003  25774   13438       81    459      0    35.2
## 55     row        2004  25613   13225       75    450      0    34.9
## 56     row        2005  25043   13437      122    582      0    36.1
## 57     row        2006  29739   14138        5    152      0    32.5
## 58     row        2007  43217       0        5     67      0     0.2
## 59     row        2008  55606       0       23     34      0     0.1
## 60     row        2009  45771       0        1     45      0     0.1
## 61     row        2010  48093       0        1     67      0     0.1
## 62     row        2011  62080       0        3     91      0     0.2
## 63     row     NO YEAR 267022   18206    14204    682      0    11.0
## 64   total     covered 635275  884091   888093 901615 902297     0.0
## 65   total       total 902297  902297   902297 902297 902297     0.0
## 66   total pct_covered     70      98       98    100    100    85.0

From this table it can be seen how errors and recording of events has changed over time. This aligns with the data-entry history available on the NOAA website.

  • Pre-1992 information was hand keyed by SPC staff and there were only three event types (Thunderstorm Wind, Tornado, Hail).
  • From 1993-1995, the NWS Weather Offices sent data in directly to the database as Unformatted Text Files These data had many inconsistencies in the spelling of event types.
  • From 1996-1999, the event type field was a free-text field so there were many, many variations of event types.
  • In 2000 the NWS added a drop-down selector for Event Type on the data entry interface, which standardized the Event Type values.
  • In October 2006, the NWS supplied NCEI with comma separated (CSV) text files that NCEI would import into their own database for the Storm Data publication and Storm Events Database.

It can also be seen that the coverage across the year groupings is fairly even.

table(data_1$yearGrp)
## 
## 1992-1996 1997-2001 2002-2006 2007-2011  pre 1992 
##    107012    167530    198626    255104    174025


Processing - states

Each entry in the SDD has a corresponding state code in the STATE field. This is used in the analysis to understand how weather is different across different states, as well as how different states report. There are 72 state codes used in the SDD. To understand what the additional codes above 50 are they are mapped against three datasets: state codes, territory codes, and the most common (modal) office recorded in the SDD against each code.

# get the list of state codes from the SDD and count how many each occurs
statelist <- unique(data_1$STATE)
countlist <- data_1 %>% group_by(STATE) %>% summarise(count = n()) %>% ungroup()
statecount <- countlist$count[match(statelist, countlist$STATE)]

# This function returns the mode from a vector
statmode <- function(vec) {
    unq <- unique(vec)
    unq[which.max(tabulate(match(vec, unq)))]
}

# get the list of states, territories and office locations
statematchS <- index_match(statelist, data_states, "")
statematchT <- index_match(statelist, data_territories, "")
officelist <- data_1 %>% group_by(STATE) %>% summarise(mode = statmode(STATEOFFIC)) %>% ungroup()
statematchO <- index_match(statelist, officelist, "")

# combine for review
statelist <- data.frame(code = statelist, 
                        count = statecount, 
                        state = statematchS, 
                        territory = statematchT, 
                        office = str_sub(statematchO,1,20))
statelist
##    code count          state           territory               office
## 1    AL 22739        Alabama                         ALABAMA, Central
## 2    AZ  6156        Arizona                     ARIZONA, Central and
## 3    AR 27102       Arkansas                     ARKANSAS, Central an
## 4    CA 10780     California                     CALIFORNIA, South Ce
## 5    CO 20473       Colorado                     COLORADO, Central an
## 6    CT  3294    Connecticut                     CONNECTICUT, Souther
## 7    DE  1913       Delaware                                 DELAWARE
## 8    DC   437                                           DIST COLUMBIA
## 9    FL 22124        Florida                                         
## 10   GA 25259        Georgia                     GEORGIA, North and C
## 11   HI  2547         Hawaii                                   HAWAII
## 12   ID  4767          Idaho                         IDAHO, Southeast
## 13   IL 28488       Illinois                                         
## 14   IN 21506        Indiana                         INDIANA, Central
## 15   IA 31069           Iowa                            IOWA, Central
## 16   KS 53440         Kansas                                         
## 17   KY 22092       Kentucky                        KENTUCKY, Eastern
## 18   LA 17323      Louisiana                                         
## 19   ME  4524          Maine                             MAINE, South
## 20   MD  8185       Maryland                        MARYLAND, Central
## 21   MA  5652  Massachusetts                     MASSACHUSETTS, Centr
## 22   MI 17910       Michigan                                         
## 23   MN 23609      Minnesota                     MINNESOTA, Central a
## 24   MS 22192    Mississippi                     MISSISSIPPI, Central
## 25   MO 35648       Missouri                      MISSOURI, Southwest
## 26   MT 14695        Montana                           MONTANA, South
## 27   NE 30271       Nebraska                                         
## 28   NV  3139         Nevada                             NEVADA, West
## 29   NH  3022  New Hampshire                     NEW HAMPSHIRE, North
## 30   NJ  8075     New Jersey                     NEW JERSEY, South an
## 31   NM  7129     New Mexico                     NEW MEXICO, Central 
## 32   NY 21058       New York                                         
## 33   NC 25351 North Carolina                     NORTH CAROLINA, Sout
## 34   ND 14632   North Dakota                     NORTH DAKOTA, Centra
## 35   OH 24922           Ohio                                         
## 36   OK 46802       Oklahoma                                         
## 37   OR  4821         Oregon                     OREGON, Central and 
## 38   PA 22226   Pennsylvania                                         
## 39   RI   839   Rhode Island                             RHODE ISLAND
## 40   SC 17126 South Carolina                                         
## 41   SD 21727   South Dakota                       SOUTH DAKOTA, West
## 42   TN 21721      Tennessee                       TENNESSEE, Central
## 43   TX 83728          Texas                                         
## 44   UT  4135           Utah                     UTAH, West and Centr
## 45   VT  3871        Vermont                     VERMONT, North and C
## 46   VA 21189       Virginia                      VIRGINIA, Southwest
## 47   WA  3312     Washington                     WASHINGTON, Northeas
## 48   WV  9099  West Virginia                      WEST VIRGINIA, West
## 49   WI 19781      Wisconsin                                         
## 50   WY  7332        Wyoming                       WYOMING, Southeast
## 51   PR  3015                        Puerto Rico          PUERTO RICO
## 52   AK  4391         Alaska                         ALASKA, Southern
## 53   ST     1                                                        
## 54   AS   257                      America Samoa       AMERICAN SAMOA
## 55   GU   306                               Guam              PACIFIC
## 56   MH     1                   Marshall Islands                     
## 57   VI   338                U.S. Virgin Islands       VIRGIN ISLANDS
## 58   AM  1879                                    ATLANTIC, Caribbean 
## 59   LC   274                                           LAKE ST CLAIR
## 60   PH    28                                           HAWAII WATERS
## 61   GM  5337                                          GULF OF MEXICO
## 62   PZ    96                                           PACIFIC OCEAN
## 63   AN  3250                                         ATLANTIC, North
## 64   LH   654                                              LAKE HURON
## 65   LM  1347                                           LAKE MICHIGAN
## 66   LE  1526                                               LAKE ERIE
## 67   LS   262                                           LAKE SUPERIOR
## 68   SL     7                                       ST LAWRENCE RIVER
## 69   LO    70                                            LAKE ONTARIO
## 70   PM     1                                    PACIFIC ISLAND WATER
## 71   PK    23                                          GULF OF ALASKA
## 72   XX     2

This shows only 3 entries against unmatched state codes ST(1) and XX(2). The analysis above is used to create a full list of state/territory codes which include territories and locations identified by NWS offices (namely the great lakes and other marine locations).

This full set of state/territory codes is applied via a state_abbreviations_full.csv file. Two further levels of grouping are applied to the state/territory codes. Firstly the state/territory codes are mapped to state codes only plus inshore and offshore lake/sea groupings. Secondly the state codes are mapped to geographic groupings as used by the Agricultural Research Service (ARS), namely Pacific West Area, Plains Area, Midwest Area, Southeast Area, Northeast Area. https://en.wikipedia.org/wiki/List_of_regions_of_the_United_States. These groupings are applied via a state_grouping.csv.

data_states_full <- read.csv("state_abbreviations_full.csv", header = TRUE)
data_states_full
##    code                                  state statecode
## 1    AL                                Alabama        AL
## 2    AK                                 Alaska        AK
## 3    AZ                                Arizona        AZ
## 4    AR                               Arkansas        AR
## 5    CA                             California        CA
## 6    CO                               Colorado        CO
## 7    CT                            Connecticut        CT
## 8    DE                               Delaware        DE
## 9    DC                   District of Columbia        MA
## 10   FL                                Florida        FL
## 11   GA                                Georgia        GA
## 12   HI                                 Hawaii        HI
## 13   ID                                  Idaho        ID
## 14   IL                               Illinois        IL
## 15   IN                                Indiana        IN
## 16   IA                                   Iowa        IA
## 17   KS                                 Kansas        KS
## 18   KY                               Kentucky        KY
## 19   LA                              Louisiana        LA
## 20   ME                                  Maine        ME
## 21   MD                               Maryland        MD
## 22   MA                          Massachusetts        MA
## 23   MI                               Michigan        MI
## 24   MN                              Minnesota        MN
## 25   MS                            Mississippi        MS
## 26   MO                               Missouri        MO
## 27   MT                                Montana        MT
## 28   NE                               Nebraska        NE
## 29   NV                                 Nevada        NV
## 30   NH                          New Hampshire        NH
## 31   NJ                             New Jersey        NJ
## 32   NM                             New Mexico        NM
## 33   NY                               New York        NY
## 34   NC                         North Carolina        NC
## 35   ND                           North Dakota        ND
## 36   OH                                   Ohio        OH
## 37   OK                               Oklahoma        OK
## 38   OR                                 Oregon        OR
## 39   PA                           Pennsylvania        PA
## 40   RI                           Rhode Island        RI
## 41   SC                         South Carolina        SC
## 42   SD                           South Dakota        SD
## 43   TN                              Tennessee        TN
## 44   TX                                  Texas        TX
## 45   UT                                   Utah        UT
## 46   VT                                Vermont        VT
## 47   VA                               Virginia        VA
## 48   WA                             Washington        WA
## 49   WV                          West Virginia        WV
## 50   WI                              Wisconsin        WI
## 51   WY                                Wyoming        WY
## 52   AM      Atlantic - Caribbean and Tropical        ZO
## 53   AN                         Atlantic Ocean        ZO
## 54   AS                          America Samoa        ZO
## 55   GU                                   Guam        ZO
## 56   PK                         Gulf of Alaska        ZO
## 57   GM                         Gulf of Mexico        ZO
## 58   PH                       Hawaii Off Shore        ZO
## 59   LE                              Lake Erie        ZI
## 60   LH                             Lake Huron        ZI
## 61   LM                          Lake Michigan        ZI
## 62   LO                           Lake Ontario        ZI
## 63   LC                          Lake St Clair        ZI
## 64   LS                          Lake Superior        ZI
## 65   MP                Northern Marina Islands        ZO
## 66   PM                  Pacific Island Waters        ZO
## 67   PZ                          Pacific Ocean        ZO
## 68   PR                            Puerto Rico        ZO
## 69   VI                    U.S. Virgin Islands        ZO
## 70   UM            U.S. Minor Outlying Islands        ZO
## 71   MH                       Marshall Islands        ZO
## 72   FM                             Micronesia        ZO
## 73   PW                                  Palsu        ZO
## 74   NB                               Nebraska        ZO
## 75   CM                Northern Marina Islands        ZO
## 76   CZ                      Panama Canal Zone        ZO
## 77   PI                     Philippine Islands        ZO
## 78   SL                      St Lawrence River        ZI
## 79   TT Trust Territory of the Pacific Islands        ZO
stategrouping <- read.csv("state_grouping.csv", header = TRUE)
stategrouping
##    INDEX state_abr     state_name            region chart_order
## 1      2        AK         Alaska Pacific West Area           1
## 2     47        WA     Washington Pacific West Area           2
## 3     37        OR         Oregon Pacific West Area           3
## 4     12        ID          Idaho Pacific West Area           4
## 5      5        CA     California Pacific West Area           5
## 6     28        NV         Nevada Pacific West Area           6
## 7     44        UT           Utah Pacific West Area           7
## 8      3        AZ        Arizona Pacific West Area           8
## 9     11        HI         Hawaii Pacific West Area           9
## 10    26        MT        Montana       Plains Area          10
## 11    34        ND   North Dakota       Plains Area          11
## 12    50        WY        Wyoming       Plains Area          12
## 13    41        SD   South Dakota       Plains Area          13
## 14    27        NE       Nebraska       Plains Area          14
## 15     6        CO       Colorado       Plains Area          15
## 16    16        KS         Kansas       Plains Area          16
## 17    31        NM     New Mexico       Plains Area          17
## 18    43        TX          Texas       Plains Area          18
## 19    36        OK       Oklahoma       Plains Area          19
## 20    23        MN      Minnesota      Midwest Area          20
## 21    49        WI      Wisconsin      Midwest Area          21
## 22    15        IA           Iowa      Midwest Area          22
## 23    13        IL       Illinois      Midwest Area          23
## 24    25        MO       Missouri      Midwest Area          24
## 25    22        MI       Michigan      Midwest Area          25
## 26    14        IN        Indiana      Midwest Area          26
## 27    35        OH           Ohio      Midwest Area          27
## 28    17        KY       Kentucky      Midwest Area          28
## 29     4        AR       Arkansas    Southeast Area          29
## 30    18        LA      Louisiana    Southeast Area          30
## 31    42        TN      Tennessee    Southeast Area          31
## 32    24        MS    Mississippi    Southeast Area          32
## 33     1        AL        Alabama    Southeast Area          33
## 34     9        FL        Florida    Southeast Area          34
## 35    10        GA        Georgia    Southeast Area          35
## 36    40        SC South Carolina    Southeast Area          36
## 37    33        NC North Carolina    Southeast Area          37
## 38    46        VA       Virginia    Northeast Area          38
## 39    48        WV  West Virginia    Northeast Area          39
## 40    20        MD       Maryland    Northeast Area          40
## 41     8        DE       Delaware    Northeast Area          41
## 42    38        PA   Pennsylvania    Northeast Area          42
## 43    30        NJ     New Jersey    Northeast Area          43
## 44    32        NY       New York    Northeast Area          44
## 45     7        CT    Connecticut    Northeast Area          45
## 46    39        RI   Rhode Island    Northeast Area          46
## 47    21        MA  Massachusetts    Northeast Area          47
## 48    45        VT        Vermont    Northeast Area          48
## 49    29        NH  New Hampshire    Northeast Area          49
## 50    19        ME          Maine    Northeast Area          50
## 51    51        ZI        Inshore             Other          51
## 52    52        ZO       Offshore             Other          52
## 53    53        ZZ    Not Matched             Other          53
stateabrs <- stategrouping[order(stategrouping$state_name),]$state_abr
stateabrssorted <- stategrouping$state_abr
statenames <- sort(stategrouping$state_name)
statenamessorted <- stategrouping$state_name
    
# create a mapping table for the state hierarchy
sl_L3code <- index_match(statelist[,1], data.frame(data_states_full[,1],data_states_full[,1]), "XX")
sl_L3name <- index_match(statelist[,1], data_states_full[,c(1,2)], "Unknown")
sl_L2code <- index_match(statelist[,1], data_states_full[,c(1,3)], "ZZ")
sl_L2name <- index_match(sl_L2code, stategrouping[,c(2,3)], "")
sl_L1name <- index_match(sl_L2code, stategrouping[,c(2,4)], "")
sl_order <- index_match(sl_L2code, stategrouping[,c(2,5)], "")

statelist_all <- data.frame(D1code = statelist[,1],
                            D1count = statelist[,2],
                            L3code = sl_L3code,
                            L3name = sl_L3name,
                            L2code = sl_L2code,
                            L2name = sl_L2name,
                            L1name = sl_L1name,
                            c_order = as.integer(sl_order))

statelist_all <- arrange(statelist_all, c_order, L1name, L2name, L3name, D1code)

# show statelist in one table by reducing columns
statelist_display <- statelist_all[,c("D1code", "D1count", "L3name", "L2name", "L1name")]
statelist_display[,"L3name"] <- str_sub(statelist_display[,"L3name"],1,20)
statelist_display[,"L1name"] <- str_replace(statelist_display[,"L1name"]," Area","")
statelist_display
##    D1code D1count               L3name         L2name       L1name
## 1      AK    4391               Alaska         Alaska Pacific West
## 2      WA    3312           Washington     Washington Pacific West
## 3      OR    4821               Oregon         Oregon Pacific West
## 4      ID    4767                Idaho          Idaho Pacific West
## 5      CA   10780           California     California Pacific West
## 6      NV    3139               Nevada         Nevada Pacific West
## 7      UT    4135                 Utah           Utah Pacific West
## 8      AZ    6156              Arizona        Arizona Pacific West
## 9      HI    2547               Hawaii         Hawaii Pacific West
## 10     MT   14695              Montana        Montana       Plains
## 11     ND   14632         North Dakota   North Dakota       Plains
## 12     WY    7332              Wyoming        Wyoming       Plains
## 13     SD   21727         South Dakota   South Dakota       Plains
## 14     NE   30271             Nebraska       Nebraska       Plains
## 15     CO   20473             Colorado       Colorado       Plains
## 16     KS   53440               Kansas         Kansas       Plains
## 17     NM    7129           New Mexico     New Mexico       Plains
## 18     TX   83728                Texas          Texas       Plains
## 19     OK   46802             Oklahoma       Oklahoma       Plains
## 20     MN   23609            Minnesota      Minnesota      Midwest
## 21     WI   19781            Wisconsin      Wisconsin      Midwest
## 22     IA   31069                 Iowa           Iowa      Midwest
## 23     IL   28488             Illinois       Illinois      Midwest
## 24     MO   35648             Missouri       Missouri      Midwest
## 25     MI   17910             Michigan       Michigan      Midwest
## 26     IN   21506              Indiana        Indiana      Midwest
## 27     OH   24922                 Ohio           Ohio      Midwest
## 28     KY   22092             Kentucky       Kentucky      Midwest
## 29     AR   27102             Arkansas       Arkansas    Southeast
## 30     LA   17323            Louisiana      Louisiana    Southeast
## 31     TN   21721            Tennessee      Tennessee    Southeast
## 32     MS   22192          Mississippi    Mississippi    Southeast
## 33     AL   22739              Alabama        Alabama    Southeast
## 34     FL   22124              Florida        Florida    Southeast
## 35     GA   25259              Georgia        Georgia    Southeast
## 36     SC   17126       South Carolina South Carolina    Southeast
## 37     NC   25351       North Carolina North Carolina    Southeast
## 38     VA   21189             Virginia       Virginia    Northeast
## 39     WV    9099        West Virginia  West Virginia    Northeast
## 40     MD    8185             Maryland       Maryland    Northeast
## 41     DE    1913             Delaware       Delaware    Northeast
## 42     PA   22226         Pennsylvania   Pennsylvania    Northeast
## 43     NJ    8075           New Jersey     New Jersey    Northeast
## 44     NY   21058             New York       New York    Northeast
## 45     CT    3294          Connecticut    Connecticut    Northeast
## 46     RI     839         Rhode Island   Rhode Island    Northeast
## 47     DC     437 District of Columbia  Massachusetts    Northeast
## 48     MA    5652        Massachusetts  Massachusetts    Northeast
## 49     VT    3871              Vermont        Vermont    Northeast
## 50     NH    3022        New Hampshire  New Hampshire    Northeast
## 51     ME    4524                Maine          Maine    Northeast
## 52     LE    1526            Lake Erie        Inshore        Other
## 53     LH     654           Lake Huron        Inshore        Other
## 54     LM    1347        Lake Michigan        Inshore        Other
## 55     LO      70         Lake Ontario        Inshore        Other
## 56     LC     274        Lake St Clair        Inshore        Other
## 57     LS     262        Lake Superior        Inshore        Other
## 58     SL       7    St Lawrence River        Inshore        Other
## 59     AS     257        America Samoa       Offshore        Other
## 60     AM    1879 Atlantic - Caribbean       Offshore        Other
## 61     AN    3250       Atlantic Ocean       Offshore        Other
## 62     GU     306                 Guam       Offshore        Other
## 63     PK      23       Gulf of Alaska       Offshore        Other
## 64     GM    5337       Gulf of Mexico       Offshore        Other
## 65     PH      28     Hawaii Off Shore       Offshore        Other
## 66     MH       1     Marshall Islands       Offshore        Other
## 67     PM       1 Pacific Island Water       Offshore        Other
## 68     PZ      96        Pacific Ocean       Offshore        Other
## 69     PR    3015          Puerto Rico       Offshore        Other
## 70     VI     338  U.S. Virgin Islands       Offshore        Other
## 71     ST       1              Unknown    Not Matched        Other
## 72     XX       2              Unknown    Not Matched        Other

This shows the mapping hierarchy for the state information. This is added to the main data_1 table, with state_L2name and state_L1name selected as the key fields for state and stateGrp variables for analysis.

data_1 <- mutate(data_1, 
                 state_L3code = index_match(STATE, statelist_all[, c("D1code","L3code")]),
                 state_L3name = index_match(STATE, statelist_all[, c("D1code","L3name")]),
                 state_L2code = index_match(STATE, statelist_all[, c("D1code","L2code")]),
                 state_L2name = index_match(STATE, statelist_all[, c("D1code","L2name")]),
                 state_L1name = index_match(STATE, statelist_all[, c("D1code","L1name")]),
                 state = state_L2name,
                 stateGrp = state_L1name)

It can be seen how the data coverage is spread across states and also which states have events which were transformed by the rules that were applied.

states2 <- ruletable(statenames, "state")
states2
##    rowtype           name   orig generic specific  match  other pct_gen
## 1      row        Alabama  15268    7261       23    187      0    32.9
## 2      row         Alaska   4004     145       67    124     51     8.8
## 3      row        Arizona   4306    1639       13    197      1    30.1
## 4      row       Arkansas  17936    9101       12     53      0    33.8
## 5      row     California   8653     530      483   1095     19    19.7
## 6      row       Colorado  17765    2041       35    632      0    13.2
## 7      row    Connecticut   2398     796       52     48      0    27.2
## 8      row       Delaware   1264     382      124    136      7    33.9
## 9      row        Florida  14563    6310      266    984      1    34.2
## 10     row        Georgia  15594    9316       60    284      5    38.3
## 11     row         Hawaii   2057      33       18    426     13    19.2
## 12     row          Idaho   3354    1123       69    220      1    29.6
## 13     row       Illinois  19099    9051       46    290      2    33.0
## 14     row        Indiana  13584    7688       29    203      2    36.8
## 15     row        Inshore   1615    2525        0      0      0    61.0
## 16     row           Iowa  22286    8154       34    591      4    28.3
## 17     row         Kansas  41367   11861       31    174      7    22.6
## 18     row       Kentucky  14249    7473       76    275     19    35.5
## 19     row      Louisiana  10526    6562       68    167      0    39.2
## 20     row          Maine   3283    1037       72     69     63    27.4
## 21     row       Maryland   5574    2273      149    183      6    31.9
## 22     row  Massachusetts   4491    1435       91     69      3    26.2
## 23     row       Michigan  10816    6888       86    120      0    39.6
## 24     row      Minnesota  16898    6602        6    102      1    28.4
## 25     row    Mississippi  14065    7968       27    131      1    36.6
## 26     row       Missouri  26053    9318       40    236      1    26.9
## 27     row        Montana  11955    2538        4    198      0    18.6
## 28     row       Nebraska  24089    6018       18    145      1    20.4
## 29     row         Nevada   2658     415       13     50      3    15.3
## 30     row  New Hampshire   2060     807       39     64     52    31.8
## 31     row     New Jersey   5578    1849      223    412     13    30.9
## 32     row     New Mexico   6200     827       14     75     13    13.0
## 33     row       New York  13087    7268      180    441     82    37.9
## 34     row North Carolina  17518    7257      218    355      3    30.9
## 35     row   North Dakota  11621    2939        5     67      0    20.6
## 36     row    Not Matched      2       1        0      0      0    33.3
## 37     row       Offshore   9798    3737       91    904      1    32.6
## 38     row           Ohio  14657   10017      102    101     45    41.2
## 39     row       Oklahoma  34008   12568       41    124     61    27.3
## 40     row         Oregon   4060     314       36    410      1    15.8
## 41     row   Pennsylvania  12806    8624      200    574     22    42.4
## 42     row   Rhode Island    637     163       32      7      0    24.1
## 43     row South Carolina  10983    5851       96    191      5    35.9
## 44     row   South Dakota  17125    4447       16    138      1    21.2
## 45     row      Tennessee  12648    8776       13    284      0    41.8
## 46     row          Texas  64104   18946       75    582     21    23.4
## 47     row           Utah   3316     628       11    180      0    19.8
## 48     row        Vermont   2587     942       78    157    107    33.2
## 49     row       Virginia  14878    5822      151    327     11    29.8
## 50     row     Washington   2863     249       25    173      2    13.6
## 51     row  West Virginia   6184    2539      163    185     28    32.0
## 52     row      Wisconsin  12542    6779      168    289      3    36.6
## 53     row        Wyoming 273265   25432    21443   8014   7332    18.5
## 54   total        covered 629032  876865   880854 894283 894965     0.0
## 55   total          total 902297  902297   902297 902297 902297     0.0
## 56   total    pct_covered     70      97       98     99     99    84.9

It can also be seen that the the coverage across the state groupings is fairly even.

table(data_1$stateGrp)
## 
##      Midwest Area    Northeast Area             Other Pacific West Area 
##            225025            113384             18674             44048 
##       Plains Area    Southeast Area 
##            300229            200937


Processing - fatalties and injuries

The fatalities and injuries data is explored using a function called element_yearly() to see how complete it is.

element_yearly <- function(element1){
        
        data_2 <- data_1 %>% mutate(element2 = as.numeric(data_1[,element1]))
        yearnames1 = yearnames[-length(yearnames)]
        elyr <- data.frame(year = yearnames1,
                           count = rep(0,length(yearnames1)),
                           count_0 = rep(0,length(yearnames1)),
                           count_1 = rep(0,length(yearnames1)),
                           complete_1 = rep(0,length(yearnames1)),
                           sum_1 = rep(0,length(yearnames1)),
                           mean_1 = rep(0,length(yearnames1)))
                           
                                                                   
                                         
        for (yy in yearnames1) {
            data_2f <- data_2 %>% filter(year == yy)
            elyr[(elyr$year == yy),"count"] = nrow(data_2f)
            elyr[(elyr$year == yy),"count_0"] = sum(data_2f$element2 == 0)
            elyr[(elyr$year == yy),"count_1"] = elyr[(elyr$year == yy),"count"] - elyr[(elyr$year == yy),"count_0"]
            elyr[(elyr$year == yy),"complete_1"] = ifelse(elyr[(elyr$year == yy),"count"] == 0, 
                            0, round(elyr[(elyr$year == yy),"count_1"] / elyr[(elyr$year == yy),"count"],2))
            elyr[(elyr$year == yy),"sum_1"] = round(sum(data_2f$element2),2)
            elyr[(elyr$year == yy),"mean_1"] = ifelse(elyr[(elyr$year == yy),"count_1"] == 0, 
                            0, round(elyr[(elyr$year == yy),"sum_1"]/elyr[(elyr$year == yy),"count_1"],2))
            }
        
        elyr
}

element_yearly("FATALITIES")
##    year count count_0 count_1 complete_1 sum_1 mean_1
## 1  1950   223     200      23       0.10    70   3.04
## 2  1951   269     252      17       0.06    34   2.00
## 3  1952   272     236      36       0.13   230   6.39
## 4  1953   492     439      53       0.11   519   9.79
## 5  1954   609     588      21       0.03    36   1.71
## 6  1955  1413    1391      22       0.02   129   5.86
## 7  1956  1703    1683      20       0.01    83   4.15
## 8  1957  2184    2133      51       0.02   193   3.78
## 9  1958  2213    2192      21       0.01    67   3.19
## 10 1959  1813    1797      16       0.01    58   3.62
## 11 1960  1945    1928      17       0.01    46   2.71
## 12 1961  2246    2225      21       0.01    52   2.48
## 13 1962  2389    2379      10       0.00    30   3.00
## 14 1963  1968    1951      17       0.01    31   1.82
## 15 1964  2348    2330      18       0.01    73   4.06
## 16 1965  2855    2798      57       0.02   301   5.28
## 17 1966  2388    2376      12       0.01    98   8.17
## 18 1967  2688    2658      30       0.01   114   3.80
## 19 1968  3312    3284      28       0.01   131   4.68
## 20 1969  2926    2911      15       0.01    66   4.40
## 21 1970  3215    3192      23       0.01    73   3.17
## 22 1971  3471    3440      31       0.01   159   5.13
## 23 1972  2168    2156      12       0.01    27   2.25
## 24 1973  4463    4421      42       0.01    89   2.12
## 25 1974  5386    5288      98       0.02   366   3.73
## 26 1975  4975    4945      30       0.01    60   2.00
## 27 1976  3768    3739      29       0.01    44   1.52
## 28 1977  3728    3713      15       0.00    43   2.87
## 29 1978  3657    3638      19       0.01    53   2.79
## 30 1979  4279    4258      21       0.00    84   4.00
## 31 1980  6146    6126      20       0.00    28   1.40
## 32 1981  4517    4503      14       0.00    24   1.71
## 33 1982  7132    7099      33       0.00    64   1.94
## 34 1983  8322    8292      30       0.00    37   1.23
## 35 1984  7335    7270      65       0.01   160   2.46
## 36 1985  7979    7940      39       0.00   112   2.87
## 37 1986  8726    8692      34       0.00    51   1.50
## 38 1987  7367    7328      39       0.01    89   2.28
## 39 1988  7257    7218      39       0.01    55   1.41
## 40 1989 10410   10372      38       0.00    79   2.08
## 41 1990 10946   10901      45       0.00    95   2.11
## 42 1991 12522   12480      42       0.00    73   1.74
## 43 1992 13534   13504      30       0.00    54   1.80
## 44 1993 12607   12435     172       0.01   298   1.73
## 45 1994 20631   20432     199       0.01   344   1.73
## 46 1995 27970   27618     352       0.01  1491   4.24
## 47 1996 32270   31904     366       0.01   542   1.48
## 48 1997 28680   28343     337       0.01   601   1.78
## 49 1998 38128   37764     364       0.01   687   1.89
## 50 1999 31289   30949     340       0.01   908   2.67
## 51 2000 34471   34151     320       0.01   477   1.49
## 52 2001 34962   34664     298       0.01   469   1.57
## 53 2002 36293   35995     298       0.01   498   1.67
## 54 2003 39752   39468     284       0.01   443   1.56
## 55 2004 39363   39098     265       0.01   370   1.40
## 56 2005 39184   38944     240       0.01   469   1.95
## 57 2006 44034   43745     289       0.01   599   2.07
## 58 2007 43289   43001     288       0.01   421   1.46
## 59 2008 55663   55340     323       0.01   488   1.51
## 60 2009 45817   45562     255       0.01   333   1.31
## 61 2010 48161   47866     295       0.01   425   1.44
## 62 2011 62174   61778     396       0.01  1002   2.53
element_yearly("INJURIES")
##    year count count_0 count_1 complete_1 sum_1 mean_1
## 1  1950   223     148      75       0.34   659   8.79
## 2  1951   269     206      63       0.23   524   8.32
## 3  1952   272     166     106       0.39  1915  18.07
## 4  1953   492     341     151       0.31  5131  33.98
## 5  1954   609     495     114       0.19   715   6.27
## 6  1955  1413    1316      97       0.07   926   9.55
## 7  1956  1703    1587     116       0.07  1355  11.68
## 8  1957  2184    2012     172       0.08  1976  11.49
## 9  1958  2213    2113     100       0.05   535   5.35
## 10 1959  1813    1732      81       0.04   734   9.06
## 11 1960  1945    1849      96       0.05   737   7.68
## 12 1961  2246    2088     158       0.07  1087   6.88
## 13 1962  2389    2313      76       0.03   551   7.25
## 14 1963  1968    1892      76       0.04   538   7.08
## 15 1964  2348    2211     137       0.06  1148   8.38
## 16 1965  2855    2668     187       0.07  5197  27.79
## 17 1966  2388    2300      88       0.04  2030  23.07
## 18 1967  2688    2525     163       0.06  2144  13.15
## 19 1968  3312    3180     132       0.04  2522  19.11
## 20 1969  2926    2822     104       0.04  1311  12.61
## 21 1970  3215    3082     133       0.04  1355  10.19
## 22 1971  3471    3292     179       0.05  2723  15.21
## 23 1972  2168    2031     137       0.06   976   7.12
## 24 1973  4463    4199     264       0.06  2406   9.11
## 25 1974  5386    5086     300       0.06  6824  22.75
## 26 1975  4975    4809     166       0.03  1457   8.78
## 27 1976  3768    3620     148       0.04  1195   8.07
## 28 1977  3728    3589     139       0.04   771   5.55
## 29 1978  3657    3552     105       0.03   919   8.75
## 30 1979  4279    4150     129       0.03  3014  23.36
## 31 1980  6146    6003     143       0.02  1157   8.09
## 32 1981  4517    4416     101       0.02   798   7.90
## 33 1982  7132    6975     157       0.02  1276   8.13
## 34 1983  8322    8152     170       0.02   816   4.80
## 35 1984  7335    7059     276       0.04  2858  10.36
## 36 1985  7979    7806     173       0.02  1513   8.75
## 37 1986  8726    8513     213       0.02   915   4.30
## 38 1987  7367    7174     193       0.03  1416   7.34
## 39 1988  7257    7038     219       0.03  1030   4.70
## 40 1989 10410   10108     302       0.03  1675   5.55
## 41 1990 10946   10627     319       0.03  1825   5.72
## 42 1991 12522   12241     281       0.02  1355   4.82
## 43 1992 13534   13262     272       0.02  1754   6.45
## 44 1993 12607   12312     295       0.02  2149   7.28
## 45 1994 20631   20027     604       0.03  4161   6.89
## 46 1995 27970   27267     703       0.03  4480   6.37
## 47 1996 32270   31631     639       0.02  2717   4.25
## 48 1997 28680   27983     697       0.02  3800   5.45
## 49 1998 38128   37264     864       0.02 11177  12.94
## 50 1999 31289   30580     709       0.02  5148   7.26
## 51 2000 34471   33842     629       0.02  2803   4.46
## 52 2001 34962   34304     658       0.02  2721   4.14
## 53 2002 36293   35736     557       0.02  3155   5.66
## 54 2003 39752   39192     560       0.01  2931   5.23
## 55 2004 39363   38897     466       0.01  2426   5.21
## 56 2005 39184   38758     426       0.01  1834   4.31
## 57 2006 44034   43511     523       0.01  3368   6.44
## 58 2007 43289   42893     396       0.01  2191   5.53
## 59 2008 55663   55137     526       0.01  2703   5.14
## 60 2009 45817   45379     438       0.01  1354   3.09
## 61 2010 48161   47688     473       0.01  1855   3.92
## 62 2011 62174   61544     630       0.01  7792  12.37

There are a lot of zero entries for the fatalities data with only around 0.5% of events leading to fatalities. This is similar for injuries with around 2% of events leading to injuries. This would need to be verified as correct by some external expertise, but there is nothing obvious in the data leading us to suspect a data quality issue. Therefore these fields are added to the data_1 dataset as is.

data_1 <- mutate(data_1, 
                 injuries = INJURIES,
                 fatalities = FATALITIES)


Processing - property damage and crop damage

The PROPDMG and CROPDMG fields are similar to the fatalities and injuries fields in that they appear fully populated but there are a lot of zero values.

element_yearly("CROPDMG")
##    year count count_0 count_1 complete_1     sum_1 mean_1
## 1  1950   223     223       0       0.00      0.00   0.00
## 2  1951   269     269       0       0.00      0.00   0.00
## 3  1952   272     272       0       0.00      0.00   0.00
## 4  1953   492     492       0       0.00      0.00   0.00
## 5  1954   609     609       0       0.00      0.00   0.00
## 6  1955  1413    1413       0       0.00      0.00   0.00
## 7  1956  1703    1703       0       0.00      0.00   0.00
## 8  1957  2184    2184       0       0.00      0.00   0.00
## 9  1958  2213    2213       0       0.00      0.00   0.00
## 10 1959  1813    1813       0       0.00      0.00   0.00
## 11 1960  1945    1945       0       0.00      0.00   0.00
## 12 1961  2246    2246       0       0.00      0.00   0.00
## 13 1962  2389    2389       0       0.00      0.00   0.00
## 14 1963  1968    1968       0       0.00      0.00   0.00
## 15 1964  2348    2348       0       0.00      0.00   0.00
## 16 1965  2855    2855       0       0.00      0.00   0.00
## 17 1966  2388    2388       0       0.00      0.00   0.00
## 18 1967  2688    2688       0       0.00      0.00   0.00
## 19 1968  3312    3312       0       0.00      0.00   0.00
## 20 1969  2926    2926       0       0.00      0.00   0.00
## 21 1970  3215    3215       0       0.00      0.00   0.00
## 22 1971  3471    3471       0       0.00      0.00   0.00
## 23 1972  2168    2168       0       0.00      0.00   0.00
## 24 1973  4463    4463       0       0.00      0.00   0.00
## 25 1974  5386    5386       0       0.00      0.00   0.00
## 26 1975  4975    4975       0       0.00      0.00   0.00
## 27 1976  3768    3768       0       0.00      0.00   0.00
## 28 1977  3728    3728       0       0.00      0.00   0.00
## 29 1978  3657    3657       0       0.00      0.00   0.00
## 30 1979  4279    4279       0       0.00      0.00   0.00
## 31 1980  6146    6146       0       0.00      0.00   0.00
## 32 1981  4517    4517       0       0.00      0.00   0.00
## 33 1982  7132    7132       0       0.00      0.00   0.00
## 34 1983  8322    8322       0       0.00      0.00   0.00
## 35 1984  7335    7335       0       0.00      0.00   0.00
## 36 1985  7979    7979       0       0.00      0.00   0.00
## 37 1986  8726    8726       0       0.00      0.00   0.00
## 38 1987  7367    7367       0       0.00      0.00   0.00
## 39 1988  7257    7257       0       0.00      0.00   0.00
## 40 1989 10410   10410       0       0.00      0.00   0.00
## 41 1990 10946   10946       0       0.00      0.00   0.00
## 42 1991 12522   12522       0       0.00      0.00   0.00
## 43 1992 13534   13534       0       0.00      0.00   0.00
## 44 1993 12607   11769     838       0.07  51661.15  61.65
## 45 1994 20631   19001    1630       0.08  88819.98  54.49
## 46 1995 27970   27030     940       0.03  35493.00  37.76
## 47 1996 32270   31162    1108       0.03  69291.90  62.54
## 48 1997 28680   27536    1144       0.04  74074.56  64.75
## 49 1998 38128   36036    2092       0.05 102235.21  48.87
## 50 1999 31289   30541     748       0.02  62707.15  83.83
## 51 2000 34471   33083    1388       0.04  79413.61  57.21
## 52 2001 34962   33668    1294       0.04  48248.35  37.29
## 53 2002 36293   35451     842       0.02  49230.64  58.47
## 54 2003 39752   38792     960       0.02  52322.19  54.50
## 55 2004 39363   37922    1441       0.04  72438.97  50.27
## 56 2005 39184   38454     730       0.02  48443.05  66.36
## 57 2006 44034   43075     959       0.02  65850.56  68.67
## 58 2007 43289   42258    1031       0.02  76768.00  74.46
## 59 2008 55663   53848    1815       0.03 183821.00 101.28
## 60 2009 45817   44803    1014       0.02  59683.00  58.86
## 61 2010 48161   47203     958       0.02  59014.00  61.60
## 62 2011 62174   61007    1167       0.02  98311.00  84.24
element_yearly("PROPDMG")
##    year count count_0 count_1 complete_1     sum_1 mean_1
## 1  1950   223      28     195       0.87  16999.15  87.18
## 2  1951   269      39     230       0.86  10560.99  45.92
## 3  1952   272      47     225       0.83  16679.74  74.13
## 4  1953   492      88     404       0.82  19182.20  47.48
## 5  1954   609     127     482       0.79  23367.82  48.48
## 6  1955  1413     980     433       0.31  27715.63  64.01
## 7  1956  1703    1280     423       0.25  27002.35  63.84
## 8  1957  2184    1366     818       0.37  44568.89  54.49
## 9  1958  2213    1674     539       0.24  26597.11  49.35
## 10 1959  1813    1311     502       0.28  25015.54  49.83
## 11 1960  1945    1393     552       0.28  28314.24  51.29
## 12 1961  2246    1626     620       0.28  39528.73  63.76
## 13 1962  2389    1988     401       0.17  22245.73  55.48
## 14 1963  1968    1597     371       0.19  24793.08  66.83
## 15 1964  2348    1758     590       0.25  38618.32  65.45
## 16 1965  2855    2147     708       0.25  46716.54  65.98
## 17 1966  2388    1970     418       0.18  27079.36  64.78
## 18 1967  2688    2010     678       0.25  40056.72  59.08
## 19 1968  3312    2799     513       0.15  27762.41  54.12
## 20 1969  2926    2470     456       0.16  33354.23  73.15
## 21 1970  3215    2712     503       0.16  30800.98  61.23
## 22 1971  3471    2775     696       0.20  44778.05  64.34
## 23 1972  2168    1595     573       0.26  36287.57  63.33
## 24 1973  4463    3454    1009       0.23  75056.56  74.39
## 25 1974  5386    4565     821       0.15  57905.28  70.53
## 26 1975  4975    4232     743       0.15  52498.79  70.66
## 27 1976  3768    3065     703       0.19  56249.69  80.01
## 28 1977  3728    3037     691       0.19  54811.73  79.32
## 29 1978  3657    3043     614       0.17  49135.66  80.03
## 30 1979  4279    3632     647       0.15  54633.74  84.44
## 31 1980  6146    5423     723       0.12  64904.35  89.77
## 32 1981  4517    3942     575       0.13  46608.31  81.06
## 33 1982  7132    6004    1128       0.16  83065.50  73.64
## 34 1983  8322    7328     994       0.12  64058.09  64.44
## 35 1984  7335    6539     796       0.11  63241.39  79.45
## 36 1985  7979    7506     473       0.06  39667.21  83.86
## 37 1986  8726    8170     556       0.06  50897.34  91.54
## 38 1987  7367    6955     412       0.06  33212.95  80.61
## 39 1988  7257    6722     535       0.07  50289.65  94.00
## 40 1989 10410    9826     584       0.06  49263.55  84.36
## 41 1990 10946   10166     780       0.07  69560.57  89.18
## 42 1991 12522   11844     678       0.05  60888.93  89.81
## 43 1992 13534   12674     860       0.06  70526.40  82.01
## 44 1993 12607    7049    5558       0.44 477319.20  85.88
## 45 1994 20631   11574    9057       0.44 622212.70  68.70
## 46 1995 27970   18331    9639       0.34 319225.80  33.12
## 47 1996 32270   22981    9289       0.29 424503.33  45.70
## 48 1997 28680   19225    9455       0.33 337999.53  35.75
## 49 1998 38128   25099   13029       0.34 545844.65  41.89
## 50 1999 31289   21481    9808       0.31 364611.53  37.17
## 51 2000 34471   23961   10510       0.30 403471.23  38.39
## 52 2001 34962   25655    9307       0.27 383596.77  41.22
## 53 2002 36293   26535    9758       0.27 355741.34  36.46
## 54 2003 39752   29400   10352       0.26 437754.86  42.29
## 55 2004 39363   29618    9745       0.25 412837.59  42.36
## 56 2005 39184   29743    9441       0.24 428309.11  45.37
## 57 2006 44034   32696   11338       0.26 449583.65  39.65
## 58 2007 43289   31958   11331       0.26 438170.28  38.67
## 59 2008 55663   38686   16977       0.30 690246.11  40.66
## 60 2009 45817   32087   13730       0.30 553432.58  40.31
## 61 2010 48161   32866   15295       0.32 594847.94  38.89
## 62 2011 62174   42271   19903       0.32 820290.74  41.21

The CROPDMG field is only populated from 1993 onwards with around 5% of events having crop damage data. The PROPDMG data appears to be more populated with around 30% of the events having property damage data. Again this would need to be verified as correct by some external expertise, but there is nothing obvious in the data leading us to suspect a data quality issue. The information appears to be entered as ’000s of dollars although this is not explicitly set out. So an assumption is made.

We also make an adjustment to the dollar values in each of these fields by converting to 2021 US dollar values by using a 100 US dollar index from 1950 to 2021. This is to ensure a fair comparison for historic data.

data_1 <- mutate(data_1, 
                 infyrX = index_match(year, data_inflation[,c(1,2)], data_inflation[nrow(data_inflation),2]),
                 inf2021 = data_inflation[nrow(data_inflation),2],
                 CROPDMG21 = round(1000 * as.numeric(CROPDMG) * inf2021/infyrX,0),
                 PROPDMG21 = round(1000 * as.numeric(PROPDMG) * inf2021/infyrX,0))



Results

This section considers some initial results from analysing the SDD data. Specifically it sets out we look at three pieces of analysis as follows:


Results - High level overview

This considers the full dataset across events, years and states. The analysis highlights where event data has been cleansed during the data cleansing process so the impact can be understood, and also compared to analysis which has not undergone data cleansing.

# set up the margins for a triple plot (vertically)
    par(mfrow = c(3,1), mar = c(1, 2, 4, 1), oma = c(3, 3, 5, 1))
    options(scipen=999)
    
# build the chart1 dataset for analysis by event        
    data_c <- data_1 %>% group_by(event_match, event, .drop = FALSE) %>% summarise(count = n()) %>% ungroup()
    cdata1 <-  matrix(0, ncol = length(eventnamessorted), nrow = 2)
    cdata1[1,] <- index_match(eventnamessorted, filter(data_c, event_match == "TRUE")[-1],0)
    cdata1[2,] <- index_match(eventnamessorted, filter(data_c, event_match == "FALSE")[-1],0)
    rownames(cdata1) <- c("original", "generated")
    colnames(cdata1) <- eventnamessorted

# draw chart1 for analysis by event      
        b1 <- barplot(cdata1,
                  ylim = c(0, 100000),
                  col = c("blue", "light blue"),
                  border = NA,
                  ylab = "frequency",
                  xaxt = "n",
                  yaxt = "n",
                  xaxs = "i",
                  yaxs = "i")

    axis(side = 2, 
         las = 2, 
         cex.axis = 1, 
         at = seq(0, 100000, by = 20000),
         labels = format(axTicks(2), big.mark = ","),
         mgp = c(3, 1, 0))

        text( x = b1,
          y = par("usr")[3] - 1,
          labels = eventnamessorted,
          cex = 1,
          xpd = NA,
          srt = 90,
          adj = 1)
    
    legend("topright", 
           legend = c("original", "generated"),
           fill = c("blue", "light blue"),
           border = NA,
           box.lty = 0)
    
    text( x = c(32,39.25),
          y = c(90000,90000),
         label = c("325,080", "288,842"),
          cex = 1,
          srt = 90)
          
    title("Figure 1: Number of different weather events recorded in SDD from 1950 to 2011", 
          cex = 3, line = 3, outer = TRUE)   

# build the chart2 dataset for analysis by year
    data_c <- data_1 %>% group_by(event_match, year, .drop = FALSE) %>% summarise(count = n()) %>% ungroup()
    yearnames1 <- yearnames[-length(yearnames)]
    cdata2 <-  matrix(0, ncol = length(yearnames1), nrow = 2)
    cdata2[1,] <- index_match(yearnames1, filter(data_c, event_match == "TRUE")[-1],0)
    cdata2[2,] <- index_match(yearnames1, filter(data_c, event_match == "FALSE")[-1],0)
    rownames(cdata2) <- c("original", "generated")
    colnames(cdata2) <- yearnames1
    
# draw chart2 for analysis by year      
    b2 <- barplot(cdata2,
                  ylim = c(0, 100000),
                  col = c("blue", "light blue"),
                  border = NA,
                  ylab = "frequency",
                  xaxt = "n",
                  yaxt = "n",
                  xaxs = "i",
                  yaxs = "i")

    axis(side = 2, 
         las = 2, 
         cex.axis = 1, 
         at = seq(0, 100000, by = 20000),
         labels = format(axTicks(2), big.mark = ","),
         mgp = c(3, 1, 0))
    
    text( x = b2,
          y = par("usr")[3] - 1,
          labels = yearnames1,
          cex = 1,
          xpd = NA,
          srt = 90,
          adj = 1)
        
# build the chart3 dataset for analysis by state       
    data_c <- data_1 %>% group_by(event_match, state, .drop = FALSE) %>% summarise(count = n()) %>% ungroup()
    cdata3 <-  matrix(0, ncol = length(statenamessorted), nrow = 2)
    cdata3[1,] <- index_match(statenamessorted, filter(data_c, event_match == "TRUE")[-1],0)
    cdata3[2,] <- index_match(statenamessorted, filter(data_c, event_match == "FALSE")[-1],0)
    rownames(cdata3) <- c("original", "generated")
    colnames(cdata3) <- statenamessorted
    
# draw chart3 for analysis by state     
    b3 <- barplot(cdata3,
                  ylim = c(0, 100000), 
                  col = c("blue", "light blue"),
                  border = NA,
                  ylab = "frequency",
                  xaxt = "n",
                  yaxt = "n",
                  xaxs = "i",
                  yaxs = "i")

    axis(side = 2, 
         las = 2, 
         cex.axis = 1, 
         at = seq(0, 100000, by = 20000),
         labels = format(axTicks(2), big.mark = ","),
         mgp = c(3, 1, 0))

        text( x = b3,
          y = par("usr")[3] - 1,
          labels = statenamessorted,
          cex = 1,
          xpd = NA,
          srt = 90,
          adj = 1)        

The analysis in Figure 1 shows that:

  • The weather events are dominated by ‘Hail’, ‘Thunderstorm Wind’ and ‘Tornado’ this is not surprising since these were the only events recorded in the database up to 1992.
  • The ‘Thunderstorm Wind’ event is recorded accurately 80,000 times and inaccurately 245,000 times so data cleansing is vital.
  • The number of events recorded per year increases significantly over the period 1992 to 1996.
  • Data quality dramatically improves from 2007 onwards following the introduction of automating the data entry processes.
  • Most of the weather events are recorded in the mid-states of the US. There are relatively fewer recorded in the west coast states or the northeast states.


Results - Fatalities and Injuries

The second analysis looks at fatalities and injuries data. Given the improvement of the data from 1992 onwards the pre-1992 data is ignored for this analysis. The analysis considers the total number of fatalities and injuries per event type, and shows this alongside the fatalities and injuries per event. This allows a full comparison of the events from a public health perspective.

# This stores the data for charts 4 to 7
perevent1 <- data.frame(event = eventnames,
                        count = rep(0, length(eventnames)),
                        injuries = rep(0, length(eventnames)),
                        fatalities = rep(0, length(eventnames)),
                        CROPDMG21 = rep(0, length(eventnames)),
                        PROPDMG21 = rep(0, length(eventnames)),
                        injuries_e = rep(0, length(eventnames)),
                        fatalities_e = rep(0, length(eventnames)),
                        CROPDMG21_e = rep(0, length(eventnames)),
                        PROPDMG21_e = rep(0, length(eventnames)))     

        for (ee in eventnames) {               
            data_pe <- data_1 %>% filter(yearGrp != "pre 1992", event == ee) 
            
            perevent1[(perevent1$event == ee), "count"] <- nrow(data_pe)
            cEvents <- nrow(data_pe)
            perevent1[(perevent1$event == ee), "injuries"] <- sum(as.numeric(data_pe$injuries)) 
            perevent1[(perevent1$event == ee), "fatalities"] <- sum(as.numeric(data_pe$fatalities)) 
            perevent1[(perevent1$event == ee), "CROPDMG21"] <- sum(as.numeric(data_pe$CROPDMG21)) 
            perevent1[(perevent1$event == ee), "PROPDMG21"] <- sum(as.numeric(data_pe$PROPDMG21))
        

            if (cEvents == 0) {
                perevent1[(perevent1$event == ee), "injuries_e"] <- 0 
                perevent1[(perevent1$event == ee), "fatalities_e"] <- 0 
                perevent1[(perevent1$event == ee), "CROPDMG21_e"] <- 0 
                perevent1[(perevent1$event == ee), "PROPDMG21_e"] <- 0 
            } else {
                perevent1[(perevent1$event == ee), "injuries_e"] <- 
                               round(perevent1[perevent1$event == ee, "injuries"] / cEvents,2)
                perevent1[(perevent1$event == ee), "fatalities_e"] <- 
                                round(perevent1[perevent1$event == ee, "fatalities"] / cEvents,2)
                perevent1[(perevent1$event == ee), "CROPDMG21_e"] <- 
                                round(perevent1[perevent1$event == ee, "CROPDMG21"] / cEvents,2) 
                perevent1[(perevent1$event == ee), "PROPDMG21_e"] <- 
                               round(perevent1[perevent1$event == ee, "PROPDMG21"] / cEvents,2)
            }
        }    
# set up parameters for 2 charts side by side
par(mfrow = c(1,2), mar = c(1, 5, 2, 1), oma = c(3, 3, 5, 1))
eventnamesreversed = rev(eventnames)
eventnamessortedrev = rev(eventnamessorted)

# build the chart4 dataset for analysis of injuries and fatalities     
 cdata4 <-  matrix(0, ncol = length(eventnamessortedrev), nrow = 2)
 # cdata4[1,] <- rev(perevent1[,"injuries"])
 cdata4[1,] <- index_match(eventnamessortedrev, perevent1[,c("event", "injuries")])
 # cdata4[2,] <- rev(perevent1[,"fatalities"])
 cdata4[2,] <- index_match(eventnamessortedrev, perevent1[,c("event", "fatalities")])
 rownames(cdata4) <- c("injuries", "fatalities")
 colnames(cdata4) <- eventnamessortedrev

 # draw chart4 
  b4 <- barplot(cdata4,
               horiz = TRUE,
               beside=TRUE,
               xlim = c(0, 10000), 
               col = c("blue", "red"),
               border = NA,
               xlab = "frequency",
               xaxt = "n",
               yaxt = "n",
               xaxs = "i",
               yaxs = "i")

 axis(side = 1, 
      cex.axis = 0.7, 
      at = seq(0, 10000, by = 2000),
      labels = format(axTicks(1), big.mark = ","),
      mgp = c(3, 1, 0))
 
 text( y = (b4[1,]+b4[2,])/2,
       x = -100,
       labels = eventnamessortedrev,
       cex = 0.6,
       xpd = NA,
       adj = 1)
 
 mtext("Total incidents",
        side = 3, 
        line = 1,
        adj = 0)

 legend("top", 
        legend = c("fatalaties", "injuries"),
        fill = c("red", "blue"),
        border = NA,
        box.lty = 0,
        cex = 0.7)
 
 abline(v = seq(0, 10000, by = 2000), 
        col = "grey", 
        lty = 3, 
        lwd = 1)
 
 title("Figure 2: Injuries and fatalities from different weather events recorded in SDD from 1996 to 2011", cex = 1.5, line = 3, outer = TRUE)

# build the chart5 dataset for analysis of injuries and fatalities by event             
 cdata5 <-  matrix(0, ncol = length(eventnamessortedrev), nrow = 2)
 # cdata5[1,] <- rev(perevent1[,"injuries_e"])
 cdata5[1,] <- index_match(eventnamessortedrev, perevent1[,c("event", "injuries_e")])
 # cdata5[2,] <- rev(perevent1[,"fatalities_e"])
 cdata5[2,] <- index_match(eventnamessortedrev, perevent1[,c("event", "fatalities_e")])
 rownames(cdata5) <- c("injuries_e", "fatalities_e")
 colnames(cdata5) <- eventnamessortedrev

# draw chart5 
 b5 <- barplot(cdata5,
               horiz = TRUE,
               beside=TRUE,
               xlim = c(0, 10), 
               col = c("blue", "red"),
               border = NA,
               xlab = "frequency",
               xaxt = "n",
               yaxt = "n",
               xaxs = "i",
               yaxs = "i")

 axis(side = 1, 
      cex.axis = 0.7, 
      at = seq(0, 10, by = 2),
      labels = format(axTicks(1), big.mark = ","),
      mgp = c(3, 1, 0))
 
 text( y = (b5[1,]+b5[2,])/2,
       x = -0.1,
       labels = eventnamessortedrev,
       cex = 0.6,
       xpd = NA,
       adj = 1)
 
 mtext("Incidents per event",
       side = 3, 
       line = 1,
       adj = 0)
 
 legend("top", 
        legend = c("fatalaties", "injuries"),
        fill = c("red", "blue"),
        border = NA,
        box.lty = 0,
        cex = 0.7) 
 
 # add vertical gridlines
 abline(v = seq(0, 10, by = 2), 
        col = "grey", 
        lty = 3, 
        lwd = 1)

The analysis in Figure 2 shows that:

  • The largest number of fatalities come from ‘Excessive Heat’, ‘Heat’ and ‘Tornadoes’. (Even after discarding pre-1992 data).
  • The largest number of injuries come from ‘Excessive Heat’, ‘Flood’, ‘Lightning’, ‘Thunderstorm Wind’, and ‘Tornado’.
  • Per event the largest number of fatalities come from ‘Excessive Heat’, ‘Heat’ and ‘Tsunami’ with significant numbers also coming from ‘Avalanche’, ‘Rip Current’ and ‘Typhoon’.
  • Per event the largest number of fatalities come from ‘Tsunami’, ‘Excessive Heat’ and ‘Hurricane(Typhoon)’.
  • Overall there seems to be good reason for public health officials to focus on extreme weather events involving heat.


Results - Fatalities and Injuries

The third analysis looks at property damage and crop damage data. Given the improvement of the data from 1992 onwards the pre-1992 data is ignored for this analysis. The analysis considers the total economic impact of property damage and crop damage, and shows this alongside the economic impact per event. This allows a full comparison of the events from an economic perspective.

 par(mfrow = c(1,2), mar = c(1, 5, 2, 1), oma = c(3, 3, 5, 1))

# build the chart6 dataset for analysis of crop damage and property damage  
 cdata6 <-  matrix(0, ncol = length(eventnamessortedrev), nrow = 2)
 # cdata6[1,] <- rev(perevent1[,"CROPDMG21"])
 cdata6[1,] <- index_match(eventnamessortedrev, perevent1[,c("event", "CROPDMG21")])
 # cdata6[2,] <- rev(perevent1[,"PROPDMG21"])
 cdata6[2,] <- index_match(eventnamessortedrev, perevent1[,c("event", "PROPDMG21")])
 rownames(cdata6) <- c("CROPDMG21", "PROPDMG21")
 colnames(cdata6) <- eventnamesreversed

 # draw chart 6
 b6 <- barplot(cdata6,
               horiz = TRUE,
               beside=TRUE,
               xlim = c(0, 5000000000), 
               col = c("forestgreen", "green"),
               border = NA,
               xlab = "frequency",
               xaxt = "n",
               yaxt = "n",
               xaxs = "i",
               yaxs = "i")
 # axis(side = 1, labels = FALSE)
 axis(side = 1, 
      cex.axis = 0.7, 
      at = seq(0, 5000000000, by = 1000000000),
      labels = c("0", "1bn", "2bn", "3bn", "4bn", "5bn"),
      mgp = c(3, 1, 0))
 
 text( y = (b6[1,]+b6[2,])/2,
       x = -100,
       labels = eventnamessortedrev,
       cex = 0.6,
       xpd = NA,
       adj = 1)
 
 mtext("Total economic impact (USD 2021)",
       side = 3, 
       line = 1,
       adj = 0)
 
 legend("top", 
        legend = c("crop damage", "property damage"),
        fill = c("forestgreen", "green"),
        border = NA,
        box.lty = 0,
        cex = 0.7)
 
 # add vertical gridlines
 abline(v = seq(0, 5000000000, by = 1000000000), 
        col = "grey", 
        lty = 3, 
        lwd = 1)
 
 title("Figure 3: Crop damage and property damage from different weather events recorded in SDD from 1996 to 2011", cex = 1.5, line = 3, outer = TRUE)

# build the chart7 dataset for analysis of crop damage and property damage per event 
 cdata7 <-  matrix(0, ncol = length(eventnamessortedrev), nrow = 2)
 cdata7[1,] <- rev(perevent1[,"CROPDMG21_e"])
 cdata7[1,] <- index_match(eventnamessortedrev, perevent1[,c("event", "CROPDMG21_e")])
 cdata7[2,] <- rev(perevent1[,"PROPDMG21_e"])
 cdata7[2,] <- index_match(eventnamessortedrev, perevent1[,c("event", "PROPDMG21_e")])
 rownames(cdata7) <- c("CROPDMG21_e", "PROPDMG21_e")
 colnames(cdata7) <- eventnamessortedrev

 # draw chart 7
 b7 <- barplot(cdata7,
               horiz = TRUE,
               beside = TRUE,
               xlim = c(0, 150000), 
               col = c("forestgreen", "green"),
               border = NA,
               xlab = "frequency",
               xaxt = "n",
               yaxt = "n",
               xaxs = "i",
               yaxs = "i")
 # axis(side = 1, labels = FALSE)
 axis(side = 1, 
      cex.axis = 0.7, 
      at = seq(0, 150000, by = 25000),
      mgp = c(3, 1, 0))
 
 text( y = (b7[1,]+b7[2,])/2,
       x = -0.1,
       labels = eventnamessortedrev,
       cex = 0.6,
       xpd = NA,
       adj = 1)
 
 mtext("Economic impact per event (USD 2021)",
       side = 3, 
       line = 1,
       adj = 0)
 
 legend("top", 
        legend = c("crop damage", "property damage"),
        fill = c("forestgreen", "green"),
        border = NA,
        box.lty = 0,
        cex = 0.7) 
 
 # add vertical gridlines
 abline(v = seq(0, 150000, by = 25000), 
        col = "grey", 
        lty = 3, 
        lwd = 1)        

The analysis in Figure 3 shows that:

  • The largest property damage comes from ‘Thunderstorm Wind’, ‘Tornado’, ‘Flash Flood’ and ‘Flood’.
  • The largest crop damage comes from ‘Hail’, with significant amount also from ‘Flash Flood’ and ‘Flood’.
  • The largest property damage per event comes from ‘Hurricane (Typhoon)’ with significant impact across many other events.
  • The largest crop damage per event comes from ‘Hurricane (Typhoon)’ and ‘Tropical Storm’.
  • Overall there seems to be good reason for officials to focus on Flooding and Storms in terms of property damage, and Hail in terms of crop damage.


Results - Further Analsis

The initial three piece of analysis are intersting and point towards intial directions of investigation around policy on piublic health and economic impact. Further analysis is suggested in the following areas:

  • Significant effort to clean data pre 1992, with review from experts in the NWS SDD
  • A geographical look at weather evnts to see whsther certain events are more important to focus on in certain geographies.
  • Consideration as to some geographies shoule be more prepared for some weather events, also considering the cost/benefit anaysis for preparing for freak weather events - e.g. Heat events in northern states or cold events in southern states
  • Combining with other national databases such as state spending on sorm defences or road traffic accidents.