Exploratory Analysis of US Storm Data

Synopsis

This study presents exploratory analysis of the U.S. National Oceanic and Atmospheric Administration (NOAA) storm database. The aim of the analysis is to gain a general understanding of the health and economic consequences of major storm events.

The data set presents a number of challenges. It is a long-running database, covering six decades. Over that time, standards for reporting have changed. There is inconsisetency in the classification of events, and the values are expressed in inconsistent units without inflation adjustment. For the purposes of this study, we aggregate event data and focus on the mean and total count or count of losses, without rigorously analyzing coding errors that could misrepresent individual events. To reliably understand the true cost of any single event, it is necessary to consult the narrative text and other historical records.

The results present national level data for the most injurious events. Because of the geographically custered nature of many weather events, state and regional level aggregation would be a logical next step in data exploration and analyis.

Data Processing

Load the dataset and required packages

# load primary weather dataset
wxdata <- read.csv("repdata_data_StormData.csv.bz2")

# suppress messages and warnings about my R version when loading packages they all still work.
suppressMessages(suppressWarnings(require(plyr)))     
suppressMessages(suppressWarnings(require(quantmod))) 
suppressMessages(suppressWarnings(require(lubridate)))
suppressMessages(suppressWarnings(require(knitr)))

For this analysis, we are also going to need to adjust for inflation to calculate comparable monetary losses. We will use 2000 as the base year for CPI adjustment.

# load US Federal Reserve cpi data for use in generating CPI-adjusted damage value statistics
suppressMessages(getSymbols("CPIAUCSL", src='FRED')) # suppress long update message about function

## [1] "CPIAUCSL"

avg.cpi <- apply.yearly(CPIAUCSL, mean)
adj.cpi <- avg.cpi/as.numeric(avg.cpi['2000']) # set 2000 as the base year for CPI adjustment
adj.cpi <- data.frame(year=year(index(adj.cpi)), adj.cpi$CPIAUCSL)
cpi.adjusted.cost <- function(x, year) {
    x * adj.cpi$CPIAUCSL[adj.cpi$year == year]
}

Condition the data for analysis

For comparison and aggregation, we will need to standardize dates as well as monitary units of measure. The data contain magnitude coding for cost fields. We’ll convert those to a standard measure with millions as a unit that makes sense when focusing on the most costly events.

# adjust column names based to preferred style and convert date formats
names(wxdata) <- tolower(names(wxdata))
colnames(wxdata)[8]<-"event_type"
wxdata$bgn_date <- as.Date(wxdata$bgn_date, "%M/%d/%Y %H:%M:%S")
wxdata$end_date <- as.Date(wxdata$end_date, "%M/%d/%Y %H:%M:%S")
wxdata$event_year <- year(wxdata$bgn_date)

# standardize crop and prop exponent expressions and use millions as the unit of measure
10^-6 -> wxdata$stdcropexp[wxdata$cropdmgexp=="" | wxdata$cropdmgexp=="?" | wxdata$cropdmgexp=="0"]
10^-4 -> wxdata$stdcropexp[wxdata$cropdmgexp=="2"]
10^-3 -> wxdata$stdcropexp[wxdata$cropdmgexp=="K"]
10^0  -> wxdata$stdcropexp[wxdata$cropdmgexp=="M"]
10^3  -> wxdata$stdcropexp[wxdata$cropdmgexp=="B"]

10^-6 -> wxdata$stdpropexp[wxdata$propdmgexp=="" | wxdata$propdmgexp=="?" | wxdata$propdmgexp=="+" | 
                      wxdata$propdmgexp=="-" | wxdata$propdmgexp=="0"] 
10^-5 -> wxdata$stdpropexp[wxdata$propdmgexp=="1"]
10^-4 -> wxdata$stdpropexp[wxdata$propdmgexp=="2" | wxdata$propdmgexp=="H"]
10^-3 -> wxdata$stdpropexp[wxdata$propdmgexp=="3" | wxdata$propdmgexp=="K"]
10^-2 -> wxdata$stdpropexp[wxdata$propdmgexp=="4"]
10^-1 -> wxdata$stdpropexp[wxdata$propdmgexp=="5"]
10^0  -> wxdata$stdpropexp[wxdata$propdmgexp=="6" | wxdata$propdmgexp=="M"]
10^1  -> wxdata$stdpropexp[wxdata$propdmgexp=="7"]
10^2  -> wxdata$stdpropexp[wxdata$propdmgexp=="8"]
10^3  -> wxdata$stdpropexp[wxdata$propdmgexp=="B"]

# adjust damage amounts using standard exponent expressions
wxdata$cropdmg <- wxdata$cropdmg * wxdata$stdcropexp
wxdata$propdmg <- wxdata$propdmg * wxdata$stdpropexp

fix inconsistent event type codes

Because many reporters entered data from various regions over extended periods of time, and enumerated values do not appear to have been established or enforced, the event_type field contains a large number of synonyms, case variation, and typos. We’ll reduce some of those here.

# fix false differentiation between factors due to case
wxdata$event_type <- as.factor(tolower(wxdata$event_type))
wxdata$cropdmgexp <- as.factor(toupper(wxdata$cropdmgexp))
wxdata$propdmgexp <- as.factor(toupper(wxdata$propdmgexp))

# fix trivial (non-semantic) differences in event type coding
wxdata$event_type <- gsub(" {2,}"," ",wxdata$event_type)                    # multiple spaces
wxdata$event_type <- gsub("^\\s+","",wxdata$event_type)                     # leading spaces
wxdata$event_type <- gsub("[\\s/.]+$","",wxdata$event_type)                 # trailing oddities
wxdata$event_type <- gsub("/+","/",wxdata$event_type)                       # double slashes
wxdata$event_type <- gsub("sml","small",wxdata$event_type)                  # abbr
wxdata$event_type <- gsub("wnd","wind",wxdata$event_type)                   # abbr
wxdata$event_type <- gsub("cstl","coastal",wxdata$event_type)               # abbr
wxdata$event_type <- gsub("surge/tide","surge",wxdata$event_type)           # inconsistency
wxdata$event_type <- gsub("non.tstm","non-tstm",wxdata$event_type)          # inconsistency
wxdata$event_type <- gsub("tstm","thunderstorm",wxdata$event_type)          # abbr
wxdata$event_type <- gsub("fld","flood",wxdata$event_type)                  # abbr
wxdata$event_type <- gsub("\\\\","/",wxdata$event_type)                     # double backslash
wxdata$event_type <- gsub("&","and",wxdata$event_type)                      # inconsistency
wxdata$event_type <- gsub("devel","devil",wxdata$event_type)                # typo
wxdata$event_type <- gsub("lake effect","lake-effect",wxdata$event_type)    # inconsistency
wxdata$event_type <- gsub("sleet?storm","sleet",wxdata$event_type)          # inconsistency
wxdata$event_type <- gsub("unseasonably","unusually",wxdata$event_type)     # inconsistency
wxdata$event_type <- gsub("^.+mix","wintry mix",wxdata$event_type)          # fix abbr
wxdata$event_type <- gsub("abnormally","unusually",wxdata$event_type)       # synonym
wxdata$event_type <- gsub("wild/forest fire","wildfire",wxdata$event_type)  # synonym
wxdata$event_type <- gsub("high sea", "rough sea",wxdata$event_type)        # synonym
wxdata$event_type <- gsub("high sea", "rough sea",wxdata$event_type)        # synonym
wxdata$event_type <- gsub("high sea", "rough sea",wxdata$event_type)        # synonym
wxdata$event_type <- gsub(".*(high|hazardous) surf.*","heavy surf",
                          wxdata$event_type)                                # synonyms
wxdata$event_type <- gsub("mud slide","mudslide",wxdata$event_type)         # alternatives
wxdata$event_type <- gsub("hurricane.typhoon","hurricane",wxdata$event_type)# synonym

The amount of reporting over time als presents a data analysis challenge. A plot of the total number of observations made in a year shows a marked increase around 1995. It is therefore reasonable, and sufficient for this study, to subset the data to look only at the last decade. That gives us a reasonable baseline for comparison of event costs in terms of injuries to persons and property.

# calculate total events per year
yearly_events <- as.data.frame(table(wxdata$event_year))
colnames(yearly_events) <- c("year","recorded_events")
plot(yearly_events, main="Recorded Weather Events per Year")

plot of chunk unnamed-chunk-6

Subset and aggregate the data

recent_wxdata <- subset(wxdata,wxdata$event_year > 2000)

# add columns for CPI adjusted damages to crops and property
recent_wxdata$adj_propdmg <- mapply(cpi.adjusted.cost, 
                                    x = recent_wxdata$propdmg, year=recent_wxdata$event_year)
recent_wxdata$adj_cropdmg <- mapply(cpi.adjusted.cost, 
                                    x = recent_wxdata$cropdmg, year=recent_wxdata$event_year)

# create a dataframe that aggregates counts and means per event
eventcounts    <- as.data.frame(table(recent_wxdata$event_type))
deaths.mean    <- aggregate(fatalities  ~ event_type, data = recent_wxdata, FUN = mean)
deaths.total   <- aggregate(fatalities  ~ event_type, data = recent_wxdata, FUN =  sum)
injuries.mean  <- aggregate(injuries    ~ event_type, data = recent_wxdata, FUN = mean)
injuries.total <- aggregate(injuries    ~ event_type, data = recent_wxdata, FUN =  sum)
propdmg.mean   <- aggregate(adj_propdmg ~ event_type, data = recent_wxdata, FUN = mean)
propdmg.total  <- aggregate(adj_propdmg ~ event_type, data = recent_wxdata, FUN =  sum)
cropdmg.mean   <- aggregate(adj_cropdmg ~ event_type, data = recent_wxdata, FUN = mean)
cropdmg.total  <- aggregate(adj_cropdmg ~ event_type, data = recent_wxdata, FUN =  sum)
casualties     <- cbind(eventcounts,
                        round(deaths.mean  [,2],3), deaths.total  [,2],
                        round(injuries.mean[,2],3), injuries.total[,2], 
                        round(propdmg.mean [,2],3), propdmg.total [,2],
                        round(cropdmg.mean [,2],3), cropdmg.total [,2])
names(casualties) <- c("event","frequency",
                       "mean_deaths","total_deaths",
                       "mean_injuries","total_injuries",
                       "mean_prop_dmg","total_prop_dmg",
                       "mean_crop_dmg", "total_crop_dmg")

Results

To get a sense of the weather events that produce the most economic and health-effect damage, we’ll take the top ten events in each of four categories (loss of life, physical injury, property damage, and crop damage), base on mean loss rates. Then we’ll combine those and order by frequency. A nice R Shiny app for state-level planners might let them subset by state before performing these calculations to get a sense of the most damaging events with the highest frequency.

crop_killers <- casualties[order(casualties$mean_crop_dmg, decreasing=TRUE),]
prop_killers <-  casualties[order(casualties$mean_prop_dmg, decreasing=TRUE),]
pop_killers  <- casualties[order(casualties$mean_deaths,   decreasing=TRUE),]
maimers      <- casualties[order(casualties$mean_injuries, decreasing=TRUE),]
worst_wx     <- unique(rbind(
    head(crop_killers,10),
    head(prop_killers,10),
    head(pop_killers,10),
    head(maimers,10)))
# order worst weather events by frequency
worst_wx_ordered <- worst_wx[order(worst_wx$frequency, decreasing=TRUE),]
# because of remaining questions about coding error, discount events that occur very rarely
worst_wx_ordered <- worst_wx_ordered[worst_wx_ordered$frequency > 10,]
kable(worst_wx_ordered[,c(1,2,3,5,7,9)])

## 
## 
## |    |event             | frequency| mean_deaths| mean_injuries| mean_prop_dmg| mean_crop_dmg|
## |:---|:-----------------|---------:|-----------:|-------------:|-------------:|-------------:|
## |38  |flood             |     18822|       0.014|         0.016|         8.374|         0.232|
## |115 |tornado           |     16294|       0.070|         0.876|         1.436|         0.017|
## |134 |wildfire          |      3254|       0.025|         0.337|         1.767|         0.107|
## |19  |drought           |      1930|       0.000|         0.002|         0.478|         3.874|
## |45  |frost/freeze      |      1335|       0.000|         0.000|         0.009|         1.002|
## |63  |ice storm         |      1316|       0.021|         0.082|         1.686|         0.007|
## |28  |excessive heat    |      1031|       0.825|         3.144|         0.005|         0.559|
## |53  |heat              |       684|       0.333|         1.782|         0.003|         0.000|
## |118 |tropical storm    |       602|       0.083|         0.442|        12.553|         0.756|
## |67  |landslide         |       572|       0.065|         0.091|         0.669|         0.041|
## |92  |rip current       |       551|       0.753|         0.644|         0.000|         0.000|
## |4   |avalanche         |       295|       0.546|         0.359|         0.012|         0.000|
## |110 |storm surge       |       264|       0.042|         0.030|       207.388|         0.004|
## |39  |fog               |       195|       0.149|         1.092|         0.028|         0.000|
## |30  |extreme cold      |       133|       0.128|         0.000|         0.002|         1.205|
## |60  |hurricane         |       132|       0.515|         9.780|       617.516|        26.064|
## |40  |freeze            |        37|       0.000|         0.000|         0.000|         1.330|
## |11  |coastal flooding  |        36|       0.028|         0.000|         0.488|         0.000|
## |119 |tsunami           |        20|       1.650|         6.450|         9.102|         0.001|
## |33  |extreme windchill |        15|       0.067|         0.000|         0.000|         1.211|
## |99  |small hail        |        14|       0.000|         0.071|         0.000|         0.343|

Conclusions

From this ordering, it is no surprise that hurricandes and the often associated storm surge are the greatest cause of property damage on average. Among the events with the greatest propensity to cause death and injury, excessive heat and rip currents stand out, with 0.825 and 0.753 deaths, respectively. But no event type in this study approaches the localized lethality of rare tsumami events, with 0.753 deaths per event on average.