Exploratory Analysis of the Impact of US Storm Events upon Health and Economy

Synopsis
Data Processing
Results
Discussion

Synopsis

We undertake a brief analysis of the impact of storm events upon reported fatalities and injuries as well as the estimated costs of damage to property and crops. Data is summarised across the entire US for the years 1950 through 2011, inclusive, via a variety of nominally categorised storm event types - regardless of their magnitude. Over this term, net fatality and injury counts are in excess of 5600 and 91000, respectively. Additionally, net damage to property and crops is estimated to be in excess of US$144 bn and US$5 bn, respectively. We present descriptive statistics plots for each summarised impact area whose implication is that no normal distribution can be attributed between the event type and the impacts considered. Finally, we highlight the top five storm event types which have most impact upon the health and economic metrics.

Data Processing

Dependencies

The analysis and charting undertaken by this report requires the following library packages:

library(data.table)  # Encapsulate data in table format
library(dplyr)       # Use table related mainoulation functions
library(purrr)       # Functional extensions for R
library(stringr)     # String manipulation extensions
library(ggplot2)     # Plot generation
library(htmlTable)   # Formatting tabular R output as HTML.
library(gridExtra)   # Panel composition for gglot output
library(ggpubr)      # Extensions for ggplot panel composition

Source & Loading of Raw Data

Data is sourced from the National Climatic Data Center and downloaded to a local, compressed CSV file. Presuming download into an appropriate working directory whose path contains the download file, the raw data may be loaded via:

data <-
  data.table(fread('repdata_data_StormData.csv',
                   header = T, strip.white = T,
                   blank.lines.skip = T, check.names = F,
                   key = NULL, stringsAsFactors = F))

The uncompressed file is rather large ( 535.62 MB) and will therefore require a minimum working memory of approximately 1.05 GB. However, a number of columns of the dataset are of character type whose length varies so that memory requirement amount, again, should be allowed for.

Raw Data Structure

Essentially, we wish only to correlate event type to population health indicators and economic indicators. The analysis is not, therefore, spatially dependent but occurs across the entire US. As to temporal analysis, we do not produce any static time series or animated plots such as bubble charts to determine temporal cycles in event type versus either health or economic indicators so we only consider accumulated impact over the entire term of the dataset in this exploratory data analysis.

Event type is encoded as the column EVTYPE (which our code renames to Event) as a discrete, categorical variable. The indicators for ‘population health’ are encoded in the columns FATALITIES (which our code renames to Fatalities) and INJURIES (which our code renames to Injuries) as counts. These are discrete, quantitative, interval scaled integers without dimension.

There are four columns that encode economic indicators, in turn, for:

Property
- The column PROPDMG (which our code renames to Property) is a continuous quantitative measure of the dollar damage reported for a storm event and associated with it is…
- The column PROPDMGEXP which has distinct values of, for example, “K” and “M” to signify a multiplicand (thousands or millions) to be applied to reported dollar amount.
Crops
- The columns herein are CROPDMG (which our code renames to Crop) and CROPDMGEXT and their content and meaning is as for property damage.

Metadata

Essentially the economic consequence reported upon will be factored by the two variables of Property and Crop giving related damage estimates in dollar values. Given the nominally encoded exponent factor present in the columns with the suffix EXP, we need to normalise the scale of the dollar estimate. However, unfortunately, no metadata appears to exist to characterise the distinct values of the exponents used (not helped by the current US shut-down affecting access to on-line Federal resources).

For this exercise we’ll use the mapping of the values of CROPDMGEXP and PROPDMGEXP stated upon a colleagues submission for the exponent factors of H,h,K,k,M,m,B,b,+,-,?, the digits 0 through 8 and a blank/empty character:

H,h -> hundreds = 10^2
K,k -> kilos = thousands = 10^3
M,m -> millions = 10^6
B,b -> billions = 10^9
+ -> 1
- -> 0
? -> 0
Blank/empty character -> 0
Numeric 0..8 -> 10

Therefore, we need a function to evaluate a multiplier and apply it to scale a number, for which, we encode:

scale <- function(exp,x){
  case_when(
    grepl("^[Bb]", exp) ~ 10^9,
    grepl("^[Mm]", exp) ~ 10^6,
    grepl("^[Kk]", exp) ~ 10^3,
    grepl("^[Hh]", exp) ~ 10^2,
    grepl("^[+]", exp) ~ 1,
    grepl("^[012345678]", exp) ~ 10,
    TRUE ~ 0
  ) * x }

Analytic Data

Given our damage indicator scaling, we may now undertake the column sub-setting of the raw data and its aggregation by event type in summaries containing the (scale-normalised, where necessary) indicators metrics. Firstly, we produce our primary analytical data set…

subset <-
  mutate(data,
         Event = as.factor(str_trim(str_to_title(EVTYPE))),
         Fatalities = FATALITIES,
         Injuries = INJURIES,
         Property = scale(PROPDMGEXP, PROPDMG),
         Crop = scale(CROPDMGEXP, CROPDMG) ) %>%
  select(Event,
         Fatalities,
         Injuries,
         Property,
         Crop ) %>%
  group_by(Event) %>%
  summarise(
    Fatalities = sum(Fatalities),
    Injuries = sum(Injuries),
    Property = sum(Property),
    Crop = sum(Crop) )

For the purposes of producing descriptive statistics the range of values of the encoded metrics varies widely both between and within metric and, in the latter case, even by several orders of magnitude. As our purpose is an exploratory, descriptive review, we will then undertake a log₁₀ transformation of all our metrics so all are similarly scaled. However, as we do not wish to introduce negative infinities into the analytic data, we split the subset data into two tables - one containing the health indicators, the other the economic indicators and, for each, we remove all event types where both of these paired metric values are zero:

reportData <-
  merge(subset %>%
        select(Event, Property, Crop) %>%
        filter(Property > 0 & Crop > 0 ) %>%
        mutate(Property = log10(Property),
               Crop = log10(Crop) ),
        subset %>%
        select(Event, Fatalities, Injuries) %>%
        filter(Fatalities > 0 & Injuries > 0 ) %>%
        mutate(Fatalities = log10(Fatalities),
               Injuries = log10(Injuries) ) )

Supporting Functions

In addition to the scale function, we declare a number of additional functions to support:

The creation of plots that characterise the descriptive statistics of the impact indicators for Fatalities, Injuries, Property and Crop:

edaPlots <- function(metric,chart){
  ds <- reportData[,c("Event",metric)]
  colnames(ds) <- c("Event","Metric")
  metricColour <-
    switch(metric,
           "Fatalities" = "red",
           "Injuries" = "orange",
           "Property" = "green",
           "Crop" = "blue" )
  xlabel <-
    switch(metric,
           "Fatalities" = "\nFatality Count",
           "Injuries" = "\nInjury Count",
           "Property" = "\nProperty Damage",
           "Crop" = "\nCrop Damage")
  g <- switch(chart,
         "Hist" = {
           ggplot(ds, aes(Metric)) + theme_bw() +
           geom_histogram(
             aes(y = ..density..),
             binwidth = 1, fill = metricColour) +
           labs(x="",y="") +
           stat_function(
             fun = dnorm,
             args = list(mean = mean(ds$Metric,na.rm = TRUE),
                         sd = sd(ds$Metric, na.rm = TRUE)),
             colour = "black", size = 1) +
           scale_x_continuous(labels = function(x){sprintf("%.0f",x)}) +
           scale_y_continuous(labels = function(y){sprintf("%.2f",y)}) },
         "Box" = {
           ggplot(ds, aes(Metric)) + theme_bw() +
           geom_boxplot(data=ds,aes(y=Metric)) +
           geom_point(data=ds,
             aes(x=mean(Metric),y=Metric),
             size = 1, shape = 19,show.legend = F,
             colour = metricColour,alpha=0.3) +
             labs(x = "",y = "") +
             theme(axis.text.x = element_blank(),
                   axis.ticks = element_blank()) +
             scale_x_continuous(labels = function(x){sprintf("%.0f",x)}) +
             scale_y_continuous(labels = function(y){sprintf("%.2f",y)}) },
         "QQ" = {
           ggplot(data=ds,aes(sample=Metric),size=1) + theme_bw() +
           stat_qq(show.legend = F,colour=metricColour) +
           stat_qq_line() +
             theme(axis.text.x = element_blank(),
                   axis.ticks.x = element_blank()) +
           labs(x=xlabel,y="") +
             scale_x_continuous(labels = function(x){sprintf("%.0f",x)}) +
             scale_y_continuous(labels = function(y){sprintf("%.2f",y)}) } )
 return(
   if(metric == "Fatalities"){
     switch(chart,
            "Hist" = g + labs(y="Histograms\n"),
            "Box" = g + labs(y="Box Plots\n"),
            "QQ" = g + labs(y="Q-Q Plots\n") )
   } else g ) }

The selection of the top five ranked event types for each of our impact indicators:

topFiveData <- function(metric){
  select(reportData,
         Event = c("Event"),
         Metric=c(metric)) %>%
    mutate(Metric =
             case_when(
               metric %in% c("Property","Crop") ~ 10^(Metric - 9),
               metric %in% c("Fatalities","Injuries") ~ 10^Metric)) %>%
    arrange(desc(Metric)) %>%
    head(5) }

The creation of a bar plot for the top five impacted indicators:

topFivePlot <- function(metric){
  pal <-colorRampPalette(c('yellow','red' ))
  ds <- topFiveData(metric)
  g <- ggplot(data=ds,
              aes(x=reorder(Event,Metric),y=Metric,
                  fill=Metric)) +
    scale_fill_gradient2(high=pal(100), space='Lab') +
    geom_bar(show.legend = F,stat='identity') +
    theme_bw() +
    labs(x="",
         y=
           case_when(
             metric == "Fatalities" ~ "Fatality Count",
             metric == "Injuries" ~ "Injury Count",
             metric == "Property" ~ "Property Damage (Billions US$)",
             metric == "Crop" ~ "Crop Damage (Billions US$)"
           ))
  if(metric %in% c("Property","Crop")){
    g + coord_cartesian(ylim=c(min(ds$Metric),
                        max(ds$Metric)))}
  g + coord_flip() }

With the foregoing, we may now present the results of the analysis…

Results

Summary Statistics

With our merged, log₁₀ normalised and scaled metrics as persisted in the reportData dataset, we may now tabulate the descriptive summary statistics for each of the metrics summarised by the categorical event type:

options(table_counter = TRUE)
print(htmlTable(summary(reportData),
                caption="Summary, descriptive statistics for storm event type metrics",
                tfoot=paste("<i>All impact measurement values have been",
                           "log<sub>10</sub> transformed. Fatalities and",
                           "injuries are\nrecorded as counts. Property",
                           "and Crop values relate to the estimate of",
                           "US$ damage incurred.</i>"),
                col.rgroup = c("none", "#DADADA"),
                css.table = "text-align: center; margin-left: auto; margin-right: auto;",
                align = "rrrrr",
                css.cell = "padding-left: .5em; padding-right: .2em;"))

	Event	Property	Crop	Fatalities	Injuries
Table 1: Summary, descriptive statistics for storm event type metrics
	Blizzard : 1	Min. : 4.176	Min. :3.699	Min. :0.0000	Min. :0.000
	Cold/Wind Chill: 1	1st Qu.: 6.765	1st Qu.:5.947	1st Qu.:0.8094	1st Qu.:1.176
	Dry Microburst : 1	Median : 8.200	Median :7.366	Median :1.4828	Median :1.956
	Dust Storm : 1	Mean : 8.045	Mean :7.222	Mean :1.4692	Mean :1.989
	Excessive Heat : 1	3rd Qu.: 9.583	3rd Qu.:8.614	3rd Qu.:2.0107	3rd Qu.:2.959
	Extreme Cold : 1	Max. :11.160	Max. :9.753	Max. :3.7507	Max. :4.961
	(Other) :52
All impact measurement values have been log₁₀ transformed. Fatalities and injuries are recorded as counts. Property and Crop values relate to the estimate of US$ damage incurred.

Reviewing the summary, descriptive statistics it should be borne in mind that because of the log₁₀ transform, there is wide variation in metric values - yielded by the inverse of the log₁₀ transform as 10^value. Thus, although means and medians may appear ‘close’, they may not necessarily be so. It is also not unreasonable to have event types for which there are no fatalities or injuries - or, indeed, vice-versa.

Descriptive Statistics Charts

In order to better visualise the dispersion of metric values by event type, we plot, for each metric, a histogram, a box and whisker plot with accompanying point scattering, and a Quantile-Quantile plot based upon a normal distribution. The histograms show a density of point dispersion across the event type categories since we overlay, for reference, the normal distribution trace for the metric given its mean and standard deviation.

To invoke the chart generation we use the following code in which we layout each of the histograms, box plots and Q-Q plots in a row-wise fashion with each metric being shown, column-wise in turn:

gg <- grid.arrange(
        edaPlots("Fatalities","Hist"), edaPlots("Injuries","Hist"),
        edaPlots("Property","Hist"), edaPlots("Crop","Hist"),
        edaPlots("Fatalities","Box"), edaPlots("Injuries","Box"),
        edaPlots("Property","Box"), edaPlots("Crop","Box"),
        edaPlots("Fatalities","QQ"), edaPlots("Injuries","QQ"),
        edaPlots("Property","QQ"), edaPlots("Crop","QQ"),
        nrow=3,
top = text_grob(
    paste("Descriptive Statistics Visualisation:",
          "Health and Economic Impacts of Storm Events"),
    color = "black",
    face = "bold", size = 12),
bottom = text_grob(
    paste("All figures are log tranformed and represent",
          "a summation of total counts/US$ values by the storm event type.",
          "\nHistograms depict a value density distribution",
          "based upon a nomral distribution. Box plots points",
          "are centered \naround the variable mean. The Q-Q plots",
          "are based upon a theoretical match against a",
          "normal distibution."),
    color = "black", face = "italic", size = 10) )

Ranking of Severe Impacts

For planning and similar purposes, we now identify and highlight the top five most ‘costly’ impacts for each of our metrics. The code below generates another faceted plot depicting a bar chart, per metric, of the top five metric values showing the event types associated with them. Note that the values depicted in these plots are the actual counts and dollar values - not the log₁₀ transformed values and, for economic impacts, are shown in billions of US$.

gp <- grid.arrange(
  topFivePlot("Fatalities"), topFivePlot("Property"),
  topFivePlot("Injuries"), topFivePlot("Crop"),
  nrow=2,
  top = text_grob(
    paste("Top Five Storm Event Types by Impact"),
    color = "black",
    face = "bold", size = 12),
  bottom = text_grob(
    paste("All figures are a summation of total values",
          " by the storm event type."),
    color = "black", face = "italic", size = 10) )

Discussion

It is relatively clear that, when considering the complete data set of storm event types for which we have non-zero metric values, that a normal distribution cannot, realistically (or logically) be attributed to any distribution of metric values. This exploratory overview, rather, indicates that a more detailed, temporal analysis may be more profitable in highlighting trends and associations between metrics and particular storm event types where there are significantly high metric values within their range - for example, a selection of the event types where metric values fall within the 95% percentile.