We undertake a brief analysis of the impact of storm events upon reported fatalities and injuries as well as the estimated costs of damage to property and crops. Data is summarised across the entire US for the years 1950 through 2011, inclusive, via a variety of nominally categorised storm event types - regardless of their magnitude. Over this term, net fatality and injury counts are in excess of 5600 and 91000, respectively. Additionally, net damage to property and crops is estimated to be in excess of US$144 bn and US$5 bn, respectively. We present descriptive statistics plots for each summarised impact area whose implication is that no normal distribution can be attributed between the event type and the impacts considered. Finally, we highlight the top five storm event types which have most impact upon the health and economic metrics.
The analysis and charting undertaken by this report requires the following library packages:
library(data.table) # Encapsulate data in table format
library(dplyr) # Use table related mainoulation functions
library(purrr) # Functional extensions for R
library(stringr) # String manipulation extensions
library(ggplot2) # Plot generation
library(htmlTable) # Formatting tabular R output as HTML.
library(gridExtra) # Panel composition for gglot output
library(ggpubr) # Extensions for ggplot panel composition
Data is sourced from the National Climatic Data Center and downloaded to a local, compressed CSV file. Presuming download into an appropriate working directory whose path contains the download file, the raw data may be loaded via:
data <-
data.table(fread('repdata_data_StormData.csv',
header = T, strip.white = T,
blank.lines.skip = T, check.names = F,
key = NULL, stringsAsFactors = F))
The uncompressed file is rather large ( 535.62 MB) and will therefore require a minimum working memory of approximately 1.05 GB. However, a number of columns of the dataset are of character type whose length varies so that memory requirement amount, again, should be allowed for.
Essentially, we wish only to correlate event type to population health indicators and economic indicators. The analysis is not, therefore, spatially dependent but occurs across the entire US. As to temporal analysis, we do not produce any static time series or animated plots such as bubble charts to determine temporal cycles in event type versus either health or economic indicators so we only consider accumulated impact over the entire term of the dataset in this exploratory data analysis.
Event type is encoded as the column EVTYPE (which our code renames to Event) as a discrete, categorical variable. The indicators for ‘population health’ are encoded in the columns FATALITIES (which our code renames to Fatalities) and INJURIES (which our code renames to Injuries) as counts. These are discrete, quantitative, interval scaled integers without dimension.
There are four columns that encode economic indicators, in turn, for:
PROPDMG (which our code renames to Property) is a continuous quantitative measure of the dollar damage reported for a storm event and associated with it is…PROPDMGEXP which has distinct values of, for example, “K” and “M” to signify a multiplicand (thousands or millions) to be applied to reported dollar amount.CROPDMG (which our code renames to Crop) and CROPDMGEXT and their content and meaning is as for property damage.Essentially the economic consequence reported upon will be factored by the two variables of Property and Crop giving related damage estimates in dollar values. Given the nominally encoded exponent factor present in the columns with the suffix EXP, we need to normalise the scale of the dollar estimate. However, unfortunately, no metadata appears to exist to characterise the distinct values of the exponents used (not helped by the current US shut-down affecting access to on-line Federal resources).
For this exercise we’ll use the mapping of the values of CROPDMGEXP and PROPDMGEXP stated upon a colleagues submission for the exponent factors of H,h,K,k,M,m,B,b,+,-,?, the digits 0 through 8 and a blank/empty character:
-> hundreds = 10^2-> kilos = thousands = 10^3-> millions = 10^6-> billions = 10^9-> 1-> 0-> 0-> 0-> 10Therefore, we need a function to evaluate a multiplier and apply it to scale a number, for which, we encode:
scale <- function(exp,x){
case_when(
grepl("^[Bb]", exp) ~ 10^9,
grepl("^[Mm]", exp) ~ 10^6,
grepl("^[Kk]", exp) ~ 10^3,
grepl("^[Hh]", exp) ~ 10^2,
grepl("^[+]", exp) ~ 1,
grepl("^[012345678]", exp) ~ 10,
TRUE ~ 0
) * x }
Given our damage indicator scaling, we may now undertake the column sub-setting of the raw data and its aggregation by event type in summaries containing the (scale-normalised, where necessary) indicators metrics. Firstly, we produce our primary analytical data set…
subset <-
mutate(data,
Event = as.factor(str_trim(str_to_title(EVTYPE))),
Fatalities = FATALITIES,
Injuries = INJURIES,
Property = scale(PROPDMGEXP, PROPDMG),
Crop = scale(CROPDMGEXP, CROPDMG) ) %>%
select(Event,
Fatalities,
Injuries,
Property,
Crop ) %>%
group_by(Event) %>%
summarise(
Fatalities = sum(Fatalities),
Injuries = sum(Injuries),
Property = sum(Property),
Crop = sum(Crop) )
For the purposes of producing descriptive statistics the range of values of the encoded metrics varies widely both between and within metric and, in the latter case, even by several orders of magnitude. As our purpose is an exploratory, descriptive review, we will then undertake a log10 transformation of all our metrics so all are similarly scaled. However, as we do not wish to introduce negative infinities into the analytic data, we split the subset data into two tables - one containing the health indicators, the other the economic indicators and, for each, we remove all event types where both of these paired metric values are zero:
reportData <-
merge(subset %>%
select(Event, Property, Crop) %>%
filter(Property > 0 & Crop > 0 ) %>%
mutate(Property = log10(Property),
Crop = log10(Crop) ),
subset %>%
select(Event, Fatalities, Injuries) %>%
filter(Fatalities > 0 & Injuries > 0 ) %>%
mutate(Fatalities = log10(Fatalities),
Injuries = log10(Injuries) ) )
In addition to the scale function, we declare a number of additional functions to support:
Fatalities, Injuries, Property and Crop:edaPlots <- function(metric,chart){
ds <- reportData[,c("Event",metric)]
colnames(ds) <- c("Event","Metric")
metricColour <-
switch(metric,
"Fatalities" = "red",
"Injuries" = "orange",
"Property" = "green",
"Crop" = "blue" )
xlabel <-
switch(metric,
"Fatalities" = "\nFatality Count",
"Injuries" = "\nInjury Count",
"Property" = "\nProperty Damage",
"Crop" = "\nCrop Damage")
g <- switch(chart,
"Hist" = {
ggplot(ds, aes(Metric)) + theme_bw() +
geom_histogram(
aes(y = ..density..),
binwidth = 1, fill = metricColour) +
labs(x="",y="") +
stat_function(
fun = dnorm,
args = list(mean = mean(ds$Metric,na.rm = TRUE),
sd = sd(ds$Metric, na.rm = TRUE)),
colour = "black", size = 1) +
scale_x_continuous(labels = function(x){sprintf("%.0f",x)}) +
scale_y_continuous(labels = function(y){sprintf("%.2f",y)}) },
"Box" = {
ggplot(ds, aes(Metric)) + theme_bw() +
geom_boxplot(data=ds,aes(y=Metric)) +
geom_point(data=ds,
aes(x=mean(Metric),y=Metric),
size = 1, shape = 19,show.legend = F,
colour = metricColour,alpha=0.3) +
labs(x = "",y = "") +
theme(axis.text.x = element_blank(),
axis.ticks = element_blank()) +
scale_x_continuous(labels = function(x){sprintf("%.0f",x)}) +
scale_y_continuous(labels = function(y){sprintf("%.2f",y)}) },
"QQ" = {
ggplot(data=ds,aes(sample=Metric),size=1) + theme_bw() +
stat_qq(show.legend = F,colour=metricColour) +
stat_qq_line() +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank()) +
labs(x=xlabel,y="") +
scale_x_continuous(labels = function(x){sprintf("%.0f",x)}) +
scale_y_continuous(labels = function(y){sprintf("%.2f",y)}) } )
return(
if(metric == "Fatalities"){
switch(chart,
"Hist" = g + labs(y="Histograms\n"),
"Box" = g + labs(y="Box Plots\n"),
"QQ" = g + labs(y="Q-Q Plots\n") )
} else g ) }
topFiveData <- function(metric){
select(reportData,
Event = c("Event"),
Metric=c(metric)) %>%
mutate(Metric =
case_when(
metric %in% c("Property","Crop") ~ 10^(Metric - 9),
metric %in% c("Fatalities","Injuries") ~ 10^Metric)) %>%
arrange(desc(Metric)) %>%
head(5) }
topFivePlot <- function(metric){
pal <-colorRampPalette(c('yellow','red' ))
ds <- topFiveData(metric)
g <- ggplot(data=ds,
aes(x=reorder(Event,Metric),y=Metric,
fill=Metric)) +
scale_fill_gradient2(high=pal(100), space='Lab') +
geom_bar(show.legend = F,stat='identity') +
theme_bw() +
labs(x="",
y=
case_when(
metric == "Fatalities" ~ "Fatality Count",
metric == "Injuries" ~ "Injury Count",
metric == "Property" ~ "Property Damage (Billions US$)",
metric == "Crop" ~ "Crop Damage (Billions US$)"
))
if(metric %in% c("Property","Crop")){
g + coord_cartesian(ylim=c(min(ds$Metric),
max(ds$Metric)))}
g + coord_flip() }
With the foregoing, we may now present the results of the analysis…
With our merged, log10 normalised and scaled metrics as persisted in the reportData dataset, we may now tabulate the descriptive summary statistics for each of the metrics summarised by the categorical event type:
options(table_counter = TRUE)
print(htmlTable(summary(reportData),
caption="Summary, descriptive statistics for storm event type metrics",
tfoot=paste("<i>All impact measurement values have been",
"log<sub>10</sub> transformed. Fatalities and",
"injuries are\nrecorded as counts. Property",
"and Crop values relate to the estimate of",
"US$ damage incurred.</i>"),
col.rgroup = c("none", "#DADADA"),
css.table = "text-align: center; margin-left: auto; margin-right: auto;",
align = "rrrrr",
css.cell = "padding-left: .5em; padding-right: .2em;"))
| Table 1: Summary, descriptive statistics for storm event type metrics | |||||
| Event | Property | Crop | Fatalities | Injuries | |
|---|---|---|---|---|---|
| Blizzard : 1 | Min. : 4.176 | Min. :3.699 | Min. :0.0000 | Min. :0.000 | |
| Cold/Wind Chill: 1 | 1st Qu.: 6.765 | 1st Qu.:5.947 | 1st Qu.:0.8094 | 1st Qu.:1.176 | |
| Dry Microburst : 1 | Median : 8.200 | Median :7.366 | Median :1.4828 | Median :1.956 | |
| Dust Storm : 1 | Mean : 8.045 | Mean :7.222 | Mean :1.4692 | Mean :1.989 | |
| Excessive Heat : 1 | 3rd Qu.: 9.583 | 3rd Qu.:8.614 | 3rd Qu.:2.0107 | 3rd Qu.:2.959 | |
| Extreme Cold : 1 | Max. :11.160 | Max. :9.753 | Max. :3.7507 | Max. :4.961 | |
| (Other) :52 | |||||
|
All impact measurement values have been log10 transformed. Fatalities and injuries are recorded as counts. Property and Crop values relate to the estimate of US$ damage incurred. |
|||||
Reviewing the summary, descriptive statistics it should be borne in mind that because of the log10 transform, there is wide variation in metric values - yielded by the inverse of the log10 transform as 10value. Thus, although means and medians may appear ‘close’, they may not necessarily be so. It is also not unreasonable to have event types for which there are no fatalities or injuries - or, indeed, vice-versa.
In order to better visualise the dispersion of metric values by event type, we plot, for each metric, a histogram, a box and whisker plot with accompanying point scattering, and a Quantile-Quantile plot based upon a normal distribution. The histograms show a density of point dispersion across the event type categories since we overlay, for reference, the normal distribution trace for the metric given its mean and standard deviation.
To invoke the chart generation we use the following code in which we layout each of the histograms, box plots and Q-Q plots in a row-wise fashion with each metric being shown, column-wise in turn:
gg <- grid.arrange(
edaPlots("Fatalities","Hist"), edaPlots("Injuries","Hist"),
edaPlots("Property","Hist"), edaPlots("Crop","Hist"),
edaPlots("Fatalities","Box"), edaPlots("Injuries","Box"),
edaPlots("Property","Box"), edaPlots("Crop","Box"),
edaPlots("Fatalities","QQ"), edaPlots("Injuries","QQ"),
edaPlots("Property","QQ"), edaPlots("Crop","QQ"),
nrow=3,
top = text_grob(
paste("Descriptive Statistics Visualisation:",
"Health and Economic Impacts of Storm Events"),
color = "black",
face = "bold", size = 12),
bottom = text_grob(
paste("All figures are log tranformed and represent",
"a summation of total counts/US$ values by the storm event type.",
"\nHistograms depict a value density distribution",
"based upon a nomral distribution. Box plots points",
"are centered \naround the variable mean. The Q-Q plots",
"are based upon a theoretical match against a",
"normal distibution."),
color = "black", face = "italic", size = 10) )
For planning and similar purposes, we now identify and highlight the top five most ‘costly’ impacts for each of our metrics. The code below generates another faceted plot depicting a bar chart, per metric, of the top five metric values showing the event types associated with them. Note that the values depicted in these plots are the actual counts and dollar values - not the log10 transformed values and, for economic impacts, are shown in billions of US$.
gp <- grid.arrange(
topFivePlot("Fatalities"), topFivePlot("Property"),
topFivePlot("Injuries"), topFivePlot("Crop"),
nrow=2,
top = text_grob(
paste("Top Five Storm Event Types by Impact"),
color = "black",
face = "bold", size = 12),
bottom = text_grob(
paste("All figures are a summation of total values",
" by the storm event type."),
color = "black", face = "italic", size = 10) )
It is relatively clear that, when considering the complete data set of storm event types for which we have non-zero metric values, that a normal distribution cannot, realistically (or logically) be attributed to any distribution of metric values. This exploratory overview, rather, indicates that a more detailed, temporal analysis may be more profitable in highlighting trends and associations between metrics and particular storm event types where there are significantly high metric values within their range - for example, a selection of the event types where metric values fall within the 95% percentile.