Exploring The Human and Economic Cost of Bad Weather Events.

Synopsis

The National Oceanic & Atmospheric Administration in the United States stores data of abnormal weather events, This information includes property damage caused either directly or indirectly by the event, Effects to human health such as injury or fatality that occured either directly or indirectly due to the event.
This is an Exploratory Data Analysis of one such dataset, it contains data from 1950 to 2015. We will be answering two main questions:
‘What is the most expensive(economic cost) weather event?’,
and ‘What weather event affects human health (injuries of fatalities) the most?’.
We will use first group the costs by weather event, then calculate the percentages, mean values, and total values of the costs(crop damage, property damage, fatalities, injuries), event occurences, and cost types (economic and human) to advise our final decision on what the most costly event(s) is/are. we will also employ visualisations to see what the different costs associated with the events, and the distributions of the costs.

Data Processing

Loading the data

We first download the datasetS from the repository using this link
The dataset is stored as StormData.csv.bz2 and read directly into dataset using read.csv

# If the dataset file does not exist, download it and save it as StormData
if(!file.exists('StormData.csv.bz2')){
  download.file('https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2',
                'StormData.csv.bz2')
}
# Read the dataset directly from the compressed file
dataset <- read.csv('StormData.csv.bz2', sep = ',')

Data Cleaning

In order for us to glean any useful insights from the data we must first clean it by first removing all
unnecessary data from the read dataset.
This dataset contains information about when and where the weather event occured, and on unusual events
that have occured but do not have an accompanying human, or economic cost.

We therefore need to remove all rows that do not document a cost to health. life, property, or crops.
We must also remove all rows that do not give us any information necessary to answering our question.
That means that we get rid of all columns that do not record a state name, county name, weather event,
injury, fatality, crop damage, or a property damage.

# Streamline the dataset.
# Set the names of the columns needed
needed <- c('STATE', 'COUNTYNAME', 'EVTYPE', 'FATALITIES', 'INJURIES', 'PROPDMG', 'CROPDMG')
# Set the rows with a cost to health or economy to TRUE, the rest to FALSE
wanted <- ifelse(dataset$FATALITIES > 0, TRUE,
                 ifelse(dataset$INJURIES > 0, TRUE,
                        ifelse(dataset$PROPDMG > 0, TRUE,
                               ifelse(dataset$CROPDMG > 0, TRUE, FALSE))))
# Maintain only the columns that are neccessary for this analysis, removing rows with no cost to Health
# or economy
dataset <- dataset[wanted , needed]

Data Transformation

Because we are trying to find the weather events that cost the most economically
and in terms of human health, we need to transform the dataset into a more easily summarised form.
In this case it means we will group the data by state, county, and weather event. This will lend itself
to an easier analysis.

# Create summary aiding datasets
library(dplyr, warn.conflicts = F, quietly = T)

# Get event level statistics
by.event <- dataset %>% group_by(EVTYPE) %>% 
  summarise(Fatalities = sum(FATALITIES), Injuries = sum(INJURIES), PropDMG = sum(PROPDMG)
            , CropDMG = sum(CROPDMG)) %>%
  mutate(Economic.cost = CropDMG + PropDMG, Human.cost = Injuries + Fatalities,
         Count = vector(mode = 'numeric', length = length(unique(dataset$EVTYPE))))

Data Analysis

Our next step is the creation of the variables that will store the results of our analysis
of the datasets. We will find the median, mean, and total values for our desired costs. These values will influence
our conclusion and be presented in our results in the next section.

# Nation wide statistics
# Total number of events
all.events <- sum(!is.na(dataset$EVTYPE))
# Count the total number of event occurences per event.
for(event in by.event$EVTYPE){
  by.event[event == by.event$EVTYPE, 'Count'] <- sum(dataset$EVTYPE == event)
}
# Total costs of all events
all.fatalities <- sum(dataset$FATALITIES)
all.injuries <- sum(dataset$INJURIES)
all.cropdmg <- sum(dataset$CROPDMG)
all.prop <- sum(dataset$PROPDMG)
human.cost <- sum(all.fatalities, all.injuries)
economic.cost <- sum(all.cropdmg, all.prop)
hc.mean <- mean(dataset$INJURIES + dataset$FATALITIES)
ec.mean <- mean(dataset$CROPDMG + dataset$PROPDMG)

# Percentage statistics
by.event <- mutate(by.event, perc.hc = (by.event$Human.cost / human.cost) * 100,
                   perc.ec = (by.event$Economic.cost / economic.cost) * 100,
                   perc.injur = (by.event$Injuries / all.injuries) * 100,
                   perc.fatal = (by.event$Fatalities / all.fatalities) * 100,
                   perc.crops = (by.event$CropDMG / all.cropdmg) * 100,
                   perc.prop = (by.event$PropDMG / all.prop) * 100, 
                   perc.event = (by.event$Count / all.events) * 100)

# Get event level statistics
event.mean <- dataset %>% group_by(EVTYPE) %>% 
  summarise(Fatalities = mean(FATALITIES), Injuries = mean(INJURIES), PropDMG = mean(PROPDMG)
            , CropDMG = mean(CROPDMG), Economic.cost = mean(CROPDMG + PROPDMG), 
            Human.cost = mean(INJURIES + FATALITIES))

event.mean <- mutate(event.mean, perc.hc = (event.mean$Human.cost / hc.mean) * 100,
                     perc.ec = (event.mean$Economic.cost / ec.mean) * 100)

Visualisations

Because the events each have a different number of occurences, the answer to the questions: ‘What is the most economically costly weather event?’ and
‘What is the most costly Weather event in terms of human health?’ are somewhat relative to how many
times each event occurs. We will therefore explor both questions from multiple view points and try
to see which event is more costly in total, on average, and as a percentage of the total cost incurred
(be it economical, human, or their respective constituent costs).

I order to do that however we have to prepare a few exploratory graphs so as to gain a visual
Understanding of the data.

library(ggplot2, warn.conflicts = F, quietly = T)
library(scales, warn.conflicts = F, quietly = T)
library(gridExtra, warn.conflicts = F, quietly = T)
require(grid, quietly = TRUE)

# Make the scales be real numbered instead of scientific notation
point <- format_format(big.mark = ' ', decimal.mark = '.', scientific = FALSE)

# Create a human cost plots with event type counts.
hc.fig1 <- 
ggplot(data = by.event, aes(x = Injuries, y = Fatalities, color = Human.cost)) + 
  geom_count(alpha = 0.64) + ggtitle('Total Human Cost by Event') + 
  scale_color_gradientn('Human Cost', colours = c('green', 'yellow', 'orange', 'red'), 
                        guide = 'colourbar') + scale_size(name = 'Number of Events')
  
# Create a mean human cost plot with event type counts.
hc.fig2 <-  
  ggplot(data = event.mean, aes(x = Injuries, y = Fatalities, color = Human.cost)) + 
  geom_count(alpha = 0.64) +
  ggtitle('Mean Human Cost by Event') + 
  scale_color_gradientn('Mean Human Cost', colours = c('green', 'yellow', 'orange', 'red'), 
                        guide = 'colourbar') + 
  ylab('Mean Fatalities') + xlab('Mean Injuries') + scale_size(name = 'Number of Events')

# Get the top ten human cost percentages for display.
plotdf <- by.event[order(by.event$perc.hc, decreasing = T),][1:10,]

# Make a plot of the top ten Percentages.
hc.fig3 <- ggplot(plotdf, aes(x = EVTYPE, y = perc.hc)) + 
  geom_bar(stat = "identity", width = .3) + 
  theme(axis.text.x = element_text(color = "black", angle = 90, vjust = .5, hjust = 1)) +
  labs(x = "Event type", y = "Percentage of Human Cost", 
              title = "Top 10 Events by Human Cost Percentage")

# Get the top ten mean human cost percentages for display.
plotdf <- event.mean[order(by.event$perc.hc, decreasing = T),][1:10,]

# Make a plot of the top ten Percentages.
hc.fig4 <- ggplot(plotdf, aes(x = EVTYPE, y = perc.hc)) + 
  geom_bar(stat = "identity", width = .3) + 
  theme(axis.text.x = element_text(color = "black", angle = 90, vjust = .5, hjust = 1)) +
  labs(x = "Event type", y = "Percentage of Mean Human Cost", 
              title = "Top 10 Events by Mean Human Cost Percentage")


ec.fig1 <- ggplot(data = by.event, aes(x = CropDMG, y = PropDMG, color = Economic.cost)) + 
  geom_count(alpha = 0.64) + ggtitle('Total Economic Cost by Event') + 
  scale_color_gradientn('Economic Cost', colours = c('green', 'yellow', 'orange', 'red'), 
                        guide = 'colourbar', labels = point) + 
  ylab('Property Damage') + xlab('Crop Damage') + scale_size(name = 'Number of Events') +
  scale_x_continuous(labels = point) + scale_y_continuous(labels = point)

ec.fig2 <- 
ggplot(data = event.mean, aes(x = Injuries, y = Fatalities, color = Economic.cost)) + 
  geom_count(alpha = 0.64) + ggtitle('Mean Economic Cost by Event') + 
  scale_color_gradientn('Mean Economic Cost', colours = c('green', 'yellow', 'orange', 'red'), 
                        guide = 'colourbar') + 
  ylab('Mean Property Damage') + xlab('Mean Crop Damage') + scale_size(name = 'Number of Events')

# Get the top ten economic cost percentages for display.
plotdf <- by.event[order(by.event$perc.ec, decreasing = T),][1:10,]

ec.fig3 <- ggplot(plotdf, aes(x = EVTYPE, y = perc.ec)) + 
  geom_bar(stat = "identity", width = .3) + 
  theme(axis.text.x = element_text(color = "black", angle = 90, vjust = .5, hjust = 1)) +
  labs(x = "Event type", y = "Percentage of Economic Cost", 
              title = "Top 10 Events by Economic Cost Percentage")

# Get the top ten mean economic cost percentages for display.
plotdf <- event.mean[order(by.event$perc.ec, decreasing = T),][1:10,]

# Make a plot of the top ten Percentages.
ec.fig4 <- ggplot(plotdf, aes(x = EVTYPE, y = perc.ec)) + 
  geom_bar(stat = "identity", width = .3) + 
  theme(axis.text.x = element_text(color = "black", angle = 90, vjust = .5, hjust = 1)) +
  labs(x = "Event type", y = "Percentage of Mean Economic Cost", 
              title = "Top 10 Events by Mean Economic Cost Percentage")

# Combine figures into one graph
# We're going to wrap the grid.arrange graph combinations in a ggsave()
# We do this because gridExtra ignores the 'newpage = F' and keeps printing the figures when I knit.
ggsave(filename = 'spread_analysis.png',
       plot = grid.arrange(grid.arrange(hc.fig1, hc.fig2, hc.fig3, hc.fig4, nrow = 2, newpage = F),
                     grid.arrange(ec.fig1, ec.fig2, ec.fig3, ec.fig4, nrow = 2, newpage = F),
                     nrow = 2, newpage = F, 
                     top = textGrob('Event spread analysis graphs',
                                    gp = gpar(fontsize = 15, font = 8))),
       device = 'png', path = 'figs/',
       width = 30, height = 40, units = 'cm', dpi = 320)

# Final graphs
plotdf <- by.event[order(by.event$Human.cost, decreasing = T),][1:10,]
g1 <- ggplot(plotdf, aes(x = EVTYPE, y = Human.cost)) + 
  geom_bar(stat = "identity", width = .3) + 
  theme(axis.text.x = element_text(color = "black", angle = 90, vjust = .5, hjust = 1)) +
  labs(x = "Event type", y = "Human Cost", 
              title = "Top 10 Human costs of Weather Events")

plotdf <- by.event[order(by.event$Economic.cost, decreasing = T),][1:10,]
g2 <- ggplot(plotdf, aes(x = EVTYPE, y = Economic.cost)) + 
  geom_bar(stat = "identity", width = .3) + 
  theme(axis.text.x = element_text(color = "black", angle = 90, vjust = .5, hjust = 1)) +
  labs(x = "Event type", y = "Economic Cost", 
              title = "Top 10 Economic costs of Weather Events") +
  scale_y_continuous(labels = point)

plotdf <- event.mean[order(event.mean$Human.cost, decreasing = T),][1:10,]
mg1 <- ggplot(plotdf, aes(x = EVTYPE, y = Human.cost)) + 
  geom_bar(stat = "identity", width = .3) + 
  theme(axis.text.x = element_text(color = "black", angle = 90, vjust = .5, hjust = 1)) +
  labs(x = "Event type", y = "Mean Human Cost", 
              title = "Top 10 Mean Human costs of Weather Events")

plotdf <- event.mean[order(event.mean$Economic.cost, decreasing = T),][1:10,]
mg2 <- ggplot(plotdf, aes(x = EVTYPE, y = Economic.cost)) + 
  geom_bar(stat = "identity", width = .3) + 
  theme(axis.text.x = element_text(color = "black", angle = 90, vjust = .5, hjust = 1)) +
  labs(x = "Event type", y = "Mean Economic Cost", 
              title = "Top 10 Mean Economic costs of Weather Events")


ggsave(filename = 'Human_costs.png', plot = grid.arrange(g1, mg1, nrow = 2, ncol = 1, newpage = F,
                   top = textGrob('Human Cost Graphs', gp = gpar(fontsize = 15, font = 8))),
       device = 'png', path = 'figs/', dpi = 320,
       width = 10, height = 10)
ggsave(filename = 'Economic_costs.png', plot = grid.arrange(g2, mg2, ncol = 1, nrow = 2, newpage = F,
                   top = textGrob('Economic Cost Graphs', gp = gpar(fontsize = 15, font = 8))),
       device = 'png', path = 'figs/', dpi = 320,
       width = 10, height = 10)

Explanatory variables

Visualisations are only half the picture, we also need to gain a descriptive understanding and as such we will need to get
the names of the events with the highest cost according to a certain criteria.

# Cost maximums
# Maximum Crop damage

crop.max <- max(by.event$CropDMG)
crop.maxname <- by.event[by.event$CropDMG == crop.max, 'EVTYPE']$EVTYPE
crop.max <- format(round(crop.max, 2), nsmall = 2)

# Maximum property damage
property.max <- max(by.event$PropDMG)
property.maxname <- by.event[by.event$PropDMG == property.max, 'EVTYPE']$EVTYPE
property.max <- format(round(property.max, 2), nsmall = 2)

# Maximum Economic cost
economy.max <- max(by.event$Economic.cost)
economy.maxname <- by.event[by.event$Economic.cost == economy.max, 'EVTYPE']$EVTYPE
economy.max <- format(round(economy.max, 2), nsmall = 2)

# Maximum Fatalities
fatal.max <- max(by.event$Fatalities)
fatal.maxname <- by.event[by.event$Fatalities == fatal.max, 'EVTYPE']$EVTYPE
fatal.max <- round(fatal.max, 2)

# Maximum Injuries
injury.max <- max(by.event$Injuries)
injury.maxname <- by.event[by.event$Injuries == injury.max, 'EVTYPE']$EVTYPE
injury.max <- format(round(injury.max, 2) , nsmall = 2)

# Maximum Human cost
human.max <- max(by.event$Human.cost)
human.maxname <- by.event[by.event$Human.cost == human.max, 'EVTYPE']$EVTYPE
human.max <- format(round(human.max, 2), nsmall = 2)

# Percentage maximums
# Event percentage
percevent.max <- max(by.event$perc.event)
percevent.maxname <- by.event[by.event$perc.event == percevent.max, 'EVTYPE']$EVTYPE
percevent.max <- round(percevent.max, 2)

# Property damage percentage
percprop.max <- max(by.event$perc.prop)
percprop.maxname <- by.event[by.event$perc.prop == percprop.max, 'EVTYPE']$EVTYPE
percprop.max <- round(percprop.max, 2)

# Crop damage percentage
perccrops.max <- max(by.event$perc.crops)
perccrops.maxname <- by.event[by.event$perc.crops == perccrops.max, 'EVTYPE']$EVTYPE
perccrops.max <- round(perccrops.max, 2)

# economic cost percentage
percec.max <- max(by.event$perc.ec)
percec.maxname <- by.event[by.event$perc.ec == percec.max, 'EVTYPE']$EVTYPE
percec.max <- round(percec.max, 2)

# human cost percentages
perchc.max <- max(by.event$perc.hc)
perchc.maxname <- by.event[by.event$perc.hc == perchc.max, 'EVTYPE']$EVTYPE
perchc.max <- round(perchc.max, 2)

# fatality percentages
percfatal.max <- max(by.event$perc.fatal)
percfatal.maxname <- by.event[by.event$perc.fatal == percfatal.max, 'EVTYPE']$EVTYPE
percfatal.max <- round(percfatal.max, 2)

# injury  percentages
percinjur.max <- max(by.event$perc.injur)
percinjur.maxname <- by.event[by.event$perc.injur == percinjur.max, 'EVTYPE']$EVTYPE
percinjur.max <- round(percinjur.max, 2)

# Mean Maximums
# Maximum Crop damage
crop.max_mean <- max(event.mean$CropDMG)
crop.maxname_mean <- event.mean[event.mean$CropDMG == crop.max_mean, 'EVTYPE']$EVTYPE
crop.max_mean <- round(crop.max_mean, 2)

# Maximum property damage
property.max_mean <- max(event.mean$PropDMG)
property.maxname_mean <- event.mean[event.mean$PropDMG == property.max_mean, 'EVTYPE']$EVTYPE
property.max_mean <- round(property.max_mean, 2)

# Maximum Economic cost
economy.max_mean <- max(event.mean$Economic.cost)
economy.maxname_mean <- event.mean[event.mean$Economic.cost == economy.max_mean, 'EVTYPE']$EVTYPE
economy.max_mean <- round(economy.max_mean, 2)

# Maximum Fatalities
fatal.max_mean <- max(event.mean$Fatalities)
fatal.maxname_mean <- event.mean[event.mean$Fatalities == fatal.max_mean, 
                                 'EVTYPE']$EVTYPE
fatal.max_mean <- round(fatal.max_mean, 2)

# Maximum Injuries
injury.max_mean <- max(event.mean$Injuries)
injury.maxname_mean <- event.mean[event.mean$Injuries == injury.max_mean, 'EVTYPE']$EVTYPE
injury.max_mean <- round(injury.max_mean, 2)

# Maximum Human cost
human.max_mean <- max(event.mean$Human.cost)
human.maxname_mean <- event.mean[event.mean$Human.cost == human.max_mean, 'EVTYPE']$EVTYPE
human.max_mean <- format(round(human.max_mean, 2), nsmall = 2)

Results

Interpretations

From the Event spread analysis graph, we can see that the costs of the events are
unevenly spread, so we will answer our core questions from two perspectives:
1. Overall Statistics, where we report the maximum total cost. 2. Per occurence statistics, where we report the maximum mean cost.

The Human Cost of Bad Weather Events

TORNADO caused the most fatalities with 5633 recorded
TORNADO caused the most injuries with 91346.00 recorded
TORNADO caused the most total injuries and fatalities with 96979.00 recorded
UNSEASONABLY WARM AND DRY caused the most mean fatalities with 29 recorded
Heat Wave caused the most mean injuries with 70 recorded
Heat Wave caused the most mean total injuries and fatalities with 70.00 recorded

The Economic Cost of Bad Weather Events

HAIL caused the most crop damage with 579596.28 dollars recorded in damages
TORNADO caused the most property damage with 3212258.16 dollars recorded in damages
TORNADO caused the most economic damage with 3312276.68 dollars recorded in damages
DUST STORM/HIGH WINDS, FOREST FIRES, HURRICANE FELIX, TROPICAL STORM GORDON, WINTER STORMS have a tie for the most mean crop damage with 500 dollars
COASTAL EROSION has the highest mean property damage with 3212258.16 dollars recorded in damages
HURRICANE FELIX, TROPICAL STORM GORDON, WINTER STORMS have a tie for the highest mean economic cost with 3312276.68
dollars recorded in damages

Meta Statistics

It is also important to note that

TSTM WIND had a 24.83% occurence rate of all recorded events.
TORNADO accounted for 37.19% of all recorded fatalities.
TORNADO accounted for 65% of all recorded injuries
TORNADO accounted for 62.3% of all recorded fatalities and injuries.
TORNADO cost 29.51% of all property damage recorded.
HAIL cost 42.07% of all crop damage recorded.
TORNADO cost 27.01% of all crop and property damage recorded.

After analysing the data, we can now pick the results that best answer our main questions.

Weather event with the highest human cost:

fig 1: Human Cost by event

We can Therfeore conlude that:
1. Heat Wave is responsible for the highest human cost per occurence
2. while TORNADO is responsible for the highest human cost overall.

Weather event with the highest human cost:

fig 2: Economic Cost by event

We can Therfeore conlude that:
1. HURRICANE FELIX, TROPICAL STORM GORDON, WINTER STORMS have a tie for the highest ecnonomic cost per occurence.
2. while TORNADO has the highest ecnonomic cost overall.