The National Oceanic & Atmospheric Administration in the United States stores data of abnormal weather events, This information includes property damage caused either directly or indirectly by the event, Effects to human health such as injury or fatality that occured either directly or indirectly due to the event.
This is an Exploratory Data Analysis of one such dataset, it contains data from 1950 to 2015. We will be answering two main questions:
‘What is the most expensive(economic cost) weather event?’,
and ‘What weather event affects human health (injuries of fatalities) the most?’.
We will use first group the costs by weather event, then calculate the percentages, mean values, and total values of the costs(crop damage, property damage, fatalities, injuries), event occurences, and cost types (economic and human) to advise our final decision on what the most costly event(s) is/are. we will also employ visualisations to see what the different costs associated with the events, and the distributions of the costs.
We first download the datasetS from the repository using this link
The dataset is stored as StormData.csv.bz2 and read directly into dataset using read.csv
# If the dataset file does not exist, download it and save it as StormData
if(!file.exists('StormData.csv.bz2')){
download.file('https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2',
'StormData.csv.bz2')
}
# Read the dataset directly from the compressed file
dataset <- read.csv('StormData.csv.bz2', sep = ',')
In order for us to glean any useful insights from the data we must first clean it by first removing all
unnecessary data from the read dataset.
This dataset contains information about when and where the weather event occured, and on unusual events
that have occured but do not have an accompanying human, or economic cost.
We therefore need to remove all rows that do not document a cost to health. life, property, or crops.
We must also remove all rows that do not give us any information necessary to answering our question.
That means that we get rid of all columns that do not record a state name, county name, weather event,
injury, fatality, crop damage, or a property damage.
# Streamline the dataset.
# Set the names of the columns needed
needed <- c('STATE', 'COUNTYNAME', 'EVTYPE', 'FATALITIES', 'INJURIES', 'PROPDMG', 'CROPDMG')
# Set the rows with a cost to health or economy to TRUE, the rest to FALSE
wanted <- ifelse(dataset$FATALITIES > 0, TRUE,
ifelse(dataset$INJURIES > 0, TRUE,
ifelse(dataset$PROPDMG > 0, TRUE,
ifelse(dataset$CROPDMG > 0, TRUE, FALSE))))
# Maintain only the columns that are neccessary for this analysis, removing rows with no cost to Health
# or economy
dataset <- dataset[wanted , needed]
Because we are trying to find the weather events that cost the most economically
and in terms of human health, we need to transform the dataset into a more easily summarised form.
In this case it means we will group the data by state, county, and weather event. This will lend itself
to an easier analysis.
# Create summary aiding datasets
library(dplyr, warn.conflicts = F, quietly = T)
# Get event level statistics
by.event <- dataset %>% group_by(EVTYPE) %>%
summarise(Fatalities = sum(FATALITIES), Injuries = sum(INJURIES), PropDMG = sum(PROPDMG)
, CropDMG = sum(CROPDMG)) %>%
mutate(Economic.cost = CropDMG + PropDMG, Human.cost = Injuries + Fatalities,
Count = vector(mode = 'numeric', length = length(unique(dataset$EVTYPE))))
Our next step is the creation of the variables that will store the results of our analysis
of the datasets. We will find the median, mean, and total values for our desired costs. These values will influence
our conclusion and be presented in our results in the next section.
# Nation wide statistics
# Total number of events
all.events <- sum(!is.na(dataset$EVTYPE))
# Count the total number of event occurences per event.
for(event in by.event$EVTYPE){
by.event[event == by.event$EVTYPE, 'Count'] <- sum(dataset$EVTYPE == event)
}
# Total costs of all events
all.fatalities <- sum(dataset$FATALITIES)
all.injuries <- sum(dataset$INJURIES)
all.cropdmg <- sum(dataset$CROPDMG)
all.prop <- sum(dataset$PROPDMG)
human.cost <- sum(all.fatalities, all.injuries)
economic.cost <- sum(all.cropdmg, all.prop)
hc.mean <- mean(dataset$INJURIES + dataset$FATALITIES)
ec.mean <- mean(dataset$CROPDMG + dataset$PROPDMG)
# Percentage statistics
by.event <- mutate(by.event, perc.hc = (by.event$Human.cost / human.cost) * 100,
perc.ec = (by.event$Economic.cost / economic.cost) * 100,
perc.injur = (by.event$Injuries / all.injuries) * 100,
perc.fatal = (by.event$Fatalities / all.fatalities) * 100,
perc.crops = (by.event$CropDMG / all.cropdmg) * 100,
perc.prop = (by.event$PropDMG / all.prop) * 100,
perc.event = (by.event$Count / all.events) * 100)
# Get event level statistics
event.mean <- dataset %>% group_by(EVTYPE) %>%
summarise(Fatalities = mean(FATALITIES), Injuries = mean(INJURIES), PropDMG = mean(PROPDMG)
, CropDMG = mean(CROPDMG), Economic.cost = mean(CROPDMG + PROPDMG),
Human.cost = mean(INJURIES + FATALITIES))
event.mean <- mutate(event.mean, perc.hc = (event.mean$Human.cost / hc.mean) * 100,
perc.ec = (event.mean$Economic.cost / ec.mean) * 100)
Because the events each have a different number of occurences, the answer to the questions: ‘What is the most economically costly weather event?’ and
‘What is the most costly Weather event in terms of human health?’ are somewhat relative to how many
times each event occurs. We will therefore explor both questions from multiple view points and try
to see which event is more costly in total, on average, and as a percentage of the total cost incurred
(be it economical, human, or their respective constituent costs).
I order to do that however we have to prepare a few exploratory graphs so as to gain a visual
Understanding of the data.
library(ggplot2, warn.conflicts = F, quietly = T)
library(scales, warn.conflicts = F, quietly = T)
library(gridExtra, warn.conflicts = F, quietly = T)
require(grid, quietly = TRUE)
# Make the scales be real numbered instead of scientific notation
point <- format_format(big.mark = ' ', decimal.mark = '.', scientific = FALSE)
# Create a human cost plots with event type counts.
hc.fig1 <-
ggplot(data = by.event, aes(x = Injuries, y = Fatalities, color = Human.cost)) +
geom_count(alpha = 0.64) + ggtitle('Total Human Cost by Event') +
scale_color_gradientn('Human Cost', colours = c('green', 'yellow', 'orange', 'red'),
guide = 'colourbar') + scale_size(name = 'Number of Events')
# Create a mean human cost plot with event type counts.
hc.fig2 <-
ggplot(data = event.mean, aes(x = Injuries, y = Fatalities, color = Human.cost)) +
geom_count(alpha = 0.64) +
ggtitle('Mean Human Cost by Event') +
scale_color_gradientn('Mean Human Cost', colours = c('green', 'yellow', 'orange', 'red'),
guide = 'colourbar') +
ylab('Mean Fatalities') + xlab('Mean Injuries') + scale_size(name = 'Number of Events')
# Get the top ten human cost percentages for display.
plotdf <- by.event[order(by.event$perc.hc, decreasing = T),][1:10,]
# Make a plot of the top ten Percentages.
hc.fig3 <- ggplot(plotdf, aes(x = EVTYPE, y = perc.hc)) +
geom_bar(stat = "identity", width = .3) +
theme(axis.text.x = element_text(color = "black", angle = 90, vjust = .5, hjust = 1)) +
labs(x = "Event type", y = "Percentage of Human Cost",
title = "Top 10 Events by Human Cost Percentage")
# Get the top ten mean human cost percentages for display.
plotdf <- event.mean[order(by.event$perc.hc, decreasing = T),][1:10,]
# Make a plot of the top ten Percentages.
hc.fig4 <- ggplot(plotdf, aes(x = EVTYPE, y = perc.hc)) +
geom_bar(stat = "identity", width = .3) +
theme(axis.text.x = element_text(color = "black", angle = 90, vjust = .5, hjust = 1)) +
labs(x = "Event type", y = "Percentage of Mean Human Cost",
title = "Top 10 Events by Mean Human Cost Percentage")
ec.fig1 <- ggplot(data = by.event, aes(x = CropDMG, y = PropDMG, color = Economic.cost)) +
geom_count(alpha = 0.64) + ggtitle('Total Economic Cost by Event') +
scale_color_gradientn('Economic Cost', colours = c('green', 'yellow', 'orange', 'red'),
guide = 'colourbar', labels = point) +
ylab('Property Damage') + xlab('Crop Damage') + scale_size(name = 'Number of Events') +
scale_x_continuous(labels = point) + scale_y_continuous(labels = point)
ec.fig2 <-
ggplot(data = event.mean, aes(x = Injuries, y = Fatalities, color = Economic.cost)) +
geom_count(alpha = 0.64) + ggtitle('Mean Economic Cost by Event') +
scale_color_gradientn('Mean Economic Cost', colours = c('green', 'yellow', 'orange', 'red'),
guide = 'colourbar') +
ylab('Mean Property Damage') + xlab('Mean Crop Damage') + scale_size(name = 'Number of Events')
# Get the top ten economic cost percentages for display.
plotdf <- by.event[order(by.event$perc.ec, decreasing = T),][1:10,]
ec.fig3 <- ggplot(plotdf, aes(x = EVTYPE, y = perc.ec)) +
geom_bar(stat = "identity", width = .3) +
theme(axis.text.x = element_text(color = "black", angle = 90, vjust = .5, hjust = 1)) +
labs(x = "Event type", y = "Percentage of Economic Cost",
title = "Top 10 Events by Economic Cost Percentage")
# Get the top ten mean economic cost percentages for display.
plotdf <- event.mean[order(by.event$perc.ec, decreasing = T),][1:10,]
# Make a plot of the top ten Percentages.
ec.fig4 <- ggplot(plotdf, aes(x = EVTYPE, y = perc.ec)) +
geom_bar(stat = "identity", width = .3) +
theme(axis.text.x = element_text(color = "black", angle = 90, vjust = .5, hjust = 1)) +
labs(x = "Event type", y = "Percentage of Mean Economic Cost",
title = "Top 10 Events by Mean Economic Cost Percentage")
# Combine figures into one graph
# We're going to wrap the grid.arrange graph combinations in a ggsave()
# We do this because gridExtra ignores the 'newpage = F' and keeps printing the figures when I knit.
ggsave(filename = 'spread_analysis.png',
plot = grid.arrange(grid.arrange(hc.fig1, hc.fig2, hc.fig3, hc.fig4, nrow = 2, newpage = F),
grid.arrange(ec.fig1, ec.fig2, ec.fig3, ec.fig4, nrow = 2, newpage = F),
nrow = 2, newpage = F,
top = textGrob('Event spread analysis graphs',
gp = gpar(fontsize = 15, font = 8))),
device = 'png', path = 'figs/',
width = 30, height = 40, units = 'cm', dpi = 320)
# Final graphs
plotdf <- by.event[order(by.event$Human.cost, decreasing = T),][1:10,]
g1 <- ggplot(plotdf, aes(x = EVTYPE, y = Human.cost)) +
geom_bar(stat = "identity", width = .3) +
theme(axis.text.x = element_text(color = "black", angle = 90, vjust = .5, hjust = 1)) +
labs(x = "Event type", y = "Human Cost",
title = "Top 10 Human costs of Weather Events")
plotdf <- by.event[order(by.event$Economic.cost, decreasing = T),][1:10,]
g2 <- ggplot(plotdf, aes(x = EVTYPE, y = Economic.cost)) +
geom_bar(stat = "identity", width = .3) +
theme(axis.text.x = element_text(color = "black", angle = 90, vjust = .5, hjust = 1)) +
labs(x = "Event type", y = "Economic Cost",
title = "Top 10 Economic costs of Weather Events") +
scale_y_continuous(labels = point)
plotdf <- event.mean[order(event.mean$Human.cost, decreasing = T),][1:10,]
mg1 <- ggplot(plotdf, aes(x = EVTYPE, y = Human.cost)) +
geom_bar(stat = "identity", width = .3) +
theme(axis.text.x = element_text(color = "black", angle = 90, vjust = .5, hjust = 1)) +
labs(x = "Event type", y = "Mean Human Cost",
title = "Top 10 Mean Human costs of Weather Events")
plotdf <- event.mean[order(event.mean$Economic.cost, decreasing = T),][1:10,]
mg2 <- ggplot(plotdf, aes(x = EVTYPE, y = Economic.cost)) +
geom_bar(stat = "identity", width = .3) +
theme(axis.text.x = element_text(color = "black", angle = 90, vjust = .5, hjust = 1)) +
labs(x = "Event type", y = "Mean Economic Cost",
title = "Top 10 Mean Economic costs of Weather Events")
ggsave(filename = 'Human_costs.png', plot = grid.arrange(g1, mg1, nrow = 2, ncol = 1, newpage = F,
top = textGrob('Human Cost Graphs', gp = gpar(fontsize = 15, font = 8))),
device = 'png', path = 'figs/', dpi = 320,
width = 10, height = 10)
ggsave(filename = 'Economic_costs.png', plot = grid.arrange(g2, mg2, ncol = 1, nrow = 2, newpage = F,
top = textGrob('Economic Cost Graphs', gp = gpar(fontsize = 15, font = 8))),
device = 'png', path = 'figs/', dpi = 320,
width = 10, height = 10)
Visualisations are only half the picture, we also need to gain a descriptive understanding and as such we will need to get
the names of the events with the highest cost according to a certain criteria.
# Cost maximums
# Maximum Crop damage
crop.max <- max(by.event$CropDMG)
crop.maxname <- by.event[by.event$CropDMG == crop.max, 'EVTYPE']$EVTYPE
crop.max <- format(round(crop.max, 2), nsmall = 2)
# Maximum property damage
property.max <- max(by.event$PropDMG)
property.maxname <- by.event[by.event$PropDMG == property.max, 'EVTYPE']$EVTYPE
property.max <- format(round(property.max, 2), nsmall = 2)
# Maximum Economic cost
economy.max <- max(by.event$Economic.cost)
economy.maxname <- by.event[by.event$Economic.cost == economy.max, 'EVTYPE']$EVTYPE
economy.max <- format(round(economy.max, 2), nsmall = 2)
# Maximum Fatalities
fatal.max <- max(by.event$Fatalities)
fatal.maxname <- by.event[by.event$Fatalities == fatal.max, 'EVTYPE']$EVTYPE
fatal.max <- round(fatal.max, 2)
# Maximum Injuries
injury.max <- max(by.event$Injuries)
injury.maxname <- by.event[by.event$Injuries == injury.max, 'EVTYPE']$EVTYPE
injury.max <- format(round(injury.max, 2) , nsmall = 2)
# Maximum Human cost
human.max <- max(by.event$Human.cost)
human.maxname <- by.event[by.event$Human.cost == human.max, 'EVTYPE']$EVTYPE
human.max <- format(round(human.max, 2), nsmall = 2)
# Percentage maximums
# Event percentage
percevent.max <- max(by.event$perc.event)
percevent.maxname <- by.event[by.event$perc.event == percevent.max, 'EVTYPE']$EVTYPE
percevent.max <- round(percevent.max, 2)
# Property damage percentage
percprop.max <- max(by.event$perc.prop)
percprop.maxname <- by.event[by.event$perc.prop == percprop.max, 'EVTYPE']$EVTYPE
percprop.max <- round(percprop.max, 2)
# Crop damage percentage
perccrops.max <- max(by.event$perc.crops)
perccrops.maxname <- by.event[by.event$perc.crops == perccrops.max, 'EVTYPE']$EVTYPE
perccrops.max <- round(perccrops.max, 2)
# economic cost percentage
percec.max <- max(by.event$perc.ec)
percec.maxname <- by.event[by.event$perc.ec == percec.max, 'EVTYPE']$EVTYPE
percec.max <- round(percec.max, 2)
# human cost percentages
perchc.max <- max(by.event$perc.hc)
perchc.maxname <- by.event[by.event$perc.hc == perchc.max, 'EVTYPE']$EVTYPE
perchc.max <- round(perchc.max, 2)
# fatality percentages
percfatal.max <- max(by.event$perc.fatal)
percfatal.maxname <- by.event[by.event$perc.fatal == percfatal.max, 'EVTYPE']$EVTYPE
percfatal.max <- round(percfatal.max, 2)
# injury percentages
percinjur.max <- max(by.event$perc.injur)
percinjur.maxname <- by.event[by.event$perc.injur == percinjur.max, 'EVTYPE']$EVTYPE
percinjur.max <- round(percinjur.max, 2)
# Mean Maximums
# Maximum Crop damage
crop.max_mean <- max(event.mean$CropDMG)
crop.maxname_mean <- event.mean[event.mean$CropDMG == crop.max_mean, 'EVTYPE']$EVTYPE
crop.max_mean <- round(crop.max_mean, 2)
# Maximum property damage
property.max_mean <- max(event.mean$PropDMG)
property.maxname_mean <- event.mean[event.mean$PropDMG == property.max_mean, 'EVTYPE']$EVTYPE
property.max_mean <- round(property.max_mean, 2)
# Maximum Economic cost
economy.max_mean <- max(event.mean$Economic.cost)
economy.maxname_mean <- event.mean[event.mean$Economic.cost == economy.max_mean, 'EVTYPE']$EVTYPE
economy.max_mean <- round(economy.max_mean, 2)
# Maximum Fatalities
fatal.max_mean <- max(event.mean$Fatalities)
fatal.maxname_mean <- event.mean[event.mean$Fatalities == fatal.max_mean,
'EVTYPE']$EVTYPE
fatal.max_mean <- round(fatal.max_mean, 2)
# Maximum Injuries
injury.max_mean <- max(event.mean$Injuries)
injury.maxname_mean <- event.mean[event.mean$Injuries == injury.max_mean, 'EVTYPE']$EVTYPE
injury.max_mean <- round(injury.max_mean, 2)
# Maximum Human cost
human.max_mean <- max(event.mean$Human.cost)
human.maxname_mean <- event.mean[event.mean$Human.cost == human.max_mean, 'EVTYPE']$EVTYPE
human.max_mean <- format(round(human.max_mean, 2), nsmall = 2)
From the Event spread analysis graph, we can see that the costs of the events are
unevenly spread, so we will answer our core questions from two perspectives:
1. Overall Statistics, where we report the maximum total cost. 2. Per occurence statistics, where we report the maximum mean cost.
It is also important to note that
After analysing the data, we can now pick the results that best answer our main questions.
fig 1: Human Cost by event
We can Therfeore conlude that:
1. Heat Wave is responsible for the highest human cost per occurence
2. while TORNADO is responsible for the highest human cost overall.
fig 2: Economic Cost by event
We can Therfeore conlude that:
1. HURRICANE FELIX, TROPICAL STORM GORDON, WINTER STORMS have a tie for the highest ecnonomic cost per occurence.
2. while TORNADO has the highest ecnonomic cost overall.