In this report I aim to describe the storm events in the USA that are most harmful in terms of death and injuries caused, and in terms of economic impact to property and crops. The data covers a period from 1950 to 2011.
I found in terms of harm to humans, by far the biggest cause of harm in terms of deaths and injuries are Tornado events. Tornadoes account for more than 60% of the total harm done. Excessive heat is second in terms of death only, accounting for over 10% of deaths.
In terms of economic costs, Flooding is the biggest cause of damage costs to property and agriculture, accounting for over 30% of the total costs. Hurricanes and Tornadoes each cause over 10% of the total damage costs.
An interesting extension to this analysis would be to examine any changes in harm and economic costs with respect to time and location to assist with planning how best to allocate resources.
The global settings for the Rmarkdown document used to generate this report are to print the R code, cache the data, and set the figure width and height, and suppress messages:
library(knitr)
opts_chunk$set(echo=TRUE,cache=TRUE,fig.width=12, fig.height=8, message=FALSE,
warning=FALSE)
A bzipped dataset was provided on the course website: Storm Data [47 Mb], along with links to Storm Data Documentation and a FAQ document.
The data is downloaded and loaded as follows:
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
# Download and uzip data if not already present
if(!file.exists("stormData.csv.bz2")){
download.file(fileUrl,destfile="StormData.csv.bz2")
dateDownloaded <- date()
log_con <- file("storm_download.log")
cat (fileUrl,"\n","destfile= StormData.csv.bz2",
"\n","destdir =", getwd(),"\n",dateDownloaded,
file = log_con)
close(log_con)
gc(verbose = FALSE) # release memory
}
On my system in takes about 45 seconds to read in data from the csv file:
# If not already loaded, load the data
if(!exists("storm.raw")){
storm.raw <- read.csv(bzfile("StormData.csv.bz2"),
stringsAsFactors = FALSE, na.strings="NA",header=TRUE)
}
# Get dimensions
dim.raw <- dim(storm.raw)
Checking the dimensions of the dataset indicates that there are 902297 observations of 37 variables.
To answer this question we are interested in the variables EVTYPE, FATALITIES and INJURIES, representing the Storm Event type, number of deaths caused and number of injuries caused respectively.
These variables are in columns 8,23 and 24, as found by:
which(colnames(storm.raw)=="EVTYPE")
## [1] 8
which(colnames(storm.raw)=="FATALITIES")
## [1] 23
which(colnames(storm.raw)=="INJURIES")
## [1] 24
Subsetting and summarising these these columns only indicates no missing values.
health.dat <- storm.raw[,c(8,23,24)]
# Summarise the data, no NAs
summary(health.dat[,2:3])
## FATALITIES INJURIES
## Min. : 0.0000 Min. : 0.0000
## 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median : 0.0000 Median : 0.0000
## Mean : 0.0168 Mean : 0.1557
## 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :583.0000 Max. :1700.0000
I then did the following using the dpylr package:
EVTYPEsuppressMessages(library(dplyr)) # supress loading message
# Group by event
by_event <- health.dat %>% group_by(EVTYPE) %>%
summarise_each(funs(sum)) %>% # Sum each event
mutate(DEATH.AND.INJURY = FATALITIES+INJURIES) %>% # Add death or injury
# Select only events where death or injury occurs
filter(DEATH.AND.INJURY > 0) %>%
arrange(desc(DEATH.AND.INJURY)) # Arrange by death or injury descending
# order
# Which are the most impactful events.
max.death <- by_event[which.max(by_event$FATALITIES),]
max.injuries <- by_event[which.max(by_event$INJURIES),]
max.harm <- by_event[which.max(by_event$DEATH.AND.INJURY),]
This reveals that the storm events causing the most harm overall during this period 1950 to 2011 was by TORNADO, causing 5633 deaths and 91 thousand injuries.
I’ve then put the full output into a searchable table using the data.table and DT packages so we can see the full breakdown of events, deaths and injuries:
suppressMessages(library(data.table)) # supress loading message
library(DT)
# Create data table
harm.table <- data.table(by_event)
datatable(harm.table,rownames = FALSE,
colnames = c('Storm Event Type', 'Total Number of Deaths',
'Total Number of Injuries)',
'Total Number of Deaths and Injuries'),
class="row-border hover")
The plot uses the ggplot and GridExtra packages. First I calculated the total deaths, injuries and harm for all events, and then used these to calculate what proportion of harm each event accounted for. I found that the top 15 events accounted for over 90% of the total harm caused in terms of both deaths and injuries. This is then plotted as bar plots for Deaths, Injuries, and combined Deaths and Injuries.
library(ggplot2)
# Calculate total deaths, injuries and harm for all events
Total.Deaths <- sum(by_event$FATALITIES)
Total.Injuries <- sum(by_event$INJURIES)
Total.Death.and.Injuries <- sum(by_event$DEATH.AND.INJURY)
# Extract the top 15
top.harm <- by_event[1:15,]
# Calculate the proportion of total harm caused by each event type
top.harm <- top.harm %>%
mutate(Prop.Death = FATALITIES/Total.Deaths,
Prop.Inj = INJURIES/Total.Injuries,
Prop.Death.and.Inj = DEATH.AND.INJURY/Total.Death.and.Injuries) %>%
transmute(EVTYPE,Prop.Death,Prop.Inj,Prop.Death.and.Inj)
# Create factor variables for the events ordered by the proportions
top.harm$harm.order <- reorder(top.harm$EVTYPE, top.harm$Prop.Death.and.Inj)
top.harm$death.order <- reorder(top.harm$EVTYPE, top.harm$Prop.Death)
top.harm$injury.order <- reorder(top.harm$EVTYPE, top.harm$Prop.Inj)
# Check that the top 15 account for > 90%
sum(top.harm$Prop.Death.and.Inj)
## [1] 0.9212323
# Plot the three catergories, Deaths, Injuries and Death and Injuries
# using the proportions
plot.1 <- ggplot(top.harm, aes(y=Prop.Death.and.Inj)) +
geom_bar(aes(x=harm.order),
stat="identity", fill="white", colour="darkgreen", width=0.5) +
coord_flip() +
ggtitle("Top 15 events accounting\n for > 90% of Total Harm") +
ylab("Proportion of Total Deaths and Injuries") +
xlab("") +
theme(axis.text=element_text(size=rel(3)),
axis.title=element_text(size=rel(2.25)),
plot.title = element_text(size = rel(2.75)))
plot.2 <- ggplot(top.harm, aes(y=Prop.Death)) +
geom_bar(aes(x=death.order),
stat="identity", fill="white", colour="darkgreen", width=0.5) +
coord_flip() +
ggtitle("Top 15 events accounting\n for > 90% of Total Deaths") +
ylab("Proportion of Total Deaths") +
xlab("") +
ylim(0,0.6) + # Make axis the same as the others
theme(axis.text=element_text(size=rel(3)),
axis.title=element_text(size=rel(2.25)),
plot.title = element_text(size = rel(2.75)))
plot.3 <- ggplot(top.harm, aes(y=Prop.Inj)) +
geom_bar(aes(x=injury.order),
stat="identity", fill="white", colour="darkgreen", width=0.5) +
coord_flip() +
ggtitle("Top 15 events accounting\n for > 90% of Total Injuries") +
ylab("Proportion of Total Injuries") +
xlab("") +
theme(axis.text=element_text(size=rel(3)),
axis.title=element_text(size=rel(2.25)),
plot.title = element_text(size = rel(2.75)))
suppressMessages(library(gridExtra)) # supress loading message
# Print three panels onto a single plot
grid.arrange(plot.1,plot.2,plot.3,ncol=3,nrow=1)
Figure 1: Top 15 Storm Events that cause over 90% of recorded harm to human health
The total number of deaths in the period 1950 to 2011 was 15145.
The total number of injuries in the period 1950 to 2011 was 140528.
The total number of cases of harm in the period 1950 to 2011 was 155673.
To answer this question, in addition to the event type I needed the property damage and crop damage observations. These are recorded as PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP. The EXP variables provide factors by which to multiply the DMG values. For the purposes of this report I’ve used K as 1 thousand, M as 1 million and B as 1 billion, as indicated in the accompanying documentation. I ignored the other character values as it was unclear what they were.
# I need to clean-up the data to get total economic costs of the events from
# PROPDMG, PROPDMGEXP, CPROPDMG, CPROPDMGEXP
# Columns, 25:28
which(colnames(storm.raw)=="PROPDMG")
## [1] 25
which(colnames(storm.raw)=="PROPDMGEXP")
## [1] 26
which(colnames(storm.raw)=="CROPDMG")
## [1] 27
which(colnames(storm.raw)=="CROPDMGEXP")
## [1] 28
I initially split the property and crop data to process separately. Again I used dplyr for the analysis:
library(dplyr)
# Subset these Property columns
prop.costs.dat <- storm.raw[,c(8,25:26)] %>%
filter(PROPDMG > 0) # Only where cost has occured
# Subset the Crop columns
crop.costs.dat <- storm.raw[,c(8,27:28)] %>%
filter(CROPDMG > 0)
Then I did the following for the property and crop data:
EXP column and replaced them with integers.DMG and EXP columns to get the damage costs and added this as a new column of the damage costs.suppressMessages(library(dplyr))
# Find which strings to replace in property data
unique(prop.costs.dat$PROPDMGEXP)
## [1] "K" "M" "B" "m" "+" "0" "" "5" "6" "4" "h" "2" "7" "3" "H" "-"
# Replace characters K,M and B with integers, ignore others and NAs warning
prop.costs.dat$PROPDMGEXP <- gsub("[Kk]",1e3, prop.costs.dat$PROPDMGEXP)
prop.costs.dat$PROPDMGEXP <- gsub("[Mm]",1e6, prop.costs.dat$PROPDMGEXP)
prop.costs.dat$PROPDMGEXP <- gsub("[Bb]",1e9, prop.costs.dat$PROPDMGEXP)
prop.costs.dat$PROPDMGEXP <- as.integer(prop.costs.dat$PROPDMGEXP)
## Warning: NAs introduced by coercion
# Remove NAs as I don't know how to interpret the other symbols
prop.costs.dat <- na.omit(prop.costs.dat)
# Mutate the data frame
prop.costs.dat <- prop.costs.dat %>%
mutate(Prop.Damage.Costs = PROPDMG*PROPDMGEXP) %>%
arrange(desc(Prop.Damage.Costs))
# Find which strings to replace in crop data
unique(crop.costs.dat$CROPDMGEXP)
## [1] "M" "K" "m" "B" "k" "0" ""
# Replace characters K,M and B with integers, ignore others and NAs warning
crop.costs.dat$CROPDMGEXP <- gsub("[Kk]",1e3, crop.costs.dat$CROPDMGEXP)
crop.costs.dat$CROPDMGEXP <- gsub("[Mm]",1e6, crop.costs.dat$CROPDMGEXP)
crop.costs.dat$CROPDMGEXP <- gsub("B",1e9, crop.costs.dat$CROPDMGEXP)
crop.costs.dat$CROPDMGEXP <- as.integer(crop.costs.dat$CROPDMGEXP)
# Remove NAs as I don't know how to interpret them
crop.costs.dat <- na.omit(crop.costs.dat)
# Mutate the data frame
crop.costs.dat <- crop.costs.dat %>%
mutate(Crop.Damage.Costs = CROPDMG*CROPDMGEXP) %>%
arrange(desc(Crop.Damage.Costs))
I’ve then summarised this output by event as previously and merged the property and crop damage costs into a single data frame:
library(dplyr)
# Summarize data by event
prop_cost_by_event <- prop.costs.dat %>% group_by(EVTYPE) %>%
summarise_each(funs(sum)) # Sum Prop Costs for each event
crop_cost_by_event <- crop.costs.dat %>% group_by(EVTYPE) %>%
summarise_each(funs(sum)) # Sum Crop Costs for each event
# Merge the Property and Crop Costs
clean.costs <- merge(prop_cost_by_event, crop_cost_by_event,
"EVTYPE", all=TRUE) %>%
# Keep only the events and total costs columns
select(EVTYPE, Prop.Damage.Costs, Crop.Damage.Costs)
# Set NAs to zero
clean.costs[is.na(clean.costs)] <- 0
clean.costs <- clean.costs %>%
# Sum the Property and Crop costs
mutate(Total.Cost = Prop.Damage.Costs+Crop.Damage.Costs) %>%
arrange(desc(Total.Cost)) # Arrange in descending order
As before I’ve put the summary of the storm event costs into a a searchable table so we can see the full breakdown of events and costs:
library(data.table)
library(DT)
# Create data table
cost.table <- data.table(clean.costs)
cost.table$Prop.Damage.Costs <- round(cost.table$Prop.Damage.Costs/1e6,0)
cost.table$Crop.Damage.Costs <- round(cost.table$Crop.Damage.Costs/1e6,0)
cost.table$Total.Cost <- round(cost.table$Total.Cost/1e6,0)
datatable(cost.table,rownames = FALSE,
colnames = c('Storm Event Type', 'Property Damage Cost (Millions USD)',
'Crop Damage Cost (Millions USD)',
'Total Damage Cost (Millions USD)'),
class="row-border hover")
Finally, as before I calculated the total economic cost for all events, and then used this to calculate what proportion of the total each event accounted for. I found that the top 15 events accounted for over 90% of the total economic costs caused.
library(ggplot2)
# Calculate the total cost of all events
Total.Damage.Cost <- sum(clean.costs$Total.Cost)
# Subset top 15 events
top.costs <- clean.costs[1:15,]
top.costs <- top.costs %>% mutate(Proportion = Total.Cost/Total.Damage.Cost) %>%
transmute(EVTYPE,Proportion)
# Create factor variable for the events ordered by the proportions of cost
top.costs$cost.order <- reorder(top.costs$EVTYPE, top.costs$Proportion)
# Check this accounts for >90% of costs
sum(top.costs$Proportion)
## [1] 0.9216174
# Create barplot ordered by cost.order
plot.4 <- ggplot(top.costs, aes(y=Proportion))
plot.4 + geom_bar(aes(x= cost.order),
stat="identity", fill="white", colour="darkgreen", width=0.5) +
coord_flip() +
ggtitle("Top 15 events accounting > 90% of Total Damage Costs") +
ylab("Proportion of Total Damage Costs") +
xlab("")
Figure 2: Top 15 Storm Events that cause over 90% of recorded economic costs
The total economic cost of storm events in the period 1950 to 2011 was approximately 476 billion US Dollars
Here is the session information about the packages I used, their versions, and the version of R that I used for this assignment:
sessionInfo()
## R version 3.2.0 (2015-04-16)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
##
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252
## [2] LC_CTYPE=English_United Kingdom.1252
## [3] LC_MONETARY=English_United Kingdom.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United Kingdom.1252
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] gridExtra_0.9.1 ggplot2_1.0.1 DT_0.0.35 data.table_1.9.4
## [5] dplyr_0.4.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.11.5 knitr_1.9 magrittr_1.5
## [4] MASS_7.3-40 munsell_0.4.2 colorspace_1.2-6
## [7] stringr_0.6.2 plyr_1.8.2 tools_3.2.0
## [10] parallel_3.2.0 gtable_0.1.2 DBI_0.3.1
## [13] htmltools_0.2.6 yaml_2.1.13 lazyeval_0.1.10
## [16] assertthat_0.1 digest_0.6.8 RJSONIO_1.3-0
## [19] reshape2_1.4.1 formatR_1.2 htmlwidgets_0.3.2
## [22] evaluate_0.7 rmarkdown_0.5.1 labeling_0.3
## [25] scales_0.2.4 chron_2.3-45 proto_0.3-10