Exploration of the NOAA Storm database

Synopsis

In this report I aim to describe the storm events in the USA that are most harmful in terms of death and injuries caused, and in terms of economic impact to property and crops. The data covers a period from 1950 to 2011.

I found in terms of harm to humans, by far the biggest cause of harm in terms of deaths and injuries are Tornado events. Tornadoes account for more than 60% of the total harm done. Excessive heat is second in terms of death only, accounting for over 10% of deaths.

The total number of deaths in the period 1950 to 2011 was 15,145.
The total number of injuries in the period 1950 to 2011 was 140,528.
The total number of cases of harm in the period 1950 to 2011 was 155,673.

In terms of economic costs, Flooding is the biggest cause of damage costs to property and agriculture, accounting for over 30% of the total costs. Hurricanes and Tornadoes each cause over 10% of the total damage costs.

The total economic costs caused by storm events during this period was approximately 476 billion US Dollars.

An interesting extension to this analysis would be to examine any changes in harm and economic costs with respect to time and location to assist with planning how best to allocate resources.

Global Rmarkdown settings

The global settings for the Rmarkdown document used to generate this report are to print the R code, cache the data, and set the figure width and height, and suppress messages:

library(knitr)
opts_chunk$set(echo=TRUE,cache=TRUE,fig.width=12, fig.height=8, message=FALSE, 
               warning=FALSE)

Loading and processing the raw data

A bzipped dataset was provided on the course website: Storm Data [47 Mb], along with links to Storm Data Documentation and a FAQ document.

The data is downloaded and loaded as follows:

fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"

# Download and uzip data if not already present        
if(!file.exists("stormData.csv.bz2")){
        download.file(fileUrl,destfile="StormData.csv.bz2")
        dateDownloaded <- date()
        log_con <- file("storm_download.log")
        cat (fileUrl,"\n","destfile= StormData.csv.bz2",
             "\n","destdir =", getwd(),"\n",dateDownloaded, 
             file = log_con)
        close(log_con)
        gc(verbose = FALSE) # release memory
}

On my system in takes about 45 seconds to read in data from the csv file:

# If not already loaded, load the data
if(!exists("storm.raw")){
storm.raw <- read.csv(bzfile("StormData.csv.bz2"),
                      stringsAsFactors = FALSE, na.strings="NA",header=TRUE)
}

# Get dimensions
dim.raw <- dim(storm.raw)

Checking the dimensions of the dataset indicates that there are 902297 observations of 37 variables.

Results

1. Which events are most harmful to human health?

To answer this question we are interested in the variables EVTYPE, FATALITIES and INJURIES, representing the Storm Event type, number of deaths caused and number of injuries caused respectively.

These variables are in columns 8,23 and 24, as found by:

which(colnames(storm.raw)=="EVTYPE")

## [1] 8

which(colnames(storm.raw)=="FATALITIES")

## [1] 23

which(colnames(storm.raw)=="INJURIES")

## [1] 24

Subsetting and summarising these these columns only indicates no missing values.

health.dat <- storm.raw[,c(8,23,24)]
# Summarise the data, no NAs
summary(health.dat[,2:3])

##    FATALITIES          INJURIES        
##  Min.   :  0.0000   Min.   :   0.0000  
##  1st Qu.:  0.0000   1st Qu.:   0.0000  
##  Median :  0.0000   Median :   0.0000  
##  Mean   :  0.0168   Mean   :   0.1557  
##  3rd Qu.:  0.0000   3rd Qu.:   0.0000  
##  Max.   :583.0000   Max.   :1700.0000

I then did the following using the dpylr package:

Grouped by EVTYPE
Summed the total deaths and injuries for each event type.
Added the total deaths and injuries for each event type together to calculate the total harm done by each event type.
Arranged the events in descending order of the total harm.

suppressMessages(library(dplyr)) # supress loading message
# Group by event
by_event <- health.dat %>% group_by(EVTYPE) %>%
        summarise_each(funs(sum)) %>% # Sum each event
        mutate(DEATH.AND.INJURY = FATALITIES+INJURIES) %>% # Add death or injury
        # Select only events where death or injury occurs
        filter(DEATH.AND.INJURY > 0) %>%
        arrange(desc(DEATH.AND.INJURY))  # Arrange by death or injury descending
# order

# Which are the most impactful events.
max.death <- by_event[which.max(by_event$FATALITIES),]
max.injuries <- by_event[which.max(by_event$INJURIES),]
max.harm <- by_event[which.max(by_event$DEATH.AND.INJURY),]

This reveals that the storm events causing the most harm overall during this period 1950 to 2011 was by TORNADO, causing 5633 deaths and 91 thousand injuries.

Table of Storm Events causing harm to human health

I’ve then put the full output into a searchable table using the data.table and DT packages so we can see the full breakdown of events, deaths and injuries:

suppressMessages(library(data.table)) # supress loading message
library(DT)
# Create data table
harm.table <- data.table(by_event)

datatable(harm.table,rownames = FALSE,
          colnames = c('Storm Event Type', 'Total Number of Deaths', 
                       'Total Number of Injuries)',
                       'Total Number of Deaths and Injuries'),
          class="row-border hover")

Bar plot of Top 15 Storm Events that cause over 90% of recorded harm to human health

The plot uses the ggplot and GridExtra packages. First I calculated the total deaths, injuries and harm for all events, and then used these to calculate what proportion of harm each event accounted for. I found that the top 15 events accounted for over 90% of the total harm caused in terms of both deaths and injuries. This is then plotted as bar plots for Deaths, Injuries, and combined Deaths and Injuries.

library(ggplot2)
# Calculate total deaths, injuries and harm for all events
Total.Deaths <- sum(by_event$FATALITIES)
Total.Injuries <- sum(by_event$INJURIES)
Total.Death.and.Injuries <- sum(by_event$DEATH.AND.INJURY)

# Extract the top 15
top.harm <- by_event[1:15,]
# Calculate the proportion of total harm caused by each event type
top.harm <- top.harm %>% 
        mutate(Prop.Death = FATALITIES/Total.Deaths,
               Prop.Inj = INJURIES/Total.Injuries,
               Prop.Death.and.Inj = DEATH.AND.INJURY/Total.Death.and.Injuries) %>%
        transmute(EVTYPE,Prop.Death,Prop.Inj,Prop.Death.and.Inj) 

# Create factor variables for the events ordered by the proportions
top.harm$harm.order <- reorder(top.harm$EVTYPE, top.harm$Prop.Death.and.Inj)
top.harm$death.order <- reorder(top.harm$EVTYPE, top.harm$Prop.Death)
top.harm$injury.order <- reorder(top.harm$EVTYPE, top.harm$Prop.Inj)

# Check that the top 15 account for > 90%
sum(top.harm$Prop.Death.and.Inj)

## [1] 0.9212323

# Plot the three catergories, Deaths, Injuries and Death and Injuries
# using the proportions
plot.1 <- ggplot(top.harm, aes(y=Prop.Death.and.Inj)) +
        geom_bar(aes(x=harm.order),
                 stat="identity", fill="white", colour="darkgreen", width=0.5) + 
        coord_flip() +
        ggtitle("Top 15 events accounting\n for > 90% of Total Harm") +
        ylab("Proportion of Total Deaths and Injuries") +
        xlab("") +
        theme(axis.text=element_text(size=rel(3)),
        axis.title=element_text(size=rel(2.25)),
        plot.title = element_text(size = rel(2.75)))

plot.2 <- ggplot(top.harm, aes(y=Prop.Death)) +
        geom_bar(aes(x=death.order),
                 stat="identity", fill="white", colour="darkgreen", width=0.5) + 
        coord_flip() +
        ggtitle("Top 15 events accounting\n for > 90% of Total Deaths") +
        ylab("Proportion of Total Deaths") +
        xlab("") +
        ylim(0,0.6) + # Make axis the same as the others
        theme(axis.text=element_text(size=rel(3)),
        axis.title=element_text(size=rel(2.25)),
        plot.title = element_text(size = rel(2.75)))

plot.3 <- ggplot(top.harm, aes(y=Prop.Inj)) +
        geom_bar(aes(x=injury.order),
                 stat="identity", fill="white", colour="darkgreen", width=0.5) + 
        coord_flip() +
        ggtitle("Top 15 events accounting\n for > 90% of Total Injuries") +
        ylab("Proportion of Total Injuries") +
        xlab("") +
        theme(axis.text=element_text(size=rel(3)),
        axis.title=element_text(size=rel(2.25)),
        plot.title = element_text(size = rel(2.75)))

suppressMessages(library(gridExtra)) # supress loading message
# Print three panels onto a single plot
grid.arrange(plot.1,plot.2,plot.3,ncol=3,nrow=1)

Figure 1: Top 15 Storm Events that cause over 90% of recorded harm to human health

The total number of deaths in the period 1950 to 2011 was 15145.

The total number of injuries in the period 1950 to 2011 was 140528.

The total number of cases of harm in the period 1950 to 2011 was 155673.

2. Which events have the greatest economic impact?

To answer this question, in addition to the event type I needed the property damage and crop damage observations. These are recorded as PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP. The EXP variables provide factors by which to multiply the DMG values. For the purposes of this report I’ve used K as 1 thousand, M as 1 million and B as 1 billion, as indicated in the accompanying documentation. I ignored the other character values as it was unclear what they were.

# I need to clean-up the data to get total economic costs of the events from
# PROPDMG, PROPDMGEXP, CPROPDMG, CPROPDMGEXP
# Columns, 25:28
which(colnames(storm.raw)=="PROPDMG")

## [1] 25

which(colnames(storm.raw)=="PROPDMGEXP")

## [1] 26

which(colnames(storm.raw)=="CROPDMG")

## [1] 27

which(colnames(storm.raw)=="CROPDMGEXP")

## [1] 28

I initially split the property and crop data to process separately. Again I used dplyr for the analysis:

library(dplyr)
# Subset these Property columns
prop.costs.dat <- storm.raw[,c(8,25:26)] %>%
        filter(PROPDMG > 0) # Only where cost has occured

# Subset the Crop columns
crop.costs.dat <- storm.raw[,c(8,27:28)] %>%
        filter(CROPDMG > 0)

Then I did the following for the property and crop data:

Found which strings to replace in the EXP column and replaced them with integers.
Removed missing values.
Multiplied the DMG and EXP columns to get the damage costs and added this as a new column of the damage costs.
Arranged the data in descending order of damage costs.

suppressMessages(library(dplyr))
# Find which strings to replace in property data
unique(prop.costs.dat$PROPDMGEXP)

##  [1] "K" "M" "B" "m" "+" "0" ""  "5" "6" "4" "h" "2" "7" "3" "H" "-"

# Replace characters K,M and B with integers, ignore others and NAs warning
prop.costs.dat$PROPDMGEXP <- gsub("[Kk]",1e3, prop.costs.dat$PROPDMGEXP)
prop.costs.dat$PROPDMGEXP <- gsub("[Mm]",1e6, prop.costs.dat$PROPDMGEXP)
prop.costs.dat$PROPDMGEXP <- gsub("[Bb]",1e9, prop.costs.dat$PROPDMGEXP)
prop.costs.dat$PROPDMGEXP <- as.integer(prop.costs.dat$PROPDMGEXP)

## Warning: NAs introduced by coercion

# Remove NAs as I don't know how to interpret the other symbols
prop.costs.dat <- na.omit(prop.costs.dat)

# Mutate the data frame
prop.costs.dat <- prop.costs.dat %>% 
        mutate(Prop.Damage.Costs = PROPDMG*PROPDMGEXP) %>%
        arrange(desc(Prop.Damage.Costs))

# Find which strings to replace in crop data
unique(crop.costs.dat$CROPDMGEXP)

## [1] "M" "K" "m" "B" "k" "0" ""

# Replace characters K,M and B with integers, ignore others and NAs warning
crop.costs.dat$CROPDMGEXP <- gsub("[Kk]",1e3, crop.costs.dat$CROPDMGEXP)
crop.costs.dat$CROPDMGEXP <- gsub("[Mm]",1e6, crop.costs.dat$CROPDMGEXP)
crop.costs.dat$CROPDMGEXP <- gsub("B",1e9, crop.costs.dat$CROPDMGEXP)
crop.costs.dat$CROPDMGEXP <- as.integer(crop.costs.dat$CROPDMGEXP)

# Remove NAs as I don't know how to interpret them
crop.costs.dat <- na.omit(crop.costs.dat)

# Mutate the data frame
crop.costs.dat <- crop.costs.dat %>% 
        mutate(Crop.Damage.Costs = CROPDMG*CROPDMGEXP) %>%
        arrange(desc(Crop.Damage.Costs))

I’ve then summarised this output by event as previously and merged the property and crop damage costs into a single data frame:

library(dplyr)
# Summarize data by event
prop_cost_by_event <- prop.costs.dat %>% group_by(EVTYPE) %>%
        summarise_each(funs(sum)) # Sum Prop Costs for each event

crop_cost_by_event <- crop.costs.dat %>% group_by(EVTYPE) %>%
        summarise_each(funs(sum)) # Sum Crop Costs for each event

# Merge the Property and Crop Costs
clean.costs <- merge(prop_cost_by_event, crop_cost_by_event, 
                     "EVTYPE", all=TRUE) %>%
        # Keep only the events and total costs columns
        select(EVTYPE, Prop.Damage.Costs, Crop.Damage.Costs)
# Set NAs to zero
clean.costs[is.na(clean.costs)] <- 0
clean.costs <- clean.costs %>% 
        # Sum the Property and Crop costs
        mutate(Total.Cost = Prop.Damage.Costs+Crop.Damage.Costs) %>%
        arrange(desc(Total.Cost)) # Arrange in descending order

Table of the economic costs of Storm Events

As before I’ve put the summary of the storm event costs into a a searchable table so we can see the full breakdown of events and costs:

library(data.table)
library(DT)
# Create data table
cost.table <- data.table(clean.costs)
cost.table$Prop.Damage.Costs <- round(cost.table$Prop.Damage.Costs/1e6,0)
cost.table$Crop.Damage.Costs <- round(cost.table$Crop.Damage.Costs/1e6,0)
cost.table$Total.Cost <- round(cost.table$Total.Cost/1e6,0)
datatable(cost.table,rownames = FALSE,
          colnames = c('Storm Event Type', 'Property Damage Cost (Millions USD)', 
                       'Crop Damage Cost (Millions USD)',
                        'Total Damage Cost (Millions USD)'),
          class="row-border hover")

Bar plot of Top 15 Storm Events that cause over 90% of recorded economic costs

Finally, as before I calculated the total economic cost for all events, and then used this to calculate what proportion of the total each event accounted for. I found that the top 15 events accounted for over 90% of the total economic costs caused.

library(ggplot2)
# Calculate the total cost of all events
Total.Damage.Cost <- sum(clean.costs$Total.Cost)
# Subset top 15 events
top.costs <- clean.costs[1:15,]
top.costs <- top.costs %>% mutate(Proportion = Total.Cost/Total.Damage.Cost) %>%
        transmute(EVTYPE,Proportion)

# Create factor variable for the events ordered by the proportions of cost
top.costs$cost.order <- reorder(top.costs$EVTYPE, top.costs$Proportion)

# Check this accounts for >90% of costs
sum(top.costs$Proportion)

## [1] 0.9216174

# Create barplot ordered by cost.order
plot.4 <- ggplot(top.costs, aes(y=Proportion))
plot.4 + geom_bar(aes(x= cost.order),
                  stat="identity", fill="white", colour="darkgreen", width=0.5) + 
        coord_flip() +
        ggtitle("Top 15 events accounting > 90% of Total Damage Costs") +
        ylab("Proportion of Total Damage Costs") +
        xlab("")

Figure 2: Top 15 Storm Events that cause over 90% of recorded economic costs

The total economic cost of storm events in the period 1950 to 2011 was approximately 476 billion US Dollars

Session information

Here is the session information about the packages I used, their versions, and the version of R that I used for this assignment:

sessionInfo()

## R version 3.2.0 (2015-04-16)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252 
## [2] LC_CTYPE=English_United Kingdom.1252   
## [3] LC_MONETARY=English_United Kingdom.1252
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.1252    
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] gridExtra_0.9.1  ggplot2_1.0.1    DT_0.0.35        data.table_1.9.4
## [5] dplyr_0.4.1     
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.11.5       knitr_1.9         magrittr_1.5     
##  [4] MASS_7.3-40       munsell_0.4.2     colorspace_1.2-6 
##  [7] stringr_0.6.2     plyr_1.8.2        tools_3.2.0      
## [10] parallel_3.2.0    gtable_0.1.2      DBI_0.3.1        
## [13] htmltools_0.2.6   yaml_2.1.13       lazyeval_0.1.10  
## [16] assertthat_0.1    digest_0.6.8      RJSONIO_1.3-0    
## [19] reshape2_1.4.1    formatR_1.2       htmlwidgets_0.3.2
## [22] evaluate_0.7      rmarkdown_0.5.1   labeling_0.3     
## [25] scales_0.2.4      chron_2.3-45      proto_0.3-10

Reproducible Research: Peer Assessment 2