Synopsis

This data analysis is conducted on the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database using simple column charts to investigate which event types are associated with greatest harm to health, as determined by the number of injuries and fatalities they cause. Similarly those events with greatest economic consequence were determined based on the cost of damage to property and crops.

Our analysis showed that Tornadoes caused the most harm to population health, alongside excessive heat and flooding. Flooding had the greatest economic consequence, closely followed by extreme wind events such as Hurricanes, Typhoons & Tornadoes.

The Assignment

The basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events:
1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
2. Across the United States, which types of events have the greatest economic consequences?

Data Processing

First we will configure the R Markdown document knitr::opts_chunk$set(echo = TRUE) and load required libraries knitr, dplyr, reshape2, ggplot2 This will be hidden by setting ```{r setup, include=FALSE}.

Importing data

For the import code chunk I will set cache = TRUE as unzipping and reading the data file is the most resource intensive element of this analysis of this analysis. First the script will check if the file StormData.csv already exists, before downloading it if not. The CSV file is compressed using bz2 but read.csv() is able to uncompress this directly without any further processing.

I have then reduced the dataset to only the required variables (data, BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) to reduce memory usage and clutter.

# Specify fileUrl
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
# Check if file already exists
if(!file.exists("StormData.csv")){
        # If NOT then download file using curl
        download.file(fileUrl, 
                      destfile = "StormData.csv", method = "curl")
}
# Read file and store in environment as 'data'
data <- read.csv("StormData.csv", stringsAsFactors = FALSE)
# Select only relevant cols
data <- select(data, BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

Cleaning Event Type (EVTYPE)

Data has length(unique(data$EVTYPE)) = 985 unique values for EVTYPE. This is despite only 48 on the official National Weather Service Storm Data Documentation.

On inspection of the database information at https://www.ncdc.noaa.gov/stormevents/details.jsp we can see that the official 48 event types were only introduced in 1996. The from 1950, as identified by data$BGN_DATE[which.min(data$BGN_DATE)]. Therefore I will subset the data to only include dates after 1996. This should hopefully have the added bonus of reducing the number of unique values for EVTYPE which I have to clean!

class(data$BGN_DATE)
## [1] "character"

The data is currently stored in the "character" class. So first we need to convert it to date using as.Date with the specified format "%m/%d/%Y". Then we can subset it for dates >= "1996-01-01"

data$BGN_DATE <- as.Date(data$BGN_DATE, "%m/%d/%Y")
data <- subset(data, data$BGN_DATE >= "1996-01-01")

This has reduced the number of unique values from 985 to length(unique(data$EVTYPE)) = 516!

Next I will capitalise everything using toupper() to eliminate any errors from inconsistent cases.

data$EVTYPE <- toupper(data$EVTYPE)

This has reduced the number of unique values to length(unique(data$EVTYPE)) = 438

I wasted a lot of time trying to reduce the list of EVTYPES down to match the official 48. This involved experimenting with filtering operations, regular expressions (to remove spaces/characters) and even using amatch() from stringist with longest common substring method. This was unnecessary but taught me a valuable lesson - remember to address the question

Our question, from the JHU Reproducible Research assignment, is:

Which types of events are most harmful with respect to population health

That is to say we only need to focus on the most harmful values, not the complete range. So, providing our output seems sensible, we can tolerate having some ‘unclean’ data in our dataset. This has the benefit of greatly simplifing our analysis, but comes at the expense of reducing the proportion of data we’re analysing (i.e. we are missing typos).

Cleaning Property & Crop Damage (PROPDMG, CROPDMG)

Damage to property & crops is currently split across four columns. PROPDMG & CROPDMG are integers and their respective exponents, PROPDMGEXP & CROPDMGEXP, which take the form of characters "K", "M" & "B" corresponding to thousands, millions & billions, respectively. For the analysis I will need a single value so I shall combine these integers and their respective exponents.

First I will use dplyr’s mutate() function in the form mutate(variable = case_when(variable == condition ~ newvalue)) to substitute "K", "M" & "B" with 1E3, 1E6 & 1E9 respectively. I will then reassign PROPDMG & CROPDMG to be the value of their original integer multiplied by their exponent. Finally I will remove the redundant exponent columns by assigning them to NULL.

# Replace exponents with respective power of 10
data <- data %>% mutate(PROPDMGEXP = case_when(PROPDMGEXP == "K" ~ 1E3, PROPDMGEXP == "M" ~ 1E6, PROPDMGEXP == "B" ~ 1E9)) %>% mutate(CROPDMGEXP = case_when(CROPDMGEXP == "K" ~ 1E3, CROPDMGEXP == "M" ~ 1E6, CROPDMGEXP == "B" ~ 1E9))
# Multiply Prop & Crop damage by their respective exponents
data$PROPDMG <- data$PROPDMG * data$PROPDMGEXP
data$CROPDMG <- data$CROPDMG * data$CROPDMGEXP
# Remove redundant EXP cols
data$PROPDMGEXP <- NULL
data$CROPDMGEXP <- NULL

Data Analysis

Events most Harmful to Population Health

Health Harm = Morbidity & Mortality = Injuries + Fatalities

First I will aggregate the data to a new dataframe harm selecting only the variables EVTYPE, FATALITIES & INJURIES.

Given that we are only interesting in the most harmful event types I will use the order function in the form order(data$col, decreasing = TRUE) to order the data for Injuries & Fatalities. With this subset harm using bracket notation, into their own dataframes, topMorB & topMorT, respectively. Using the head(data, 5) will give the top 5 most fatal/injurious event types.

I will then merge the top 5 most fatal/injurious event types contained in topMorB & topMorT into topharm using merge(x, y, all = TRUE) to ensure that all values are retained (i.e. to avoid only event types common to both “top 5s” being selected).

harm <- aggregate(cbind(FATALITIES, INJURIES) ~ EVTYPE, data, sum)
topMorB <- head(harm[order(harm$INJURIES, decreasing = TRUE),], 5)
topMorT <- head(harm[order(harm$FATALITIES, decreasing = TRUE),], 5)
topharm <- merge(topMorB, topMorT, all = TRUE)

Events with Greatest Economic Consequence

Economic Consequence = Damage to Property & Crops

My analysis will take the same form as above. First aggegating data by EVTYPE, PROPDMG & CROPDMG, then ranking & merging the top 5 for each marker of economic consequence.

dmg <- aggregate(cbind(PROPDMG, CROPDMG) ~ EVTYPE, data, sum)
topProp <- head(dmg[order(dmg$PROPDMG, decreasing = TRUE),], 5)
topCrop <- head(dmg[order(dmg$CROPDMG, decreasing = TRUE),], 5)
topdmg <- merge(topProp, topCrop, all = TRUE)

Results

Events most Harmful to Population Health

I have chosen to present my data in the form of a column chart using the ggplot2 package. This will consist of Event Type on the x axis and Number of People on the y axis. The data will be grouped by type of harm (Injuries, Fatalities). To achieve this grouping I need a column containing the variable by which I am grouping. I will use the melt function from the reshape2 package to reshape the dataframe into this format.

Stylistically I will use the position = "dodge" argument to geom_col to display the grouping side-by-side. I will add titles and axis labels using ggtitle() and then need to rotate the axis labels by 45 degrees and horizontally offset them using theme(axis.text.x = element_text(angle = 45, hjust = 1)) to ensure they do not overlap and are readable.

The default grouping colours from ggplot are Fatalities in Red and Injuries in Blue. I thought this was apt so left it as is!

g <- ggplot(data = melt(topharm))
## Using EVTYPE as id variables
g <- g + geom_col(aes(x = EVTYPE, y = value, fill = variable), position = "dodge")
g <- g + theme(axis.text.x = element_text(angle = 45, hjust = 1))
g + ggtitle("Harm to Health per Event Type") + xlab("Event Type") + ylab("Number of People") + labs(fill = "Harm")

From this plot we can see that the most fatalities occur from ‘Excessive Heat’ (1797) closely followed by ‘Tornado’ (1511). The graphical differentiation of this could be improved by either plotting separate panels for Fatalities & Injuries (to prevent the high value for injuries with Tornadoes skewing the y axis) or using a logarithmic scale.

I have summarised the top 3 causes of fatalities in tabular form below:

head(topMorT[order(topMorT$FATALITIES, decreasing = TRUE),1:2],3)
##             EVTYPE FATALITIES
## 69  EXCESSIVE HEAT       1797
## 358        TORNADO       1511
## 85     FLASH FLOOD        887

For injuries it is clear to see that Tornadoes are associated with the highest number (20667) with Flood and Excessive Heat having similar counts of 6758 & 6391 respectively. As expected injuries are far greater in number than fatalities, thankfully! Again this is summarised in tabular form below.

head(topMorB[order(topMorB$INJURIES, decreasing = TRUE),-2],3)
##             EVTYPE INJURIES
## 358        TORNADO    20667
## 88           FLOOD     6758
## 69  EXCESSIVE HEAT     6391

To conclude which event is the single most harmful to health requires a value-based judgement about the relative weighting of injuries vs fatalities. This could be an entire thesis on it’s own! Nevertheless I would suggest that, despite being associated with the 2nd highest number of fatalities by a small margin, Tornadoes are most associated with harm to health.

Events with Greatest Economic Consequence

I have chosen to present my data in the same form as above.

g <- ggplot(data = melt(topdmg))
## Using EVTYPE as id variables
g <- g + geom_col(aes(x = EVTYPE, y = value, fill = variable), position = "dodge")
g <- g + theme(axis.text.x = element_text(angle = 45, hjust = 1))
g + ggtitle("Damage to Property & Crops per Event Type") + xlab("Event Type") + ylab("Cost in US Dollars ($)") + labs(fill = "Damage")

From this plot we can see that the greatest cost of Property Damage occurs with Flooding ($132.4 BILLION!) followed by wind events such as Hurricane/Typhoon and Tornadoes ($26.7 & 16 Billion, respectively). The top 3 highest costs of property damage are summarised below in tabular form:

head(dmg[order(dmg$PROPDMG, decreasing = TRUE),1:2], 3)
##               EVTYPE      PROPDMG
## 17             FLOOD 132485992500
## 33 HURRICANE/TYPHOON  26740295000
## 54           TORNADO  16030944840

As expected the cost of damage to crops is lower than that of damage to property, but it shows a similar pattern with Flooding causing the most damage ($4.7 billion) followed again by wind events Hurricane and Hurricane/Typhoon ($2.7 & 2.6 billion, respectively)

head(dmg[order(dmg$CROPDMG, decreasing = TRUE),-2], 3)
##               EVTYPE    CROPDMG
## 17             FLOOD 4792831400
## 32         HURRICANE 2688410000
## 33 HURRICANE/TYPHOON 2607872800

There is greater concordance between the event types associated with damage to property and crops. Therefore I can conclude that the event type with greatest economic consequence is Flooding however extreme wind events such as Hurricanes, Typhoons & Tornadoes also represent have significant economic effects.