Assignment instructions

The basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events. You must use the database to answer the questions below and show the code for your entire analysis.

Questions

Your data analysis must address the following questions:

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

Data

Source Data Documentation

Synopsis

I have analysed the NOAA Storm Database 1950-2011 using base R and the following packages: dplyr, lubridate, xtable, scales and ggplot2 for figures. Based on this analysis, it appears that the event which has caused the most significant mortality in the US is tornadoes, followed by excessive heat and flash floods. The most economically costly events are drought, extreme cold and hail. Please see below for the code and figures produced by this analysis.

Data processing

Reading in the data

I read in the data with the following R script:

fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
destFileName <- "stormdata.bz2"
folderName <- "CSV files"

# download file
if(!file.exists(destFileName)){download.file(fileUrl, destfile = destFileName)}

#read in file
stormData <- read.csv(destFileName)

Processing the fatalities data

I then summarised the fatalities data by looking at the top ten causes of fatalities.

# Establish total fatalities by event type
fatalitiesByEvent <- tapply(stormData$FATALITIES, stormData$EVTYPE, sum)

# Make this into a data frame
fatalitiesByEvent <- as.data.frame(fatalitiesByEvent)

# Put rownames as a value in the data frame
fatalitiesByEvent$totalFatalities <- fatalitiesByEvent$fatalitiesByEvent
fatalitiesByEvent$fatalitiesByEvent <- rownames(fatalitiesByEvent)
names(fatalitiesByEvent)[1] <- "eventType"

# This has almost 1000 events
dim(fatalitiesByEvent)
## [1] 985   2
# Let's take the top ten
library(dplyr)
fatalitiesByEvent <- arrange(fatalitiesByEvent, desc(totalFatalities))
significantFatalitiesByEvent <- fatalitiesByEvent[1:10,]

# Use this list to take a subset of data to plot in results section
significantEvents <- significantFatalitiesByEvent$eventType

# filter the data according only to significant events
significantFatalitiesforPlot <- filter(stormData, EVTYPE %in% significantEvents)

#Check top 10 values
justFatalities <- select(significantFatalitiesforPlot, EVTYPE, FATALITIES)
justFatalities <- arrange(justFatalities, desc(FATALITIES))

# Note the outlier of 583 which will squash all the data in the plot if it is not excluded.

# it would be nice to look at it by year
library(lubridate)
significantFatalitiesforPlot$StartDate <- mdy_hms(as.character(significantFatalitiesforPlot$BGN_DATE))

This concludes the processing of the fatalities data. I will discuss the Results below.

# Across the United States, which types of events have the greatest economic consequences?

I processed the data by multiplying the property damage and crop damage values by their relative exponents, then summing them to arrive at the total cost. Again, I then look at the top ten values.

# So we care about Property Damage (PROPDMG) and Crop Damage (CROPDMG), both of which have an exponent.
# We also need the event type, and let's keep the state and year for now.

damageData <- select(stormData, BGN_DATE, STATE, EVTYPE, PROPDMG:CROPDMGEXP)

# Let's look at the property damage exponent:

table(damageData$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5 
## 465934      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     40      1      6 424665      7  11330
# Firstly there are duplicates of upper and lower case
damageData$PROPDMGEXP <- tolower(damageData$PROPDMGEXP)

# Now my essential strategy is going to be create a new column, ignore non-alphanumeric symbols as an error as there are only 13 of them in 900k rows
# b is billion, m is million, a number is an exponent, as suggested by column name
damageData$propMultiplier[damageData$PROPDMGEXP == "b"] <- 1e9
damageData$propMultiplier[damageData$PROPDMGEXP == "m"] <- 1e6
damageData$propMultiplier[damageData$PROPDMGEXP == "k"] <- 1e3
damageData$propMultiplier[damageData$PROPDMGEXP == "h"] <- 100
damageData$propMultiplier[damageData$PROPDMGEXP == ""] <- 1
damageData$propMultiplier[damageData$PROPDMGEXP == "0"] <- 1
damageData$propMultiplier[damageData$PROPDMGEXP == "8"] <- 1e8
damageData$propMultiplier[damageData$PROPDMGEXP == "7"] <- 1e7
damageData$propMultiplier[damageData$PROPDMGEXP == "5"] <- 1e6
damageData$propMultiplier[damageData$PROPDMGEXP == "6"] <- 1e5
damageData$propMultiplier[damageData$PROPDMGEXP == "4"] <- 1e4
damageData$propMultiplier[damageData$PROPDMGEXP == "3"] <- 1e3
damageData$propMultiplier[damageData$PROPDMGEXP == "2"] <- 100
damageData$propMultiplier[damageData$PROPDMGEXP == "1"] <- 10
damageData$propMultiplier[damageData$PROPDMGEXP == "-" | damageData$PROPDMGEXP == "?" | damageData$PROPDMGEXP == "+" ] <- 1

## Then do the same exercise with crop damage

table(damageData$CROPDMGEXP)
## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994
# Move to lower case
damageData$CROPDMGEXP <- tolower(damageData$CROPDMGEXP)

# Write the exponent to a new multiplier

damageData$cropMultiplier[damageData$CROPDMGEXP == "b"] <- 1e9
damageData$cropMultiplier[damageData$CROPDMGEXP == "m"] <- 1e6
damageData$cropMultiplier[damageData$CROPDMGEXP == "k"] <- 1e3
damageData$cropMultiplier[damageData$CROPDMGEXP == ""] <- 1
damageData$cropMultiplier[damageData$CROPDMGEXP == "0"] <- 1
damageData$cropMultiplier[damageData$CROPDMGEXP == "2"] <- 100
damageData$cropMultiplier[damageData$CROPDMGEXP == "-" | damageData$CROPDMGEXP == "?" | damageData$CROPDMGEXP == "+" ] <- 1


# Now let's actually multiply the damage
damageData$propertyDamage <- damageData$PROPDMG * damageData$propMultiplier
damageData$cropDamage <- damageData$CROPDMG * damageData$cropMultiplier

# Create a total damage column

damageData$totalDamage <- damageData$propertyDamage + damageData$cropDamage

# Look at the top 10 events in terms of economic damage

eventsCost <- tapply(damageData$totalDamage, damageData$EVTYPE, sum, na.rm = TRUE)


# Make it a data frame with rownames as a value
eventsCost <- as.data.frame(eventsCost)
eventsCost$totalCost <- eventsCost$eventsCost
eventsCost$eventsCost <- rownames(eventsCost)
names(eventsCost)[1] <- "event"

eventsCost <- arrange(eventsCost, desc(totalCost))
topTenEventsByCost <- eventsCost[1:10,]

The data is now sufficiently processed and the results can be viewed below.

Results

Which types of events are most harmful to population health?

I have looked at fatalities data alone to answer this question. On this basis, we can see that the events which have caused the top ten fatalities are:

# In the results section I am going to want a table of the top 10

library(xtable)
# Format for printing
newTable <- xtable(significantFatalitiesByEvent)
print(newTable, type="html")
eventType totalFatalities
1 TORNADO 5633.00
2 EXCESSIVE HEAT 1903.00
3 FLASH FLOOD 978.00
4 HEAT 937.00
5 LIGHTNING 816.00
6 TSTM WIND 504.00
7 FLOOD 470.00
8 RIP CURRENT 368.00
9 HIGH WIND 248.00
10 AVALANCHE 224.00

However, this is partly caused by the increase in reporting of certain types of events in recent years. When the fatalities data is split by date, one can see this:

# Total fatalities by event
library(ggplot2)

fatalitiesPlot <- ggplot(data=significantFatalitiesforPlot, aes(x=StartDate, y=FATALITIES, color=EVTYPE)) +
                geom_point() + coord_cartesian(ylim=c(0,160)) +
                ggtitle("Number of death from top 10 fatality-causing severe weather events 1950-2011") +
                theme(axis.text.x = element_text(angle = 90, hjust=1),  plot.title = element_text(hjust = 0.5), legend.title = element_blank()) +
                ylab("Fatalities") +
                xlab("Year")
                

fatalitiesPlot + facet_grid(rows = "EVTYPE")

# N.B. Please note that I have excluded one outlier, a tragic Heat event in July 1995 which caused 583 fatalities. This figure is included in the table.

It is clear from this figure that tornadoes have been recorded consistently since 1950, whereas most other events have been recorded since around 1990. If they had been recorded since 1950, their true figures would be higher, and excessive heat in particular would be comparable to tornadoes in terms of fatality.

Which types of events are most harmful to the economy?

To answer this question, I looked at the top ten events in terms of their damage to the economy.

# Load scales package to make y axis human readable
library(scales)

costPlot <- ggplot(data = topTenEventsByCost, aes(x = event, y = totalCost)) + 
                geom_col(aes(color="red", fill="pink")) +
                scale_y_continuous(labels = dollar) +
                theme(axis.text.x = element_text(hjust=1),  plot.title = element_text(hjust = 0.5), legend.position = "none") +
                ggtitle("Top ten events severe weather events by cost in the USA 1950-2011") +
                ylab("Total cost") +
                xlab("Weather event")

costPlot

It is clear from this plot that drought has by far the highest cost to the economy. It is difficult to know what policymakers can do about drought, as currently there is no technology capable of controlling weather.

Having said that, there is technology which can mitigate the effects of severe weather, for example irrigation systems in deserts.

More research is required to establish the mechanism for the economic cost of severe weather in order that such steps can be identified and funded.