The basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events. You must use the database to answer the questions below and show the code for your entire analysis.
Your data analysis must address the following questions:
I have analysed the NOAA Storm Database 1950-2011 using base R and the following packages: dplyr, lubridate, xtable, scales and ggplot2 for figures. Based on this analysis, it appears that the event which has caused the most significant mortality in the US is tornadoes, followed by excessive heat and flash floods. The most economically costly events are drought, extreme cold and hail. Please see below for the code and figures produced by this analysis.
I read in the data with the following R script:
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
destFileName <- "stormdata.bz2"
folderName <- "CSV files"
# download file
if(!file.exists(destFileName)){download.file(fileUrl, destfile = destFileName)}
#read in file
stormData <- read.csv(destFileName)
I then summarised the fatalities data by looking at the top ten causes of fatalities.
# Establish total fatalities by event type
fatalitiesByEvent <- tapply(stormData$FATALITIES, stormData$EVTYPE, sum)
# Make this into a data frame
fatalitiesByEvent <- as.data.frame(fatalitiesByEvent)
# Put rownames as a value in the data frame
fatalitiesByEvent$totalFatalities <- fatalitiesByEvent$fatalitiesByEvent
fatalitiesByEvent$fatalitiesByEvent <- rownames(fatalitiesByEvent)
names(fatalitiesByEvent)[1] <- "eventType"
# This has almost 1000 events
dim(fatalitiesByEvent)
## [1] 985 2
# Let's take the top ten
library(dplyr)
fatalitiesByEvent <- arrange(fatalitiesByEvent, desc(totalFatalities))
significantFatalitiesByEvent <- fatalitiesByEvent[1:10,]
# Use this list to take a subset of data to plot in results section
significantEvents <- significantFatalitiesByEvent$eventType
# filter the data according only to significant events
significantFatalitiesforPlot <- filter(stormData, EVTYPE %in% significantEvents)
#Check top 10 values
justFatalities <- select(significantFatalitiesforPlot, EVTYPE, FATALITIES)
justFatalities <- arrange(justFatalities, desc(FATALITIES))
# Note the outlier of 583 which will squash all the data in the plot if it is not excluded.
# it would be nice to look at it by year
library(lubridate)
significantFatalitiesforPlot$StartDate <- mdy_hms(as.character(significantFatalitiesforPlot$BGN_DATE))
This concludes the processing of the fatalities data. I will discuss the Results below.
I processed the data by multiplying the property damage and crop damage values by their relative exponents, then summing them to arrive at the total cost. Again, I then look at the top ten values.
# So we care about Property Damage (PROPDMG) and Crop Damage (CROPDMG), both of which have an exponent.
# We also need the event type, and let's keep the state and year for now.
damageData <- select(stormData, BGN_DATE, STATE, EVTYPE, PROPDMG:CROPDMGEXP)
# Let's look at the property damage exponent:
table(damageData$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 424665 7 11330
# Firstly there are duplicates of upper and lower case
damageData$PROPDMGEXP <- tolower(damageData$PROPDMGEXP)
# Now my essential strategy is going to be create a new column, ignore non-alphanumeric symbols as an error as there are only 13 of them in 900k rows
# b is billion, m is million, a number is an exponent, as suggested by column name
damageData$propMultiplier[damageData$PROPDMGEXP == "b"] <- 1e9
damageData$propMultiplier[damageData$PROPDMGEXP == "m"] <- 1e6
damageData$propMultiplier[damageData$PROPDMGEXP == "k"] <- 1e3
damageData$propMultiplier[damageData$PROPDMGEXP == "h"] <- 100
damageData$propMultiplier[damageData$PROPDMGEXP == ""] <- 1
damageData$propMultiplier[damageData$PROPDMGEXP == "0"] <- 1
damageData$propMultiplier[damageData$PROPDMGEXP == "8"] <- 1e8
damageData$propMultiplier[damageData$PROPDMGEXP == "7"] <- 1e7
damageData$propMultiplier[damageData$PROPDMGEXP == "5"] <- 1e6
damageData$propMultiplier[damageData$PROPDMGEXP == "6"] <- 1e5
damageData$propMultiplier[damageData$PROPDMGEXP == "4"] <- 1e4
damageData$propMultiplier[damageData$PROPDMGEXP == "3"] <- 1e3
damageData$propMultiplier[damageData$PROPDMGEXP == "2"] <- 100
damageData$propMultiplier[damageData$PROPDMGEXP == "1"] <- 10
damageData$propMultiplier[damageData$PROPDMGEXP == "-" | damageData$PROPDMGEXP == "?" | damageData$PROPDMGEXP == "+" ] <- 1
## Then do the same exercise with crop damage
table(damageData$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
# Move to lower case
damageData$CROPDMGEXP <- tolower(damageData$CROPDMGEXP)
# Write the exponent to a new multiplier
damageData$cropMultiplier[damageData$CROPDMGEXP == "b"] <- 1e9
damageData$cropMultiplier[damageData$CROPDMGEXP == "m"] <- 1e6
damageData$cropMultiplier[damageData$CROPDMGEXP == "k"] <- 1e3
damageData$cropMultiplier[damageData$CROPDMGEXP == ""] <- 1
damageData$cropMultiplier[damageData$CROPDMGEXP == "0"] <- 1
damageData$cropMultiplier[damageData$CROPDMGEXP == "2"] <- 100
damageData$cropMultiplier[damageData$CROPDMGEXP == "-" | damageData$CROPDMGEXP == "?" | damageData$CROPDMGEXP == "+" ] <- 1
# Now let's actually multiply the damage
damageData$propertyDamage <- damageData$PROPDMG * damageData$propMultiplier
damageData$cropDamage <- damageData$CROPDMG * damageData$cropMultiplier
# Create a total damage column
damageData$totalDamage <- damageData$propertyDamage + damageData$cropDamage
# Look at the top 10 events in terms of economic damage
eventsCost <- tapply(damageData$totalDamage, damageData$EVTYPE, sum, na.rm = TRUE)
# Make it a data frame with rownames as a value
eventsCost <- as.data.frame(eventsCost)
eventsCost$totalCost <- eventsCost$eventsCost
eventsCost$eventsCost <- rownames(eventsCost)
names(eventsCost)[1] <- "event"
eventsCost <- arrange(eventsCost, desc(totalCost))
topTenEventsByCost <- eventsCost[1:10,]
The data is now sufficiently processed and the results can be viewed below.
I have looked at fatalities data alone to answer this question. On this basis, we can see that the events which have caused the top ten fatalities are:
# In the results section I am going to want a table of the top 10
library(xtable)
# Format for printing
newTable <- xtable(significantFatalitiesByEvent)
print(newTable, type="html")
| eventType | totalFatalities | |
|---|---|---|
| 1 | TORNADO | 5633.00 |
| 2 | EXCESSIVE HEAT | 1903.00 |
| 3 | FLASH FLOOD | 978.00 |
| 4 | HEAT | 937.00 |
| 5 | LIGHTNING | 816.00 |
| 6 | TSTM WIND | 504.00 |
| 7 | FLOOD | 470.00 |
| 8 | RIP CURRENT | 368.00 |
| 9 | HIGH WIND | 248.00 |
| 10 | AVALANCHE | 224.00 |
However, this is partly caused by the increase in reporting of certain types of events in recent years. When the fatalities data is split by date, one can see this:
# Total fatalities by event
library(ggplot2)
fatalitiesPlot <- ggplot(data=significantFatalitiesforPlot, aes(x=StartDate, y=FATALITIES, color=EVTYPE)) +
geom_point() + coord_cartesian(ylim=c(0,160)) +
ggtitle("Number of death from top 10 fatality-causing severe weather events 1950-2011") +
theme(axis.text.x = element_text(angle = 90, hjust=1), plot.title = element_text(hjust = 0.5), legend.title = element_blank()) +
ylab("Fatalities") +
xlab("Year")
fatalitiesPlot + facet_grid(rows = "EVTYPE")
# N.B. Please note that I have excluded one outlier, a tragic Heat event in July 1995 which caused 583 fatalities. This figure is included in the table.
It is clear from this figure that tornadoes have been recorded consistently since 1950, whereas most other events have been recorded since around 1990. If they had been recorded since 1950, their true figures would be higher, and excessive heat in particular would be comparable to tornadoes in terms of fatality.
To answer this question, I looked at the top ten events in terms of their damage to the economy.
# Load scales package to make y axis human readable
library(scales)
costPlot <- ggplot(data = topTenEventsByCost, aes(x = event, y = totalCost)) +
geom_col(aes(color="red", fill="pink")) +
scale_y_continuous(labels = dollar) +
theme(axis.text.x = element_text(hjust=1), plot.title = element_text(hjust = 0.5), legend.position = "none") +
ggtitle("Top ten events severe weather events by cost in the USA 1950-2011") +
ylab("Total cost") +
xlab("Weather event")
costPlot
It is clear from this plot that drought has by far the highest cost to the economy. It is difficult to know what policymakers can do about drought, as currently there is no technology capable of controlling weather.
Having said that, there is technology which can mitigate the effects of severe weather, for example irrigation systems in deserts.
More research is required to establish the mechanism for the economic cost of severe weather in order that such steps can be identified and funded.