This study is an analysis of the U.S. National Oceanic and Atmospheric Administration’s (NOAA) Storm Database. The aim of the study is to evaluate which types of weather events are the most harmful with respect to health, and which types of events cause the greatest financial damage.
The analysis shows that Tornados are the weather event causing the greatest damage to population health, in terms of both Fatalities and Injuries, whilst in terms of USD, Floods are the greatest cause of Property Damage, and Droughts cause the most Crop Damage.
Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The aim of the project is to evaluate which types of weather events are the most harmful with respect to population health, and which types of events have the greatest economic consequences.
The data for this assignment can be downloaded from the web site:
Dataset: Storm Data [47Mb]
National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events FAQ
The events in the database start in the year 1950 and end in November 2011. In the earlier years there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
The full database consists of 902297 observations of 37 variables. Of these the principal data required to evaluate the economic and health consequences of various weather events are:
# Define the data directory and create it if necessary...
dataDir <- "./data"
if(!dir.exists(dataDir)){
dir.create(dataDir)
}
# Define the data file...
dataStorm <- paste(dataDir, "repdata-data-StormData.csv.bz2", sep="/")
# Perform datafile download if necessary...
if(!file.exists(dataStorm)){
url<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url, destfile = dataStorm)
}
With the data downloaded, it is first loaded into R and tidied. To simplify further processing a new dataframe is created containing just the 7 relevant variables. At this stage important libraries are loaded for use with data manipulation and plotting.
# Read the storm data, preprocess into a new dataframe and load libraries...
library(plyr)
library(reshape2)
library(ggplot2)
storm.full <- read.csv(dataStorm)
storm.data <- data.frame(storm.full$EVTYPE, storm.full$FATALITIES,
storm.full$INJURIES, storm.full$PROPDMG, storm.full$PROPDMGEXP,
storm.full$CROPDMG, storm.full$CROPDMGEXP)
names(storm.data) <- c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG",
"PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
Start tidying the dataset by exploring the values of the property and crop exponent variables, PROPDMGEXP and CROPDMGEXP respectively.
unique(storm.data$PROPDMGEXP)
## [1] K M B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
unique(storm.data$CROPDMGEXP)
## [1] M K m B ? 0 k 2
## Levels: ? 0 2 B k K m M
In addition to the numerical powers of ten there are a number of letter designators. These are assumed to represent the standard prefixes, where H = 100, K = 1000, M = 1,000,000 and B = 1,000,000,000. It is assumed that the lower and upper case letters have the same meaning and so the lower case letters have been case converted for simplicity. A quick analysis showed that the cases where the exponent is set equal to “-”, “+” or “?” occur a handful of times and these are set equal to “NA”. The remaining letter codes are converted to the corresponding numerical value. Missing values are set to zero.
# Case convert lower case exponent letter codes...
storm.data$PROPDMGEXP <- toupper(as.character(storm.data$PROPDMGEXP))
storm.data$CROPDMGEXP <- toupper(as.character(storm.data$CROPDMGEXP))
# Set missing values to 0, as it is assumed they have no associated cost...
storm.data$FATALITIES[(storm.data$FATALITIES == "")] <- 0
storm.data$INJURIES[(storm.data$INJURIES == "")] <- 0
storm.data$PROPDMG[(storm.data$PROPDMG == "")] <- 0
storm.data$CROPDMG[(storm.data$CROPDMG == "")] <- 0
storm.data$PROPDMGEXP[(storm.data$PROPDMGEXP == "")] <- 0
storm.data$CROPDMGEXP[(storm.data$CROPDMGEXP == "")] <- 0
# Set exponent letter codes to correct numerical value...
storm.data$PROPDMGEXP[(storm.data$PROPDMGEXP == "H")] <- 2
storm.data$PROPDMGEXP[(storm.data$PROPDMGEXP == "K")] <- 3
storm.data$PROPDMGEXP[(storm.data$PROPDMGEXP == "M")] <- 6
storm.data$PROPDMGEXP[(storm.data$PROPDMGEXP == "B")] <- 9
storm.data$CROPDMGEXP[(storm.data$CROPDMGEXP == "H")] <- 2
storm.data$CROPDMGEXP[(storm.data$CROPDMGEXP == "K")] <- 3
storm.data$CROPDMGEXP[(storm.data$CROPDMGEXP == "M")] <- 6
storm.data$CROPDMGEXP[(storm.data$CROPDMGEXP == "B")] <- 9
# Set ill-defined exponent codes to "NA" so they can be omitted...
storm.data$PROPDMGEXP[(storm.data$PROPDMGEXP == "-")] <- "NA"
storm.data$PROPDMGEXP[(storm.data$PROPDMGEXP == "+")] <- "NA"
storm.data$PROPDMGEXP[(storm.data$PROPDMGEXP == "?")] <- "NA"
storm.data$CROPDMGEXP[(storm.data$CROPDMGEXP == "-")] <- "NA"
storm.data$CROPDMGEXP[(storm.data$CROPDMGEXP == "+")] <- "NA"
storm.data$CROPDMGEXP[(storm.data$CROPDMGEXP == "?")] <- "NA"
# Set exponents as integers so they can be used mathematically...
storm.data$PROPDMGEXP <- as.integer(storm.data$PROPDMGEXP)
storm.data$CROPDMGEXP <- as.integer(storm.data$CROPDMGEXP)
For reference, of the 902297 events in the database, the total number of events omitted due to ill-defined exponent values is 21.
Calculate the total Property and Crop damage in USD using the mantissa and exponent values, and then combine these to give the total financial damage in USD for each weather event.
# Calculate the actual property and crop damage values...
storm.data$PROPDMGTOTAL <- storm.data$PROPDMG * 10^storm.data$PROPDMGEXP
storm.data$CROPDMGTOTAL <- storm.data$CROPDMG * 10^storm.data$CROPDMGEXP
# Calculate the total financial value of the event damage
storm.data$TOTALDMG <- storm.data$PROPDMGTOTAL + storm.data$CROPDMGTOTAL
Aggregate the data to give the totals as a function of EVTYPE for each of the summary variables Fatalities, Injuries, Property Damage, Crop Damage and Total Financial Damage. Merge these back into a single dataframe, then order the data by each of the summary variables in turn and extra the twenty event types with the highest impact in each case.
# Aggregate the data as a function of EVTYPE...
fatalities <- aggregate(FATALITIES ~ EVTYPE, data = storm.data, FUN=sum)
injuries <- aggregate(INJURIES ~ EVTYPE, data = storm.data, FUN=sum)
propdamage <- aggregate(PROPDMGTOTAL ~ EVTYPE, data = storm.data, FUN=sum)
cropdamage <- aggregate(CROPDMGTOTAL ~ EVTYPE, data = storm.data, FUN=sum)
sumdamage <- aggregate(TOTALDMG ~ EVTYPE, data = storm.data, FUN=sum)
# Merge the aggregations into a single summary dataframe...
storm.summ <- merge(fatalities, injuries, by="EVTYPE", all=TRUE)
storm.summ <- merge(storm.summ, propdamage, by="EVTYPE", all=TRUE)
storm.summ <- merge(storm.summ, cropdamage, by="EVTYPE", all=TRUE)
storm.summ <- merge(storm.summ, sumdamage, by="EVTYPE", all=TRUE)
# Sort the dataframe by each of the summary variables and extract the first 20 rows...
fatalities <- storm.summ[order(storm.summ$FATALITIES, decreasing=TRUE),][1:20,]
injuries <- storm.summ[order(storm.summ$INJURIES, decreasing=TRUE),][1:20,]
propdamage <- storm.summ[order(storm.summ$PROPDMGTOTAL, decreasing=TRUE),][1:20,]
cropdamage <- storm.summ[order(storm.summ$CROPDMGTOTAL, decreasing=TRUE),][1:20,]
sumdamage <- storm.summ[order(storm.summ$TOTALDMG, decreasing=TRUE),][1:20,]
Plotting the twenty event types with the highest impact in terms of health shows that Torndaos have the highest impact for both Fatalties and Injuries.
par(mfrow=c(1,2), mar=c(8,4,3,2), oma=c(4,2,2,2), cex=0.8)
barplot(fatalities$FATALITIES, names.arg=fatalities$EVTYPE, las=3,
cex.names=0.8, xlab="", ylab="total number of Fatalities", col="red",
main="Weather Events with highest incidence of Fatalities")
barplot(injuries$INJURIES, names.arg=injuries$EVTYPE, las=3, cex.names=0.8,
xlab="", ylab="total number of Injuries", col="red", main="Weather Events
with highest incidence of Injuries")
Plotting the twenty event types with the highest impact in terms of financial damage shows that Floods have the highest impact in terms of Property Damage, whilst Droughts have the highest impact in terms of Crop Damage.
par(mfrow=c(1,2), mar=c(8,4,3,2), oma=c(4,2,2,2), cex=0.8)
barplot(propdamage$PROPDMGTOTAL/10^6, names.arg=propdamage$EVTYPE, las=3,
cex.names=0.8, xlab="", ylab="total Property Damage in USD (Millions)",
col="red", main="Weather Events with highest cost in Property Damage")
barplot(cropdamage$CROPDMGTOTAL/10^6, names.arg=cropdamage$EVTYPE, las=3,
cex.names=0.8, xlab="", ylab="total Crop Damage in USD (Millions)",
col="red", main="Weather Events with highest cost in Crop Damage")