This document describes an analysis done as an assignment of the Coursera Reproducible Research Course from Johns Hopkins University.
Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage. Preparing for unprecedented extreme weather is a key concern of public officials.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. Over 900 event types are contained in this database, which covers the 1950-2011 time period. It tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
Our goal here is to answer two basic questions about losses generated from severe weather events.
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
Links for more information about the data include:
Before beginning the project, be sure to load the required R libraries and set any environmental variables. Note that setting messages in markdown to false suppresses messages from library loading such as version number and dependencies. Updating to the latest versions of these libraries may improve ability to obtain results fairly similar to the steps outlined here. All code is set to echo=TRUE to reveal calculations and assumptions made.
# Clear environment of prior calculations.
rm(list=ls())
library(dplyr) # data.table + dplyr code now lives in dtplyr (alternative)
library(data.table)
library(knitr) # for output to html and pdf
library(ggplot2) # enhanced grahics
library(ggthemes) # advanced themese
library(readr) # for faster/consistent read-writes
library(RCurl) # for loading external dataset (getBinaryURL)
library(R.utils) # for bunzip2
Load data from already prepared stormData.rds file or reload and rescrub data. This really saves time during the report writing phase.
dataReload <- TRUE
dataRescrub <- TRUE
# check if scrubbed stormData.rds already exists
if(file.exists("./data/stormData.rds")){
stormData <- read_rds("./data/stormData.rds")
dataReload <- FALSE # set this to FALSE for publication, TRUE for debugging
dataRescrub <- FALSE # set this to FALSE for publication, TRUE for debugging
}
if(dataReload){
# create a data dir if it doesn't exist
if(!file.exists("./data")){
dir.create("./data")
}
# load file from URL to bz2 file
if(!file.exists("./data/StormData.csv.bz2")){
print("Saving new stormData files...")
temp <- tempfile()
download.file("https://d396qusza40orc.cloudfront.net/repdata/data/StormData.csv.bz2",temp)
destPath <- "./data/StormData.csv"
bunzip2(temp,destPath,overwrite=TRUE, remove=FALSE)
unlink(temp)
}
# read file into stormData
if(file.exists("./data/StormData.csv")){
print("Reading new stormData csv file...")
stormData <- read.csv("./data/StormData.csv")
}
}
For the purposes of this analysis we will focus only on the following fields:
if(dataRescrub){
# A. Drop columns not wanted based on knowledge of dataset or research focus.
cols.want <- stormData[,grep('BGN_DATE|END_DATE|EVTYPE|FATALITIES|INJURIES|PROPDMG|PROPRDMGEXP|CROPDMG|CROPDMGEXP', x = names(stormData) )]
cols.dont.want <- stormData[,-grep('BGN_DATE|END_DATE|EVTYPE|FATALITIES|INJURIES|PROPDMG|PROPRDMGEXP|CROPDMG|CROPDMGEXP', x = names(stormData) )]
# use this to reset base data
stormData <- cols.want
}
Since damage totals are not consistently calculated, create summary numeric fields to hold total damages.
if(dataRescrub){
# B. Reformat property damages columns
# fill in the blanks
stormData$PROPDMGEXP[is.na(stormData$PROPDMGEXP)] <- "0"
# temporarily set all multipliers to alpha values
stormData$PROPDMGEXP <- as.character(stormData$PROPDMGEXP)
# set both lower and upper characters to numeric meaning
stormData$PROPDMGEXP[toupper(stormData$PROPDMGEXP) == 'H'] <- "2"
stormData$PROPDMGEXP[toupper(stormData$PROPDMGEXP) == 'K'] <- "3"
stormData$PROPDMGEXP[toupper(stormData$PROPDMGEXP) == 'M'] <- "6"
stormData$PROPDMGEXP[toupper(stormData$PROPDMGEXP) == 'B'] <- "9"
stormData$PROPDMGEXP[is.na(stormData$PROPDMGEXP)] <- 0
# set the multipliers back to numeric values
stormData$PROPDMGEXP <- as.numeric(stormData$PROPDMGEXP)
# calculate total damages and place into new column
stormData$TOTALPROPDMG <- stormData$PROPDMG * 10^stormData$PROPDMGEXP
# C. Reformat crop damages columns
stormData$CROPDMGEXP[is.na(stormData$CROPDMGEXP)] <- "0"
# temporarily set all multipliers to alpha values
stormData$CROPDMGEXP <- as.character(stormData$CROPDMGEXP)
# set both lower and upper characters to numeric meaning
stormData$CROPDMGEXP[toupper(stormData$CROPDMGEXP) == 'H'] <- "2"
stormData$CROPDMGEXP[toupper(stormData$CROPDMGEXP) == 'K'] <- "3"
stormData$CROPDMGEXP[toupper(stormData$CROPDMGEXP) == 'M'] <- "6"
stormData$CROPDMGEXP[toupper(stormData$CROPDMGEXP) == 'B'] <- "9"
# set the multipliers back to numeric values
stormData$CROPDMGEXP[is.na(stormData$CROPDMGEXP)] <- 0
stormData$CROPDMGEXP <- as.numeric(stormData$CROPDMGEXP)
# calculate total damages and place into new column
stormData$TOTALCROPDMG <- as.numeric(stormData$CROPDMG * 10^stormData$CROPDMGEXP)
}
For historical and data-entry reasons, the event labels assigned to records may not be consistent with the current official event categories. In 1996 the event types were increased to 48 from three. For consistency, records for events prior to 1996 will be excluded.
if(dataRescrub){
# assign value to cut-off date
splitYear <- as.numeric(format(as.Date("1/1/1996 0:00:00", format = "%m/%d/%Y"), "%Y"))
# query dataset based on date selection
stormData <- stormData %>%
filter (as.numeric(format(as.Date(stormData$BGN_DATE, format = "%m/%d/%Y"), "%Y")) >= splitYear)
# add numeric year dimension
stormData$EVT_YR <-as.numeric(format(as.Date(stormData$BGN_DATE, format = "%m/%d/%Y"), "%Y"))
}
The raw data initially contained over 900 event type labels. This refactoring and label review is essential to reducing classification error and proper allocation of damages to each category. The code below documents the grouping decisions for creating new event categories. This process reduces the categories to 13, including one “other” category for hard-to-classify event types.
if(dataRescrub){
# create temporary processing sets
stormData2 <- stormData
reducedStormData <- stormData2 # keeps a prefactored copy for edits
# set upper characters to begin normalizing labels or set ignore.case = TRUE in grep
reducedStormData$EVTYPE <- toupper(reducedStormData$EVTYPE)
reducedStormData[grepl("HIGH WIND*|WND|GUSTY|HIGH WINDS|GRADIENT WIND|WIND DAMAGE|GUSTY LAKE WIND|WIND GUSTS|WIND ADVISORY|RED FLAG CRITERIA|STRONG WIND.*|STRONG WIND GUSTS|WINDS",
reducedStormData$EVTYPE), "EVTYPE"] <- "WIND"
reducedStormData[grepl("^STORM|HURRICANE|TYPHOON|TROPICAL STORM|REMNANTS OF FLOYD",
reducedStormData$EVTYPE), "EVTYPE"] <- "HURRICANE/TYPHOON"
reducedStormData[grepl("SUMMARY.*|.*OTHER.*|NONE|NORTHERN LIGHTS|NO SEVERE WEATHER",
reducedStormData$EVTYPE), "EVTYPE"] <- "OTHER"
reducedStormData[grepl("TSTM|THUNDERSTORM|LIGHTNING|^METRO STORM.*|COASTAL STORM|COASTALSTORM",
reducedStormData$EVTYPE), "EVTYPE"] <- "THUNDERSTORM"
reducedStormData[grepl("TORNADO|SPOUT|FUNNEL|WHIRLWIND",
reducedStormData$EVTYPE), "EVTYPE"] <- "TORNADO"
reducedStormData[grepl("FLOOD|SURF|BLOW_OUT|SWELLS|FLD|DAM BREAK",
reducedStormData$EVTYPE), "EVTYPE"] <- "FLOOD"
reducedStormData[grepl("FIRE|SMOKE|VOLCANIC",
reducedStormData$EVTYPE), "EVTYPE"] <- "FIRE"
reducedStormData[grepl("PRECIPITATION|RAIN|HAIL|DRIZZLE|WET|PERCIP|BURST|DEPRESSION|FOG|CLOUD|vOG",
reducedStormData$EVTYPE), "EVTYPE"] <- "RAIN/FOG"
reducedStormData[grepl("COLD|COOL|ICE|ICY|FROST|FREEZE|SNOW|WINTER|WINTRY|WINTERY|BLIZZARD|CHILL|FREEZING|AVALANCHE|GLAZE|SLEET|MIXED PRECIP",
reducedStormData$EVTYPE), "EVTYPE"] <- "SNOW/ICE"
reducedStormData[grepl("WARMTH|WARM|HEAT|DRY|HOT|DROUGHT|THERMIA|TEMPERATURE RECORD|RECORD TEMPERATURE|RECORD HIGH|MONTHLY TEMPERATURE|DRIEST MONTH",
reducedStormData$EVTYPE), "EVTYPE"] <- "HEAT/DROUGHT"
reducedStormData[grepl("SEAS|HIGH WATER|TIDE|TSUNAMI|WAVE|CURRENT|MARINE|DROWNING|SURF|SEICHE|WAKE LOW WIND",
reducedStormData$EVTYPE), "EVTYPE"] <- "WATER-RELATED"
reducedStormData[grepl("DUST.*|SHARA.*",
reducedStormData$EVTYPE), "EVTYPE"] <- "DUST"
reducedStormData[grepl("SLIDE|EROSION|SLUMP",
reducedStormData$EVTYPE), "EVTYPE"] <- "SLIDES/EROSION"
#verify all records in a proper category
unique(reducedStormData$EVTYPE)
# assign reduced data back to analysis dataset
stormData<- reducedStormData
}
To improve processing time, we are saving the cleaned, reduced dataset in an rds-format file. To re-run the scrubbing operations, and update information, delete the local “rds” file from the data folder.
if(dataRescrub){
write_rds(stormData,"./data/stormData.rds") # alternative to store data rda or csv
}
# create graphing dataset for fatalities
sumFatalities <- stormData %>%
select(Event=EVTYPE,FATALITIES) %>%
group_by(Event) %>%
summarise(Fatalities = as.numeric(sum(FATALITIES, na.rm=TRUE)))
sumFatalities <- sumFatalities[order(-sumFatalities$Fatalities), ][1:10, ]
# create graphing dataset for injuries
sumInjuries <- stormData %>%
select(Event=EVTYPE,INJURIES) %>%
group_by(Event) %>%
summarise(Injuries = as.numeric(sum(INJURIES, na.rm=TRUE)))
sumInjuries <- sumInjuries[order(-sumInjuries$Injuries), ][1:10, ]
# create graphing dataset combined dead and injured
deadAndInjured <- merge(x = sumFatalities, y = sumInjuries, by = "Event", all = TRUE)
deadAndInjured <- melt(deadAndInjured, id.vars = 'Event')
names(deadAndInjured) <- c("Event", "Loss", "Value")
deadAndInjured$Value <- deadAndInjured$Value/(10^3)
# plot both sets
ggplot(deadAndInjured, aes(x = Event, y = Value, fill = Loss)) +
geom_bar(stat = "identity") +
labs(fill="") +
labs(x="", y="Thousands - (Source: NOAA Storm Data 1995-2011)") +
ggtitle("Weather-related Fatality & Injury Counts")+
coord_flip() +
theme_economist()
Answer: This plot shows that the combination of excessive heat and tornado caused most fatalities. On the other hand, tornato disasters alone represented the largest cause of weather-related injuries in the United States from 1996 to 2011.
# create graphing dataset for crop damages
cropSum <- stormData %>%
select(Event=EVTYPE,TOTALCROPDMG) %>%
group_by(Event) %>%
summarise(Crop = as.numeric(sum(TOTALCROPDMG, na.rm=TRUE)))
cropSum <- cropSum[order(-cropSum$Crop), ][1:10, ]
# create graphing dataset for prop damages
propSum <- stormData %>%
select(Event=EVTYPE,TOTALPROPDMG) %>%
group_by(Event) %>%
summarise(Property = as.numeric(sum(TOTALPROPDMG, na.rm=TRUE)))
propSum <- propSum[order(-propSum$Property), ][1:10, ]
# create graphing dataset combined damages
cropAndProp <- merge(x = cropSum, y = propSum, by = "Event", all = TRUE)
cropAndProp <- melt(cropAndProp, id.vars = 'Event')
names(cropAndProp) <- c("Event", "Loss", "Value")
cropAndProp$Value <- cropAndProp$Value/(10^9)
# plot property and crop losses
ggplot(cropAndProp, aes(Event, Value)) +
geom_bar(aes(fill = Loss), position = "dodge", stat="identity") +
labs(fill="") +
labs(x="", y="Damage, Billion $USD - (Source: NOAA Storm Data 1995-2011)") +
ggtitle("Weather-related Property & Crop Losses")+
coord_flip() +
theme_economist()
Answer: Based on the chart above, frequent flood and hurricane/typhoon disasters accounted for most of the property damage. Meanwhile severe drought and flood events contributed most to the crop damage in the United States from 1996 to 2011.
The main challenges of working with this data included, inconsistent event labeling and different ways of describing damage/human loss estimates. In addition, more recent data may be affected by other features not recorded in the dataset, such as an increase in concentration of property along coastal and flood areas which are subject to natural events.
Finally, further study of this data should yield more specific regional and local differences in weather-related risks.
RPubs page: http://rpubs.com/iwebconsultant/storm-data