This publication downloads weather data created by NOAA and ranks the health and economic impacts of severe weather types in the U.S.A. between 1950 & 2011. The qualitative information is of poor quality - being the worst in the earlier observations. To resolve this, a series of steps were taken to consolidate overly specific or mis-keyed categories. Another problem with the raw data was that the measures were given in a non-additive format - having varying units between hundreds and billions. To resolve this, all quantitative data was re-factored and scaled to a common denominator.
To provide complete transparency into the author’s judgments that produced the outcomes, all the steps used are given in the reproducible code that follows.
This process was created using R studio with the dplyr package. It has been tested using only the below system and versions:
platform x86_64-apple-darwin13.4.0
arch x86_64
os darwin13.4.0
system x86_64, darwin13.4.0
status
major 3
minor 2.2
year 2015
month 08
day 14
svn rev 69053
language R
R version 3.2.2 (2015-08-14) nickname Fire Safety Rstudio Version 0.99.463 Required Package: dplyr
Results were validated by setting the parameters in this code to pull specific years that could be compared with reports available from the National Weather Service’s Natural Hazard Statistics web site.
The below chunk of code allows one to set new parameters should the URL or file name ever changes for downloading this data. Also, the local file can be given a custom name by changing fileNew. After the download is complete, the csv is decompressed and loaded into the sd data frame by the read.csv command.
url <- "http://d396qusza40orc.cloudfront.net/"
file <- "repdata%2Fdata%2FStormData.csv.bz2"
fileNew <- "StormData.csv.bz2"
download.file(paste(url, file, sep=""), fileNew, method="curl")
sd <- read.csv(fileNew,head=TRUE, sep=",", stringsAsFactors=FALSE)
In our source data set, information recorded in the EVTYPE column is suffering from human error. Summation of common weather types can only be accomplished if this is cleansed. To solve the problem, the below chunk creates a new WeatherType column that is loaded with consistent weather event descriptions based upon regular expressions that interpret keywords in EVTYPE.
ts <- grep("TSTM|^TSTM|TSTM$|THUNDER|^THUNDER|THUNDER$", toupper(sd$EVTYPE), value=FALSE)
sd[ts,"WeatherType"] <- "THUNDERSTORM"
fl <- grep("FLOOD|^FLOOD|FLOOD$|RAIN|^RAIN|RAIN$", toupper(sd$EVTYPE), value=FALSE)
sd[fl,"WeatherType"] <- "FLOOD"
ff <- grep("FLASH FLOOD|^FLASH FLOOD|FLASH FLOOD$", toupper(sd$EVTYPE), value=FALSE)
sd[ff,"WeatherType"] <- "FLASH FLOOD"
ff <- grep("FLASHFLOOD|^FLASHFLOOD|FLASHFLOOD$", toupper(sd$EVTYPE), value=FALSE)
sd[ff,"WeatherType"] <- "FLASH FLOOD"
sw <- grep("SNOW|^SNOW|SNOW$", toupper(sd$EVTYPE), value=FALSE)
sd[sw,"WeatherType"] <- "SNOW"
ct <- grep("CURRENT|^CURRENT|CURRENT$", toupper(sd$EVTYPE), value=FALSE)
sd[ct,"WeatherType"] <- "RIP CURRENT"
fe <- grep("FIRE|^FIRE|FIRE$", toupper(sd$EVTYPE), value=FALSE)
sd[fe,"WeatherType"] <- "FIRE"
cd <- grep("COLD|^COLD|COLD$", toupper(sd$EVTYPE), value=FALSE)
sd[cd,"WeatherType"] <- "COLD"
wr <- grep("WINTER|^WINTER|WINTER$", toupper(sd$EVTYPE), value=FALSE)
sd[wr,"WeatherType"] <- "WINTER"
ss <- grep("SURGE|^SURGE|SURGE$", toupper(sd$EVTYPE), value=FALSE)
sd[ss,"WeatherType"] <- "STORM SURGE"
lg <- grep("LIGHTNING|^LIGHTNING|LIGHTNING$", toupper(sd$EVTYPE), value=FALSE)
sd[lg,"WeatherType"] <- "LIGHTNING"
hl <- grep("HAIL|^HAIL|HAIL$", toupper(sd$EVTYPE), value=FALSE)
sd[hl,"WeatherType"] <- "FROZEN PERCIP"
ht <- grep("HEAT|^HEAT|HEAT$", toupper(sd$EVTYPE), value=FALSE)
sd[ht,"WeatherType"] <- "HEAT"
ft <- grep("FROST|^FROST|FROST$", toupper(sd$EVTYPE), value=FALSE)
sd[ft,"WeatherType"] <- "FROST"
wd <- grep("WIND|^WIND|WIND$", toupper(sd$EVTYPE), value=FALSE)
sd[wd,"WeatherType"] <- "WIND"
fz <- grep("FREEZE|^FREEZE|FREEZE$|FREEZ|^FREEZ|FREEZ$|ICE|^ICE|ICE$|ICY|^ICY|ICY$", toupper(sd$EVTYPE), value=FALSE)
sd[fz,"WeatherType"] <- "FROZEN PERCIP"
tn <- grep("TORNADO|^TORNADO|TORNADO$|FUNNEL CLOUD", toupper(sd$EVTYPE), value=FALSE)
sd[tn,"WeatherType"] <- "TORNADO"
he <- grep("HURRICANE|^HURRICANE|HURRICANE$", toupper(sd$EVTYPE), value=FALSE)
sd[he,"WeatherType"] <- "HURRICANE"
na <- is.na(sd$WeatherType)
sd[na,"WeatherType"] <- toupper(sd[na,"EVTYPE"])
The raw data contains multiple indicators in CROPDMGEXP and PROPDMGEXP such as; B for “Billions”, M for “Millions” etc. This chunk loads a set of vectors with the expression values; y, that correspond with their multiple; z, that will load into a new field in the data frame called PROPextended. The vector; t, contains descriptions of the multiples. Finally, s allows the researcher to choose the divisor by which to scale final results.
library(dplyr)
options(scipen = 999) # Turn off scientific notation
sd$CROPextended <- 1 # Initialize new field that holds extenstion.
sd$PROPextended <- 1 # Initialize new field that holds extenstion.
s <-'B' # Assign variable s to Scale fraphs Accordingly
t <-c('(Billions)','(Millions)','(Thousands)','(Hundreds)','') # Scale Text
y <-c('B','M','K','2','') # Values to use from PROPDMGEXP and CROPDMGEXP
z <- c(1000000000, 1000000,1000,100,1)
sd$CROPextended <- (z[match(toupper(sd$CROPDMGEXP),y)] * sd$CROPDMG)
sd$PROPextended <- (z[match(toupper(sd$PROPDMGEXP),y)] * sd$PROPDMG)
scale <- z[match(s,y)] # Scale Multiple
scale_desc <- t[match(s,y)] # Scale Description
The below parameters have been given to make it easier to modify output and evaluate results.
Using the parameters above, this research can be regenerated to produce results comparable to the National Weather Service’s Natural Hazard Statistics web site. It is interesting to note that in more recent years where input validation should have been easier, our source data has puzzling anomalies. For example, there is no record of Louisiana having a Hurricane in 2005. Katrina is noted as having a minor impact in Mississippi, but there is nothing for Louisiana.
#Create four new data frames that summarize which weather events caused the highest number of fatalities, Injuries, Crop Damages & Property Damages. There are a few parameter variables for making easy changes in this chunk;
top_n <- 10 # Update this to desired length of top n list
begin_year <- 1950 #Use this as a parameter for the beginning year.
end_year <- 2011 #Use this as a parameter for the ending year.
#Full Range of years is 1950 - 2011
sd$BGN_YEAR <- as.numeric(dates <- substr(sd$BGN_DATE, nchar(sd$BGN_DATE)-11, nchar(sd$BGN_DATE)-8)) #Extract year as numeric for filtering out non-used years.
sdf <- sd %>% # Create Fatalities Dataframe
filter(BGN_YEAR >= begin_year & BGN_YEAR <= end_year) %>%
group_by(WeatherType) %>%
summarise(COUNT = sum(FATALITIES)) %>%
arrange(desc(COUNT)) %>%
top_n(top_n)
sdi <- sd %>% # Create Injuries DataFrame
filter(BGN_YEAR >= begin_year & BGN_YEAR <= end_year) %>%
group_by(WeatherType) %>%
summarise(COUNT = sum(INJURIES)) %>%
arrange(desc(COUNT)) %>%
top_n(top_n)
sdc <- sd %>% # Create Crop Damage Data Frame
filter(BGN_YEAR >= begin_year & BGN_YEAR <= end_year) %>%
group_by(WeatherType) %>%
summarise(USD = sum(CROPextended)) %>%
arrange(desc(USD)) %>%
top_n(top_n)
sdp <- sd %>% # Create Property Damage Data Frame
filter(BGN_YEAR >= begin_year & BGN_YEAR <= end_year) %>%
group_by(WeatherType) %>%
summarise(USD = sum(PROPextended)) %>%
arrange(desc(USD)) %>%
top_n(top_n)
The following two Health and Economic figures present results in bar plots where a log scale has been used for the Y axis due to the large difference between the min and max results.
It should be noted that, the topmost natural hazards for all available years do not follow the same patterns as a fewer number of years -especially the most recent years.
Figure 1 shows the top ranking weather events that caused the greatest number of injuries and fatalities for the chosen years.
# Create Figure 1 Panel Plot
par(
mfrow = c(1, 2), mar = c(9, 4, 6, 2), mgp = c(3, 1, 0), cex = 0.8, bg="gray"
)
barplot(
sdf$COUNT, las = 3, names.arg = sdf$WeatherType, main = "Top Fatalities",
ylab = "Fatality Count (log)", log = "y", col = "blue"
)
barplot(
sdi$COUNT, las = 3, names.arg = sdi$WeatherType, main = "Top Injuries",
ylab = "Injury Count (log)", log = "y", col = "blue"
)
mtext(paste("Figure 1: Health Impact",begin_year, "-",end_year, sep=" "), side=3, outer=TRUE, line=-1)
Figure 2 shows the topmost economic damage, measured in U.S Dollars, caused by severe weather events.
# Create figure 2 Panel Plot
par(
mfrow = c(1, 2), mar = c(11, 4, 6, 2), mgp = c(3, 1, 0), cex = 0.8, bg="gray"
)
barplot(
sdc$USD / scale, las = 3, names.arg = sdc$WeatherType, main = "Crop Damage",
ylab = paste("USD$",scale_desc,"(log)"), log = "y", col = "blue"
)
barplot(
sdp$USD / scale, las = 3, names.arg = sdp$WeatherType, main = "Property Damage",
ylab = paste("USD$",scale_desc,"(log)"), log = "y", col = "blue"
)
mtext(paste("Figure 2: Economic Impact",begin_year, "-",end_year, sep=" "), side=3, outer=TRUE, line=-1)