Storms and other severe weather events can cause both public health and economic problems across America. Severe weather events result in fatalities, injuries, property damage, and crop damage. Having a greater understanding of the concequences of severe weather events can help decision makers take steps to prevent these concequences for future weather events.
In this report we will analyze the public health and economic impacts of weather weather events collected between May 1, 1950 - May 1, 2011 by the National Weather Service. We analyze two metrics for measuring the impact of weather events and show that the greatest average economic impact is caused by hurricanes and the greatest average public health impact is caused by heat waves. We also look at the relative frequency of the weather events and show that when taking event frequency into account tornadoes have the greatest impact on public health while flooding has the greatest economic impact.
The dataset can be downloaded by clicking this link.
First and foremost we must load the data. As shown in the code snippet below we make sure we have access to our data by checking it exists in our working directory. The data will automatically download if it is missing. After this check we'll load the data.
data_url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
load_data <- function(filename = "data/StormData.csv.bz2") {
if (!file.exists("data")) {
dir.create("data")
}
if (!file.exists(filename)) {
download.file(data_url, filename, method = "curl")
}
read.csv(bzfile(filename))
}
data <- load_data()
With the data in hand we can see how our dataset is structured. We can see the dimension of the dataset.
dim(data)
## [1] 902297 37
We can also look at a small sample of each column in the dataset
str(data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "10/10/1954 0:00:00",..: 6523 6523 4213 11116 1426 1426 1462 2873 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "000","0000","00:00:00 AM",..: 212 257 2645 1563 2524 3126 122 1563 3126 3126 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels "?","ABNORMALLY DRY",..: 830 830 830 830 830 830 830 830 830 830 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels "","E","Eas","EE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","?","(01R)AFB GNRY RNG AL",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","10/10/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels "","?","0000",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","(0E4)PAYSON ARPT",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels "","2","43","9V9",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels ""," "," "," ",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
Futhermore we can see a sample of what types of EVTYPEs are encoded in the dataset
uevt <- unique(data$EVTYPE)
uevt[order(uevt)][20:30]
## [1] BLACK ICE Black Ice
## [3] BLIZZARD BLIZZARD AND EXTREME WIND CHIL
## [5] BLIZZARD AND HEAVY SNOW BLIZZARD/FREEZING RAIN
## [7] BLIZZARD/HEAVY SNOW BLIZZARD/HIGH WIND
## [9] Blizzard Summary BLIZZARD WEATHER
## [11] BLIZZARD/WINTER STORM
## 985 Levels: ? ABNORMALLY DRY ABNORMALLY WET ... WND
From these exploratory results we can immediately pick out a few needed preprocessing steps
First, BGN_DATE is currently encoded as a factor. It would make more sense to encode as a Date object, lets do the conversion. We will also remove the trailing timestamp in BGN_Date
data$BGN_DATE <- as.Date(sapply(strsplit(as.character(data$BGN_DATE), " "),
function(x) x[1]), format = "%M/%d/%Y")
Second, a lot of EVTYPEs have similar name structures but use different syntax, for example BEACH EROSIN vs Beach Erosion vs BEACH EROSION. We will combine these similar event types to hand picked event types for the following broader categories:
broad_evtypes <- c("Thunderstorms", "Flooding", "Landslides", "Hurricanes",
"Dry Spells", "Cold Spells", "Tornadoes", "Blizzards", "Heavy Rain", "High Winds",
"Fog", "Heat Waves", "Hail", "Tropical Storms", "Erosion", "Rough Surf",
"Fires", "Cold Weather")
evtype_greps <- list(Thunderstorms = c("thunderstorm", "tstm", "thunder", "lightning"),
Flooding = c("flood", "rapid rising water"), Landslides = c("mudslide",
"landslide", "mud slide", "land slide", "rock slide", "rockslide"),
Hurricanes = c("hurricane"), Dry.Spells = c("dry weather", "dry spell",
"drought", "dry pattern"), Cold.Spells = c("cold weather", "cold spell",
"freezing"), Tornadoes = c("tornado", "waterspout", "tsunami"), Blizzards = c("blizzard",
"heavy snow", "winter storm"), Heavy.Rain = c("heavy rain", "downpour"),
High.Winds = c("high wind", "heavy wind"), Fog = c("fog"), Heat.Waves = c("heat wave",
"hot", "very warm", "prolong warmth"), Hail = c("hail"), Tropical.Storms = c("tropical storm"),
Erosion = c("erosion", "erode", "erosin"), Rough.Surf = c("high surf", "rough surf",
"heavy surf", "rough seas"), Fires = c("fire"), Cold.Weather = c("cold",
"low temp"))
data$EVTYPE <- as.character(data$EVTYPE)
evtgnames <- names(evtype_greps)
for (i in seq_along(evtype_greps)) {
rows <- numeric()
for (gterm in evtype_greps[[evtgnames[i]]]) {
rows <- union(rows, grep(gterm, data$EVTYPE, ignore.case = T))
}
data$EVTYPE[rows] <- broad_evtypes[i]
}
data$EVTYPE <- as.factor(data$EVTYPE)
Third, the PROPDMG and CROPDMG columns have additional columns named PROPDMGEXP and CROPDMGEXP. These columns indicate the base 10 exponant in which the damage values have been recorded. Some examples include K for thousands, M and m for millions, H and h for hundreds, etc. Futhermore, the (PROP/CROP)DMGEXP field also contains the symbols +/- to indicate the sign of the respective damage. First we will convert all character values to their respective numerical base-10 exponent. After this step, we will transform both PROPDMG and CROPDMG by multiplying by 10EXP and adjusting for the numerical sign.
data$PROPDMGEXP <- as.character(data$PROPDMGEXP)
data$CROPDMGEXP <- as.character(data$CROPDMGEXP)
data$PROPDMG[data$PROPDMGEXP == "-"] <- -1 * data$PROPDMG[data$PROPDMGEXP ==
"-"]
data$PROPDMGEXP[data$PROPDMGEXP %in% c("-", "+", "?", "")] <- "0"
data$PROPDMGEXP[data$PROPDMGEXP == "B"] <- "9"
data$PROPDMGEXP[data$PROPDMGEXP %in% c("m", "M")] <- "6"
data$PROPDMGEXP[data$PROPDMGEXP %in% c("k", "K")] <- "3"
data$PROPDMGEXP[data$PROPDMGEXP %in% c("h", "H")] <- "2"
data$CROPDMGEXP[data$CROPDMGEXP %in% c("-", "+", "?", "")] <- "0"
data$CROPDMGEXP[data$CROPDMGEXP == "B"] <- "9"
data$CROPDMGEXP[data$CROPDMGEXP %in% c("m", "M")] <- "6"
data$CROPDMGEXP[data$CROPDMGEXP %in% c("k", "K")] <- "3"
data$CROPDMGEXP[data$CROPDMGEXP %in% c("h", "H")] <- "2"
data$PROPDMG <- data$PROPDMG * 10^as.numeric(data$PROPDMGEXP)
data$CROPDMG <- data$CROPDMG * 10^as.numeric(data$CROPDMGEXP)
We will explore the historical costs on public health and property of the 18 event types generated during preprocessing. The costs metrics that we will use are: the average cost for each event, and the expected costs of weather events weighted by the frequency of the event.
Futhermore, the data uses 4 metrics for measuring costs, these are: crop damage, property damage, fatalities, and injuries. These can be grouped into two broad categories: Public Health and Economic. While in reality Economic and Public Health costs aren't independent, in this analysis we will focus defining Public Health as the total number of fatalities and injuries, while Economic will include the total crop and property damage.
First we explore average concequences of weather events on economy and public health, we will report the mean costs over all events in our dataset. This metric will give us a measure of the average costs incurred for each of the 18 listed events
aggr <- aggregate(cbind(FATALITIES, INJURIES, PROPDMG, CROPDMG) ~ EVTYPE, data = data,
mean)
aggr <- aggr[aggr$EVTYPE %in% broad_evtypes, ]
aggr$Economic <- aggr$PROPDMG + aggr$CROPDMG
aggr$Public <- aggr$FATALITIES + aggr$INJURIES
aggr2 <- data.frame(EVTYPE = rep(aggr$EVTYPE, 2), Cost = c(aggr$Economic, aggr$Public),
Cost.Type = factor(rep(c("Economic", "Public"), each = nrow(aggr))))
library(ggplot2)
plt <- ggplot(data = aggr2, aes(x = EVTYPE, y = log(Cost + 1), fill = EVTYPE)) +
geom_bar(stat = "identity") + labs(title = "Average Economic and Public Costs of Weather Events",
x = "Weather Event") + scale_fill_discrete(name = "Weather Event") + facet_grid(Cost.Type ~
., scales = "free") + theme(axis.text.x = element_blank())
print(plt)
The above figure shows that on average Hurricane events have the greatest economic impact while Heat Waves have the greatest public health impact, but hurricanes are a close second.
print(as.character(aggr$EVTYPE[which.max(aggr$Economic)]))
## [1] "Hurricanes"
print(as.character(aggr$EVTYPE[which.max(aggr$Public)]))
## [1] "Heat Waves"
Now we explore the average concequences of weather events on economy and public health while taking into account the relative frequency of a weather event occuring. We will compute the average frequency of each event type in our data, and use that as a multiplicative factor when estimating the average public and economic costs. This metric will give us a relative measure of how much a weather event costs. The motivation for this metric can be explained from a simple thought experiment:
Which type weather event costs more? A yearly mudslide that results in $1million in property damage, or a weekly tornado that on average does $20,000 in property damage?
data.ev <- data[data$EVTYPE %in% broad_evtypes, ]
tbl <- table(data.ev$EVTYPE)
tbl <- tbl/sum(tbl)
for (evtype in broad_evtypes) {
sel <- aggr$EVTYPE == evtype
aggr$Public[sel] <- aggr$Public[sel] * tbl[evtype]
aggr$Economic[sel] <- aggr$Economic[sel] * tbl[evtype]
}
aggr2 <- data.frame(EVTYPE = rep(aggr$EVTYPE, 2), Cost = c(aggr$Economic, aggr$Public),
Cost.Type = factor(rep(c("Economic", "Public"), each = nrow(aggr))))
plt <- ggplot(data = aggr2, aes(x = EVTYPE, y = log(Cost + 1), fill = EVTYPE)) +
geom_bar(stat = "identity") + labs(title = "Relative Economic and Public Costs of Weather Events",
x = "Weather Event") + scale_fill_discrete(name = "Weather Event") + facet_grid(Cost.Type ~
., scales = "free") + theme(axis.text.x = element_blank())
print(plt)
The above figure shows that when adjusting for occurance frequency Tornadoes have the greatest public health impact while Flooding has the greatest economic impact
print(as.character(aggr$EVTYPE[which.max(aggr$Economic)]))
## [1] "Flooding"
print(as.character(aggr$EVTYPE[which.max(aggr$Public)]))
## [1] "Tornadoes"