U.S. National Oceanic and Atmospheric Administration (NOAA) publishes database which describes characteristics of major storms and weather events in the United States covering timespan from 1950 to 2011. Present report investigates the nation-wide economic consequences and public health effects of adverse weather events recorded in the database.
Overall,tornadoes are the cause of majority of weather-related deaths and injuries in addition to the significant economic impact. Economically, floods are the biggest cause of loss.
Typically weather events are relatively local even though their impact can be nation wide, depending on the area they occur. Therefore it is noted, that the preparedness for severe weather events should be based on information of local weather conditions.
library(data.table)
library(ggplot2)
library(plyr)
library(maps)
setwd('/Users/majulass/Documents/2014/reproducible-research/peer-assesment-02')
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "storm-dataset.csv.bz2", method = "curl")
weather.df <- read.csv(bzfile("./storm-dataset.csv.bz2"))
# Process multipliers in exponent field.
process.multipliers <- function(multiplier,number){
switch(tolower(as.character(multiplier)),
h={
100 * number
},
k={
1000 * number
},
m={
1000000 * number
},
b={
1000000000 * number
},
{
number
}
)
}
# Function for calculating total impact on public health
calculate.health.cost <- function(injuries,fatalities){
injuries + fatalities
}
# Utility function for creating Title Case strings.
# From: http://stackoverflow.com/a/15778614
topropper <- function(x) {
# Makes Proper Capitalization out of a string or collection of strings.
sapply(x, function(strn)
{ s <- strsplit(strn, "\\s")[[1]]
paste0(toupper(substring(s, 1,1)),
tolower(substring(s, 2)),
collapse=" ")}, USE.NAMES=FALSE)
}
# Utility function for wrapping long titles in ggplot plots.
wrapper <- function(x, ...) paste(strwrap(x, ...), collapse = "\n")
The original, unprocessed dataset contains 902297 rows. The dataset is prepared for analysis by renaming the variables following the principles of tidy data.
# Renaming the field based on information found at http://ire.org/nicar/database-library/databases/storm-events/
weather.names<-c(
'state.code',
'event.began.date',
'event.began.time',
'timezone',
'county.code',
'county.names',
'state',
'event.type',
'event.start.point',
'event.begin.azimuth',
'event.start.location',
'event.end.date',
'event.end.time',
'event.end.county',
'event.coundn.end',
'event.start.point',
'event.end.azimuth',
'event.end.location',
'tornado.path.length',
'tordano.path.width',
'tornado.intensity.scale',
'hail.magnitude',
'fatalities',
'injuries',
'property.damage',
'property.damage.exp',
'crop.damage',
'crop.damage.exp',
'wfo.office.id',
'event.us.state',
'event.zones',
'event.start.latitude',
'event.start.longitude',
'event.end.latitude',
'event.end.longitude',
'remarks',
'ncdc.reference')
names(weather.df)<-weather.names
By observing the dataset visually it can be seen that event though the dataset is published by established govermental organization, the dataset is rather messy; compared to the codebook it contains non-standard and missing values which can distort the result of analyses based on the dataset. Therefore a strategy for cleaning the dataset is needed. For serious academic research or policy-making, the chosen strategy should try to minimize the loss of observations resulted from cleaning. Eg. observations with non-standard or unknown values should be imputed with replacement values.
For the purposes of this assingment more straightforward strategy is sufficient – the ambiquous observations are removed and only the major errors are corrected.
The following table lists top 50 weather event codes in the dataset:
event.type.counts<-as.data.frame(sort(table(weather.df$event.type),decreasing=TRUE))[c(1:50),]
event.type.counts <- as.data.frame(event.type.counts)
kable(as.data.frame(event.type.counts),
format="html",
caption="Table 1: Top 50 weather events in unprocessed NOAA data.")
event.type.counts | |
---|---|
HAIL | 288661 |
TSTM WIND | 219940 |
THUNDERSTORM WIND | 82563 |
TORNADO | 60652 |
FLASH FLOOD | 54277 |
FLOOD | 25326 |
THUNDERSTORM WINDS | 20843 |
HIGH WIND | 20212 |
LIGHTNING | 15754 |
HEAVY SNOW | 15708 |
HEAVY RAIN | 11723 |
WINTER STORM | 11433 |
WINTER WEATHER | 7026 |
FUNNEL CLOUD | 6839 |
MARINE TSTM WIND | 6175 |
MARINE THUNDERSTORM WIND | 5812 |
WATERSPOUT | 3796 |
STRONG WIND | 3566 |
URBAN/SML STREAM FLD | 3392 |
WILDFIRE | 2761 |
BLIZZARD | 2719 |
DROUGHT | 2488 |
ICE STORM | 2006 |
EXCESSIVE HEAT | 1678 |
HIGH WINDS | 1533 |
WILD/FOREST FIRE | 1457 |
FROST/FREEZE | 1342 |
DENSE FOG | 1293 |
WINTER WEATHER/MIX | 1104 |
TSTM WIND/HAIL | 1028 |
EXTREME COLD/WIND CHILL | 1002 |
HEAT | 767 |
HIGH SURF | 725 |
TROPICAL STORM | 690 |
FLASH FLOODING | 682 |
EXTREME COLD | 655 |
COASTAL FLOOD | 650 |
LAKE-EFFECT SNOW | 636 |
FLOOD/FLASH FLOOD | 624 |
LANDSLIDE | 600 |
SNOW | 587 |
COLD/WIND CHILL | 539 |
FOG | 538 |
RIP CURRENT | 470 |
MARINE HAIL | 442 |
DUST STORM | 427 |
AVALANCHE | 386 |
WIND | 340 |
RIP CURRENTS | 304 |
STORM SURGE | 261 |
Based on the table, at least non-standard event codes related to thunderstorm winds (TSTM WIND, THUNDERSTORM WINDS) should be mapped to official values given in National Weather Service Instruction (NWSI) 10-1605, hereafter NWSI-10-1065. For the purposes of this assingment the effect of mapping other values is negligible taking into account the total number of observations with valid event type code.
weather.df$event.type <- mapvalues(weather.df$event.type, from = c("TSTM WIND", "THUNDERSTORM WINDS"), to = c("THUNDERSTORM WIND", "THUNDERSTORM WIND"))
Next, the dataset is filtered using permitted weather event types and exponent values.
weather.dt <- data.table(weather.df)
permitted.event.types <- c(
"ASTRONOMICAL LOW TIDE",
"AVALANCHE",
"BLIZZARD",
"COASTAL FLOOD",
"COLD/WIND CHILL",
"DEBRIS FLOW",
"DENSE FOG",
"DENSE SMOKE",
"DROUGHT",
"DUST DEVIL",
"DUST STORM",
"EXCESSIVE HEAT",
"EXTREME COLD/WIND CHILL",
"FLASH FLOOD",
"FLOOD",
"FROST/FREEZE",
"FUNNEL CLOUD",
"FREEZING FOG",
"HAIL",
"HEAT",
"HEAVY RAIN",
"HEAVY SNOW",
"HIGH SURF",
"HIGH WIND",
"HURRICANE (TYPHOON)",
"ICE STORM",
"LAKE-EFFECT SNOW",
"LAKESHORE FLOOD",
"LIGHTNING",
"MARINE HAIL",
"MARINE HIGH WIND",
"MARINE STRONG WIND",
"MARINE THUNDERSTORM WIND",
"RIP CURRENT",
"SEICHE",
"SLEET",
"STORM SURGE/TIDE",
"STRONG WIND",
"THUNDERSTORM WIND",
"TORNADO",
"TROPICAL DEPRESSION",
"TROPICAL STORM",
"TSUNAMI",
"VOLCANIC ASH",
"WATERSPOUT",
"WILDFIRE",
"WINTER STORM",
"WINTER WEATHER")
In the original data, the dollar impact values of weather events are given in a format in which the numeric value is given separatery of the exponent of the number. In the NWSI-10-1065 the permitted exponents are:
Only observations with permitted exponent values are included in the analysis, to remove potentially ambiquous observations.
multipliers=c('h','k','m','b','')
weather.dt$property.damage.exp <- tolower(weather.dt$property.damage.exp)
weather.dt$crop.damage.exp <- tolower(weather.dt$crop.damage.exp)
weather.dt.subset <- weather.dt[weather.dt$event.type %in% permitted.event.types,]
weather.dt.subset.subset <- weather.dt.subset[weather.dt.subset$property.damage.exp %in% multipliers & weather.dt.subset$crop.damage.exp %in% multipliers ,]
The original dataset had 26239 observations with ambiquous weather event type and 341 observations with and exponent value with unknown meaning. When rows with ambiquous values are removed, the resulting final dataset used in the analysis contains 875737 rows.
In the last data processing step, the exponent values are transformed to actual dollar amounts. After the last processing step, the dataset is ready for analysis, in which the following variables are used:
property.damage.dollars
– property damage in dollars (derived variable)crop.damage.dollars
– crop damage in dollars (derived variable)event.type
– type of the weather event, as described in NWSI-10-1065 codebook.injuries
– number of injuries caused by weather eventfatalities
– number of deaths caused by weather eventIn later steps, derived variable state.name
is included to the analysis to enable geographic investigation of the data.
weather.dt.subset.subset$property.damage.dollars <- mapply(
process.multipliers,
weather.dt.subset.subset$property.damage.exp,
weather.dt.subset.subset$property.damage)
weather.dt.subset.subset$crop.damage.dollars <- mapply(
process.multipliers,
weather.dt.subset.subset$crop.damage.exp,
weather.dt.subset.subset$crop.damage)
weather.dt.subset.subset$total.damage.dollars <- weather.dt.subset.subset$property.damage.dollars +
weather.dt.subset.subset$crop.damage.dollars
weather.dt.subset.subset$population.health.cost <-mapply(
calculate.health.cost,
weather.dt.subset.subset$injuries,
weather.dt.subset.subset$fatalities)
Calculating the total impact to the US economy.
# Prepare dataset for summing over event type
setkey(weather.dt.subset.subset, event.type)
## Group damages by event type, US wide.
# Damages to economy
total.us.crop.damages <- weather.dt.subset.subset[,sum(crop.damage.dollars),by = list(event.type)]
total.us.crop.damages$damage.type <- 'Crop'
total.us.property.damages <- weather.dt.subset.subset[,sum(property.damage.dollars),by = list(event.type)]
total.us.property.damages$damage.type <- 'Property'
total.us.economic.damages <- rbind(total.us.crop.damages,total.us.property.damages)
total.us.economic.damages <- total.us.economic.damages[order(total.us.economic.damages$event.type),]
total.us.economic.damages$event.name <- lapply(total.us.economic.damages$event.type,as.character)
total.us.economic.damages$event.name <- topropper(total.us.economic.damages$event.name)
Calculating the total damage to public health.
# Damages to public health
total.us.fatalities <- weather.dt.subset.subset[,sum(fatalities),by = list(event.type)]
total.us.fatalities$damage.type <- 'Fatalities'
total.us.injuries <- weather.dt.subset.subset[,sum(injuries),by = list(event.type)]
total.us.injuries$damage.type <- 'Injuries'
total.us.public.health.effect <- rbind(total.us.fatalities,total.us.injuries)
total.us.public.health.effect <- total.us.public.health.effect[order(total.us.public.health.effect$event.type),]
total.us.public.health.effect$event.name <- lapply(total.us.public.health.effect$event.type,as.character)
total.us.public.health.effect$event.name <- topropper(total.us.public.health.effect$event.name)
Overall, tornadoes are the cause of majority of weather-related deaths and injuries (Figure 1). Other significant causes are thunderstorm winds, flood and extensive heat, but in relation to the effect of tornadoes, their scale is minuscule.
dollar.impact.labels<- c('<1M','50B','100B')
dollar.impact.breaks<- c(1000000,50000000000,100000000000)
ggplot(
data = total.us.public.health.effect,
aes(
x = event.name,
y = V1,
group = damage.type,
fill = damage.type)) +
geom_bar(stat="identity") +
xlab("Event") +
ylab("Number of casualties") +
ggtitle(wrapper("Figure 1. Casualties of weather-related events from 1950 to 2011.", width=80)) +
guides(fill=guide_legend(title="Category of casualty")) +
coord_flip() +
theme_minimal()
Nation-wide, floods and tornadoes have the greatest impact on the economy (Figure 2). When damage to the agriculture is inspected separately, droughts are the number one cause of economic loss.
ggplot(
data = total.us.economic.damages,
aes(
x = event.name,
y = V1,
group = damage.type,
fill = damage.type)) +
geom_bar(stat="identity") +
scale_y_continuous(labels=dollar.impact.labels, breaks=dollar.impact.breaks) +
xlab("Event") +
ylab("Economic impact") +
ggtitle(wrapper("Figure 2. Total economic damage of weather-related events from 1950 to 2011.", width=80)) +
guides(fill=guide_legend(title="Category of damage")) +
coord_flip() +
theme_minimal()
Even though the nation-wide aggregation of the weather event statistics gives a good overview of the total economic and public health impact of weather events, it should be noted that for preparing for severe weather events, more localized information is needed.
As the geographic area of the US is vast, the possible weather conditions vary highly – there's no such thing as the weather of the USA (Tables 2 and 3). Nation-wide aggregation also hides the fact that the weather events can be extremely local; for example, floods typically affect relatively small areas but can cause serious damage. Also, tornadoes which are the major cause of deaths and injuries related to weather events nation-wide, are occuring typically in particular geographical area, known as Tornado Alley.
## Derive state names from the more detailed area descriptions.
weather.dt.subset.subset$state.name <- NA
weather.dt.subset.subset$state.name <- lapply(weather.dt.subset.subset$event.us.state,as.character)
weather.dt.subset.subset$state.name <- sub(',.*','',weather.dt.subset.subset$state.name)
weather.dt.subset.geographic <- weather.dt.subset.subset[weather.dt.subset.subset$state.name !="",]
setkey(weather.dt.subset.subset, event.type, event.us.state)
## State-part level damages
area.property.damages<-weather.dt.subset.geographic[,sum(total.damage.dollars),by = list(event.us.state,event.type,state.name)]
area.property.damages <- area.property.damages[area.property.damages$V1>1000000,]
area.property.damages<-area.property.damages[order(area.property.damages$event.us.state, -area.property.damages$V1), ]
# Detailed look on state level
new.hampshire.property.damages <- area.property.damages[area.property.damages$state.name == "NEW HAMPSHIRE",-3, with=FALSE]
new.hampshire.property.damages$event.name <- lapply(new.hampshire.property.damages$event.type,as.character)
new.hampshire.property.damages$event.name <- topropper(new.hampshire.property.damages$event.name)
new.hampshire.property.damages <- new.hampshire.property.damages[,-2, with=FALSE]
new.hampshire.property.damages <- new.hampshire.property.damages[,c(1,3,2),with=FALSE]
montana.property.damages <- area.property.damages[area.property.damages$state.name == "MONTANA",-3, with=FALSE]
montana.property.damages$event.name <- lapply(montana.property.damages$event.type,as.character)
montana.property.damages$event.name <- topropper(montana.property.damages$event.name)
montana.property.damages <- montana.property.damages[,-2, with=FALSE]
montana.property.damages <- montana.property.damages[,c(1,3,2),with=FALSE]
setnames(new.hampshire.property.damages,
c('event.us.state','event.name','V1'),
c('US state','Event','Economic impact (in USD)')
)
setnames(montana.property.damages,
c('event.us.state','event.name','V1'),
c('US state','Event','Economic impact (in USD)')
)
kable(montana.property.damages,
format = "html",
caption="Table 2: Economic effect of weather-related events in Montana")
US state | Event | Economic impact (in USD) |
---|---|---|
MONTANA, Central | Hail | 60056600 |
MONTANA, Central | Tornado | 4301000 |
MONTANA, Central | Flood | 3218400 |
MONTANA, Central | Winter Storm | 3215000 |
MONTANA, Central | High Wind | 2165000 |
MONTANA, East | Hail | 43283100 |
MONTANA, East | Thunderstorm Wind | 22990900 |
MONTANA, East | Wildfire | 4075500 |
MONTANA, East | Flash Flood | 3447000 |
MONTANA, East | Winter Storm | 3310000 |
MONTANA, East | Tornado | 1312000 |
MONTANA, South | Tornado | 63094000 |
MONTANA, South | Flood | 20430000 |
MONTANA, South | Hail | 16606000 |
MONTANA, South | Flash Flood | 4030000 |
MONTANA, South | Winter Storm | 3000000 |
MONTANA, South | Wildfire | 1700000 |
MONTANA, South | Blizzard | 1500000 |
MONTANA, South | Heavy Snow | 1214000 |
MONTANA, West | Flash Flood | 7048600 |
MONTANA, West | Flood | 3061000 |
MONTANA, West | Thunderstorm Wind | 2864000 |
MONTANA, West | Hail | 2206000 |
MONTANA, West | High Wind | 1252800 |
MONTANA, West | Heavy Snow | 1042000 |
kable(new.hampshire.property.damages,
format = "html",
caption="Table 3: Economic effect of weather-related events in New Hampshire")
US state | Event | Economic impact (in USD) |
---|---|---|
NEW HAMPSHIRE, North and Central | Flood | 57595270 |
NEW HAMPSHIRE, North and Central | Ice Storm | 27629000 |
NEW HAMPSHIRE, North and Central | Flash Flood | 21868000 |
NEW HAMPSHIRE, North and Central | Lightning | 8895000 |
NEW HAMPSHIRE, North and Central | Coastal Flood | 5304500 |
NEW HAMPSHIRE, Southern | Ice Storm | 37305000 |
NEW HAMPSHIRE, Southern | Flash Flood | 14975000 |
NEW HAMPSHIRE, Southern | Heavy Snow | 7433000 |
NEW HAMPSHIRE, Southern | High Wind | 4645000 |
NEW HAMPSHIRE, Southern | Flood | 3918000 |
NEW HAMPSHIRE, Southern | Thunderstorm Wind | 1551000 |
NEW HAMPSHIRE, Southern | Lightning | 1131500 |
## Group health costs damages by state, statewide
statewide.area.health.cost <- weather.dt.subset.subset[,sum(population.health.cost),by = list(state.name,event.type)]
statewide.area.health.cost <- statewide.area.health.cost[statewide.area.health.cost$V1>10,]
statewide.area.health.cost <- statewide.area.health.cost[order(statewide.area.health.cost$state.name, -statewide.area.health.cost$V1), ]
statewide.area.health.cost$cost.breaks<-cut(statewide.area.health.cost$V1,
breaks=c(0, 50, 250, 500, 1000, 2000, 10000),
labels=c("<50", "50-250", "250-500", "500-1000", "1000-2000", "2000<"),
include.lowest=TRUE)
tornado.health.cost.state <- weather.dt.subset.subset[weather.dt.subset.subset$event.type=="TORNADO",sum(population.health.cost),by = list(state.name)]
tornado.health.cost.state$cost.breaks<-cut(tornado.health.cost.state$V1,
breaks=c(0, 50, 250, 500, 1000, 2000, 10000),
labels=c("<50", "50-250", "250-500", "500-1000", "1000-2000", "2000<"),
include.lowest=TRUE)
setnames(tornado.health.cost.state, 'state.name','region')
tornado.health.cost.state$region<-tolower(tornado.health.cost.state$region)
all_states <- map_data("state")
map.state <- data.table(map_data('state'))
setkey(map.state,region)
tornado.health.map.df <- merge(map.state,tornado.health.cost.state, by="region")
ggplot(tornado.health.map.df, aes(x=long, y=lat, group=group, fill=cost.breaks)) +
geom_polygon() +
coord_map() +
scale_fill_brewer(type="seq",palette="Reds") +
labs(
fill = "Number of casualties",
title = "Figure 3. Geographic distribution of tornado casualties from 1950 to 2011",
x="",
y="") +
scale_y_continuous(breaks=c()) +
scale_x_continuous(breaks=c()) +
theme(panel.border = element_blank()) +
theme_minimal()