The goal of this report is to expose the results of the analysis of the NOAA Storm Database to answer the following questions about severe weather events:
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
We found that events categories with the most impact for the population health, are not in general the same ones having the biggest economic consecuences.
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. The file can be downloaded from the course web site:
Storm Data [47Mb]
if( !file.exists("repdata-data-StormData.csv") ) {
library(R.utils)
if( !file.exists("repdata-data-StormData.csv.bz2") ) {
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
"repdata-data-StormData.csv.bz2")
}
bunzip2("repdata-data-StormData.csv.bz2")
}
rawdata <- read.csv("repdata-data-StormData.csv")
For the current analysis we need the following information:
getcost <- function(x,KMB) {
factor_kmb <- rep( 1, length(x))
factor_kmb[KMB=="K"] <- 1000
factor_kmb[KMB=="M"] <- 1000000
factor_kmb[KMB=="B"] <- 1000000000
x*factor_kmb
}
dates = sapply( rawdata$BGN_DATE,
function(x){strsplit(strsplit(as.character(x),"/")[[1]][3]," ")[[1]][1]} )
cleandata <- data.frame( type = rawdata$EVTYPE,
years = as.factor(dates),
time = strptime( rawdata$BGN_DATE,"%m/%d/%Y %H:%M:%S" ),
fatalities = rawdata$FATALITIES,
injuries = rawdata$INJURIES,
propcost = getcost(rawdata$PROPDMG, rawdata$PROPDMGEXP),
cropcost = getcost(rawdata$CROPDMG, rawdata$CROPDMGEXP) )
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete. This can be easily shown in the following graph:
hist(as.numeric(as.character(cleandata$years)),
col = "red",
main = "Number of Weather Events per Year",
xlab = "Number of Weather Events per Year" )
So, in order to have an anlysis independent of the recording effectiveness, we will focus in the last 15 complete years going from 1995 to 2010.
filter <- as.factor(1995:2010)
cleandata <- cleandata[cleandata$years %in% filter,]
sorttime <- order(cleandata$time, decreasing = TRUE)
cleandata <- cleandata[sorttime,]
hist(as.numeric(as.character(cleandata$years)),
col = "red",
main = "Number of Weather Events per Year",
xlab = "Number of Weather Events per Year")
Looking at the following documentation:
We can see that the event types should comply to the following classification:
originaldbtypes <- length(levels(cleandata$type))
But on the other hand, the data extracted from the database shows that we have in reality 985 different event types in the dataset. In order to classify the raw data into the categories listed above, the following rules have been defined:
cleandata$type <- tolower(cleandata$type)
cleandata <- cleandata[- grep("summary", cleandata$type, perl=T),]
cleandata <- cleandata[- grep("monthly", cleandata$type, perl=T),]
# Astronomical Low Tide
cleandata$type[grep("low.*tide", cleandata$type, perl=T)] <- "Astronomical Low Tide"
# Avalanche
cleandata$type[grep("avalan", cleandata$type, perl=T)] <- "Avalanche"
# Blizzard
cleandata$type[grep("(blizzard|wintry mix)",
cleandata$type, perl=T)] <- "Blizzard"
# Coastal Flood
cleandata$type[grep("(coastal|cstl|tidal|beach).*(flood|eros)",
cleandata$type, perl=T)] <- "Coastal Flood"
# Extreme Cold/Wind Chill
cleandata$type[grep("ex.*(cold|cool|chill)",
cleandata$type, perl=T)] <- "Extreme Cold/Wind Chill"
# Cold/Wind Chill
cleandata$type[grep("(cold|cool|chill|hypothermia|low temp)",
cleandata$type, perl=T)] <- "Cold/Wind Chill"
# Debris Flow
cleandata$type[grep("(debris|lands|mud|rock)",
cleandata$type, perl=T)] <- "Debris Flow"
# Freezing Fog
cleandata$type[grep("freez.*fog", cleandata$type, perl=T)] <- "Freezing Fog"
# Dense Fog
cleandata$type[grep("fog", cleandata$type, perl=T)] <- "Dense Fog"
# Dense Smoke
cleandata$type[grep("smoke", cleandata$type, perl=T)] <- "Dense Smoke"
# Drought
cleandata$type[grep("(drought|dry|driest)",
cleandata$type, perl=T)] <- "Drought"
cleandata$type[grep("below normal precipitation",
cleandata$type, perl=T)] <- "Drought"
# Dust Devil
cleandata$type[grep("devil", cleandata$type, perl=T)] <- "Dust Devil"
# Dust Storm
cleandata$type[grep("dust", cleandata$type, perl=T)] <- "Dust Storm"
# Excessive Heat
cleandata$type[grep("(ex|abnor).*(heat|warm)",
cleandata$type, perl=T)] <- "Excessive Heat"
# Lake-Effect Snow
cleandata$type[grep("lake.*snow", cleandata$type, perl=T)] <- "Lake-Effect Snow"
# Lakeshore Flood
cleandata$type[grep("lake.*flood", cleandata$type, perl=T)] <- "Lakeshore Flood"
# Flash Flood
cleandata$type[grep("flash", cleandata$type, perl=T)] <- "Flash Flood"
# Flood
cleandata$type[grep("(flood|dam|drown|high water|rising water|high sea|stream)",
cleandata$type, perl=T)] <- "Flood"
# Frost/Freeze
cleandata$type[grep("(frost|freeze)", cleandata$type, perl=T)] <- "Frost/Freeze"
# Funnel Cloud
cleandata$type[grep("(funnel|cloud)", cleandata$type, perl=T)] <- "Funnel"
# Marine Hail
cleandata$type[grep("marine.*hail", cleandata$type, perl=T)] <- "Marine Hail"
# Marine High Wind
cleandata$type[grep("marine.*high.*wind",
cleandata$type, perl=T)] <- "Marine High Wind"
# Marine Thunderstorm Wind
cleandata$type[grep("marine.*thunder.*wind",
cleandata$type, perl=T)] <- "Marine Thunderstorm Wind"
# Marine Strong Wind
cleandata$type[grep("marine.*wind", cleandata$type, perl=T)] <- "Marine Strong Wind"
# Hail
cleandata$type[grep("hail", cleandata$type, perl=T)] <- "Hail"
# Heat
cleandata$type[grep("(heat|hot|high temperature|hyperthermia|warm|wet)",
cleandata$type, perl=T)] <- "Heat"
# High Surf
cleandata$type[grep("(surf|swells|wave)",
cleandata$type, perl=T)] <- "High Surf"
# Hurricane (Typhoon)
cleandata$type[grep("(hurricane|typhoon)",
cleandata$type, perl=T)] <- "Hurricane (Typhoon)"
# Ice Storm
cleandata$type[grep("(ice|drizzle|spray)",
cleandata$type, perl=T)] <- "Ice Storm"
# Lightning
cleandata$type[grep("(lightning|ligntning)",
cleandata$type, perl=T)] <- "Lightning"
# Rip Current
cleandata$type[grep("rip", cleandata$type, perl=T)] <- "Rip Current"
# Seiche
cleandata$type[grep("seiche", cleandata$type, perl=T)] <- "Seiche"
# Sleet
cleandata$type[grep("(sleet|glaze|icy)",
cleandata$type, perl=T)] <- "Sleet"
# Storm Surge/Tide
cleandata$type[grep("(surge|tide)", cleandata$type, perl=T)] <- "Storm Surge/Tide"
# Thunderstorm Wind
cleandata$type[grep("(thunder|tunder|tstm|downburst)",
cleandata$type, perl=T)] <- "Thunderstorm Wind"
# Tornado
cleandata$type[grep("(tornado|gustnado)",
cleandata$type, perl=T)] <- "Tornado"
# Tropical Depression
cleandata$type[grep("depression", cleandata$type, perl=T)] <- "Tropical Depression"
# Tropical Storm
cleandata$type[grep("tropical", cleandata$type, perl=T)] <- "Tropical Storm"
# Tsunami
cleandata$type[grep("tsunami", cleandata$type, perl=T)] <- "Tsunami"
# Volcanic Ash
cleandata$type[grep("volcan", cleandata$type, perl=T)] <- "Volcanic Ash"
# Waterspout
cleandata$type[grep("waterspout", cleandata$type, perl=T)] <- "Waterspout"
# Wildfire
cleandata$type[grep("fire", cleandata$type, perl=T)] <- "Wildfire"
# Winter Storm
cleandata$type[grep("winter.*storm", cleandata$type, perl=T)] <- "Winter Storm"
# Winter Weather
cleandata$type[grep("winter", cleandata$type, perl=T)] <- "Winter Weather"
# Heavy Rain
cleandata$type[grep("(rain|shower|precipitation)",
cleandata$type, perl=T)] <- "Heavy Rain"
# Heavy Snow
cleandata$type[grep("snow", cleandata$type, perl=T)] <- "Heavy Snow"
# Strong Wind
cleandata$type[grep("strong.*wind", cleandata$type, perl=T)] <- "Strong Wind"
# High Wind
cleandata$type[grep("(wind|wnd)", cleandata$type, perl=T)] <- "High Wind"
# Other/unclassified
cleandata$type[grep("^[a-z]", cleandata$type, perl=T)] <- "Other/Unclassified"
cleandata$type <- as.factor(cleandata$type)
The new dataset categories are listed below:
levels(cleandata$type)
## [1] "Astronomical Low Tide" "Avalanche"
## [3] "Blizzard" "Coastal Flood"
## [5] "Cold/Wind Chill" "Debris Flow"
## [7] "Dense Fog" "Dense Smoke"
## [9] "Drought" "Dust Devil"
## [11] "Dust Storm" "Excessive Heat"
## [13] "Extreme Cold/Wind Chill" "Flash Flood"
## [15] "Flood" "Freezing Fog"
## [17] "Frost/Freeze" "Funnel"
## [19] "Hail" "Heat"
## [21] "Heavy Rain" "Heavy Snow"
## [23] "High Surf" "High Wind"
## [25] "Hurricane (Typhoon)" "Ice Storm"
## [27] "Lake-Effect Snow" "Lakeshore Flood"
## [29] "Lightning" "Marine Hail"
## [31] "Marine High Wind" "Marine Strong Wind"
## [33] "Marine Thunderstorm Wind" "Other/Unclassified"
## [35] "Rip Current" "Seiche"
## [37] "Sleet" "Storm Surge/Tide"
## [39] "Strong Wind" "Thunderstorm Wind"
## [41] "Tornado" "Tropical Depression"
## [43] "Tropical Storm" "Tsunami"
## [45] "Volcanic Ash" "Waterspout"
## [47] "Wildfire" "Winter Storm"
## [49] "Winter Weather"
Giving the number of event types to compare, the proposed strategy is to look first at fatalities and injuries from three different angles:
The top event types for each comparison will then be compared together in a single chart.
healthdata <- aggregate(cbind(fatalities, injuries) ~ type, data=cleandata, sum)
sortfat <- order(healthdata$fatalities, decreasing = TRUE)
head(healthdata[sortfat,])
## type fatalities injuries
## 12 Excessive Heat 1958 6542
## 20 Heat 1036 1816
## 41 Tornado 958 15624
## 14 Flash Flood 883 1709
## 29 Lightning 704 4439
## 35 Rip Current 535 497
sortinj <- order(healthdata$injuries, decreasing = TRUE)
head(healthdata[sortinj,])
## type fatalities injuries
## 41 Tornado 958 15624
## 15 Flood 407 6855
## 12 Excessive Heat 1958 6542
## 40 Thunderstorm Wind 362 5186
## 29 Lightning 704 4439
## 20 Heat 1036 1816
If we consider the Total Number of Fatalities, Heat is the main event affecting health in the United State. Now looking at the Total Number of Injuries, Tornadoes and Floods need to be taken into consideration as well.
healthdata <- aggregate(cbind(fatalities, injuries) ~ type, data=cleandata, max)
sortfat <- order(healthdata$fatalities, decreasing = TRUE)
head(healthdata[sortfat,])
## type fatalities injuries
## 20 Heat 583 230
## 12 Excessive Heat 99 519
## 41 Tornado 32 293
## 44 Tsunami 32 129
## 9 Drought 29 15
## 43 Tropical Storm 22 200
sortinj <- order(healthdata$injuries, decreasing = TRUE)
head(healthdata[sortinj,])
## type fatalities injuries
## 15 Flood 11 800
## 25 Hurricane (Typhoon) 15 780
## 12 Excessive Heat 99 519
## 41 Tornado 32 293
## 20 Heat 583 230
## 43 Tropical Storm 22 200
A similar picture is seen when looking at the maximum number of people affected per single event with the exemption of having Hurricanes as the second most impacting for injuries.
nonzero <- function(x) sum(x != 0)
healthdata <- aggregate(cbind(fatalities, injuries) ~ type, data=cleandata, nonzero)
sortfat <- order(healthdata$fatalities, healthdata$injuries, decreasing = TRUE)
head(healthdata[sortfat,])
## type fatalities injuries
## 29 Lightning 656 2383
## 14 Flash Flood 575 349
## 12 Excessive Heat 561 160
## 35 Rip Current 474 186
## 41 Tornado 379 1785
## 40 Thunderstorm Wind 309 2079
sortinj <- order(healthdata$injuries, healthdata$fatalities, decreasing = TRUE)
head(healthdata[sortinj,])
## type fatalities injuries
## 29 Lightning 656 2383
## 40 Thunderstorm Wind 309 2079
## 41 Tornado 379 1785
## 24 High Wind 210 470
## 14 Flash Flood 575 349
## 47 Wildfire 33 268
If we look now at the number of events with at least one fatality/injury, the results are significantly different: Lightning, Flash Flood and Thunderstorm Wind are the top event types for this type of analysis.
library(lattice)
filter <- cleandata$type %in% c( "Excessive Heat",
"Flash Flood",
"Flood",
"Heat",
"Hurricane (Typhoon)",
"Lightning",
"Thunderstorm Wind",
"Tornado" )
# Plot data
xyplot( fatalities ~ time | type,
data = cleandata[filter,],
scales = list(y = list(log = 10)),
xlab = "Years",
ylab = "Number of Fatalities",
layout = c(4,2),
panel = function(x, y, ...) {
panel.xyplot(x, y, ...) ## First call the default panel function for 'xyplot'
panel.abline(h = 10^0.5, lty = 2, col = "black") ## Add a horizontal line for reference
} )
xyplot( injuries ~ time | type,
data = cleandata[filter,],
scales = list(y = list(log = 10)),
xlab = "Years",
ylab = "Number of Injuries",
layout = c(4,2),
panel = function(x, y, ...) {
panel.xyplot(x, y, ...) ## First call the default panel function for 'xyplot'
panel.abline(h = 10^1.5, lty = 2, col = "black") ## Add a horizontal line for reference
} )
First we need to note that the y axis is logarithmic. The main conclusions from this first section would be:
Following a similar approach to the one above, the proposed strategy is to look first at property and crop damage cost from three different angles:
And then again the top event types for each comparison will be compared together in a single chart.
costdata <- aggregate(cbind(propcost, cropcost) ~ type, data=cleandata, sum)
sortprop <- order(costdata$propcost, decreasing = TRUE)
head(costdata[sortprop,])
## type propcost cropcost
## 15 Flood 1.366e+11 5.733e+09
## 25 Hurricane (Typhoon) 8.522e+10 5.495e+09
## 38 Storm Surge/Tide 4.780e+10 8.550e+05
## 41 Tornado 1.510e+10 2.652e+08
## 19 Hail 1.488e+10 2.617e+09
## 14 Flash Flood 1.415e+10 1.361e+09
sortcrop <- order(costdata$cropcost, decreasing = TRUE)
head(costdata[sortcrop,])
## type propcost cropcost
## 9 Drought 1.048e+09 1.389e+10
## 15 Flood 1.366e+11 5.733e+09
## 25 Hurricane (Typhoon) 8.522e+10 5.495e+09
## 19 Hail 1.488e+10 2.617e+09
## 17 Frost/Freeze 5.155e+06 1.576e+09
## 14 Flash Flood 1.415e+10 1.361e+09
If we consider the Total Property Damage Cost, Floods is the main impacting weather event in the United State followed by Hurricanes. Now looking the Total Crop Damage Cost is of course heavily impacted by Drought and Floods.
costdata <- aggregate(cbind(propcost, cropcost) ~ type, data=cleandata, max)
sortprop <- order(costdata$propcost, decreasing = TRUE)
head(costdata[sortprop,])
## type propcost cropcost
## 15 Flood 1.150e+11 5.00e+08
## 38 Storm Surge/Tide 3.130e+10 7.50e+05
## 25 Hurricane (Typhoon) 1.693e+10 1.51e+09
## 43 Tropical Storm 5.150e+09 2.00e+08
## 21 Heavy Rain 2.500e+09 2.00e+08
## 19 Hail 1.800e+09 7.00e+07
sortcrop <- order(costdata$cropcost, decreasing = TRUE)
head(costdata[sortcrop,])
## type propcost cropcost
## 25 Hurricane (Typhoon) 1.693e+10 1.510e+09
## 9 Drought 6.451e+08 1.000e+09
## 13 Extreme Cold/Wind Chill 2.500e+07 5.960e+08
## 15 Flood 1.150e+11 5.000e+08
## 12 Excessive Heat 3.300e+06 4.924e+08
## 20 Heat 3.800e+06 4.000e+08
A similar picture is seen when looking at the maximum damage cost per single event with the exemption of having Storm Surge/Tide as the second most impacting for property damages.
nonzero <- function(x) sum(x != 0)
costdata <- aggregate(cbind(propcost, cropcost) ~ type, data=cleandata, nonzero)
sortprop <- order(costdata$propcost, costdata$cropcost, decreasing = TRUE)
head(costdata[sortprop,])
## type propcost cropcost
## 40 Thunderstorm Wind 96559 4375
## 19 Hail 19927 8186
## 14 Flash Flood 18384 1840
## 41 Tornado 11047 1211
## 15 Flood 8745 1562
## 29 Lightning 8676 94
sortcrop <- order(costdata$cropcost, costdata$propcost, decreasing = TRUE)
head(costdata[sortcrop,])
## type propcost cropcost
## 19 Hail 19927 8186
## 40 Thunderstorm Wind 96559 4375
## 14 Flash Flood 18384 1840
## 15 Flood 8745 1562
## 41 Tornado 11047 1211
## 9 Drought 116 243
If we look now at the number of events with any economic impact, the results are significantly different: Hail, Flash Flood and Thunderstorm Wind are the top event types for this type of analysis.
library(lattice)
filter <- cleandata$type %in% c( "Drought",
"Flood",
"Hail",
"Hurricane (Typhoon)",
"Storm Surge/Tide",
"Thunderstorm Wind",
"Extreme Cold/Wind Chill",
"Flash Flood" )
# Plot data
xyplot( propcost ~ time | type,
data = cleandata[filter,],
scales = list(y = list(log = 10)),
xlab = "Years",
ylab = "Property Damage Cost in $",
layout = c(4,2),
panel = function(x, y, ...) {
panel.xyplot(x, y, ...) ## First call the default panel function for 'xyplot'
panel.abline(h = 10^6, lty = 2, col = "black") ## Add a horizontal line for reference
} )
xyplot( cropcost ~ time | type,
data = cleandata[filter,],
scales = list(y = list(log = 10)),
xlab = "Years",
ylab = "Crop Damage Cost in $",
layout = c(4,2),
panel = function(x, y, ...) {
panel.xyplot(x, y, ...) ## First call the default panel function for 'xyplot'
panel.abline(h = 10^6, lty = 2, col = "black") ## Add a horizontal line for reference
} )
First we need to note that the y axis is logarithmic. The main conclusions from this second section would be:
We found that most harmful events for the population health are in general not the ones with the biggest economic consecuences. In particular, Heat and Excessive Heat fatalities are the most impacting in public health and have no significant economical impact in property and crop damages. A similar conclusion would apply to Tornadoes being the main event on total number of injuries in the time period.
On the other hand, Flood is the weather event with highest impact in Property and Crop Damage Cost overall. Of course Drought is the main contributor to Crop Damage cost and Hurricanes are the second biggest driver for Property Damage Cost.
Hail and Lightnings are common events with lower impact causing Property and Crop Damage Costs and Health impact respectively.
Finally, Thunderstom Wind and Flash Flood are common events with lower impact but affecting both health and economics.