Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
We will address the following questions : 1. Across the United States, which types of events are most harmful with respect to population health ? 2. Across the United States, which types of events have the greatest economic consequences ?
library("data.table")
library("ggplot2")
library("dplyr")
library("plyr")
You can download the file from the coursera web site : Storm Database.
The events in the file start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records.
In order to understand the Storm Database content, you can download these two files : National Weather Service Storm Data Documentation and National Climatic Data Center Storm Events FAQ
Download the zip file in a “data” directory
setwd("~/Coursera/5. Reproducible Research/Week4")
if (!file.exists("data")){
dir.create("data")
}
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl,destfile = "./data/data.csv.bz2")
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf"
download.file(fileUrl,destfile = "./data/stormdatadoc.pdf")
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf"
download.file(fileUrl,destfile = "./data/faq.pdf")
Creates data table with the readcsv() and bzfile() functions to unzip the file
setwd("~/Coursera/5. Reproducible Research/Week4")
data <- data.frame(read.csv(bzfile("./data/data.csv.bz2"),stringsAsFactors = FALSE))
According to the NOAA documentation, the Storm Database is constructed with manual entries according to a table of 30 types of events. The type of events is defined in the following table :
For example there is a difference between “Cold/Wind Chill”" and “Extreme Cold/Wind Chill” or between “Heat” and “Extreme heat”.
Unfortunately, the events have been manually entered in the Storm Database through time. Therefore when we look at the event type column in the Storm Databse, we find much more than 30 events :
length(unique(data$EVTYPE))
## [1] 985
Therefore we do need to make some replacement and some corrections.
The strategy used is the following :
Look first to the most frequent event types using the command :
tail(sort(table(data$EVTYPE)),30)"If not, find the most accurate and corresponding standard event. We have to be very precautious to avoid mixing “heat” and “extreme heat” for example.
data$EVTYPE <- gsub(".*[Hh][Ii][Gg][Hh] [Ss][Uu][Rr][Ff].*", "HIGH SURF", data$EVTYPE)
data$EVTYPE <- gsub(".*[Tt][Ss][Tt][Mm].*[Ww][Ii][Nn].*", "THUNDERSTORM WIND", data$EVTYPE)
data$EVTYPE <- gsub("[Tt][Hh][Uu][Nn][Dd][Ee][Rr][Ss][Tt][Oo][Rr][Mm] [Ww][Ii][Nn][Dd][Ss]", "THUNDERSTORM WIND", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ff][Ll][Dd].*", "FLOOD", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ff][Ii][Rr].*", "WILDFIRE", data$EVTYPE)
data$EVTYPE <- gsub(".*[Hh][Ii].*[Ww][Ii][Nn][Dd][Ss].*", "HIGH WIND", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ww][Ii][Nn][Tt].*[Ww][Ee][Aa][Tt][Hh].*", "WINTER WEATHER", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ff][Ll].*[Ff][Ll].*", "FLASH FLOOD", data$EVTYPE)
data$EVTYPE <- gsub("[Ee][Xx][Tt][Rr][Ee][Mm][Ee] [Cc][Oo][Ll][Dd]$", "EXTREME COLD/WIND CHILL", data$EVTYPE)
data$EVTYPE <- gsub("^[Ss][Nn][Oo][Ww]", "HEAVY SNOW", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ee][Rr][Oo][Ss].*", "COASTAL FLOOD", data$EVTYPE)
data$EVTYPE <- gsub("^[Bb][Ll][Ii][Zz].*", "BLIZZARD", data$EVTYPE)
data$EVTYPE <- gsub("^[Cc][Oo][Aa][Ss][Tt].*", "COASTAL FLOOD", data$EVTYPE)
data$EVTYPE <- gsub("^[Dd][Rr][Yy].*", "DROUGHT", data$EVTYPE)
data$EVTYPE <- gsub(".*[Dd][Rr][Yy]$", "DROUGHT", data$EVTYPE)
data$EVTYPE <- gsub(".*[Dd][Rr][Oo][Uu][Gg][Hh][Tt].*", "DROUGHT", data$EVTYPE)
data$EVTYPE <- gsub("^[Ww][Ii][Nn].*", "HIGH WIND", data$EVTYPE)
data$EVTYPE <- gsub("^[Rr][Ii][Pp].*", "RIP CURRENT", data$EVTYPE)
data$EVTYPE <- gsub("^[Ee][Xx][Tt][Rr][Ee][Mm][Ee] [Ww][Ii][Nn].*", "EXTREME COLD/WIND CHILL", data$EVTYPE)
data$EVTYPE <- gsub("^[Ff][Oo][Gg].*", "DENSE FOG", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ss][Uu][Rr][Gg][Ee]$", "STORM SURGE/TIDE", data$EVTYPE)
data$EVTYPE <- gsub("^[Uu][Rr][Bb][Aa][Nn].*[Ff][Ll].*", "FLOOD", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ss][Tt].*[Ww][Ii][Nn][Dd][Ss]$.*", "STRONG WIND", data$EVTYPE)
data$EVTYPE <- gsub(".*[Rr][Aa][Ii][Nn].*", "HEAVY RAIN", data$EVTYPE)
data$EVTYPE <- gsub(".*[Rr][Ii][Vv].*[Ff][Ll].*", "FLOOD", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ww][Aa][Rr][Mm].*", "EXCESSIVE HEAT", data$EVTYPE)
data$EVTYPE <- gsub("[Ff][Ll][Oo][Oo][Dd][Ii][Nn][Gg]$", "FLOOD", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ll][Ii][Gg].*[Ss][Nn][Oo][Ww].*", "HEAVY SNOW", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ss][Nn][Oo][Ww][Ff][Aa][Ll][Ll].*", "HEAVY SNOW", data$EVTYPE)
data$EVTYPE <- gsub(".*[Cc][Ll][Oo][Uu][Dd][Ss]$", "FUNNEL CLOUD", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ss][Uu][Rr][Ff]$", "HIGH SURF", data$EVTYPE)
data$EVTYPE <- gsub(".*[Rr][Ee][Cc].*[Hh][Ee][Aa][Tt].*", "EXCESSIVE HEAT", data$EVTYPE)
data$EVTYPE <- gsub("^[Hh][Ee][Aa][Tt].*", "HEAT", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ff][Rr][Ee][Ee][Zz][Ee].*", "FROST/FREEZE", data$EVTYPE)
data$EVTYPE <- gsub("^[Cc][Oo][Ll][Dd].*", "COLD/WIND CHILL", data$EVTYPE)
data$EVTYPE <- gsub(".*[Rr][Ee][Cc][Oo][Rr][Dd].*[Cc][Oo][Ll][Dd].*", "EXTREME COLD/WIND CHILL", data$EVTYPE)
data$EVTYPE <- gsub(".*[Hh][Uu][Rr][Rr][Ii].*", "HURRICANE", data$EVTYPE)
data$EVTYPE <- gsub("^[Tt][Hh][Uu][Nn].*", "THUNDERSTORM WIND", data$EVTYPE)
data$EVTYPE <- gsub("^[Ii][Cc][Ee].*", "ICE STORM", data$EVTYPE)
data$EVTYPE <- gsub("^[Ff][Rr][Oo][Ss][Tt].*", "FROST/FREEZE", data$EVTYPE)
data$EVTYPE <- gsub("^[Hh][Ee][Aa][Vv][Yy].*[Ss][Nn][Oo].*", "HEAVY SNOW", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ss][Mm].*[Hh][Aa][Ii][Ll].*", "HAIL", data$EVTYPE)
data$EVTYPE <- gsub("^[Ff][Uu][Nn][Nn].*", "FUNNEL CLOUD", data$EVTYPE)
data$EVTYPE <- gsub(".*[Tt][Ee][Mm][Pp].*[Rr][Ee][Cc].*", "EXCESSIVE HEAT", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ss][Pp][Oo][Uu][Tt][Ss]", "WATERSPOUT", data$EVTYPE)
data$EVTYPE <- gsub(".*[Pp][Rr][Ee][Cc][Ii][Pp][Ii].*", "HEAVY RAIN", data$EVTYPE)
data$EVTYPE <- gsub(".*[Gg][Ll][Aa][Zz][Ee].*", "ICE STORM", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ss][Nn][Oo][Ww].*", "HEAVY SNOW", data$EVTYPE)
data$EVTYPE <- gsub("^[Hh][Aa][Ii][Ll].*", "HAIL", data$EVTYPE)
data$EVTYPE <- gsub("^[Ii][Cc][Yy].*", "ICE STORM", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ee][Xx][Cc].*[Ss][Nn][Oo][Ww].*", "HEAVY SNOW", data$EVTYPE)
data$EVTYPE <- gsub(".*[Cc][Oo][Ll][Dd]$", "COLD/WIND CHILL", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ss][Ee][Vv].*[Tt][Hh][Uu][Nn][Dd].*", "THUNDERSTORM WIND", data$EVTYPE)
data$EVTYPE <- gsub(".*[Hh][Ii].*[Tt][Ii][Dd][Ee].*", "COASTAL FLOOD", data$EVTYPE)
data$EVTYPE <- gsub(".*[Gg][Uu][Ss][Tt].*[Ww][Ii][Nn][Dd].*", "HIGH WIND", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ee][Xx][Tt].*[Hh][Ee][Aa][Tt].*", "EXCESSIVE HEAT", data$EVTYPE)
data$EVTYPE <- gsub(".*[Tt][Ii][Dd].*[Ff][Ll].*", "COASTAL FLOOD", data$EVTYPE)
data$EVTYPE <- gsub("^[Ff][Rr][Ee][Ee][Zz]*", "FREEZING FOG", data$EVTYPE)
length(unique(data$EVTYPE))
## [1] 341
From this point, there are less than 20 occurences of each of non referenced event types. Therefore we will now be much more general in the replacements.
data$EVTYPE <- gsub(".*[Ww][Ii][Nn].*", "HIGH WIND", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ss][Nn][Oo][Ww].*", "HEAVY SNOW", data$EVTYPE)
data$EVTYPE <- gsub(".*[Bb][Ll][Ii][Zz].*", "BLIZZARD", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ww][Ee][Tt].*", "HEAVY RAIN", data$EVTYPE)
data$EVTYPE <- gsub(".*[Rr][Aa][Ii].*", "HEAVY RAIN", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ff][Ll][Oo][Oo].*", "FLOOD", data$EVTYPE)
data$EVTYPE <- gsub(".*[Hh][Aa][Ii].*", "HAIL", data$EVTYPE)
data$EVTYPE <- gsub(".*[Bb][Uu][Rr][Ss].*", "THUNDERSTORM WIND", data$EVTYPE)
data$EVTYPE <- gsub(".*[Oo][Rr][Nn][Aa].*", "TORNADO", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ss][Uu][Rr].*", "HIGH SURF", data$EVTYPE)
data$EVTYPE <- gsub(".*[Rr][Ee][Cc].*[Tt][Ee][Mm].*", "HIGH WIND", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ff][Oo][Gg].*", "DENSE FOG", data$EVTYPE)
data$EVTYPE <- gsub(".*[Vv][Oo][Ll].*", "VOLCANIC ASH", data$EVTYPE)
data$EVTYPE <- gsub(".*[Dd][Ee][Vv].*", "DUST DEVIL", data$EVTYPE)
data$EVTYPE <- gsub(".*[Dd][Rr][Yy].*", "HEAT", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ss][Uu][Mm].*", "SUMMARY", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ll][Ii][Gg][Hh][Tt][Nn].*", "LIGHTNING", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ww][Aa][Tt][Ee][Rr].*", "WATERSPOUT", data$EVTYPE)
length(unique(data$EVTYPE))
## [1] 136
We divided the number of events by more than 7. We still have more than 30 events ; let’s see the percentage of the fewer occurences :
(sum(sort(table(data$EVTYPE)))-sum(tail(sort(table(data$EVTYPE))),30))/sum(sort(table(data$EVTYPE)))
## [1] 0.05223668
Less than 6%. As we will be looking only to major events, we could conclude that the cleaning is sufficient.
To help me to handle the exponent values of PROPDMGEXP and CROPDMGEXP, we should have a look to the PROPDMGEXP table:
table(data$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 424665 7 11330
table(data$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
I will only take into account the “K” for kilo, “M” for mega, “B” for billion and ignore the others which only count for a very small percentage (less than 0,1%).
mean(data$PROPDMGEXP %in% setdiff(data$PROPDMGEXP, c("K", "M", "B", "")))
## [1] 0.0003635167
mean(data$CROPDMGEXP %in% setdiff(data$CROPDMGEXP, c("K", "M", "B", "")))
## [1] 5.430584e-05
# set a cropmultiplier according to crop damages exponent value
data$cropmultiplier <- 1L
data$cropmultiplier[data$CROPDMGEXP=='K'] <- 1000
data$cropmultiplier[data$CROPDMGEXP=='M'] <- 1000000
data$cropmultiplier[data$CROPDMGEXP=='B'] <- 1000000000
data$CROPDMG <- data$CROPDMG*data$cropmultiplier
# same for properties damages
data$propmultiplier <- 1L
data$propmultiplier[data$PROPDMGEXP=='K'] <- 1000
data$propmultiplier[data$PROPDMGEXP=='M'] <- 1000000
data$propmultiplier[data$PROPDMGEXP=='B'] <- 1000000000
data$PROPDMG <- data$PROPDMG*data$propmultiplier
Transforms the begin date od the event (BGN_DATE) from a character vector to a date :
data$BGN_DATE <- as.Date(data$BGN_DATE,format=("%m/%d/%Y %H:%M:%S"))
Selects only the columns that we will use in the analysis part : BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG and CROPDMG :
df <- select(data, BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, CROPDMG)
colnames(df) <- tolower(colnames(df))
According to the NOAA web site (http://www.ncdc.noaa.gov/stormevents/details.jsp) Event Types Available are :
We therefore have to categorize data into three periods of time :
df$timeperiod <- "from 1950 to 1954"
df$timeperiod[df$bgn_date >= "1955-01-01" & df$bgn_date <= "1995-12-31"] <- "from 1955 to 1995"
df$timeperiod[df$bgn_date >= "1996-01-01"] <- "since 1996"
# creates a fatalities sum table depending on event type and time period
dffatalities <- ddply(df, c("evtype", "timeperiod"), summarise,
fatalities=sum(fatalities))
tablefatalities <- arrange(dffatalities, desc(fatalities))[1:10,]
# creates an injuries sum table depending on event type and time period
dfinjuries <- ddply(df, c("evtype", "timeperiod"), summarise,
injuries=sum(injuries))
tableinjuries <- arrange(dfinjuries, desc(injuries))[1:10,]
# creates a damages sum table depending on event type
dfdmg <- ddply(df, "evtype", summarise,
cropdmg=sum(cropdmg)/10^9,
propdmg=sum(propdmg)/10^9,
sumdmg = cropdmg+propdmg)
tabledmg <- arrange(dfdmg, desc(sumdmg))[1:10,]
tablefatalities
## evtype timeperiod fatalities
## 1 TORNADO from 1955 to 1995 3236
## 2 EXCESSIVE HEAT since 1996 1799
## 3 TORNADO since 1996 1511
## 4 HIGH WIND since 1996 1427
## 5 FLOOD since 1996 1341
## 6 TORNADO from 1950 to 1954 889
## 7 HEAT from 1955 to 1995 877
## 8 LIGHTNING since 1996 651
## 9 RIP CURRENT since 1996 542
## 10 HIGH WIND from 1955 to 1995 531
graph1 <- ggplot(tablefatalities, aes(evtype, fatalities, fill=evtype)) +
geom_bar(stat = "identity", colour = "black") +
facet_grid(. ~ timeperiod) +
scale_fill_brewer(palette="Oranges") +
theme(axis.text.x = element_blank(), legend.title=element_blank()) +
xlab("Event Type") +
ylab("Fatalities") +
ggtitle("Most fatal events, 1950-1954 / 1955-1995 / 1996-2011")
print(graph1)
Nota Bene :
The most fatal event type since 1950 is "tornado" with a total of 5.636 deaths. Nevetheless, since 1996 (more event types recorded as defined in NWS Directive 10-1605), the most fatal event is "excessive heat" with 1.799 deaths.
tableinjuries
## evtype timeperiod injuries
## 1 TORNADO from 1955 to 1995 61796
## 2 TORNADO since 1996 20667
## 3 TORNADO from 1950 to 1954 8944
## 4 HIGH WIND since 1996 8656
## 5 FLOOD since 1996 8522
## 6 EXCESSIVE HEAT since 1996 6410
## 7 HIGH WIND from 1955 to 1995 5113
## 8 LIGHTNING since 1996 4141
## 9 ICE STORM from 1955 to 1995 1822
## 10 WILDFIRE since 1996 1458
graph2 <- ggplot(tableinjuries, aes(evtype, injuries, fill=evtype)) +
geom_bar(stat = "identity", colour = "black") +
facet_grid(. ~ timeperiod) +
scale_fill_brewer(palette="Oranges") +
theme(axis.text.x = element_blank(), legend.title=element_blank()) +
xlab("Event Type") +
ylab("Injuries") +
ggtitle("Most harmful events (injuries), 1950-1954 / 1955-1995 / 1996-2011")
print(graph2)
Nota Bene
The most harmful event type (in terms of injuries) since 1950 is "tornado" with a total of 91.407 injuries.
tabledmg
## evtype cropdmg propdmg sumdmg
## 1 FLOOD 12.2703842 167.566340 179.836724
## 2 HURRICANE 5.4952928 84.636180 90.131473
## 3 TORNADO 0.4149615 56.981598 57.396559
## 4 HIGH SURF 0.0023550 48.079699 48.082054
## 5 HIGH WIND 3.5162599 25.788664 29.304924
## 6 HAIL 3.0464710 15.969644 19.016115
## 7 DROUGHT 13.9726368 1.053039 15.025675
## 8 ICE STORM 5.0271143 3.969863 8.996977
## 9 WILDFIRE 0.4032816 8.501629 8.904910
## 10 TROPICAL STORM 0.6783460 7.703891 8.382237
graph3 <- ggplot(tabledmg, aes(evtype, sumdmg, fill=evtype)) +
geom_bar(stat = "identity", colour = "black") +
scale_fill_brewer(palette="Oranges") +
theme(axis.text.x = element_blank(), legend.title=element_blank()) +
xlab("Event Type") +
ylab("Total Damages (in billion $)") +
ggtitle("Greatest economic consequences events")
print(graph3)
The greatest economic consequences event type (total of properties and crop damages) since 1950 is "flood" with a total estimated cost of 179,8 billion dollars.