1. SYNOPSIS

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

We will address the following questions : 1. Across the United States, which types of events are most harmful with respect to population health ? 2. Across the United States, which types of events have the greatest economic consequences ?


2. DATA PROCESSING

2.1. Set working environment and getting the files

library("data.table")
library("ggplot2")
library("dplyr")
library("plyr")

You can download the file from the coursera web site : Storm Database.

The events in the file start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records.

In order to understand the Storm Database content, you can download these two files : National Weather Service Storm Data Documentation and National Climatic Data Center Storm Events FAQ

Download the zip file in a “data” directory

setwd("~/Coursera/5. Reproducible Research/Week4")
if (!file.exists("data")){
        dir.create("data")
}
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl,destfile = "./data/data.csv.bz2")
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf"
download.file(fileUrl,destfile = "./data/stormdatadoc.pdf")
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf"
download.file(fileUrl,destfile = "./data/faq.pdf")

Creates data table with the readcsv() and bzfile() functions to unzip the file

setwd("~/Coursera/5. Reproducible Research/Week4")
data <- data.frame(read.csv(bzfile("./data/data.csv.bz2"),stringsAsFactors = FALSE))

2.2. Data Cleaning : Events

According to the NOAA documentation, the Storm Database is constructed with manual entries according to a table of 30 types of events. The type of events is defined in the following table :

  • Astronomical Low Tide
  • Avalanche
  • Blizzard
  • Coastal Flood
  • Cold/Wind Chill
  • Debris Flow
  • Dense Fog
  • Dense Smoke
  • Drought
  • Dust Devil
  • Dust Storm
  • Excessive Heat
  • Extreme Cold/Wind Chill
  • Flash Flood
  • Flood
  • Frost/Freeze
  • Funnel Cloud
  • Freezing Fog
  • Hail
  • Heat
  • Heavy Rain
  • Heavy Snow
  • High Surf
  • High Wind
  • Hurricane (Typhoon)
  • Ice Storm
  • Lake-Effect Snow
  • Lakeshore Flood
  • Lightning
  • Marine Hail
  • Marine High Wind
  • Marine Strong Wind
  • Marine Thunderstorm Wind
  • Rip Current
  • Seiche
  • Sleet
  • Storm Surge/Tide
  • Strong Wind
  • Thunderstorm Wind
  • Tornado
  • Tropical Depression
  • Tropical Storm
  • Tsunami
  • Volcanic Ash
  • Waterspout
  • Wildfire
  • Winter Storm
  • Winter Weather

For example there is a difference between “Cold/Wind Chill”" and “Extreme Cold/Wind Chill” or between “Heat” and “Extreme heat”.

Unfortunately, the events have been manually entered in the Storm Database through time. Therefore when we look at the event type column in the Storm Databse, we find much more than 30 events :

length(unique(data$EVTYPE))
## [1] 985

Therefore we do need to make some replacement and some corrections.

The strategy used is the following :

  • Look first to the most frequent event types using the command :

    tail(sort(table(data$EVTYPE)),30)"
  • Check that these are part of the reference event table
  • If not, find the most accurate and corresponding standard event. We have to be very precautious to avoid mixing “heat” and “extreme heat” for example.

data$EVTYPE <- gsub(".*[Hh][Ii][Gg][Hh] [Ss][Uu][Rr][Ff].*", "HIGH SURF", data$EVTYPE)
data$EVTYPE <- gsub(".*[Tt][Ss][Tt][Mm].*[Ww][Ii][Nn].*", "THUNDERSTORM WIND", data$EVTYPE)
data$EVTYPE <- gsub("[Tt][Hh][Uu][Nn][Dd][Ee][Rr][Ss][Tt][Oo][Rr][Mm] [Ww][Ii][Nn][Dd][Ss]", "THUNDERSTORM WIND", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ff][Ll][Dd].*", "FLOOD", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ff][Ii][Rr].*", "WILDFIRE", data$EVTYPE) 
data$EVTYPE <- gsub(".*[Hh][Ii].*[Ww][Ii][Nn][Dd][Ss].*", "HIGH WIND", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ww][Ii][Nn][Tt].*[Ww][Ee][Aa][Tt][Hh].*", "WINTER WEATHER", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ff][Ll].*[Ff][Ll].*", "FLASH FLOOD", data$EVTYPE)
data$EVTYPE <- gsub("[Ee][Xx][Tt][Rr][Ee][Mm][Ee] [Cc][Oo][Ll][Dd]$", "EXTREME COLD/WIND CHILL", data$EVTYPE)
data$EVTYPE <- gsub("^[Ss][Nn][Oo][Ww]", "HEAVY SNOW", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ee][Rr][Oo][Ss].*", "COASTAL FLOOD", data$EVTYPE)
data$EVTYPE <- gsub("^[Bb][Ll][Ii][Zz].*", "BLIZZARD", data$EVTYPE)
data$EVTYPE <- gsub("^[Cc][Oo][Aa][Ss][Tt].*", "COASTAL FLOOD", data$EVTYPE)
data$EVTYPE <- gsub("^[Dd][Rr][Yy].*", "DROUGHT", data$EVTYPE)
data$EVTYPE <- gsub(".*[Dd][Rr][Yy]$", "DROUGHT", data$EVTYPE)
data$EVTYPE <- gsub(".*[Dd][Rr][Oo][Uu][Gg][Hh][Tt].*", "DROUGHT", data$EVTYPE)
data$EVTYPE <- gsub("^[Ww][Ii][Nn].*", "HIGH WIND", data$EVTYPE)
data$EVTYPE <- gsub("^[Rr][Ii][Pp].*", "RIP CURRENT", data$EVTYPE)
data$EVTYPE <- gsub("^[Ee][Xx][Tt][Rr][Ee][Mm][Ee] [Ww][Ii][Nn].*", "EXTREME COLD/WIND CHILL", data$EVTYPE)
data$EVTYPE <- gsub("^[Ff][Oo][Gg].*", "DENSE FOG", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ss][Uu][Rr][Gg][Ee]$", "STORM SURGE/TIDE", data$EVTYPE)
data$EVTYPE <- gsub("^[Uu][Rr][Bb][Aa][Nn].*[Ff][Ll].*", "FLOOD", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ss][Tt].*[Ww][Ii][Nn][Dd][Ss]$.*", "STRONG WIND", data$EVTYPE)
data$EVTYPE <- gsub(".*[Rr][Aa][Ii][Nn].*", "HEAVY RAIN", data$EVTYPE)
data$EVTYPE <- gsub(".*[Rr][Ii][Vv].*[Ff][Ll].*", "FLOOD", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ww][Aa][Rr][Mm].*", "EXCESSIVE HEAT", data$EVTYPE)
data$EVTYPE <- gsub("[Ff][Ll][Oo][Oo][Dd][Ii][Nn][Gg]$", "FLOOD", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ll][Ii][Gg].*[Ss][Nn][Oo][Ww].*", "HEAVY SNOW", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ss][Nn][Oo][Ww][Ff][Aa][Ll][Ll].*", "HEAVY SNOW", data$EVTYPE)
data$EVTYPE <- gsub(".*[Cc][Ll][Oo][Uu][Dd][Ss]$", "FUNNEL CLOUD", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ss][Uu][Rr][Ff]$", "HIGH SURF", data$EVTYPE)
data$EVTYPE <- gsub(".*[Rr][Ee][Cc].*[Hh][Ee][Aa][Tt].*", "EXCESSIVE HEAT", data$EVTYPE)
data$EVTYPE <- gsub("^[Hh][Ee][Aa][Tt].*", "HEAT", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ff][Rr][Ee][Ee][Zz][Ee].*", "FROST/FREEZE", data$EVTYPE)
data$EVTYPE <- gsub("^[Cc][Oo][Ll][Dd].*", "COLD/WIND CHILL", data$EVTYPE)
data$EVTYPE <- gsub(".*[Rr][Ee][Cc][Oo][Rr][Dd].*[Cc][Oo][Ll][Dd].*", "EXTREME COLD/WIND CHILL", data$EVTYPE)
data$EVTYPE <- gsub(".*[Hh][Uu][Rr][Rr][Ii].*", "HURRICANE", data$EVTYPE)
data$EVTYPE <- gsub("^[Tt][Hh][Uu][Nn].*", "THUNDERSTORM WIND", data$EVTYPE)
data$EVTYPE <- gsub("^[Ii][Cc][Ee].*", "ICE STORM", data$EVTYPE)
data$EVTYPE <- gsub("^[Ff][Rr][Oo][Ss][Tt].*", "FROST/FREEZE", data$EVTYPE)
data$EVTYPE <- gsub("^[Hh][Ee][Aa][Vv][Yy].*[Ss][Nn][Oo].*", "HEAVY SNOW", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ss][Mm].*[Hh][Aa][Ii][Ll].*", "HAIL", data$EVTYPE)
data$EVTYPE <- gsub("^[Ff][Uu][Nn][Nn].*", "FUNNEL CLOUD", data$EVTYPE)
data$EVTYPE <- gsub(".*[Tt][Ee][Mm][Pp].*[Rr][Ee][Cc].*", "EXCESSIVE HEAT", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ss][Pp][Oo][Uu][Tt][Ss]", "WATERSPOUT", data$EVTYPE)
data$EVTYPE <- gsub(".*[Pp][Rr][Ee][Cc][Ii][Pp][Ii].*", "HEAVY RAIN", data$EVTYPE)
data$EVTYPE <- gsub(".*[Gg][Ll][Aa][Zz][Ee].*", "ICE STORM", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ss][Nn][Oo][Ww].*", "HEAVY SNOW", data$EVTYPE)
data$EVTYPE <- gsub("^[Hh][Aa][Ii][Ll].*", "HAIL", data$EVTYPE)
data$EVTYPE <- gsub("^[Ii][Cc][Yy].*", "ICE STORM", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ee][Xx][Cc].*[Ss][Nn][Oo][Ww].*", "HEAVY SNOW", data$EVTYPE)
data$EVTYPE <- gsub(".*[Cc][Oo][Ll][Dd]$", "COLD/WIND CHILL", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ss][Ee][Vv].*[Tt][Hh][Uu][Nn][Dd].*", "THUNDERSTORM WIND", data$EVTYPE)
data$EVTYPE <- gsub(".*[Hh][Ii].*[Tt][Ii][Dd][Ee].*", "COASTAL FLOOD", data$EVTYPE)
data$EVTYPE <- gsub(".*[Gg][Uu][Ss][Tt].*[Ww][Ii][Nn][Dd].*", "HIGH WIND", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ee][Xx][Tt].*[Hh][Ee][Aa][Tt].*", "EXCESSIVE HEAT", data$EVTYPE)
data$EVTYPE <- gsub(".*[Tt][Ii][Dd].*[Ff][Ll].*", "COASTAL FLOOD", data$EVTYPE)
data$EVTYPE <- gsub("^[Ff][Rr][Ee][Ee][Zz]*", "FREEZING FOG", data$EVTYPE)
length(unique(data$EVTYPE))
## [1] 341

From this point, there are less than 20 occurences of each of non referenced event types. Therefore we will now be much more general in the replacements.

data$EVTYPE <- gsub(".*[Ww][Ii][Nn].*", "HIGH WIND", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ss][Nn][Oo][Ww].*", "HEAVY SNOW", data$EVTYPE)
data$EVTYPE <- gsub(".*[Bb][Ll][Ii][Zz].*", "BLIZZARD", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ww][Ee][Tt].*", "HEAVY RAIN", data$EVTYPE)
data$EVTYPE <- gsub(".*[Rr][Aa][Ii].*", "HEAVY RAIN", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ff][Ll][Oo][Oo].*", "FLOOD", data$EVTYPE)
data$EVTYPE <- gsub(".*[Hh][Aa][Ii].*", "HAIL", data$EVTYPE)
data$EVTYPE <- gsub(".*[Bb][Uu][Rr][Ss].*", "THUNDERSTORM WIND", data$EVTYPE)
data$EVTYPE <- gsub(".*[Oo][Rr][Nn][Aa].*", "TORNADO", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ss][Uu][Rr].*", "HIGH SURF", data$EVTYPE)
data$EVTYPE <- gsub(".*[Rr][Ee][Cc].*[Tt][Ee][Mm].*", "HIGH WIND", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ff][Oo][Gg].*", "DENSE FOG", data$EVTYPE)
data$EVTYPE <- gsub(".*[Vv][Oo][Ll].*", "VOLCANIC ASH", data$EVTYPE)
data$EVTYPE <- gsub(".*[Dd][Ee][Vv].*", "DUST DEVIL", data$EVTYPE)
data$EVTYPE <- gsub(".*[Dd][Rr][Yy].*", "HEAT", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ss][Uu][Mm].*", "SUMMARY", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ll][Ii][Gg][Hh][Tt][Nn].*", "LIGHTNING", data$EVTYPE)
data$EVTYPE <- gsub(".*[Ww][Aa][Tt][Ee][Rr].*", "WATERSPOUT", data$EVTYPE)
length(unique(data$EVTYPE))
## [1] 136

We divided the number of events by more than 7. We still have more than 30 events ; let’s see the percentage of the fewer occurences :

(sum(sort(table(data$EVTYPE)))-sum(tail(sort(table(data$EVTYPE))),30))/sum(sort(table(data$EVTYPE)))
## [1] 0.05223668

Less than 6%. As we will be looking only to major events, we could conclude that the cleaning is sufficient.

2.3. Data Cleaning : Damages exponents values

To help me to handle the exponent values of PROPDMGEXP and CROPDMGEXP, we should have a look to the PROPDMGEXP table:

table(data$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5 
## 465934      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     40      1      6 424665      7  11330
table(data$CROPDMGEXP)
## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994

I will only take into account the “K” for kilo, “M” for mega, “B” for billion and ignore the others which only count for a very small percentage (less than 0,1%).

mean(data$PROPDMGEXP %in% setdiff(data$PROPDMGEXP, c("K", "M", "B", "")))
## [1] 0.0003635167
mean(data$CROPDMGEXP %in% setdiff(data$CROPDMGEXP, c("K", "M", "B", "")))
## [1] 5.430584e-05
# set a cropmultiplier according to crop damages exponent value
data$cropmultiplier <- 1L
data$cropmultiplier[data$CROPDMGEXP=='K'] <- 1000
data$cropmultiplier[data$CROPDMGEXP=='M'] <- 1000000
data$cropmultiplier[data$CROPDMGEXP=='B'] <- 1000000000
data$CROPDMG <- data$CROPDMG*data$cropmultiplier

# same for properties damages
data$propmultiplier <- 1L
data$propmultiplier[data$PROPDMGEXP=='K'] <- 1000
data$propmultiplier[data$PROPDMGEXP=='M'] <- 1000000
data$propmultiplier[data$PROPDMGEXP=='B'] <- 1000000000
data$PROPDMG <- data$PROPDMG*data$propmultiplier

2.4 Data Cleaning : dates

Transforms the begin date od the event (BGN_DATE) from a character vector to a date :

data$BGN_DATE <- as.Date(data$BGN_DATE,format=("%m/%d/%Y %H:%M:%S"))

2.5. Data subsetting

Selects only the columns that we will use in the analysis part : BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG and CROPDMG :

df <- select(data, BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, CROPDMG)
colnames(df) <- tolower(colnames(df))

2.6. Data processing : periods of time

According to the NOAA web site (http://www.ncdc.noaa.gov/stormevents/details.jsp) Event Types Available are :

  • Tornado: From 1950 through 1954, only tornado events were recorded.
  • Tornado, Thunderstorm Wind and Hail: From 1955 through 1992, only tornado, thunderstorm wind and hail events were keyed from the paper publications into digital data. From 1993 to 1995, only tornado, thunderstorm wind and hail events have been extracted from the Unformatted Text Files.
  • All Event Types (48 from Directive 10-1605): From 1996 to present, 48 event types are recorded as defined in NWS Directive 10-1605.

We therefore have to categorize data into three periods of time :

  • 1950 to 1054
  • 1955 to 1992
  • since 1992
df$timeperiod <- "from 1950 to 1954"
df$timeperiod[df$bgn_date >= "1955-01-01" & df$bgn_date <= "1995-12-31"] <- "from 1955 to 1995"
df$timeperiod[df$bgn_date >= "1996-01-01"] <- "since 1996"

2.7. Data Analysis : human health (fatalities and injuries)

# creates a fatalities sum table depending on event type and time period
dffatalities <- ddply(df, c("evtype", "timeperiod"), summarise,
                fatalities=sum(fatalities))
tablefatalities <- arrange(dffatalities, desc(fatalities))[1:10,]

# creates an injuries sum table depending on event type and time period
dfinjuries <- ddply(df, c("evtype", "timeperiod"), summarise,
                injuries=sum(injuries))
tableinjuries <- arrange(dfinjuries, desc(injuries))[1:10,]

2.8. Data Analysis : economic consequences (crop and properties)

# creates a  damages sum table depending on event type
dfdmg <- ddply(df, "evtype", summarise, 
               cropdmg=sum(cropdmg)/10^9,
               propdmg=sum(propdmg)/10^9,
               sumdmg = cropdmg+propdmg)
tabledmg <- arrange(dfdmg, desc(sumdmg))[1:10,]

3. RESULTS

3.1. Across the US, which type of events are more harmful with respect to population health ?

3.1.1 Top 10 fatal event types
tablefatalities
##            evtype        timeperiod fatalities
## 1         TORNADO from 1955 to 1995       3236
## 2  EXCESSIVE HEAT        since 1996       1799
## 3         TORNADO        since 1996       1511
## 4       HIGH WIND        since 1996       1427
## 5           FLOOD        since 1996       1341
## 6         TORNADO from 1950 to 1954        889
## 7            HEAT from 1955 to 1995        877
## 8       LIGHTNING        since 1996        651
## 9     RIP CURRENT        since 1996        542
## 10      HIGH WIND from 1955 to 1995        531
graph1 <- ggplot(tablefatalities, aes(evtype, fatalities, fill=evtype)) +
        geom_bar(stat = "identity", colour = "black") +
        facet_grid(. ~ timeperiod) +
        scale_fill_brewer(palette="Oranges") +
        theme(axis.text.x = element_blank(), legend.title=element_blank()) +
        xlab("Event Type") +
        ylab("Fatalities") +
        ggtitle("Most fatal events, 1950-1954 / 1955-1995 / 1996-2011")
print(graph1)

Nota Bene :

  • From 1950 through 1954, only tornado events were recorded.
  • From 1955 through 1992, only tornado, thunderstorm wind and hail events were keyed and extracted.
  • From 1996 to present, 48 event types are recorded as defined in NWS Directive 10-1605.
The most fatal event type since 1950 is "tornado" with a total of 5.636 deaths. Nevetheless, since 1996 (more event types recorded as defined in NWS Directive 10-1605), the most fatal event is "excessive heat" with 1.799 deaths.
3.1.2 Top 10 harmful event types (injuries)
tableinjuries
##            evtype        timeperiod injuries
## 1         TORNADO from 1955 to 1995    61796
## 2         TORNADO        since 1996    20667
## 3         TORNADO from 1950 to 1954     8944
## 4       HIGH WIND        since 1996     8656
## 5           FLOOD        since 1996     8522
## 6  EXCESSIVE HEAT        since 1996     6410
## 7       HIGH WIND from 1955 to 1995     5113
## 8       LIGHTNING        since 1996     4141
## 9       ICE STORM from 1955 to 1995     1822
## 10       WILDFIRE        since 1996     1458
graph2 <- ggplot(tableinjuries, aes(evtype, injuries, fill=evtype)) +
        geom_bar(stat = "identity", colour = "black") +
        facet_grid(. ~ timeperiod) +
        scale_fill_brewer(palette="Oranges") +
        theme(axis.text.x = element_blank(), legend.title=element_blank()) +
        xlab("Event Type") +
        ylab("Injuries") +
        ggtitle("Most harmful events (injuries), 1950-1954 / 1955-1995 / 1996-2011")
print(graph2)

Nota Bene

  • From 1950 through 1954, only tornado events were recorded.
  • From 1955 through 1992, only tornado, thunderstorm wind and hail events were keyed and extracted.
  • From 1996 to present, 48 event types are recorded as defined in NWS Directive 10-1605.
The most harmful event type (in terms of injuries) since 1950 is "tornado" with a total of 91.407 injuries.

3.2. Across the US, which types of events have the greatest economic consequences ?

tabledmg
##            evtype    cropdmg    propdmg     sumdmg
## 1           FLOOD 12.2703842 167.566340 179.836724
## 2       HURRICANE  5.4952928  84.636180  90.131473
## 3         TORNADO  0.4149615  56.981598  57.396559
## 4       HIGH SURF  0.0023550  48.079699  48.082054
## 5       HIGH WIND  3.5162599  25.788664  29.304924
## 6            HAIL  3.0464710  15.969644  19.016115
## 7         DROUGHT 13.9726368   1.053039  15.025675
## 8       ICE STORM  5.0271143   3.969863   8.996977
## 9        WILDFIRE  0.4032816   8.501629   8.904910
## 10 TROPICAL STORM  0.6783460   7.703891   8.382237
graph3 <- ggplot(tabledmg, aes(evtype, sumdmg, fill=evtype)) +
        geom_bar(stat = "identity", colour = "black") +
        scale_fill_brewer(palette="Oranges") +
        theme(axis.text.x = element_blank(), legend.title=element_blank()) +
        xlab("Event Type") +
        ylab("Total Damages (in billion $)") +
        ggtitle("Greatest economic consequences events")
print(graph3)

The greatest economic consequences event type (total of properties and crop damages) since 1950 is "flood" with a total estimated cost of 179,8 billion dollars.