Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

The objective of this analysis to identify which severe weather events 1) are most harmful with respect to population health, 2) have the greatest economic consequences.

Environment

  1. OS: Windows 10
  2. R Version: 3.2.2 (2015-08-14)
  3. Tool: R studio Version 0.99.486
  4. R packages used: dplyr, lubridate, stringr, ggplot2, scales, gridExtra
  5. Data: U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database
  6. Publishing tool: RPubs

Configuration and Required Packages

The global options for the document are set with echo = TRUE so that reviewers can see the code, data is lazy-loaded from previously saved databases with cache = TRUE, and figures sizes are pre-set.

# set the global options
knitr::opts_chunk$set(echo = TRUE, cache = TRUE, warning = FALSE, message = FALSE, fig.width=12, fig.height=8)

The analysis requires the r packages dplyr, lubridate, stringr, ggplot2, scales, and gridExtra.

# load the required packages
library(dplyr)
library(lubridate)
library(stringr)
library(ggplot2)
library(scales)
library(gridExtra)

Data Processing

The data for the analysis is downloaded and the generated csv is read in R.

# download file
if (!"StormData.csv.bz2" %in% dir("./")) {
        download.file("http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", destfile = "StormData.csv.bz2", mode="wb")
}

# read file into r
dataset <- read.csv("StormData.csv.bz2", stringsAsFactors = FALSE)

The documentation of the database is available at the following links

Exploring the original dataset, we find out that there are 902,297 rows and 37 columns in total. The events in the database start in the year 1950 and end in November 2011.

# dimensions of dataset
dim(dataset)
## [1] 902297     37
# names of the columns
names(dataset)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

We create a subset with only the information relevant for the analysis,ie, types of weather events, and their impact on health and the economy. The selected columns are:

# Subset dataset
impact <- select(dataset, BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
dim(impact)
## [1] 902297      8

In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of reliable/complete records. This is shown by plotting the histogram of the observations over the years.

# To assist in date analysis, dates are converted to POSIXct format
impact <- mutate(impact, BGN_DATE = mdy(str_extract(impact$BGN_DATE, "[^ ]+")), YEAR = year(BGN_DATE))
# Plot the histogram of observations over the years
ggplot(data = impact, aes(impact$YEAR)) + 
        geom_histogram(binwidth = 3) +
        scale_x_continuous(limits=c(1950,2012), breaks = seq(1950,2011, by=5)) +
        geom_vline(xintercept=1995) +
        ggtitle("Histogram of Weather Events") +
        xlab("Year") +
        ylab("Frequency")+
        theme_bw() +
        theme(
                panel.background = element_blank(),
                panel.border = element_blank(),
                panel.grid.major = element_line(colour = "lightgray"),
                panel.grid.minor = element_blank(),
                panel.grid.major.x = element_blank(),
                legend.position = "bottom",
                legend.title = element_blank(),
                legend.key = element_blank()
        )
Fig.1 Frequency of Observations from 1950 to 2011

Fig.1 Frequency of Observations from 1950 to 2011

As we can see from the above chart, the dataset starts to have more records after the year 1995. So we decide to retain only the rows after 1995 and discard the remaining rows.

impact <- filter(impact, YEAR>=1995)
dim(impact)
## [1] 681500      9

The list of events within the column impact$EVTYPE include several repetitions, typos, and categories are generally not consistent.

To group and cleanse the impact$EVTYPE column, we use the list of events included in the storm data event table, available from the National Weather Service Instruction Directive.

# clean EVTYPE
impact <- mutate(impact, EVTYPE = ifelse(grepl("astro", EVTYPE, ignore.case = TRUE), "Astronomical Low Tide", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("avalanche|slide|slump|landspout", EVTYPE, ignore.case = TRUE), "Avalanche", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("blizzard", EVTYPE, ignore.case = TRUE), "Blizzard", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("coast|beach|tide", EVTYPE, ignore.case = TRUE), "Coastal Flood", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("cold|hypo|cool|low", EVTYPE, ignore.case = TRUE), "Cold/Windhill", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("debris", EVTYPE, ignore.case = TRUE), "Debris Flow", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("fog|vog", EVTYPE, ignore.case = TRUE), "Dense Fog", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("smoke", EVTYPE, ignore.case = TRUE), "Dense Smoke", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("drought|dry|driest", EVTYPE, ignore.case = TRUE), "Drought", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("devil|devel", EVTYPE, ignore.case = TRUE), "Dust Devil", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("dust storm|blowing|saharan", EVTYPE, ignore.case = TRUE), "Dust Storm", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("heat|record high|warm|hot|temperature|wet", EVTYPE, ignore.case = TRUE), "Excessive Heat", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("flood|sea|water|swell|stream|dam|floyd", EVTYPE, ignore.case = TRUE), "Flood", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("flash", EVTYPE, ignore.case = TRUE), "Flash Flood", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("freez|frost", EVTYPE, ignore.case = TRUE), "Frost/Freeze", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("funnel|cloud", EVTYPE, ignore.case = TRUE), "Funnelloud", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("freezing fog", EVTYPE, ignore.case = TRUE), "Freezing Fog", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("hail", EVTYPE, ignore.case = TRUE), "Hail", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("rain|precipitation|shower", EVTYPE, ignore.case = TRUE), "Heavy Rain", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("snow", EVTYPE, ignore.case = TRUE), "Heavy Snow", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("surf", EVTYPE, ignore.case = TRUE), "High Surf", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("wind|wnd", EVTYPE, ignore.case = TRUE), "High Wind", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("hurrican|typhoon", EVTYPE, ignore.case = TRUE), "Hurricane", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("ice|glaze|icy", EVTYPE, ignore.case = TRUE), "Ice Storm", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("lake.snow", EVTYPE, ignore.case = TRUE), "Lake-Effect Snow", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("lake.flood", EVTYPE, ignore.case = TRUE), "Lakeshore Flood", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("lightning|LIGNTNING", EVTYPE, ignore.case = TRUE), "Lightning", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("marine hail", EVTYPE, ignore.case = TRUE), "Marine Hail", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("marine high|mishap|accident|drowning|wave", EVTYPE, ignore.case = TRUE), "Marine High Wind", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("marine strong wind", EVTYPE, ignore.case = TRUE), "Marine Strong Wind", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("thunder|TSTM", EVTYPE, ignore.case = TRUE), "Thunderstorm Wind", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("MARINE TSTM WIND|MARINE THUNDERSTORM WIND", EVTYPE, ignore.case = TRUE), "Marine Thunderstorm Wind", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("rip", EVTYPE, ignore.case = TRUE), "Ripurrent", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("seiche", EVTYPE, ignore.case = TRUE), "Seiche", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("sleet", EVTYPE, ignore.case = TRUE), "Sleet", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("surge", EVTYPE, ignore.case = TRUE), "Storm Surge/Tide", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("^strong wind", EVTYPE, ignore.case = TRUE), "Strong Wind", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("tornado|burst|gustnado", EVTYPE, ignore.case = TRUE), "Tornado", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("depression", EVTYPE, ignore.case = TRUE), "Tropical Depression", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("tropical storm", EVTYPE, ignore.case = TRUE), "Tropical Storm", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("tsunami", EVTYPE, ignore.case = TRUE), "Tsunami", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("volcanic", EVTYPE, ignore.case = TRUE), "Volcanic Ash", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("waterspout", EVTYPE, ignore.case = TRUE), "Waterspout", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("fire", EVTYPE, ignore.case = TRUE), "Wildfire", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("winter storm", EVTYPE, ignore.case = TRUE), "Winter Storm", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("winter weather|wintry|wintery", EVTYPE, ignore.case = TRUE), "Winter Weather", EVTYPE))                         

The variables PROPDMGEXP and CROPDMGEXP are then converted to numeric

# Format damage valuation magnitude as characters for re-coding: 
impact$PROPDMGEXP <- as.character(impact$PROPDMGEXP)
impact$CROPDMGEXP <- as.character(impact$CROPDMGEXP)

# Create numeric values to represent the coded damage valuation magnitudes 
# (i.e. "h" = 100; "k" = 1,000; "m" = 1,000,000; etc.)
# re-code PROPDMGEXP variable 
impact$PROPDMGEXP[(impact$PROPDMGEXP == "h") | (impact$PROPDMGEXP == "H")] <- 100
impact$PROPDMGEXP[(impact$PROPDMGEXP == "k") | (impact$PROPDMGEXP == "K")] <- 1000
impact$PROPDMGEXP[(impact$PROPDMGEXP == "m") | (impact$PROPDMGEXP == "M")] <- 1000000
impact$PROPDMGEXP[(impact$PROPDMGEXP == "b") | (impact$PROPDMGEXP == "B")] <- 1000000000
impact$PROPDMGEXP[(impact$PROPDMGEXP == "0")] <- 1
impact$PROPDMGEXP[(impact$PROPDMGEXP == "1")] <- 10
impact$PROPDMGEXP[(impact$PROPDMGEXP == "2")] <- 100
impact$PROPDMGEXP[(impact$PROPDMGEXP == "3")] <- 1000
impact$PROPDMGEXP[(impact$PROPDMGEXP == "4")] <- 10000
impact$PROPDMGEXP[(impact$PROPDMGEXP == "5")] <- 100000
impact$PROPDMGEXP[(impact$PROPDMGEXP == "6")] <- 1000000
impact$PROPDMGEXP[(impact$PROPDMGEXP == "7")] <- 10000000
impact$PROPDMGEXP[(impact$PROPDMGEXP == "8")] <- 100000000
impact$PROPDMGEXP[(impact$PROPDMGEXP == "") | (impact$PROPDMGEXP == "?")] <- 0
impact$PROPDMGEXP[(impact$PROPDMGEXP == "+") | (impact$PROPDMGEXP == "-")] <- 0

# re-code CROPDMGEXP variable
impact$CROPDMGEXP[(impact$CROPDMGEXP == "h") | (impact$CROPDMGEXP == "H")] <- 100
impact$CROPDMGEXP[(impact$CROPDMGEXP == "k") | (impact$CROPDMGEXP == "K")] <- 1000
impact$CROPDMGEXP[(impact$CROPDMGEXP == "m") | (impact$CROPDMGEXP == "M")] <- 1000000
impact$CROPDMGEXP[(impact$CROPDMGEXP == "b") | (impact$CROPDMGEXP == "B")] <- 1000000000
impact$CROPDMGEXP[(impact$CROPDMGEXP == "0")] <- 1
impact$CROPDMGEXP[(impact$CROPDMGEXP == "1")] <- 10
impact$CROPDMGEXP[(impact$CROPDMGEXP == "2")] <- 100
impact$CROPDMGEXP[(impact$CROPDMGEXP == "3")] <- 1000
impact$CROPDMGEXP[(impact$CROPDMGEXP == "4")] <- 10000
impact$CROPDMGEXP[(impact$CROPDMGEXP == "5")] <- 100000
impact$CROPDMGEXP[(impact$CROPDMGEXP == "6")] <- 1000000
impact$CROPDMGEXP[(impact$CROPDMGEXP == "7")] <- 10000000
impact$CROPDMGEXP[(impact$CROPDMGEXP == "8")] <- 100000000
impact$CROPDMGEXP[(impact$CROPDMGEXP == "") | (impact$CROPDMGEXP == "?")] <- 0
impact$CROPDMGEXP[(impact$CROPDMGEXP == "+") | (impact$CROPDMGEXP == "-")] <- 0

# convert PROPDMGEXP & CROPDMGEXP variables to integers
impact$PROPDMGEXP <- as.integer(impact$PROPDMGEXP)
impact$CROPDMGEXP <- as.integer(impact$CROPDMGEXP)

We get total property and crop damage values by multiplying the valuation and magnitude columns for both the property damage (PROPDMG * PROPDMGEXP) and crop damage (CROPDMG * CROPDMGEXP) variables. Then we combine these two values to calculate an aggregated total economic damage valuation (total_property_damage + total_crop_damage).

impact$total_property_damage <- impact$PROPDMG * impact$PROPDMGEXP
impact$total_crop_damage <- impact$CROPDMG * impact$CROPDMGEXP
impact$total_economic_damage <- impact$total_property_damage + impact$total_crop_damage

Results

1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

In this section we want to find the top 10 of weather events that caused the most fatalities and injuries.

We group fatalities and injuries by event type and get the top 10 event types which are the most harmful to public health:

# group fatalities by event type and sort event types by number of fatalities in decreasing order 
aggrfatalities <- aggregate(FATALITIES~EVTYPE, data = impact, "sum")
fatalities <- aggrfatalities[order(-aggrfatalities$FATALITIES), ][1:10, ]
fatalities$EVTYPE <- factor(fatalities$EVTYPE, levels = fatalities$EVTYPE)
fatalities
##             EVTYPE FATALITIES
## 10  Excessive Heat       3092
## 105        Tornado       1545
## 11           Flood       1450
## 19       High Wind       1269
## 23       Lightning        730
## 33       Ripurrent        564
## 1        Avalanche        267
## 113   Winter Storm        195
## 18       High Surf        163
## 20       Hurricane        133
# group injuries by event type and sort event types by number of injuries in decreasing order 
aggrinjuries <- aggregate(INJURIES~EVTYPE, data = impact, "sum")
injuries <- aggrinjuries[order(-aggrinjuries$INJURIES), ][1:10, ]
injuries$EVTYPE <- factor(injuries$EVTYPE, levels = injuries$EVTYPE)
injuries
##             EVTYPE INJURIES
## 105        Tornado    21783
## 10  Excessive Heat     9107
## 11           Flood     8659
## 19       High Wind     7366
## 23       Lightning     4633
## 111       Wildfire     1458
## 20       Hurricane     1330
## 113   Winter Storm     1298
## 14            Hail     1021
## 4        Dense Fog      994

As we can see from the chart below, Excessive Heat is to blame for the largest number of fatalities, while Tornado caused the highest number of injuries.

# fatalities plot
fatalities_plot <- ggplot(fatalities, aes(x=EVTYPE, y=FATALITIES, fill=FATALITIES)) + geom_bar(stat = "identity") + 
    ylab("Number of Fatalities") +
    ggtitle ("Total fatalities per event type") +
    theme(axis.text=element_text(size=11),
          axis.title.x=element_blank(),
          panel.background = element_blank(),
                panel.border = element_blank(),
                panel.grid.major = element_line(colour = "lightgray"),
                panel.grid.minor = element_blank(),
                panel.grid.major.x = element_blank()) 
# injuries plot
injuries_plot <- ggplot(injuries, aes(x=EVTYPE, y=INJURIES, fill=INJURIES)) + geom_bar(stat = "identity") + 
    ylab("Number of Injuries") +
    ggtitle ("Total injuries per event type") +
    theme(axis.text=element_text(size=11),
          axis.title.x=element_blank(),
          panel.background = element_blank(),
                panel.border = element_blank(),
                panel.grid.major = element_line(colour = "lightgray"),
                panel.grid.minor = element_blank(),
                panel.grid.major.x = element_blank()) 
# arrange plots in one chart
grid.arrange(fatalities_plot, injuries_plot, nrow=2)
Fig.2 Impact of Weather Events on Public Health from 1995 to 2011

Fig.2 Impact of Weather Events on Public Health from 1995 to 2011

2. Which events have the greatest economic consequences?

We can apply the same logic to identify the events which cause the greatest economic impacts. In order to extract the top 10 events which cause the most economic damages, we will use the total_economic_damage variable, which included both damage to properties and to crops.

# group economic damage by event type and sort event types by economic impact in decreasing order 
aggr_economic_impact <- aggregate(total_economic_damage~EVTYPE, data = impact, "sum")
economic_impact <- aggr_economic_impact [order(-aggr_economic_impact$total_economic_damage), ][1:10, ]
economic_impact$EVTYPE <- factor(economic_impact$EVTYPE, levels = economic_impact$EVTYPE)
economic_impact
##               EVTYPE total_economic_damage
## 11             Flood          172908397730
## 20         Hurricane           90655952810
## 37  Storm Surge/Tide           43193541000
## 105          Tornado           25237340320
## 14              Hail           18035235127
## 19         High Wind           17806783671
## 6            Drought           14970175380
## 107   Tropical Storm            8352221550
## 111         Wildfire            8163274130
## 16        Heavy Rain            3895108240

Then we plot the top 10 events with the greatest economic impact.

# economic impact plot
economic_plot <- ggplot(economic_impact, aes(x=EVTYPE, y=total_economic_damage/1000000000, fill=total_economic_damage/1000000000)) + geom_bar(stat = "identity") + 
    ylab("USD (Billions)") +
    ggtitle ("Total economic damage per event type") +
        scale_fill_continuous(name="USD (Billions)") +
    theme(axis.text=element_text(size=10),
          axis.title.x=element_blank(),
          panel.background = element_blank(),
                panel.border = element_blank(),
                panel.grid.major = element_line(colour = "lightgray"),
                panel.grid.minor = element_blank(),
                panel.grid.major.x = element_blank())
economic_plot
Fig.3 Impact of Weather Events on Economy from 1995 to 2011

Fig.3 Impact of Weather Events on Economy from 1995 to 2011

As we can see from Fig.3, the weather events which caused the greatest economic damage from 1995 to 2011 were Floods (~$173 billions), followed by Hurricanes (~$91 billions), and Storm Surges (~$43 billions).