Synopsis

This analysis involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. The database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Using data for the period 1996 to 2011, this analysis identifies the weather events with the greatest impact on health (fatalities and injuries) and those with the greatest economic impact in terms of damages to property and crops.

Over this period, tornados had the greatest impact on health, while floods resulted in the greatest economic impact.

Initial preparation

Set global knitr options and load libraries

knitr::opts_chunk$set(echo = TRUE, options(scipen=999))
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(ggplot2)

Data processing

Loading the data

The data is downloaded into the working directory if this has not been done previously. The data is then loaded into the data frame rawdata.

projectfile <- "repdata_data_StormData.csv.bz2"

if(!file.exists(projectfile)) {
  fileURL <- 
    "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
  download.file(fileURL, projectfile, method = "curl")
}

rawdata <- read.csv("repdata_data_StormData.csv.bz2", header = TRUE)

Cleaning the data

The original data file contains 902297 observations of 37 variables recorded over the period from 1950 to 2011. Not all of these variables are required for the analysis of the health and economic impact of storms.

Variables required in the analysis are:

  • BGN_DATE
  • EVTYPE
  • FATALITIES
  • INJURIES
  • PROPDMG
  • PROPDMGEXP
  • CROPDMG
  • CROPDMGEXP

According to the National Oceanic and Atmospheric Administration (https://www.ncdc.noaa.gov/stormevents/details.jsp) only a subset of the 48 weather events were recorded prior to January 1996. Thus to avoid potentially overstating the impact of this subset of events, the data is restricted to the period from January 1996 onwards.

# select only the required variables

stormdata <- select(rawdata, BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

# convert BGN_DATE to date format and determine the year

stormdata$BGN_DATE <- as.Date(stormdata$BGN_DATE, "%m/%d/%Y")
stormdata$YEAR <- year(stormdata$BGN_DATE)

# omit data prior to 1996 as only a subset of weather events were recorded
# in prior years

stormdata <- filter(stormdata, YEAR>=1996)

The resultant data frame has 653530 observations and 9 variables. There are 516 distinct values of EVTYPE.

The variable EVTYPE takes many more values than the 48 weather events listed by the NOAA. This is due to:

  • some observations have EVTYPE being phrases other than for a weather event, suggesting that these are not following a standard for recording observations (eg indicating summary results)
  • inconsistent recording of EVTYPE values, such as in the use of upper and lower cases or other minor differences (eg use of singular/plural, extra spaces, abbreviations)
  • values used for some EVTYPEs are a combination of two weather event types
  • values used for EVTYPE do not correspond to any of the 48 NOAA weather events.

Firstly, observations that have no health and no economic impacts are omitted. These observations are not required in the analysis performed as they do not add to the total health or economic impacts. Removing these also removes those observations that are non standard (as identified in the first point above).

Secondly, the EVTYPE variable is converted to all upper case, which removes inconsistencies with upper and lower cases.

## only include observations that have a non-zero health 
## or non-zero economic impact

stormdata <- filter(stormdata, PROPDMG > 0 | CROPDMG > 0 | FATALITIES > 0 | INJURIES > 0)

## convert EVTYPE to all upper case

stormdata$EVTYPE <- toupper(stormdata$EVTYPE)

There are still 186 distinct values of EVTYPE. Comparing these against the NOAA Storm Data Event table identifies mappings that can be used for some of these values, such as correcting spelling variations. As the analysis only requires events with the greatest impact, the focus of the remapping is on weather events which have the largest effects.

Further, the remapping aims to retain NOAA weather event types rather than aggregating to more generic types. As the objective is to inform preparation for severe weather events, it is important to establish the characteristics of such events to ensure resources are allocated appropriately. Future analysis may seek to examine greater aggregation.

Features of the remapping of EVTYPE:

  • any abbrevations used are spelt out in full
  • correct variations in spelling
  • remove additional spaces
  • for non-standard values, map to an appropriate NOAA event type (eg variations of COLD mapped to COLD/WIND CHILL)
  • where combinations of event types are used, one NOAA event type is selected.
## remap selected EVTYPE values by mapping to a subset of values 

stormdata$EVTYPE <- gsub("^MARINE TSTM.*", "MARINE THUNDERSTORM WIND",  stormdata$EVTYPE)
stormdata$EVTYPE <- gsub("^STRONG WIND.*|^NON-TSTM.*|^NON TSTM.*|^NON-SEVERE WIND.*|^WIND.*", "STRONG WIND", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub("^THUNDERSTORM WIND.*|.*TSTM.*", "THUNDERSTORM WIND", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub("^HIGH WIND.*", "HIGH WIND", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub("^HURRICANE.*|^TYPHOON.*", "HURRICANE", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub("^COASTAL FLOOD.*|^COASTAL  FLOOD.*|^TIDAL.*", "COASTAL FLOOD", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub("^WINTER WEATHER.*|^WINTRY.*", "WINTER WEATHER", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub("^RIP CURRENT.*", "RIP CURRENT", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub("^WILD.*", "WILDFIRE", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub(".*FLASH.*", "FLASH FLOOD", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub("^RIVER FLOOD.*|^URBAN.*|^ICE JAM.*|^LAKESHORE FLOOD.*", "FLOOD", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub("^EXTREME.*", "EXTREME COLD/WIND CHILL", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub("^COLD.*|^UNSEASONABLE COLD|^UNSEASONABLY COLD|^HYPO.*", "COLD/WIND CHILL", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub("^HEAT.*|^UNSEASONABLY WARM.*|^WARM.*", "HEAT", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub("^RECORD.*", "EXCESSIVE HEAT", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub("^ASTRONOMICAL HIGH.*|STORM SURGE.*", "STORM SURGE/TIDE", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub(".*FROST.*|.*FREEZ.*", "FROST/FREEZE", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub(".*SURF.*", "HIGH SURF", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub(".*SNOW.*", "SNOW", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub("SMALL HAIL.*", "HAIL", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub("^GUSTY.*|^GRADIENT WIND.*", "STRONG WIND", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub("^WHIRLWIND", "TORNADO", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub(".*SLIDE.*|.*SLUMP.*", "LANDSLIDE", stormdata$EVTYPE)

Calculate impact

The total health impact is calculated to be the sum of fatalities and injuries.

stormdata$TOTALhealth <- stormdata$FATALITIES + stormdata$INJURIES

The economic impact needs to be calculated from the values for property and crop damages and the corresponding exponents.

## check exponent values for property and crop damages

table(stormdata$PROPDMGEXP)
## 
##             B      K      M 
##   8448     32 185474   7364
table(stormdata$CROPDMGEXP)
## 
##             B      K      M 
## 102767      2  96787   1762

The exponents for economic damage are assumed to correspond to:

  • (blank) - 1
  • B (billions) - 10^9
  • M (millions) - 10^6
  • K (thousands) - 10^3

Create two new variables which are multipliers for economic damage based on the value of the exponent, with the resultant economic damage to be expressed in millions.

## define multipliers, set default value as 1/10^6 
## (data to be in millions)

stormdata$PROPmult <- 0.000001
stormdata$PROPmult[grep("B", stormdata$PROPDMGEXP, ignore.case=TRUE)] <- 1000
stormdata$PROPmult[grep("M", stormdata$PROPDMGEXP, ignore.case=TRUE)] <- 1
stormdata$PROPmult[grep("K", stormdata$PROPDMGEXP, ignore.case=TRUE)] <- 0.001

stormdata$CROPmult <- 0.000001
stormdata$CROPmult[grep("B", stormdata$CROPDMGEXP, ignore.case=TRUE)] <- 1000
stormdata$CROPmult[grep("M", stormdata$CROPDMGEXP, ignore.case=TRUE)] <- 1
stormdata$CROPmult[grep("K", stormdata$CROPDMGEXP, ignore.case=TRUE)] <- 0.001

## calculate damages in millions, with TOTALm being the sum
## of PROPm and CROPm

stormdata$PROPm <- stormdata$PROPDMG * stormdata$PROPmult
stormdata$CROPm <- stormdata$CROPDMG * stormdata$CROPmult
stormdata$TOTALm <- stormdata$PROPm + stormdata$CROPm

Determine the health and economic impacts of the weather events, in descending order.

## determine health impact of the weather events,

healthImpact <- aggregate(cbind(TOTALhealth, FATALITIES, INJURIES) ~ EVTYPE,
                          data=stormdata, sum)
healthImpact <- healthImpact[order(healthImpact$TOTALhealth,
                                   decreasing= TRUE),]

## determine the economic impact of the weather events

econImpact <- aggregate(cbind(TOTALm,PROPm,CROPm) ~ EVTYPE, 
                        data = stormdata, sum)
econImpact <- econImpact[order(econImpact$TOTALm, decreasing=TRUE),]  

Results

The analysis identifies the top 10 weather events, in terms of their impact on health and the economy.

## graph the health impact of the top 10 weather events

g1 <- ggplot(healthImpact[0:10, ], 
             aes(x=reorder(EVTYPE, -TOTALhealth), y=TOTALhealth)) +
  geom_col(fill = "seagreen3", col = "seagreen4") +
  ylab("Total health impact (persons)") +
  xlab("Weather event") +
  theme(axis.text.x = element_text(angle=45, hjust=1)) +
  ggtitle("Weather events: health impact in the United States",
          subtitle= "Top 10 events for period 1996 to 2011")
g1

## print a table of the health impact of the top 10 weather
## events, showing total, fatalities and injuries.

print.data.frame(healthImpact[0:10, ], row.names = FALSE, digits=1)
##             EVTYPE TOTALhealth FATALITIES INJURIES
##            TORNADO       22179       1512    20667
##     EXCESSIVE HEAT        8190       1799     6391
##              FLOOD        7282        444     6838
##  THUNDERSTORM WIND        5506        378     5128
##          LIGHTNING        4792        651     4141
##        FLASH FLOOD        2561        887     1674
##               HEAT        1548        237     1311
##           WILDFIRE        1543         87     1456
##       WINTER STORM        1483        191     1292
##          HURRICANE        1453        125     1328

Over the period 1996 to 2011, tornados had the greatest overall impact on health, in terms of fatalities and injuries, however the number of fatalities is greater for excessive heat.

## graph the economic impact of the top 10 weather events

g2 <- ggplot(econImpact[0:10, ], aes(x=reorder(EVTYPE, -TOTALm), y=TOTALm)) +
  geom_col(fill = "seagreen3", col = "seagreen4") +
  ylab("Total economic impact ($ millions)") +
  xlab("Weather event") +
  theme(axis.text.x = element_text(angle=45, hjust=1)) +
  ggtitle("Weather events: economic impact in the United States",
          subtitle= "Top 10 events for period 1996 to 2011")
g2

## print a table of the economic impact of the top 10 weather
## events, showing total damages, property damages and crop damages.

print.data.frame(econImpact[0:10, ], row.names = FALSE, digits=1)
##             EVTYPE TOTALm  PROPm   CROPm
##              FLOOD 149150 144137  5013.2
##          HURRICANE  87069  81719  5350.1
##   STORM SURGE/TIDE  47845  47844     0.9
##            TORNADO  24900  24617   283.4
##               HAIL  17092  14595  2496.8
##        FLASH FLOOD  16557  15222  1334.9
##            DROUGHT  14414   1046 13367.6
##  THUNDERSTORM WIND   8930   7914  1016.9
##     TROPICAL STORM   8320   7642   677.7
##           WILDFIRE   8163   7760   402.3

Over the period 1996 to 2011 floods had the greatest economic impact of any weather event in the United States.