This analysis involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. The database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
Using data for the period 1996 to 2011, this analysis identifies the weather events with the greatest impact on health (fatalities and injuries) and those with the greatest economic impact in terms of damages to property and crops.
Over this period, tornados had the greatest impact on health, while floods resulted in the greatest economic impact.
knitr::opts_chunk$set(echo = TRUE, options(scipen=999))
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2)
The data is downloaded into the working directory if this has not been done previously. The data is then loaded into the data frame rawdata.
projectfile <- "repdata_data_StormData.csv.bz2"
if(!file.exists(projectfile)) {
fileURL <-
"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileURL, projectfile, method = "curl")
}
rawdata <- read.csv("repdata_data_StormData.csv.bz2", header = TRUE)
The original data file contains 902297 observations of 37 variables recorded over the period from 1950 to 2011. Not all of these variables are required for the analysis of the health and economic impact of storms.
Variables required in the analysis are:
According to the National Oceanic and Atmospheric Administration (https://www.ncdc.noaa.gov/stormevents/details.jsp) only a subset of the 48 weather events were recorded prior to January 1996. Thus to avoid potentially overstating the impact of this subset of events, the data is restricted to the period from January 1996 onwards.
# select only the required variables
stormdata <- select(rawdata, BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
# convert BGN_DATE to date format and determine the year
stormdata$BGN_DATE <- as.Date(stormdata$BGN_DATE, "%m/%d/%Y")
stormdata$YEAR <- year(stormdata$BGN_DATE)
# omit data prior to 1996 as only a subset of weather events were recorded
# in prior years
stormdata <- filter(stormdata, YEAR>=1996)
The resultant data frame has 653530 observations and 9 variables. There are 516 distinct values of EVTYPE.
The variable EVTYPE takes many more values than the 48 weather events listed by the NOAA. This is due to:
Firstly, observations that have no health and no economic impacts are omitted. These observations are not required in the analysis performed as they do not add to the total health or economic impacts. Removing these also removes those observations that are non standard (as identified in the first point above).
Secondly, the EVTYPE variable is converted to all upper case, which removes inconsistencies with upper and lower cases.
## only include observations that have a non-zero health
## or non-zero economic impact
stormdata <- filter(stormdata, PROPDMG > 0 | CROPDMG > 0 | FATALITIES > 0 | INJURIES > 0)
## convert EVTYPE to all upper case
stormdata$EVTYPE <- toupper(stormdata$EVTYPE)
There are still 186 distinct values of EVTYPE. Comparing these against the NOAA Storm Data Event table identifies mappings that can be used for some of these values, such as correcting spelling variations. As the analysis only requires events with the greatest impact, the focus of the remapping is on weather events which have the largest effects.
Further, the remapping aims to retain NOAA weather event types rather than aggregating to more generic types. As the objective is to inform preparation for severe weather events, it is important to establish the characteristics of such events to ensure resources are allocated appropriately. Future analysis may seek to examine greater aggregation.
Features of the remapping of EVTYPE:
## remap selected EVTYPE values by mapping to a subset of values
stormdata$EVTYPE <- gsub("^MARINE TSTM.*", "MARINE THUNDERSTORM WIND", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub("^STRONG WIND.*|^NON-TSTM.*|^NON TSTM.*|^NON-SEVERE WIND.*|^WIND.*", "STRONG WIND", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub("^THUNDERSTORM WIND.*|.*TSTM.*", "THUNDERSTORM WIND", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub("^HIGH WIND.*", "HIGH WIND", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub("^HURRICANE.*|^TYPHOON.*", "HURRICANE", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub("^COASTAL FLOOD.*|^COASTAL FLOOD.*|^TIDAL.*", "COASTAL FLOOD", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub("^WINTER WEATHER.*|^WINTRY.*", "WINTER WEATHER", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub("^RIP CURRENT.*", "RIP CURRENT", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub("^WILD.*", "WILDFIRE", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub(".*FLASH.*", "FLASH FLOOD", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub("^RIVER FLOOD.*|^URBAN.*|^ICE JAM.*|^LAKESHORE FLOOD.*", "FLOOD", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub("^EXTREME.*", "EXTREME COLD/WIND CHILL", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub("^COLD.*|^UNSEASONABLE COLD|^UNSEASONABLY COLD|^HYPO.*", "COLD/WIND CHILL", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub("^HEAT.*|^UNSEASONABLY WARM.*|^WARM.*", "HEAT", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub("^RECORD.*", "EXCESSIVE HEAT", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub("^ASTRONOMICAL HIGH.*|STORM SURGE.*", "STORM SURGE/TIDE", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub(".*FROST.*|.*FREEZ.*", "FROST/FREEZE", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub(".*SURF.*", "HIGH SURF", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub(".*SNOW.*", "SNOW", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub("SMALL HAIL.*", "HAIL", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub("^GUSTY.*|^GRADIENT WIND.*", "STRONG WIND", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub("^WHIRLWIND", "TORNADO", stormdata$EVTYPE)
stormdata$EVTYPE <- gsub(".*SLIDE.*|.*SLUMP.*", "LANDSLIDE", stormdata$EVTYPE)
The total health impact is calculated to be the sum of fatalities and injuries.
stormdata$TOTALhealth <- stormdata$FATALITIES + stormdata$INJURIES
The economic impact needs to be calculated from the values for property and crop damages and the corresponding exponents.
## check exponent values for property and crop damages
table(stormdata$PROPDMGEXP)
##
## B K M
## 8448 32 185474 7364
table(stormdata$CROPDMGEXP)
##
## B K M
## 102767 2 96787 1762
The exponents for economic damage are assumed to correspond to:
Create two new variables which are multipliers for economic damage based on the value of the exponent, with the resultant economic damage to be expressed in millions.
## define multipliers, set default value as 1/10^6
## (data to be in millions)
stormdata$PROPmult <- 0.000001
stormdata$PROPmult[grep("B", stormdata$PROPDMGEXP, ignore.case=TRUE)] <- 1000
stormdata$PROPmult[grep("M", stormdata$PROPDMGEXP, ignore.case=TRUE)] <- 1
stormdata$PROPmult[grep("K", stormdata$PROPDMGEXP, ignore.case=TRUE)] <- 0.001
stormdata$CROPmult <- 0.000001
stormdata$CROPmult[grep("B", stormdata$CROPDMGEXP, ignore.case=TRUE)] <- 1000
stormdata$CROPmult[grep("M", stormdata$CROPDMGEXP, ignore.case=TRUE)] <- 1
stormdata$CROPmult[grep("K", stormdata$CROPDMGEXP, ignore.case=TRUE)] <- 0.001
## calculate damages in millions, with TOTALm being the sum
## of PROPm and CROPm
stormdata$PROPm <- stormdata$PROPDMG * stormdata$PROPmult
stormdata$CROPm <- stormdata$CROPDMG * stormdata$CROPmult
stormdata$TOTALm <- stormdata$PROPm + stormdata$CROPm
Determine the health and economic impacts of the weather events, in descending order.
## determine health impact of the weather events,
healthImpact <- aggregate(cbind(TOTALhealth, FATALITIES, INJURIES) ~ EVTYPE,
data=stormdata, sum)
healthImpact <- healthImpact[order(healthImpact$TOTALhealth,
decreasing= TRUE),]
## determine the economic impact of the weather events
econImpact <- aggregate(cbind(TOTALm,PROPm,CROPm) ~ EVTYPE,
data = stormdata, sum)
econImpact <- econImpact[order(econImpact$TOTALm, decreasing=TRUE),]
The analysis identifies the top 10 weather events, in terms of their impact on health and the economy.
## graph the health impact of the top 10 weather events
g1 <- ggplot(healthImpact[0:10, ],
aes(x=reorder(EVTYPE, -TOTALhealth), y=TOTALhealth)) +
geom_col(fill = "seagreen3", col = "seagreen4") +
ylab("Total health impact (persons)") +
xlab("Weather event") +
theme(axis.text.x = element_text(angle=45, hjust=1)) +
ggtitle("Weather events: health impact in the United States",
subtitle= "Top 10 events for period 1996 to 2011")
g1
## print a table of the health impact of the top 10 weather
## events, showing total, fatalities and injuries.
print.data.frame(healthImpact[0:10, ], row.names = FALSE, digits=1)
## EVTYPE TOTALhealth FATALITIES INJURIES
## TORNADO 22179 1512 20667
## EXCESSIVE HEAT 8190 1799 6391
## FLOOD 7282 444 6838
## THUNDERSTORM WIND 5506 378 5128
## LIGHTNING 4792 651 4141
## FLASH FLOOD 2561 887 1674
## HEAT 1548 237 1311
## WILDFIRE 1543 87 1456
## WINTER STORM 1483 191 1292
## HURRICANE 1453 125 1328
Over the period 1996 to 2011, tornados had the greatest overall impact on health, in terms of fatalities and injuries, however the number of fatalities is greater for excessive heat.
## graph the economic impact of the top 10 weather events
g2 <- ggplot(econImpact[0:10, ], aes(x=reorder(EVTYPE, -TOTALm), y=TOTALm)) +
geom_col(fill = "seagreen3", col = "seagreen4") +
ylab("Total economic impact ($ millions)") +
xlab("Weather event") +
theme(axis.text.x = element_text(angle=45, hjust=1)) +
ggtitle("Weather events: economic impact in the United States",
subtitle= "Top 10 events for period 1996 to 2011")
g2
## print a table of the economic impact of the top 10 weather
## events, showing total damages, property damages and crop damages.
print.data.frame(econImpact[0:10, ], row.names = FALSE, digits=1)
## EVTYPE TOTALm PROPm CROPm
## FLOOD 149150 144137 5013.2
## HURRICANE 87069 81719 5350.1
## STORM SURGE/TIDE 47845 47844 0.9
## TORNADO 24900 24617 283.4
## HAIL 17092 14595 2496.8
## FLASH FLOOD 16557 15222 1334.9
## DROUGHT 14414 1046 13367.6
## THUNDERSTORM WIND 8930 7914 1016.9
## TROPICAL STORM 8320 7642 677.7
## WILDFIRE 8163 7760 402.3
Over the period 1996 to 2011 floods had the greatest economic impact of any weather event in the United States.