This report attempts to determine which weather events are the most costly (in terms of property, crops, and in human lives and injuries). The knowledge gained can be used to direct resources to where they will have the most effect. It turns out that only a few weather event types are responsible for most of the damage, to property, crops, and human health.
The project uses data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, which tracks major storms and weather events in the US. The raw data used for this project is here. For more information on the data, consult the National Weather Service Storm Data Documentation and the National Climatic Data Center Storm Events FAQ.
The raw data is extremely untidy; the bulk of this report describes the methods used to process the data before analysis. Data processing is implemented in a manner which allows refining and reproducing the results in the future.
First things first - load required libraries.
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
This report assumes that the storm data file has been downloaded, and is in the current directory. If needed, the data can be downloaded here. The data is loaded into R.
raw_data <- read.csv("repdata_data_StormData.csv.bz2")
The raw data contains 37 variables, but we only care about a handful of them. I create a second, working dataset, and keep the original raw data around as a reference.
keepers <- c(
"BGN_DATE",
"EVTYPE",
"FATALITIES",
"INJURIES",
"PROPDMG",
"PROPDMGEXP",
"CROPDMG",
"CROPDMGEXP",
"REFNUM"
)
data <- raw_data[, keepers]
I convert the BGN_DATE variable to an R Date object, and also extract the year from that Date as a new variable.
data$BGN_DATE <- as.Date(data$BGN_DATE, "%m/%d/%Y")
data$year <- as.integer(format(data$BGN_DATE, "%Y"))
PROPDMGEXP, CROPDMGEXP are two variables which serve as multipliers on the PROPDMG and CROPDMG variables, respectively. Despite the “EXP” (presumably standing for “exponent”), the values are letters, with multiplier values according to the following table:
| EXP code | multiplier value |
|---|---|
| h, H | 100 |
| k, K | 1000 |
| m, M | 1e6 |
| b, B | 1e9 |
A very small minority of records have odd values in PROPDMGEXP, CROPDMGEXP, such as numerals 0-9, ‘?’, ‘+’, ‘-’. These records occur in only a few years, and even within that subset are a tiny minority of records. The table shows the ratio of “odd” values to normal values, over the years in which those values occur, for PROPDMGEXP:
odd_names <- c(0:9, "+", "-", "?")
odd_exp <- data[data$PROPDMGEXP %in% odd_names,]
odd_years <- unique(odd_exp$year)
table(odd_exp$year) / table(data[data$year %in% odd_years,]$year)
##
## 1993 1994 1995 2011
## 3.172840e-04 1.599535e-03 9.867715e-03 1.608389e-05
The next table shows the same ratio, for CROPDMGEXP:
odd_exp <- data[data$CROPDMGEXP %in% odd_names,]
odd_years <- unique(odd_exp$year)
table(odd_exp$year) / table(data[data$year %in% odd_years,]$year)
##
## 1993 1994 1995
## 0.0003172840 0.0004362367 0.0005005363
Because the oddball values for the EXP values are so rare, I’ll remove records from the dataset that contain them. For the purposes of this analysis, this should pose no problem.
data <- data[!data$PROPDMGEXP %in% odd_names,]
data <- data[!data$CROPDMGEXP %in% odd_names,]
Next I convert the CROPDMG, CROPDMGEXP, PROPDMG, PROPDMGEXP variables into new numerical variables.
data = within(data, {
PROP_MULT = NA
PROP_MULT[PROPDMGEXP == ""] = 1
PROP_MULT[PROPDMGEXP %in% c('h', 'H')] = 100
PROP_MULT[PROPDMGEXP %in% c('k', 'K')] = 1000
PROP_MULT[PROPDMGEXP %in% c('m', 'M')] = 1000000
PROP_MULT[PROPDMGEXP %in% 'B'] = 1000000000
})
data = within(data, {
CROP_MULT = NA
CROP_MULT[CROPDMGEXP == ""] = 1
CROP_MULT[CROPDMGEXP %in% c('h', 'H')] = 100
CROP_MULT[CROPDMGEXP %in% c('k', 'K')] = 1000
CROP_MULT[CROPDMGEXP %in% c('m', 'M')] = 1000000
CROP_MULT[CROPDMGEXP %in% 'B'] = 1000000000
})
data$PROP_COST <- data$PROPDMG * data$PROP_MULT
data$CROP_COST <- data$CROPDMG * data$CROP_MULT
For purposes of this analysis, only events that had economic impact, injuries or fatalities are of interest.
data <- data[
data$FATALITIES > 0 |
data$INJURIES > 0 |
data$PROP_COST > 0 |
data$CROP_COST > 0,
]
A summary of property damage reveals an outlier:
summary(data$PROP_COST)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00e+00 2.00e+03 1.00e+04 1.68e+06 3.50e+04 1.15e+11
m <- data[data$PROP_COST == max(data$PROP_COST),]
m
## BGN_DATE EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
## 605953 2006-01-01 FLOOD 0 0 115 B 32.5
## CROPDMGEXP REFNUM year PROP_MULT CROP_MULT PROP_COST CROP_COST
## 605953 M 605943 2006 1e+09 1e+06 1.15e+11 32500000
raw_data[raw_data$REFNUM==m$REFNUM, "REMARKS"]
## [1] Major flooding continued into the early hours of January 1st, before the Napa River finally fell below flood stage and the water receeded. Flooding was severe in Downtown Napa from the Napa Creek and the City and Parks Department was hit with $6 million in damage alone. The City of Napa had 600 homes with moderate damage, 150 damaged businesses with costs of at least $70 million.
## 436781 Levels: -2 at Deer Park\n ... Zones 22 and 23 were added to the high wind warning of January 26. Peak winds Sitka 55MPH, Cape Decision 58MPH, and Cape Spencer 64MPH.\n
The remarks section mentions a few costs; $6 million, $70 million. It seems obvious that the total damage wasn’t $115 billion! I’m assuming that $115 million was intended, and will make that correction for this analysis.
data[data$REFNUM == 605943, "PROP_COST"] <- 1.15e8
What’s the next largest property damage?
m <- data[data$PROP_COST == max(data$PROP_COST),]
m
## BGN_DATE EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP
## 577676 2005-08-29 STORM SURGE 0 0 31.3 B
## CROPDMG CROPDMGEXP REFNUM year PROP_MULT CROP_MULT PROP_COST
## 577676 0 577616 2005 1e+09 1 3.13e+10
## CROP_COST
## 577676 0
raw_data[raw_data$REFNUM==m$REFNUM, "REMARKS"]
## [1] Storm surge damage in southeast Louisiana, especially in the New Orleans area and the coastal parishes, was catastrophic. Hurricane protection levees and floodwalls were overtopped and/or breached resulting in widespread and deep flooding of homes and businesses. Much of Orleans and Plaquemines Parishes and nearly all of St. Bernard Parish were flooded by storm surge. Approximately 80 percent of the city of New Orleans was flooded. Thousands of people were stranded by the flood waters in homes and buildings and on rooftops for several days and had to be rescued by boat and helicopter. In Jefferson Parish, levees were not compromised, however many homes were flooded by either heavy rain overwhelming limited pumping capacity or storm surge water moving through in-operable pumps into the parish. Severe storm surge damage also occurred along the north shore of Lake Pontchartrain from Mandeville to Slidell with storm surge water moving inland as far as Old Towne Slidell with water up to 6 feet deep in some locations\n\nPost storm high water surveys of the area conducted by FEMA indicated the following storm surge estimates: Orleans Parish - 12-15 feet in east New Orleans to 9 to 12 feet along the Lakefront; St. Bernard Parish - 14 to 17 feet; Jefferson Parish - 6 to 9 feet along the lakefront to 5 to 8 feet from Lafitte to Grand Isle; Plaquemines Parish - 15 to 17 feet; St. Tammany Parish - 11 to 16 feet in southeast portion to 7 to 10 feet in western portion. All storm surge heights are still water elevations referenced to NAVD88 datum.
## 436781 Levels: -2 at Deer Park\n ... Zones 22 and 23 were added to the high wind warning of January 26. Peak winds Sitka 55MPH, Cape Decision 58MPH, and Cape Spencer 64MPH.\n
And that’s Hurricane Katrina - $31.3 billion seems reasonable.
Data per year was collected according to the following:
Because not all event types were recorded until 1996, I will exclude earlier years from this analysis.
data <- data[data$year >= 1996,]
keepers2 <- c(
"BGN_DATE",
"EVTYPE",
"FATALITIES",
"INJURIES",
"year",
"PROP_COST",
"CROP_COST",
"REFNUM"
)
data <- data[, keepers2]
The input data EVTYPE variable (Event Type) contains over 900 unique values, many of which are clearly misspelled, or are worded slightly differently. 48 official event types are defined for data from 1996 to the present, in NWS Directive 10-1605.
Event types:
EVTYPE values are transformed into values from the official list of 48 by text transformation and string matching, using a couple of utility functions.
data$EVTYPE <- toupper(data$EVTYPE)
# Replace events in *data* with EVTYPE matching *pattern* with *replacement*.
merge_events <- function(pattern, replacement, data) {
ast <- grepl(pattern, data$EVTYPE, ignore.case=T)
data$EVTYPE[ast] <- replacement
return(data)
}
# Replace events in *data* matching various patterns with replacements
# from the official list of 48 types.
quash_evtype <- function(data) {
data <- merge_events("ASTRONOMICAL", "ASTRONOMICAL LOW/HIGH TIDE", data)
data <- merge_events("AVALANC", "AVALANCHE", data)
data <- merge_events("mud slide", "AVALANCHE", data)
data <- merge_events("BLIZZARD", "BLIZZARD", data)
data <- merge_events("coastal.*flood", "COASTAL FLXXD", data)
data <- merge_events("cstl flood", "COASTAL FLXXD", data)
data <- merge_events("flash.*flood", "FLASH FLXXD", data)
data <- merge_events("flood.*flash", "FLASH FLXXD", data)
data <- merge_events("lake.*flood", "LAKESHORE FLXXD", data)
data <- merge_events("flood", "FLXXD", data)
data <- merge_events("dam break", "FLXXD", data)
data <- merge_events("urban.*stream", "FLXXD", data)
data$EVTYPE <- gsub("FLXXD", "FLOOD", data$EVTYPE, fixed=T)
data <- merge_events("^COLD/WIND CHILL", "COLD/WIND CHILL", data)
data <- merge_events("EXTREME.*CHILL", "EXTREME COLD/WIND CHILL", data)
data <- merge_events("EXTREME COLD", "EXTREME COLD/WIND CHILL", data)
data <- merge_events("^FOG$", "DENSE FOG", data)
data <- merge_events("^FOG AND COLD TEMPERATURES$", "FREEZING FOG", data)
data <- merge_events("DROUGHT", "DROUGHT", data)
data <- merge_events("DUST DEVIL", "DUST DEVIL", data)
data <- merge_events("DUST STORM", "DUST STORM", data)
data <- merge_events("blowing dust", "DUST STORM", data)
data <- merge_events("EXCESSIVE HEAT", "EXCESSIVE HEAT", data)
data <- merge_events("frost", "FROST/FREEZE", data)
data <- merge_events("freeze", "FROST/FREEZE", data)
data <- merge_events("waterspout", "WATERSPOUT", data)
data <- merge_events("tornad", "TORNADO", data)
data <- merge_events("gustnado", "TORNADO", data)
data <- merge_events("lightning", "LIGHTNING", data)
data <- merge_events("marine tstm", "MARINE GUHAQREFGBEZ WIND", data)
data <- merge_events("marine thunderstorm", "MARINE GUHAQREFGBEZ WIND", data)
data <- merge_events("thunderstorm", "THUNDERSTORM", data)
data <- merge_events("MARINE GUHAQREFGBEZ WIND", "MARINE THUNDERSTORM WIND", data)
data <- merge_events("MARINE GUHAQREFGBEZ WIND", "MARINE THUNDERSTORM WIND", data)
data <- merge_events("thundersnow", "THUNDERSTORM", data)
data <- merge_events("thundertorm", "THUNDERSTORM", data)
data <- merge_events("thunderstrom", "THUNDERSTORM", data)
data <- merge_events("thunerstrom", "THUNDERSTORM", data)
data <- merge_events("thunerstorm", "THUNDERSTORM", data)
data <- merge_events("thuderstorm", "THUNDERSTORM", data)
data <- merge_events("thundeerstorm", "THUNDERSTORM", data)
data <- merge_events("tunderstorm", "THUNDERSTORM", data)
data <- merge_events("tstm", "THUNDERSTORM", data)
data <- merge_events("marine hail", "MARINE H8IL", data)
data <- merge_events("hail", "HAIL", data)
data <- merge_events("MARINE H8IL", "MARINE HAIL", data)
data <- merge_events("excessive heat", "monkey attack", data)
data <- merge_events("heat", "HEAT", data)
data <- merge_events("unseasonably warm", "HEAT", data)
data <- merge_events("monkey attack", "EXCESSIVE HEAT", data)
data <- merge_events("hurricane", "HURRICANE", data)
# Wind.
data <- merge_events("marine high wind", "znevar uvtu jvaq", data)
data <- merge_events("high.*wind", "HIGH WIND", data)
data <- merge_events("znevar uvtu jvaq", "MARINE HIGH WIND", data)
data <- merge_events("gusty", "HIGH WIND", data)
data <- merge_events("winds", "HIGH WIND", data)
data <- merge_events("wind and wave", "MARINE HIGH WIND", data)
data <- merge_events("strong wave", "HIGH WIND", data)
data <- merge_events("rain/wind", "HIGH WIND", data)
data <- merge_events("wind damage", "HIGH WIND", data)
data <- merge_events("whirlwind", "HIGH WIND", data)
data <- merge_events("wind storm", "HIGH WIND", data)
data <- merge_events("gradient wind", "HIGH WIND", data)
data <- merge_events("^wind$", "HIGH WIND", data)
data <- merge_events("strong wind", "HIGH WIND", data)
data <- merge_events("surf", "HIGH SURF", data)
data <- merge_events("heavy swells", "HIGH SURF", data)
data <- merge_events("heavy seas", "HIGH SURF", data)
data <- merge_events("heavy rain", "HEAVY RAIN", data)
data <- merge_events("heavy.*snow", "HEAVY SNOW", data)
data <- merge_events("heavy shower", "HEAVY RAIN", data)
data <- merge_events("heavy precipitation", "HEAVY RAIN", data)
data <- merge_events("mixed precip", "HEAVY RAIN", data)
data <- merge_events("heavy mix", "HEAVY RAIN", data)
data <- merge_events("wet", "HEAVY RAIN", data)
data <- merge_events("downburst", "HEAVY RAIN", data)
# Fire
data <- merge_events("fire", "WILDFIRE", data)
# Snow
data <- merge_events("la.*effect snow", "YNXR-RSSRPG FABJ", data)
data <- merge_events("snow", "HEAVY SNOW", data)
data <- merge_events("YNXR-RSSRPG FABJ", "LAKE-EFFECT SNOW", data)
# Rip Current
data <- merge_events("rip current", "RIP CURRENT", data)
# Rain
data <- merge_events("rain", "HEAVY RAIN", data)
# Tropical Storm
data <- merge_events("tropical storm", "TROPICAL STORM", data)
data <- merge_events("ice", "ICE STORM", data)
data <- merge_events("icy", "ICE STORM", data)
data <- merge_events("glaze", "ICE STORM", data)
data <- merge_events("coastalstorm", "COASTAL STORM", data)
data <- merge_events("surge", "STORM SURGE/TIDE", data)
data <- merge_events("rapidly rising water", "STORM SURGE/TIDE", data)
data <- merge_events("high water", "STORM SURGE/TIDE", data)
return(data)
}
data <- quash_evtype(data)
The number of unique values for EVTYPE is much reduced:
length(unique(data$EVTYPE))
## [1] 78
There are still EVTYPE values outside of the official list, but for purposes of this analysis I’ll ignore them as outliers. This report is implemented in a reproducible manner; if necessary, further work can be done on refining the list of EVTYPE values by modifying the quash_evtype function.
The importance of various weather event types can be measured by harm to human beings, or by economic damage. Plotting the top event types by fatalities, injuries, property damange and crop damage should provide some quick insight into which types of weather events are of most concern.
par(bg = "light gray")
par(mfrow = c(2, 1))
par(mar=c(5.1, 7.1, 4.1, 2.1))
num_event_types <- 5
fatalities <- aggregate(FATALITIES ~ EVTYPE, FUN=sum, data=data)
fatalities <- arrange(fatalities, desc(FATALITIES))
injuries <- aggregate(INJURIES ~ EVTYPE, FUN=sum, data=data)
injuries <- arrange(injuries, desc(INJURIES))
barplot(
col = "red",
xlab = "Fatalities per event type",
fatalities$FATALITIES[1:num_event_types],
horiz=T,
names.arg=fatalities$EVTYPE[1:num_event_types],
las=1,
xlim = c(0, 1.2 * max(fatalities$FATALITIES)),
cex.names=0.5
)
barplot(
col = "red",
xlab = "Injuries per event type",
injuries$INJURIES[1:num_event_types],
horiz=T,
names.arg=injuries$EVTYPE[1:num_event_types],
las=1,
xlim = c(0, 1.3 * max(injuries$INJURIES)),
cex.names=0.5
)
The plots show that excessive heat, tornadoes, and floods/flash floods are of the most concern, with tornadoes in the top three list for both fatalities and injuries. These facts may be used to motivate increased investment in early warning systems for tornado and flood events.
num_event_types <- 10
prop <- aggregate(PROP_COST ~ EVTYPE, FUN=sum, data=data)
prop$PROP_COST <- prop$PROP_COST / 1e9
prop <- arrange(prop, desc(PROP_COST))
par(bg = "light gray")
par(mfrow = c(1, 1))
par(mar=c(5.1, 6.1, 4.1, 2.1))
barplot(
col = "red",
xlab = "Property damage by event type ($US billion)",
prop$PROP_COST[1:num_event_types],
horiz=T,
names.arg=prop$EVTYPE[1:num_event_types],
las=1,
cex.names=0.5
)
The plot shows that hurricanes are by a strong margin the major cause of property damage, followed by the water-related events storm surge/tide, flood and flash flood. Tornadoes rate a bit less, perhaps due to their localized effects. It should be noted that the range of dates in which data were collected included Hurricane Katrina, which by itself cost more than $30 billion of the approximately $80 billion dollars in property damage; if not for Katrina, hurricane property damage would be about on par with storm surge/tide.
The data could suggest that infrastructural investment to reduce property damage should focus on rather costly improvements to coastal barriers. Key point: as a society, would we rather invest, say, $100 billion in coastal improvements, or $100 billion in cleanup costs?
par(bg = "light gray")
par(mar=c(5.1, 7.1, 4.1, 2.1))
crop <- aggregate(CROP_COST ~ EVTYPE, FUN=sum, data=data)
crop$CROP_COST = crop$CROP_COST / 1e9
crop <- arrange(crop, desc(CROP_COST))
barplot(
col = "red",
xlab = "Crop damage by event type ($US billion)",
crop$CROP_COST[1:num_event_types],
horiz=T,
names.arg=crop$EVTYPE[1:num_event_types],
las=1,
xlim = c(0, 1.1 * max(crop$CROP_COST)),
cex.names=0.5
)
For crop damage, drought is the clear winner, followed by hurricane and flood. If only we could introduce some of the hurricane rain and flood waters to the drought-stricken regions, we’d have a nice savings. The results could be used in a few ways: for example, the cost of drought might motivate dam construction. A different strategy would be to promote the use of more drought-tolerant crops in regions that have historically suffered from drought.
The work can be extended in several obvious ways.