Storms and other weather events can have large economic and public health impacts. An understanding of the frequency, location, and impact of past weather events can help municipal, state, and other levels of government prepare for future events. This analysis provides an overview of storm and weather impacts across the 50 US States between January 1996 and November 2011, highlighting differences by Census Bureau region.
The data for the analysis is from the US National Oceanic & Atmospheric Administration (NOAA) Storm Data database. Most of the data is compiled by the National Weather Service (NWS). The database contains information about weather events including their type, location, and four measures of their impact:
The analysis focuses attention on the type of weather events that administrators should be most prepared for in their respective regions. It does this by showing which types of events are most harmful with respect to population health, and which types have the greatest economic consequences.
Download data file to current working directory and load into R.
#download.file('https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2', 'storm_data.csv.bz2')
storm_data <- read.csv('storm_data.csv.bz2')
Load required packages. (If they are not already installed, use
install.packages() to install them first.)
library(dplyr)
library(lubridate)
library(stringr)
library(ggplot2)
library(forcats)
library(tidyr)
library(viridis)
library(ggpubr)
library(scales)
Large datasets with multiple contributors are likely to have inconsistencies and errors. The National Weather Service Instruction 10-1605, Aug 17, 2007 begins with a disclaimer noting that some Storm Data information is provided by outside sources and the National Weather Service “does not guarantee the accuracy or validity of the information.” Inconsistencies can also arise when the database does not enforce a controlled vocabulary, or when classification and data entry rules change over time. It is therefore important to explore the data first to determine whether cleanup, filtering, or other transformation is required to provide the best answer to the questions.
Four data quality issues were identified and addressed.
1. Outlier flood data. There are occasional spikes in storm-related expenses. These usually correspond to major weather events (e.g. Hurricane Katrina) but one flood record for Napa County, California is an extreme outlier that appears to be a data error. It records property damage amounting to USD 115 billion on January 1, 2006, an amount more than three times larger than any other single record. In contrast, a 2006 US Geological Survey report refers to floods that affected Napa County, but it estimates a total of USD 300 million across multiple counties. Given this evidence it seems more likely that the Napa County damage amounted to USD 115 million, not billion; the code below changes the units from “B” to “M” accordingly.
storm_data$PROPDMGEXP[which(storm_data$PROPDMGEXP == "B" & storm_data$COUNTYNAME == "NAPA")] <- "M"
2. Unexpected unit values. The vast majority of records use one of three letters to indicate the unit for dollar amounts: ‘K’ for thousands, ‘M’ for millions, and ‘B’ for billions. However, some records from 1993-1995 have unexpected values that are not described in the storm data documentation or FAQ. Records after 1995 do not have this problem.
3. Data completeness. The instructions for this project note that “in the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.” There is an appreciable jump in the number of records per year around 1995, suggesting the database is more complete after this period.
To address issues 2 and 3, keep only records of weather events after 1995.
storm_subset <- storm_data %>%
mutate(BGN_DATE = mdy(str_extract(BGN_DATE, '^[^ ]*'))) %>%
filter(year(BGN_DATE) > 1995)
4. Inconsistent event types. The National Weather Service Instruction 10-1605, Aug 17, 2007 provides a table of event type names (p6), but the EVTYPE variable contains many other non-standard values. Differences in capitalization, extra white spaces, abbreviations, and combined entries account for some of the differences. Other differences may stem from undocumented variation in reporting practices. To calculate storm-related damage accurately it’s important to map non-standard values to the appropriate standard value wherever possible. The code below creates a new EVTYPE_NEW variable that standardizes event types according to these guidelines:
# Create list of standard event types from documentation
standard_events <- c('astronomical low tide','avalanche','blizzard','coastal flood','cold/wind chill','debris flow','dense fog','dense smoke','drought','dust devil','dust storm','excessive heat','extreme cold/wind chill','flash flood','flood','frost/freeze','funnel cloud','freezing fog','hail','heat','heavy rain','heavy snow','high surf','high wind','hurricane/typhoon','ice storm','lake-effect snow','lakeshore flood','lightning','marine hail','marine high wind','marine strong wind','marine thunderstorm wind','rip current','seiche','sleet','storm surge/tide','strong wind','thunderstorm wind','tornado','tropical depression','tropical storm','tsunami','volcanic ash','waterspout','wildfire','winter storm','winter weather')
# Create EVTYPE_STANDARD variable. Where EVTYPE is standard, populate with EVTYPE. Where EVTYPE is non-standard, populate with the best match using regular expressions (not all non-standard types have a match).
storm_subset <- storm_subset %>%
mutate(EVTYPE = str_replace(str_trim(tolower(EVTYPE)),"[ ]{2,}", " ")) %>%
mutate(STANDARD = ifelse(EVTYPE %in% standard_events,1,0)) %>%
mutate(EVTYPE_NEW = ifelse(STANDARD == 1,EVTYPE,case_when(
str_detect(EVTYPE,'(hurricane|typhoon)') ~ 'hurricane/typhoon',
str_detect(EVTYPE,'(frost|freeze)') ~ 'frost/freeze',
str_detect(EVTYPE,'(coastal|cstl|tidal) flood') ~ 'coastal flood',
str_detect(EVTYPE,'(burst|^tstm| tstm)') ~ 'thunderstorm wind',
str_detect(EVTYPE,'(slide|slump)') ~ 'debris flow',
str_detect(EVTYPE,'high wind') ~ 'high wind',
str_detect(EVTYPE,'landspout') ~ 'tornado',
str_detect(EVTYPE,'storm surge') ~ 'storm surge/tide',
str_detect(EVTYPE,'fire') ~ 'wildfire',
str_detect(EVTYPE,'flash') ~ 'flash flood',
str_detect(EVTYPE,'(flood|fld)') ~ 'flood',
str_detect(EVTYPE,'(extreme (cold|wind)|hypothermia)') ~ 'extreme cold/wind chill',
str_detect(EVTYPE,'cold') ~ 'cold/wind chill',
str_detect(EVTYPE,'lake effect snow') ~ 'lake-effect snow',
str_detect(EVTYPE,'(snow|freezing|winter|wintry|glaze)') ~ 'winter weather',
str_detect(EVTYPE,'fog') ~ 'dense fog',
str_detect(EVTYPE,'[^l]?wind') ~ 'strong wind',
str_detect(EVTYPE,'hail') ~ 'hail',
str_detect(EVTYPE,'heavy rain') ~ 'heavy rain',
str_detect(EVTYPE,'(surf|swell|seas$)') ~ 'high surf',
str_detect(EVTYPE,'rain') ~ 'heavy rain',
str_detect(EVTYPE,'(ice|icy)') ~ 'frost/freeze',
str_detect(EVTYPE,'(hyperthermia|warm|heat)') ~ 'excessive heat',
.default = EVTYPE
))) %>%
select(-STANDARD)
The previous section addressed data quality issues. The next step is to transform, reshape, and enhance the data to make it easier to work with. The code below does the following:
storm_subset <- storm_subset %>%
select(BGN_DATE,STATE,EVTYPE_NEW,FATALITIES,INJURIES,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP) %>%
mutate(CROPDMGEXP = toupper(CROPDMGEXP)) %>%
mutate(PROPDMGEXP = toupper(PROPDMGEXP)) %>%
filter(PROPDMG+CROPDMG+FATALITIES+INJURIES > 0)
Group by region. The database includes storm events for the entire United States but patterns vary by region. A state-by-state analysis is not possible within the constraints of this project but it’s feasible to group states into the four Census Bureau regions (Northeast, Midwest, South, West). This more specific analysis will be useful for decisions at regional, state, and municipal levels.
Records are assigned to regions based on the STATE variable. The analysis is limited to Census Bureau regions and therefore excludes American Samoa, Guam, Puerto Rico, and the US Virgin Islands. Some records have non-standard state abbreviations. These are assigned to a region based on the corresponding COUNTY values. For example, several records have a STATE value of ‘LO’. The counties for these record are on the shores of Lake Ontario (LO), so they are assigned to the Northeast region.
# Create dataframe with state abbreviations and their corresponding regions (includes non-standard state abbreviations AM:SL)
state <- c('AL','AK','AZ','AR','CA','CO','CT','DE','FL','GA','HI','ID','IL','IN','IA','KS','KY','LA','ME','MD','MA','MI','MN','MS','MO','MT','NE','NV','NH','NJ','NM','NY','NC','ND','OH','OK','OR','PA','RI','SC','SD','TN','TX','UT','VT','VA','WA','WV','WI','WY','DC','AM','AN','GM','LE','LM','LO','LS','PH','PK','PZ','SL')
region <- c('South','West','West','South','West','West','Northeast','South','South','South','West','West','Midwest','Midwest','Midwest','Midwest','South','South','Northeast','South','Northeast','Midwest','Midwest','South','Midwest','West','Midwest','West','Northeast','Northeast','West','Northeast','South','Midwest','Midwest','South','West','Northeast','Northeast','South','Midwest','South','South','West','Northeast','South','West','South','Midwest','West','South','South','South','South','Northeast','Midwest','Northeast','Midwest','West','West','West','Northeast')
state_region <- data.frame(STATE = state, REGION = region)
# Remove American Samoa, Guam, Puerto Rico, US Virgin Islands; add region to `storm_subset`
storm_subset <- storm_subset %>%
filter(!STATE %in% c("AS","GU","PR","VI")) %>%
left_join(state_region, join_by(STATE))
Calculate damage values in dollars. Property damage values are obtained by multiplying the PROPDMG variable by the unit indicated in PROPDMGEXP (e.g. ‘K’ for thousands, ‘M’ for millions). A similar calculation is required for crop damage values. To simplify the analysis later, full dollar amounts are encoded in CROP_COST and PROP_COST variables.
storm_subset <- storm_subset %>%
mutate(CROP_COST = CROPDMG * case_when(
CROPDMGEXP == 'K' ~ 10^3,
CROPDMGEXP == 'M' ~ 10^6,
CROPDMGEXP == 'B' ~ 10^9,
.default = 1
)) %>%
mutate(PROP_COST = PROPDMG * case_when(
PROPDMGEXP == 'K' ~ 10^3,
PROPDMGEXP == 'M' ~ 10^6,
PROPDMGEXP == 'B' ~ 10^9,
.default = 1
)) %>%
select(-c(PROPDMGEXP,CROPDMGEXP,CROPDMG,PROPDMG))
Make data tidy. ‘Tidy data’ is the standard for efficient data management and analysis. In a tidy dataset each row is an observation, each column is a variable, and “each type of observational unit forms a table” (Tidy data, p4). The final data processing step is to create two tables: one for economic impact (containing property and crop damage) and one for personal harm (containing injuries and fatalities).
economic_impact <- storm_subset %>%
select(-c(INJURIES,FATALITIES)) %>%
pivot_longer(c(PROP_COST,CROP_COST), names_to = "IMPACT_TYPE", values_to = "DOLLARS") %>%
filter(DOLLARS > 0)
personal_harm <- storm_subset %>%
select(-c(PROP_COST,PROP_COST)) %>%
pivot_longer(c(INJURIES,FATALITIES), names_to = "IMPACT_TYPE", values_to = "PEOPLE") %>%
filter(PEOPLE > 0)
In the Storm Data database the economic consequences of weather events are represented with two dollar amounts: the amount of property damage and the amount of crop damage.
top_economic <- economic_impact %>%
group_by(EVTYPE_NEW) %>%
summarize(total = sum(DOLLARS)) %>%
slice_max(total, n=1)
Based on the adjusted data, the event type that had the largest economic impact between January 1996 and November 2011 was hurricane/typhoon, representing a total cost of 83.7 billion dollars. This amounts to 30% of the total economic costs reported.
top_economic_both <- economic_impact %>%
group_by(IMPACT_TYPE, EVTYPE_NEW) %>%
summarize(total = sum(DOLLARS), .groups="drop_last") %>%
slice_max(total, n=1)
top_crop_perc <- label_percent()(top_economic_both$total[1]/sum(economic_impact$DOLLARS[economic_impact$IMPACT_TYPE == "CROP_COST"]))
top_prop_perc <- label_percent()(top_economic_both$total[2]/sum(economic_impact$DOLLARS[economic_impact$IMPACT_TYPE == "PROP_COST"]))
Weather events do not affect property and crops evenly. The most costly event type for property damage across the United States was hurricane/typhoon (USD 78.9 billion, 32% of property damage), while the event type that caused the most crop damage is drought (USD 13.4 billion, 39% of crop damage).
The national totals above tell part of the story but they hide variations among regions. Figure 1 illustrates this point: even though hurricane/typhoon is the event type that caused the largest economic impact between 1996 and 2011, it was only the top event type in one of the four regions (the South). Figure 1 also makes it clear how much the total economic impact of weather events varies across regions (the x axis scale is adjusted by region).
# For each region in the economic impact data...
for (r in unique(economic_impact$REGION)) {
# Get top 10 event types by economic cost
list_name <- paste0(r,"_list")
assign(list_name, economic_impact %>%
filter(REGION == r) %>%
group_by(EVTYPE_NEW) %>%
summarize(TOTAL = sum(DOLLARS)) %>%
slice_max(TOTAL, n=10))
# Plot and save bar graph
assign(paste0(r,"_economic_bar"), economic_impact %>%
filter(REGION == r) %>%
filter(EVTYPE_NEW %in% eval(parse(text=list_name))$EVTYPE_NEW) %>%
ggplot(aes(DOLLARS/10^9, fct_reorder(EVTYPE_NEW, DOLLARS, .fun="sum"), fill=IMPACT_TYPE)) +
geom_col() +
ggtitle(r) +
xlab("USD (billions)") +
ylab(element_blank()) +
scale_fill_viridis(discrete = TRUE, begin=0.6, end=0.9,
option = "G", labels = c("crop damage","property damage")) +
theme_minimal() +
theme(legend.position = c(0.7,0.2),
legend.title = element_blank(),
axis.title=element_text(size=9),
plot.title = element_text(size = "12", color="gray30"))
)
}
# Combine individual bar graps into one plot
bar_grid <- ggarrange(Midwest_economic_bar,
Northeast_economic_bar,
West_economic_bar,
South_economic_bar,
nrow = 2, ncol = 2,
common.legend = TRUE, legend="bottom")
annotate_figure(
annotate_figure(bar_grid, top = text_grob("Costliest ten event types in each Census Bureau region", size = 10, face = "italic")),
top = text_grob("Figure 1. Economic impact of US weather events, 1996-2011 (total) "))
Figure 1 shows totals for the 15 year period under review but it doesn’t show differences from year to year. Is the high economic cost in the South evenly distributed across the period, or can it be attributed to a smaller number of extraordinary events? Figure 2 helps answer this question by showing annual patterns in each region. (To keep the graphs legible only the top four events are shown for each region.)
# assign colors ahead of time so they're consistent across quadrants
eventColors <- viridis(9, option = "H")
names(eventColors) <- unique(c(South_list$EVTYPE_NEW[1:4],
Northeast_list$EVTYPE_NEW[1:4],
Midwest_list$EVTYPE_NEW[1:4],
West_list$EVTYPE_NEW[1:4]))
custom_colors <- scale_color_manual(name = "", values=eventColors)
# plot and save line graphs
for (r in unique(economic_impact$REGION)) {
assign(paste0(r,"_economic_line"), economic_impact %>%
filter(REGION == r) %>%
filter(EVTYPE_NEW %in% eval(parse(text=paste0(r,"_list")))$EVTYPE_NEW[1:4]) %>%
group_by(YEAR = year(BGN_DATE),EVTYPE_NEW) %>%
summarize(TOTAL = sum(DOLLARS)/10^9, .groups="keep") %>%
ggplot(aes(YEAR, TOTAL, col=EVTYPE_NEW))+
geom_line() +
theme_minimal() +
ggtitle(r) +
ylab("USD (billions)") +
xlab(element_blank()) +
custom_colors +
theme(#legend.position = c(0.8, 0.73),
legend.title=element_blank(),
axis.title=element_text(size=9),
plot.title = element_text(size = "12", color="gray30"))
)
}
# arrange line graphs
line_grid <- ggarrange(Midwest_economic_line,
Northeast_economic_line,
West_economic_line,
South_economic_line,
nrow = 2, ncol = 2)
annotate_figure(
annotate_figure(line_grid, top = text_grob("Costliest four event types in each Census Bureau region", size = 10, face = "italic")),
top = text_grob("Figure 2. Economic impact of US weather events, 1996-2011 (by year) "))
When plotted by year we can see additional detail that may be useful when planning. Amounts for the Northeast are affected by unusually costly floods in 2011. Even more prominently, the graph for the South shows the massive economic impact of hurricane Katrina in 2005, which skews the overall picture.
In the Storm Data database harm to individuals is presented in two variables: injuries and fatalities.
top_personal <- personal_harm %>%
group_by(EVTYPE_NEW) %>%
summarize(total = sum(PEOPLE)) %>%
slice_max(total, n=1)
The event type that was most harmful in terms of population health was tornado, which resulted in a combined total of 22,178 thousand injuries and fatalities (34% of the total).
top_personal_both <- personal_harm %>%
group_by(IMPACT_TYPE, EVTYPE_NEW) %>%
summarize(total = sum(PEOPLE), .groups='drop_last') %>%
slice_max(total, n=1)
top_injuries_perc <- label_percent()(top_personal_both$total[2]/sum(personal_harm$PEOPLE[personal_harm$IMPACT_TYPE == "INJURIES"]))
top_fatalities_perc <- label_percent()(top_personal_both$total[1]/sum(personal_harm$PEOPLE[personal_harm$IMPACT_TYPE == "FATALITIES"]))
The event type associated with the most injuries across the United States was tornado (20,667, 36% of injuries), while the event type that caused the most fatalities was excessive heat (1,800, 21% of fatalities).
As with economic impact, the distribution of injuries and fatalities changes by geographic region. Figure 3 illustrates these differences.
# For each region in the economic impact data...
for (r in unique(personal_harm$REGION)) {
# Get top 10 event types by economic cost
list_name <- paste0(r,"_list")
assign(list_name, personal_harm %>%
filter(REGION == r) %>%
group_by(EVTYPE_NEW) %>%
summarize(TOTAL = sum(PEOPLE)) %>%
slice_max(TOTAL, n=10))
# Plot and save bar graph
assign(paste0(r,"_personal_harm"), personal_harm %>%
filter(REGION == r) %>%
filter(EVTYPE_NEW %in% eval(parse(text=list_name))$EVTYPE_NEW) %>%
ggplot(aes(PEOPLE, fct_reorder(EVTYPE_NEW, PEOPLE, .fun="sum"), fill=IMPACT_TYPE)) +
geom_col() +
ggtitle(r) +
xlab("people") +
ylab(element_blank()) +
scale_fill_viridis(discrete = TRUE, begin=0.6, end=0.9,
option = "B", labels = c("fatalities","injuries")) +
theme_minimal() +
theme(legend.position = c(0.7,0.2),
legend.title = element_blank(),
axis.title=element_text(size=9),
plot.title = element_text(size = "12", color="gray30"))
)
}
# Combine individual bar graps into one plot
bar_grid <- ggarrange(Midwest_personal_harm,
Northeast_personal_harm,
West_personal_harm,
South_personal_harm,
nrow = 2, ncol = 2,
common.legend = TRUE, legend="bottom")
annotate_figure(
annotate_figure(bar_grid, top = text_grob("Most harmful ten event types in each Census Bureau region", size = 10, face = "italic")),
top = text_grob("Figure 3. Injuries and fatalites from US weather events, 1996-2011 (total) "))
The purpose of this report is to provide a high level overview that suggests where planners may wish to allocate resources. Additional exploration that focuses on variation over time, seasonal patterns, and finer geographic subdivision may all prove helpful depending on the project.