The goal of this analysis is to find the weather events that create
the most amount of damage, both economically and to personal health.
We will attempt to answer two questions directly. They are
1.
Across the United States, which types of events (as indicated by the
EVTYPE variable) are most harmful with respect to
population health?
2. Across the United States, which types of
events have the greatest economic consequences?
To accomplish
this we look at data containing information regarding amounts of varying
types of damage and the weather that caused the damages. The data
used in this analysis is taken from the NOAA Storm Database. This data
was not altered before the following process.
install_load <- function(pkgs) {
# Find which packages are not yet installed
new_pkgs <- pkgs[!(pkgs %in% installed.packages()[, "Package"])]
# Install missing ones
if (length(new_pkgs)) {
install.packages(new_pkgs, dependencies = TRUE)
}
# Load all packages
sapply(pkgs, require, character.only = TRUE)
}
# Use the function
install_load(c("tidyverse", "scales","tidytext"))
## tidyverse scales tidytext
## TRUE TRUE TRUE
We begin by reading in the data using fread from
the data.table package. We select nine columns to read to
avoid bringing in any extra data. These columns contain the dates of the
events, the state where the event occurred, the events themselves, and
then six columns that all concern the impacts. We also remove data for
years 1992 and prior as the number of records increases dramatically
after this, likely indicating improved processes with regards to
collecting data.
# Read in dataset, select columns to read, and rename variables.
StormDataOriginal <- data.table::fread("repdata_data_StormData.csv.bz2",
select = c(2,7,8,23,24,25,26,27,28))
names_vect <- c("Date", "State", "Event.Type", "Deaths",
"Injuries", "Prop.Dam.Significand", "Prop.Dam.Exponent",
"Crop.Dam.Significand", "Crop.Dam.Exponent")
StormData <- StormDataOriginal
colnames(StormData) <- names_vect
# Convert dates to date format and add a column for the year.
# Subset the dataset for years after 1992.
StormData$Date <- as.Date(StormData$Date, format = "%m/%d/%Y")
StormData$Year <- with(StormData, format(Date, "%Y"))
StormData$Year <- as.numeric(StormData$Year)
StormData <- StormData[StormData$Year >= 1993]
summary(StormData)
## Date State Event.Type Deaths Injuries
## Min. :1993-01-01 Length:714738 Length:714738 Min. : 0.0000 Min. :0.000e+00
## 1st Qu.:1999-06-23 Class :character Class :character 1st Qu.: 0.0000 1st Qu.:0.000e+00
## Median :2004-06-14 Mode :character Mode :character Median : 0.0000 Median :0.000e+00
## Mean :2003-12-14 Mean : 0.0152 Mean :9.621e-02
## 3rd Qu.:2008-06-22 3rd Qu.: 0.0000 3rd Qu.:0.000e+00
## Max. :2011-11-30 Max. :583.0000 Max. :1.568e+03
## Prop.Dam.Significand Prop.Dam.Exponent Crop.Dam.Significand Crop.Dam.Exponent Year
## Min. : 0.00 Length:714738 Min. : 0.000 Length:714738 Min. :1993
## 1st Qu.: 0.00 Class :character 1st Qu.: 0.000 Class :character 1st Qu.:1999
## Median : 0.00 Mode :character Median : 0.000 Mode :character Median :2004
## Mean : 12.68 Mean : 1.928 Mean :2003
## 3rd Qu.: 2.00 3rd Qu.: 0.000 3rd Qu.:2008
## Max. :5000.00 Max. :990.000 Max. :2011
Next, we transform the data. The values for the amount of damage
(in dollars) for both property damage and damage to crops are stored
across two variables each. One contains information on the exponent or
magnitude of the damages. Another contains the coefficient of the
damages. If we combine them, through multiplication, we get the amount
of damage, to property or to crops.
# Create list of unique exponent values and corresponding numerical values.
Dam.Exponent <- union(StormData$Prop.Dam.Exponent,StormData$Crop.Dam.Exponent)
Dam.Values <- c( 1e3, 1e6, 1, 1e9, 1e6, 1, 1, 1e5,
1e6, 1, 1e4, 1e2, 1e3, 1e2, 1e7,
1e2, 1, 10, 1e8, 1e3)
# Create relational table to change the exponent symbol to its numerical meaning
Dam.Multiplier <- cbind(Dam.Exponent, Dam.Values)
StormData <- left_join(StormData, Dam.Multiplier,
join_by("Prop.Dam.Exponent" == "Dam.Exponent"),
copy = TRUE)
colnames(StormData)[11] <- "Prop.Dam.Values"
StormData <- left_join(StormData, Dam.Multiplier,
join_by("Crop.Dam.Exponent" == "Dam.Exponent"),
copy = TRUE)
colnames(StormData)[12] <- "Crop.Dam.Values"
StormData$Prop.Dam.Values <- as.numeric(StormData$Prop.Dam.Values)
StormData$Crop.Dam.Values <- as.numeric(StormData$Crop.Dam.Values)
# Multiply the exponent value by the coefficient.
StormData$Prop.Dam.Cost <- StormData$Prop.Dam.Significand *
StormData$Prop.Dam.Values
StormData$Crop.Dam.Cost <- StormData$Crop.Dam.Significand *
StormData$Crop.Dam.Values
# Convert economic damages to trillions of dollars.
StormData$Prop.Dam.Cost <- StormData$Prop.Dam.Cost / (1e12)
StormData$Crop.Dam.Cost <- StormData$Crop.Dam.Cost / (1e12)
# Remove obsolete variables.
StormData <- subset(StormData,
select = -c( Prop.Dam.Significand,
Prop.Dam.Exponent,
Crop.Dam.Significand,
Crop.Dam.Exponent,
Prop.Dam.Values,
Crop.Dam.Values))
We also converted damages to trillions of dollars and removed any
extra columns that are a result of the transformations.
There are 985 unique entries in row containing all of the
events. Many of them are redundant or abbreviations of same event or
misspellings of the same event. First we will subset the data so there
are no rows with all zeros in the impact columns. That is to say only
rows where there is at least one death or injury, or some monetary
damage recorded for either property or crops greater than zero. The rest
of the steps are taken to combine and reduce the number of different
events.
# Subset only rows with some form of impact. If any of the impact variables
# (Deaths, Injuries, Prop.Dam.Cost, Crop.Dam.Cost) are non zero they are kept.
SD <- StormData[!rowSums(StormData[, c(4,5,7,8)]) == 0,]
# Make all Events uppercase to avoid distinction by case.
SD$Event.Type <- toupper(SD$Event.Type)
# Group all Events by meaning to reduce buckets and fix misspellings.
SD$Event.Type <- gsub(".*THUN.*|.*LIGHTNING.*|.*TSTM.*|.*LIGNTNING.*|.*LIGHTING.*|.*MICRO.*|.*DOWNBURST.*|.*TURBULENCE.*"
, "THUNDER/LIGHTNING", SD$Event.Type)
SD$Event.Type <- gsub(".*BLIZZARD.*", "BLIZZARD", SD$Event.Type)
SD$Event.Type <- gsub(".*SNOW.*|.*ICE.*|.*WINT.*|.*FROST.*|.*FREEZE.*|.*ICY.*|.*GLAZE.*|.*HEAVY MIX.*|.*FREEZING SPRAY.*",
"ICE/SNOW", SD$Event.Type)
SD$Event.Type <- gsub(".*TORNADO.*|.*WATERSPOUT.*|.*DUST.*|.*GUSTNADO.*|.*LANDSPOUT.*|.*FUNNEL.*|.*TORNDAO.*", "TORNADO", SD$Event.Type)
SD$Event.Type <- gsub(".*HURR.*|.*TYPH.*", "HURRICANE", SD$Event.Type)
SD$Event.Type <- gsub(".*HAIL.*", "HAIL", SD$Event.Type)
SD$Event.Type <- gsub(".*FLOOD.*|.*MUD.*|.*SLIDE.*|.*DAM.*|.*RISING.*|.*EROSION.*|.*LANDSLUMP.*|.*DROWN.*|.*HIGH WATER.*"
, "FLOODS/DEBRIS FLOW", SD$Event.Type)
SD$Event.Type <- gsub(".*HEAT.*|.*DROUGHT.*|.*HYPER.*|.*WARM.*|.*DENSE SMOKE.*", "HEAT", SD$Event.Type)
SD$Event.Type <- gsub(".*RAIN.*|.*DRIZZ.*|.*SLEET.*|.*PRECIP.*|.*SHOWER.*|.*COOL AND WET.*|.*WETNESS.*",
"RAIN/SLEET", SD$Event.Type)
SD$Event.Type <- gsub(".*FIRE.*", "WILDFIRE", SD$Event.Type)
SD$Event.Type <- gsub(".*FOG.*", "FOG", SD$Event.Type)
SD$Event.Type <- gsub(".*COLD.*|.*WIND.*|.*HYPO.*|.*LOW TEMP.*", "COLD/WIND", SD$Event.Type)
SD$Event.Type <- gsub(".*SURF.*|.*TIDE.*|.*COAST.*|.*WAVE.*|.*RIP CUR.*", "COASTAL/SURF/TIDE", SD$Event.Type)
SD$Event.Type <- gsub(".*TROPIC.*", "TROPICAL STORMS", SD$Event.Type)
SD$Event.Type <- gsub(".*OTHER.*|.*\\?.*|.*APACHE.*|.*ACCIDENT.*|.*URBAN.*|.*MISHAP.*|^HIGH$", "OTHER", SD$Event.Type)
SD$Event.Type <- gsub(".*SEA.*|.*SWELL.*|.*STORM SURGE.*", "MARINE/SEAS", SD$Event.Type)
SD$Event.Type <- gsub(".*AVALAN.*", "AVALANCHE", SD$Event.Type)
We now sum the deaths and injuries by the type of
weather event. We also sum the damages to property and
crops by the type of weather event. Then we take the top ten events in
damages and casualties.
# Combine Deaths and Injuries to one variable called Casualties
SD$Casualties <- SD$Deaths + SD$Injuries
# Combine Property Damage and Crop Damage to give Total Damage.
SD$Dam.Cost <- (SD$Prop.Dam.Cost + SD$Crop.Dam.Cost)
# Aggregate Deaths and Injuries by Event Type. Rename new columns.
# Create variable summing the aggregated Deaths and Injuries.
# Take top 10 events with regard to Casualties (Deaths + Injuries).
SD.Agg.DI <- aggregate(cbind(SD$Deaths,SD$Injuries)~SD$Event.Type, FUN = sum)
names(SD.Agg.DI) <- c("Event", "Deaths", "Injuries")
SD.Agg.DI$Cas <- SD.Agg.DI$Deaths + SD.Agg.DI$Injuries
SD.Agg.DI <- slice_max(.data = SD.Agg.DI, order_by = SD.Agg.DI$Cas, n = 10)
# Aggregate Property and Crop damages by Event Type. Rename new columns.
# Create variable summing the aggregated Property and Crop damages.
# Take top 10 events with regard to Total damage (Property damage + Crop damage).
SD.Agg.PC <- aggregate(cbind(SD$Prop.Dam.Cost,SD$Crop.Dam.Cost)~SD$Event.Type, FUN = sum)
names(SD.Agg.PC) <- c("Event", "Prop.Dam", "Crop.Dam")
SD.Agg.PC$Dam.Cost <- SD.Agg.PC$Crop.Dam + SD.Agg.PC$Prop.Dam
SD.Agg.PC <- slice_max(.data = SD.Agg.PC, order_by = SD.Agg.PC$Dam.Cost, n = 10)
Next, we tidy the data. We make the dataset longer but reduce its width by collapse two variables into one. This will allow us to plot totals for damages and casualties but also show the components.
# Tidy the data by taking the two variables Deaths and Injuries and combining
# them into one variable and then making the new variable a factor.
SD.Agg.byCas <- pivot_longer(data = SD.Agg.DI,cols = c("Deaths", "Injuries"),
names_to = "Casualty.Type", values_to = "Casualties")
SD.Agg.byCas$Casualty.Type <- factor(SD.Agg.byCas$Casualty.Type,
levels = c("Injuries", "Deaths"))
# Tidy the data by taking the two variables Property Damage and Crop Damage
# and combining them into one variable and then making the new variable a factor.
SD.Agg.byDam <- pivot_longer(data = SD.Agg.PC,cols = c("Prop.Dam", "Crop.Dam"),
names_to = "Damage.Type", values_to = "Damage.Cost")
SD.Agg.byDam$Damage.Type <- factor(SD.Agg.byDam$Damage.Type,
levels = c("Prop.Dam", "Crop.Dam"))
Now we are ready to plot the impacts by event. First we will
plot Casualties by event.
# Plot stacked bar plot showing Casualties (stacking deaths and injuries) by Event Type.
k <- ggplot(SD.Agg.byCas, aes(x = reorder(Event, -Casualties, sum), y = Casualties, fill = Casualty.Type))+
geom_bar(stat = "identity")+
xlab("Weather Event")+
ylab("Casualties")+
theme(plot.title = element_text(hjust = 0.5) ,
axis.text.x = element_text(angle = 45, vjust = 1,
hjust = 1, size = 11), legend.position = "bottom",
legend.title = element_blank(),
axis.ticks.length.x = unit(0.5, "cm"))+
labs(title = "Casualties by Weather Event")+
ylim(0,30000)+
scale_x_discrete(labels = c("Tornado", "Thunder\nLightning",
"Heat", "Floods\nDebris Flow",
"Ice\nSnow", "Cold\nWind",
"Wildfire", "Coastal\nSurf/Tide",
"Hurricane", "Fog"))+
stat_summary(fun = sum, aes(label = after_stat(prettyNum(y, big.mark = ",")), group = Event),
geom = "text", vjust = -0.5)
k
Tornadoes are responsible for over 25,000 casualties. Thunder and
Lightning are the next closest with less than 13,000. The event most
harmful with respect to population health is a tornado.
Next
we plot the damages by the event that caused the damages.
# Plot stacked bar plot showing Total Damage (stacking property and crop damage) by Event Type.
l <- ggplot(SD.Agg.byDam, aes(x = reorder(Event, -Damage.Cost, sum), y = Damage.Cost, fill = Damage.Type))+
geom_bar(stat = "identity")+
xlab("Weather Event")+
ylab("Damages (trillions of dollars)")+
theme(plot.title = element_text(hjust = 0.5) ,
axis.text.x = element_text(angle = 45,
vjust = 1, hjust = 1, size = 11), legend.position = "bottom",
legend.title = element_blank(),
axis.ticks.length.x = unit(0.5, "cm"))+
ylim(0, 50)+
labs(title = "Economic Impact by Weather Event")+
scale_x_discrete(labels = c("Floods\nDebris Flow", "Tornado", "Hail",
"Hurricane", "Heat", "Thunder\nLightning",
"Ice\nSnow", "Cold\nWind",
"Wildfire", "Tropical \n Storms"))+
stat_summary(fun = sum, aes(label = dollar(round(after_stat(y), 2)), group = Event),
geom = "text", vjust = -0.5)
l
Floods (and flood-like weather) are the most economically consequential
events in the United States. They have caused almost 44 trillion dollars
in damage from 1993 to 2011 accounting for around 30% of all
damages.
We have now answered the questions set forth at the beginning of
this analysis. Beyond this, we will look at the impacts of these weather
events after grouping them by the state they happened in.
# Aggregate all variables measuring impact by State. Rename variables.
# Scale from trillions to billions
SD.State <- aggregate(cbind(SD$Deaths, SD$Injuries,SD$Casualties,
SD$Prop.Dam.Cost, SD$Crop.Dam.Cost,
SD$Dam.Cost)~SD$State, FUN = sum)
names(SD.State) <- c("State", "Deaths", "Injuries","Casualties", "Property.Damage",
"Crop.Damage", "Total.Damage")
SD.State$Total.Damage <- SD.State$Total.Damage * 1e3
SD.State$Property.Damage <- SD.State$Property.Damage * 1e3
SD.State$Crop.Damage <- SD.State$Crop.Damage * 1e3
Again we will take the aggregated data and “unpivot” it so
that the impacts are collapsed to one variable. This time we will take
the states with the ten highest amounts of casualties and also
damages.
# Tidy the data by combining the four impact variables into one and making that
# new variable a factor. Also add a variable classifying the type of impact.
SD.State.combined <- pivot_longer(data = SD.State, cols = c("Deaths","Injuries", "Property.Damage", "Crop.Damage"),
names_to = "Type", values_to = "Impact")
SD.State.combined$Type <- factor(SD.State.combined$Type,
levels = c("Injuries", "Deaths","Crop.Damage", "Property.Damage"))
SD.State.combined$Facet <- ifelse(SD.State.combined$Type == "Deaths" | SD.State.combined$Type == "Injuries", "Personal", "Economic")
# Split by type of impact. Find 10 highest impacts in each type and then
# recombine the split datasets.
SD.Personal <- SD.State.combined[SD.State.combined$Facet == "Personal",]
SD.Economic <- SD.State.combined[SD.State.combined$Facet == "Economic",]
SD.Personal.T10 <- slice_max(.data = SD.Personal, order_by = Casualties, n = 20)
SD.Economic.T10 <- slice_max(.data = SD.Economic, order_by = Total.Damage, n = 20)
SD.State.combined.Top10 <- rbind(SD.Personal.T10, SD.Economic.T10)
Now we can plot the ten most impacted states, both economically and with regards to health.
# Plot a bar plot showing 10 most impacted states by both Economic and Personal Health
n <- ggplot(data = SD.State.combined.Top10, aes(x = reorder_within(State,-Impact,Facet), y = Impact, fill = Type))+
geom_bar(stat = "identity")+
facet_wrap(~Facet, scales = "free", labeller = as_labeller(
c(Economic = "Economic Damage (billions of dollars)", Personal = "Casualties")))+
xlab("State")+
theme(plot.title = element_text(hjust = 0.5) ,
axis.text.x = element_text(angle = 0,
vjust = 1, hjust = 0.5, size = 11), legend.position = "bottom",
legend.title = element_blank(),
axis.ticks.length.x = unit(0.5, "cm"))+
labs(title = "Economic and Personal Health Damage by State")+
scale_x_discrete(labels = function(x) gsub("__Economic|__Personal|___Economic|___Personal","",x))
n
Texas, Florida, California and Ohio are in the top ten for both economic
damage and number of casualties. Increased efforts to buttress
protections against major weather events in these states would likely
yield the greatest return.