Synopsis: This analysis goes over the weather events that have the
biggest effects on population health and economic costs. Two separate
analyses are done within the same original dataset: one for population
health (injuries and fatalities), and one for economic costs (property
damage and crop damage). Both areas are examined by extracting the
weather events with the top 10 biggest effects on them. Data of the
highest contributing weather events are examined separately from the
rest to see how they affect these variables.
Data Processing
library(knitr)
library(ggplot2)
library(dplyr)
library(tidyr)
library(scales)
data <- read.csv('repdata_data_StormData.csv')
#get structure of the StormData dataframe
head(str(data))
'data.frame': 902297 obs. of 37 variables:
$ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
$ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
$ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
$ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
$ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
$ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
$ STATE : chr "AL" "AL" "AL" "AL" ...
$ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
$ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
$ BGN_AZI : chr "" "" "" "" ...
$ BGN_LOCATI: chr "" "" "" "" ...
$ END_DATE : chr "" "" "" "" ...
$ END_TIME : chr "" "" "" "" ...
$ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
$ COUNTYENDN: logi NA NA NA NA NA NA ...
$ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
$ END_AZI : chr "" "" "" "" ...
$ END_LOCATI: chr "" "" "" "" ...
$ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
$ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
$ F : int 3 2 2 2 2 2 2 1 3 3 ...
$ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
$ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
$ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
$ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
$ PROPDMGEXP: chr "K" "K" "K" "K" ...
$ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
$ CROPDMGEXP: chr "" "" "" "" ...
$ WFO : chr "" "" "" "" ...
$ STATEOFFIC: chr "" "" "" "" ...
$ ZONENAMES : chr "" "" "" "" ...
$ LATITUDE : num 3040 3042 3340 3458 3412 ...
$ LONGITUDE : num 8812 8755 8742 8626 8642 ...
$ LATITUDE_E: num 3051 0 0 0 0 ...
$ LONGITUDE_: num 8806 0 0 0 0 ...
$ REMARKS : chr "" "" "" "" ...
$ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
NULL
Questions to answer: - Which types of events (EVTYPE) are most
harmful relative to population health? - Which types of events have
greatest economic impact across united states?
- Popluation Health Data Processing
#for the first question, the main columns to analyze are EVTYPE, FATALATIES, INJURIES
evtype_health <- data[, c('EVTYPE','FATALITIES','INJURIES')]
#there are too many unique evtypes, so going to filter the data so that we only analyze the one's that have the top 10 most fatalities or top 10 injuries
#getting sums of fatalities and injuries for each unique event type
sum_fatalities <- aggregate(FATALITIES ~ EVTYPE, data = evtype_health, sum, na.rm=TRUE)
sum_injuries <- aggregate(INJURIES ~ EVTYPE, data=evtype_health, sum, na.rm=TRUE)
#get the top ten events from sums of fatalities
fatal_top10 <- sum_fatalities %>%
slice_max(order_by = FATALITIES, n = 10, with_ties = FALSE) %>%
pull(EVTYPE)
#get top ten events from sums of injuries
injury_top10 <- sum_injuries %>%
slice_max(order_by = INJURIES, n = 10, with_ties = FALSE) %>%
pull(EVTYPE)
#subset the first df to only have events with top ten most fatalaties
#subset again to with top 10 injuries
#subset again for events that are top 10 in both fatalities and injuries
evtype_fatalities <- evtype_health[evtype_health$EVTYPE %in% fatal_top10,]
evtype_injuries <- evtype_health[evtype_health$EVTYPE %in% injury_top10,]
evtype_top10_both <- evtype_health[evtype_health$EVTYPE %in% intersect(fatal_top10, injury_top10),]
#end with three subsetted dataframes for each category (top 10 fatalities, top 10 injuries, intersection of events in top 10 fatalities and injuries)
#see the names of events with the most fatalities and injuries
print(fatal_top10)
[1] "TORNADO" "EXCESSIVE HEAT" "FLASH FLOOD" "HEAT" "LIGHTNING" "TSTM WIND"
[7] "FLOOD" "RIP CURRENT" "HIGH WIND" "AVALANCHE"
print(injury_top10)
[1] "TORNADO" "TSTM WIND" "FLOOD" "EXCESSIVE HEAT" "LIGHTNING"
[6] "HEAT" "ICE STORM" "FLASH FLOOD" "THUNDERSTORM WIND" "HAIL"
#see names of events that are in the top 10 of both fatalities and injuries
print(intersect(fatal_top10, injury_top10))
[1] "TORNADO" "EXCESSIVE HEAT" "FLASH FLOOD" "HEAT" "LIGHTNING" "TSTM WIND"
[7] "FLOOD"
- Economic Cost Data Processing
#relevant columns from the original dataset are PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP, and EVTYPE
econ <- data[, c('EVTYPE','PROPDMG', 'PROPDMGEXP','CROPDMG','CROPDMGEXP')]
#Like the previous analysis, I'm going to extract the event types with the highest total Property and Crop damage
#first make a function that will use the EXP columns (PROPDMGEXP and CROPDMGEXP) to get the actual value of the DMG columns
exp_multiplier <- function(x) {
case_when(
x %in% c("H", "h") ~ 1e2,
x %in% c("K", "k") ~ 1e3,
x %in% c("M", "m") ~ 1e6,
x %in% c("B", "b") ~ 1e9,
TRUE ~ 1
)
}
#add new columns that show the DMG cost multiplied by the corresponding EXP
econ_dmg <- econ %>%
mutate(
prop_dmg_total = PROPDMG * exp_multiplier(PROPDMGEXP),
crop_dmg_total = CROPDMG * exp_multiplier(CROPDMGEXP)
)
#get sums of property damage costs for each event type
prop_dmg_by_event <- econ_dmg %>%
group_by(EVTYPE) %>%
summarise(
total_prop_damage = sum(prop_dmg_total, na.rm = TRUE),
.groups = "drop"
) %>%
arrange(desc(total_prop_damage))
#get sums of crop damage costs for each event type
crop_dmg_by_event <- econ_dmg %>%
group_by(EVTYPE) %>%
summarise(
total_crop_damage = sum(crop_dmg_total, na.rm = TRUE),
.groups = "drop"
) %>%
arrange(desc(total_crop_damage))
#from these two data frames I can get the top 10 event types for both propery and crop damage
prop_top10 <- prop_dmg_by_event[1:10,]
crop_top10 <- crop_dmg_by_event[1:10,]
#look at the names of the events with highes property and crop damage
print(prop_top10$EVTYPE)
print(crop_top10$EVTYPE)
Results
- Population Health
#to visualize population health effects, I'm going to only use the event types that are in the top 10 of BOTH fatalities and injuries
#I'll make bar plots for each of these event types that show total injuries and fatalities
#first the evtype_top10_both dataframe needs to be changed into longform. This will be done by adding a new column 'health_type' (either fatality or injury), and another column 'counts' (for fatality or injury)
# Gather fatalities and injuries into long format for stacked bars
evtype_long <- evtype_top10_both %>%
select(EVTYPE, FATALITIES, INJURIES) %>%
pivot_longer(
cols = c(FATALITIES, INJURIES),
names_to = "health_type",
values_to = "count"
)
#convert the health_type column to factors
evtype_long <- evtype_long %>%
mutate(type = factor(health_type, levels = c("INJURIES", "FATALITIES")))
#this new df will be used to make barplots for each top 10 event type, showing total counts of fatalities and injuries
ggplot(evtype_long, aes(x = reorder(EVTYPE, -count), y = count, fill = health_type)) +
geom_col() +
labs(
x = "Event Type",
y = "Count",
fill = "Type"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

It’s clear that tornado’s are the most dangerous weather events in
the united states with regard to human health. Of events that are in the
top 10 for both Injuries and fatalities, they account for the most in
each category.
- Economic Impact
#visualize the property/crop damage caused by these events
p <- ggplot(prop_top10, aes(x=reorder(EVTYPE, -total_prop_damage), y=total_prop_damage, fill=EVTYPE))
p + geom_col() + labs(x='Event Type', y='Total Cost (billions)', title='Total Property Damage by Event Type (top 10 events)') + theme(axis.text.x = element_blank()) + scale_y_continuous(labels = label_number(scale = 1e-9, suffix = "B"))

c <- ggplot(crop_top10, aes(x=reorder(EVTYPE, -total_crop_damage), y=total_crop_damage, fill=EVTYPE))
c + geom_col() + labs(x='Event Type', y='Total Cost (billions)', title='Total Crop Damage by Event Type (top 10 events)') + theme(axis.text.x = element_blank()) + scale_y_continuous(labels = label_number(scale = 1e-9, suffix = "B"))

The largest cause of property damage are floods, and the largest
cause of crop damage are droughts. The costs of each of these are nearly
double the second-most damaging.
