Severe weather events impose substantial health and economic burdens on communities and municipalities. Given the scale of these impacts, it is crucial to analyze data on fatalities, injuries, and property damage to inform the development of effective response protocols. This study examines the dataset provided by the U.S. National Oceanic and Atmospheric Administration (NOAA) Storm Database. The analysis reveals that tornadoes are the leading cause of adverse health outcomes, while floods are the primary drivers of economic losses.
I imported the necessary libraries as follows:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
Then, I read the data by setting the pathway to Downlaods folder on my computer, then loaded the filename as df, which stands for dataframe.
setwd('~/Downloads')
df = read.csv('/Users/anthonyshin/Downloads/repdata-data-StormData.csv')
head(df)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL TORNADO
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL TORNADO
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1 0 0 NA
## 2 0 0 NA
## 3 0 0 NA
## 4 0 0 NA
## 5 0 0 NA
## 6 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1 0 14.0 100 3 0 0 15 25.0
## 2 0 2.0 150 2 0 0 0 2.5
## 3 0 0.1 123 2 0 0 2 25.0
## 4 0 0.0 100 2 0 0 2 2.5
## 5 0 0.0 150 2 0 0 2 2.5
## 6 0 1.5 177 2 0 0 6 2.5
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1 K 0 3040 8812
## 2 K 0 3042 8755
## 3 K 0 3340 8742
## 4 K 0 3458 8626
## 5 K 0 3412 8642
## 6 K 0 3450 8748
## LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3051 8806 1
## 2 0 0 2
## 3 0 0 3
## 4 0 0 4
## 5 0 0 5
## 6 0 0 6
In the following data processing, I have implemented these changes as listed:
The column name PROPDMGEXP is considered scientific
suffix which denotes the multitude of the PROPDMG.
Therefore, I have converted them into numeric values to re-estimate the
actual total property damage.
The health impact is estimated by the sum of fatalities and
injuries, which I defined as health impact total.
The total sum is calculated per event type, in which I used
group_by function to gather all the event type, then
summarized into the sum of the damages.
library(tidyverse)
df_health = df |>
transmute(
event_type = EVTYPE,
Multiplier_prop = case_when(
PROPDMGEXP == "K" ~ 1000,
PROPDMGEXP == "M" ~ 1e6,
PROPDMGEXP == "B" ~ 1e9,
TRUE ~ 1 # Default case if no match
),
property_damage_total = PROPDMG * Multiplier_prop,
Multiplier_crop = case_when(
CROPDMGEXP == "K" ~ 1000,
CROPDMGEXP == "M" ~ 1e6,
CROPDMGEXP == "B" ~ 1e9,
TRUE ~ 1 # Default case if no match
),
crop_damage_total = CROPDMG * Multiplier_crop,
health_impact_total = FATALITIES + INJURIES
) |>
group_by(event_type) |>
summarize(
Property = sum(property_damage_total),
Health = sum(health_impact_total)
)
head(df_health)
## # A tibble: 6 × 3
## event_type Property Health
## <chr> <dbl> <dbl>
## 1 " HIGH SURF ADVISORY" 200000 0
## 2 " COASTAL FLOOD" 0 0
## 3 " FLASH FLOOD" 50000 0
## 4 " LIGHTNING" 0 0
## 5 " TSTM WIND" 8100000 0
## 6 " TSTM WIND (G45)" 8000 0
Then, I created two arrangements of the same df_health
dataframe to display the top 6 event types that contributed to either
property damage or health impact. The following code snippet shows the
processing of such:
top6_health = df_health |>
arrange(desc(Health)) |>
head()
top6_property = df_health |>
arrange(desc(Property)) |>
head()
top6_health
## # A tibble: 6 × 3
## event_type Property Health
## <chr> <dbl> <dbl>
## 1 TORNADO 56925660790. 96979
## 2 EXCESSIVE HEAT 7753700 8428
## 3 TSTM WIND 4484928495 7461
## 4 FLOOD 144657709807 7259
## 5 LIGHTNING 928659447. 6046
## 6 HEAT 1797000 3037
top6_property
## # A tibble: 6 × 3
## event_type Property Health
## <chr> <dbl> <dbl>
## 1 FLOOD 144657709807 7259
## 2 HURRICANE/TYPHOON 69305840000 1339
## 3 TORNADO 56925660790. 96979
## 4 STORM SURGE 43323536000 51
## 5 FLASH FLOOD 16140812067. 2755
## 6 HAIL 15727367053. 1376
ggplot(top6_health) +
aes(
x = event_type,
y = Health
) +
geom_col() +
theme_classic() +
labs(
x = 'Event Types',
y = 'Health Impact',
title = 'The top 6 weather event types that contributed to the health impact',
caption = 'Figure 1. The top 6 weather event tpyes that contributed to the total combined health imapct is displayed. Health impact is determined by the sum of fatalities and injuries incurred by each event type. The top contributor is determined to be tornado.')
ggplot(top6_property) +
aes(
x = event_type,
y = Property
) +
geom_col() +
theme_classic() +
labs(
x = 'Event Types',
y = 'Property Damages',
title = 'The top 6 weather event types that contributed to the property damages',
caption = 'Figure 2. The top 6 weather event tpyes that contributed to the total combined property damage is displayed. The top contributor of property damages is determined to be flood. '
)
From the graph, I have identified that Tornado is the top severe weather condition that contributes to the largest extent of negative helath impact, while flood is the top contributior to the property damage.