Abstract

Severe weather events impose substantial health and economic burdens on communities and municipalities. Given the scale of these impacts, it is crucial to analyze data on fatalities, injuries, and property damage to inform the development of effective response protocols. This study examines the dataset provided by the U.S. National Oceanic and Atmospheric Administration (NOAA) Storm Database. The analysis reveals that tornadoes are the leading cause of adverse health outcomes, while floods are the primary drivers of economic losses.

Data Processing

I imported the necessary libraries as follows:

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)

Then, I read the data by setting the pathway to Downlaods folder on my computer, then loaded the filename as df, which stands for dataframe.

setwd('~/Downloads')
df = read.csv('/Users/anthonyshin/Downloads/repdata-data-StormData.csv')
head(df)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE  EVTYPE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL TORNADO
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL TORNADO
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL TORNADO
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL TORNADO
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL TORNADO
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL TORNADO
##   BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1         0                                               0         NA
## 2         0                                               0         NA
## 3         0                                               0         NA
## 4         0                                               0         NA
## 5         0                                               0         NA
## 6         0                                               0         NA
##   END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1         0                      14.0   100 3   0          0       15    25.0
## 2         0                       2.0   150 2   0          0        0     2.5
## 3         0                       0.1   123 2   0          0        2    25.0
## 4         0                       0.0   100 2   0          0        2     2.5
## 5         0                       0.0   150 2   0          0        2     2.5
## 6         0                       1.5   177 2   0          0        6     2.5
##   PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1          K       0                                         3040      8812
## 2          K       0                                         3042      8755
## 3          K       0                                         3340      8742
## 4          K       0                                         3458      8626
## 5          K       0                                         3412      8642
## 6          K       0                                         3450      8748
##   LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1       3051       8806              1
## 2          0          0              2
## 3          0          0              3
## 4          0          0              4
## 5          0          0              5
## 6          0          0              6

In the following data processing, I have implemented these changes as listed:

  1. The column name PROPDMGEXP is considered scientific suffix which denotes the multitude of the PROPDMG. Therefore, I have converted them into numeric values to re-estimate the actual total property damage.

  2. The health impact is estimated by the sum of fatalities and injuries, which I defined as health impact total.

  3. The total sum is calculated per event type, in which I used group_by function to gather all the event type, then summarized into the sum of the damages.

library(tidyverse)
df_health = df |> 
  transmute(
    event_type = EVTYPE,
    Multiplier_prop = case_when(
      PROPDMGEXP == "K" ~ 1000,
      PROPDMGEXP == "M" ~ 1e6,
      PROPDMGEXP == "B" ~ 1e9,
      TRUE ~ 1  # Default case if no match
    ),
    property_damage_total = PROPDMG * Multiplier_prop,
    Multiplier_crop = case_when(
      CROPDMGEXP == "K" ~ 1000,
      CROPDMGEXP == "M" ~ 1e6,
      CROPDMGEXP == "B" ~ 1e9,
      TRUE ~ 1  # Default case if no match
    ),
    crop_damage_total = CROPDMG * Multiplier_crop,
    health_impact_total = FATALITIES + INJURIES
  ) |> 
  group_by(event_type) |> 
  summarize(
    Property = sum(property_damage_total),
    Health = sum(health_impact_total)
  )
head(df_health)
## # A tibble: 6 × 3
##   event_type              Property Health
##   <chr>                      <dbl>  <dbl>
## 1 "   HIGH SURF ADVISORY"   200000      0
## 2 " COASTAL FLOOD"               0      0
## 3 " FLASH FLOOD"             50000      0
## 4 " LIGHTNING"                   0      0
## 5 " TSTM WIND"             8100000      0
## 6 " TSTM WIND (G45)"          8000      0

Then, I created two arrangements of the same df_health dataframe to display the top 6 event types that contributed to either property damage or health impact. The following code snippet shows the processing of such:

top6_health = df_health |> 
  arrange(desc(Health)) |> 
  head()

top6_property = df_health |> 
  arrange(desc(Property)) |> 
  head()

top6_health
## # A tibble: 6 × 3
##   event_type          Property Health
##   <chr>                  <dbl>  <dbl>
## 1 TORNADO         56925660790.  96979
## 2 EXCESSIVE HEAT      7753700    8428
## 3 TSTM WIND        4484928495    7461
## 4 FLOOD          144657709807    7259
## 5 LIGHTNING         928659447.   6046
## 6 HEAT                1797000    3037
top6_property
## # A tibble: 6 × 3
##   event_type             Property Health
##   <chr>                     <dbl>  <dbl>
## 1 FLOOD             144657709807    7259
## 2 HURRICANE/TYPHOON  69305840000    1339
## 3 TORNADO            56925660790.  96979
## 4 STORM SURGE        43323536000      51
## 5 FLASH FLOOD        16140812067.   2755
## 6 HAIL               15727367053.   1376

Results

ggplot(top6_health) +
  aes(
    x = event_type,
    y = Health
  ) +
  geom_col() +
  theme_classic() +
  labs(
    x = 'Event Types',
    y = 'Health Impact',
    title = 'The top 6 weather event types that contributed to the health impact',
    caption = 'Figure 1. The top 6 weather event tpyes that contributed to the total combined health imapct is displayed. Health impact is determined by the sum of fatalities and injuries incurred by each event type. The top contributor is determined to be tornado.')

ggplot(top6_property) +
  aes(
    x = event_type,
    y = Property
  ) +
  geom_col() +
  theme_classic() +
  labs(
    x = 'Event Types',
    y = 'Property Damages',
    title = 'The top 6 weather event types that contributed to the property damages',
    caption = 'Figure 2. The top 6 weather event tpyes that contributed to the total combined property damage is displayed. The top contributor of property damages is determined to be flood. '
    )

From the graph, I have identified that Tornado is the top severe weather condition that contributes to the largest extent of negative helath impact, while flood is the top contributior to the property damage.