Synopsis

This report analyzes the U.S. National Oceanic and Atmospheric Administration (NOAA) storm database to identify which types of weather events are most harmful to population health and which have the greatest economic consequences. The dataset covers events from 1950 to 2011 and includes fatalities, injuries, and property and crop damages. Data were loaded directly from the raw CSV file without external preprocessing. Events were aggregated by type, and damages were standardized into U.S. dollars. Health impacts were measured as the sum of fatalities and injuries, while economic impacts were calculated as the sum of property and crop damage. The analysis highlights that tornadoes are the leading cause of human casualties, while floods and hurricanes are the most costly events. Additionally, correlations between health impacts and economic damages were examined, with extreme outliers removed. The correlation between casualties and economic losses was found to be weak and not statistically significant. These findings suggest that different types of weather events drive human losses and economic damages.

Data Processing

In this section we describe the steps used to process the raw data. First, the required R libraries are loaded. Then, the dataset is downloaded directly from the NOAA website and read into R. From the raw dataset, only the relevant variables are selected, and the damage magnitude indicators are converted into numeric multipliers so that property and crop damages are expressed in U.S. dollars.

Libraries used

library(dplyr)
## 
## Adjuntando el paquete: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(readr)

Download and read the NOAA storm database

url <- "https://d396qusza40orc.cloudfront.net/repdata/data/StormData.csv.bz2"
file <- "StormData.csv.bz2"
if (!file.exists(file)) {
  try(download.file(url, file, method = "curl"), silent = TRUE)
  if (!file.exists(file)) download.file(url, file)
}
storm <- read_csv(file)
## Rows: 902297 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (18): BGN_DATE, BGN_TIME, TIME_ZONE, COUNTYNAME, STATE, EVTYPE, BGN_AZI,...
## dbl (18): STATE__, COUNTY, BGN_RANGE, COUNTY_END, END_RANGE, LENGTH, WIDTH, ...
## lgl  (1): COUNTYENDN
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Select relevant variables

storm_sel <- storm %>%
  select(EVTYPE, FATALITIES, INJURIES,
         PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

# Convert magnitude exponents (K, M, B) into numeric multipliers
conv_exp <- function(val, exp) {
  exp <- toupper(exp)
  mult <- ifelse(exp == "K", 1e3,
                 ifelse(exp == "M", 1e6,
                        ifelse(exp == "B", 1e9, 1)))
  val * mult
}

storm_sel <- storm_sel %>%
  mutate(PROPDMG = conv_exp(PROPDMG, PROPDMGEXP),
         CROPDMG = conv_exp(CROPDMG, CROPDMGEXP))

Then relevant variables were selected and the magnitudes converted (K, M, B) into numeric multipliers, data is saved in storm_sel.

Results

Health impact

salud <- storm_sel %>%
  group_by(EVTYPE) %>%
  summarise(
    deaths = sum(FATALITIES, na.rm = TRUE),
    injuries = sum(INJURIES, na.rm = TRUE),
    total_health = deaths + injuries
  ) %>%
  arrange(desc(total_health)) %>%
  slice_head(n = 10)

This code groups the raw storm records by event type (EVTYPE) and computes, for each type, the total number of deaths and injuries, then adds them to obtain a combined health impact (total_health). It orders event types from highest to lowest total_health and keeps the top 10. The resulting salud table is therefore a ranked summary of the weather events most harmful to population health.

ggplot(salud, aes(x = reorder(EVTYPE, total_health), y = total_health, fill = deaths)) +
  geom_col() +
  coord_flip() +
  labs(title = "Top 10 Weather Events Affecting Public Health",
       x = "Event", y = "Total Casualties (deaths + injuries)")

Afterward, the summary is visualized with a bar chart: each bar represents an event type, the bar height is the total casualties (total_health), and the fill encodes the share of deaths. Event types are ordered by impact and the axes are flipped (coord_flip()) to improve label readability, yielding a clear top-10 chart of the weather events most harmful to public health.

 economia <- storm_sel %>%
  group_by(EVTYPE) %>%
  summarise(
    property = sum(PROPDMG, na.rm = TRUE),
    crop     = sum(CROPDMG, na.rm = TRUE),
    total_econ = property + crop
  ) %>%
  arrange(desc(total_econ)) %>%
  slice_head(n = 10)

This code groups the raw observations by event type (EVTYPE) and, for each type, sums property damages (PROPDMG) and crop damages (CROPDMG). It then computes the total economic loss as property + crop, ranks event types in descending order of total_econ, and keeps the top 10. The resulting economia table is a ranked summary of the most economically costly weather events.

ggplot(economia, aes(x = reorder(EVTYPE, total_econ), y = total_econ, fill = property)) +
  geom_col() +
  coord_flip() +
  labs(title = "Top 10 Weather Events with Greatest Economic Impact",
       x = "Event", y = "Total Damages (USD)")

The subsequent bar chart visualizes this summary: each bar is an event type, the bar height is the total economic loss (total_econ in USD), and the fill typically indicates the property-damage component (with crop damage making up the remainder if stacked or shown in the legend). Event labels are ordered by impact and often flipped (coord_flip()) for readability, producing a clear top-10 view of which hazards drive the largest monetary losses.

Correlation between health and economic impact

We first aggregate the dataset by event type to obtain two totals per event: the combined health impact (total_health = fatalities + injuries) and the combined economic impact (total_econ = property + crop damages). To focus on meaningful signals, we keep only events with strictly positive health and economic totals. Because a few events can be orders of magnitude larger than the rest, we then trim only the most extreme values by excluding observations above the upper 0.5% quantile in log10 scale for either variable; this removes extreme outliers while preserving the bulk of the data. On the cleaned sample, we compute the Pearson correlation between total_health and total_econ to assess linear association. Finally, we visualize the relationship with a scatter plot on log–log axes and add a least-squares regression line to summarize the trend. ## Aggregate by event type

storm_corr <- storm_sel %>%
  group_by(EVTYPE) %>%
  summarise(
    total_health = sum(FATALITIES + INJURIES, na.rm = TRUE),
    total_econ   = sum(PROPDMG + CROPDMG,   na.rm = TRUE)
  ) %>%
  ungroup()

Keep only events with positive impact

before <- storm_corr %>% filter(total_health > 0, total_econ > 0)

Remove only extremely large values (upper 0.5% quantile)

p_upper <- 0.995
thr_h <- quantile(log10(before$total_health + 1), probs = p_upper, na.rm = TRUE)
thr_e <- quantile(log10(before$total_econ  + 1), probs = p_upper, na.rm = TRUE)

after <- before %>%
  filter(log10(total_health + 1) <= thr_h,
         log10(total_econ  + 1) <= thr_e)

Pearson correlation

ct <- cor.test(after$total_health, after$total_econ, method = "pearson")

ggplot(after, aes(x = total_health, y = total_econ)) +
  geom_point(alpha = 0.7, color = "darkgreen") +
  scale_x_log10() + scale_y_log10() +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Correlation Between Casualties and Economic Damages (without extreme outliers)",
       x = "Total Casualties (deaths + injuries)",
       y = "Total Economic Damages (USD)")
## `geom_smooth()` using formula = 'y ~ x'

ct
## 
##  Pearson's product-moment correlation
## 
## data:  after$total_health and after$total_econ
## t = 1.0794, df = 76, p-value = 0.2838
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1024551  0.3362145
## sample estimates:
##       cor 
## 0.1228772

Discussion

The analysis shows that tornadoes cause the largest number of fatalities and injuries, while floods and hurricanes cause the greatest economic damages. Although one might expect a strong positive association between human losses and economic costs, the Pearson correlation after removing extreme outliers was weak (close to zero) and not statistically significant. This indicates that different weather events drive health and economic impacts, highlighting the importance of tailored mitigation strategies for each type of risk.