Synopsis

Severe weather events can lead to devastating impacts on public health and the economy. This analysis uses the NOAA Storm Database to investigate:

1. Which event types are most harmful to population health (measured by fatalities and injuries)?
2. Which event types cause the greatest economic damage (measured by property and crop damage)?

The analysis starts by processing the raw NOAA dataset, transforming relevant variables, and grouping data by event type. The results are presented through figures and tables, showing that tornadoes are the leading cause of fatalities and injuries, while floods cause the greatest economic damage. The findings aim to provide clear insights for disaster preparedness and resource allocation. —

1. Data Processing

1.1 Loading the Data

The code checks whether the required data file (StormData.csv.bz2) exists in the working directory. If the file is missing, it downloads it from the specified URL and saves it locally; otherwise, it skips the download. The file is then directly loaded into R as a data frame without any manipulation in between.

# Load required libraries
library(dplyr)

## Warning: package 'dplyr' was built under R version 4.4.2

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

# Step 1: Define file URL and file path
file_url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
file_name <- "StormData.csv.bz2"

# Step 2: Get working directory and construct full file path
working_dir <- getwd()
file_path <- file.path(working_dir, file_name)

# Step 3: Check if file exists, if not, download it
if (!file.exists(file_path)) {
  cat("File not found in the working directory. Downloading...\n")
  download.file(file_url, destfile = file_path, method = "curl")
  cat("File downloaded successfully and saved as:", file_path, "\n")
} else {
  cat("File already exists in the working directory. Skipping download.\n")
}

## File already exists in the working directory. Skipping download.

# Step 4: Load the data from the working directory
cat("Loading data from: [hidden path]\n")

## Loading data from: [hidden path]

storm_data <- read.csv(file_path, stringsAsFactors = FALSE)

Dataset Structure and Overview

Dataset Dimensions: The dataset contains 902,297 rows and 37 columns.

Column Inspection: Column names and data types were inspected to understand the structure of the dataset.

Missing Values: Most columns have no missing values. Significant missing values detected in the F column (843,563 missing), as well as minor missing values in LATITUDE_E (47) and LONGITUDE_ (40). This data are not needed in our analysis and have no impact on the analysis.

# Inspect the data
cat("Dimensionen des Datensatzes:", dim(storm_data), "\n")

## Dimensionen des Datensatzes: 902297 37

cat("Spaltennamen:\n")

## Spaltennamen:

print(names(storm_data))

##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

cat("\nDatentypen:\n")

## 
## Datentypen:

str(storm_data)

## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

cat("\nMissing values in columns:\n")

## 
## Missing values in columns:

missing_values <- colSums(is.na(storm_data))
print(missing_values)

##    STATE__   BGN_DATE   BGN_TIME  TIME_ZONE     COUNTY COUNTYNAME      STATE 
##          0          0          0          0          0          0          0 
##     EVTYPE  BGN_RANGE    BGN_AZI BGN_LOCATI   END_DATE   END_TIME COUNTY_END 
##          0          0          0          0          0          0          0 
## COUNTYENDN  END_RANGE    END_AZI END_LOCATI     LENGTH      WIDTH          F 
##     902297          0          0          0          0          0     843563 
##        MAG FATALITIES   INJURIES    PROPDMG PROPDMGEXP    CROPDMG CROPDMGEXP 
##          0          0          0          0          0          0          0 
##        WFO STATEOFFIC  ZONENAMES   LATITUDE  LONGITUDE LATITUDE_E LONGITUDE_ 
##          0          0          0         47          0         40          0 
##    REMARKS     REFNUM 
##          0          0

1.2 Data Selection and Cleaning

We focus on key columns relevant to the analysis:

Column	Description
`EVTYPE`	Event type, which categorizes severe weather events.
`FATALITIES`	Number of fatalities caused by the event (indicator of health impact).
`INJURIES`	Number of injuries caused by the event (indicator of health impact).
`PROPDMG`	Amount of property damage caused by the event.
`PROPDMGEXP`	Magnitude of property damage (e.g., thousands, millions, billions).
`CROPDMG`	Amount of crop damage caused by the event.
`CROPDMGEXP`	Magnitude of crop damage (e.g., thousands, millions, billions).

This selection reduces the dataset to only the variables needed for our analysis, improving computational efficiency and clarity.

When analyzing the Event Type data, it became apparent that many values seemed similar but were recorded differently – whether through variations in spelling, pluralization, capitalization, or even typos. For example: Strong Wind, STRONG WIND, Strong Winds, and STRONG WINDS. For this reason, the Event Type data was normalized using tolower.

The property and crop damage exponents (PROPDMGEXP and CROPDMGEXP) indicate damage magnitude (e.g., K for thousands, M for millions, B for billions). To standardize these values:

A function is defined to map exponents to numeric multipliers.This function is applied to both columns to calculate standardized damage values.

The total economic damage (TOTAL_DAMAGE) is calculated as the sum of property and crop damage. This combined metric simplifies analysis and visualization by consolidating both types of damage into a single value.

# Select relevant columns
storm_data <- storm_data %>%
  select(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

# Normalize event types to lowercase for consistency
# To ensure consistency and avoid mismatches during grouping, we convert all event types to lowercase:
storm_data$EVTYPE <- tolower(storm_data$EVTYPE)

# Function to convert damage exponents to multipliers
# The property and crop damage columns include exponents (PROPDMGEXP and CROPDMGEXP) that
# represent the magnitude of damage in units like thousands (K), millions (M), or billions (B). To
# standardize these values:
# 1. We define a function to map the exponents to their corresponding numeric multipliers.
# 2. We apply this function to both the property and crop damage exponent columns.
# This step ensures that damage values can be accurately calculated as total monetary amounts.
convert_exp <- function(exp) {
  ifelse(exp %in% c("K", "k"), 1e3,
         ifelse(exp %in% c("M", "m"), 1e6,
                ifelse(exp %in% c("B", "b"), 1e9, 1)))
}

# Apply conversion and calculate total damage
# We compute the total economic damage (TOTAL_DAMAGE) as the sum of property and crop damage. This derived variable
# simplifies analysis and visualization by aggregating both types of damage into a single metric.
 storm_data <- storm_data %>%
  mutate(
    PROPDMGEXP = convert_exp(PROPDMGEXP),
    CROPDMGEXP = convert_exp(CROPDMGEXP),
    TOTAL_DAMAGE = PROPDMG * PROPDMGEXP + CROPDMG * CROPDMGEXP
  )

1.3 Grouping Data by Event Type

The purpose of grouping data by event type is to analyze and compare the impact of different event types. By summarizing key metrics (total fatalities, injuries, and economic damage) and calculating a combined health impact metric, it becomes easier to identify which event types have the most significant effects on human health and the economy.

# Summarize data by event type
grouped_data <- storm_data %>%
  group_by(EVTYPE) %>%
  summarize(
    total_fatalities = sum(FATALITIES, na.rm = TRUE),
    total_injuries = sum(INJURIES, na.rm = TRUE),
    total_damage = sum(TOTAL_DAMAGE, na.rm = TRUE)
  ) %>%
  mutate(health_impact = total_fatalities + total_injuries) %>%
  arrange(desc(health_impact))

2.0 Results

2.1 Events Most Harmful to Population Health

# Top 10 events by health impact
top_health <- grouped_data %>% slice_max(order_by = health_impact, n = 10)

# Plot
ggplot(top_health, aes(x = reorder(EVTYPE, -health_impact), y = health_impact)) +
  geom_bar(stat = "identity", fill = "darkred") +
  labs(
    title = "Top 10 Events by Health Impact (Fatalities + Injuries)",
    x = "Event Type", y = "Health Impact",
    caption = "Data Source: NOAA Storm Database"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Tornadoes are the most harmful event to population health, far surpassing other event types in fatalities and injuries.

2.2 Top 10 Events with the Greatest Economic Consequences

# Top 10 events by economic damage
top_damage <- grouped_data %>% slice_max(order_by = total_damage, n = 10)

# Plot
ggplot(top_damage, aes(x = reorder(EVTYPE, -total_damage), y = total_damage / 1e9)) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(
    title = "Top 10 Events by Economic Damage",
    x = "Event Type", y = "Total Damage (in Billion USD)",
    caption = "Data Source: NOAA Storm Database"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Floods cause the highest economic damage, followed by hurricanes/typhoons and tornadoes.

Discussion and Conclusion

The analysis highlights the significant impact of tornadoes on public health, contributing to the most fatalities and injuries. In terms of economic damage, floods stand out as the costliest events, primarily due to their widespread effects on property and agriculture.

These findings underscore the need for targeted disaster preparedness strategies, prioritizing resources for tornado-prone and flood-affected areas.

References

NOAA Storm Database Documentation

Last accessed on January 15, 2025. - NOAA Dataset Link
Last accessed on January 15, 2025. - R Packages: dplyr, ggplot2, knitr

NOAA Storm Data Analysis – Impacts of Severe Weather Events

Reproducible Research - Course Project 2 (DataScience - John Hopkins University)

Marcus Naeher

2025-01-15