Severe weather events can lead to devastating impacts on public health and the economy. This analysis uses the NOAA Storm Database to investigate:
1. Which event types are most harmful to population health (measured by fatalities and injuries)?
2. Which event types cause the greatest economic damage (measured by property and crop damage)?
The analysis starts by processing the raw NOAA dataset, transforming relevant variables, and grouping data by event type. The results are presented through figures and tables, showing that tornadoes are the leading cause of fatalities and injuries, while floods cause the greatest economic damage. The findings aim to provide clear insights for disaster preparedness and resource allocation. —
The code checks whether the required data file (StormData.csv.bz2) exists in the working directory. If the file is missing, it downloads it from the specified URL and saves it locally; otherwise, it skips the download. The file is then directly loaded into R as a data frame without any manipulation in between.
# Load required libraries
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
# Step 1: Define file URL and file path
file_url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
file_name <- "StormData.csv.bz2"
# Step 2: Get working directory and construct full file path
working_dir <- getwd()
file_path <- file.path(working_dir, file_name)
# Step 3: Check if file exists, if not, download it
if (!file.exists(file_path)) {
cat("File not found in the working directory. Downloading...\n")
download.file(file_url, destfile = file_path, method = "curl")
cat("File downloaded successfully and saved as:", file_path, "\n")
} else {
cat("File already exists in the working directory. Skipping download.\n")
}
## File already exists in the working directory. Skipping download.
# Step 4: Load the data from the working directory
cat("Loading data from: [hidden path]\n")
## Loading data from: [hidden path]
storm_data <- read.csv(file_path, stringsAsFactors = FALSE)
Dataset Dimensions: The dataset contains 902,297 rows and 37 columns.
Column Inspection: Column names and data types were inspected to understand the structure of the dataset.
Missing Values: Most columns have no missing values. Significant missing values detected in the F column (843,563 missing), as well as minor missing values in LATITUDE_E (47) and LONGITUDE_ (40). This data are not needed in our analysis and have no impact on the analysis.
# Inspect the data
cat("Dimensionen des Datensatzes:", dim(storm_data), "\n")
## Dimensionen des Datensatzes: 902297 37
cat("Spaltennamen:\n")
## Spaltennamen:
print(names(storm_data))
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
cat("\nDatentypen:\n")
##
## Datentypen:
str(storm_data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
cat("\nMissing values in columns:\n")
##
## Missing values in columns:
missing_values <- colSums(is.na(storm_data))
print(missing_values)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 0 0 0 0 0 0 0
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 0 0 0 0 0 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F
## 902297 0 0 0 0 0 843563
## MAG FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 0 0 0 0 0 0 0
## WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE LATITUDE_E LONGITUDE_
## 0 0 0 47 0 40 0
## REMARKS REFNUM
## 0 0
We focus on key columns relevant to the analysis:
| Column | Description |
|---|---|
EVTYPE |
Event type, which categorizes severe weather events. |
FATALITIES |
Number of fatalities caused by the event (indicator of health impact). |
INJURIES |
Number of injuries caused by the event (indicator of health impact). |
PROPDMG |
Amount of property damage caused by the event. |
PROPDMGEXP |
Magnitude of property damage (e.g., thousands, millions, billions). |
CROPDMG |
Amount of crop damage caused by the event. |
CROPDMGEXP |
Magnitude of crop damage (e.g., thousands, millions, billions). |
This selection reduces the dataset to only the variables needed for our analysis, improving computational efficiency and clarity.
When analyzing the Event Type data, it became apparent that many values seemed similar but were recorded differently – whether through variations in spelling, pluralization, capitalization, or even typos. For example: Strong Wind, STRONG WIND, Strong Winds, and STRONG WINDS. For this reason, the Event Type data was normalized using tolower.
The property and crop damage exponents (PROPDMGEXP and CROPDMGEXP) indicate damage magnitude (e.g., K for thousands, M for millions, B for billions). To standardize these values:
A function is defined to map exponents to numeric multipliers.This function is applied to both columns to calculate standardized damage values.
The total economic damage (TOTAL_DAMAGE) is calculated as the sum of property and crop damage. This combined metric simplifies analysis and visualization by consolidating both types of damage into a single value.
# Select relevant columns
storm_data <- storm_data %>%
select(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
# Normalize event types to lowercase for consistency
# To ensure consistency and avoid mismatches during grouping, we convert all event types to lowercase:
storm_data$EVTYPE <- tolower(storm_data$EVTYPE)
# Function to convert damage exponents to multipliers
# The property and crop damage columns include exponents (PROPDMGEXP and CROPDMGEXP) that
# represent the magnitude of damage in units like thousands (K), millions (M), or billions (B). To
# standardize these values:
# 1. We define a function to map the exponents to their corresponding numeric multipliers.
# 2. We apply this function to both the property and crop damage exponent columns.
# This step ensures that damage values can be accurately calculated as total monetary amounts.
convert_exp <- function(exp) {
ifelse(exp %in% c("K", "k"), 1e3,
ifelse(exp %in% c("M", "m"), 1e6,
ifelse(exp %in% c("B", "b"), 1e9, 1)))
}
# Apply conversion and calculate total damage
# We compute the total economic damage (TOTAL_DAMAGE) as the sum of property and crop damage. This derived variable
# simplifies analysis and visualization by aggregating both types of damage into a single metric.
storm_data <- storm_data %>%
mutate(
PROPDMGEXP = convert_exp(PROPDMGEXP),
CROPDMGEXP = convert_exp(CROPDMGEXP),
TOTAL_DAMAGE = PROPDMG * PROPDMGEXP + CROPDMG * CROPDMGEXP
)
The purpose of grouping data by event type is to analyze and compare the impact of different event types. By summarizing key metrics (total fatalities, injuries, and economic damage) and calculating a combined health impact metric, it becomes easier to identify which event types have the most significant effects on human health and the economy.
# Summarize data by event type
grouped_data <- storm_data %>%
group_by(EVTYPE) %>%
summarize(
total_fatalities = sum(FATALITIES, na.rm = TRUE),
total_injuries = sum(INJURIES, na.rm = TRUE),
total_damage = sum(TOTAL_DAMAGE, na.rm = TRUE)
) %>%
mutate(health_impact = total_fatalities + total_injuries) %>%
arrange(desc(health_impact))
# Top 10 events by health impact
top_health <- grouped_data %>% slice_max(order_by = health_impact, n = 10)
# Plot
ggplot(top_health, aes(x = reorder(EVTYPE, -health_impact), y = health_impact)) +
geom_bar(stat = "identity", fill = "darkred") +
labs(
title = "Top 10 Events by Health Impact (Fatalities + Injuries)",
x = "Event Type", y = "Health Impact",
caption = "Data Source: NOAA Storm Database"
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Tornadoes are the most harmful event to population health, far surpassing other event types in fatalities and injuries.
# Top 10 events by economic damage
top_damage <- grouped_data %>% slice_max(order_by = total_damage, n = 10)
# Plot
ggplot(top_damage, aes(x = reorder(EVTYPE, -total_damage), y = total_damage / 1e9)) +
geom_bar(stat = "identity", fill = "blue") +
labs(
title = "Top 10 Events by Economic Damage",
x = "Event Type", y = "Total Damage (in Billion USD)",
caption = "Data Source: NOAA Storm Database"
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Floods cause the highest economic damage, followed by hurricanes/typhoons and tornadoes.
The analysis highlights the significant impact of tornadoes on public health, contributing to the most fatalities and injuries. In terms of economic damage, floods stand out as the costliest events, primarily due to their widespread effects on property and agriculture.
These findings underscore the need for targeted disaster preparedness strategies, prioritizing resources for tornado-prone and flood-affected areas.
Last accessed on January 15, 2025. - NOAA
Dataset Link
Last accessed on January 15, 2025. - R Packages: dplyr,
ggplot2, knitr