This analysis explores the NOAA Storm Database to assess the impact of severe weather events in the United States from 1950 to 2011. The study focuses on fatalities, injuries, and property damage to determine which weather events have the most significant consequences for public health and the economy. The dataset includes information on storm characteristics, location, and recorded damages. Due to incomplete records in the earlier years, more recent data is expected to be more reliable. Using R programming, we clean and process the data, visualize trends, and identify the most harmful weather events. The results highlight which types of storms require the most attention for disaster prevention and mitigation. Findings from this analysis can help policymakers and emergency response teams better allocate resources.

Data

The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:

There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

Research Questions

  1. Which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
  2. Which types of events have the most significant economic consequences (as indicated by property and crop damage)?

Software Environment Information

sessionInfo()
## R version 4.3.1 (2023-06-16 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=Spanish_Argentina.utf8  LC_CTYPE=Spanish_Argentina.utf8   
## [3] LC_MONETARY=Spanish_Argentina.utf8 LC_NUMERIC=C                      
## [5] LC_TIME=Spanish_Argentina.utf8    
## 
## time zone: Europe/Paris
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.33     R6_2.5.1          fastmap_1.1.1     xfun_0.40        
##  [5] cachem_1.0.8      knitr_1.44        htmltools_0.5.6   rmarkdown_2.25   
##  [9] cli_3.6.1         sass_0.4.7        jquerylib_0.1.4   compiler_4.3.1   
## [13] rstudioapi_0.15.0 tools_4.3.1       evaluate_0.21     bslib_0.5.1      
## [17] yaml_2.3.7        rlang_1.1.1       jsonlite_1.8.7

R Packages Used for the Analysis

R packages such as data.table, dplyr, ggplot2, and plyr are required to run the analysis.

library(plyr)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(knitr)
library(kableExtra)

Download Source File Data

First of all, the analysis begins with the download of the source file and loading the data into the variable StormRawData.

Url<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
File<-"./stormdata.csv.bz2"
if (!file.exists(File)) {
    download.file(Url, destfile=File, method = "auto")  # Download the file
    message("File downloaded.")  # Inform the user that the file has been downloaded
} else {
  message("File already exists. Skipping download.")  # Inform user that the file is already available
}
## File already exists. Skipping download.
storm <- read.csv(bzfile("./stormdata.csv.bz2"), stringsAsFactors = FALSE)

# Capture the output of str(storm) into a variable
storm_str <- capture.output(str(storm))

# Convert the output into a dataframe to use with kable
storm_str_df <- data.frame(Output = storm_str)

# Display the output as a table using kable
kable(storm_str_df, caption = "Structure of the 'storm' object")%>%
  kable_styling("striped", full_width = F)
Structure of the ‘storm’ object
Output
‘data.frame’: 902297 obs. of 37 variables:
$ STATE__ : num 1 1 1 1 1 1 1 1 1 1 …
$ BGN_DATE : chr “4/18/1950 0:00:00” “4/18/1950 0:00:00” “2/20/1951 0:00:00” “6/8/1951 0:00:00” …
$ BGN_TIME : chr “0130” “0145” “1600” “0900” …
$ TIME_ZONE : chr “CST” “CST” “CST” “CST” …
$ COUNTY : num 97 3 57 89 43 77 9 123 125 57 …
$ COUNTYNAME: chr “MOBILE” “BALDWIN” “FAYETTE” “MADISON” …
$ STATE : chr “AL” “AL” “AL” “AL” …
$ EVTYPE : chr “TORNADO” “TORNADO” “TORNADO” “TORNADO” …
$ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 …
$ BGN_AZI : chr “” “” “” “” …
$ BGN_LOCATI: chr “” “” “” “” …
$ END_DATE : chr “” “” “” “” …
$ END_TIME : chr “” “” “” “” …
$ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 …
$ COUNTYENDN: logi NA NA NA NA NA NA …
$ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 …
$ END_AZI : chr “” “” “” “” …
$ END_LOCATI: chr “” “” “” “” …
$ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 …
$ WIDTH : num 100 150 123 100 150 177 33 33 100 100 …
$ F : int 3 2 2 2 2 2 2 1 3 3 …
$ MAG : num 0 0 0 0 0 0 0 0 0 0 …
$ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 …
$ INJURIES : num 15 0 2 2 2 6 1 0 14 0 …
$ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 …
$ PROPDMGEXP: chr “K” “K” “K” “K” …
$ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 …
$ CROPDMGEXP: chr “” “” “” “” …
$ WFO : chr “” “” “” “” …
$ STATEOFFIC: chr “” “” “” “” …
$ ZONENAMES : chr “” “” “” “” …
$ LATITUDE : num 3040 3042 3340 3458 3412 …
$ LONGITUDE : num 8812 8755 8742 8626 8642 …
$ LATITUDE_E: num 3051 0 0 0 0 …
$ LONGITUDE_: num 8806 0 0 0 0 …
$ REMARKS : chr “” “” “” “” …
$ REFNUM : num 1 2 3 4 5 6 7 8 9 10 …

Data Processing

Before answering the research questions, the dataset is cleaned and processed to extract the relevant variables for analysis. Specifically, we focus on the following variables:

Data Cleaning and Variable Selection

Why These Variables?

To answer the research questions, we need to identify which events had the greatest impact on health (fatalities and injuries) and which had the most significant economic consequences (property and crop damage). These four variables—FATALITIES, INJURIES, PROPDMG, and CROPDMG—are crucial because they directly measure the consequences of each event.

We focus on FATALITIES and INJURIES for health, as these directly affect public health outcomes. For economic consequences, PROPDMG and CROPDMG give us insights into the financial losses caused by each event.

Data Cleaning Steps

We apply a series of data cleaning procedures to ensure the dataset is suitable for analysis.

  1. Standardizing Event Types (EVTYPE):
    The dataset contains inconsistent and duplicate entries for event types due to variations in spelling and formatting (e.g., “TSTM WIND” vs. “THUNDERSTORM WIND”). To address this, we normalize event types by converting all text to uppercase and merging similar categories.

  2. Handling Missing and Zero Values:
    Events with missing or zero fatalities and injuries are removed, as these values do not contribute meaningfully to the health analysis. Similarly, missing values for property and crop damage are handled appropriately.

  3. Aggregating Data by Event Type:
    We aggregate the data to calculate the total number of fatalities, injuries, property damage, and crop damage for each event type.

Calculate the fatalities and injuries separately

# Standardizing the event type to uppercase and cleaning inconsistencies
storm$EVTYPE <- toupper(storm$EVTYPE)

# Combining similar event types (example: grouping "TSTM WIND" and "THUNDERSTORM WIND")
storm$EVTYPE <- gsub("TSTM WIND", "THUNDERSTORM WIND", storm$EVTYPE)

# Removing rows with missing or zero fatalities and injuries
storm_clean <- storm[(!is.na(storm$FATALITIES) & storm$FATALITIES > 0) | (!is.na(storm$INJURIES) & storm$INJURIES > 0), ]


# Aggregating data by EVTYPE
fatalities_data <- storm_clean %>%
  group_by(EVTYPE) %>%
  summarise(Total_Fatalities = sum(FATALITIES, na.rm = TRUE))

# Ordering events by fatalities
fatalities_data <- fatalities_data %>%
  arrange(desc(Total_Fatalities))

kable(head(fatalities_data, n = 10), caption = "Top 10 Fatalities Event")%>%
  kable_styling("striped", full_width = F)
Top 10 Fatalities Event
EVTYPE Total_Fatalities
TORNADO 5633
EXCESSIVE HEAT 1903
FLASH FLOOD 978
HEAT 937
LIGHTNING 816
THUNDERSTORM WIND 637
FLOOD 470
RIP CURRENT 368
HIGH WIND 248
AVALANCHE 224
# Aggregating data by EVTYPE for injuries using dplyr
injuries_data <- storm_clean %>%
  group_by(EVTYPE) %>%
  summarise(Total_Injuries = sum(INJURIES, na.rm = TRUE)) %>%
  arrange(desc(Total_Injuries))


kable(head(injuries_data, n = 10), caption = "Top 10 de eventos con más lesiones") %>%
  kable_styling("striped", full_width = F)
Top 10 de eventos con más lesiones
EVTYPE Total_Injuries
TORNADO 91346
THUNDERSTORM WIND 8445
FLOOD 6789
EXCESSIVE HEAT 6525
LIGHTNING 5230
HEAT 2100
ICE STORM 1975
FLASH FLOOD 1777
HAIL 1361
WINTER STORM 1321
# Select relevant columns for economic analysis
storm_economic_data <- storm %>%
  select(EVTYPE, PROPDMG, CROPDMG)

# Convert PROPDMG and CROPDMG to millions for easier interpretation
storm_economic_data <- storm_economic_data %>%
  mutate(
    PROPDMG = PROPDMG / 1e6,  # Convert to millions
    CROPDMG = CROPDMG / 1e6   # Convert to millions
  )

# Summarize economic impact by event type
economic_impact <- storm_economic_data %>%
  group_by(EVTYPE) %>%
  summarize(
    Total_Property_Damage = sum(PROPDMG, na.rm = TRUE), 
    Total_Crop_Damage = sum(CROPDMG, na.rm = TRUE)
  ) %>%
  arrange(desc(Total_Property_Damage), desc(Total_Crop_Damage))

Results

Health Impact Analysis

To answer the first research question, we focus on the total number of fatalities and injuries per event type. The cleaned data for these two variables is aggregated into separate plots. The graphs below show the top weather events based on fatalities and injuries.

# Gráfico para Fatalities
fatalities_plot <- ggplot(fatalities_data[1:10,], aes(x = reorder(EVTYPE, -Total_Fatalities))) +
  geom_bar(aes(y = Total_Fatalities), stat = "identity", fill = "lightcoral") +
  theme_minimal() +
  labs(x = "Event Type", y = "Total Fatalities", title = "Top 10 Weather Events by Fatalities") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1.2))

# Gráfico para Injuries
injuries_plot <- ggplot(injuries_data[1:10,], aes(x = reorder(EVTYPE, -Total_Injuries))) +
  geom_bar(aes(y = Total_Injuries), stat = "identity", fill = "lightblue") +
  theme_minimal() +
  labs(x = "Event Type", y = "Total Injuries", title = "Top 10 Weather Events by Injuries") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1.2))

# Combinar los dos gráficos en un solo panel (2 columnas)
grid.arrange(fatalities_plot, injuries_plot, ncol = 2)

Economic Impact Analysis

To answer the second research question, we focus on the property damage (PROPDMG) and crop damage (CROPDMG) caused by each weather event type. The damage values are converted to a consistent scale (millions of dollars) to make them comparable.

# Plotting the economic damage data
ggplot(economic_impact[1:10,], aes(x = reorder(EVTYPE, -Total_Property_Damage))) +
  geom_bar(aes(y = Total_Property_Damage, fill = "Property Damage"), stat = "identity", position = "dodge") +
  geom_bar(aes(y = Total_Crop_Damage, fill = "Crop Damage"), stat = "identity", position = "dodge") +
  scale_fill_manual(values = c("Property Damage" = "lightgreen", "Crop Damage" = "lightpink")) +
  theme_minimal() +
  labs(x = "Event Type", y = "Damage in Millions of Dollars", title = "Top 10 Weather Events by Property and Crop Damage") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  guides(fill = guide_legend(title = "Damage Type"))

Conclusion

From the fatalities and injuries plots, we can observe that tornadoes and heat waves are responsible for the highest number of fatalities and injuries respectively. The fatalities graph clearly shows that tornadoes cause the highest number of fatalities, followed by events like heat and floods. The injuries plot similarly highlights tornadoes and heat waves as the most frequent sources of injuries, followed by events like hurricanes and thunderstorms.

From the economic impact graphs, we see that hurricanes and floods result in the most significant property damage, whereas droughts and floods cause substantial crop damage. These trends highlight the dual nature of severe weather events: while some may have more immediate health impacts, others cause long-lasting financial consequences, particularly in agriculture and infrastructure.

  • Health Impact: The analysis showed that tornadoes and heat waves are the most harmful, with tornadoes causing the highest fatalities and heat waves resulting in a significant number of injuries. Both types of events require focused efforts on public health preparedness.
  • Economic Impact: Hurricanes, floods, and droughts result in the highest property and crop damage. These weather events present long-term challenges for economic recovery and resource allocation, especially in affected regions.

This analysis provides valuable insight into the most devastating weather events, assisting in targeting resources effectively for both health protection and economic recovery.