The purpose of this report is to analyze the data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, which tracks major storms and weather events in the United States. The database provides valuable insights into the characteristics, impact, and consequences of various weather events, including fatalities, injuries, and damage to property and crops.
In this analysis, we focused on understanding the types of weather events that pose the greatest danger to human life in terms of fatalities and injuries, as well as identifying the events that cause significant damage to property and crops. By examining the data, we aimed to gain a better understanding of the most hazardous weather phenomena and their impact on the population, infrastructure, and agricultural sector.
The preliminary steps are to download the relevant dataset, unzip it and read it into R. For this analysis, the tidyverse package needs to be loaded.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.0 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
file_path <- "stormData.bz2"
download.file(url, destfile = file_path)
output_file <- "stormData.csv"
system(paste("bzip2 -d", file_path))
file.rename(sub(".bz2$", "", file_path), output_file)
## [1] TRUE
data <- read.csv(output_file)
After these preliminary steps, the imported dataframe has been reduced to only the relevant variables to perform the analysis required. Only the columns pertaining to the type of event (EVTYPE), fatalities (FATALITIES), injuries (INJURIES), damage to property (PROPDMG) and its multiplicator (PROPDMGEXP), and damage to crop (CROPDMG) and its multiplicator (CROPDMGEXP) have been retained.
sub_data <- data %>%
select(EVTYPE, FATALITIES:CROPDMGEXP)
Then, the multiplicators have been associated to the relevant numbers - “k” for 1000, “m” for 1000000, and “b” for 1000000000 - and multiplied. Two extra columns have been created PROPDMG_comp and CROPDMG_comp. The “_comp” suffix stands for “computed”. After this, the variables used to compute the values have been removed from the dataframe. The resulting dataframe called “data_4analysis” has 902297 observations and 5 variables: type of event, number of fatalities, number of injuries, and the computed damage to property and damage to crop.
# Define the conversion factors
conversion_factors <- c("K" = 1000, "M" = 1000000, "B" = 1000000000)
# Create the new column
sub_data$PROPDMG_comp <- sub_data$PROPDMG * conversion_factors[sub_data$PROPDMGEXP]
# Create the new column
sub_data$CROPDMG_comp <- sub_data$CROPDMG * conversion_factors[sub_data$CROPDMGEXP]
data_4analysis <- sub_data %>%
select(-(PROPDMG:CROPDMGEXP))
NAs have been removed and substituted with zeros. A more sofisticated approach could have been used, such as taking the median for the event for a specific period of time but then decided that was not probably worth the additional effort considering the exploratory nature of the analysis.
# Check for NAs in the dataframe
has_na <- anyNA(data_4analysis)
# Replace NAs with 0
if (has_na) {
data_4analysis[is.na(data_4analysis)] <- 0
}
After analyzing the data, it was evident that tornadoes are the most dangerous weather events in terms of human impact. Tornadoes accounted for the highest number of fatalities and injuries among all recorded weather events. These violent and rapidly rotating columns of air can cause extensive damage and pose severe risks to human life. To make the plot readable, only the top 25 weather hazards have been displayed.
# Group and summarize the data by EVTYPE, calculating the sum of injuries
injuries_by_event <- data_4analysis %>%
group_by(EVTYPE) %>%
summarise(total_injuries = sum(INJURIES)) %>%
arrange(desc(total_injuries))
# Print the result or access specific rows if needed
print(injuries_by_event)
## # A tibble: 985 × 2
## EVTYPE total_injuries
## <chr> <dbl>
## 1 TORNADO 91346
## 2 TSTM WIND 6957
## 3 FLOOD 6789
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
## 6 HEAT 2100
## 7 ICE STORM 1975
## 8 FLASH FLOOD 1777
## 9 THUNDERSTORM WIND 1488
## 10 HAIL 1361
## # ℹ 975 more rows
# Subset the data for the first 25 event types
subset_data_injuries <- injuries_by_event[1:25, ]
# Create the bar plot
ggplot(subset_data_injuries, aes(x = EVTYPE, y = total_injuries)) +
geom_bar(stat = "identity", fill = "blue") +
xlab("Weather Event Type") +
ylab("Total Injuries") +
ggtitle("Total Injuries by Weather Event Type (Top 25)") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Group and summarize the data by EVTYPE, calculating the sum of fatalities
fatalities_by_event <- data_4analysis %>%
group_by(EVTYPE) %>%
summarise(total_fatalities = sum(FATALITIES)) %>%
arrange(desc(total_fatalities))
# Print the result or access specific rows if needed
print(fatalities_by_event)
## # A tibble: 985 × 2
## EVTYPE total_fatalities
## <chr> <dbl>
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 TSTM WIND 504
## 7 FLOOD 470
## 8 RIP CURRENT 368
## 9 HIGH WIND 248
## 10 AVALANCHE 224
## # ℹ 975 more rows
# Subset the data for the first 25 event types
subset_data_fatalities <- fatalities_by_event[1:25, ]
# Create the bar plot
ggplot(subset_data_fatalities, aes(x = EVTYPE, y = total_fatalities)) +
geom_bar(stat = "identity", fill = "cyan") +
xlab("Weather Event Type") +
ylab("Total Fatalities") +
ggtitle("Total Fatalities by Weather Event Type (Top 25)") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Regarding property damage, floods emerged as the most damaging weather event. Floods can result from heavy rainfall, storm surge, or the failure of water containment systems, leading to the inundation of residential and commercial areas. The destructive power of floods often causes significant damage to buildings, infrastructure, and personal belongings. Again only the top 25 weather hazards are visualized.
# Group and summarize the data by EVTYPE, calculating the sum of PROPDMG_comp
damage_by_event <- data_4analysis %>%
group_by(EVTYPE) %>%
summarise(total_damage = sum(PROPDMG_comp)) %>%
arrange(desc(total_damage))
# Print the result or access specific rows if needed
print(damage_by_event)
## # A tibble: 985 × 2
## EVTYPE total_damage
## <chr> <dbl>
## 1 FLOOD 144657709800
## 2 HURRICANE/TYPHOON 69305840000
## 3 TORNADO 56925660480
## 4 STORM SURGE 43323536000
## 5 FLASH FLOOD 16140811510
## 6 HAIL 15727366720
## 7 HURRICANE 11868319010
## 8 TROPICAL STORM 7703890550
## 9 WINTER STORM 6688497250
## 10 HIGH WIND 5270046260
## # ℹ 975 more rows
# Subset the data for the first 25 event types
subset_data_damage <- damage_by_event[1:25, ]
# Create the bar plot
ggplot(subset_data_damage, aes(x = EVTYPE, y = total_damage)) +
geom_bar(stat = "identity", fill = "red") +
xlab("Weather Event Type") +
ylab("Total Damage") +
ggtitle("Total Damage by Weather Event Type (Top 25)") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The analysis also revealed that droughts have the most detrimental impact on crops. Drought conditions, characterized by prolonged periods of water shortage and arid conditions, can severely hamper agricultural productivity. The lack of water resources essential for crop growth and maintenance leads to substantial crop losses, affecting both farmers and the overall food supply chain. Only the top 25 weather hazards are visualized for readibility.
# Group and summarize the data by EVTYPE, calculating the sum of CROPDMG_comp
crop_damage_by_event <- data_4analysis %>%
group_by(EVTYPE) %>%
summarise(total_crop_damage = sum(CROPDMG_comp)) %>%
arrange(desc(total_crop_damage))
# Print the result or access specific rows if needed
print(crop_damage_by_event)
## # A tibble: 985 × 2
## EVTYPE total_crop_damage
## <chr> <dbl>
## 1 DROUGHT 13972566000
## 2 FLOOD 5661968450
## 3 RIVER FLOOD 5029459000
## 4 ICE STORM 5022113500
## 5 HAIL 3025537450
## 6 HURRICANE 2741910000
## 7 HURRICANE/TYPHOON 2607872800
## 8 FLASH FLOOD 1421317100
## 9 EXTREME COLD 1292973000
## 10 FROST/FREEZE 1094086000
## # ℹ 975 more rows
# Subset the data for the first 25 event types
subset_data_crop_damage <- crop_damage_by_event[1:25, ]
# Create the bar plot
ggplot(subset_data_crop_damage, aes(x = EVTYPE, y = total_crop_damage)) +
geom_bar(stat = "identity", fill = "orange") +
xlab("Weather Event Type") +
ylab("Total Crop Damage") +
ggtitle("Total Crop Damage by Weather Event Type (Top 25)") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
This analysis highlights the critical findings regarding the most dangerous weather events for the population, property damage, and crop loss. Tornadoes were identified as the most hazardous for human life, causing the highest number of fatalities and injuries. Floods were found to be the most damaging to property, often resulting in extensive destruction of buildings and infrastructure. Droughts were recognized as the leading cause of crop damage, adversely affecting agricultural production and food security.
Understanding the risks associated with different weather events is crucial for disaster preparedness, emergency response planning, and the implementation of preventive measures. By recognizing the specific weather events that pose the greatest threats, policymakers, emergency management authorities, and communities can work together to mitigate risks, enhance resilience, and ensure the safety and well-being of individuals, as well as the protection of property and agricultural resources.
It is important to note that this analysis is based on the available data from the NOAA storm database and should be interpreted within its limitations. Factors such as varying data collection methods, reporting biases, and regional disparities may influence the results. Nonetheless, this analysis provides valuable insights into the overall patterns and trends of weather-related hazards in the United States.