In this analysis, we explore the NOAA Storm Database to gain insights into severe weather events in the United States. The primary goal is to answer two key questions: (1) Which types of events are most harmful to population health? and (2) Which types of events have the greatest economic consequences? By examining the database, we aim to provide valuable information for government and municipal managers responsible for preparing for severe weather events and prioritizing resources. The analysis follows a structured approach, starting with data processing, including loading and cleaning the raw data. We then present the results through visualizations and statistical summaries. The analysis showcases the types of events that pose the highest risk to population health and identifies the events that have the most significant economic impacts. By understanding these patterns, decision-makers can allocate resources efficiently and effectively mitigate the adverse effects of severe weather events.
To ensure reproducibility and start the analysis from the raw CSV file, we will describe how the data were loaded into R and processed for analysis. The following steps outline the data processing procedure:
We begin by loading the necessary packages for our analysis. In this case, we will use the tidyverse package, which provides a suite of tools for data manipulation and visualization.
# Load the required packages
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Next, we load the raw CSV file containing the NOAA Storm Database into R. We assume that the data file, named “StormData.csv.bz2,” is located in the current working directory.
# Set the URL of the compressed data file
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
# Set the path where you want to save the decompressed CSV file
output_file <- "C:/Users/arobn/OneDrive/Documents/school/MOOC/Reproducible Research/repdata_data_StormData.csv"
# Download the compressed file
download.file(url, destfile = "repdata_data_StormData.csv.bz2")
# Decompress the file
bz2file <- bzfile("repdata_data_StormData.csv.bz2", "r")
decompressed <- readLines(bz2file, n = -1L)
close(bz2file)
# Write the decompressed data to a CSV file
writeLines(decompressed, con = output_file)
# Read the decompressed CSV file
data <- read.csv(output_file, stringsAsFactors = FALSE)
Once the data is loaded, we can perform some initial exploration to understand its structure and contents. We can use functions such as head(), summary(), and str() to get a glimpse of the data set.
# Load the raw data from the CSV file
# Display the first few rows of the data set
head(data)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL TORNADO
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL TORNADO
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1 0 0 NA
## 2 0 0 NA
## 3 0 0 NA
## 4 0 0 NA
## 5 0 0 NA
## 6 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1 0 14.0 100 3 0 0 15 25.0
## 2 0 2.0 150 2 0 0 0 2.5
## 3 0 0.1 123 2 0 0 2 25.0
## 4 0 0.0 100 2 0 0 2 2.5
## 5 0 0.0 150 2 0 0 2 2.5
## 6 0 1.5 177 2 0 0 6 2.5
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1 K 0 3040 8812
## 2 K 0 3042 8755
## 3 K 0 3340 8742
## 4 K 0 3458 8626
## 5 K 0 3412 8642
## 6 K 0 3450 8748
## LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3051 8806 1
## 2 0 0 2
## 3 0 0 3
## 4 0 0 4
## 5 0 0 5
## 6 0 0 6
# Summarize the main characteristics of the data set
summary(data)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE
## Min. : 1.0 Length:902297 Length:902297 Length:902297
## 1st Qu.:19.0 Class :character Class :character Class :character
## Median :30.0 Mode :character Mode :character Mode :character
## Mean :31.2
## 3rd Qu.:45.0
## Max. :95.0
##
## COUNTY COUNTYNAME STATE EVTYPE
## Min. : 0.0 Length:902297 Length:902297 Length:902297
## 1st Qu.: 31.0 Class :character Class :character Class :character
## Median : 75.0 Mode :character Mode :character Mode :character
## Mean :100.6
## 3rd Qu.:131.0
## Max. :873.0
##
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE
## Min. : 0.000 Length:902297 Length:902297 Length:902297
## 1st Qu.: 0.000 Class :character Class :character Class :character
## Median : 0.000 Mode :character Mode :character Mode :character
## Mean : 1.484
## 3rd Qu.: 1.000
## Max. :3749.000
##
## END_TIME COUNTY_END COUNTYENDN END_RANGE
## Length:902297 Min. :0 Mode:logical Min. : 0.0000
## Class :character 1st Qu.:0 NA's:902297 1st Qu.: 0.0000
## Mode :character Median :0 Median : 0.0000
## Mean :0 Mean : 0.9862
## 3rd Qu.:0 3rd Qu.: 0.0000
## Max. :0 Max. :925.0000
##
## END_AZI END_LOCATI LENGTH WIDTH
## Length:902297 Length:902297 Min. : 0.0000 Min. : 0.000
## Class :character Class :character 1st Qu.: 0.0000 1st Qu.: 0.000
## Mode :character Mode :character Median : 0.0000 Median : 0.000
## Mean : 0.2301 Mean : 7.503
## 3rd Qu.: 0.0000 3rd Qu.: 0.000
## Max. :2315.0000 Max. :4400.000
##
## F MAG FATALITIES INJURIES
## Min. :0.0 Min. : 0.0 Min. : 0.0000 Min. : 0.0000
## 1st Qu.:0.0 1st Qu.: 0.0 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median :1.0 Median : 50.0 Median : 0.0000 Median : 0.0000
## Mean :0.9 Mean : 46.9 Mean : 0.0168 Mean : 0.1557
## 3rd Qu.:1.0 3rd Qu.: 75.0 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :5.0 Max. :22000.0 Max. :583.0000 Max. :1700.0000
## NA's :843563
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## Min. : 0.00 Length:902297 Min. : 0.000 Length:902297
## 1st Qu.: 0.00 Class :character 1st Qu.: 0.000 Class :character
## Median : 0.00 Mode :character Median : 0.000 Mode :character
## Mean : 12.06 Mean : 1.527
## 3rd Qu.: 0.50 3rd Qu.: 0.000
## Max. :5000.00 Max. :990.000
##
## WFO STATEOFFIC ZONENAMES LATITUDE
## Length:902297 Length:902297 Length:902297 Min. : 0
## Class :character Class :character Class :character 1st Qu.:2802
## Mode :character Mode :character Mode :character Median :3540
## Mean :2875
## 3rd Qu.:4019
## Max. :9706
## NA's :47
## LONGITUDE LATITUDE_E LONGITUDE_ REMARKS
## Min. :-14451 Min. : 0 Min. :-14455 Length:902297
## 1st Qu.: 7247 1st Qu.: 0 1st Qu.: 0 Class :character
## Median : 8707 Median : 0 Median : 0 Mode :character
## Mean : 6940 Mean :1452 Mean : 3509
## 3rd Qu.: 9605 3rd Qu.:3549 3rd Qu.: 8735
## Max. : 17124 Max. :9706 Max. :106220
## NA's :40
## REFNUM
## Min. : 1
## 1st Qu.:225575
## Median :451149
## Mean :451149
## 3rd Qu.:676723
## Max. :902297
##
# Explore the structure of the data set
str(data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
The raw dataset might contain missing values, inconsistencies, or irrelevant columns. We need to clean and transform the data to ensure its quality and suitability for analysis. Depending on the specific analysis requirements, this step may include tasks such as removing unnecessary columns, handling missing values, and transforming variables.
# Clean and filter the data for relevant variables
cleaned_data <- data %>%
select(EVTYPE, FATALITIES, INJURIES, PROPDMG) %>%
filter(!is.na(EVTYPE), !is.na(FATALITIES), !is.na(INJURIES), !is.na(PROPDMG))
# Calculate the total number of fatalities and injuries for each event type
event_health <- cleaned_data %>%
group_by(EVTYPE) %>%
summarise(total_fatalities = sum(FATALITIES),
total_injuries = sum(INJURIES)) %>%
arrange(desc(total_fatalities))
# Identify the event types with the highest impact on population health
top_health_events <- event_health[1:5, ]
# Calculate the total property damage for each event type
event_economic <- cleaned_data %>%
group_by(EVTYPE) %>%
summarise(total_prop_damage = sum(PROPDMG)) %>%
arrange(desc(total_prop_damage))
# Identify the event types with the greatest economic consequences
top_economic_events <- event_economic[1:5, ]
# Most Harmful Events with Respect to Population Health
cat("Most Harmful Events with Respect to Population Health:\n")
## Most Harmful Events with Respect to Population Health:
top_health_events
## # A tibble: 5 × 3
## EVTYPE total_fatalities total_injuries
## <chr> <dbl> <dbl>
## 1 TORNADO 5633 91346
## 2 EXCESSIVE HEAT 1903 6525
## 3 FLASH FLOOD 978 1777
## 4 HEAT 937 2100
## 5 LIGHTNING 816 5230
# Events with the Greatest Economic Consequences
cat("\nEvents with the Greatest Economic Consequences:\n")
##
## Events with the Greatest Economic Consequences:
top_economic_events
## # A tibble: 5 × 2
## EVTYPE total_prop_damage
## <chr> <dbl>
## 1 TORNADO 3212258.
## 2 FLASH FLOOD 1420125.
## 3 TSTM WIND 1335966.
## 4 FLOOD 899938.
## 5 THUNDERSTORM WIND 876844.
The analysis of the data revealed the following events with the highest impact on population health:
Considering the economic consequences, the analysis identified the following events with the highest property damage:
To assist government or municipal managers responsible for preparing for severe weather events, it is essential to consider the prioritization of resources based on the potential impact on both population health and economic consequences.
# Plotting Most Harmful Events
library(ggplot2)
# Combine the data for most harmful events
most_harmful <- rbind(
data.frame(Event_Type = c("TORNADO", "EXCESSIVE HEAT", "FLASH FLOOD", "HEAT", "LIGHTNING"),
Category = rep("Population Health", 5),
Value = c(5633, 1903, 978, 937, 816)),
data.frame(Event_Type = c("TORNADO", "FLASH FLOOD", "TSTM WIND", "FLOOD", "THUNDERSTORM WIND"),
Category = rep("Economic Consequences", 5),
Value = c(3212258.2, 1420124.6, 1335965.6, 899938.5, 876844.2))
)
# Create the bar plot
plot <- ggplot(most_harmful, aes(x = Event_Type, y = Value, fill = Category)) +
geom_bar(stat = "identity", position = "dodge", width = 0.7) +
labs(x = "Event Type", y = "Value", title = "Most Harmful Events with Respect to Population Health and Economic Consequences") +
scale_fill_manual(values = c("#1f77b4", "#ff7f0e"), labels = c("Population Health", "Economic Consequences")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
# Save the plot as a file
ggsave("path/to/your/figure.png", plot, width = 10, height = 6)
# Print the plot
plot
The figure above presents a visual representation of the most harmful events with respect to population health and events with the greatest economic consequences. This plot can provide a comprehensive overview to aid in decision-making and resource allocation for future severe weather event preparedness.
Please note that specific recommendations are beyond the scope of this analysis, but the results provide valuable insights for prioritizing resources and taking appropriate actions to minimize the impact of severe weather events.