This report focuses on an exploration of the U.S. National Oceanic and Atmospheric Administration (NOAA) storm database. This database records the characteristics of major storms and weather-related events in the United States, including data on fatalities, property damage, and crop damage, from April of 1950 to November 2011. The primary objective is to conduct an exploratory analysis to determine which types of events resulted in the highest number of fatalities and which types caused the most significant economic impacts.
To access the data, refer to the following link: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2. The database is provided in a compressed format, so this process starts with unzipping the file and then loading the data into R. After loading the data, the head() function will let me visualize the top of the database and its variables for further transformation and analysis.
library(R.utils)
bunzip2("repdata_data_StormData.csv.bz2", "repdata_data_StormData.csv", remove = FALSE)
storm_data <- read.csv("./repdata_data_StormData.csv")
head(storm_data, n=3)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1 0 0 NA
## 2 0 0 NA
## 3 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1 0 14.0 100 3 0 0 15 25.0
## 2 0 2.0 150 2 0 0 0 2.5
## 3 0 0.1 123 2 0 0 2 25.0
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1 K 0 3040 8812
## 2 K 0 3042 8755
## 3 K 0 3340 8742
## LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3051 8806 1
## 2 0 0 2
## 3 0 0 3
Now i can identify that the variables of interests are:
EVTYPE: The type of event (whether it is a tornado, a flood, an avalanche, etc.).
FATALITIES: The total number of deaths for each specific event.
PROPDMG: Indicates the monetary value of property damage incurred per event.
PROPDMGEXP: The value of PROPDMG (K = 1,000 , M = 1,000,000 , B = 1,000,000,000)
CROPDMG: Represents the monetary value of crop damage incurred per event.
CROPDMGEXP: The value of CROPDMG (K = 1,000 , M = 1,000,000 , B = 1,000,000,000)
The next step involves creating a new data set named summary_by_event, which will contain the following four variables:
EVTYPE: The complete list of event types.
event_count: The frequency of each event type.
total_fatalities: The total number of fatalities for each event type.
risk_of_death: The average number of fatalities per event, for each event type (total_fatalities / event_count).
Once the summary_by_event data set is created, it will be arranged in descending order based on the total_fatalities variable, and the top 10 events with the highest number of fatalities will be displayed.
library(dplyr)
summary_by_event <- storm_data %>%
group_by(EVTYPE) %>%
summarise(
event_count = n(),
total_fatalities = sum(FATALITIES, na.rm = TRUE),
risk_of_death = total_fatalities / event_count
) %>%
arrange(desc(total_fatalities)) %>% # Order by total_fatalities in descending order
ungroup()
print(summary_by_event)
## # A tibble: 985 × 4
## EVTYPE event_count total_fatalities risk_of_death
## <chr> <int> <dbl> <dbl>
## 1 TORNADO 60652 5633 0.0929
## 2 EXCESSIVE HEAT 1678 1903 1.13
## 3 FLASH FLOOD 54277 978 0.0180
## 4 HEAT 767 937 1.22
## 5 LIGHTNING 15754 816 0.0518
## 6 TSTM WIND 219940 504 0.00229
## 7 FLOOD 25326 470 0.0186
## 8 RIP CURRENT 470 368 0.783
## 9 HIGH WIND 20212 248 0.0123
## 10 AVALANCHE 386 224 0.580
## # ℹ 975 more rows
We can observe which are the ten types of events with the highest number of fatalities, but when we compare with the risk_of_death, interestingly the event types with most deaths differ from the deadliness of a particular event.
Now I’m going to create a panel of two bar plots to compare. One bar plot for the overall fatalities and a second plot for the average number of fatalities per event.
## FIGURE 1
library(ggplot2)
library(patchwork)
# 1st Plot: Top 10 EVTYPE by total fatalities
top_10_fatalities <- summary_by_event %>%
arrange(desc(total_fatalities)) %>%
slice(1:10)
max_fatalities <- max(top_10_fatalities$total_fatalities)
plot1 <- ggplot(top_10_fatalities, aes(x = reorder(EVTYPE, total_fatalities), y = total_fatalities)) +
geom_bar(stat = "identity", fill = "steelblue") +
geom_text(aes(x = EVTYPE, y = max_fatalities / 2, label = total_fatalities), # Centered in the middle of the plot
hjust = 0.5, color = "black", fontface = "bold") +
coord_flip() +
labs(title = "Overall Fatalities", x = "Event Type", y = "Total Fatalities") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
# 2nd Plot: Specific EVTYPE with their respective risk_of_death
specific_events <- summary_by_event %>%
filter(EVTYPE %in% c("TORNADO", "EXCESSIVE HEAT", "FLASH FLOOD", "HEAT", "LIGHTNING",
"TSTM WIND", "FLOOD", "RIP CURRENT", "HIGH WIND", "AVALANCHE"))
max_risk_of_death <- max(specific_events$risk_of_death)
plot2 <- ggplot(specific_events, aes(x = reorder(EVTYPE, risk_of_death), y = risk_of_death)) +
geom_bar(stat = "identity", fill = "red") +
geom_text(aes(x = EVTYPE, y = max_risk_of_death / 2, label = round(risk_of_death, 2)), # Centered in the middle of the plot
hjust = 0.5, color = "black", fontface = "bold") +
coord_flip() +
labs(title = "Total Fatalities per Event", x = "Event Type", y = "Fatalities per Event") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
# Combine the two plots into a panel
plot1 + plot2
In the first plot we can see that TORNADO events have caused the highest number of fatalities registered, according to NOAA, with 5,633 deaths recorded across the U.S. EXCESSIVE HEAT ranks second with 1,903, followed by FLASH FLOOD with 978. However, considering the fatality rate per event, the HEAT is the deadliest type of event, followed by the EXCESSIVE HEAT and the RIP CURRENT.
Given that the type of event with more deaths registered are not necessarily the deadliest, i will now focus on identifying which are the overall deadliest event types.
# FIGURE 2
top_10_risk <- summary_by_event %>%
arrange(desc(risk_of_death)) %>%
slice(1:10)
ggplot(top_10_risk, aes(x = reorder(EVTYPE, risk_of_death), y = risk_of_death)) +
geom_bar(stat = "identity", fill = "red") +
geom_text(aes(label = round(risk_of_death, 2)), hjust = 0.5, color = "black", fontface = "bold") +
coord_flip() +
labs(title = "Most Fatalities per Event", x = "Event Type", y = "Total Fatalities per Event") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
Here we observe that none of the types of events of the first group is in this list. The deadliest of all type of events is the TORNADOES, TSTM WIND, HAIL (tornado with thunderstorm wind and hail), a phenomenon that each time it happens kills 25 people on average, followed by the COLD AND SNOW with 14 deaths per event on average.
To answer this question, I’m going to create two new variables into the original data set created for this analysis (storm_data):
REAL_PROP: The number of PROPDMG times their value in the corresponding category of PROPDMGEXP (PROPDMG*PROPDMGEXP)
REAL_CROP: The number of CROPDMG times their value in the corresponding category of CROPDMGEXP (CROPDMG*CROPDMGEXP)
storm_data <- storm_data %>%
mutate(
REAL_PROP = case_when(
PROPDMGEXP == "K" ~ PROPDMG * 1000,
PROPDMGEXP == "M" ~ PROPDMG * 1000000,
PROPDMGEXP == "B" ~ PROPDMG * 1000000000,
TRUE ~ 0
),
REAL_CROP = case_when(
CROPDMGEXP == "K" ~ CROPDMG * 1000,
CROPDMGEXP == "M" ~ CROPDMG * 1000000,
CROPDMGEXP == "B" ~ CROPDMG * 1000000000,
TRUE ~ 0
)
)
Next, i will create a new data set with the selected variables relevant for this analysis, and then add a new variable called TOTALDMG, which is the sum of REAL_PROP + REAL_CROP. Finally, similar to the step i did earlier, I will create a new data set with the list of EVTYPE and their respective sum of TOTALDMG by EVTYPE in a new variable called FINAL_DAMAGE.
new_storm_data <- storm_data %>%
select(EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP, REAL_PROP, REAL_CROP)
head(new_storm_data)
## EVTYPE PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP REAL_PROP REAL_CROP
## 1 TORNADO 25.0 K 0 25000 0
## 2 TORNADO 2.5 K 0 2500 0
## 3 TORNADO 25.0 K 0 25000 0
## 4 TORNADO 2.5 K 0 2500 0
## 5 TORNADO 2.5 K 0 2500 0
## 6 TORNADO 2.5 K 0 2500 0
new_storm_data <- new_storm_data %>%
mutate(TOTALDMG = REAL_PROP + REAL_CROP)
head(new_storm_data)
## EVTYPE PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP REAL_PROP REAL_CROP TOTALDMG
## 1 TORNADO 25.0 K 0 25000 0 25000
## 2 TORNADO 2.5 K 0 2500 0 2500
## 3 TORNADO 25.0 K 0 25000 0 25000
## 4 TORNADO 2.5 K 0 2500 0 2500
## 5 TORNADO 2.5 K 0 2500 0 2500
## 6 TORNADO 2.5 K 0 2500 0 2500
event_summary <- new_storm_data %>%
group_by(EVTYPE) %>%
summarise(
event_count = n(),
FINAL_DMG = sum(TOTALDMG, na.rm = TRUE),
dmg_per_event = FINAL_DMG / event_count
) %>%
arrange(desc(FINAL_DMG))
head(event_summary)
## # A tibble: 6 × 4
## EVTYPE event_count FINAL_DMG dmg_per_event
## <chr> <int> <dbl> <dbl>
## 1 FLOOD 25326 150319678250 5935390.
## 2 HURRICANE/TYPHOON 88 71913712800 817201282.
## 3 TORNADO 60652 57340613590 945404.
## 4 STORM SURGE 261 43323541000 165990579.
## 5 HAIL 288661 18752904170 64965.
## 6 FLASH FLOOD 54277 17562128610 323565.
Now that I have my data set ready, I can elaborate a plot to see the overall economic damage by EVTYPE and the average economic damage per event of that list.
library(scales)
library(gridExtra)
# FIGURE 3
top_10_events <- event_summary %>%
arrange(desc(FINAL_DMG)) %>%
slice(1:10)
max_dmg <- max(top_10_events$FINAL_DMG)
max_dmg_per_event <- max(top_10_events$dmg_per_event)
# Plot 1
p1 <- ggplot(top_10_events, aes(x = reorder(EVTYPE, FINAL_DMG), y = FINAL_DMG)) +
geom_bar(stat = "identity", fill = "navyblue") +
coord_flip() +
geom_text(aes(x = EVTYPE, y = max_dmg / 2, label = dollar(FINAL_DMG)),
hjust = 0.5, color = "black", fontface = "bold", size = 3) +
labs(title = "Overall Economic Damage",
x = "Event Type",
y = "Total Economic Damage") +
theme_minimal()
# Plot 2
p2 <- ggplot(top_10_events, aes(x = reorder(EVTYPE, dmg_per_event), y = dmg_per_event)) +
geom_bar(stat = "identity", fill = "darkred") +
coord_flip() +
geom_text(aes(x = EVTYPE, y = max_dmg_per_event / 2, label = dollar(dmg_per_event)),
hjust = 0.5, color = "black", fontface = "bold", size = 3) +
labs(title = "Average Economic Damage per Event",
x = "Event Type",
y = "Average Damage per Event") +
theme_minimal()
grid.arrange(p1, p2, ncol = 2)
When we look at the plot we can see that from the data recorded between 1950 and 2011, floods have registered the highest total economic damage, amounting to $150,319,678,250 USD. This is followed by the hurricane/typhoon category with $71,913,712,800 USD, and in third place, tornadoes with $57,340,613,590 USD.
When we look at the average economic damage per individual event, is evident that the hurricanes have a significantly greater economic impact than most other events types, averaging $817,201,282 dollars for each hurricane. This highlights the substantial financial burden that hurricanes impose relative to other weather-related events.