Synopsis

This report focuses on an exploration of the U.S. National Oceanic and Atmospheric Administration (NOAA) storm database. This database records the characteristics of major storms and weather-related events in the United States, including data on fatalities, property damage, and crop damage, from April of 1950 to November 2011. The primary objective is to conduct an exploratory analysis to determine which types of events resulted in the highest number of fatalities and which types caused the most significant economic impacts.

Loading and Processing the Data

Loading

To access the data, refer to the following link: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2. The database is provided in a compressed format, so this process starts with unzipping the file and then loading the data into R. After loading the data, the head() function will let me visualize the top of the database and its variables for further transformation and analysis.

library(R.utils)

bunzip2("repdata_data_StormData.csv.bz2", "repdata_data_StormData.csv", remove = FALSE)

storm_data <- read.csv("./repdata_data_StormData.csv")

head(storm_data, n=3)
##   STATE__          BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE  EVTYPE
## 1       1 4/18/1950 0:00:00     0130       CST     97     MOBILE    AL TORNADO
## 2       1 4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL TORNADO
## 3       1 2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL TORNADO
##   BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1         0                                               0         NA
## 2         0                                               0         NA
## 3         0                                               0         NA
##   END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1         0                      14.0   100 3   0          0       15    25.0
## 2         0                       2.0   150 2   0          0        0     2.5
## 3         0                       0.1   123 2   0          0        2    25.0
##   PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1          K       0                                         3040      8812
## 2          K       0                                         3042      8755
## 3          K       0                                         3340      8742
##   LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1       3051       8806              1
## 2          0          0              2
## 3          0          0              3

Now i can identify that the variables of interests are:

EVTYPE: The type of event (whether it is a tornado, a flood, an avalanche, etc.).

FATALITIES: The total number of deaths for each specific event.

PROPDMG: Indicates the monetary value of property damage incurred per event.

PROPDMGEXP: The value of PROPDMG (K = 1,000 , M = 1,000,000 , B = 1,000,000,000)

CROPDMG: Represents the monetary value of crop damage incurred per event.

CROPDMGEXP: The value of CROPDMG (K = 1,000 , M = 1,000,000 , B = 1,000,000,000)

Processing

The next step involves creating a new data set named summary_by_event, which will contain the following four variables:

EVTYPE: The complete list of event types.

event_count: The frequency of each event type.

total_fatalities: The total number of fatalities for each event type.

risk_of_death: The average number of fatalities per event, for each event type (total_fatalities / event_count).

Once the summary_by_event data set is created, it will be arranged in descending order based on the total_fatalities variable, and the top 10 events with the highest number of fatalities will be displayed.

library(dplyr)

summary_by_event <-   storm_data %>%
  group_by(EVTYPE) %>%
  summarise(
    event_count = n(),
    total_fatalities = sum(FATALITIES, na.rm = TRUE),
    risk_of_death = total_fatalities / event_count
  ) %>%
  arrange(desc(total_fatalities)) %>%  # Order by total_fatalities in descending order
  ungroup()

print(summary_by_event)
## # A tibble: 985 × 4
##    EVTYPE         event_count total_fatalities risk_of_death
##    <chr>                <int>            <dbl>         <dbl>
##  1 TORNADO              60652             5633       0.0929 
##  2 EXCESSIVE HEAT        1678             1903       1.13   
##  3 FLASH FLOOD          54277              978       0.0180 
##  4 HEAT                   767              937       1.22   
##  5 LIGHTNING            15754              816       0.0518 
##  6 TSTM WIND           219940              504       0.00229
##  7 FLOOD                25326              470       0.0186 
##  8 RIP CURRENT            470              368       0.783  
##  9 HIGH WIND            20212              248       0.0123 
## 10 AVALANCHE              386              224       0.580  
## # ℹ 975 more rows

We can observe which are the ten types of events with the highest number of fatalities, but when we compare with the risk_of_death, interestingly the event types with most deaths differ from the deadliness of a particular event.

Results

Now I’m going to create a panel of two bar plots to compare. One bar plot for the overall fatalities and a second plot for the average number of fatalities per event.

## FIGURE 1

library(ggplot2)
library(patchwork)

# 1st Plot: Top 10 EVTYPE by total fatalities


top_10_fatalities <- summary_by_event %>%
  arrange(desc(total_fatalities)) %>%
  slice(1:10)


max_fatalities <- max(top_10_fatalities$total_fatalities)


plot1 <- ggplot(top_10_fatalities, aes(x = reorder(EVTYPE, total_fatalities), y = total_fatalities)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  geom_text(aes(x = EVTYPE, y = max_fatalities / 2, label = total_fatalities),  # Centered in the middle of the plot
            hjust = 0.5, color = "black", fontface = "bold") +
  coord_flip() +  
  labs(title = "Overall Fatalities", x = "Event Type", y = "Total Fatalities") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) 


# 2nd Plot: Specific EVTYPE with their respective risk_of_death


specific_events <- summary_by_event %>%
  filter(EVTYPE %in% c("TORNADO", "EXCESSIVE HEAT", "FLASH FLOOD", "HEAT", "LIGHTNING",
                       "TSTM WIND", "FLOOD", "RIP CURRENT", "HIGH WIND", "AVALANCHE"))


max_risk_of_death <- max(specific_events$risk_of_death)


plot2 <- ggplot(specific_events, aes(x = reorder(EVTYPE, risk_of_death), y = risk_of_death)) +
  geom_bar(stat = "identity", fill = "red") +
  geom_text(aes(x = EVTYPE, y = max_risk_of_death / 2, label = round(risk_of_death, 2)),  # Centered in the middle of the plot
            hjust = 0.5, color = "black", fontface = "bold") +
  coord_flip() +  
  labs(title = "Total Fatalities per Event", x = "Event Type", y = "Fatalities per Event") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) 

# Combine the two plots into a panel
plot1 + plot2

Across the United States, which types of events (as indicated in the variable) are most harmful with respect to population health?

In the first plot we can see that TORNADO events have caused the highest number of fatalities registered, according to NOAA, with 5,633 deaths recorded across the U.S. EXCESSIVE HEAT ranks second with 1,903, followed by FLASH FLOOD with 978. However, considering the fatality rate per event, the HEAT is the deadliest type of event, followed by the EXCESSIVE HEAT and the RIP CURRENT.

Given that the type of event with more deaths registered are not necessarily the deadliest, i will now focus on identifying which are the overall deadliest event types.

# FIGURE 2

top_10_risk <- summary_by_event %>%
  arrange(desc(risk_of_death)) %>%
  slice(1:10)  


ggplot(top_10_risk, aes(x = reorder(EVTYPE, risk_of_death), y = risk_of_death)) +
  geom_bar(stat = "identity", fill = "red") +
  geom_text(aes(label = round(risk_of_death, 2)), hjust = 0.5, color = "black", fontface = "bold") +  
  coord_flip() +  
  labs(title = "Most Fatalities per Event", x = "Event Type", y = "Total Fatalities per Event") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))  

Here we observe that none of the types of events of the first group is in this list. The deadliest of all type of events is the TORNADOES, TSTM WIND, HAIL (tornado with thunderstorm wind and hail), a phenomenon that each time it happens kills 25 people on average, followed by the COLD AND SNOW with 14 deaths per event on average.

Across the United States, which types of events have the greatest economic consequences?

To answer this question, I’m going to create two new variables into the original data set created for this analysis (storm_data):

REAL_PROP: The number of PROPDMG times their value in the corresponding category of PROPDMGEXP (PROPDMG*PROPDMGEXP)

REAL_CROP: The number of CROPDMG times their value in the corresponding category of CROPDMGEXP (CROPDMG*CROPDMGEXP)

storm_data <- storm_data %>%
  mutate(
    REAL_PROP = case_when(
      PROPDMGEXP == "K" ~ PROPDMG * 1000,
      PROPDMGEXP == "M" ~ PROPDMG * 1000000,
      PROPDMGEXP == "B" ~ PROPDMG * 1000000000,
      TRUE ~ 0  
    ),
    REAL_CROP = case_when(
      CROPDMGEXP == "K" ~ CROPDMG * 1000,
      CROPDMGEXP == "M" ~ CROPDMG * 1000000,
      CROPDMGEXP == "B" ~ CROPDMG * 1000000000,
      TRUE ~ 0  
    )
  )

Next, i will create a new data set with the selected variables relevant for this analysis, and then add a new variable called TOTALDMG, which is the sum of REAL_PROP + REAL_CROP. Finally, similar to the step i did earlier, I will create a new data set with the list of EVTYPE and their respective sum of TOTALDMG by EVTYPE in a new variable called FINAL_DAMAGE.

new_storm_data <- storm_data %>%
  select(EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP, REAL_PROP, REAL_CROP)


head(new_storm_data)
##    EVTYPE PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP REAL_PROP REAL_CROP
## 1 TORNADO    25.0          K       0                25000         0
## 2 TORNADO     2.5          K       0                 2500         0
## 3 TORNADO    25.0          K       0                25000         0
## 4 TORNADO     2.5          K       0                 2500         0
## 5 TORNADO     2.5          K       0                 2500         0
## 6 TORNADO     2.5          K       0                 2500         0
new_storm_data <- new_storm_data %>%
  mutate(TOTALDMG = REAL_PROP + REAL_CROP)


head(new_storm_data)
##    EVTYPE PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP REAL_PROP REAL_CROP TOTALDMG
## 1 TORNADO    25.0          K       0                25000         0    25000
## 2 TORNADO     2.5          K       0                 2500         0     2500
## 3 TORNADO    25.0          K       0                25000         0    25000
## 4 TORNADO     2.5          K       0                 2500         0     2500
## 5 TORNADO     2.5          K       0                 2500         0     2500
## 6 TORNADO     2.5          K       0                 2500         0     2500
event_summary <- new_storm_data %>%
  group_by(EVTYPE) %>%
  summarise(
    event_count = n(),
    FINAL_DMG = sum(TOTALDMG, na.rm = TRUE),
    dmg_per_event = FINAL_DMG / event_count
  ) %>%
  arrange(desc(FINAL_DMG))


head(event_summary)
## # A tibble: 6 × 4
##   EVTYPE            event_count    FINAL_DMG dmg_per_event
##   <chr>                   <int>        <dbl>         <dbl>
## 1 FLOOD                   25326 150319678250      5935390.
## 2 HURRICANE/TYPHOON          88  71913712800    817201282.
## 3 TORNADO                 60652  57340613590       945404.
## 4 STORM SURGE               261  43323541000    165990579.
## 5 HAIL                   288661  18752904170        64965.
## 6 FLASH FLOOD             54277  17562128610       323565.

Now that I have my data set ready, I can elaborate a plot to see the overall economic damage by EVTYPE and the average economic damage per event of that list.

library(scales)
library(gridExtra)


# FIGURE 3

top_10_events <- event_summary %>%
  arrange(desc(FINAL_DMG)) %>%
  slice(1:10)


max_dmg <- max(top_10_events$FINAL_DMG)
max_dmg_per_event <- max(top_10_events$dmg_per_event)

# Plot 1
p1 <- ggplot(top_10_events, aes(x = reorder(EVTYPE, FINAL_DMG), y = FINAL_DMG)) +
  geom_bar(stat = "identity", fill = "navyblue") +
  coord_flip() +
  geom_text(aes(x = EVTYPE, y = max_dmg / 2, label = dollar(FINAL_DMG)),  
            
            hjust = 0.5, color = "black", fontface = "bold", size = 3) +  
  
  labs(title = "Overall Economic Damage", 
       x = "Event Type", 
       y = "Total Economic Damage") +
  theme_minimal()

# Plot 2
p2 <- ggplot(top_10_events, aes(x = reorder(EVTYPE, dmg_per_event), y = dmg_per_event)) +
  geom_bar(stat = "identity", fill = "darkred") +
  coord_flip() +
  geom_text(aes(x = EVTYPE, y = max_dmg_per_event / 2, label = dollar(dmg_per_event)),  
            hjust = 0.5, color = "black", fontface = "bold", size = 3) +  
  
  labs(title = "Average Economic Damage per Event", 
       x = "Event Type", 
       y = "Average Damage per Event") +
  theme_minimal()


grid.arrange(p1, p2, ncol = 2)

When we look at the plot we can see that from the data recorded between 1950 and 2011, floods have registered the highest total economic damage, amounting to $150,319,678,250 USD. This is followed by the hurricane/typhoon category with $71,913,712,800 USD, and in third place, tornadoes with $57,340,613,590 USD.

When we look at the average economic damage per individual event, is evident that the hurricanes have a significantly greater economic impact than most other events types, averaging $817,201,282 dollars for each hurricane. This highlights the substantial financial burden that hurricanes impose relative to other weather-related events.