Analysis of U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database

Synopsis

Analysis of U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database is detailed in this document. Data was taken from https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2 which includes data from the year 1950 to November 2011. Focus is on the event types which 1) are most harmful to population health and 2) which events have the greatest economic consequences across the United States.

It is determined that tornadeos have caused the most harm to the health of the population. Flooding has caused the most economic damage.

Data Analysis

Data has been downloaded from https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2. This is read into a data frame for analysis within R. There are 902297 records with 37 fields.

dataset <- read.csv('repdata_data_StormData.csv' )
dim(dataset)
## [1] 902297     37

The description of the various fields can be found at https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf. As the data we are most interested is the events which cause the most economic damage and most harm we will focus on the fields EVTYPE which indicates event type, FATALITIES and INJURIES which indicates the number of fatalities and injuries respectively and PROPDMG and PROPDMGEXP which indicate the amount of property damage wiith and units respectively. Will will also keep the date of the event which is contained in the field BGN_Date.

Before processing the data will first load the R packages which will use in this analysis. We will use dplyr to modify our data set and the ggplot2 to create visualisations.

  ## first 
  library(dplyr)
  library(tidyr)
  library(ggplot2)
  library(lubridate)

Create a new data set with only the fields we require

  data_2 <- dataset %>%
    select(c(BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP))

Now perform some data formatting. Will Convert BGN_Date to a valid date. Additionally, depending on the unit specified in the field PROPDMGEXP will multiple the amount in PropDMG field to have a consistent amount for all records

# convert BGN_DATE to  valid date   
data_2$BGN_DATE <- as.Date(data_2$BGN_DATE, format="%m/%d/%Y %H:%M:%S")
# make the units consistent for the property damage 
data_2 <- data_2 %>%
  mutate(
    PROPDMG = case_when(
      PROPDMGEXP == 'K' ~ PROPDMG * 1000,
      PROPDMGEXP == 'M' ~ PROPDMG * 1000000,
      TRUE ~ PROPDMG
    )
  ) %>%
  select(-PROPDMGEXP)

Of interest is the weather events that cause the most harm to population health. To answer this question require some way to relate fatalities and injuries. For the purpose of this analysis it is assumed that an injury is are equivalent to 0.2 fatality. Can then combine both injuries and fatalities to measure the impact on population health.

data_2 <- data_2 %>%
  mutate(    FATALITIES = FATALITIES + 0.2 * INJURIES    )   %>%
  select(-INJURIES)

Will now use dply to aggregate by Year and weather type event using dplyr.

data_grouped <- data_2 %>%
  group_by(year= year(BGN_DATE), EVTYPE) %>%
  summarise(
    Total_Fatalities = sum(FATALITIES, na.rm = TRUE),
    Total_PropDMG = sum(PROPDMG, na.rm = TRUE), 
    Number_Observations = n(),
    .groups ="drop"
  )

Impact on Health

A heatmap of the number of fatalities + 0.2 Injuries is shown below

# Find the 10 largest events otherwise table is not readable
top_events_fatalities <- data_grouped %>%
  group_by(EVTYPE) %>%
  summarise(total = sum(Total_Fatalities)) %>% 
  slice_max(total, n=10) %>%
  pull(EVTYPE)
  
# now filter on those events 
fatalities_summary <- data_grouped %>%
    filter(EVTYPE %in%  top_events_fatalities) %>%
  arrange(desc(Total_Fatalities))
# now create a heatmap 
library(ggplot2)

ggplot(fatalities_summary, aes(x = EVTYPE, y = year, fill = Total_Fatalities)) +
  geom_tile() +
  scale_fill_gradient(low = "white", high = "red") +
  labs(
    title = "Weather Event Fatalities by Year",
    x = "Weather Event Type",
    y = "Year",
    fill = "Fatalities+ 0.2 * Injuries"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Below is a time series of the number of fatalities + 0.2* Injuries by year.

library(ggplot2)

ggplot(fatalities_summary, aes(x = year, y = Total_Fatalities, color = EVTYPE)) +
  geom_line() +
  labs(
    title = "Fatalities by Weather Event Over Time",
    x = "Year",
    y = "Number of Fatalities + 0.2* Injuries",
    color = "Event Type"
  ) +
  theme_minimal()

Both these plots indicate that tornadoes have caused the most harm to public health over time. It worth noting that appears some weather events were only recorded in the mid 1990’s.

Impact on Economy

A heatmap of the impact on economy is shown below

# Find the 10 largest events otherwise table is not readable
top_events_Economy <- data_grouped %>%
  group_by(EVTYPE) %>%
  summarise(total = sum(Total_PropDMG)) %>% 
  slice_max(total, n=10) %>%
  pull(EVTYPE)
  
# now filter on those events 
PropDMG_summary <- data_grouped %>%
    filter(EVTYPE %in%  top_events_Economy) %>%
  arrange(desc(Total_PropDMG))
# now create a heatmap 
library(ggplot2)

ggplot(PropDMG_summary, aes(x = EVTYPE, y = year, fill = Total_PropDMG)) +
  geom_tile() +
  scale_fill_gradient(low = "white", high = "red") +
  labs(
    title = "Weather Event Property Damage by Year",
    x = "Weather Event Type",
    y = "Year",
    fill = "Propery Damage"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Below is a time series of the number of Property Damage by year.

library(ggplot2)

ggplot(PropDMG_summary, aes(x = year, y = Total_PropDMG, color = EVTYPE)) +
  geom_line() +
  labs(
    title = "Property Damage by Weather Event Over Time",
    x = "Year",
    y = "Property Damage $",
    color = "Event Type"
  ) +
  theme_minimal()

Flash floods and flood have the greatest impact by economic damage. This indicates that flooding causing the most ecomomic damage.