Reproducible Research Course Project 2 Assignment: Storm Data Preparation

Synopsys

This project involves exploring the storm data collection of the National Oceanic and Atmospheric Administration (NOAA). The dataset documents the occurrence of storms and other significant or unusual weather phenomena causing loss of life, injuries, or property damage. The dataset contains a list of events observed in the US between 1950 and 2011. The analysis aims to explore which types of events are the most harmful with respect to population health and have the greatest economic consequences.

Data Processing

The dataset was downloaded in ‘bz2’ format from the Reporoducible Research Course website. There is also some documentation of the database available, explaining how some of the variables are constructed and defined:

National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events FAQ

After downloading the file to our own directory, we are reading the csv file and checking the number of observations and variables.

library(ggplot2)
library(dplyr)
library(tidyverse)
library(tidytext)  

if(!file.exists("./data")){dir.create("./data")}
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl, destfile = paste0(getwd(), "./data/stormdata.bz2"), method = "libcurl")

stormdata <- read.csv(bzfile("./data/stormdata.bz2"))

str(stormdata)

## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

We note that the dataset contains 902297 obvervations and 37 variables. Some of the columns are not required for our analysis therefore we keep only the columns describing the event type, number of fatalities and injuries, and the columns relating to property and crop damage.

storm_columns <- c("EVTYPE", "FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")
stormdata <- stormdata[,storm_columns]

We are checking the unique values of PROPDMGEXP and CROPDMGEXP.

unique(stormdata$PROPDMGEXP)

##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"

unique(stormdata$CROPDMGEXP)

## [1] ""  "M" "K" "m" "B" "?" "0" "k" "2"

The values provided for CROPDMGEXP and PROPDMGEXP need to be converted to numerical values and then multiplied by CROPDM and PROPDMG to get the absolute values.

stormdata <- stormdata %>% mutate(PROPDMGEXP =
  case_match(PROPDMGEXP, c(2,"h","H") ~ 100,
             c("K","3") ~ 1000,
             "4" ~ 10000,
             "5" ~ 100000,
             c("M","m","6") ~ 1000000,
             "7" ~ 10000000,
             "8" ~ 100000000,
             "B" ~ 1000000000,
             .default =  1), CROPDMGEXP =
  case_match(CROPDMGEXP, "2" ~ 100,
             c("K","k") ~ 1000,
             c("M","m") ~ 1000000,
             "B" ~ 1000000000,
             .default = 1))

stormdata <- stormdata %>% 
  mutate(PROPDMG_value = PROPDMG * as.numeric(PROPDMGEXP), 
         CROPDMG_value = CROPDMG * as.numeric(CROPDMGEXP))

Data Analysis

1. Across the United States, which types of events (as indicated in the EVTYPE are most harmful with respect to population health?

As the dataset has many unique type of events, we are identifying the top 10 most impactful events for fatalities and injuries.

stormdata_harm <- stormdata %>% group_by(EVTYPE) %>% summarize(FATALITIES = sum(FATALITIES), INJURIES = sum(INJURIES))
stormdata_fatality_events <- stormdata_harm[order(stormdata_harm$FATALITIES, decreasing = T)[1:10],]
stormdata_injury_events <- stormdata_harm[order(stormdata_harm$INJURIES, decreasing = T)[1:10],]
stormdata_tops <- rbind(stormdata_fatality_events, stormdata_injury_events)
stormdata_eventlist <- unique(stormdata_tops$EVTYPE)
stormdata_eventlist

##  [1] "TORNADO"           "EXCESSIVE HEAT"    "FLASH FLOOD"      
##  [4] "HEAT"              "LIGHTNING"         "TSTM WIND"        
##  [7] "FLOOD"             "RIP CURRENT"       "HIGH WIND"        
## [10] "AVALANCHE"         "ICE STORM"         "THUNDERSTORM WIND"
## [13] "HAIL"

The list of events are partially overlapping for fatalities and injuries. We are creating a subset of data and creating a barplot comparing the event count for fatalities and injuries. We are creating a subset of data filtered to these events, and shaping the data ready for plotting.

stormdata_harm <- stormdata_harm[stormdata_harm$EVTYPE %in% stormdata_eventlist,]

storm_harm_long <- stormdata_harm %>%
  pivot_longer(cols = c(FATALITIES, INJURIES), 
               names_to = "HarmType", 
               values_to = "Count")

p <- ggplot(storm_harm_long, aes(x = reorder(EVTYPE, -Count), y = Count, fill = HarmType)) +
  geom_bar(stat = "identity") +
  facet_wrap(~HarmType, scales = "free_y") +
  labs(title = "Events most harmful to US population health", x = "Event Type", y = "Event count") +
  theme_classic() +
  scale_fill_manual(values = c("turquoise","orange")) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(hjust = 0.5, size = 15),
        axis.title.x = element_text(size = 15),
        axis.title.y = element_text(size = 15))
p

Tornadoes, excessive heat and thunderstorm wind are the most impactful events for both fatalities and injuries. The number of injuries for these events are approximately 10 times higher than the number of fatalities.

2. Across the United States, which types of events have the greatest economic consequences?

As the dataset has many unique type of events, we are identifying the top 10 events with greatest economic consequences for property and crop damage.

stormdata_eco <- stormdata %>% group_by(EVTYPE) %>% summarize(PROPDMG_value = sum(PROPDMG_value), CROPDMG_value = sum(CROPDMG_value))

stormdata_prop <- stormdata_eco[order(stormdata_eco$PROPDMG_value, decreasing = T)[1:10],]
stormdata_crop <- stormdata_eco[order(stormdata_eco$CROPDMG_value, decreasing = T)[1:10],]
stormdata_damage <- rbind(stormdata_prop, stormdata_crop)
stormdata_eventlist <- unique(stormdata_damage$EVTYPE)

The list of events are similar to the list of most harmful events.

We are creating a subset of data filtered to these events and shaping the data for plotting.

stormdata_damage <- stormdata_damage[stormdata_damage$EVTYPE %in% stormdata_eventlist,]

storm_damage_long <- stormdata_damage %>%
  pivot_longer(cols = c(PROPDMG_value, CROPDMG_value), 
               names_to = "HarmType", 
               values_to = "Amount")

p <- ggplot(storm_damage_long, aes(x = reorder_within(EVTYPE, -Amount, HarmType), y = Amount, fill = HarmType)) +
  geom_bar(stat = "identity") +
  facet_wrap(~HarmType, scales = "free", axes = "all") +
  labs(title = "Events with greater economic consequence", x = "Event Type", y = "Cost of damage ($)") +
  theme_classic() +
  scale_x_reordered() + 
  scale_fill_manual(values = c("turquoise","orange")) +
  scale_y_continuous(labels = function(x) paste0(x / 1000000000, "B")) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(hjust = 0.5, size = 15),
        axis.title.x = element_text(size = 15),
        axis.title.y = element_text(size = 15))
p

While flood, hurricanes and tornados had the biggest economic impact on property damage, drought, flood and hail impacted the most the damage on crops.The economic impact of property damage was higher scale than crop damage.