Synopsis

In this report, I tried to answer two questions regarding an exploration of severe weather data, obtained from the NOAA Storm Database. The goals was to evaluate the event type that caused the largest numbers of fatal and non-fatal injuries and to determine the event types that cause the highest number of US-dollar damage to properties and crops. Overall, high winds were the cause for the largest number of in US-dollar reported damages, where injuries due to weather events, both fatal and non-fatal, were mostly caused by tornadoes.

Data Processing

Process data

The table below provides an initial overview of the data in the stormfront dataset of the first few columns.

## Download data
download.file(url = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", dest = "./data/stormdata.csv.bz2", curl = FALSE)
## Import data in R
stormdata <- read.csv("./data/stormdata.csv.bz2")
## See first few lines of data
hux(head(stormdata[, c(1:7)])) %>% theme_article()
STATE__BGN_DATEBGN_TIMETIME_ZONECOUNTYCOUNTYNAMESTATE
14/18/1950 0:00:000130CST97MOBILEAL
14/18/1950 0:00:000145CST3BALDWINAL
12/20/1951 0:00:001600CST57FAYETTEAL
16/8/1951 0:00:000900CST89MADISONAL
111/15/1951 0:00:001500CST43CULLMANAL
111/15/1951 0:00:002000CST77LAUDERDALEAL

Clean up

Now, we need to clean up the EVTYPE variable, as it contains upper and lower case data, but also a lot of duplicates caused by typos or differences in spelling.

## Clean up EVTYPE
stormdata$EVTYPE <- toupper(stormdata$EVTYPE) #uppercase
stormdata$EVTYPE <- gsub("[^A-Z ]", "", stormdata$EVTYPE) # remove all whitespaces and non A-Z symbols

## Stormdata is large and not all columns are needed for the analysis.
stormdata <- stormdata[, -c(2:7, 9:22, 30:37)]

Results

The first research question is:

  1. Across the United States, which types of events (as indicated in the EVTYPE, EVTYPE variable) are most harmful with respect to population health?

As stated in the question, EVTYPE provides information on the weather event type, where population health is captured in variable: FATALITIES and INJURIES. For the secondary research question, it goes:

  1. Across the United States, which types of events have the greatest economic consequences?

Damages by weather event types is registerd by number under PROPDMG and CROPDMG, respectively damage to property and crops. The description of the amount is presented in PROPDMGEXP

# View unique values for PROPDMGEXP
unique(stormdata$PROPDMGEXP)
##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
# View unique values for CROPDMGEXP
unique(stormdata$CROPDMGEXP)
## [1] ""  "M" "K" "m" "B" "?" "0" "k" "2"

In these cases, most likely h = 100, k = 100.000, m = 1.000.000 and b = 1.000.000.000 All other numbers are to be removed from the dataset as these are unknown or missing. Therefore, I created a function change_exp()

# create a function to change all relevant exponentials of dmg. 
change_exp <- function(x){
  v <- toupper(x) #change all to upper
  v <- gsub("[^a-zA-Z]", "", v) # keep all A-Z elements
  v <- trimws(v, which = "both") #Trim all whitespaces
  v[v == ""] <- NA # change empty strings to NA
  return(v)
}

# apply function to propdmgexp and cropdmgexp
stormdata$PROPDMGEXP <- change_exp(stormdata$PROPDMGEXP)
stormdata$CROPDMGEXP <- change_exp(stormdata$CROPDMGEXP)

Now, we need to test whether we are left with only relevant values to indicate the amount of zero’s

# Check PROPDMGEXP
unique(stormdata$PROPDMGEXP)
## [1] "K" "M" NA  "B" "H"
# Check CROPDMGEXP
unique(stormdata$CROPDMGEXP)
## [1] NA  "M" "K" "B"

Now, the damage needs to be converted to ‘true’ numbers. As we do not know to what extend damage is related to the ’NA’s, we assign NA to the number of damage as well.

# Property damage first:
stormdata$PROPDMG[is.na(stormdata$PROPDMGEXP)] <- NA

# Crop damage second: 
stormdata$CROPDMG[is.na(stormdata$CROPDMGEXP)] <- NA

We can now calculate the amount of damage in US-dollars:

dollar_damage <- function(damage, exponent){
  if (is.na(exponent)) {v <- 0}
  else if (exponent == "H" | !is.na(exponent)) {v <- 100} 
  else if (exponent == "K") {v <- 1000}
  else if (exponent == "M") {v <- 1000000}
  else if (exponent == "B") {v <- 1000000000}
  return(v*damage)
}

# Apply the function to create new columns with dollars of damage to property
stormdata$PROPDMGDOLLARS <- mapply(dollar_damage, stormdata$PROPDMG, stormdata$PROPDMGEXP)

# Apply the function to create new columns with dollars of damage to crops
stormdata$CROPDMGDOLLARS <- mapply(dollar_damage, stormdata$CROPDMG, stormdata$CROPDMGEXP)

Now create a ‘event-type-damage dataframe’

## Create a table that summarizes level of fatality, injuries, property and crop damage
## per EV type 'ev_fi' 
ev_fi_dmg <- stormdata %>% group_by(EVTYPE) %>% 
  summarise(sum_fatalities = sum(FATALITIES),
            sum_injuries = sum(INJURIES),
            sum_dmg_property = sum(PROPDMGDOLLARS),
            sum_dmg_crops = sum(CROPDMGDOLLARS))

hux(head(ev_fi_dmg)) %>% theme_article()
EVTYPEsum_fatalitiessum_injuriessum_dmg_propertysum_dmg_crops
00500
HIGH SURF ADVISORY002e+04
COASTAL FLOOD00
FLASH FLOOD005e+03
LIGHTNING00
TSTM WIND00

Answer to research question 1

To obtain an overview of all damages and injuries/fatalities, we take a look at the plots below:

## Plot fatalities and injuries
plot_1a <- ev_fi_dmg %>% arrange(desc(sum_fatalities)) %>% slice_head(., n = 10) %>%
  ggplot(., aes(x = reorder(EVTYPE, -sum_fatalities), y = sum_fatalities)) + 
  geom_bar(stat = "identity") + 
  xlab("Weather event type") +
  ylab("Total sum of fatal injuries")+
  theme(axis.text.x = element_text(angle = 90))

plot_1b <- ev_fi_dmg %>% arrange(desc(sum_injuries)) %>% slice_head(., n = 10) %>%
  ggplot(., aes(x = reorder(EVTYPE, -sum_injuries), y = sum_injuries)) + 
  geom_bar(stat = "identity") +
  xlab("Weather event type") +
  ylab("Total sum of injuries")+
  theme(axis.text.x = element_text(angle = 90))

plot_1 <- plot_grid(plot_1a, plot_1b, labels = c("A", "B"))
plot_1

Here, we can see that tornado’s cause the largest number of fatal injuries as well as non-fatal injuries. Excessive heat is second when it comes to largest cause for fatalities, whereas this is fourth place for injuries.

Answer to research question 2

Now, let us take a look at damage described in dollars to properties and crops.

## Plot fatalities and injuries
plot_2a <- ev_fi_dmg %>% arrange(desc(sum_dmg_property)) %>% slice_head(., n = 10) %>%
  ggplot(., aes(x = reorder(EVTYPE, -sum_dmg_property), y = sum_dmg_property)) + 
  geom_bar(stat = "identity") + 
  xlab("Weather event type") +
  ylab("Total of property damage")+
  theme(axis.text.x = element_text(angle = 90))

plot_2b <- ev_fi_dmg %>% arrange(desc(sum_dmg_crops)) %>% slice_head(., n = 10) %>%
  ggplot(., aes(x = reorder(EVTYPE, -sum_dmg_crops), y = sum_dmg_crops)) + 
  geom_bar(stat = "identity") +
  xlab("Weather event type") +
  ylab("Total of crop damage")+
  theme(axis.text.x = element_text(angle = 90))

ev_fi_dmg$sum_total_dmg <- rowSums(ev_fi_dmg[, c(4:5)], na.rm = TRUE)

plot_2c <- ev_fi_dmg %>% arrange(desc(sum_total_dmg)) %>% slice_head(., n = 10) %>%
  ggplot(., aes(x = reorder(EVTYPE, -sum_total_dmg), y = sum_total_dmg)) + 
  geom_bar(stat = "identity") +
  xlab("Weather event type") +
  ylab("Total of damage")+
  theme(axis.text.x = element_text(angle = 90))
plot_2 <- plot_grid(plot_2a, plot_2b, plot_2c, labels = c("A", "B", "C"), ncol = 3)
plot_2

In the second plot we can see that excessive snow and high, but cold winds are the largest causes for property and crop damage respectively. Plot 2C shows that high winds lead to the largest overall damage in US-dollars, followed by excessive snow and flash flooding/floods.