Introduction / Synopsis

Dear reader, As a part of Coursera’s Reproducible research course assignment, we answer the following questions in this report:

  1. Across the United States, which types of events (as indicated in the EVTYPEEVTYPE variable) are most harmful with respect to population health?

  2. Across the United States, which types of events have the greatest economic consequences?

This document is structed accordingly. First, I will describe how to download and load the data. Secondly, I will attempt to give an an answer to question 1, and afterwards, to question 2. Additional data cleaning is done in both the steps, as it is more logical to do the data cleaning when it becomes apparent why it is necessary.

Data processing

First, we download the data file from the coursera server (if not already present in the WD). The data has been downloaded on 2020-01-25 16:27:05.

url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"

if(!file.exists("data.csv.bz2")){
  download.file(url, destfile = "data.csv.bz2")
}

The bz2 format allows us to immediately use read.csv to read the table, rather than unzip it.

data <- read.csv("data.csv.bz2")

Results

Question 1

Across the United States, which types of events (as indicated in the EVTYPEEVTYPE variable) are most harmful with respect to population health?

Let us define harmfulness to population health HPE as the sum of FATALITIES and INJURIES. We start by creating a variable to measure this:

library(tidyverse)

data <- data %>%
  mutate(HPE = FATALITIES + INJURIES)

Let us now examine, for each of the event types, what the total harmfulness to population health is. I order the eventtype dataset, which contains the sum of HPE for each event type, by descending frequency of the aforementioned variable, so that the event types with the highest number of casuals are on top of the dataframe.

eventtype <- data %>%
    group_by(EVTYPE) %>%
    summarise(TotalHPE = sum(HPE))

eventtype <- eventtype[order(eventtype$TotalHPE, decreasing = T),]

Now, I visualize the first 10 observations of the dataset, using a nice theme from the hrbrthemes package.

library(hrbrthemes)
library(scales)

ggplot(eventtype[1:10,], aes(x = EVTYPE, y = TotalHPE)) + 
  geom_col(fill = "orange", alpha = 0.5) +
  coord_flip(ylim = c(0,120000)) + 
  theme_ipsum_ps() + 
  labs(y = "Total Casualties + Injuries", x = "Type of Event", title = "Most Harmful Event Types") +
  geom_text(aes(label=scales::comma(TotalHPE)), hjust=0, nudge_y=2000) 

It becomes clear, then, that by far, tornados are the most harmful type of event, which take many lives and cause many casualties:

library(kableExtra)
kable(eventtype[1:10,], caption = "Most Harmful Event Types", booktabs = TRUE, row.names = FALSE, type = "html")
Most Harmful Event Types
EVTYPE TotalHPE
TORNADO 96979
EXCESSIVE HEAT 8428
TSTM WIND 7461
FLOOD 7259
LIGHTNING 6046
HEAT 3037
FLASH FLOOD 2755
ICE STORM 2064
THUNDERSTORM WIND 1621
WINTER STORM 1527

Question 2

Across the United States, which types of events have the greatest economic consequences? First, let’s define economic consequences as either damage to property, or damage to crops.

Furthermore, from the brochure, it becomes clear that “estimates should be rounded to three significant digits, followed by an alphabetical character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000. Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions." The variables PROPDMGEXP and CROPDMGEXP therefore contain multipliers to the digits in PROPDMG and CROPDMG respectively. Let’s first look at the levels of these variables.

levels(data$PROPDMGEXP)
##  [1] ""  "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K" "m" "M"
levels(data$CROPDMGEXP)
## [1] ""  "?" "0" "2" "B" "k" "K" "m" "M"

Let’s make some code that maps those signs to the real numbers, and then create variables contains the real crop damage and property damage.

library(plyr)
data$PROPDMGEXP <- mapvalues(data$PROPDMGEXP,
c("K","M","", "B","m","+","0","5","6","?","4","2","3","h","7","H","-","1","8"), 
c(1e3,1e6, 1, 1e9,1e6,  1,  1,1e5,1e6,  1,1e4,1e2,1e3,  1,1e7,1e2, 1, 10,1e8))

data$CROPDMGEXP <- mapvalues(data$CROPDMGEXP,
c("","M","K","m","B","?","0","k","2"),
c(1,1e6,1e3,1e6,1e9,1,1,1e3,1e2))

data$PROPDMGEXP <- as.numeric(as.character(data$PROPDMGEXP))
data$CROPDMGEXP <- as.numeric(as.character(data$CROPDMGEXP))

Now, create the real property and crop damage variables:

data$REALPROPDMG <- data$PROPDMG * data$PROPDMGEXP
data$REALCROPDMG <- data$CROPDMG * data$CROPDMGEXP

Now, we can proceed to answer the question: which type of events have the greatest economic consequences?

detach(package:plyr)

damage <- data %>%
  group_by(EVTYPE) %>%
  summarise(TPD = sum(REALPROPDMG), TCD = sum(REALCROPDMG), TD = TPD + TCD)

Maximum damage to property

The following figure shows the maximum damage to property. The most damaging event type to property are floods, which is perhaps unsursprising. Floods are by far the most damaging event to property, exceeding hurricanes/typhoons by about a factor of 2 (meaning floods causes twice as much damage as do tycoons and hurricanes).

damage <- damage[order(damage$TPD, decreasing = T),]

o <- damage[1:10,] %>%
  ggplot(aes(x = EVTYPE, y = TPD)) + 
  geom_col()

o + scale_y_continuous(
  labels = scales::number_format(accuracy = 1, big.mark = ",")) +
  coord_flip(ylim = c(0, 1.7e11)) +
  theme_ft_rc() +
  labs(y = "Total Property Damage", x = "Type of Event", title = "Most Harmful Event Types") +
  geom_text(aes(label=paste(scales::comma(TPD,scale = 1e-9), "bln")), hjust=0, nudge_y=2000)

Maximum damage to crops

The most damaging type of event to crops is Droughts. Droughts also cause twice as much damage as do Floods, which come second place. Crop damage is far smaller in magnitude than property damage, by a factor of approx. 10, meaning property damage from the most damaging events is about 10 times are large as damage to crops.

damage <- damage[order(damage$TCD, decreasing = T),]

p <- damage[1:10,] %>%
  ggplot(aes(x = EVTYPE, y = TCD)) + 
  geom_col() 

p + scale_y_continuous(
  labels = scales::number_format(accuracy = 1, big.mark = ",")) + 
  coord_flip(ylim = c(0,1.7e10)) + 
  theme_ft_rc() + 
  labs(y = "Total Crops Damage", x = "Type of Event", title = "Most Harmful Event Types") + 
  geom_text(aes(label=paste(scales::comma(TCD, scale = 0.000001, accuracy = 1),"bln"), hjust=0, nudge_y=2000))

Conclusion

Thank you for reading my report.