This report explores the NOAA (U.S. National Oceanic and Atmospheric Administration) Storm Database to answer to following questions:
The report is written according to the principles of “Reproducible Reports”.
I use the package tidyverse and follow the respective guideline with the piping syntax.
The data is downloaded from the URL below and read into a tibble. Both steps are only executed when their result isn’t already present.
library(tidyverse)
## -- Attaching packages --------------------------------------------------------------------- tidyverse 1.2.0 --
## v ggplot2 2.2.1 v purrr 0.2.4
## v tibble 1.3.4 v dplyr 0.7.4
## v tidyr 0.7.2 v stringr 1.2.0
## v readr 1.1.1 v forcats 0.2.0
## -- Conflicts ------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
setwd("D:/Studium/DataScience/Course Material/DataScience/05 Reproducible Research")
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
file <- "stormData.bz2"
if(!file.exists(file)) {
download.file(url, file)
}
if(!exists("stormData")){
storm_data <- as_tibble(read.csv("stormData.bz2"))
}
Looking at the data, it becomes clear that only the following columns are needed for the purpose of this analysis:
Unfortunately, there doesn’t seem to be any documentation on the exponents. But clearly, their values need to be transformed and then, I assume, just multiplied with the base values. I translate the values as good as I can. Values I can’t guess, get the value 1, meaning that the exponent doesn’t have any effect.
To clearly show the translation I create a data set with all distinct translations. Then I use this translation data set to join with the original data and use the translation for mutations.
str(storm_data)
## Classes 'tbl_df', 'tbl' and 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
# collect all exponents in one distinct set
exp_prop <- rename(distinct(storm_data, PROPDMGEXP), exp = PROPDMGEXP)
exp_crop <- rename(distinct(storm_data, CROPDMGEXP), exp = CROPDMGEXP)
exp <- distinct(rbind(exp_prop, exp_crop), exp)
(exp <- exp %>% mutate_if(is.factor, as.character))
## # A tibble: 20 x 1
## exp
## <chr>
## 1 K
## 2 M
## 3
## 4 B
## 5 m
## 6 +
## 7 0
## 8 5
## 9 6
## 10 ?
## 11 4
## 12 2
## 13 3
## 14 h
## 15 7
## 16 H
## 17 -
## 18 1
## 19 8
## 20 k
# function to translate the exponents as far as feasible and set the rest to 1
clean_exp <- function(exp) {
exp_clean <- 1
if (toupper(exp) == "H") exp_clean <- 1e02
else if (toupper(exp) == "K") exp_clean <- 1e03
else if (toupper(exp) == "M") exp_clean <- 1e06
else if (toupper(exp) == "B") exp_clean <- 1e09
else if (is.numeric(exp)) exp_clean <- 10 ** exp
return(exp_clean)
}
# The function needs to be applied rowwise, which is expensive.
# Do the rowwise cleaning on this small dataset instead of the large one!
(exp <- exp %>% rowwise() %>% mutate(exp_clean = clean_exp(exp)))
## Source: local data frame [20 x 2]
## Groups: <by row>
##
## # A tibble: 20 x 2
## exp exp_clean
## <chr> <dbl>
## 1 K 1e+03
## 2 M 1e+06
## 3 1e+00
## 4 B 1e+09
## 5 m 1e+06
## 6 + 1e+00
## 7 0 1e+00
## 8 5 1e+00
## 9 6 1e+00
## 10 ? 1e+00
## 11 4 1e+00
## 12 2 1e+00
## 13 3 1e+00
## 14 h 1e+02
## 15 7 1e+00
## 16 H 1e+02
## 17 - 1e+00
## 18 1 1e+00
## 19 8 1e+00
## 20 k 1e+03
# get a data set with all relevant columns and clean values
(storm_data_clean <- storm_data %>%
select(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) %>%
mutate_if(is.factor, as.character) %>%
left_join(exp, c("PROPDMGEXP" = "exp")) %>%
mutate(PROPDMG_CLEAN = PROPDMG * exp_clean) %>%
select(-exp_clean) %>%
left_join(exp, c("CROPDMGEXP" = "exp")) %>%
mutate(CROPDMG_CLEAN = CROPDMG * exp_clean) %>%
select(-exp_clean))
## # A tibble: 902,297 x 9
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
## 1 TORNADO 0 15 25.0 K 0
## 2 TORNADO 0 0 2.5 K 0
## 3 TORNADO 0 2 25.0 K 0
## 4 TORNADO 0 2 2.5 K 0
## 5 TORNADO 0 2 2.5 K 0
## 6 TORNADO 0 6 2.5 K 0
## 7 TORNADO 0 1 2.5 K 0
## 8 TORNADO 0 0 2.5 K 0
## 9 TORNADO 1 14 25.0 K 0
## 10 TORNADO 0 0 25.0 K 0
## # ... with 902,287 more rows, and 2 more variables: PROPDMG_CLEAN <dbl>,
## # CROPDMG_CLEAN <dbl>
# sum the relevant values per EVTYPE
(evtype_sums <- storm_data_clean %>%
rename(`Event Type` = EVTYPE) %>%
group_by(`Event Type`) %>%
summarize(
Fatalities = sum(FATALITIES),
Injuries = sum(INJURIES),
`Property Damage` = sum(PROPDMG_CLEAN),
`Crop Damage` = sum(CROPDMG_CLEAN)))
## # A tibble: 985 x 5
## `Event Type` Fatalities Injuries `Property Damage`
## <chr> <dbl> <dbl> <dbl>
## 1 HIGH SURF ADVISORY 0 0 200000
## 2 COASTAL FLOOD 0 0 0
## 3 FLASH FLOOD 0 0 50000
## 4 LIGHTNING 0 0 0
## 5 TSTM WIND 0 0 8100000
## 6 TSTM WIND (G45) 0 0 8000
## 7 WATERSPOUT 0 0 0
## 8 WIND 0 0 0
## 9 ? 0 0 5000
## 10 ABNORMAL WARMTH 0 0 0
## # ... with 975 more rows, and 1 more variables: `Crop Damage` <dbl>
Recall the questions to be answered with this report:
I will answer the questions with popular top-10 lists in charts.
Population health can be measured in the dimensions “fatalaties” and “injuries” with the given data. I report these dimensions separately because it’s not feasible to aggregate them in a formula like 1 fatalaty = 10 injuries. Tornados are clearly the most harmful events in both dimensions.
evtype_sums %>%
arrange(desc(Fatalities)) %>%
head(10) %>%
ggplot(mapping = aes(x = reorder(`Event Type`, Fatalities),
y = Fatalities, fill = "red")) +
geom_bar(stat = "identity") + coord_flip() +
labs(x = "Event Type", title = "Top 10 Storm Event Types by Count of Fatalities") +
guides(fill = FALSE)
evtype_sums %>%
arrange(desc(Injuries)) %>%
head(10) %>%
ggplot(mapping = aes(x = reorder(`Event Type`, Injuries),
y = Injuries, fill = "blue")) +
geom_bar(stat = "identity") + coord_flip() +
labs(x = "Event Type", title = "Top 10 Storm Event Types by Count of Injuries") +
guides(fill = FALSE)
Economic consequences can be measured in the dimensions “property damage” and “crop damage”. I assume the respective values are given in USD and add them together to select the top-10 list. Floods have the greatest economic consequences.
evtype_sums %>%
rowwise() %>%
mutate(`Total Damage` = sum(`Property Damage`, `Crop Damage`)) %>%
arrange(desc(`Total Damage`)) %>%
head(10) %>%
mutate(`Event Type` = factor(
`Event Type`,
levels = `Event Type`[order(`Total Damage`)],
ordered = TRUE)) %>%
select(`Event Type`, `Property Damage`, `Crop Damage`) %>%
gather(`Property Damage`, `Crop Damage`,
key = "Damage Type", value = "Damage Value") %>%
ggplot(mapping = aes(x = `Event Type`, y = `Damage Value`, fill = `Damage Type`)) +
geom_bar(stat = "identity") + coord_flip() +
labs(x = "Event Type",
y = "Property Damage in USD",
title = "Top 10 Storm Event Types by Damage Value")