Synopsis

Extraordinary weather events can have devastating effects on public health and the economy. Understanding the amount of damage they cause is critical for policy making and taking preventative measures. In this analysis, we try to find out which type of weather events are most damaging to public health and economy. We use U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database and R statistical programming to answer these questions. NOAA database contains data on extreme weather events happened in USA between years 1950-2011. According to our results, tornados are by far the most devastating type of weather event in terms of public health (5633 deaths, 91346 injuries in total). When it comes to economic costs, floods cause most damage (150 billion U.S. Dollars) followed by hurricanes (71 billion U.S. Dollars) and tornados (57 billion U.S. Dollars). While droughts don’t cause much property damage, they have devastating effects on agriculture.

Required Packages and Data Loading

We start by loading tidyverse package which includes dplyr, ggplot2 and other useful packages. Since we will work on very large numbers, we run options(scipen = 999) command to change R Studio’s default behaviour of using scientific notation for large numbers.

library(tidyverse)
## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang
## Registered S3 method overwritten by 'rvest':
##   method            from
##   read_xml.response xml2
## -- Attaching packages ----------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.1       v purrr   0.3.2  
## v tibble  2.1.1       v dplyr   0.8.0.1
## v tidyr   0.8.3       v stringr 1.4.0  
## v readr   1.3.1       v forcats 0.4.0
## -- Conflicts -------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
options(scipen = 999)

We download the database using “download.file” function from URL within an “if” sentence. Code tests whether there is a file with the same name, if the file doesn’t exist it downloads the file.

if (!file.exists("stormdata.csv.bz2")) {
  
  url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
  
  download.file(url, "stormdata.csv.bz2")
  
}

After downloading the file we read it to create a dataset named “stormdata”. “Read.csv” function can directly read zipped files so we don’t need to unzip it.

stormdata <- read.csv("stormdata.csv.bz2")

Data Processing and Wrangling

Our first question (“Which types of events are most harmful with respect to population health?”) doesn’t require any data wrangling. Later in the analysis we will just use dplyr functions to calculate the answer and ggplot to visualise it. On the other hand, second question (“Across the United States, which types of events have the greatest economic consequences?”) require some wrangling to compute the results. If we check relevant variables (PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP), we see they are coded rather unusually. Damage numbers are coded in pairs, one for numeric value and one for multiplier. Multiplier columns are factor variables and we can check their levels with levels function.

levels(stormdata$PROPDMGEXP)
##  [1] ""  "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"
levels(stormdata$CROPDMGEXP)
## [1] ""  "?" "0" "2" "B" "k" "K" "m" "M"

As we can see some levels are coded fairly intuitively with H,K,M,B standing for hundred, thousand, million and billion while numeric and other values are not. Documentations doesn’t help us much here too as there are no references to these values. To find out how many of them and what should we do with these confusing values we can use dplyr group_by and summarize functions.

stormdata %>% group_by(PROPDMGEXP) %>% summarize(n = n())
## # A tibble: 19 x 2
##    PROPDMGEXP      n
##    <fct>       <int>
##  1 ""         465934
##  2 -               1
##  3 ?               8
##  4 +               5
##  5 0             216
##  6 1              25
##  7 2              13
##  8 3               4
##  9 4               4
## 10 5              28
## 11 6               4
## 12 7               5
## 13 8               1
## 14 B              40
## 15 h               1
## 16 H               6
## 17 K          424665
## 18 m               7
## 19 M           11330

We can see that there are negligible amount of rows coded with these values. Same is true for CROPDMGEXP variable.

stormdata %>% group_by(CROPDMGEXP) %>% summarize(n = n())
## # A tibble: 9 x 2
##   CROPDMGEXP      n
##   <fct>       <int>
## 1 ""         618413
## 2 ?               7
## 3 0              19
## 4 2               1
## 5 B               9
## 6 k              21
## 7 K          281832
## 8 m               1
## 9 M            1994

Given their very low number (i.e. 314) we can safely remove them without any effect on our results. In order to do this we use dplyr filter function combined with fct_drop which removes unused factor levels.

stormdata_dmg <- stormdata %>% filter(PROPDMGEXP %in% c("h", "H", "K", "m", "M", "B", "")) %>% mutate(PROPDMGEXP = fct_drop(PROPDMGEXP))

Now we need to convert remaining PROPDMGEXP values to numeric. “h” and “H” stand for “hundred”, “K” for thousand, “m” and “M” for “million”, “B” for “billion”. We just need to convert them to their standard numeric values and later we can multiply them with PROPDMG variable to get standardised damage values. We will first use “fct_recode” function to change factor levels. Then we convert PROPDMGEXP to character and then to numeric because converting factor values directly to numeric values doesn’t work. After that we replace PROPDMG values with multiplication results of two variables.

stormdata_dmg$PROPDMGEXP <- fct_recode(stormdata_dmg$PROPDMGEXP, "1" = "", "1000" = "K", "100" = "h", "100" = "h", "1000000" = "M", "1000000" = "m", "1000000000" = "B") 

stormdata_dmg$PROPDMGEXP<- as.character(stormdata_dmg$PROPDMGEXP)
stormdata_dmg$PROPDMGEXP<- as.numeric(stormdata_dmg$PROPDMGEXP)
## Warning: NAs introduced by coercion
stormdata_dmg <- stormdata_dmg %>% mutate(PROPDMG = PROPDMG * PROPDMGEXP)

We do the same for CROPDMG and CROPDMGEXP variables.

stormdata_dmg <- stormdata_dmg %>% filter(CROPDMGEXP %in% c("k", "K", "m", "M", "B", "")) %>% mutate(CROPDMGEXP = fct_drop(CROPDMGEXP))


stormdata_dmg$CROPDMGEXP <- fct_recode(stormdata_dmg$CROPDMGEXP, "1" = "", "1000" = "k", "1000" = "K", "1000000" = "M", "1000000" = "m", "1000000000" = "B") 

stormdata_dmg$CROPDMGEXP<- as.character(stormdata_dmg$CROPDMGEXP)
stormdata_dmg$CROPDMGEXP<- as.numeric(stormdata_dmg$CROPDMGEXP)

stormdata_dmg <- stormdata_dmg %>% mutate(CROPDMG = CROPDMG * CROPDMGEXP)

Now that we have standardised damage values, we can go ahead and start our analysis.

Results

1.Which type of weather events are most harmful with respect to population health?

To answer this question, we use group_by and summarize functions from dplyr package. After summarizing we arrange the results by death count.

stormdata_sum <- stormdata %>% group_by(EVTYPE) %>% summarize(deaths = sum(FATALITIES), injuries = sum(INJURIES)) %>% arrange(desc(deaths)) 

We select first 10 rows showing event types with most deaths.

stormdata_sum <- stormdata_sum[1:10, ]
stormdata_sum
## # A tibble: 10 x 3
##    EVTYPE         deaths injuries
##    <fct>           <dbl>    <dbl>
##  1 TORNADO          5633    91346
##  2 EXCESSIVE HEAT   1903     6525
##  3 FLASH FLOOD       978     1777
##  4 HEAT              937     2100
##  5 LIGHTNING         816     5230
##  6 TSTM WIND         504     6957
##  7 FLOOD             470     6789
##  8 RIP CURRENT       368      232
##  9 HIGH WIND         248     1137
## 10 AVALANCHE         224      170

To visualise our results with ggplot in one graphic, we need to use gather function. “Gather” converts columns to variables.

stormdata_gather <- gather(stormdata_sum, "type", "count", 2:3)

Now we can visualise it.

ggplot(stormdata_gather, aes(x = EVTYPE, y = count, fill = type)) + geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = count),  position = position_dodge(0.9), vjust = -0.5) + scale_fill_brewer(palette = "Pastel1") +
  ggtitle("Ten Most Harmful Weather Events in USA with Respect to Public Health (1950-2011)") + labs(x = "Type of Event", y = "Total") + 
  theme(legend.title = element_blank())

2. Across the United States, which types of events have the greatest economic consequences?"

We can use the same method for calculating economic costs of different event types.

stormdatadmg_sum <- stormdata_dmg %>% group_by(EVTYPE) %>% summarize(PropDamage = sum(PROPDMG, na.rm = T), CropDamage = sum(CROPDMG), Total = CropDamage + PropDamage) %>% arrange(desc(Total)) 

stormdatadmg_sum <- stormdatadmg_sum[1:10,]
stormdatadmg_sum
## # A tibble: 10 x 4
##    EVTYPE              PropDamage  CropDamage        Total
##    <fct>                    <dbl>       <dbl>        <dbl>
##  1 FLOOD             144657709807  5661968450 150319678257
##  2 HURRICANE/TYPHOON  69305840000  2607872800  71913712800
##  3 TORNADO            56936985483   364950110  57301935593
##  4 STORM SURGE        43323536000        5000  43323541000
##  5 HAIL               15732261777  3000954453  18733216230
##  6 FLASH FLOOD        16140811717  1420727100  17561538817
##  7 DROUGHT             1046106000 13972566000  15018672000
##  8 HURRICANE          11868319010  2741910000  14610229010
##  9 RIVER FLOOD         5118945500  5029459000  10148404500
## 10 ICE STORM           3944927810  5022110000   8967037810

Like before we use gather function on summary data to visualise everthing in one graph.

stormdatadmg_gather <- gather(stormdatadmg_sum, "type", "damage", 2:4)

Like before it’s ready for visualisation.

ggplot(stormdatadmg_gather, aes(x = EVTYPE, y = round(damage /1000000000, digits = 2), fill = type)) + geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = round(damage/1000000000, digits = 2)),  position = position_dodge(0.9), vjust = -0.5) + scale_fill_brewer(palette = "Pastel2") +
  ggtitle("Ten Most Harmful Weather Events in USA with Respect to Economic Cost (1950-2011)") + labs(x = "Type of Event", y = "Damage Costs (In USD Billions)") + 
  theme(legend.title = element_blank())