Extraordinary weather events can have devastating effects on public health and the economy. Understanding the amount of damage they cause is critical for policy making and taking preventative measures. In this analysis, we try to find out which type of weather events are most damaging to public health and economy. We use U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database and R statistical programming to answer these questions. NOAA database contains data on extreme weather events happened in USA between years 1950-2011. According to our results, tornados are by far the most devastating type of weather event in terms of public health (5633 deaths, 91346 injuries in total). When it comes to economic costs, floods cause most damage (150 billion U.S. Dollars) followed by hurricanes (71 billion U.S. Dollars) and tornados (57 billion U.S. Dollars). While droughts don’t cause much property damage, they have devastating effects on agriculture.
We start by loading tidyverse package which includes dplyr, ggplot2 and other useful packages. Since we will work on very large numbers, we run options(scipen = 999) command to change R Studio’s default behaviour of using scientific notation for large numbers.
library(tidyverse)
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
## Registered S3 method overwritten by 'rvest':
## method from
## read_xml.response xml2
## -- Attaching packages ----------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.1 v purrr 0.3.2
## v tibble 2.1.1 v dplyr 0.8.0.1
## v tidyr 0.8.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts -------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
options(scipen = 999)
We download the database using “download.file” function from URL within an “if” sentence. Code tests whether there is a file with the same name, if the file doesn’t exist it downloads the file.
if (!file.exists("stormdata.csv.bz2")) {
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url, "stormdata.csv.bz2")
}
After downloading the file we read it to create a dataset named “stormdata”. “Read.csv” function can directly read zipped files so we don’t need to unzip it.
stormdata <- read.csv("stormdata.csv.bz2")
Our first question (“Which types of events are most harmful with respect to population health?”) doesn’t require any data wrangling. Later in the analysis we will just use dplyr functions to calculate the answer and ggplot to visualise it. On the other hand, second question (“Across the United States, which types of events have the greatest economic consequences?”) require some wrangling to compute the results. If we check relevant variables (PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP), we see they are coded rather unusually. Damage numbers are coded in pairs, one for numeric value and one for multiplier. Multiplier columns are factor variables and we can check their levels with levels function.
levels(stormdata$PROPDMGEXP)
## [1] "" "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"
levels(stormdata$CROPDMGEXP)
## [1] "" "?" "0" "2" "B" "k" "K" "m" "M"
As we can see some levels are coded fairly intuitively with H,K,M,B standing for hundred, thousand, million and billion while numeric and other values are not. Documentations doesn’t help us much here too as there are no references to these values. To find out how many of them and what should we do with these confusing values we can use dplyr group_by and summarize functions.
stormdata %>% group_by(PROPDMGEXP) %>% summarize(n = n())
## # A tibble: 19 x 2
## PROPDMGEXP n
## <fct> <int>
## 1 "" 465934
## 2 - 1
## 3 ? 8
## 4 + 5
## 5 0 216
## 6 1 25
## 7 2 13
## 8 3 4
## 9 4 4
## 10 5 28
## 11 6 4
## 12 7 5
## 13 8 1
## 14 B 40
## 15 h 1
## 16 H 6
## 17 K 424665
## 18 m 7
## 19 M 11330
We can see that there are negligible amount of rows coded with these values. Same is true for CROPDMGEXP variable.
stormdata %>% group_by(CROPDMGEXP) %>% summarize(n = n())
## # A tibble: 9 x 2
## CROPDMGEXP n
## <fct> <int>
## 1 "" 618413
## 2 ? 7
## 3 0 19
## 4 2 1
## 5 B 9
## 6 k 21
## 7 K 281832
## 8 m 1
## 9 M 1994
Given their very low number (i.e. 314) we can safely remove them without any effect on our results. In order to do this we use dplyr filter function combined with fct_drop which removes unused factor levels.
stormdata_dmg <- stormdata %>% filter(PROPDMGEXP %in% c("h", "H", "K", "m", "M", "B", "")) %>% mutate(PROPDMGEXP = fct_drop(PROPDMGEXP))
Now we need to convert remaining PROPDMGEXP values to numeric. “h” and “H” stand for “hundred”, “K” for thousand, “m” and “M” for “million”, “B” for “billion”. We just need to convert them to their standard numeric values and later we can multiply them with PROPDMG variable to get standardised damage values. We will first use “fct_recode” function to change factor levels. Then we convert PROPDMGEXP to character and then to numeric because converting factor values directly to numeric values doesn’t work. After that we replace PROPDMG values with multiplication results of two variables.
stormdata_dmg$PROPDMGEXP <- fct_recode(stormdata_dmg$PROPDMGEXP, "1" = "", "1000" = "K", "100" = "h", "100" = "h", "1000000" = "M", "1000000" = "m", "1000000000" = "B")
stormdata_dmg$PROPDMGEXP<- as.character(stormdata_dmg$PROPDMGEXP)
stormdata_dmg$PROPDMGEXP<- as.numeric(stormdata_dmg$PROPDMGEXP)
## Warning: NAs introduced by coercion
stormdata_dmg <- stormdata_dmg %>% mutate(PROPDMG = PROPDMG * PROPDMGEXP)
We do the same for CROPDMG and CROPDMGEXP variables.
stormdata_dmg <- stormdata_dmg %>% filter(CROPDMGEXP %in% c("k", "K", "m", "M", "B", "")) %>% mutate(CROPDMGEXP = fct_drop(CROPDMGEXP))
stormdata_dmg$CROPDMGEXP <- fct_recode(stormdata_dmg$CROPDMGEXP, "1" = "", "1000" = "k", "1000" = "K", "1000000" = "M", "1000000" = "m", "1000000000" = "B")
stormdata_dmg$CROPDMGEXP<- as.character(stormdata_dmg$CROPDMGEXP)
stormdata_dmg$CROPDMGEXP<- as.numeric(stormdata_dmg$CROPDMGEXP)
stormdata_dmg <- stormdata_dmg %>% mutate(CROPDMG = CROPDMG * CROPDMGEXP)
Now that we have standardised damage values, we can go ahead and start our analysis.
To answer this question, we use group_by and summarize functions from dplyr package. After summarizing we arrange the results by death count.
stormdata_sum <- stormdata %>% group_by(EVTYPE) %>% summarize(deaths = sum(FATALITIES), injuries = sum(INJURIES)) %>% arrange(desc(deaths))
We select first 10 rows showing event types with most deaths.
stormdata_sum <- stormdata_sum[1:10, ]
stormdata_sum
## # A tibble: 10 x 3
## EVTYPE deaths injuries
## <fct> <dbl> <dbl>
## 1 TORNADO 5633 91346
## 2 EXCESSIVE HEAT 1903 6525
## 3 FLASH FLOOD 978 1777
## 4 HEAT 937 2100
## 5 LIGHTNING 816 5230
## 6 TSTM WIND 504 6957
## 7 FLOOD 470 6789
## 8 RIP CURRENT 368 232
## 9 HIGH WIND 248 1137
## 10 AVALANCHE 224 170
To visualise our results with ggplot in one graphic, we need to use gather function. “Gather” converts columns to variables.
stormdata_gather <- gather(stormdata_sum, "type", "count", 2:3)
Now we can visualise it.
ggplot(stormdata_gather, aes(x = EVTYPE, y = count, fill = type)) + geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = count), position = position_dodge(0.9), vjust = -0.5) + scale_fill_brewer(palette = "Pastel1") +
ggtitle("Ten Most Harmful Weather Events in USA with Respect to Public Health (1950-2011)") + labs(x = "Type of Event", y = "Total") +
theme(legend.title = element_blank())
We can use the same method for calculating economic costs of different event types.
stormdatadmg_sum <- stormdata_dmg %>% group_by(EVTYPE) %>% summarize(PropDamage = sum(PROPDMG, na.rm = T), CropDamage = sum(CROPDMG), Total = CropDamage + PropDamage) %>% arrange(desc(Total))
stormdatadmg_sum <- stormdatadmg_sum[1:10,]
stormdatadmg_sum
## # A tibble: 10 x 4
## EVTYPE PropDamage CropDamage Total
## <fct> <dbl> <dbl> <dbl>
## 1 FLOOD 144657709807 5661968450 150319678257
## 2 HURRICANE/TYPHOON 69305840000 2607872800 71913712800
## 3 TORNADO 56936985483 364950110 57301935593
## 4 STORM SURGE 43323536000 5000 43323541000
## 5 HAIL 15732261777 3000954453 18733216230
## 6 FLASH FLOOD 16140811717 1420727100 17561538817
## 7 DROUGHT 1046106000 13972566000 15018672000
## 8 HURRICANE 11868319010 2741910000 14610229010
## 9 RIVER FLOOD 5118945500 5029459000 10148404500
## 10 ICE STORM 3944927810 5022110000 8967037810
Like before we use gather function on summary data to visualise everthing in one graph.
stormdatadmg_gather <- gather(stormdatadmg_sum, "type", "damage", 2:4)
Like before it’s ready for visualisation.
ggplot(stormdatadmg_gather, aes(x = EVTYPE, y = round(damage /1000000000, digits = 2), fill = type)) + geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = round(damage/1000000000, digits = 2)), position = position_dodge(0.9), vjust = -0.5) + scale_fill_brewer(palette = "Pastel2") +
ggtitle("Ten Most Harmful Weather Events in USA with Respect to Economic Cost (1950-2011)") + labs(x = "Type of Event", y = "Damage Costs (In USD Billions)") +
theme(legend.title = element_blank())