# libraries
library(ggplot2)
library(dplyr)
library(ggpubr)
Sys.setlocale("LC_ALL","English") # set English as locale if necessary
## [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
options(scipen=999) # turn-off scientific notation like 1e+48
The analysis on the storm event database revealed that tornadoes are the most dangerous weather event to the population health. The second most dangerous event type is the excessive heat. The economic impact of weather events was also analyzed. Flash floods and thunderstorm winds caused billions of dollars in property damages between 1950 and 2011. The largest crop damage caused by drought, followed by flood and hails.
The data come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file here:
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
In the follow lines of code we download the data automatically into a folder called ‘dataset’ with the name of FstormData.csv.bz2.
# create 'dataset' folder if not exists
if(!dir.exists("./dataset")){
dir.create("./dataset")
}
#download the file into 'dataset' folder if not exists
if(!file.exists("./dataset/FStormData.csv.bz2")){
download.file(url = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
,destfile = "./dataset/FStormData.csv.bz2")
}
Now we can load the data into a variable.
# read the csv file (this can take a few seconds)
storm.data <- read.csv("./dataset/FStormData.csv.bz2")
Fact this, now we can look at a summary of the data.
str(storm.data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
We can see that the data contains 902297 rows and 37 variables and are very well formatted, here I have an unique consideration, and is that PROPDMG and CROPDMG variables are expressed in units of type K (Thousands), M (Millions) and B (Billions) , as we can see in PROPDMGEXP and CROPDMGEXP variables. I think that the best choice is convert this values to its real value (ex. 1K = 1000).
Well, we can get this very quickly by writing a simple function to do it for us.
as.usd.dollar <- function(value, exp){
if(exp == "K"){
return(value*1000)
}else if(exp=="M"){
return(value*1000000)
}else if(exp =="B"){
return(value*1000000000)
}else{
return(value)
}
}
damage <- select(storm.data, EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP, FATALITIES, INJURIES)
damage <- mutate(damage, PROPDMG.USD = as.usd.dollar(PROPDMG, PROPDMGEXP))
damage <- mutate(damage, CROPDMG.USD = as.usd.dollar(CROPDMG, CROPDMGEXP))
# remove unused variables for a data set more clean
damage <- select(damage, EVTYPE, PROPDMG.USD, CROPDMG.USD, FATALITIES, INJURIES)
That is all, now we have a new data set with correct units and only the variables required.
str(damage)
## 'data.frame': 902297 obs. of 5 variables:
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ PROPDMG.USD: num 25000 2500 25000 2500 2500 2500 2500 2500 25000 25000 ...
## $ CROPDMG.USD: num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES : num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
Additionally we can need create some new data groups.
For fatalities:
fatalities <- damage %>% group_by(EVTYPE) %>% summarise(Count = sum(FATALITIES))
fatalities <- fatalities[fatalities$Count > 150,]
For injuries:
injuries <- damage %>% group_by(EVTYPE) %>% summarise(Count = sum(INJURIES))
injuries <- injuries[injuries$Count > 1000,]
For property damage:
prop <- damage %>% group_by(EVTYPE) %>% summarise(Count = sum(PROPDMG.USD))
prop <- prop[prop$Count > 20000000,]
And finally for Crop damage:
crop <- damage %>% group_by(EVTYPE) %>% summarise(Count = sum(CROPDMG.USD))
crop <- crop[crop$Count > 5000,]
fatalities.plot <- ggplot(fatalities) +
geom_col(aes(x=fatalities$EVTYPE, y=fatalities$Count, fill = EVTYPE)) +
xlab("") +
ylab("Fatalities") +
ggtitle("Fatalities by Event Type (greater than 150)") +
coord_flip() +
theme(legend.position = "none")
injuries.plot <- ggplot(injuries) +
geom_col(aes(x=injuries$EVTYPE, y=injuries$Count, fill = EVTYPE)) +
xlab("") +
ylab("Injuries") +
ggtitle("Injuries by Event Type (greater than 1000)") +
coord_flip() +
theme(legend.position = "none")
ggarrange(fatalities.plot, injuries.plot, ncol = 2, nrow = 1)
In the previous graphs, we have a very clear vision of the types of events that most harm the health of the population.
Among the most fatal we find the Tornado, Excessive Heat, Flash Flood, Heat and Lighting. Also, among the causes of more injuries are the Tornado, TSTM wind, Flood, Heat and Lighting.
prop.plot <- ggplot(prop) +
geom_col(aes(x=prop$EVTYPE, y=prop$Count, fill = EVTYPE)) +
xlab("") +
ggtitle("Property Damage by Event Type (greater than 20M)") +
ylab("Damage Cost (USD)") +
coord_flip() +
theme(legend.position = "none")
crop.plot <- ggplot(crop) +
geom_col(aes(x=crop$EVTYPE, y=crop$Count, fill = EVTYPE)) +
xlab("") +
ggtitle("Crop Damage by Event Type (greater than 5000)") +
ylab("Damage Cost (USD)") +
coord_flip() +
theme(legend.position = "none")
ggarrange(prop.plot, crop.plot, ncol = 2, nrow = 1)
In the graphs above we can see the most expensive events for the communities compared to their cost in US dollars. On the right we have the costs for property damage and on the left we have the costs for crop damage.
Among the most costly for property damage are the Tornado, Excessive Heat, Flash Flood, Storm Winds, Flood and Storm Wind. On the other hand, among the most expensive for damage to crops we have the Hail, Flash Flood, Flood, TSTM wind and Tornado.
#Environment specifications
sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 10240)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggpubr_0.4.0 dplyr_1.0.2 ggplot2_3.3.2
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.1.0 xfun_0.16 purrr_0.3.4 haven_2.3.1
## [5] carData_3.0-4 colorspace_1.4-1 vctrs_0.3.3 generics_0.0.2
## [9] htmltools_0.5.0 yaml_2.2.1 rlang_0.4.7 pillar_1.4.6
## [13] foreign_0.8-80 glue_1.4.2 withr_2.2.0 readxl_1.3.1
## [17] lifecycle_0.2.0 stringr_1.4.0 munsell_0.5.0 ggsignif_0.6.0
## [21] gtable_0.3.0 cellranger_1.1.0 zip_2.1.1 evaluate_0.14
## [25] labeling_0.3 knitr_1.29 rio_0.5.16 forcats_0.5.0
## [29] curl_4.3 fansi_0.4.1 broom_0.7.0 Rcpp_1.0.5
## [33] backports_1.1.7 scales_1.1.1 abind_1.4-5 farver_2.0.3
## [37] hms_0.5.3 digest_0.6.25 stringi_1.4.6 openxlsx_4.1.5
## [41] rstatix_0.6.0 grid_4.0.2 cowplot_1.1.0 cli_2.0.2
## [45] tools_4.0.2 magrittr_1.5 tibble_3.0.3 crayon_1.3.4
## [49] car_3.0-9 tidyr_1.1.2 pkgconfig_2.0.3 ellipsis_0.3.1
## [53] data.table_1.13.0 assertthat_0.2.1 rmarkdown_2.3 R6_2.4.1
## [57] compiler_4.0.2