Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
The purpose of this report is to present somo basic plots about the given information by the U.S. National Oceanic and Atmospheric Administration’s (NOAA) in his storm data base. More information can be founded in their page.
To avoid the presence of some warnings and messages of the code chunk and save some machine time, we set the following global options.
knitr::opts_chunk$set(warning = F, message = F, cache = T)
First of all, we load the tidyverse library to manage dataframes and plots. Also, we set some global options for this report, i.e. we won’t show any warning and message produced by the code.
library(tidyverse)
library(lubridate)
library(gridExtra)
We can download the database from this link. Now, we show the dimensions of the data set and a quick view of the first 3 observations.
fileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileURL, destfile = "StormData.csv.bz2")
stormData <- read.csv("StormData.csv.bz2", header = T, stringsAsFactors = F)
dim(stormData)
## [1] 902297 37
head(stormData, 3)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1 0 0 NA
## 2 0 0 NA
## 3 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1 0 14.0 100 3 0 0 15 25.0
## 2 0 2.0 150 2 0 0 0 2.5
## 3 0 0.1 123 2 0 0 2 25.0
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1 K 0 3040 8812
## 2 K 0 3042 8755
## 3 K 0 3340 8742
## LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3051 8806 1
## 2 0 0 2
## 3 0 0 3
Also, we get the structure of this data set.
str(stormData)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
We modify the BGN_DATE and END_DATEas the proper format.
stormData$BGN_DATE <- as.Date(stormData$BGN_DATE, format = "%m/%d/%Y")
stormData$END_DATE <- as.Date(stormData$END_DATE, format = "%m/%d/%Y")
According to the data base details provided by NOAA, before 1996 only Tornado, Thunderstorm Wind and Hail were reported. Therefore, we only consider observations reported after 1996 (including that year). Also, since this report wants to describe the disasters and the effect of some types of events, we only consider those events which had some economic damages.
stormData2 <- stormData %>% filter(year(BGN_DATE)>=1996) %>%
mutate(ecoImpact = (PROPDMG > 0 | CROPDMG > 0 | FATALITIES > 0 | INJURIES > 0))
Let’s see a quick review of the presence of all the event type. Since we have more than 985 different observations, we just show those events of which frequency is iver the mean of the frequency of every event. The right plot shows all the events and in the left we are showing only the events which has economic impact.
p <- stormData2 %>% group_by(EVTYPE) %>% summarise(n=n(), damage=sum(FATALITIES, INJURIES)) %>% slice_max(n, n=10) %>% ggplot() + geom_bar(aes(x=reorder(EVTYPE, -n), y=n, fill=damage), stat = "identity") + coord_flip() + theme(legend.position = "none") + scale_fill_gradient(low = "#add8e6", high = "#f08080")
q <- stormData2 %>% filter(ecoImpact) %>% group_by(EVTYPE) %>% summarise(n=n(), damage=sum(FATALITIES, INJURIES)) %>% slice_max(n, n=10) %>% ggplot() + geom_bar(aes(x=reorder(EVTYPE, -n), y=n, fill=damage), stat = "identity") + coord_flip() + scale_fill_gradient(name = "Human Damage", low = "#add8e6", high = "#f08080")
grid.arrange(p, q, ncol = 2)
Although hail was the type of event with the highest number of occurrences, thunderstorm winds (TSTM WIND) have been the most frequent event with economic impact. On the other hand, besides HAIL had the highest frequency it seems to be the fourth most frequent event with economic impact. However, in both plots is clear that TORNADOS were the most harmful for the human beings.
stormData2 %>% filter(ecoImpact) %>% group_by(EVTYPE) %>% summarise(n=n(), damage=sum(FATALITIES, INJURIES))
## # A tibble: 222 x 3
## EVTYPE n damage
## <chr> <int> <dbl>
## 1 " HIGH SURF ADVISORY" 1 0
## 2 " FLASH FLOOD" 1 0
## 3 " TSTM WIND" 2 0
## 4 " TSTM WIND (G45)" 1 0
## 5 "AGRICULTURAL FREEZE" 3 0
## 6 "ASTRONOMICAL HIGH TIDE" 8 0
## 7 "ASTRONOMICAL LOW TIDE" 2 0
## 8 "AVALANCHE" 264 379
## 9 "Beach Erosion" 1 0
## 10 "BLACK ICE" 1 25
## # ... with 212 more rows
To see the number of events that have had an economic impact over the years, we show the following time series where the most.
p <- stormData2 %>% filter(ecoImpact) %>% group_by(BGN_DATE) %>% summarise(n=n()) %>% ggplot() + geom_line(aes(x=BGN_DATE, y=n))+ scale_color_gradient()
q <- stormData2 %>% filter(ecoImpact) %>% group_by(BGN_DATE) %>% summarise(n=n()) %>% ggplot() + geom_bar(aes(x=month(BGN_DATE), y=n, fill = year(BGN_DATE)), stat = "identity")+ scale_color_gradient() + scale_x_continuous(breaks = c(1:12), name = "Month") + theme(legend.position = "none")
grid.arrange(p, q, ncol = 2)
From the left plot, we can observe some seasonality over the number of events. But, from the right plot we observe that the frequency of the events get higher when in the middle of the year (May, June and July).
To show the real economic impact we must format with the exponents values of the CROPDMGEXP and PROPDMGEXP columns.
stormData2$PROPDMGEXP %>% unique()
## [1] "K" "" "M" "B" "0"
stormData2$CROPDMGEXP %>% unique()
## [1] "K" "" "M" "B"
We interpretate those multipliers as follows:
[blank] or 0 -> \(\times 1\)K -> \(\times 10^3\)M -> \(\times 10^6\)B -> \(\times 10^9\)multiplier <- function(x) {
res <- numeric(length(x))
res[x=="K"] <- 1e3
res[x=="M"] <- 1e6
res[x=="B"] <- 1e9
res[x==""] <- 1
return(res)
}
topEconomicImpact <- stormData2 %>%
mutate(totalPropDMG = PROPDMG*multiplier(PROPDMGEXP), totalCropDMG = CROPDMG*multiplier(CROPDMGEXP)) %>%
filter(ecoImpact) %>%
group_by(EVTYPE) %>%
summarise(totalDMG = sum(totalPropDMG + totalCropDMG)) %>%
ungroup() %>%
slice_max(totalDMG, n=10)
topEconomicImpact %>%
ggplot() +
geom_bar(aes(x=reorder(EVTYPE, -totalDMG), y=totalDMG), stat = "identity") +
coord_flip()
topEconomicImpact %>%
mutate(DAMAGE.Economic = scales::dollar(totalDMG)) %>%
select(EVTYPE, DAMAGE.Economic) %>%
print()
## # A tibble: 10 x 2
## EVTYPE DAMAGE.Economic
## <chr> <chr>
## 1 FLOOD $148,919,611,950
## 2 HURRICANE/TYPHOON $71,913,712,800
## 3 STORM SURGE $43,193,541,000
## 4 TORNADO $24,900,370,720
## 5 HAIL $17,071,172,870
## 6 FLASH FLOOD $16,557,105,610
## 7 HURRICANE $14,554,229,010
## 8 DROUGHT $14,413,667,000
## 9 TROPICAL STORM $8,320,186,550
## 10 HIGH WIND $5,881,421,660
Finally, we can observe that FLOOD is the eventy type that leaves more economic damage and also are in the most frequent events with considerably human damage.
To conclude this report, we left the following scatter plot where is possible to observe the economic and human damage together.
stormData2 %>%
filter(ecoImpact) %>%
mutate(totalPropDMG = PROPDMG*multiplier(PROPDMGEXP), totalCropDMG = CROPDMG*multiplier(CROPDMGEXP)) %>%
group_by(EVTYPE) %>%
summarise(totalEcoDMG = sum(totalPropDMG + totalCropDMG),
totalHumDMG = sum(FATALITIES + INJURIES),
freq = n()) %>%
ggplot() +
geom_point(aes(x=log10(totalEcoDMG), y=log10(totalHumDMG), color=freq), alpha = 0.3, size = 2)
As an alternative and future work, we can note that there might be a linear observations of those events which human damage and economic damage are not zero.