Huy NGUYEN
May 2021 - Brussels - Belgium
Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project uses the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. More info can be found in the web site NOAA
The present analysis addresses two questions :
- which types of events are most harmful to population health
- which types of events have the greatest economic consequences
This project was created with Rstudio 1.4.1106
The data come in the form of a comma-separated-value compressed bzip2 file, and can be downloaded from Storm Data
#show development environment
sessionInfo()
## R version 4.0.4 (2021-02-15)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_Europe.1252 LC_CTYPE=English_Europe.1252
## [3] LC_MONETARY=English_Europe.1252 LC_NUMERIC=C
## [5] LC_TIME=English_Europe.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] compiler_4.0.4 magrittr_2.0.1 tools_4.0.4 htmltools_0.5.1.1
## [5] yaml_2.2.1 stringi_1.5.3 rmarkdown_2.7 knitr_1.31
## [9] stringr_1.4.0 xfun_0.21 digest_0.6.27 rlang_0.4.10
## [13] evaluate_0.14
#load libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
#load data
dataRaw <- read.csv("repdata_data_StormData.csv.bz2")
# Examine the dataset structure
str(dataRaw)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
The dataset has 902.297 observations with 37 variables. We are only interested in the variables concerning the
- EVTYPE : for events type
- FATALITIES and INJURIES : for health analysis
- PROPDMG and its corresponding PROPDMGEXP : for property damages
- CROPDMG and its corresponding CROPDMGEXP : for crop damages
We create a data subset with the concerned variables.
#subset with the concerned variables
dataSub <- dataRaw[c("EVTYPE","FATALITIES","INJURIES","CROPDMG","CROPDMGEXP","PROPDMG","PROPDMGEXP" )]
#Examine some rows of the data subset
head(dataSub)
## EVTYPE FATALITIES INJURIES CROPDMG CROPDMGEXP PROPDMG PROPDMGEXP
## 1 TORNADO 0 15 0 25.0 K
## 2 TORNADO 0 0 0 2.5 K
## 3 TORNADO 0 2 0 25.0 K
## 4 TORNADO 0 2 0 2.5 K
## 5 TORNADO 0 2 0 2.5 K
## 6 TORNADO 0 6 0 2.5 K
The health impact is addressed in term of fatalities and injuries.
We add a variable (HIMPACT) to store the sum of the fatalities and the injuries.
#add HIMPACT variable
dataSub <- mutate(dataSub, HIMPACT = FATALITIES + INJURIES)
#examine some rows
head(dataSub)
## EVTYPE FATALITIES INJURIES CROPDMG CROPDMGEXP PROPDMG PROPDMGEXP HIMPACT
## 1 TORNADO 0 15 0 25.0 K 15
## 2 TORNADO 0 0 0 2.5 K 0
## 3 TORNADO 0 2 0 25.0 K 2
## 4 TORNADO 0 2 0 2.5 K 2
## 5 TORNADO 0 2 0 2.5 K 2
## 6 TORNADO 0 6 0 2.5 K 6
Now we calculate the health impact grouping by events type, and plot the results
#grouping health impact by event type
hImpact <- dataSub %>% group_by(EVTYPE) %>%
summarise(HIMPACT = sum(HIMPACT)) %>% arrange(desc(HIMPACT))
#examine the first top most harmful events type for public health
hImpact[1:10,]
## # A tibble: 10 x 2
## EVTYPE HIMPACT
## <chr> <dbl>
## 1 TORNADO 96979
## 2 EXCESSIVE HEAT 8428
## 3 TSTM WIND 7461
## 4 FLOOD 7259
## 5 LIGHTNING 6046
## 6 HEAT 3037
## 7 FLASH FLOOD 2755
## 8 ICE STORM 2064
## 9 THUNDERSTORM WIND 1621
## 10 WINTER STORM 1527
#plot top most harmfull results
ggplot(hImpact[1:10,], aes(x=reorder(EVTYPE, -HIMPACT),y=HIMPACT,color=EVTYPE)) +
geom_bar(stat="identity", fill="white") +
ggtitle("Health impacts") +
xlab("Event") + ylab("Fatalities + Injuries") +
theme(plot.title = element_text(hjust = 0.5)) +
theme(axis.text.x = element_text(angle = 90, hjust = 0.5)) +
theme(legend.position="none")
#Close device
dev.off()
## null device
## 1
The economy impact is addressed in term of cost for crop and property damages. There are 4 variables to consider:
- CROPDMG and its corresponding CROPDMGEXP : for crop damages
- PROPDMG and its corresponding PROPDMGEXP : for property damages
The cost of a crop damage is calculated as a product of CROPDMG by its exponent CROPDMGEXP ( cost = CROPDMG * CROPDMGEXP).
Similarly, the cost of a property damage is calculated as: cost = PROPDMG * PROPDMGEXP
4.1 Crop damages
First we count the “exponent” values. We see that they are represented by a character,
- example : “B” for Billion K: Kilo etc…
For the calculation, we replace them by numeric values.
#count exp values
table(dataSub$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
#replace these "abbreviated" values by numeric values
dataSub$CROPDMGEXP[dataSub$CROPDMGEXP == "0"] <- 10^0
dataSub$CROPDMGEXP[dataSub$CROPDMGEXP == "2"] <- 10^2
dataSub$CROPDMGEXP[dataSub$CROPDMGEXP == "B"] <- 10^9
dataSub$CROPDMGEXP[dataSub$CROPDMGEXP == "k"] <- 10^3
dataSub$CROPDMGEXP[dataSub$CROPDMGEXP == "K"] <- 10^3
dataSub$CROPDMGEXP[dataSub$CROPDMGEXP == "m"] <- 10^6
dataSub$CROPDMGEXP[dataSub$CROPDMGEXP == "M"] <- 10^6
dataSub$CROPDMGEXP[dataSub$CROPDMGEXP == ""] <- 0
dataSub$CROPDMGEXP[dataSub$CROPDMGEXP == "?"] <- 0
#convert to numeric value
dataSub$CROPDMGEXP <- as.numeric(as.character(dataSub$CROPDMGEXP))
We do the same “exponent” pre-processing for the property damages.
4.2 Property damages
#count exp values
table(dataSub$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5 6
## 465934 1 8 5 216 25 13 4 4 28 4
## 7 8 B h H K m M
## 5 1 40 1 6 424665 7 11330
#replace these "abbreviated" values by numeric values
dataSub$PROPDMGEXP[dataSub$PROPDMGEXP == "0"] <- 10^0
dataSub$PROPDMGEXP[dataSub$PROPDMGEXP == "1"] <- 10^1
dataSub$PROPDMGEXP[dataSub$PROPDMGEXP == "2"] <- 10^2
dataSub$PROPDMGEXP[dataSub$PROPDMGEXP == "3"] <- 10^3
dataSub$PROPDMGEXP[dataSub$PROPDMGEXP == "4"] <- 10^4
dataSub$PROPDMGEXP[dataSub$PROPDMGEXP == "5"] <- 10^5
dataSub$PROPDMGEXP[dataSub$PROPDMGEXP == "6"] <- 10^6
dataSub$PROPDMGEXP[dataSub$PROPDMGEXP == "7"] <- 10^7
dataSub$PROPDMGEXP[dataSub$PROPDMGEXP == "8"] <- 10^8
dataSub$PROPDMGEXP[dataSub$PROPDMGEXP == "B"] <- 10^9
dataSub$PROPDMGEXP[dataSub$PROPDMGEXP == "h"] <- 10^2
dataSub$PROPDMGEXP[dataSub$PROPDMGEXP == "H"] <- 10^2
dataSub$PROPDMGEXP[dataSub$PROPDMGEXP == "K"] <- 10^3
dataSub$PROPDMGEXP[dataSub$PROPDMGEXP == "m"] <- 10^6
dataSub$PROPDMGEXP[dataSub$PROPDMGEXP == "M"] <- 10^6
dataSub$PROPDMGEXP[dataSub$PROPDMGEXP == ""] <- 0
dataSub$PROPDMGEXP[dataSub$PROPDMGEXP == "-"] <- 0
dataSub$PROPDMGEXP[dataSub$PROPDMGEXP == "?"] <- 0
dataSub$PROPDMGEXP[dataSub$PROPDMGEXP == "+"] <- 0
#convert to numeric value
dataSub$PROPDMGEXP <- as.numeric(as.character(dataSub$PROPDMGEXP))
4.3 Total damages cost
We add a variable (ECOST) to store the total costs.
#add ECOST variable
dataSub <- mutate(dataSub, ECOST = PROPDMG * PROPDMGEXP + CROPDMG * CROPDMGEXP)
#examine some rows
head(dataSub)
## EVTYPE FATALITIES INJURIES CROPDMG CROPDMGEXP PROPDMG PROPDMGEXP HIMPACT
## 1 TORNADO 0 15 0 0 25.0 1000 15
## 2 TORNADO 0 0 0 0 2.5 1000 0
## 3 TORNADO 0 2 0 0 25.0 1000 2
## 4 TORNADO 0 2 0 0 2.5 1000 2
## 5 TORNADO 0 2 0 0 2.5 1000 2
## 6 TORNADO 0 6 0 0 2.5 1000 6
## ECOST
## 1 25000
## 2 2500
## 3 25000
## 4 2500
## 5 2500
## 6 2500
Now we calculate the economy impact grouping by events type, and plot the results
#grouping economy impact by event type
eImpact <- dataSub %>% group_by(EVTYPE) %>%
summarise(ECOST = sum(ECOST)) %>% arrange(desc(ECOST))
#examine the first top most damages costs
eImpact[1:10,]
## # A tibble: 10 x 2
## EVTYPE ECOST
## <chr> <dbl>
## 1 FLOOD 150319678250
## 2 HURRICANE/TYPHOON 71913712800
## 3 TORNADO 57362335085
## 4 STORM SURGE 43323541000
## 5 HAIL 18761224047
## 6 FLASH FLOOD 18243993225
## 7 DROUGHT 15018672000
## 8 HURRICANE 14610229010
## 9 RIVER FLOOD 10148404500
## 10 ICE STORM 8967041810
#we see that the values are very high. So we scale them to Billion
eImpact$ECOST <- eImpact$ECOST / 10^9
#plot top most damages costs
ggplot(eImpact[1:10,], aes(x=reorder(EVTYPE, -ECOST),y=ECOST,color=EVTYPE)) +
geom_bar(stat="identity", fill="white") +
ggtitle("Economy impacts") +
xlab("Event") + ylab("Damages Cost (billion USD) ") +
theme(plot.title = element_text(hjust = 0.5)) +
theme(axis.text.x = element_text(angle = 90)) +
theme(legend.position="none")
#Close device
dev.off()
## null device
## 1
The figure about “Healt Impact” (section 3. Health impact analysis) shows that TORNADO is by far the most harmful wheather event causing about 10 times fatalities/injuries more than the 3 following EXCESSIVE HEAT, TSTM WIND and FLOOD.
In term of damages, as shown in the figure (section 4. Economy impact analysis), FLOOD, HURRICANE/TYPHOON and TORNADO are the most damaging wheather events.
If we consider that life has no price, then TORNADO is the most damaging.