Huy NGUYEN
May 2021 - Brussels - Belgium

Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project uses the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. More info can be found in the web site NOAA

The present analysis addresses two questions :
- which types of events are most harmful to population health
- which types of events have the greatest economic consequences

Data Processing

This project was created with Rstudio 1.4.1106

The data come in the form of a comma-separated-value compressed bzip2 file, and can be downloaded from Storm Data

1. Development environment - Loading libraries and data

#show development environment
sessionInfo()
## R version 4.0.4 (2021-02-15)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_Europe.1252  LC_CTYPE=English_Europe.1252   
## [3] LC_MONETARY=English_Europe.1252 LC_NUMERIC=C                   
## [5] LC_TIME=English_Europe.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_4.0.4    magrittr_2.0.1    tools_4.0.4       htmltools_0.5.1.1
##  [5] yaml_2.2.1        stringi_1.5.3     rmarkdown_2.7     knitr_1.31       
##  [9] stringr_1.4.0     xfun_0.21         digest_0.6.27     rlang_0.4.10     
## [13] evaluate_0.14
#load libraries
library(dplyr)  
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)  

#load data
dataRaw <- read.csv("repdata_data_StormData.csv.bz2")

# Examine the dataset structure  
str(dataRaw)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

The dataset has 902.297 observations with 37 variables. We are only interested in the variables concerning the
- EVTYPE : for events type
- FATALITIES and INJURIES : for health analysis
- PROPDMG and its corresponding PROPDMGEXP : for property damages
- CROPDMG and its corresponding CROPDMGEXP : for crop damages

2. Subset with interested variables

We create a data subset with the concerned variables.

#subset with the concerned variables
dataSub <- dataRaw[c("EVTYPE","FATALITIES","INJURIES","CROPDMG","CROPDMGEXP","PROPDMG","PROPDMGEXP" )]  

#Examine some rows of the data subset  
head(dataSub)  
##    EVTYPE FATALITIES INJURIES CROPDMG CROPDMGEXP PROPDMG PROPDMGEXP
## 1 TORNADO          0       15       0               25.0          K
## 2 TORNADO          0        0       0                2.5          K
## 3 TORNADO          0        2       0               25.0          K
## 4 TORNADO          0        2       0                2.5          K
## 5 TORNADO          0        2       0                2.5          K
## 6 TORNADO          0        6       0                2.5          K

3. Health impact analysis

The health impact is addressed in term of fatalities and injuries.
We add a variable (HIMPACT) to store the sum of the fatalities and the injuries.

#add HIMPACT variable
dataSub <- mutate(dataSub, HIMPACT = FATALITIES + INJURIES)  

#examine some rows
head(dataSub)  
##    EVTYPE FATALITIES INJURIES CROPDMG CROPDMGEXP PROPDMG PROPDMGEXP HIMPACT
## 1 TORNADO          0       15       0               25.0          K      15
## 2 TORNADO          0        0       0                2.5          K       0
## 3 TORNADO          0        2       0               25.0          K       2
## 4 TORNADO          0        2       0                2.5          K       2
## 5 TORNADO          0        2       0                2.5          K       2
## 6 TORNADO          0        6       0                2.5          K       6

Now we calculate the health impact grouping by events type, and plot the results

#grouping health impact by event type
hImpact <- dataSub %>%   group_by(EVTYPE) %>% 
  summarise(HIMPACT = sum(HIMPACT)) %>%   arrange(desc(HIMPACT))

#examine the first top most harmful events type for public health   
hImpact[1:10,]
## # A tibble: 10 x 2
##    EVTYPE            HIMPACT
##    <chr>               <dbl>
##  1 TORNADO             96979
##  2 EXCESSIVE HEAT       8428
##  3 TSTM WIND            7461
##  4 FLOOD                7259
##  5 LIGHTNING            6046
##  6 HEAT                 3037
##  7 FLASH FLOOD          2755
##  8 ICE STORM            2064
##  9 THUNDERSTORM WIND    1621
## 10 WINTER STORM         1527
#plot top most harmfull results
ggplot(hImpact[1:10,], aes(x=reorder(EVTYPE, -HIMPACT),y=HIMPACT,color=EVTYPE)) + 
  geom_bar(stat="identity", fill="white") + 
  ggtitle("Health impacts") +
  xlab("Event") + ylab("Fatalities + Injuries") +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 0.5)) + 
  theme(legend.position="none") 

#Close device
dev.off()
## null device 
##           1

4. Economy impact analysis

The economy impact is addressed in term of cost for crop and property damages. There are 4 variables to consider:
- CROPDMG and its corresponding CROPDMGEXP : for crop damages
- PROPDMG and its corresponding PROPDMGEXP : for property damages

The cost of a crop damage is calculated as a product of CROPDMG by its exponent CROPDMGEXP ( cost = CROPDMG * CROPDMGEXP).

Similarly, the cost of a property damage is calculated as: cost = PROPDMG * PROPDMGEXP

4.1 Crop damages
First we count the “exponent” values. We see that they are represented by a character,
- example : “B” for Billion K: Kilo etc…
For the calculation, we replace them by numeric values.

#count exp values
table(dataSub$CROPDMGEXP)
## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994
#replace these "abbreviated" values by numeric values
dataSub$CROPDMGEXP[dataSub$CROPDMGEXP == "0"] <- 10^0
dataSub$CROPDMGEXP[dataSub$CROPDMGEXP == "2"] <- 10^2
dataSub$CROPDMGEXP[dataSub$CROPDMGEXP == "B"] <- 10^9
dataSub$CROPDMGEXP[dataSub$CROPDMGEXP == "k"] <- 10^3
dataSub$CROPDMGEXP[dataSub$CROPDMGEXP == "K"] <- 10^3
dataSub$CROPDMGEXP[dataSub$CROPDMGEXP == "m"] <- 10^6
dataSub$CROPDMGEXP[dataSub$CROPDMGEXP == "M"] <- 10^6
dataSub$CROPDMGEXP[dataSub$CROPDMGEXP == ""] <- 0
dataSub$CROPDMGEXP[dataSub$CROPDMGEXP == "?"] <- 0

#convert to numeric value
dataSub$CROPDMGEXP <- as.numeric(as.character(dataSub$CROPDMGEXP))

We do the same “exponent” pre-processing for the property damages.

4.2 Property damages

#count exp values
table(dataSub$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5      6 
## 465934      1      8      5    216     25     13      4      4     28      4 
##      7      8      B      h      H      K      m      M 
##      5      1     40      1      6 424665      7  11330
#replace these "abbreviated" values by numeric values
dataSub$PROPDMGEXP[dataSub$PROPDMGEXP == "0"] <- 10^0
dataSub$PROPDMGEXP[dataSub$PROPDMGEXP == "1"] <- 10^1
dataSub$PROPDMGEXP[dataSub$PROPDMGEXP == "2"] <- 10^2
dataSub$PROPDMGEXP[dataSub$PROPDMGEXP == "3"] <- 10^3
dataSub$PROPDMGEXP[dataSub$PROPDMGEXP == "4"] <- 10^4
dataSub$PROPDMGEXP[dataSub$PROPDMGEXP == "5"] <- 10^5
dataSub$PROPDMGEXP[dataSub$PROPDMGEXP == "6"] <- 10^6
dataSub$PROPDMGEXP[dataSub$PROPDMGEXP == "7"] <- 10^7
dataSub$PROPDMGEXP[dataSub$PROPDMGEXP == "8"] <- 10^8
dataSub$PROPDMGEXP[dataSub$PROPDMGEXP == "B"] <- 10^9
dataSub$PROPDMGEXP[dataSub$PROPDMGEXP == "h"] <- 10^2
dataSub$PROPDMGEXP[dataSub$PROPDMGEXP == "H"] <- 10^2
dataSub$PROPDMGEXP[dataSub$PROPDMGEXP == "K"] <- 10^3
dataSub$PROPDMGEXP[dataSub$PROPDMGEXP == "m"] <- 10^6
dataSub$PROPDMGEXP[dataSub$PROPDMGEXP == "M"] <- 10^6
dataSub$PROPDMGEXP[dataSub$PROPDMGEXP == ""] <- 0
dataSub$PROPDMGEXP[dataSub$PROPDMGEXP == "-"] <- 0
dataSub$PROPDMGEXP[dataSub$PROPDMGEXP == "?"] <- 0
dataSub$PROPDMGEXP[dataSub$PROPDMGEXP == "+"] <- 0

#convert to numeric value
dataSub$PROPDMGEXP <- as.numeric(as.character(dataSub$PROPDMGEXP))

4.3 Total damages cost
We add a variable (ECOST) to store the total costs.

#add ECOST variable
dataSub <- mutate(dataSub, ECOST = PROPDMG * PROPDMGEXP + CROPDMG * CROPDMGEXP)

#examine some rows
head(dataSub) 
##    EVTYPE FATALITIES INJURIES CROPDMG CROPDMGEXP PROPDMG PROPDMGEXP HIMPACT
## 1 TORNADO          0       15       0          0    25.0       1000      15
## 2 TORNADO          0        0       0          0     2.5       1000       0
## 3 TORNADO          0        2       0          0    25.0       1000       2
## 4 TORNADO          0        2       0          0     2.5       1000       2
## 5 TORNADO          0        2       0          0     2.5       1000       2
## 6 TORNADO          0        6       0          0     2.5       1000       6
##   ECOST
## 1 25000
## 2  2500
## 3 25000
## 4  2500
## 5  2500
## 6  2500

Now we calculate the economy impact grouping by events type, and plot the results

#grouping economy impact by event type
eImpact <- dataSub %>%   group_by(EVTYPE) %>% 
  summarise(ECOST = sum(ECOST)) %>%   arrange(desc(ECOST))

#examine the first top most damages costs  
eImpact[1:10,]
## # A tibble: 10 x 2
##    EVTYPE                   ECOST
##    <chr>                    <dbl>
##  1 FLOOD             150319678250
##  2 HURRICANE/TYPHOON  71913712800
##  3 TORNADO            57362335085
##  4 STORM SURGE        43323541000
##  5 HAIL               18761224047
##  6 FLASH FLOOD        18243993225
##  7 DROUGHT            15018672000
##  8 HURRICANE          14610229010
##  9 RIVER FLOOD        10148404500
## 10 ICE STORM           8967041810
#we see that the values are very high. So we scale them to Billion
eImpact$ECOST <- eImpact$ECOST / 10^9

#plot top most damages costs  
ggplot(eImpact[1:10,], aes(x=reorder(EVTYPE, -ECOST),y=ECOST,color=EVTYPE)) + 
  geom_bar(stat="identity", fill="white") + 
  ggtitle("Economy impacts") +
  xlab("Event") + ylab("Damages Cost  (billion  USD) ") +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(axis.text.x = element_text(angle = 90)) + 
  theme(legend.position="none") 

#Close device
dev.off()
## null device 
##           1

Results

The figure about “Healt Impact” (section 3. Health impact analysis) shows that TORNADO is by far the most harmful wheather event causing about 10 times fatalities/injuries more than the 3 following EXCESSIVE HEAT, TSTM WIND and FLOOD.

In term of damages, as shown in the figure (section 4. Economy impact analysis), FLOOD, HURRICANE/TYPHOON and TORNADO are the most damaging wheather events.

If we consider that life has no price, then TORNADO is the most damaging.