Synopsis

In this report, we would like to answer the following two questions:

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

  2. Across the United States, which types of events have the greatest economic consequences?

Therefore, analytic graphics has been utilized to identify the types of severe weather event that give the most harmful effect on population health and economy in the Unitded States from 1950 to 2011. In the analytic graphics, the total number of victims of death and injury to the severe weather events has been selected as a measure to judge the degree of harmfulness of the event onto population health, and the sum of the total estimated property and crop damage has been selected as to the effect of the events on U.S. economy. Analytic graphics show that tornado is the most harmful type of severe weather events in the United States, and flood has inflicted the heaviest loss onto U.S. economy. The detailed data processing and analysis are as followings:

Data Processing

To begin with, let’s reads in the data set from the U.S. National Oceanic and Atmospheric Administration (NOAA) which describes characteristics of major storms and weather events in the United States.

if(!file.exists("StormData.csv.bz2")) {
    url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
    download.file(url=url, destfile = "StormData.csv.bz2", method = "curl")
}
storm_data <- read.table("StormData.csv.bz2", header=TRUE, sep=",")
sel_col <- colnames(storm_data)[c(1:2,7,8,23:28,36)]

This is a bit large data set (>400Mb), so let’s subset a portion of data that we are interested in. Among 37 features, we select 11 features to address our questions the analysis. The selected features: **STATE__, BGN_DATE, STATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP, REMARKS**

library(data.table)
DT <- data.table(storm_data[sel_col])
str(DT)
## Classes 'data.table' and 'data.frame':   902297 obs. of  11 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REMARKS   : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>

To handle large data set efficiently, we utilize data.table package due to its optimized performance on this type of task.

Let’s execute a series of pre-processing steps for the following analysis. First, create a new variable ‘pHealth’ which is the sum of FATALITIES and INJURIES to address the firt question regarding the impact of the events on population health.

DT[, pHealth := FATALITIES + INJURIES]

To address econoimc consequence question, we need both PROPDMG and CROPDMG varaibles. However, those variables comes with separate PROPDMGEXP and CROPDMGEXP varaibles that define finacial abbreviation, and we need to take account of those expression variable to get right estimates of property and crop data.

PropDmg_exp <- unique(DT[,PROPDMGEXP]) # for property damage
DT[, pDMG:=sub("-|\\?|\\+","",PROPDMGEXP)]
rowN_unknown <- DT[PROPDMGEXP=="-" | PROPDMGEXP=="?"|PROPDMGEXP=="+", which=TRUE]
DT[, pDMG:=c(1, 1e0,1e1,1e2,1e3,1e4,1e5,1e6,1e7,1e8, 1e9,1e2,1e2,1e3,1e6,1e6)[match(PROPDMGEXP, c("","0","1","2","3","4","5","6","7","8","B","h","H","K","m","M"))]] 
DT[, pDMG:=pDMG*PROPDMG]
DT[rowN_unknown, pDMG:=PROPDMG]
CropDmg_exp <- unique(DT[,CROPDMGEXP]) # for crop damage
DT[, cDMG:=sub("?", "", CROPDMGEXP)]
rowN_unknown_cDMG <- DT[CROPDMGEXP=="?", which=TRUE]
DT[, cDMG:=c(1, 1e0,1e2, 1e9,1e3,1e3,1e6,1e6)[match(CROPDMGEXP, c("","0","2","B","k","K","m","M"))]]
DT[, cDMG:=cDMG*CROPDMG]
DT[rowN_unknown_cDMG, cDMG:=CROPDMG ]
DT[, totalDMG := (pDMG+cDMG)] # total damage

To handle this issue, let’s look at PROPDMGEXP first.
PROPDMGEXP is a factor variable with K, M, , B, m, +, 0, 5, 6, ?, 4, 2, 3, h, 7, H, -, 1, 8 levels. The empty string “” has been set equal to 1. We ignore -,?,+ because the meaning of the notations are not clear (I faled to find the decription about these notations in the field/column information document provided by NOAA nor the document provided by coursera), so only the PROPDMG value has be used for these cases. On the other hand, based on the contents of REMARKS variable, the number 0 to 8 has been guessed as 0 figure (1e0) to 8 figure (100000000) numbers, respectively. The abbreviations of B, h, H, K, m, M are interpreted as 1000000000, 100, 100, 1000, 1000000, and 1000000, respectively. So, once converting PROPDMGEXP into the corresponding numbers, we multiply PROPDMG and PROPDMGEXP and assigned the result into a new varialbe ‘pDMG’.

We deal with CROPDMGEXP in a similar way. CROPDMGEXP is a factor variable with , M, K, m, B, ?, 0, k, 2 levels. “+” is ignored, and “0” and “2” are considered 1 and 100. The abbreviation of “B”,“k”,“K”,“m”,“M” are interpreted as 1000000000, 1000, 1000, 1000000, and 1000000, respectively. Then, we multiply CROPDMG and CROPDMGEXP and assigned the result into ‘cDMG’.

Finally, we add cDMG to pDMG, and assinged the result into ‘totalDMG’

Results

OK, now we are ready to generate the analytic graphics. First, let’s look at the impact of weather events on population health by counting the total number of victims (fatalities and injuries) per each weather event type.

public_health <- DT[, .(pHealth=sum(pHealth), death=sum(FATALITIES), injury=sum(INJURIES)), by=EVTYPE][order(-pHealth)]
head(public_health,5)
##            EVTYPE pHealth death injury
## 1:        TORNADO   96979  5633  91346
## 2: EXCESSIVE HEAT    8428  1903   6525
## 3:      TSTM WIND    7461   504   6957
## 4:          FLOOD    7259   470   6789
## 5:      LIGHTNING    6046   816   5230
name_EVTYPE_top5 <- public_health[1:5, EVTYPE]
library(RColorBrewer)
cols <- brewer.pal(3,"YlGnBu")
pal <- colorRampPalette(cols[3:1]) # inverse of cols
with(public_health, 
     {barplot(pHealth[1:5], ylim=c(0,100000),  yaxt='n',
                                xlab="", ylab="Total number of victims [people]", col=pal(5),
                                main = "Top 5 harmful weather events on population health",
                                legend.text = name_EVTYPE_top5)
                axis(2, at=pretty(public_health[1:5, pHealth]),
             labels=paste0(pretty(public_health[1:5, pHealth])/1000, "K")) 
    })

[Fig.1. Total number of victims of death and injury to the top 5 harmful types of severe weather event from 1950 to 2011.]

We plot the top 5 events to see which event types are most detrimental. Tornado caused the larges number of victims: `r public_health[,pHealth][1]’ people of death and injury in the United States. After that, Excessive heat, tropical storm wind, etc follow in the next rank.

How about the economic consequencies infliced by the severe weather events? To address this question, let’s look at the total loss (property and crop damage) caused by each type of weather events.

economy_consq <- DT[, .(totalDMG=sum(totalDMG), pDMG=sum(pDMG), cDMG=sum(cDMG)), by=EVTYPE][order(-totalDMG)]
head(economy_consq,5)
##               EVTYPE     totalDMG         pDMG       cDMG
## 1:             FLOOD 150319678257 144657709807 5661968450
## 2: HURRICANE/TYPHOON  71913712800  69305840000 2607872800
## 3:           TORNADO  57362333947  56947380677  414953270
## 4:       STORM SURGE  43323541000  43323536000       5000
## 5:              HAIL  18761221986  15735267513 3025954473
name_EVTYPE_top5_DMG <- economy_consq[1:5, EVTYPE]
cols_DMG <- brewer.pal(3,"BuGn")
pal_DMG <- colorRampPalette(cols_DMG[3:1]) # inverse of cols
with(economy_consq, 
     {barplot(totalDMG[1:5], ylim=c(0,160000000000),  yaxt='n',
                                xlab="", ylab="Total estimate of damage [dollar]",
                                main="Top 5 harmful weather events on economy",
                                col=pal_DMG(5),
                                legend.text = name_EVTYPE_top5_DMG)
                axis(2, at=pretty(economy_consq[1:5, totalDMG]),
             labels=paste0(pretty(economy_consq[1:5, totalDMG])/1000000000, "B")) 
    })

[Fig.2. Total estimated economic loss inflicted by the top 5 harmful types of severe weather event from 1950 to 2011.]

When we calcualte total estimated sum of property and crop damage per each type of the events, we can identify the top 5 most detrimental weather events on U.S. economy. FLOOD ranks in the top, HURRICANE/TYPOON, TORNADO, etc come after that. FLOOD caused more than $150 billon dollars to U.S. economy from 1950 to 2011.

In conclusion, we find out that tornado is the most harmful severe weather event on population health in the United States and flood is the most detrimental weather events on U.S. economy.

Finally, this report has been prepared under the following environment:

sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 10586)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] RColorBrewer_1.1-2 data.table_1.11.0 
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.5.0  backports_1.1.2 magrittr_1.5    rprojroot_1.3-2
##  [5] tools_3.5.0     htmltools_0.3.6 yaml_2.1.19     Rcpp_0.12.16   
##  [9] stringi_1.1.7   rmarkdown_1.9   knitr_1.20      stringr_1.3.0  
## [13] digest_0.6.15   evaluate_0.10.1