In this report, we would like to answer the following two questions:
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
Therefore, analytic graphics has been utilized to identify the types of severe weather event that give the most harmful effect on population health and economy in the Unitded States from 1950 to 2011. In the analytic graphics, the total number of victims of death and injury to the severe weather events has been selected as a measure to judge the degree of harmfulness of the event onto population health, and the sum of the total estimated property and crop damage has been selected as to the effect of the events on U.S. economy. Analytic graphics show that tornado is the most harmful type of severe weather events in the United States, and flood has inflicted the heaviest loss onto U.S. economy. The detailed data processing and analysis are as followings:
To begin with, let’s reads in the data set from the U.S. National Oceanic and Atmospheric Administration (NOAA) which describes characteristics of major storms and weather events in the United States.
if(!file.exists("StormData.csv.bz2")) {
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url=url, destfile = "StormData.csv.bz2", method = "curl")
}
storm_data <- read.table("StormData.csv.bz2", header=TRUE, sep=",")
sel_col <- colnames(storm_data)[c(1:2,7,8,23:28,36)]
This is a bit large data set (>400Mb), so let’s subset a portion of data that we are interested in. Among 37 features, we select 11 features to address our questions the analysis. The selected features: **STATE__, BGN_DATE, STATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP, REMARKS**
library(data.table)
DT <- data.table(storm_data[sel_col])
str(DT)
## Classes 'data.table' and 'data.frame': 902297 obs. of 11 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, ".internal.selfref")=<externalptr>
To handle large data set efficiently, we utilize data.table package due to its optimized performance on this type of task.
Let’s execute a series of pre-processing steps for the following analysis. First, create a new variable ‘pHealth’ which is the sum of FATALITIES and INJURIES to address the firt question regarding the impact of the events on population health.
DT[, pHealth := FATALITIES + INJURIES]
To address econoimc consequence question, we need both PROPDMG and CROPDMG varaibles. However, those variables comes with separate PROPDMGEXP and CROPDMGEXP varaibles that define finacial abbreviation, and we need to take account of those expression variable to get right estimates of property and crop data.
PropDmg_exp <- unique(DT[,PROPDMGEXP]) # for property damage
DT[, pDMG:=sub("-|\\?|\\+","",PROPDMGEXP)]
rowN_unknown <- DT[PROPDMGEXP=="-" | PROPDMGEXP=="?"|PROPDMGEXP=="+", which=TRUE]
DT[, pDMG:=c(1, 1e0,1e1,1e2,1e3,1e4,1e5,1e6,1e7,1e8, 1e9,1e2,1e2,1e3,1e6,1e6)[match(PROPDMGEXP, c("","0","1","2","3","4","5","6","7","8","B","h","H","K","m","M"))]]
DT[, pDMG:=pDMG*PROPDMG]
DT[rowN_unknown, pDMG:=PROPDMG]
CropDmg_exp <- unique(DT[,CROPDMGEXP]) # for crop damage
DT[, cDMG:=sub("?", "", CROPDMGEXP)]
rowN_unknown_cDMG <- DT[CROPDMGEXP=="?", which=TRUE]
DT[, cDMG:=c(1, 1e0,1e2, 1e9,1e3,1e3,1e6,1e6)[match(CROPDMGEXP, c("","0","2","B","k","K","m","M"))]]
DT[, cDMG:=cDMG*CROPDMG]
DT[rowN_unknown_cDMG, cDMG:=CROPDMG ]
DT[, totalDMG := (pDMG+cDMG)] # total damage
To handle this issue, let’s look at PROPDMGEXP first.
PROPDMGEXP is a factor variable with K, M, , B, m, +, 0, 5, 6, ?, 4, 2, 3, h, 7, H, -, 1, 8 levels. The empty string “” has been set equal to 1. We ignore -,?,+ because the meaning of the notations are not clear (I faled to find the decription about these notations in the field/column information document provided by NOAA nor the document provided by coursera), so only the PROPDMG value has be used for these cases. On the other hand, based on the contents of REMARKS variable, the number 0 to 8 has been guessed as 0 figure (1e0) to 8 figure (100000000) numbers, respectively. The abbreviations of B, h, H, K, m, M are interpreted as 1000000000, 100, 100, 1000, 1000000, and 1000000, respectively. So, once converting PROPDMGEXP into the corresponding numbers, we multiply PROPDMG and PROPDMGEXP and assigned the result into a new varialbe ‘pDMG’.
We deal with CROPDMGEXP in a similar way. CROPDMGEXP is a factor variable with , M, K, m, B, ?, 0, k, 2 levels. “+” is ignored, and “0” and “2” are considered 1 and 100. The abbreviation of “B”,“k”,“K”,“m”,“M” are interpreted as 1000000000, 1000, 1000, 1000000, and 1000000, respectively. Then, we multiply CROPDMG and CROPDMGEXP and assigned the result into ‘cDMG’.
Finally, we add cDMG to pDMG, and assinged the result into ‘totalDMG’
OK, now we are ready to generate the analytic graphics. First, let’s look at the impact of weather events on population health by counting the total number of victims (fatalities and injuries) per each weather event type.
public_health <- DT[, .(pHealth=sum(pHealth), death=sum(FATALITIES), injury=sum(INJURIES)), by=EVTYPE][order(-pHealth)]
head(public_health,5)
## EVTYPE pHealth death injury
## 1: TORNADO 96979 5633 91346
## 2: EXCESSIVE HEAT 8428 1903 6525
## 3: TSTM WIND 7461 504 6957
## 4: FLOOD 7259 470 6789
## 5: LIGHTNING 6046 816 5230
name_EVTYPE_top5 <- public_health[1:5, EVTYPE]
library(RColorBrewer)
cols <- brewer.pal(3,"YlGnBu")
pal <- colorRampPalette(cols[3:1]) # inverse of cols
with(public_health,
{barplot(pHealth[1:5], ylim=c(0,100000), yaxt='n',
xlab="", ylab="Total number of victims [people]", col=pal(5),
main = "Top 5 harmful weather events on population health",
legend.text = name_EVTYPE_top5)
axis(2, at=pretty(public_health[1:5, pHealth]),
labels=paste0(pretty(public_health[1:5, pHealth])/1000, "K"))
})
[Fig.1. Total number of victims of death and injury to the top 5 harmful types of severe weather event from 1950 to 2011.]
We plot the top 5 events to see which event types are most detrimental. Tornado caused the larges number of victims: `r public_health[,pHealth][1]’ people of death and injury in the United States. After that, Excessive heat, tropical storm wind, etc follow in the next rank.
How about the economic consequencies infliced by the severe weather events? To address this question, let’s look at the total loss (property and crop damage) caused by each type of weather events.
economy_consq <- DT[, .(totalDMG=sum(totalDMG), pDMG=sum(pDMG), cDMG=sum(cDMG)), by=EVTYPE][order(-totalDMG)]
head(economy_consq,5)
## EVTYPE totalDMG pDMG cDMG
## 1: FLOOD 150319678257 144657709807 5661968450
## 2: HURRICANE/TYPHOON 71913712800 69305840000 2607872800
## 3: TORNADO 57362333947 56947380677 414953270
## 4: STORM SURGE 43323541000 43323536000 5000
## 5: HAIL 18761221986 15735267513 3025954473
name_EVTYPE_top5_DMG <- economy_consq[1:5, EVTYPE]
cols_DMG <- brewer.pal(3,"BuGn")
pal_DMG <- colorRampPalette(cols_DMG[3:1]) # inverse of cols
with(economy_consq,
{barplot(totalDMG[1:5], ylim=c(0,160000000000), yaxt='n',
xlab="", ylab="Total estimate of damage [dollar]",
main="Top 5 harmful weather events on economy",
col=pal_DMG(5),
legend.text = name_EVTYPE_top5_DMG)
axis(2, at=pretty(economy_consq[1:5, totalDMG]),
labels=paste0(pretty(economy_consq[1:5, totalDMG])/1000000000, "B"))
})
[Fig.2. Total estimated economic loss inflicted by the top 5 harmful types of severe weather event from 1950 to 2011.]
When we calcualte total estimated sum of property and crop damage per each type of the events, we can identify the top 5 most detrimental weather events on U.S. economy. FLOOD ranks in the top, HURRICANE/TYPOON, TORNADO, etc come after that. FLOOD caused more than $150 billon dollars to U.S. economy from 1950 to 2011.
In conclusion, we find out that tornado is the most harmful severe weather event on population health in the United States and flood is the most detrimental weather events on U.S. economy.
Finally, this report has been prepared under the following environment:
sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 10586)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] RColorBrewer_1.1-2 data.table_1.11.0
##
## loaded via a namespace (and not attached):
## [1] compiler_3.5.0 backports_1.1.2 magrittr_1.5 rprojroot_1.3-2
## [5] tools_3.5.0 htmltools_0.3.6 yaml_2.1.19 Rcpp_0.12.16
## [9] stringi_1.1.7 rmarkdown_1.9 knitr_1.20 stringr_1.3.0
## [13] digest_0.6.15 evaluate_0.10.1