(this paper is based on a Coursera Data Science - Reproducible research peer graded assignment). The goal of this study is to explore the effects of climate events on US population health as well as the economic impacts of such events (damage to properties), in order to determine which type of events have been the most impacting in each case. Such study should be able to inform administration actions toward preventing or minimising the effects of such natural events.
The source code for this analysis is available on GithUb.
NOAA source data is available here. The documentation explaining the fields is availalble here, along with an FAQ.
suppressWarnings(suppressMessages(library(R.utils)))
projectData <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
bunzip2File <- "./StormData.csv.bz2"
if (!file.exists(bunzip2File)) {
download.file(projectData, bunzip2File)
message("Source File downloaded")
} else {
message("Source File already present in this folder.")
}
## Source File already present in this folder.
if (file.exists(bunzip2File)) {
bunzip2(bunzip2File, ext="bz2", FUN=bzfile, skip=TRUE, remove=FALSE)
message("Bunzip2 file content extracted")
}
## Bunzip2 file content extracted
library(data.table)
system.time(noaa <- read.csv("StormData.csv", header=TRUE))
## user system elapsed
## 61.03 0.86 63.11
system.time(noaa2 <- fread("StormData.csv", showProgress = FALSE))
## user system elapsed
## 3.03 0.83 5.27
noaa <- noaa2
rm(noaa2)
Due to the speed gains, this analysis will use data.table package
dim(noaa)
## [1] 902297 37
There are 902297 observations of 37 variables. However, only 8 variables will present an interest for this analysis : State, Event type (EVTYPE), # of Fatalities (FATALITIES), # of Injuries (INJURIES), Property damages (PROPDMG), Property damage exponent (PROPDMGEXP), Crop damages (CROPDMG) and Crop damages exponent (CROPDMGEXP).
noaa <- noaa[,.(STATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)]
head(noaa)
## STATE EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1: AL TORNADO 0 15 25.0 K 0
## 2: AL TORNADO 0 0 2.5 K 0
## 3: AL TORNADO 0 2 25.0 K 0
## 4: AL TORNADO 0 2 2.5 K 0
## 5: AL TORNADO 0 2 2.5 K 0
## 6: AL TORNADO 0 6 2.5 K 0
str(noaa)
## Classes 'data.table' and 'data.frame': 902297 obs. of 8 variables:
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## - attr(*, ".internal.selfref")=<externalptr>
sum(is.na(noaa))
## [1] 0
unique(noaa$PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
unique(noaa$CROPDMGEXP)
## [1] "" "M" "K" "m" "B" "?" "0" "k" "2"
head(unique(noaa$EVTYPE),60)
## [1] "TORNADO" "TSTM WIND"
## [3] "HAIL" "FREEZING RAIN"
## [5] "SNOW" "ICE STORM/FLASH FLOOD"
## [7] "SNOW/ICE" "WINTER STORM"
## [9] "HURRICANE OPAL/HIGH WINDS" "THUNDERSTORM WINDS"
## [11] "RECORD COLD" "HURRICANE ERIN"
## [13] "HURRICANE OPAL" "HEAVY RAIN"
## [15] "LIGHTNING" "THUNDERSTORM WIND"
## [17] "DENSE FOG" "RIP CURRENT"
## [19] "THUNDERSTORM WINS" "FLASH FLOOD"
## [21] "FLASH FLOODING" "HIGH WINDS"
## [23] "FUNNEL CLOUD" "TORNADO F0"
## [25] "THUNDERSTORM WINDS LIGHTNING" "THUNDERSTORM WINDS/HAIL"
## [27] "HEAT" "WIND"
## [29] "LIGHTING" "HEAVY RAINS"
## [31] "LIGHTNING AND HEAVY RAIN" "FUNNEL"
## [33] "WALL CLOUD" "FLOODING"
## [35] "THUNDERSTORM WINDS HAIL" "FLOOD"
## [37] "COLD" "HEAVY RAIN/LIGHTNING"
## [39] "FLASH FLOODING/THUNDERSTORM WI" "WALL CLOUD/FUNNEL CLOUD"
## [41] "THUNDERSTORM" "WATERSPOUT"
## [43] "EXTREME COLD" "HAIL 1.75)"
## [45] "LIGHTNING/HEAVY RAIN" "HIGH WIND"
## [47] "BLIZZARD" "BLIZZARD WEATHER"
## [49] "WIND CHILL" "BREAKUP FLOODING"
## [51] "HIGH WIND/BLIZZARD" "RIVER FLOOD"
## [53] "HEAVY SNOW" "FREEZE"
## [55] "COASTAL FLOOD" "HIGH WIND AND HIGH TIDES"
## [57] "HIGH WIND/BLIZZARD/FREEZING RA" "HIGH TIDES"
## [59] "HIGH WIND AND HEAVY SNOW" "RECORD COLD AND HIGH WIND"
No missing data, however data about exponent is encoded, and no usable as such. EVTYPE also requires some cleanup (presence of summary lines, etc…)
expTable will be built accordingly.symbols <- unique(c(unique(noaa$PROPDMGEXP), unique(noaa$CROPDMGEXP)))
#"K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8" "k"
# The list of symbols below has been rearranged manually, to ease maintenance
symbols <- c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "h", "H", "k", "K", "m", "M", "b", "B","", "+", "-", "?")
expValue <- c(seq(0,9),2,2,3,3,6,6,9,9,0,0,0,0)
expTable <- data.frame(symbols, expValue)
Two new columns have been created :
PROPDMGVALUE : contains the USD value of Property damagesCROPDMGVALUE : contains the USD value of Crop damagesBoth columns have been created according to the formula :
xxxDMG * 10 ^ xxxEXP
where the xxx is PROP|CROP, and where xxxEXP is the translated exponent value found in the matching table built above.
noaa[, PROPDMGVALUE := PROPDMG * 10 ^ expTable[match(PROPDMGEXP, expTable$symbols),2]]
noaa[, CROPDMGVALUE := CROPDMG * 10 ^ expTable[match(CROPDMGEXP, expTable$symbols),2]]
SUMMARY which are likely polluting the resultslibrary(stringr)
noaa[, EVTYPE:=toupper(EVTYPE)]
noaa[, EVTYPE:=trimws(EVTYPE)]
noaa <- noaa[!(EVTYPE %like% "SUMMARY")]
noaa[EVTYPE %like% "TSTM WIND", EVTYPE := "THUNDERSTORM WINDS"]
noaa[EVTYPE %like% "THUNDERSTORM", EVTYPE := "THUNDERSTORM WINDS"]
noaa[EVTYPE %like% "TORNADO", EVTYPE := "TORNADO"]
noaa[EVTYPE %like% "FLOOD", EVTYPE := "FLOOD"]
We now have a clean dataset to start this analysis.
Two types of healths harm are present in the dataset : fatalities and injuries. They will thus be analyzed separately.
A fatalities data.table is created by summing all fatalities numbers grouped by event type. The dataframe is then reordered in descending order of total fatalities.
library(ggplot2)
fatalities <- noaa[, .(FATALITIES=sum(FATALITIES)), by=EVTYPE]
fatalities <- fatalities[order(-FATALITIES)]
fatalities <- transform(fatalities, EVTYPE=reorder(EVTYPE, -FATALITIES))
g <- ggplot(fatalities[1:20], aes(EVTYPE,FATALITIES))
g + geom_bar(stat="identity", aes(fill=FATALITIES)) +
theme_light() +
theme(axis.text.x= element_text(angle=90, vjust=0.5, size=8)) +
xlab("Type of events") +
ylab("Total number of fatalities") +
ggtitle("Total number of fatalities in the US between 1950 and 2011 by type of climate event \n (Top 20 events)")
We can see the top 3 main causes for casulaties are tornadoes, excessive heat and floods. Tornadoes are a key contributor to casulaties, being responsible for more casulaties than the next 3 type of events combined.
Next the injuries. An injuries data.table is created by summing all injuries numbers grouped by event type. The dataframe is then reordered in descending order of total injuries.
injuries <- noaa[, .(INJURIES=sum(INJURIES)), by=EVTYPE]
injuries <- injuries[order(-INJURIES)]
injuries <- transform(injuries, EVTYPE=reorder(EVTYPE, -INJURIES))
g <- ggplot(injuries[1:20], aes(EVTYPE,INJURIES))
g + geom_bar(stat="identity", aes(fill=INJURIES)) +
theme_light() +
theme(axis.text.x= element_text(angle=90, vjust=0.5, size=8)) +
xlab("Type of events") +
ylab("Total number of injuries") +
ggtitle("Total number of injuries in the US between 1950 and 2011 by type of climate event \n (Top 20 events)")
We can see here that the top 3 main causes for injuries are tornadoes, thunderstorm winds, and floods. Tordano is a massive contributor, causing more injuries alone that all other top 20 types combined.
There are two types of economic effects on this dataset :
The figure below will show the cumulated total damages for this two types.
A new column TOTALDMGVALUE is created by summing both Properties damages value and crop damages value. The resulting data.table economic is then sorted in descending order of total damage value.
noaa[, TOTALDMGVALUE := PROPDMGVALUE + CROPDMGVALUE]
economic <- noaa[, .(TOTALDMGVALUE=sum(TOTALDMGVALUE)), by=EVTYPE]
economic[, TOTALDMGVALUEBN := TOTALDMGVALUE / 1e9]
economic <- economic[order(-TOTALDMGVALUE)]
economic <- transform(economic, EVTYPE=reorder(EVTYPE, -TOTALDMGVALUE))
g <- ggplot(economic[1:20], aes(EVTYPE,TOTALDMGVALUEBN))
g + geom_bar(stat="identity", aes(fill=TOTALDMGVALUEBN)) +
theme_light() +
theme(axis.text.x= element_text(angle=90, vjust=0.5, size=8)) +
xlab("Type of events") +
ylab("Total economic damage in $Bn") +
labs(fill="Economic damages value") +
ggtitle("Total economic damage in the US between 1950 and 2011 \n by type of climate event (Top 20 events)")
This time we can see that tornadoes are only the 3rd major contributor in terms of economic damages, behind floods and hurricanes.
A strategy aiming at reducing health hazard of natural events would thus need to focus on tornadoes / winds, heat and floods prevention measures, while a strategy aiming at reducing economic impact of such events would focus on floods, typhoon and tornadoes prevention measures.