author : @StefMT2970
Ref : R Core Team (2014). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.
Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage.
This report involves exploring the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database.It tracks the impact of major storms and weather events in the United States, with regards to estimates of any fatalities, injuries, and property and crop damage.
Data is reported as-is, records have not been removed or updated. The goal of the analysis is to detect the most important events.
It is direclty visible that tornados are the events which cause most human damage whilst flooding is responsible for most economic impact.
The dataset under examination is read from the web and loaded into the storm_tbl data table.
library(ggplot2)
library(data.table)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:data.table':
##
## between, last
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:data.table':
##
## hour, mday, month, quarter, wday, week, yday, year
s <-sessionInfo()
s[1]$R.version$version.string
## [1] "R version 3.1.2 (2014-10-31)"
s[1]$R.version$nickname
## [1] "Pumpkin Helmet"
s[1]$R.version$platform
## [1] "x86_64-w64-mingw32"
oldwd <- getwd()
setwd("d:/dev/app/gitrepo/RepData_PeerAssessment2")
# dir()
if(!file.exists("stormData.csv.bz2")) {
file_url <-
"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url = file_url, destfile = "stormData.csv.bz2")
}
storm_tbl <- data.table(read.csv(bzfile("stormData.csv.bz2")))
# str(storm_tbl)
# storm data is a large dataset, the next line of code gives an overview
# of memory usage.
# sort( sapply(ls(),function(x){object.size(get(x))}))
# tables()
Certain variables are not in the best format to work with. Only few variables are required in this analysis. For the impact on health, 2 variables (fatalities and injuries) are added together. For the impact on economy, asset property and crop damage are counted together. Dollar amounts have to be calculated by multiplying the estimate (rounded to three significant digits), with an alphabetical character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000. Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions. Details are in National Climatic Data Center Storm Events
workset <- storm_tbl %>%
select(BGN_DATE,
REFNUM,
STATE,
EVTYPE,
FATALITIES,
INJURIES,
PROPDMG,
PROPDMGEXP,
CROPDMG,
CROPDMGEXP)
workset$BGN_DATE <- as.Date(workset$BGN_DATE, "%m/%d/%Y")
workset$BGN_YEAR <- year(workset$BGN_DATE)
workset$HEALTH <- workset$FATALITIES + workset$INJURIES
unitConversion <- function(uom){
multiplicator <- vector(mode="numeric", length=length(uom))
for ( i in 1:length(uom)){
if (uom[i] == "K") multiplicator[i] = 1000
else if (uom[i] == "M") multiplicator[i] = 1000000
else if (uom[i] == "B") multiplicator[i] = 1000000000
else multiplicator[i] = 0
}
multiplicator
}
workset[ , ECONOMICAL:=PROPDMG * unitConversion(PROPDMGEXP) +
CROPDMG * unitConversion(CROPDMGEXP)]
## BGN_DATE REFNUM STATE EVTYPE FATALITIES INJURIES PROPDMG
## 1: 1950-04-18 1 AL TORNADO 0 15 25.0
## 2: 1950-04-18 2 AL TORNADO 0 0 2.5
## 3: 1951-02-20 3 AL TORNADO 0 2 25.0
## 4: 1951-06-08 4 AL TORNADO 0 2 2.5
## 5: 1951-11-15 5 AL TORNADO 0 2 2.5
## ---
## 902293: 2011-11-30 902293 WY HIGH WIND 0 0 0.0
## 902294: 2011-11-10 902294 MT HIGH WIND 0 0 0.0
## 902295: 2011-11-08 902295 AK HIGH WIND 0 0 0.0
## 902296: 2011-11-09 902296 AK BLIZZARD 0 0 0.0
## 902297: 2011-11-28 902297 AL HEAVY SNOW 0 0 0.0
## PROPDMGEXP CROPDMG CROPDMGEXP BGN_YEAR HEALTH ECONOMICAL
## 1: K 0 1950 15 25000
## 2: K 0 1950 0 2500
## 3: K 0 1951 2 25000
## 4: K 0 1951 2 2500
## 5: K 0 1951 2 2500
## ---
## 902293: K 0 K 2011 0 0
## 902294: K 0 K 2011 0 0
## 902295: K 0 K 2011 0 0
## 902296: K 0 K 2011 0 0
## 902297: K 0 K 2011 0 0
We are mainly interested in health and economical impact so let's investigate for any outliers in those 2 calculated variables. The data remains untouched, the tables show the largest values for both impact categories. For the oulier (refnum 605943) the orginal remark is shown. It is not clear from this remark if the reported economical damage has a data entry error or not. To validate the large events, historical research would be required which is outside the scope of this report.
summary(workset$HEALTH)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1725 0.0000 1742.0000
summary(workset$ECONOMICAL)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00e+00 0.00e+00 0.00e+00 5.28e+05 1.00e+03 1.15e+11
workset[HEALTH > 1000] %>%
select(BGN_DATE, REFNUM, EVTYPE, HEALTH, FATALITIES, INJURIES, ECONOMICAL)
## BGN_DATE REFNUM EVTYPE HEALTH FATALITIES INJURIES ECONOMICAL
## 1: 1953-06-09 67884 TORNADO 1318 90 1228 2.5e+08
## 2: 1974-04-03 116011 TORNADO 1186 36 1150 2.5e+08
## 3: 1979-04-10 157885 TORNADO 1742 42 1700 2.5e+08
## 4: 1994-02-08 223436 ICE STORM 1569 1 1568 5.5e+07
## 5: 2011-05-22 862563 TORNADO 1308 158 1150 2.8e+09
workset[ECONOMICAL > 9000000000] %>%
select(BGN_DATE, REFNUM, EVTYPE, ECONOMICAL, HEALTH)
## BGN_DATE REFNUM EVTYPE ECONOMICAL HEALTH
## 1: 1993-08-31 198375 RIVER FLOOD 10000000000 0
## 2: 2005-10-24 569288 HURRICANE/TYPHOON 10000000000 5
## 3: 2005-08-28 577615 HURRICANE/TYPHOON 16930000000 0
## 4: 2005-08-29 577616 STORM SURGE 31300000000 0
## 5: 2005-08-29 581535 STORM SURGE 11260000000 0
## 6: 2006-01-01 605943 FLOOD 115032500000 0
storm_tbl [REFNUM== 605943] %>% select(REFNUM, EVTYPE, REMARKS)
## REFNUM EVTYPE
## 1: 605943 FLOOD
## REMARKS
## 1: Major flooding continued into the early hours of January 1st, before the Napa River finally fell below flood stage and the water receeded. Flooding was severe in Downtown Napa from the Napa Creek and the City and Parks Department was hit with $6 million in damage alone. The City of Napa had 600 homes with moderate damage, 150 damaged businesses with costs of at least $70 million.
Although the official guidelines only make notice of a small number of event types, the dataset contains 985 types. However, some quick calculations show that top 10 event types account for 89% of all registered events. The purpose of this analysis is to detect the most important event, therefore cleaning the last 10% does not seem to be necessary.
evtype <- table(workset$EVTYPE)
evtype <- evtype[order(evtype, decreasing=T)]
evtype[1:10]
##
## HAIL TSTM WIND THUNDERSTORM WIND
## 288661 219940 82563
## TORNADO FLASH FLOOD FLOOD
## 60652 54277 25326
## THUNDERSTORM WINDS HIGH WIND LIGHTNING
## 20843 20212 15754
## HEAVY SNOW
## 15708
sum(evtype[1:10])/nrow(workset)
## [1] 0.8909882
# barplot(evtype[1:10], las=3, cex.names = 0.5)
Observations have started as early as 1950, but the histogram shows that the mass of observations only comes in after 1990. Again no data has been removed as we already noticed that major events took place prior 1990.
hist(workset$BGN_YEAR, breaks=50, col = "blue",
main = "number of registered extreme weather events per year",
xlab = "year of observation",
ylab = "number of events")
The HEALTH variable contains the sum of fatalities and injuries for all measurements in the data set (from 1950). This variable is examined per EVTYPE (for the entire dataset, so spanning all years).The table output shows for the top event types:
Tornado's by far outnumber any other event type with regards to impact on public health.
health <- workset %>%
select(EVTYPE, HEALTH) %>%
group_by(EVTYPE) %>%
summarise(totalImpactHealth = sum(HEALTH),
avgImpactHealth = round(mean(HEALTH),2),
freqOfEvents = n())
health <- health[order(totalImpactHealth, decreasing=T)]
health$EVTYPE <- factor(health$EVTYPE, levels = health$EVTYPE)
health[totalImpactHealth > 999]
## Source: local data table [14 x 4]
##
## EVTYPE totalImpactHealth avgImpactHealth freqOfEvents
## 1 TORNADO 96979 1.60 60652
## 2 EXCESSIVE HEAT 8428 5.02 1678
## 3 TSTM WIND 7461 0.03 219940
## 4 FLOOD 7259 0.29 25326
## 5 LIGHTNING 6046 0.38 15754
## 6 HEAT 3037 3.96 767
## 7 FLASH FLOOD 2755 0.05 54277
## 8 ICE STORM 2064 1.03 2006
## 9 THUNDERSTORM WIND 1621 0.02 82563
## 10 WINTER STORM 1527 0.13 11433
## 11 HIGH WIND 1385 0.07 20212
## 12 HAIL 1376 0.00 288661
## 13 HURRICANE/TYPHOON 1339 15.22 88
## 14 HEAVY SNOW 1148 0.07 15708
glp <- ggplot(health[1:10], aes(EVTYPE, totalImpactHealth))
glp2 <- glp + geom_bar(stat= 'identity')+
labs(x = "event",
y = "Total number of fatalities and injuries") +
theme(axis.text.x = element_text(angle=90))
glp2
The analysis for the econonical impact shows a different picture. Again a table is constructed showing per event (spanning all events registered from 1950 onwards) :
Flooding has the biggest negative impact on economy, tornado's are a distant third. Hurricanes and typhoons occur rarely but are devastating.
econ <- workset %>%
select(EVTYPE, ECONOMICAL) %>%
group_by(EVTYPE) %>%
summarise(totalImpactEconMD = round(sum(ECONOMICAL)/1000000, 0),
avgImpactEconMD = round(mean(ECONOMICAL)/1000000,2),
freqOfEvents = n())
econ <- econ[order(totalImpactEconMD, decreasing=T)]
econ$EVTYPE <- factor(econ$EVTYPE, levels = econ$EVTYPE)
econ[totalImpactEconMD > 4999]
## Source: local data table [15 x 4]
##
## EVTYPE totalImpactEconMD avgImpactEconMD freqOfEvents
## 1 FLOOD 150320 5.94 25326
## 2 HURRICANE/TYPHOON 71914 817.20 88
## 3 TORNADO 57341 0.95 60652
## 4 STORM SURGE 43324 165.99 261
## 5 HAIL 18753 0.06 288661
## 6 FLASH FLOOD 17562 0.32 54277
## 7 DROUGHT 15019 6.04 2488
## 8 HURRICANE 14610 83.97 174
## 9 RIVER FLOOD 10148 58.66 173
## 10 ICE STORM 8967 4.47 2006
## 11 TROPICAL STORM 8382 12.15 690
## 12 WINTER STORM 6715 0.59 11433
## 13 HIGH WIND 5909 0.29 20212
## 14 WILDFIRE 5061 1.83 2761
## 15 TSTM WIND 5039 0.02 219940
glpe <- ggplot(econ[1:10], aes(EVTYPE, totalImpactEconMD))
glpe2 <- glpe + geom_bar(stat= 'identity')+
labs(x = "event",
y = "Total amount of economical damage in million dollars") +
theme(axis.text.x = element_text(angle=90))
glpe2