By Sentil Pillai
August 2014
This analysis explores the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database, which tracks characteristics of extreme weather events in the United States in the period 1950 to 2011. It includes when and where they occur, as well as estimates of any fatalities, injuries, and property and crop damage.
The top three types of extreme weather events that are most harmful with respect to population health (fatalities and injuries) are tornado, heat and thunderstorm.
The top three types of extreme weather events that have the greatest economic (property and crop damage) consequences are flood, hurricane and tornado.
Initialize, load libraries, session information
library(tcltk)
library(scales)
library(ggplot2)
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
## Loading required package: DBI
## Loading required package: RSQLite.extfuns
sessionInfo()
## R version 3.0.3 (2014-03-06)
## Platform: x86_64-apple-darwin10.8.0 (64-bit)
##
## locale:
## [1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
##
## attached base packages:
## [1] tcltk stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] sqldf_0.4-7.1 RSQLite.extfuns_0.0.1 RSQLite_0.11.4
## [4] DBI_0.2-7 gsubfn_0.6-5 proto_0.3-10
## [7] ggplot2_1.0.0 scales_0.2.4 knitr_1.6
##
## loaded via a namespace (and not attached):
## [1] chron_2.3-45 colorspace_1.2-4 digest_0.6.4 evaluate_0.5.5
## [5] formatR_0.10 grid_3.0.3 gtable_0.1.2 MASS_7.3-33
## [9] munsell_0.4.2 plyr_1.8.1 Rcpp_0.11.1 reshape2_1.4
## [13] stringr_0.6.2 tools_3.0.3
Download the data file and the load the data
## Generated temporary file name and location
tmpFile <- tempfile(fileext = ".bz2")
## source data URL
fileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
## download the data set and to the temporary file
download.file(fileURL, destfile = tmpFile, method = "curl")
## unzip the temporary file, read the file
rawData <- read.csv(bzfile(tmpFile),header=TRUE)
dim(rawData)
## [1] 902297 37
The raw NOAA dataset shows a large number of observations ( ~900 thousand rows with 37 columns )
Subset the data required for the analysis
For the analysis only the seven columns (event type, injuries, fatalities, property damage, crop damage and exponents) and rows with value greater than zero are needed.
## subset the columns and the data rows
stormData <- sqldf('select EVTYPE, FATALITIES, INJURIES,
PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP
from rawData
where FATALITIES > 0 or INJURIES >0
or CROPDMG > 0 or PROPDMG > 0')
str(stormData)
## 'data.frame': 254633 obs. of 7 variables:
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
Transform the event type data for analysis
There are 985 distinct weather event types in the raw dataset. There are data entry mistakes and several variations in terminology. The following cleans the EVTYPE data and consolidates / categories / groups them into 20 new events types.
eventData <- sqldf('select EVTYPE, EVTYPE as newEventType, EventCount
from (select EVTYPE, count(*) as EventCount
from stormData
group by EVTYPE)')
## remove all non alphabaltic characters
eventData <- sqldf(c('update eventData
set newEventType = UPPER(TRIM(
REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
REPLACE(REPLACE(REPLACE( newEventType
,"0","") ,"1","") ,"2","") ,"3","") ,"4","")
,"5","") ,"6","") ,"7","") ,"8","") ,"9","")
,"(","") ,")","") ,"."," ") ,"/"," ") ,"&"," ")
,"\"," ") ,"-"," ") ,"/"," ") ,","," ") ,"?"," ")
,":"," ") ,","," ") ," "," ") )) '
, 'select * from main.eventData' ))
## create new event type TORNADO
eventData <- sqldf(c('update eventData set newEventType = "TORNADO"
where newEventType like "%TORNADO%"'
, 'select * from main.eventData' ))
## create new event type WATERSPOUT
eventData <- sqldf(c('update eventData set newEventType = "WATERSPOUT"
where newEventType like "%SPOUT%"'
, 'select * from main.eventData' ))
## create new event type FLOOD
eventData <- sqldf(c('update eventData set newEventType = "FLOOD"
where newEventType like "%FLOOD%"
or newEventType like "%FLD%"'
, 'select * from main.eventData' ))
## create new event type THUNDERSTORM
eventData <- sqldf(c('update eventData set newEventType = "THUNDERSTORM"
where newEventType like "%THUNDERSTORM%"
or newEventType like "%TSTM%"
or newEventType like "%TROPICAL DEPRESSION%"
or newEventType like "%GUSTNADO%"'
, 'select * from main.eventData' ))
## create new event type HURRICANE
eventData <- sqldf(c('update eventData set newEventType = "HURRICANE"
where newEventType like "%HURRICANE%"
or newEventType like "%TYPHOON%"'
, 'select * from main.eventData' ))
## create new event type HAIL
eventData <- sqldf(c('update eventData set newEventType = "HAIL"
where newEventType like "%HAIL%"'
, 'select * from main.eventData' ))
## create new event type WIND
eventData <- sqldf(c('update eventData set newEventType = "WIND"
where newEventType like "%WIND%"
or newEventType like "%DUST%"'
, 'select * from main.eventData' ))
## create new event type FIRE
eventData <- sqldf(c('update eventData set newEventType = "FIRE"
where newEventType like "%FIRE%"'
, 'select * from main.eventData' ))
## create new event type RAIN
eventData <- sqldf(c('update eventData set newEventType = "RAIN"
where newEventType like "%RAIN%"'
, 'select * from main.eventData' ))
## create new event type HEAT
eventData <- sqldf(c('update eventData set newEventType = "HEAT"
where newEventType like "%HEAT%"
or newEventType like "%HOT%"
or newEventType like "%WARM%"'
, 'select * from main.eventData' ))
## create new event type WINTER
eventData <- sqldf(c('update eventData set newEventType = "WINTER"
where newEventType like "%COLD%"
or newEventType like "%FREEZ%"
or newEventType like "%WINT%"
or newEventType like "%FROST%"
or newEventType like "%GLAZE%"
or newEventType like "%SNOW%"
or newEventType like "%SLEET%"
or newEventType like "%ICY%"
or newEventType like "%ICE%"
or newEventType like "%BLIZZARD%"'
, 'select * from main.eventData' ))
## create new event type STORM
eventData <- sqldf(c('update eventData set newEventType = "STORM"
where newEventType like "%STORM%"
and newEventType != "THUNDERSTORM"'
, 'select * from main.eventData' ))
## create new event type COASTAL
eventData <- sqldf(c('update eventData set newEventType = "COASTAL"
where newEventType like "%CURRENT%"
or newEventType like "%SURF%"
or newEventType like "%COASTAL%"
or newEventType like "%SEICHE%"
or newEventType like "% SEA%"
or newEventType like "%SWELL%"
or newEventType like "%TIDE%"
or newEventType like "%WAVE%"
or newEventType like "%BEACH%"
or newEventType like "%TSUNAMI%"'
, 'select * from main.eventData' ))
## create new event type LANDSLIDE
eventData <- sqldf(c('update eventData set newEventType = "LANDSLIDE"
where newEventType like "%MUD%"
or newEventType like "%LAND%"'
, 'select * from main.eventData' ))
## create new event type FOG
eventData <- sqldf(c('update eventData set newEventType = "FOG"
where newEventType like "%FOG%"'
, 'select * from main.eventData' ))
## create new event type FUNNEL CLOUD
eventData <- sqldf(c('update eventData set newEventType = "FUNNEL CLOUD"
where newEventType like "%FUNNEL%"'
, 'select * from main.eventData' ))
## create new event type LIGHTNING
eventData <- sqldf(c('update eventData set newEventType = "LIGHTNING"
where newEventType like "%LIGHTNING%"'
, 'select * from main.eventData' ))
## create new event type AVALANCHE
eventData <- sqldf(c('update eventData set newEventType = "AVALANCHE"
where newEventType like "%AVALAN%"'
, 'select * from main.eventData' ))
## create new event type DROUGHT
eventData <- sqldf(c('update eventData set newEventType = "DROUGHT"
where newEventType like "%DROUGHT%"
or newEventType like "%DRY%"
or newEventType like "%DUST%"'
, 'select * from main.eventData' ))
## create new event type WATER
eventData <- sqldf(c('update eventData set newEventType = "WATER"
where (newEventType like "%WATER%"
or newEventType like "%WET%"
or newEventType like "%PRECIP%")
and newEventType != "WATERSPOUT"'
, 'select * from main.eventData' ))
## create new event type OTHER, include all un categorized events
eventData <- sqldf(c('update eventData set newEventType = "OTHER"
where newEventType not in (
"TORNADO", "WATERSPOUT", "FLOOD"
,"THUNDERSTORM","HURRICANE", "HAIL"
,"WIND", "FIRE", "RAIN", "HEAT"
,"WINTER", "STORM","COASTAL", "LANDSLIDE"
,"FOG", "FUNNEL CLOUD", "LIGHTNING"
, "AVALANCHE", "DROUGHT", "WATER" )'
, 'select * from main.eventData' ))
## Examine the new event types
sqldf('select newEventType, count(EVTYPE) as countOfEVTYPE
from eventData
group by newEventType')
## newEventType countOfEVTYPE
## 1 AVALANCHE 2
## 2 COASTAL 26
## 3 DROUGHT 2
## 4 FIRE 9
## 5 FLOOD 64
## 6 FOG 2
## 7 FUNNEL CLOUD 1
## 8 HAIL 19
## 9 HEAT 13
## 10 HURRICANE 11
## 11 LANDSLIDE 9
## 12 LIGHTNING 4
## 13 OTHER 29
## 14 RAIN 26
## 15 STORM 10
## 16 THUNDERSTORM 71
## 17 TORNADO 12
## 18 WATER 9
## 19 WATERSPOUT 4
## 20 WIND 68
## 21 WINTER 97
Transform the data for analysis of impact on human health.
Sum the number of human fatalities and injuries by grouping it by the new event type.
causalityData <- sqldf('select e.newEventType,
sum(s.FATALITIES) TotalFATALITIES,
sum(s.INJURIES) TotalINJURIES,
sum(s.FATALITIES + s.INJURIES) totalCausality
from stormData s, eventData e
where (FATALITIES > 0 or INJURIES >0)
and e.EVTYPE = s.EVTYPE
group by e.newEventType')
dim(causalityData)
## [1] 21 4
## Order by the total causality and subset the data greater than one thousand persons.
causalityDataResult <- sqldf('select newEventType,
totalCausality
from causalityData
where totalCausality > 999
order by totalCausality Desc')
## create factors for ploting newEventType
causalityDataResult$newEventType <- factor(causalityDataResult$newEventType, levels=causalityDataResult$newEventType)
dim(causalityDataResult)
## [1] 12 2
Transform the data for analysis of economic cost calculation.
The crop and property damage dollar amount have a exponent component. The exponent is de-coded and multiplied to the respective amount.
damageDollarData <- sqldf('select CROPDMG,
CROPDMGEXP,
case upper(CROPDMGEXP)
when "B" then (CROPDMG * 1000000000 )
when "M" then (CROPDMG * 1000000 )
when "K" then (CROPDMG * 1000 )
when "H" then (CROPDMG * 100 )
when "2" then (CROPDMG * 100 )
when "3" then (CROPDMG * 1000 )
when "4" then (CROPDMG * 10000 )
when "5" then (CROPDMG * 100000 )
when "6" then (CROPDMG * 1000000 )
when "7" then (CROPDMG * 10000000 )
else (CROPDMG * 1)
end as CROPtotal,
PROPDMG,
PROPDMGEXP,
case upper(PROPDMGEXP)
when "B" then (PROPDMG * 1000000000 )
when "M" then (PROPDMG * 1000000 )
when "K" then (PROPDMG * 1000 )
when "H" then (PROPDMG * 100 )
when "2" then (PROPDMG * 100 )
when "3" then (PROPDMG * 1000 )
when "4" then (PROPDMG * 10000 )
when "5" then (PROPDMG * 100000 )
when "6" then (PROPDMG * 1000000 )
when "7" then (PROPDMG * 10000000 )
else (PROPDMG * 1)
end as PROPtotal,
EVTYPE
from stormData
where CROPDMG > 0 or PROPDMG > 0')
dim(damageDollarData)
## [1] 245031 7
Sum the crop damage and property damage by grouping it by the new event type.
damageData <- sqldf('select e.newEventType,
sum(d.CROPtotal) TotalCrop,
sum(d.PROPtotal) TotalProperty,
sum(d.CROPtotal + d.PROPtotal) TotalDollar
from damageDollarData d, eventData e
where e.EVTYPE = d.EVTYPE
group by e.newEventType')
dim(damageData)
## [1] 21 4
## Order the economic damage data and subset it over one billion dollars.
damageDataResult <- sqldf('select newEventType,
(TotalDollar / 1000000000) as inBillionDollar
from damageData
where TotalDollar > 999999999
order by inBillionDollar Desc')
## create factors for ploting newEventType
damageDataResult$newEventType <- factor(damageDataResult$newEventType, levels = damageDataResult$newEventType)
dim(damageDataResult)
## [1] 11 2
The top most extreme weather events which are most harmful to human health.
sqldf('select newEventType, totalCausality from causalityDataResult order by totalCausality desc')
## newEventType totalCausality
## 1 TORNADO 97068
## 2 HEAT 12421
## 3 THUNDERSTORM 10273
## 4 FLOOD 10236
## 5 WINTER 7445
## 6 LIGHTNING 6048
## 7 WIND 3197
## 8 COASTAL 1705
## 9 FIRE 1698
## 10 HURRICANE 1468
## 11 HAIL 1386
## 12 FOG 1156
The above list displays the top most extreme weather events; accounting for more than 1,000 recorded fatalities and injuries to human population USA in the period of 1950 to 2011 recorded by NOAA. The following plots the data.
require(scales)
ggplot(causalityDataResult, aes(newEventType, totalCausality)) +
labs(x="Weather event", y="Number of persons health effected"
, title="Most harmfull weather events to US population during (1950-2011)") +
theme(axis.text.x = element_text( angle = 30, hjust = 1)) +
geom_bar(stat = "identity" , aes(fill = newEventType)) +
scale_y_continuous(labels = comma )
The top most extreme weather events which have the greatest economic (property and crop damage) consequences.
sqldf('select newEventType, inBillionDollar from damageDataResult order by inBillionDollar desc')
## newEventType inBillionDollar
## 1 FLOOD 180.659
## 2 HURRICANE 90.873
## 3 TORNADO 59.021
## 4 STORM 56.375
## 5 WINTER 21.216
## 6 HAIL 19.024
## 7 DROUGHT 15.025
## 8 THUNDERSTORM 12.458
## 9 FIRE 8.905
## 10 WIND 7.058
## 11 RAIN 4.053
The above list displays the top most extreme weather events; which have the greatest economic (property and crop damage) cost above a billion dollars to USA in the period of 1950 to 2011 recorded by NOAA. The dollar amount is not adjusted for inflation over the 60 year period. The economic cost does not account other factors. The following plots the data.
require(scales)
ggplot(damageDataResult, aes(newEventType, inBillionDollar)) +
labs(x="Weather event", y="Economic cost (crop and property) in billion dollars"
, title="Most costly weather events to US ecomony during (1950-2011)") +
theme(axis.text.x = element_text( angle = 30, hjust = 1)) +
geom_bar(aes(fill = newEventType), stat = "identity" ) +
scale_y_continuous(labels = dollar )