This project use the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. The fatalities, injuries, and property/crop damage during major storms and weather events in US were recorded by The NOAA since 1950. This analysis will present the ecnomic costs as well as human life cost of these storm events. Prior to 1996, Events were recorded in very few types, and many reports of storm events are not fully compliance to 48 official list of events. In this analysis, we will convert event type to standadize them and present an overview of most common events and their impact on both human life and economy. R language will be used to do data processing and basic analysis.
The data can be downloaded from https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2. Detailed information about those variables can be found at: https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf, which include the 48 event types. This source is also available at: http://www.nws.noaa.gov/directives/.
First dowadload and read data into R and explort data.
# check work directory
if(!getwd()=="C:/Users/song/Dropbox/Coursera/Reproduce/assign2")
setwd("C:/Users/song/Dropbox/Coursera/Reproduce/assign2")
dataurl <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
destfile <- "C:/Users/song/Dropbox/Coursera/Reproduce/assign2/stormData.csv.bz2"
download.file(dataurl, destfile)
storm= read.table("C:/Users/song/Dropbox/Coursera/Reproduce/assign2/stormData.csv.bz2", header=T, sep = ",")
We got 902297 rows of observation, for this project, we will keep only health and property/crop related information.
#subsetting data to contain only those relevant columns
stormdata <- storm[,c('BGN_DATE','EVTYPE','END_DATE',
'FATALITIES', 'INJURIES', 'PROPDMG', 'PROPDMGEXP',
'CROPDMG','CROPDMGEXP')]
#change data format
stormdata$EVTYPE=tolower(stormdata$EVTYPE)
stormdata$BGN_DATE=as.Date(stormdata$BGN_DATE,format='%m/%d/%Y')
#review data structure
str(stormdata)
## 'data.frame': 902297 obs. of 9 variables:
## $ BGN_DATE : Date, format: "1950-04-18" "1950-04-18" ...
## $ EVTYPE : chr "tornado" "tornado" "tornado" "tornado" ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
Per http://www.ncdc.noaa.gov/stormevents/details.jsp, From 1950 through 1954, only tornado events were recorded. From 1955 through 1992, only tornado, thunderstorm wind and hail events were keyed from the paper publications into digital data. From 1993 to 1995, only tornado, thunderstorm wind and hail events have been extracted from the Unformatted Text Files. From 1996 to present, 48 event types are recorded as defined in NWS Directive 10-1605.
data1996 <- stormdata[stormdata$BGN_DATE>='1996-01-01',]
After subseeting, we get 653530 rows of observation, we’ll use year 1996 and later data to conduct our analysis.
Second, we need to standardize the event type to make them consistent. From the event types table, we created one spreadsheet to use for data processing.
#read in official eventtype
event <- read.table('C:/Users/song/Dropbox/Coursera/Reproduce/assign2/eventtype.csv',head=T,sep=',')
event$eventtype=as.character(tolower(event$eventtype) )
#trim event type
library(stringr)
event$eventtype=str_trim(event$eventtype)
event
## eventtype
## 1 astronomical low tide
## 2 avalanche
## 3 blizzard
## 4 coastal flood
## 5 cold/wind chill
## 6 debris flow
## 7 dense fog
## 8 dense smoke
## 9 drought
## 10 dust devil
## 11 dust storm
## 12 excessive heat
## 13 extreme cold/wind chill
## 14 flash flood
## 15 flood
## 16 frost/freeze
## 17 funnel cloud
## 18 freezing fog
## 19 hail
## 20 heat
## 21 heavy rain
## 22 heavy snow
## 23 high surf
## 24 high wind
## 25 hurricane (typhoon)
## 26 ice storm
## 27 lake-effect snow
## 28 lakeshore flood
## 29 lightning
## 30 marine hail
## 31 marine high wind
## 32 marine strong wind
## 33 marine thunderstorm wind
## 34 rip current
## 35 seiche
## 36 sleet
## 37 storm surge/tide
## 38 strong wind
## 39 thunderstorm wind
## 40 tornado
## 41 tropical depression
## 42 tropical storm
## 43 tsunami
## 44 volcanic ash
## 45 waterspout
## 46 wildfire
## 47 winter storm
## 48 winter weather
We shall use the actual dollars of damage for each category. Since damage in dollars are in different unit, we are using multipliers provided in the PROPDMGEXP for property damage and CROPDMGEXP for crop damage to calculate the really dollars..
Per documentation, “K” means thousand US dollars(1e+03), “M” means million US dollars(1e+06), and “B” means billion US dollars(1e+09), The conversion from characters to their numeric dollar value can be done by following:
multiplier = c(1e+03, 1e+06, 1e+09,0)
mcharacter = c("K", "M", "B","")
#property damage
data1996$pdamage = multiplier[match(data1996$PROPDMGEXP, mcharacter)]
#crop damage
data1996$cdamage = multiplier[match(data1996$CROPDMGEXP, mcharacter)]
#calculate total damage for property and crop seprately
data1996$totalpdamage <- data1996$PROPDMG * data1996$pdamage
data1996$totalcdamage <- data1996$CROPDMG * data1996$cdamage
#calcuate total damage combining property and crop damage
data1996$totaldamage=data1996$totalpdamage + data1996$totalcdamage
#Get total ecnomic damage due to storm after 1996
sum(data1996$totaldamage,na.rm=T)
## [1] 4.015e+11
Storms caused total of over 400 billions unadjusted US dollars damage since year 1996. The damage by event types can be seen as follows:
typedamage=aggregate(totaldamage~EVTYPE, data1996, sum)
str(typedamage)
## 'data.frame': 438 obs. of 2 variables:
## $ EVTYPE : chr " high surf advisory" " coastal flood" " flash flood" " lightning" ...
## $ totaldamage: num 200000 0 50000 0 8100000 8000 0 0 0 0 ...
Let’s look the top 50 most costly economic damage by event type.
#remove those with missing or 0 damage
eventcost <- typedamage[with(typedamage,totaldamage>0),]
topeco <- eventcost[with(eventcost,order(-totaldamage)),]
str(eventcost)
## 'data.frame': 155 obs. of 2 variables:
## $ EVTYPE : chr " high surf advisory" " flash flood" " tstm wind" " tstm wind (g45)" ...
## $ totaldamage: num 200000 50000 8100000 8000 28820000 ...
eco50<- topeco[1:50,]
#calculate the percentage of damage from those top 50 costly event type account for total damge from all type of damage
sum(eco50$totaldamage)/sum(typedamage$totaldamage)*100
## [1] 99.97
Since top 50 event type cover about 99.97% of total damage, we will standarize them into 48 offical event types and find out the most costly event type.
codes <- event$eventtype
Di <- adist(eco50$EVTYPE, codes)
colnames(Di) <- codes
rownames(Di) <- eco50$EVTYPE
j <- apply(Di, 1, which.min)
eco50c <- data.frame(eco50,rawtext = eco50$EVTYPE, coded= codes[j])
eco50c$coded = as.vector(eco50c$coded)
eco50c$rawtext = as.vector(eco50c$rawtext)
#str(eco50c)
Standandize event type manually if above fuzzy matching failed.
#find those not equal with official event type
fuzn <- subset(eco50c,eco50c$rawtext != eco50c$coded)
#keep those with correct event type
fuzm <- subset(eco50c,eco50c$rawtext == eco50c$coded)
#replace manually with official type
fuznl=data.frame(coded=ifelse(fuzn$rawtext %in% c('tstm wind','tstm wind/hail'),'thunderstorm wind',
ifelse(fuzn$rawtext %in% c('freeze','early frost'),'frost/freeze',
ifelse(fuzn$rawtext %in% c('landslide'),'debris flow',
ifelse(fuzn$rawtext %in% c('hurricane','typhoon'),'hurricane (typhoon)',
ifelse(fuzn$rawtext %in% c('unseasonably cold','cold'),'cold/wind chill',
ifelse(fuzn$rawtext %in% c('urban/sml stream fld','river flood','river flooding'),'flood',
ifelse(fuzn$rawtext %in% c('wind'),'strong wind',
ifelse(fuzn$rawtext %in% c('fog'),'dense fog',
ifelse(fuzn$rawtext %in% c('small hail'),'hail',
ifelse(fuzn$rawtext %in% c('extreme cold'),'extreme cold/wind chill',
fuzn$coded)))))))))))
fuzmerg<-cbind(fuzn[,1:3],fuznl)
#adding back to create newly coded event type
fuzrank <- rbind(fuzm,fuzmerg)
#str(fuzrank)
Let’s redo the calculation after official remapping event type.
# total damage by newly coded event type
ecosum <- aggregate(totaldamage~coded,fuzrank,sum)
#sort by total damage
ecorank <- ecosum[with(ecosum,order(-totaldamage)),]
Similialy, for health cost we first calculate the total mortality and injury lose, then standardize the top 50 most deadly event type to official 48 event types.
#totalhealth <- data.frame(data1996,totaldamage=(data1996$FATALITIES+data1996$INJURIES))
#calculate total damage based on coded offical eventy type and create analytic dataset
sum(data1996$FATALITIES,na.rm=T)+sum(data1996$INJURIES,na.rm=T)
## [1] 66707
##COUNT TOTAL SUMMARY
fatility <- aggregate(FATALITIES~EVTYPE,data1996,sum)
injury <- aggregate(INJURIES~EVTYPE,data1996,sum)
fatinj <- merge(fatility,injury,by='EVTYPE',all=T)
# total death due to storm since 1996
sum(fatility$FATALITIES,na.rm=T)
## [1] 8732
# tatal injury due to storm since 1996
sum(injury$INJURIES,na.rm=T)
## [1] 57975
As shown above, There are 57975 injuries and 8732 deaths since 1996.
Let’s Sort data based on fatalities and injuries and find out what are the top 50 most costly event type.
#order by descending to determin most type
fatinjs <- fatinj[with(fatinj,order(-FATALITIES,-INJURIES)),]
top50 <- fatinjs[1:50,]
top50$EVTYPE = as.character(tolower(top50$EVTYPE))
str(top50)
## 'data.frame': 50 obs. of 3 variables:
## $ EVTYPE : chr "excessive heat" "tornado" "flash flood" "lightning" ...
## $ FATALITIES: num 1797 1511 887 651 414 ...
## $ INJURIES : num 6391 20667 1674 4141 6758 ...
We will use the similar strategy to remap those event types with official event types.
codes <- event$eventtype
D <- adist(top50$EVTYPE, codes)
colnames(D) <- codes
rownames(D) <- top50$EVTYPE
i <- apply(D, 1, which.min)
top50c <- data.frame(top50,rawtext = top50$EVTYPE, coded= codes[i])
top50c$coded = as.vector(top50c$coded)
top50c$rawtext = as.vector(top50c$rawtext)
#str(top50c)
#find those not equal with official event type
trym <- subset(top50c,top50c$rawtext != top50c$coded)
#keep those with correct event type
topm <- subset(top50c,top50c$rawtext == top50c$coded)
#replace manually with official type
tryml=data.frame(coded=ifelse(trym$rawtext %in% c('tstm wind','tstm wind/hail'),'thunderstorm wind',
ifelse(trym$rawtext %in% c('extreme cold','extreme windchill','hypothermia/exposure'),'extreme cold/wind chill',
ifelse(trym$rawtext %in% c('landslide'),'debris flow',
ifelse(trym$rawtext %in% c('hurricane'),'hurricane (typhoon)',
ifelse(trym$rawtext %in% c('unseasonably warm and dry'),'drought',
ifelse(trym$rawtext %in% c('urban/sml stream fld'),'flood',
ifelse(trym$rawtext %in% c('wind'),'strong wind',
ifelse(trym$rawtext %in% c('fog'),'dense fog',
ifelse(trym$rawtext %in% c('cold','cold/snow','cold and snow'),'cold/wind chill',
ifelse(trym$rawtext %in% c('heavy surf','rough seas'),'high surf',
trym$coded)))))))))))
trymerg<-cbind(trym[,1:4],tryml)
#adding back to create newly coded event type
toprank <- rbind(topm,trymerg)
After remapping, we can recalculate the total health damage based on official event type and create an analytic dataset for health damage analysis.
#get total fatalities and injuries
topdamage <- data.frame(toprank,totaldamage=(toprank$FATALITIES+toprank$INJURIES))
#calculate total damage based on coded offical eventy type and create analytic dataset
sum(topdamage$totaldamage,na.rm=T)
## [1] 65836
healthdamage <- aggregate(totaldamage~coded,topdamage,sum)
fatalitysum <-aggregate(FATALITIES~coded,topdamage,sum)
injurysum <-aggregate(INJURIES~coded,topdamage,sum)
#combine them into one single data
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
## Loading required package: DBI
## Loading required package: RSQLite.extfuns
fatinjtotlong = sqldf("select coded as event,totaldamage,'total' as Damage
from healthdamage
union
select coded as event,FATALITIES as totaldamage,'fatality' as Damage
from fatalitysum
union
select coded as event,INJURIES as totaldamage,'injury' as Damage
from injurysum
")
## Loading required package: tcltk
#sort to make the most damage by descending order
healthorder <- healthdamage[with(healthdamage,order(-totaldamage)),]
#show top 10 event type that caused most deaths and injuries
healthorder[1:10,]
## coded totaldamage
## 27 tornado 22178
## 7 excessive heat 8188
## 10 flood 7279
## 26 thunderstorm wind 5500
## 19 lightning 4792
## 9 flash flood 2561
## 30 wildfire 1543
## 31 winter storm 1483
## 12 heat 1459
## 17 hurricane (typhoon) 1446
top10 <- data.frame(event=healthorder[1:10,'coded'])
fatalityorder <- fatalitysum[with(fatalitysum,order(-FATALITIES)),]
#show top 10 event types that caused most deaths
fatalityorder[1:10,]
## coded FATALITIES
## 7 excessive heat 1797
## 27 tornado 1511
## 9 flash flood 887
## 19 lightning 651
## 23 rip current 542
## 10 flood 442
## 26 thunderstorm wind 376
## 8 extreme cold/wind chill 264
## 12 heat 237
## 16 high wind 235
injuryorder <- injurysum[with(injurysum,order(-INJURIES)),]
#show top 10 event types that caused most injury
injuryorder[1:10,]
## coded INJURIES
## 27 tornado 20667
## 10 flood 6837
## 7 excessive heat 6391
## 26 thunderstorm wind 5124
## 19 lightning 4141
## 9 flash flood 1674
## 30 wildfire 1456
## 17 hurricane (typhoon) 1321
## 31 winter storm 1292
## 12 heat 1222
fatinjorder=merge(fatalityorder,injuryorder,by='coded')
#chose only those events that has most total damage
fatinjfinal <- sqldf("select * from fatinjtotlong where event in (select distinct event from top10)")
So far we have finished data processing and ready for our analysis.
After data processing, we can look at the health cost and economy cost separately.
Using standardized event tpye, the top 10 most costly events can be seen as:
#show top 10 event type that caused most deaths and injuries
healthorder[1:10,]
## coded totaldamage
## 27 tornado 22178
## 7 excessive heat 8188
## 10 flood 7279
## 26 thunderstorm wind 5500
## 19 lightning 4792
## 9 flash flood 2561
## 30 wildfire 1543
## 31 winter storm 1483
## 12 heat 1459
## 17 hurricane (typhoon) 1446
#show top 10 event types that caused most deaths
fatalityorder[1:10,]
## coded FATALITIES
## 7 excessive heat 1797
## 27 tornado 1511
## 9 flash flood 887
## 19 lightning 651
## 23 rip current 542
## 10 flood 442
## 26 thunderstorm wind 376
## 8 extreme cold/wind chill 264
## 12 heat 237
## 16 high wind 235
#show top 10 event types that caused most injury
injuryorder[1:10,]
## coded INJURIES
## 27 tornado 20667
## 10 flood 6837
## 7 excessive heat 6391
## 26 thunderstorm wind 5124
## 19 lightning 4141
## 9 flash flood 1674
## 30 wildfire 1456
## 17 hurricane (typhoon) 1321
## 31 winter storm 1292
## 12 heat 1222
Let’s look at the health damage by event type.
#require(gridExtra)
library(ggplot2)
fatinjfinal <- transform(fatinjfinal, event=reorder(event, -totaldamage) )
#ggplot(fatinjfinal, aes(factor(event),totaldamage),stat='source')+geom_bar()
qplot(x=event, y=totaldamage, fill=Damage,
data=fatinjfinal[fatinjfinal$Damage!='total',], geom="bar", stat="identity",
position="stack",las=2)+theme(axis.text.x = element_text(angle = 90, hjust = 1))+labs(title="Top 10 most Mortalitt and Injury Damage event")
#plot2 <- qplot(x=event, y=totaldamage, fill=Damage,
# data=fatinjfinal[fatinjfinal$Damage=='total',], geom="bar", stat="identity",
# position="stack",las=2)+theme(axis.text.x = element_text(angle = 45, hjust = 2))+labs(title#="Top 10 most cost health Damage events")
#grid.arrange(plot1, plot2, ncol=2)
Total economic damage we calculated based on property and crop damage.
# show top 10 costly event
ecorank[1:10,-c(3)]
## coded totaldamage
## 10 flood 1.491e+11
## 17 hurricane (typhoon) 8.707e+10
## 21 storm surge/tide 4.784e+10
## 24 tornado 2.490e+10
## 12 hail 1.709e+10
## 9 flash flood 1.656e+10
## 6 drought 1.441e+10
## 23 thunderstorm wind 8.922e+09
## 25 tropical storm 8.320e+09
## 27 wildfire 8.163e+09
ecofigure <- ecorank[1:10,-c(3)]
#change scales
ecofigure$total = ecofigure$totaldamage/1e+09
#produce a figure to show the total ecnomy damage
ecofigure <- transform(ecofigure, event=reorder(coded, -total) )
#show top
barplot(ecofigure$total,names.arg=ecofigure$coded, las=2,
xlab='', ylab='Cost damage in Billion',
main="Top 10 property and crop damage by events")
As you can see, it turns out that flood is the most costly that has over 149 Billion US dollars damage, followed by hurricane(typhoon), which has 87 billion US dollars damage. Since that flood is the most costly event, let’s see how it changes over time.
require(sqldf)
flood <- sqldf("select EVTYPE,totalpdamage,totalcdamage,totaldamage,BGN_DATE from data1996 where EVTYPE in (select distinct EVTYPE from fuzrank where coded='flood')")
## Loading required package: tcltk
str(flood)
## 'data.frame': 27749 obs. of 5 variables:
## $ EVTYPE : chr "urban/sml stream fld" "urban/sml stream fld" "flood" "flood" ...
## $ totalpdamage: num 0e+00 0e+00 2e+05 0e+00 0e+00 0e+00 0e+00 0e+00 1e+03 0e+00 ...
## $ totalcdamage: num 0 0 0 0 0 0 0 0 0 0 ...
## $ totaldamage : num 0e+00 0e+00 2e+05 0e+00 0e+00 0e+00 0e+00 0e+00 1e+03 0e+00 ...
## $ BGN_DATE : Date, format: "1996-08-18" "1996-08-18" ...
#get year of those data
library(lubridate)
floodyear = data.frame(flood,year=year(flood$BGN_DATE))
# calculate the total damage by year
floodp <- aggregate(totalpdamage~year,floodyear,sum)
floodc <- aggregate(totalcdamage~year,floodyear,sum)
floodt <- aggregate(totaldamage~year,floodyear,sum)
#combine together to one file
floodfigure = sqldf("select year, totalcdamage as totaldamage,'Crop' as Damage
from floodc
union
select year,totalpdamage as totaldamage,'Property' as Damage
from floodp
")
#plot time by year
library( lattice )
xyplot(totaldamage ~ year|Damage, floodfigure, type='l',xlab="Year",
ylab=paste("Total damage(",expression(log[10]),")"), main="Flood damage by years(log scale) ",scales = list(y = list(log = 10)) )
# show top 10 costly event
ecorank[1:10,-c(3)]
## coded totaldamage
## 10 flood 1.491e+11
## 17 hurricane (typhoon) 8.707e+10
## 21 storm surge/tide 4.784e+10
## 24 tornado 2.490e+10
## 12 hail 1.709e+10
## 9 flash flood 1.656e+10
## 6 drought 1.441e+10
## 23 thunderstorm wind 8.922e+09
## 25 tropical storm 8.320e+09
## 27 wildfire 8.163e+09
In summary, Tornado(Typhoon) is the most dangerous storm causing most mortality and injury while flood impose most costly damage to property and crops.
This analysis was bsaed on storm data after 1996 due to the facts that data before 1996 were not recorded completely and officail event type were not reported consistetly. The Economic damage was estimated using really dollars and ignored the inflation over year,so it might be underestimate early years damage. The strategy of remapping event type to 48 event type also might be faulty leading to bias. We only look at flood economic damage and year trend, it might worth to see other event type’s trent over time.