The aim of this report is to show affect of most dangerous weather events on population and economics across US. As shown below tornadoes cause more deaths and injured people than any other weather event. For economics most damage came from floods (both, property and crop damage). Plot of frequency through years of most dangerous weather events is also presented.
Data was obtaines from Peer Assignment 2 webpage. File was saved as ‘stormdata.csv.bz2’. It is database of major storms and weather events in the United States since beginning of 1950 till November 2011. As database was compressed and was read using read.csv and bzfile commands.
dat <- read.csv(bzfile("stormdata.csv.bz2"))
The data.frame dat has dimensions:
dim(dat)
## [1] 902297 37
And columns have names, so no need to set it up:
names(dat)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
Variables I’ll work with are BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP.
suppressMessages(library(dplyr))
dat <- select(dat, BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
Before start I need to modify some variables. Dealing with data:
dat$BGN_DATE <- as.Date(as.character(dat$BGN_DATE), "%m/%d/%Y")
Setting damage exponential factor. Sorry, no neat code. After that I added new variable TD (total damage) which includes property and crop damages in dollars.
dat$PROPDMGEXP <- as.character(dat$PROPDMGEXP)
dat$PROPDMGEXP[dat$PROPDMGEXP %in% c("", "+", "-", "?")] <- "0"
dat$PROPDMGEXP[dat$PROPDMGEXP == "B"] <- "1000000000"
dat$PROPDMGEXP[dat$PROPDMGEXP %in% c("M", "m")] <- "1000000"
dat$PROPDMGEXP[dat$PROPDMGEXP == "K"] <- "1000"
dat$PROPDMGEXP[dat$PROPDMGEXP %in% c("H", "h")] <- "100"
dat$PROPDMGEXP[dat$PROPDMGEXP == "1"] <- "10"
dat$PROPDMGEXP[dat$PROPDMGEXP == "2"] <- "100"
dat$PROPDMGEXP[dat$PROPDMGEXP == "3"] <- "1000"
dat$PROPDMGEXP[dat$PROPDMGEXP == "4"] <- "10000"
dat$PROPDMGEXP[dat$PROPDMGEXP == "5"] <- "100000"
dat$PROPDMGEXP[dat$PROPDMGEXP == "6"] <- "1000000"
dat$PROPDMGEXP[dat$PROPDMGEXP == "7"] <- "10000000"
dat$PROPDMGEXP[dat$PROPDMGEXP == "8"] <- "100000000"
dat$PROPDMGEXP <- as.numeric(dat$PROPDMGEXP)
levels(factor(dat$PROPDMGEXP))
## [1] "0" "10" "100" "1000" "10000" "1e+05" "1e+06" "1e+07"
## [9] "1e+08" "1e+09"
dat$CROPDMGEXP <- as.character(dat$CROPDMGEXP)
dat$CROPDMGEXP[dat$CROPDMGEXP %in% c("", "?")] <- "0"
dat$CROPDMGEXP[dat$CROPDMGEXP == "B"] <- "1000000000"
dat$CROPDMGEXP[dat$CROPDMGEXP %in% c("M", "m")] <- "1000000"
dat$CROPDMGEXP[dat$CROPDMGEXP %in% c("K", "k")] <- "1000"
dat$CROPDMGEXP[dat$CROPDMGEXP == "2"] <- "100"
dat$CROPDMGEXP <- as.numeric(dat$CROPDMGEXP)
levels(factor(dat$CROPDMGEXP))
## [1] "0" "100" "1000" "1e+06" "1e+09"
dat$TD <- dat$PROPDMG*dat$PROPDMGEXP + dat$CROPDMG*dat$CROPDMGEXP
Here’s a pair of top10 weather types by fatality and total damage:
suppressMessages(library(dplyr))
head(dat %>% group_by(EVTYPE) %>% summarise(Fatal=sum(FATALITIES)) %>% arrange(desc(Fatal)), 10)
## Source: local data frame [10 x 2]
##
## EVTYPE Fatal
## (fctr) (dbl)
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 TSTM WIND 504
## 7 FLOOD 470
## 8 RIP CURRENT 368
## 9 HIGH WIND 248
## 10 AVALANCHE 224
head(dat %>% group_by(EVTYPE) %>% summarise(Damage=sum(TD)) %>% arrange(desc(Damage)), 10)
## Source: local data frame [10 x 2]
##
## EVTYPE Damage
## (fctr) (dbl)
## 1 FLOOD 150319678250
## 2 HURRICANE/TYPHOON 71913712800
## 3 TORNADO 57362333590
## 4 STORM SURGE 43323541000
## 5 HAIL 18761221670
## 6 FLASH FLOOD 18243990610
## 7 DROUGHT 15018672000
## 8 HURRICANE 14610229010
## 9 RIVER FLOOD 10148404500
## 10 ICE STORM 8967041310
It’s easy to see that some types must be merged. Despite it’s open question which types should be merged. I decided to merge data this way:
dat$type <- as.character(dat$EVTYPE)
dat$type[grepl("torn", dat$type, ignore.case = T)] <- "TORNADO"
dat$type[grepl("thunderstorm|tstm", dat$type, ignore.case = T)] <- "THUNDERSTORM"
dat$type[grepl("flood", dat$type, ignore.case = T)] <- "FLOOD"
dat$type[grepl("heat|hot", dat$type, ignore.case = T)] <- "HEAT"
dat$type[grepl("hail", dat$type, ignore.case = T)] <- "HAIL"
dat$type[grepl("hurric", dat$type, ignore.case = T)] <- "HURRICANE"
dat$type[grepl("snow|ice|blizzard", dat$type, ignore.case = T)] <- "SNOW"
dat$type[grepl("lightning", dat$type, ignore.case = T)] <- "LIGHTNING"
dat$type[grepl("current", dat$type, ignore.case = T)] <- "RIP CURRENT"
Using ggplot2 and next two variables will clarify influenses on economics and public health:
pop <- dat %>% group_by(type) %>% summarise(Fatalities = sum(FATALITIES), Injured = sum(INJURIES)) %>% arrange(desc(Fatalities))
dam <- dat %>% group_by(type) %>% summarise(Damage = sum(TD)) %>% arrange(desc(Damage))
library(ggplot2)
Top-10 types of wheather events causes most injuries and deaths:
(pop %>% arrange(desc(Injured)) %>% select(type, Injured))[1:10,]
## Source: local data frame [10 x 2]
##
## type Injured
## (chr) (dbl)
## 1 TORNADO 91407
## 2 THUNDERSTORM 9544
## 3 HEAT 9224
## 4 FLOOD 8604
## 5 LIGHTNING 5231
## 6 SNOW 4123
## 7 HAIL 1371
## 8 HURRICANE 1328
## 9 WINTER STORM 1321
## 10 HIGH WIND 1137
(pop %>% arrange(desc(Fatalities)) %>% select(type, Fatalities))[1:10,]
## Source: local data frame [10 x 2]
##
## type Fatalities
## (chr) (dbl)
## 1 TORNADO 5661
## 2 HEAT 3138
## 3 FLOOD 1525
## 4 LIGHTNING 817
## 5 THUNDERSTORM 729
## 6 RIP CURRENT 577
## 7 SNOW 367
## 8 HIGH WIND 248
## 9 AVALANCHE 224
## 10 WINTER STORM 206
To compare first most harmful 5 disasters look at plot below (note that Tornado injuries not in scale, look table above).
library(reshape2)
pop5 <- melt(pop[1:5, ], id = "type")
ggplot(pop5, aes(x=type, y = value, fill=variable)) + geom_bar(position = "dodge", stat = "identity") + coord_cartesian(ylim = c(0,10000)) + ggtitle("5 Most harmful disasters for population health")
10 types of weather events that couse most damage. It contains both, property damage and crop damage.
(dam %>% arrange(desc(Damage)) %>% select(type, Damage))[1:10,]
## Source: local data frame [10 x 2]
##
## type Damage
## (chr) (dbl)
## 1 FLOOD 180591769420
## 2 HURRICANE 90271472810
## 3 TORNADO 59020779590
## 4 STORM SURGE 43323541000
## 5 HAIL 19024451820
## 6 DROUGHT 15018672000
## 7 THUNDERSTORM 12456456330
## 8 SNOW 10917175400
## 9 TROPICAL STORM 8382236550
## 10 WINTER STORM 6715441250
Next plot allows to compare 5 most damaged weather events in dollars.
qplot(type, Damage, data = dam[1:5, ], fill = type, geom = "bar", stat="identity", main = "Damage from weather events 1950-2011")
First, it’s need obtain most dangerous weather events both, for economix and population. Second, then extract frequenses of every of most dangerous disasters per year. These new variables allow to plot frequencies of disasters per year.
mostDang <- union(dam$type[1:5], pop$type[1:5])
disFreq <- dat %>% filter(type %in% mostDang) %>% mutate(Year = format(BGN_DATE, "%Y")) %>% group_by(Year, type) %>% summarise(Frequency = n())
qplot(Year, Frequency, data = disFreq, fill = type, group = type, geom = "line", colour = type)
Fin.