For this assessment we are using data Storm Data collected by the US National Oceanic and Atmospheric Administration (NOAA). We analyzed the data, answering two questions:
We found that tornados are by far the environmental event that causes the most harm to humans, clocking in at 46% of all harm done to humans. Other top sources of harm (> 4% of all harm) were excessive heat, flash flood, heat and lightning.
We found that largest sources of economic damage (> 4% or all damage) were: flood (33%), hurricane/typhoon (15%), tornado (12%), and storm surge (9%).
We begin by setting up some environment variables.
require(dplyr)
require(ggplot2)
We download the dataset from: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2
Next we load the data into R
df <- read.csv(bzfile("repdata-data-StormData.csv.bz2"))
We do some preliminary inspection of the data.
dim(df)
## [1] 902297 37
df[1,]
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14 100 3 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
levels(df$EVTYPE) %>% length
## [1] 985
A once over of some basics is very revealing. There are more variables in the data set than we are really interested in, and there are too many event types to analyze them all individually.
For human health, the key variables are EVTYPE, FATALITIES and INJURIES. We quantify if something is bad for human health by determining the resultant number of fatalities and injuries.
Injuries and fatalities are equally bad for human health, so we will use a completely subjective scaling factor that relates injuries to fatalities. We choose \(F = 20 I\), meaning fatalities are 20 times worse than injuries.
Please note that this ratio is not given for scientific reasons, but for personal feelings. I am open to different scaling factors for fatality vs injury.
With this, we find the fraction of all harm from human health.
health <- df %>%
select(c(EVTYPE, FATALITIES, INJURIES)) %>%
mutate(HARMFACTOR = FATALITIES + INJURIES/20)
health <- health %>%
mutate(HARMpcnt = HARMFACTOR / (HARMFACTOR %>% sum))
healthImpact <- group_by(health, EVTYPE) %>%
summarize(HARMsum = sum(HARMpcnt)) %>%
arrange(desc(HARMsum))
healthImpact %>% as.tbl
## Source: local data frame [985 x 2]
##
## EVTYPE HARMsum
## (fctr) (dbl)
## 1 TORNADO 0.46006567
## 2 EXCESSIVE HEAT 0.10054620
## 3 LIGHTNING 0.04859865
## 4 FLASH FLOOD 0.04811830
## 5 HEAT 0.04699748
## 6 TSTM WIND 0.03842112
## 7 FLOOD 0.03650875
## 8 RIP CURRENT 0.01712116
## 9 HIGH WIND 0.01374970
## 10 WINTER STORM 0.01227031
## .. ... ...
For economic consequences, the key variables are EVTYPE, PROPDMG and PROPDMGEXP, CROPDMG and CROPDMGEXP.
We say that property and crop damage are equally bad. But we need to put all the damages onto a common scale. First we look at all the levels of our EXP variables.
econ <- df %>%
select(c(EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP))
levels(econ$PROPDMGEXP)
## [1] "" "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"
levels(econ$CROPDMGEXP)
## [1] "" "?" "0" "2" "B" "k" "K" "m" "M"
The EXP variables tell us what factor of 10 we should be muliplying by. Most EXP values make sense, but somevalues are confusing. We look those EXP examples in closer detail.
filter(econ,PROPDMGEXP %in% c("?","+", "-"))
## EVTYPE PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 BREAKUP FLOODING 20 + 0
## 2 HIGH WIND 20 + 0
## 3 FLOODING/HEAVY RAIN 2 + 0
## 4 THUNDERSTORM WINDS 0 ? 0
## 5 HIGH WINDS 15 + 0
## 6 TORNADO 60 + 0
## 7 FLASH FLOOD 0 ? 0
## 8 FLASH FLOOD 0 ? 0
## 9 HIGH WIND 15 - 0
## 10 THUNDERSTORM WIND 0 ? 0
## 11 HAIL 0 ? 0
## 12 HAIL 0 ? 0
## 13 HAIL 0 ? 0
## 14 THUNDERSTORM WINDS 0 ? 0
filter(econ,CROPDMGEXP %in% c("?","+", "-"))
## EVTYPE PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 FLASH FLOOD WINDS 0.41 0 ?
## 2 THUNDERSTORM WINDS 0.50 K 0 ?
## 3 THUNDERSTORM WINDS 0.50 K 0 ?
## 4 THUNDERSTORM WINDS 0.00 0 ?
## 5 FLOOD/FLASH FLOOD 400.00 K 0 ?
## 6 FLOOD/FLASH FLOOD 0.50 M 0 ?
## 7 THUNDERSTORM WINDS 80.00 K 0 ?
filter(econ,PROPDMGEXP == "") %>% head
## EVTYPE PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TSTM WIND 0 0
## 2 HAIL 0 0
## 3 HAIL 0 0
## 4 TSTM WIND 0 0
## 5 HAIL 0 0
## 6 TSTM WIND 0 0
The three most confusing symbols are ?, + and -. It seems that for these symbols, and for the null symbol we want to be multiplying by 1.
Now we’ll create a lookup table that lets us convert EXP values to real numbers.
lookup <- data.frame(EXP = 0:9) %>%
mutate(REPLACE = 10^EXP)
lookup <- rbind(lookup,
c("",1),
c("+",1),
c("-",1),
c("?",1),
c("h",100),
c("H",100),
c("k",1000),
c("K",1000),
c("m",10^6),
c("M",10^6),
c("b",10^9),
c("B",10^9))
lookup$REPLACE <- as.numeric(lookup$REPLACE)
We create new variables that express the total damage.
hash <- function(x) {
sel <- which(lookup$EXP == x)
lookup$REPLACE[sel]
}
PFACTOR <- sapply(econ$PROPDMGEXP,hash)
CFACTOR <- sapply(econ$CROPDMGEXP,hash)
econ <- cbind(econ %>% select(-c(PROPDMGEXP,CROPDMGEXP)),PFACTOR,CFACTOR)
With them we sum the total economic damage across categories.
econ <- mutate(econ, DMGtotal = PROPDMG*PFACTOR + CROPDMG*CFACTOR)
econ <- mutate(econ, DMGpcnt = DMGtotal / (sum(econ$DMGtotal)))
econImpact <- group_by(econ, EVTYPE) %>%
summarize(DMGsum = sum(DMGpcnt)) %>%
arrange(desc(DMGsum))
econImpact %>% as.tbl
## Source: local data frame [985 x 2]
##
## EVTYPE DMGsum
## (fctr) (dbl)
## 1 FLOOD 0.31491835
## 2 HURRICANE/TYPHOON 0.15065857
## 3 TORNADO 0.12017356
## 4 STORM SURGE 0.09076242
## 5 HAIL 0.03930459
## 6 FLASH FLOOD 0.03822099
## 7 DROUGHT 0.03146398
## 8 HURRICANE 0.03060830
## 9 RIVER FLOOD 0.02126081
## 10 ICE STORM 0.01878587
## .. ... ...
Here we present the top sources of harm to humans. A source of harm is considered a top source if it contributed to more than 4% of all harm caused to humans.
p <- ggplot(data = filter(healthImpact, HARMsum>.04),
aes(x=EVTYPE, y=HARMsum*100, fill = EVTYPE))
p + geom_bar(stat = "identity") +
labs(y = "Percent of Total Harm Caused") +
labs(title = "Largest Sources of Environmental Harm") +
scale_fill_brewer(palette = "Set1")
We see that of the 5 top scources of harm caused to humans, the lion’s share of harm comes from Tornados at 46%, followed by Excessive Heat at 10%, with all other sources being at the 5% or less level.
Here we present the top sources of economic damage. A source of damage is considered a top source if it contributed to more than 4% of all economic damage.
p <- ggplot(data = filter(econImpact, DMGsum>.04),
aes(x=EVTYPE, y=DMGsum*100, fill = EVTYPE))
p + geom_bar(stat = "identity") +
labs(y = "Percent of Economic Damage") +
labs(title = "Largest Sources of Economic Damage") +
scale_fill_brewer(palette = "Dark2")
We found that largest sources of economic damage were: flood (33%), hurricane/typhoon (15%), tornado (12%), and storm surge (9%).