require(dplyr)
require(ggplot2)
From storm dataset, I found out that Hail, Tropical Storms and Thunder Storms are the three most common types of natural disaster that happened by occurences; however they don’t necessary contribute to the highest harm and damage to human health and economy.
I used fatalities and injuries caused by the natural disaster, and found out that tornado caused the most cases of fatalities and injuries combined, with almost 100K total fatalities + injuries, followed by Excessive heat and Tropical storm wind with 8K and 7K cases respectively. This shows that the most common natural disaster may not be the most deadly disaster.
Lastly, in terms of impact on economy, I measured it with property and crop damage loss in million dollars, and it’s interesting to find that Flood caused the most damage to property and crop loss, followed by Tornado and Hurricane/Typhoon.
In summary, Tornado is the disaster that caused the most harm to human health but Flood had the largest impact on economy damage.
Let’s read in data
##Download in data if necessary
#raw <- download.file('https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2',
# destfile = '/Users/garymu/Dropbox/Coursera/DS/reproducible_research/final/data.csv.bz2')
df <- read.csv('/Users/garymu/Dropbox (Personal)/Coursera/DS/reproducible_research/final/data.csv.bz2', stringsAsFactors = F)
dim(df)
## [1] 902297 37
Looks like we have 37 variables and 902 observations.
Now let’s dive into the data to find what disaster happebned the most often and which ones cause the most damage to Human health and Economy.
#which kind of event occurs the most- top 10
library(dplyr)
top10 <- df %>%
group_by(EVTYPE) %>%
summarise(cnt = n()) %>%
arrange(desc(cnt)) %>%
head(10)
top10$EVTYPE <- factor(top10$EVTYPE, levels = top10$EVTYPE[order(top10$cnt, decreasing = T)])
top10
## # A tibble: 10 x 2
## EVTYPE cnt
## <fctr> <int>
## 1 HAIL 288661
## 2 TSTM WIND 219940
## 3 THUNDERSTORM WIND 82563
## 4 TORNADO 60652
## 5 FLASH FLOOD 54277
## 6 FLOOD 25326
## 7 THUNDERSTORM WINDS 20843
## 8 HIGH WIND 20212
## 9 LIGHTNING 15754
## 10 HEAVY SNOW 15708
#plot it
p1 <- ggplot(top10, aes( x= EVTYPE, y = cnt)) +
geom_bar(stat = 'identity') +
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
ggtitle('Top 10 most common types of events')
p1
Hail and tropical storm wind looks to be the most common natrual disaster by volume across the country. However, in order to examine the harmful effect, we need to look further than this, and look for signals that indicate harms to human health. I will use fatalities and injuries combined factor as a proxy.
To do this, I will need a new variable combining fatalities and injuries, and plot out which event has the highest occurence of the new fatality + injury variable.
#get top 10 disaster with the highest fatality and injury combined count
fat_inj_df <- df %>%
mutate(fatality_injury = FATALITIES + INJURIES) %>%
group_by(EVTYPE) %>%
summarise(fat_inj = sum(fatality_injury)) %>%
arrange(desc(fat_inj)) %>%
head(10)
fat_inj_df$EVTYPE <- factor(fat_inj_df$EVTYPE, levels = fat_inj_df$EVTYPE[order(fat_inj_df$fat_inj, decreasing = T)])
fat_inj_df
## # A tibble: 10 x 2
## EVTYPE fat_inj
## <fctr> <dbl>
## 1 TORNADO 96979
## 2 EXCESSIVE HEAT 8428
## 3 TSTM WIND 7461
## 4 FLOOD 7259
## 5 LIGHTNING 6046
## 6 HEAT 3037
## 7 FLASH FLOOD 2755
## 8 ICE STORM 2064
## 9 THUNDERSTORM WIND 1621
## 10 WINTER STORM 1527
## plot it
p2 <- ggplot(fat_inj_df, aes( x= EVTYPE, y = fat_inj)) +
geom_bar(stat = 'identity') +
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
ggtitle('Top 10 Events With the Most Fatality and Inhury Combined')
p2
It looks like Tornado caused the highest fatalities and injuries combined than anyother kinds of events, with close to 100K cases, and trumps all other events by wide margin. The event that cuased second and third highest fatalities and injuries are Excessive heat and Tropical storm wind with 8K and 7K cases respectively.
I will be using Property and Crop damange variables: “PROPDMG” and “CROPDMG” combined to assess the total economic damage, but first we need to clean the data to make sure all the rows are in the same unit, we will use millions (M) as the unified unit for damage accounting.
#clean up "PROPDMG"
summary(df$PROPDMG)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 12.06 0.50 5000.00
unique(df$PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"
#there are a lot of inconsistencies, in units,
#I will unify all units to millions, but for rows with non-sensicle units, I will drop them
prop_dmg <- df %>%
select(PROPDMG, PROPDMGEXP, EVTYPE) %>%
mutate(PROPDMGEXP = tolower(PROPDMGEXP)) %>%
filter(PROPDMGEXP %in% c('k','m', 'b'))
crop_dmg <- df %>%
select(CROPDMG, CROPDMGEXP, EVTYPE) %>%
mutate(CROPDMGEXP = tolower(CROPDMGEXP)) %>%
filter(CROPDMGEXP %in% c('k','m', 'b'))
#unify property & crop damage unit to million
prop_dmg <- prop_dmg %>%
mutate(dmg = ifelse(PROPDMGEXP =='k', PROPDMG/100, ifelse(PROPDMGEXP =='b', PROPDMG*1000, PROPDMG))) %>%
select(dmg, EVTYPE)
crop_dmg <- crop_dmg %>%
mutate( dmg= ifelse(CROPDMGEXP =='k', CROPDMG/100, ifelse(CROPDMGEXP =='b', CROPDMG*1000, CROPDMG)))%>%
select(dmg, EVTYPE)
total_dmg <- rbind(prop_dmg, crop_dmg)
total_dmg_plt <- total_dmg %>%
group_by(EVTYPE) %>%
summarise(total_dmg = sum(dmg)) %>%
arrange(desc(total_dmg)) %>%
head(10)
total_dmg_plt$EVTYPE <- factor(total_dmg_plt$EVTYPE, levels = total_dmg_plt$EVTYPE[order(total_dmg_plt$total_dmg, decreasing = T)])
total_dmg_plt
## # A tibble: 10 x 2
## EVTYPE total_dmg
## <fctr> <dbl>
## 1 FLOOD 159689.29
## 2 TORNADO 86719.33
## 3 HURRICANE/TYPHOON 71964.77
## 4 STORM SURGE 43491.04
## 5 FLASH FLOOD 31816.23
## 6 HAIL 30028.30
## 7 TSTM WIND 18012.58
## 8 DROUGHT 15239.19
## 9 HURRICANE 14717.77
## 10 THUNDERSTORM WIND 12362.77
## plot it
p3 <- ggplot(total_dmg_plt, aes( x= EVTYPE, y = total_dmg)) +
geom_bar(stat = 'identity') +
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
ggtitle('Top 10 Events With the Most Economic Loss (million dollars)')
p3
Looks like Flood had the highest impact on economic loss with more than $150B loss, followed by tonado ($86B) and Hurricane/Typhoon ($72B)