require(dplyr)
require(ggplot2)

Summary Result

From storm dataset, I found out that Hail, Tropical Storms and Thunder Storms are the three most common types of natural disaster that happened by occurences; however they don’t necessary contribute to the highest harm and damage to human health and economy.

I used fatalities and injuries caused by the natural disaster, and found out that tornado caused the most cases of fatalities and injuries combined, with almost 100K total fatalities + injuries, followed by Excessive heat and Tropical storm wind with 8K and 7K cases respectively. This shows that the most common natural disaster may not be the most deadly disaster.

Lastly, in terms of impact on economy, I measured it with property and crop damage loss in million dollars, and it’s interesting to find that Flood caused the most damage to property and crop loss, followed by Tornado and Hurricane/Typhoon.

In summary, Tornado is the disaster that caused the most harm to human health but Flood had the largest impact on economy damage.

Data Processing

Let’s read in data

##Download in data if necessary
#raw <- download.file('https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2',
#                     destfile = '/Users/garymu/Dropbox/Coursera/DS/reproducible_research/final/data.csv.bz2')

df <- read.csv('/Users/garymu/Dropbox (Personal)/Coursera/DS/reproducible_research/final/data.csv.bz2', stringsAsFactors = F)

dim(df)
## [1] 902297     37

Looks like we have 37 variables and 902 observations.

Now let’s dive into the data to find what disaster happebned the most often and which ones cause the most damage to Human health and Economy.

EDA- explore which event type has the highest population health impact

#which kind of event occurs the most- top 10
library(dplyr)
top10 <- df %>% 
  group_by(EVTYPE) %>%
  summarise(cnt = n()) %>%
  arrange(desc(cnt)) %>%
  head(10)

top10$EVTYPE <- factor(top10$EVTYPE, levels = top10$EVTYPE[order(top10$cnt, decreasing = T)])

top10
## # A tibble: 10 x 2
##                EVTYPE    cnt
##                <fctr>  <int>
##  1               HAIL 288661
##  2          TSTM WIND 219940
##  3  THUNDERSTORM WIND  82563
##  4            TORNADO  60652
##  5        FLASH FLOOD  54277
##  6              FLOOD  25326
##  7 THUNDERSTORM WINDS  20843
##  8          HIGH WIND  20212
##  9          LIGHTNING  15754
## 10         HEAVY SNOW  15708
#plot it
p1 <- ggplot(top10, aes( x= EVTYPE, y = cnt)) +
                  geom_bar(stat = 'identity') + 
                  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
                  ggtitle('Top 10 most common types of events')
p1

Hail and tropical storm wind looks to be the most common natrual disaster by volume across the country. However, in order to examine the harmful effect, we need to look further than this, and look for signals that indicate harms to human health. I will use fatalities and injuries combined factor as a proxy.

To do this, I will need a new variable combining fatalities and injuries, and plot out which event has the highest occurence of the new fatality + injury variable.

Getting the top10 disaster that had the highest tally on fatalities and injuries

#get top 10 disaster with the highest fatality and injury combined count
fat_inj_df <- df %>%
        mutate(fatality_injury = FATALITIES + INJURIES) %>%
        group_by(EVTYPE) %>%
        summarise(fat_inj = sum(fatality_injury)) %>%
        arrange(desc(fat_inj)) %>%
        head(10)

fat_inj_df$EVTYPE <- factor(fat_inj_df$EVTYPE, levels = fat_inj_df$EVTYPE[order(fat_inj_df$fat_inj, decreasing = T)])

fat_inj_df
## # A tibble: 10 x 2
##               EVTYPE fat_inj
##               <fctr>   <dbl>
##  1           TORNADO   96979
##  2    EXCESSIVE HEAT    8428
##  3         TSTM WIND    7461
##  4             FLOOD    7259
##  5         LIGHTNING    6046
##  6              HEAT    3037
##  7       FLASH FLOOD    2755
##  8         ICE STORM    2064
##  9 THUNDERSTORM WIND    1621
## 10      WINTER STORM    1527
## plot it
p2 <- ggplot(fat_inj_df, aes( x= EVTYPE, y = fat_inj)) +
                  geom_bar(stat = 'identity') + 
                  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
                  ggtitle('Top 10 Events With the Most Fatality and Inhury Combined')
p2

It looks like Tornado caused the highest fatalities and injuries combined than anyother kinds of events, with close to 100K cases, and trumps all other events by wide margin. The event that cuased second and third highest fatalities and injuries are Excessive heat and Tropical storm wind with 8K and 7K cases respectively.

EDA- explore which event type has the greatest economic consequences

I will be using Property and Crop damange variables: “PROPDMG” and “CROPDMG” combined to assess the total economic damage, but first we need to clean the data to make sure all the rows are in the same unit, we will use millions (M) as the unified unit for damage accounting.

#clean up "PROPDMG"
summary(df$PROPDMG)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   12.06    0.50 5000.00
unique(df$PROPDMGEXP) 
##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"
#there are a lot of inconsistencies, in units, 
#I will unify all units to millions, but for rows with non-sensicle units, I will drop them

prop_dmg <- df %>%
        select(PROPDMG, PROPDMGEXP, EVTYPE) %>%
        mutate(PROPDMGEXP = tolower(PROPDMGEXP)) %>%
        filter(PROPDMGEXP %in% c('k','m', 'b'))

crop_dmg <- df %>%
        select(CROPDMG, CROPDMGEXP, EVTYPE) %>%
        mutate(CROPDMGEXP = tolower(CROPDMGEXP)) %>%
        filter(CROPDMGEXP %in% c('k','m', 'b'))

#unify property & crop damage unit to million
prop_dmg <- prop_dmg %>% 
    mutate(dmg = ifelse(PROPDMGEXP =='k', PROPDMG/100, ifelse(PROPDMGEXP =='b', PROPDMG*1000, PROPDMG))) %>%
    select(dmg, EVTYPE)

crop_dmg <- crop_dmg %>% 
    mutate( dmg= ifelse(CROPDMGEXP =='k', CROPDMG/100, ifelse(CROPDMGEXP =='b', CROPDMG*1000, CROPDMG)))%>%
    select(dmg, EVTYPE)


total_dmg <- rbind(prop_dmg, crop_dmg)

total_dmg_plt <- total_dmg %>%
                    group_by(EVTYPE) %>%
                    summarise(total_dmg = sum(dmg)) %>%
                    arrange(desc(total_dmg)) %>%
                    head(10)

total_dmg_plt$EVTYPE <- factor(total_dmg_plt$EVTYPE, levels = total_dmg_plt$EVTYPE[order(total_dmg_plt$total_dmg, decreasing = T)])

total_dmg_plt
## # A tibble: 10 x 2
##               EVTYPE total_dmg
##               <fctr>     <dbl>
##  1             FLOOD 159689.29
##  2           TORNADO  86719.33
##  3 HURRICANE/TYPHOON  71964.77
##  4       STORM SURGE  43491.04
##  5       FLASH FLOOD  31816.23
##  6              HAIL  30028.30
##  7         TSTM WIND  18012.58
##  8           DROUGHT  15239.19
##  9         HURRICANE  14717.77
## 10 THUNDERSTORM WIND  12362.77
## plot it
p3 <- ggplot(total_dmg_plt, aes( x= EVTYPE, y = total_dmg)) +
                  geom_bar(stat = 'identity') + 
                  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
                  ggtitle('Top 10 Events With the Most Economic Loss (million dollars)')
p3

Looks like Flood had the highest impact on economic loss with more than $150B loss, followed by tonado ($86B) and Hurricane/Typhoon ($72B)