Synopsis

This project is the final requirement of the Reproducible Research Course offered in coursera.org, 1 of 9 courses under Data Science Specialization of John Hopkins University. Here we aim to describe the effect of severe weather events such as storms on public health and economics in United States. Specifically, the analysis should answer the two questions, 1.) across the United States, which types of events are most harmful with respect to population health? and 2.) which types of events have the greatest economic consequences? To answer these questions, we used the data U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database that tracks major storm and weather events in U.s. The database start in the year 1950 and end in November 2011.

In the analyses, we found that tornado is the most harmful weather events across United States with total casualties of 96,979 (fatalities+injuries), followed by excessive heat and thunderstorm wind. In terms of economic consequences, it’s found that flash flood has greatest impact with 68,203.78 billion dollars combined damage cost in properties and crops, and this followed by thunderstom wind and tornado.

Data Processing

We obtained the data of U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database from the course web site. It contained the characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage from 1950 to November 2011.

Below are some of the documentations of the database: * National Weather Service * National Climatic Data Center Storm Events

We will be using the following packages to process and analyzed the data:

library(dplyr)
library(ggplot2)
library(cowplot)

Reading the Data

Since the data is in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size, we read the data using read.csv() function specificying the file that it is a bzip2 compressed.

storm_data<-read.csv(bzfile("repdata%2Fdata%2FStormData.csv.bz2"))

We can see that there are 902,297 observations with 37 variables.

dim(storm_data)
## [1] 902297     37

Subsetting the Data

Since the project will focus only on the population health and economic consequences of the major storm and other severe weather events, we will make a subset from the Storm Data based on the available variables related to the analysis.

names(storm_data)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

We need only EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP columns (or variables). The first two variables (excep EVTYPE or Event Type), Fatalities and injuries, will be used to investigate population health consequences and the other variable which measures the damage will be used to analyze the effect on US economy.

sub_storm_data<-select(storm_data, c(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG,CROPDMGEXP))

head(sub_storm_data)
##    EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO          0       15    25.0          K       0           
## 2 TORNADO          0        0     2.5          K       0           
## 3 TORNADO          0        2    25.0          K       0           
## 4 TORNADO          0        2     2.5          K       0           
## 5 TORNADO          0        2     2.5          K       0           
## 6 TORNADO          0        6     2.5          K       0

Further Data Transformation

The subset is further transformed to answer the 2 key questions.

  1. Population Health

The fatalities and injuries are summed per event type..

harm_health<-group_by(sub_storm_data, EVTYPE) %>% summarise(Fatalities=sum(FATALITIES), Injuries=sum(INJURIES))

h1<-arrange(harm_health,desc(Fatalities))[,-3]
h2<-arrange(harm_health,desc(Injuries))[,-2]
head(h1)
## # A tibble: 6 x 2
##           EVTYPE Fatalities
##           <fctr>      <dbl>
## 1        TORNADO       5633
## 2 EXCESSIVE HEAT       1903
## 3    FLASH FLOOD        978
## 4           HEAT        937
## 5      LIGHTNING        816
## 6      TSTM WIND        504
head(h2)
## # A tibble: 6 x 2
##           EVTYPE Injuries
##           <fctr>    <dbl>
## 1        TORNADO    91346
## 2      TSTM WIND     6957
## 3          FLOOD     6789
## 4 EXCESSIVE HEAT     6525
## 5      LIGHTNING     5230
## 6           HEAT     2100
  1. Economic Consequences

The actual damage cost in propery damage (PROPDMG) and crop damage (CROPDMG) per event type are recomputed. The columns with EXP contains the exponential value of the damages and is expressed in letters such as h/H=hundreds, k/K=thousands, m/M=millions.

The first step is convert this letters to actual values then multiply them accordingly to repsective columns (PROPDMG or CROPDMG).

exp_fun <- function(x) {
    if (x %in% c("h", "H"))
        return(100)
    else if (x %in% c("k", "K"))
        return(1000)
    else if (x %in% c("m", "M"))
        return(1e+06)
    else if (x %in% c("b", "B"))
        return(1e+09)
    else if (!is.na(as.numeric(x))) 
        return(10^as.numeric(x))
    else if (x %in% c("", "-", "?", "+"))
        return(1)
    else {
        stop("Invalid value.")
    }
}

Now used the exp_fun() function to replace all letters and other characters to respective exponential value then recalculate the damage variables.

#express property damage in billion and crop damage in million
sub1<-mutate(sub_storm_data, propExp=sapply(sub_storm_data$PROPDMGEXP, FUN=exp_fun)) %>% mutate(prop_damage=propExp*PROPDMG/1e+9) %>% mutate(cropExp=sapply(sub_storm_data$CROPDMGEXP, FUN=exp_fun)) %>% mutate(crop_damage=cropExp*CROPDMG/1e+6) %>% select(-c(3:7))

Now the damage cost is recalculated to actual value, we create a summary of total damage cost for property and crop per type of event.

econ_con<-group_by(sub1, EVTYPE) %>% summarise(prop_damage2=sum(prop_damage), crop_damage2=sum(crop_damage)) %>% arrange(desc(prop_damage2))

ec1<-arrange(econ_con,desc(prop_damage2))[,-3]
ec2<-arrange(econ_con,desc(crop_damage2))[,-2]

head(ec1) ## express in billion
## # A tibble: 6 x 2
##               EVTYPE prop_damage2
##               <fctr>        <dbl>
## 1        FLASH FLOOD   68202.3670
## 2 THUNDERSTORM WINDS   20865.3168
## 3            TORNADO    1078.9511
## 4               HAIL     315.7558
## 5          LIGHTNING     172.9433
## 6              FLOOD     144.6577
head(ec2) ## express in million
## # A tibble: 6 x 2
##        EVTYPE crop_damage2
##        <fctr>        <dbl>
## 1     DROUGHT    13972.566
## 2       FLOOD     5661.968
## 3 RIVER FLOOD     5029.459
## 4   ICE STORM     5022.114
## 5        HAIL     3025.974
## 6   HURRICANE     2741.910

Now, we are ready to visualize the data.

Results

Effect on Population Health

Below is the top 5 most harmful severe weather events with respect to population health. We can see that tornado is the top 1 in terms of fatalities and injuries, with 5,633 and 91,346 cases, respectively.

h1[1:5,] # worst 5 in terms of fatalities
## # A tibble: 5 x 2
##           EVTYPE Fatalities
##           <fctr>      <dbl>
## 1        TORNADO       5633
## 2 EXCESSIVE HEAT       1903
## 3    FLASH FLOOD        978
## 4           HEAT        937
## 5      LIGHTNING        816
h2[1:5,] # worst 5 in terms of Injuries
## # A tibble: 5 x 2
##           EVTYPE Injuries
##           <fctr>    <dbl>
## 1        TORNADO    91346
## 2      TSTM WIND     6957
## 3          FLOOD     6789
## 4 EXCESSIVE HEAT     6525
## 5      LIGHTNING     5230

We can clearly see in the figure below the top 10 most harmful weather events to population health.

theme_set(theme_gray())
p1 <- ggplot(data=h1[1:10,], aes(x=reorder(EVTYPE, Fatalities), y=Fatalities)) + geom_bar(fill="blue1",stat="identity") + ylab("Total number of fatalities") + xlab("Event type") + ggtitle("Health impact of weather events in US - Top 10") + theme(legend.position="none")+scale_y_continuous(expand = c(0, 0), limits = c(0,max(h1$Fatalities)+1000)) + coord_flip()+geom_text(aes(label=Fatalities), position=position_dodge(width=0.9), vjust=.5, hjust=-.1, cex=3)

p2 <- ggplot(data=h2[1:10,], aes(x=reorder(EVTYPE, Injuries), y=Injuries)) + geom_bar(fill="green1",stat="identity") + ylab("Total number of injuries") + xlab("Event type") + theme(legend.position="none")+scale_y_continuous(expand = c(0, 0), limits = c(0,max(h2$Injuries)+10000)) + coord_flip()+geom_text(aes(label=Injuries), position=position_dodge(width=0.9), vjust=.5, hjust=-.1, cex=3)

plot_grid(p1, p2, ncol=1, align="v")

We can also investigate the total casualties (summing up the cases in fatalities and injuries) to see what event mostly harm the population health.

mrg_harm<-mutate(harm_health, mrg_h=Fatalities+Injuries) %>% select(-(2:3)) %>% arrange(desc(mrg_h))

head(mrg_harm,5)
## # A tibble: 5 x 2
##           EVTYPE mrg_h
##           <fctr> <dbl>
## 1        TORNADO 96979
## 2 EXCESSIVE HEAT  8428
## 3      TSTM WIND  7461
## 4          FLOOD  7259
## 5      LIGHTNING  6046

As expected, result above shows that still tornado is top 1 weather events that is most harmful to population health in terms of total casualties.

Economic Consequences

Below is the top 5 weather events with greater economic consequences in US in terms of the worth of property and crop damage.

ec1[1:5,] # worst 5 in terms of property damage
## # A tibble: 5 x 2
##               EVTYPE prop_damage2
##               <fctr>        <dbl>
## 1        FLASH FLOOD   68202.3670
## 2 THUNDERSTORM WINDS   20865.3168
## 3            TORNADO    1078.9511
## 4               HAIL     315.7558
## 5          LIGHTNING     172.9433
ec2[1:5,] # worst 5 in terms of crop damage
## # A tibble: 5 x 2
##        EVTYPE crop_damage2
##        <fctr>        <dbl>
## 1     DROUGHT    13972.566
## 2       FLOOD     5661.968
## 3 RIVER FLOOD     5029.459
## 4   ICE STORM     5022.114
## 5        HAIL     3025.974

It can be clearly examined in the figure below the top 10 weather events with greater economic consequences in terms of property and crop damage. In terms of property damage, flash flood is on the top with worth damage of 68,202.3670 billion dollars. While drought have greatest crop damage that worth 13,972.566 million dollars.

p1 <- ggplot(data=ec1[1:10,], aes(x=reorder(EVTYPE, prop_damage2), y=prop_damage2)) + geom_bar(fill="burlywood4",stat="identity") + ylab("Property Damage (Billion $)") + xlab("Event type") + ggtitle("Economic Consequences in terms of weather events in US - Top 10") + theme(legend.position="none")+scale_y_continuous(expand = c(0, 0), limits = c(0,max(ec1$prop_damage2)+10000)) + coord_flip()+geom_text(aes(label=round(prop_damage2,2)), position=position_dodge(width=0.9), vjust=.5, hjust=-.1, cex=3)

p2 <- ggplot(data=ec2[1:10,], aes(x=reorder(EVTYPE, crop_damage2), y=crop_damage2)) + geom_bar(fill="chartreuse3",stat="identity") + ylab("Crop Damage (Million $)") + xlab("Event type") + theme(legend.position="none")+scale_y_continuous(expand = c(0, 0), limits = c(0,max(ec2$crop_damage2)+2000)) + coord_flip()+geom_text(aes(label=round(crop_damage2,2)), position=position_dodge(width=0.9), vjust=.5, hjust=-.1, cex=3)

plot_grid(p1, p2, ncol=1, align="v")

We can also see which extreme events that render most damage both in properties and crops, combined.

# since property and crop damage cost was computed above to represent in billion and million, respectively, we will turn it back to actual value before summing up then express again in billion

mrg_econ<-mutate(econ_con, mrg_e=(prop_damage2*1e+9+crop_damage2*1e+6)/1e+9) %>% select(-(2:3)) %>% arrange(desc(mrg_e))

head(mrg_econ,5)
## # A tibble: 5 x 2
##               EVTYPE      mrg_e
##               <fctr>      <dbl>
## 1        FLASH FLOOD 68203.7883
## 2 THUNDERSTORM WINDS 20865.5075
## 3            TORNADO  1079.3662
## 4               HAIL   318.7818
## 5          LIGHTNING   172.9554

As expected, Weather events with greatest impact is still drought with worth damage of 68,203.78 billion dollars.