This project is the final requirement of the Reproducible Research Course offered in coursera.org, 1 of 9 courses under Data Science Specialization of John Hopkins University. Here we aim to describe the effect of severe weather events such as storms on public health and economics in United States. Specifically, the analysis should answer the two questions, 1.) across the United States, which types of events are most harmful with respect to population health? and 2.) which types of events have the greatest economic consequences? To answer these questions, we used the data U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database that tracks major storm and weather events in U.s. The database start in the year 1950 and end in November 2011.
In the analyses, we found that tornado is the most harmful weather events across United States with total casualties of 96,979 (fatalities+injuries), followed by excessive heat and thunderstorm wind. In terms of economic consequences, it’s found that flash flood has greatest impact with 68,203.78 billion dollars combined damage cost in properties and crops, and this followed by thunderstom wind and tornado.
We obtained the data of U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database from the course web site. It contained the characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage from 1950 to November 2011.
Below are some of the documentations of the database: * National Weather Service * National Climatic Data Center Storm Events
We will be using the following packages to process and analyzed the data:
library(dplyr)
library(ggplot2)
library(cowplot)
Since the data is in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size, we read the data using read.csv() function specificying the file that it is a bzip2 compressed.
storm_data<-read.csv(bzfile("repdata%2Fdata%2FStormData.csv.bz2"))
We can see that there are 902,297 observations with 37 variables.
dim(storm_data)
## [1] 902297 37
Since the project will focus only on the population health and economic consequences of the major storm and other severe weather events, we will make a subset from the Storm Data based on the available variables related to the analysis.
names(storm_data)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
We need only EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP columns (or variables). The first two variables (excep EVTYPE or Event Type), Fatalities and injuries, will be used to investigate population health consequences and the other variable which measures the damage will be used to analyze the effect on US economy.
sub_storm_data<-select(storm_data, c(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG,CROPDMGEXP))
head(sub_storm_data)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO 0 15 25.0 K 0
## 2 TORNADO 0 0 2.5 K 0
## 3 TORNADO 0 2 25.0 K 0
## 4 TORNADO 0 2 2.5 K 0
## 5 TORNADO 0 2 2.5 K 0
## 6 TORNADO 0 6 2.5 K 0
The subset is further transformed to answer the 2 key questions.
The fatalities and injuries are summed per event type..
harm_health<-group_by(sub_storm_data, EVTYPE) %>% summarise(Fatalities=sum(FATALITIES), Injuries=sum(INJURIES))
h1<-arrange(harm_health,desc(Fatalities))[,-3]
h2<-arrange(harm_health,desc(Injuries))[,-2]
head(h1)
## # A tibble: 6 x 2
## EVTYPE Fatalities
## <fctr> <dbl>
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 TSTM WIND 504
head(h2)
## # A tibble: 6 x 2
## EVTYPE Injuries
## <fctr> <dbl>
## 1 TORNADO 91346
## 2 TSTM WIND 6957
## 3 FLOOD 6789
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
## 6 HEAT 2100
The actual damage cost in propery damage (PROPDMG) and crop damage (CROPDMG) per event type are recomputed. The columns with EXP contains the exponential value of the damages and is expressed in letters such as h/H=hundreds, k/K=thousands, m/M=millions.
The first step is convert this letters to actual values then multiply them accordingly to repsective columns (PROPDMG or CROPDMG).
exp_fun <- function(x) {
if (x %in% c("h", "H"))
return(100)
else if (x %in% c("k", "K"))
return(1000)
else if (x %in% c("m", "M"))
return(1e+06)
else if (x %in% c("b", "B"))
return(1e+09)
else if (!is.na(as.numeric(x)))
return(10^as.numeric(x))
else if (x %in% c("", "-", "?", "+"))
return(1)
else {
stop("Invalid value.")
}
}
Now used the exp_fun() function to replace all letters and other characters to respective exponential value then recalculate the damage variables.
#express property damage in billion and crop damage in million
sub1<-mutate(sub_storm_data, propExp=sapply(sub_storm_data$PROPDMGEXP, FUN=exp_fun)) %>% mutate(prop_damage=propExp*PROPDMG/1e+9) %>% mutate(cropExp=sapply(sub_storm_data$CROPDMGEXP, FUN=exp_fun)) %>% mutate(crop_damage=cropExp*CROPDMG/1e+6) %>% select(-c(3:7))
Now the damage cost is recalculated to actual value, we create a summary of total damage cost for property and crop per type of event.
econ_con<-group_by(sub1, EVTYPE) %>% summarise(prop_damage2=sum(prop_damage), crop_damage2=sum(crop_damage)) %>% arrange(desc(prop_damage2))
ec1<-arrange(econ_con,desc(prop_damage2))[,-3]
ec2<-arrange(econ_con,desc(crop_damage2))[,-2]
head(ec1) ## express in billion
## # A tibble: 6 x 2
## EVTYPE prop_damage2
## <fctr> <dbl>
## 1 FLASH FLOOD 68202.3670
## 2 THUNDERSTORM WINDS 20865.3168
## 3 TORNADO 1078.9511
## 4 HAIL 315.7558
## 5 LIGHTNING 172.9433
## 6 FLOOD 144.6577
head(ec2) ## express in million
## # A tibble: 6 x 2
## EVTYPE crop_damage2
## <fctr> <dbl>
## 1 DROUGHT 13972.566
## 2 FLOOD 5661.968
## 3 RIVER FLOOD 5029.459
## 4 ICE STORM 5022.114
## 5 HAIL 3025.974
## 6 HURRICANE 2741.910
Now, we are ready to visualize the data.
Below is the top 5 most harmful severe weather events with respect to population health. We can see that tornado is the top 1 in terms of fatalities and injuries, with 5,633 and 91,346 cases, respectively.
h1[1:5,] # worst 5 in terms of fatalities
## # A tibble: 5 x 2
## EVTYPE Fatalities
## <fctr> <dbl>
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
h2[1:5,] # worst 5 in terms of Injuries
## # A tibble: 5 x 2
## EVTYPE Injuries
## <fctr> <dbl>
## 1 TORNADO 91346
## 2 TSTM WIND 6957
## 3 FLOOD 6789
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
We can clearly see in the figure below the top 10 most harmful weather events to population health.
theme_set(theme_gray())
p1 <- ggplot(data=h1[1:10,], aes(x=reorder(EVTYPE, Fatalities), y=Fatalities)) + geom_bar(fill="blue1",stat="identity") + ylab("Total number of fatalities") + xlab("Event type") + ggtitle("Health impact of weather events in US - Top 10") + theme(legend.position="none")+scale_y_continuous(expand = c(0, 0), limits = c(0,max(h1$Fatalities)+1000)) + coord_flip()+geom_text(aes(label=Fatalities), position=position_dodge(width=0.9), vjust=.5, hjust=-.1, cex=3)
p2 <- ggplot(data=h2[1:10,], aes(x=reorder(EVTYPE, Injuries), y=Injuries)) + geom_bar(fill="green1",stat="identity") + ylab("Total number of injuries") + xlab("Event type") + theme(legend.position="none")+scale_y_continuous(expand = c(0, 0), limits = c(0,max(h2$Injuries)+10000)) + coord_flip()+geom_text(aes(label=Injuries), position=position_dodge(width=0.9), vjust=.5, hjust=-.1, cex=3)
plot_grid(p1, p2, ncol=1, align="v")
We can also investigate the total casualties (summing up the cases in fatalities and injuries) to see what event mostly harm the population health.
mrg_harm<-mutate(harm_health, mrg_h=Fatalities+Injuries) %>% select(-(2:3)) %>% arrange(desc(mrg_h))
head(mrg_harm,5)
## # A tibble: 5 x 2
## EVTYPE mrg_h
## <fctr> <dbl>
## 1 TORNADO 96979
## 2 EXCESSIVE HEAT 8428
## 3 TSTM WIND 7461
## 4 FLOOD 7259
## 5 LIGHTNING 6046
As expected, result above shows that still tornado is top 1 weather events that is most harmful to population health in terms of total casualties.
Below is the top 5 weather events with greater economic consequences in US in terms of the worth of property and crop damage.
ec1[1:5,] # worst 5 in terms of property damage
## # A tibble: 5 x 2
## EVTYPE prop_damage2
## <fctr> <dbl>
## 1 FLASH FLOOD 68202.3670
## 2 THUNDERSTORM WINDS 20865.3168
## 3 TORNADO 1078.9511
## 4 HAIL 315.7558
## 5 LIGHTNING 172.9433
ec2[1:5,] # worst 5 in terms of crop damage
## # A tibble: 5 x 2
## EVTYPE crop_damage2
## <fctr> <dbl>
## 1 DROUGHT 13972.566
## 2 FLOOD 5661.968
## 3 RIVER FLOOD 5029.459
## 4 ICE STORM 5022.114
## 5 HAIL 3025.974
It can be clearly examined in the figure below the top 10 weather events with greater economic consequences in terms of property and crop damage. In terms of property damage, flash flood is on the top with worth damage of 68,202.3670 billion dollars. While drought have greatest crop damage that worth 13,972.566 million dollars.
p1 <- ggplot(data=ec1[1:10,], aes(x=reorder(EVTYPE, prop_damage2), y=prop_damage2)) + geom_bar(fill="burlywood4",stat="identity") + ylab("Property Damage (Billion $)") + xlab("Event type") + ggtitle("Economic Consequences in terms of weather events in US - Top 10") + theme(legend.position="none")+scale_y_continuous(expand = c(0, 0), limits = c(0,max(ec1$prop_damage2)+10000)) + coord_flip()+geom_text(aes(label=round(prop_damage2,2)), position=position_dodge(width=0.9), vjust=.5, hjust=-.1, cex=3)
p2 <- ggplot(data=ec2[1:10,], aes(x=reorder(EVTYPE, crop_damage2), y=crop_damage2)) + geom_bar(fill="chartreuse3",stat="identity") + ylab("Crop Damage (Million $)") + xlab("Event type") + theme(legend.position="none")+scale_y_continuous(expand = c(0, 0), limits = c(0,max(ec2$crop_damage2)+2000)) + coord_flip()+geom_text(aes(label=round(crop_damage2,2)), position=position_dodge(width=0.9), vjust=.5, hjust=-.1, cex=3)
plot_grid(p1, p2, ncol=1, align="v")
We can also see which extreme events that render most damage both in properties and crops, combined.
# since property and crop damage cost was computed above to represent in billion and million, respectively, we will turn it back to actual value before summing up then express again in billion
mrg_econ<-mutate(econ_con, mrg_e=(prop_damage2*1e+9+crop_damage2*1e+6)/1e+9) %>% select(-(2:3)) %>% arrange(desc(mrg_e))
head(mrg_econ,5)
## # A tibble: 5 x 2
## EVTYPE mrg_e
## <fctr> <dbl>
## 1 FLASH FLOOD 68203.7883
## 2 THUNDERSTORM WINDS 20865.5075
## 3 TORNADO 1079.3662
## 4 HAIL 318.7818
## 5 LIGHTNING 172.9554
As expected, Weather events with greatest impact is still drought with worth damage of 68,203.78 billion dollars.