This report tries to answer two questions based on a storm database which start in the year 1950 and end in November 2011: 1. What event types are most harmful to population health ? 2. What event types have greatest economic consequences ? This analysis sums up total fatalities and injuries of each event type to answer the first question. Then apply the same method to crop damage and property damage to answer the second question. For each question, there is one panel containing two plots to illustrate the result.
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events FAQ
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
library(dplyr)
library(RSQLite)
library(R.utils)
library(ggplot2)
library(gridExtra)
library(plyr)
if(!file.exists("repdata-data-StormData.csv.bz2")){
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2","repdata-data-StormData.csv.bz2")
}
rawdata<-read.csv("repdata-data-StormData.csv.bz2")
Checking whether data is complete.
rows <- dim(rawdata)[1]
cols <- dim(rawdata)[2]
If data is complete, rawdata should be a data frame of 902297 rows and 37 columns
Sums up fatalities and injuries caused by each event type. Subselect only those whose fatalities and injuries are not both zero. Arrange by fatalities descending order. Select top 10 event types that are most harmful to population health.
total <- rawdata %>% group_by(EVTYPE) %>% dplyr::summarize(FATALITIES=sum(FATALITIES),INJURIES=sum(INJURIES))
iszero<- total$FATALITIES==0 & total$INJURIES==0
total<- total[!iszero,]
total<-arrange(total,desc(FATALITIES),desc(INJURIES))
top10 <- total[1:10,]
total2<-arrange(total,desc(INJURIES),desc(FATALITIES))
top10byinjuries <- total2[1:10,]
Bar plots in one panel:
p1<-qplot(data=top10,x=EVTYPE,y=INJURIES,fill=INJURIES,geom="bar",stat = "identity")+labs(title="INJURIES") + scale_fill_gradient("INJURIES",low="blue",high = "darkblue")+theme(axis.text.x = element_text(angle = 45, hjust = 1))
p2<-qplot(data=top10byinjuries,x=EVTYPE,y=FATALITIES,fill=FATALITIES,geom="bar",stat = "identity")+labs(title="FATALITIES") + scale_fill_gradient("FATALITIES",low="blue",high = "darkblue")+theme(axis.text.x = element_text(angle = 45, hjust = 1))
grid.arrange(p1,p2,nrow=2)
Event types that are most harmful to population health measured by fatalities are TORNADO, EXCESSIVE HEAT, FLASH FLOOD, HEAT, LIGHTNING, TSTM WIND, FLOOD, RIP CURRENT, HIGH WIND, AVALANCHE.
Event types that are most harmful to population health measured by injuries are TORNADO, TSTM WIND, FLOOD, EXCESSIVE HEAT, LIGHTNING, HEAT, ICE STORM, FLASH FLOOD, THUNDERSTORM WIND, HAIL.
Convertting PROPDMGEXP and CROPDMGEXP from characters to numerics using mapvalues. Caculate totalcropdmg/totalpropdmg by multiplying PROPDMG and propdmgexp(numerics)/CROPDMG and cropdmgexp(numerics) together. Then sum up totalpropdmg and totalcropdmg of each EVTYPE.
ecnmc<-rawdata[,c("EVTYPE","CROPDMG","CROPDMGEXP","PROPDMG","PROPDMGEXP")]
propdmgexp<-mapvalues(rawdata$PROPDMGEXP,c("K","M","", "B","m","+","0","5","6","?","4","2","3","h","7","H","-","1","8"),c(1e3,1e6, 1, 1e9,1e6, 1, 1,1e5,1e6, 1,1e4,1e2,1e3, 1,1e7,1e2, 1, 10,1e8))
cropdmgexp<- mapvalues(ecnmc$CROPDMGEXP,c("","M","K","m","B","?","0","k","2"),c( 1,1e6,1e3,1e6,1e9,1,1,1e3,1e2))
propdmgexp<-as.numeric(as.character(propdmgexp))
cropdmgexp<-as.numeric(as.character(cropdmgexp))
ecnmc<-mutate(ecnmc,totalcropdmg=CROPDMG*cropdmgexp,totalpropdmg=PROPDMG*propdmgexp)
ecnmc <- ecnmc %>% group_by(EVTYPE) %>% dplyr::summarise(totalcropdmg=sum(totalcropdmg),totalpropdmg=sum(totalpropdmg),cropandprop=sum(totalcropdmg)+sum(totalpropdmg))
Subsect top 10 events that have the greatest economic consequences measured by totalcropdmg and totalpropdmg.
top10<-arrange(ecnmc,desc(totalcropdmg),desc(totalpropdmg))[1:10,] # arrange by totalcropdmg
top10bypropdmg<-arrange(ecnmc,desc(totalpropdmg),desc(totalcropdmg))[1:10,]
top10bypropandcrop<-arrange(ecnmc,desc(cropandprop))[1:10,]
Make 3 plots in one panel:
p3<-qplot(data=top10,x=EVTYPE,y=totalcropdmg,fill=totalcropdmg,geom="bar",stat="identity")+labs(title="Crop Damage")+scale_fill_gradient("Damage USD",low="red",high = "darkred")+theme(axis.text.x = element_text(angle = 45, hjust = 1))
top10bypropdmg <- arrange(ecnmc,desc(totalpropdmg),desc(totalcropdmg))[1:10,]
p4<-qplot(data=top10bypropdmg,x=EVTYPE,y=totalpropdmg,fill=totalpropdmg,geom="bar",stat="identity")+scale_fill_gradient("Damage USD",low="red",high="darkred")+labs(title="Prop Damage")+labs(y="Property Damage")+theme(axis.text.x = element_text(angle = 45, hjust = 1))
p5<-qplot(data=top10bypropandcrop,x=EVTYPE,y=cropandprop,fill=cropandprop,geom="bar",stat="identity")+scale_fill_gradient("Damage USD",low="red",high="darkred")+labs(title="Prop and Crop Damage")+labs(y="Crop and Property Damage")+theme(axis.text.x = element_text(angle = 45, hjust = 1))
grid.arrange(p3,p4,p5,nrow=2,ncol=2)
Events that have the greatest economic consequences measured by CROPDMG are DROUGHT, FLOOD, RIVER FLOOD, ICE STORM, HAIL, HURRICANE, HURRICANE/TYPHOON, FLASH FLOOD, EXTREME COLD, FROST/FREEZE.
Events that have the greatest economic consequences measured by PROPDMG are FLOOD, HURRICANE/TYPHOON, TORNADO, STORM SURGE, FLASH FLOOD, HAIL, HURRICANE, TROPICAL STORM, WINTER STORM, HIGH WIND.
Events that have the greatest economic consequences measured by the sum of PROPDMG and CROPDMG are FLOOD, HURRICANE/TYPHOON, TORNADO, STORM SURGE, HAIL, FLASH FLOOD, DROUGHT, HURRICANE, RIVER FLOOD, ICE STORM.
sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.3 LTS
##
## locale:
## [1] LC_CTYPE=zh_CN.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=zh_CN.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=zh_CN.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] plyr_1.8.3 gridExtra_2.0.0 ggplot2_1.0.1 R.utils_2.2.0
## [5] R.oo_1.19.0 R.methodsS3_1.7.0 RSQLite_1.0.0 DBI_0.3.1
## [9] dplyr_0.4.3
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.2 knitr_1.11 magrittr_1.5 MASS_7.3-44
## [5] munsell_0.4.2 colorspace_1.2-6 R6_2.1.1 stringr_1.0.0
## [9] tools_3.2.2 parallel_3.2.2 grid_3.2.2 gtable_0.1.2
## [13] htmltools_0.2.6 yaml_2.1.13 assertthat_0.1 digest_0.6.8
## [17] reshape2_1.4.1 formatR_1.2.1 evaluate_0.8 rmarkdown_0.8.1
## [21] labeling_0.3 stringi_1.0-1 scales_0.3.0 proto_0.3-10