The purpose of this document is to do explore on the National Weather Service (NOAA) Storm Database and answer some basic questions about severe weather events.
We need a few libraries for the data exploration:
library(knitr)
## Warning: package 'knitr' was built under R version 3.1.2
library(plyr)
library(ggplot2)
library(bitops)
library(RCurl)
## Warning: package 'RCurl' was built under R version 3.1.1
Assuming you have downloaded the file and uncompressed it using bunzip2 or similar, you need to load the data. The data can be found here: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2 [47Mb]
Loading of the data and pre-processing:
data <- read.csv("repdata-data-StormData.csv")
A summary of the loaded data:
#summary(data)
Output suppressed, but feel free to look it up yourself. I left the code in, as it was relevant to my data exploration. (knitr was not paying attention to my out.hight=10 option and other attepts to collapse the output?)
We will concentrate on total HUMAN damage and ECONOMIC impact.
First let’s put the cost estimates on property and crops in a useful format:
# Clean up the property damage scalars to something usable adding it to the end
data$real_propdmg<-data$PROPDMG*(1*(data$PROPDMGEXP=='') + 1000*(data$PROPDMGEXP=='K') + 1000000*(data$PROPDMGEXP=='M'))
data$real_cropdmg<-data$CROPDMG*(1*(data$CROPDMGEXP=='') + 1000*(data$CROPDMGEXP=='K') + 1000000*(data$CROPDMGEXP=='M'))
Now create something that helps us explore the data we care about – HUMAN damage and ECONOMIC impact (PropDMG + CropDMG) – :
sumdata <- ddply(data, .(EVTYPE), summarise, sum_fatal = sum(FATALITIES, na.rm=TRUE), sum_injury = sum(INJURIES, na.rm=TRUE), prop_damage = sum(real_propdmg, na.rm=TRUE), crop_damage = sum(real_cropdmg, na.rm=TRUE))
One must ask, across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
# Order the data and take the top 5 to determine MOST HARMFUL to human life
orderHuman <- sumdata[with(sumdata, order(-sum_fatal,-sum_injury)),]
# Output the data summary
smallSubset <- subset(orderHuman[1:5,])
smallSubset
## EVTYPE sum_fatal sum_injury prop_damage crop_damage
## 834 TORNADO 5633 91346 51625660483 414953110
## 130 EXCESSIVE HEAT 1903 6525 7753700 492402000
## 153 FLASH FLOOD 978 1777 15140811717 1421317100
## 275 HEAT 937 2100 1797000 1461500
## 464 LIGHTNING 816 5230 928659283 12092090
Let’s take a look at that visually:
qplot(EVTYPE,factor(sum_injury+sum_fatal), data=smallSubset, geom="bar", fill=EVTYPE, stat="identity") + theme(axis.text.x=element_text(angle=45, hjust=1)) + xlab("Event Type") + ylab("Fatality and Injury")
It is clear that tornados, excessive heat, flash floods, heat, and lightning are most damaging to human life and health in order.
Related and similar, across the United States, which types of events have the greatest economic consequences?
# Order the data and take the top 5 to determine MOST HARMFUL to economic interests
orderEconomy <- sumdata[with(sumdata, order(-prop_damage,-crop_damage)),]
# Output the data summary
smallSubset2 <- subset(orderEconomy[1:5,])
smallSubset2
## EVTYPE sum_fatal sum_injury prop_damage crop_damage
## 834 TORNADO 5633 91346 51625660483 414953110
## 170 FLOOD 470 6789 22157709807 5661968450
## 153 FLASH FLOOD 978 1777 15140811717 1421317100
## 244 HAIL 15 1361 13927366777 3025537453
## 402 HURRICANE 61 46 6168319010 2741910000
And to see that visually:
qplot(EVTYPE,factor(prop_damage), data=smallSubset2, geom="bar", fill=EVTYPE, stat="identity") + theme(axis.text.x=element_text(angle=45, hjust=1)) + xlab("Event Type") + ylab("Loss in $ dollars")
It is clear that impact Economically/financially, TORNADO, FLOOD, FLASH FLOOD, HAIL, and HURRICANE have the most impact on property and crop destruction (aka. DAMAGE).
There is no need to make any specific recommendations, given the purpose for this report.
It is clear that TORNADO and EXCESSIVE HEAT are most damaging to human life and TORNADO and FLOOD(s) are the most damaging to Property and Crops.