In this report, we aim to answer a few questions based on the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database which can be downloaded here
This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The events in the database start in the year 1950 and end in November 2011. The questions are: 1) Across the United States, which types of events are most harmful with respect to population health? 2) Across the United States, which types of events have the greatest economic consequences?
We first read the data from the CSV file.
sdata1 <- read.csv("repdata-data-StormData.csv.bz2")
After reading the data, we check the dimension of the dataset.
dim(sdata1)
## [1] 902297 37
We will also have a look at the names of the variables and the summary of the data set.
names(sdata1)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
To answer our questions, 7 variables of concern, i.e. EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP.
First we load the plyr package, and we put a higher weight on the number of fatalities based on the mean of injuries over mean of fatalities.
library(plyr)
weightf <- mean(sdata1[,"INJURIES"], na.rm = TRUE)/mean(sdata1[,"FATALITIES"], na.rm =TRUE)
print(weightf)
## [1] 9.278838
Then, we do a transformation on the fatalities, by multiplying the fatalities by the weight.
sdata1[,"fw"] <- sdata1[,"FATALITIES"]*weightf
Next, we sum both weighted fatalities and injuries as a measure of severity to population health.
sdata1[,"sev"] <- sdata1[,"fw"]+sdata1[,"INJURIES"]
Next, we sum the severity across time based on each event and arrange the severity in descending order.
eventsum <- aggregate(sdata1[,"sev"], list(events = sdata1[,"EVTYPE"]), FUN = sum)
eventsum <- arrange(eventsum, -x)
Finally, we answer our question by plotting the bar chart of top 5 events which are most harmful to population health.
par(mar=c(11,7,4,2) , mgp = c(6, 1, 0))
barplot(eventsum$x[1:5], main="Barplot of severity to population health against events", xlab="Events", ylab="Severity", names.arg=eventsum$events[1:5], las=2)
First we look at the possible property and crop damage exponents.
summary(sdata1[,"PROPDMGEXP"])
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 424665 7 11330
summary(sdata1[,"CROPDMGEXP"])
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
Then, we make the necessary adjustment. For example, “B” stands for billion, and we replace “B” with 9, so that we can use exp(multiplier) later. Anything that is not known will be assigned 0.
multiplier <- function(x){sapply(x, function(x){
if(x %in% "B"){9}
else if(x %in% c("M","m")){6}
else if(x %in% c("K","k")){3}
else if(x %in% c("H","h")){2}
else if(is.numeric(x)){x}
else{0}
})}
sdata1[,"newpde"] <- multiplier(sdata1[,"PROPDMGEXP"])
sdata1[,"newcde"] <- multiplier(sdata1[,"CROPDMGEXP"])
Now we apply the transformation totaldmg = PROPDMG*exp(PROPDMGEXP) + CROPDMG(CROPDMGEXP)
sdata1[,"totaldmg"] <- sdata1[,"PROPDMG"]*exp(sdata1[,"newpde"]) + sdata1[,"CROPDMG"]*exp(sdata1[,"newcde"])
Now we sum the total economic damage based on each event and arrange the total economic damage in descending order.
evdmg <- aggregate(sdata1[,"totaldmg"],list(events = sdata1[,"EVTYPE"]), FUN = sum)
evdmg <- arrange(evdmg, -x)
Finally, we answer our question by plotting a bar chart of the top 5 events which have the greatest economic consequences.
par(mar=c(11,7,4,2) , mgp = c(6, 1, 0))
barplot(evdmg$x[1:5], main="Barplot of total economic cost against events", xlab="Events", ylab="Total Economic Cost", names.arg=evdmg$events[1:5], las=2,)