Coursera - Reproducible Research Assignment 2

niczky12 25/Nov/2014

Synopsis

This document analyses data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database.
The main aim of this analysis is to figure out what kind of natural events are most harmful to population health and cause the largest economic consequences. The dataset is freely available from:
https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2

Data Processing

The original NOAA data set has to be pre-processed for our analysis. This involves converting the dates into a more manageable POSIXct format and also converting the property damage/crop damage columns into a single total COST column. We do this by using the PROPDMGEXP and CROPDMGEXP columns to multiply the PROPDMG/CROPDMG values by the appropriate exponent of ten.

#download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "StormData.csv.bz2", method="wget")

StormData <- read.table("StormData.csv.bz2", sep= ",", stringsAsFactors= F, header= T)

require(lubridate)
## Loading required package: lubridate
days <- unlist(strsplit(StormData[,"BGN_DATE"], " "))
days <- days[seq(1,length(days), by=2)]

StormData$DATE <- mdy(days)

StormData$multiple1 <- 1
StormData$multiple1[StormData$PROPDMGEXP == "B"] <- 1000000000
StormData$multiple1[StormData$PROPDMGEXP == "h" | StormData$PROPDMGEXP == "H"] <- 100
StormData$multiple1[StormData$PROPDMGEXP == "K"] <- 1000
StormData$multiple1[StormData$PROPDMGEXP == "m" | StormData$PROPDMGEXP == "M"] <- 1000000

StormData$multiple2 <- 1
StormData$multiple2[StormData$CROPDMGEXP == "B"] <- 1000000000
StormData$multiple2[StormData$CROPDMGEXP == "h" | StormData$CROPDMGEXP == "H"] <- 100
StormData$multiple2[StormData$CROPDMGEXP == "K" | StormData$CROPDMGEXP == "k"] <- 1000
StormData$multiple2[StormData$CROPDMGEXP == "m" | StormData$CROPDMGEXP == "M"] <- 1000000

StormData$COST <- StormData$PROPDMG * StormData$multiple1 + StormData$CROPDMG * StormData$multiple2

StormData <- StormData[ , c("DATE", "STATE", "COUNTYNAME", "EVTYPE", "FATALITIES", "INJURIES", "COST")]
StormData$STATE <- as.factor(StormData$STATE)
StormData$COUNTYNAME <- as.factor(StormData$COUNTYNAME)
StormData$EVTYPE <- as.factor(StormData$EVTYPE)

Now the data set has been prepared for further analysis.

Results

In this section, we aim to answer two main questions about this data set.

Across the United States, which types of events are most harmful with respect to population health?

First, let look at the summary of the fatalities and injuries column to get a general idea of the problem.

summary(StormData$FATALITIES)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0000   0.0000   0.0000   0.0168   0.0000 583.0000
summary(StormData$INJURIES)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##    0.0000    0.0000    0.0000    0.1557    0.0000 1700.0000

So, more than 75% of the incidents result in no human casualties or injuries.

Let us now aggregate our data and look at the total number fatalities and injuries per event type. Is there a correlation between the number of fatalities and injuries?

aggdata <- aggregate(StormData[, c("FATALITIES", "INJURIES", "COST")], by=list(StormData$EVTYPE), sum)
names(aggdata)[1] <- "EVTYPE"

require(ggplot2)
## Loading required package: ggplot2
ggplot(aes(FATALITIES, INJURIES), data=aggdata) + geom_point() + ggtitle("Total number of INJURIES/FATALITIES per event types")

aggdata[aggdata$INJURIES == max(aggdata$INJURIES), ]
##      EVTYPE FATALITIES INJURIES        COST
## 830 TORNADO       5633    91346 57352114049
worst_event <- as.character(aggdata[aggdata$INJURIES == max(aggdata$INJURIES),"EVTYPE" ])

So it is clear from this graph that the event type that has the most severe human consequences is TORNADO.

Across the United States, which types of events have the greatest economic consequences?

Using our aggregate data set we shall investigate which event type has the most severe economic consequences. Namely, we are looking for the event type that creates the highest amount of total economic cost.

most_expensive <- aggdata[order(-aggdata$COST), ][1:10,]

most_expensive$EVTYPE <- factor(most_expensive$EVTYPE, levels= most_expensive$EVTYPE)
ggplot(aes(EVTYPE,COST),data=most_expensive) + geom_point() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab("")+
    ylab("Total damage in US dollars") + ggtitle("Top 10 most expensive disasters")

me <- most_expensive[1,"COST"]

So it is clear that Flood has been the most devastating weather disaster in the US based on economic costs. Floods are responsible for a total 150319678257 dollars worth of damage. It is important to note that Tornados are 3rd on the list, and they had the most human casualties as well.