Reproducible Research Assignment 2

Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

This document seeks to determine which types of weather event have the greatest impact on the economy and human health. A date restriction was applied (1993-2011) to ensure completeness of the dataset over the period being analysed. The event type that had the greatest total impact on health over the period was tornadoes with 25000 injuries and deaths recorded.The event type that had the greatest total economic impact was flooding, causing $150Bn of damage.

Data Processing

General Preprocessing

Set libraries and read the data

library(xtable)
library(plyr)
require(ggplot2)
library(gdata)

## gdata: Unable to locate valid perl interpreter
## gdata: 
## gdata: read.xls() will be unable to read Excel XLS and XLSX files
## gdata: unless the 'perl=' argument is used to specify the location
## gdata: of a valid perl intrpreter.
## gdata: 
## gdata: (To avoid display of this message in the future, please
## gdata: ensure perl is installed and available on the executable
## gdata: search path.)
## gdata: Unable to load perl libaries needed by read.xls()
## gdata: to support 'XLX' (Excel 97-2004) files.
## 
## gdata: Unable to load perl libaries needed by read.xls()
## gdata: to support 'XLSX' (Excel 2007+) files.
## 
## gdata: Run the function 'installXLSXsupport()'
## gdata: to automatically download and install the perl
## gdata: libaries needed to support Excel XLS and XLSX formats.
## 
## Attaching package: 'gdata'
## 
## The following object is masked from 'package:stats':
## 
##     nobs
## 
## The following object is masked from 'package:utils':
## 
##     object.size

setwd("~/datasciencecoursera/RepRes/Ass2")
txt_file<- "repdata-data-StormData.csv"
data_set <- read.csv(txt_file)

The validity and completeness of the data in earlier years is in question. It was therefore decided to restrict the analysis to the more recent data. Through exploratory data analysis it was discovered that for many event types events started being recorded in 1993. This analysis considers the period January 1993 - November 2011

data_set$BGN_DATE<-as.Date(data_set$BGN_DATE, format = "%m/%d/%Y")
filterdate<-as.Date("01/01/1993", format ="%m/%d/%Y")
data_subset<-data_set[data_set$BGN_DATE>filterdate,]

The dataset is not very clean and requires some cleansing. The EVTYPE classification, which is key to analysing by type of weather event has 985 different classes, rather than the 48 defined in documentation. This is a result of typographical mistakes, inconsistancy of capitalisation and inconsistancy of naming conventions. To remove the effect of capitalisaiton differences all event types were converted to lower case letters.

data_subset$EVTYPE<-tolower(data_subset$EVTYPE)
data_subset$EVTYPE<-as.factor(data_subset$EVTYPE)

Whilst there was an exhaustive list of event types present in the literature supporting the database due to the nature of the data entry it would be necessary to have someone with specialist knowledge map from the entered classes to the given list. However it is possible to refine the list from certain types of data entry errors. In order to further refine the dataset it was assumed that if the number of occurances of an event type was sufficiently low then it was either a typo or an incorrect naming convention. In order to avoid losing events with low frequency as a result of this approach the threshold was set deliberately low. Once these events had been identified the levinson distance was used to reclassify them.

#define column of interest & set the threshold for occurances
col<-grep("EVTYPE", colnames(data_subset))
threshold<-10

dataclean<-data_subset
occurances<-as.data.frame(table(dataclean[,col]))

while(sum(occurances[,2]>threshold)!= nrow(occurances)){
      
      minocc<-which(occurances[,2]==min(occurances[,2]))
      dist<-data.frame(distance=numeric(nrow(occurances)))

      for (i in 1:nrow(occurances)){
            dist[i,]<-adist(occurances[minocc[1],1],occurances[i,1])
            if (dist[i,]==0){dist[i,]<-999} 
      } 
      
      mindis<-which(dist[,1]==min(dist[,1]))
      dataclean[dataclean[,col]==occurances[minocc[1],1],col]<-occurances[mindis[1],1]
      lvldrop<-drop.levels(dataclean[,col])
      dataclean[,col]<-lvldrop
      occurances<-as.data.frame(table(dataclean[,col]))
}

data_subset<-dataclean

The “Exponent” data which supports the property & crop damage data also required cleansing.

data_subset$PROPDMGEXP<-tolower(data_subset$PROPDMGEXP)
data_subset$PROPDMGEXP[data_subset$PROPDMGEXP == "h"] <- 2
data_subset$PROPDMGEXP[data_subset$PROPDMGEXP == "k"] <- 3
data_subset$PROPDMGEXP[data_subset$PROPDMGEXP == "m"] <- 6
data_subset$PROPDMGEXP[data_subset$PROPDMGEXP == "b"] <- 9
data_subset$PROPDMGEXP[data_subset$PROPDMGEXP == "-"] <- 0
data_subset$PROPDMGEXP[data_subset$PROPDMGEXP == "?"] <- 0
data_subset$PROPDMGEXP[data_subset$PROPDMGEXP == "+"] <- 0
data_subset$PROPDMGEXP[data_subset$PROPDMGEXP == ""] <- 0
data_subset$PROPDMGEXP[data_subset$PROPDMGEXP == " "] <- 0
data_subset$PROPDMGEXP<- as.numeric(data_subset$PROPDMGEXP)

data_subset$CROPDMGEXP<-tolower(data_subset$CROPDMGEXP)
data_subset$CROPDMGEXP[data_subset$CROPDMGEXP == "h"] <- 2
data_subset$CROPDMGEXP[data_subset$CROPDMGEXP == "k"] <- 3
data_subset$CROPDMGEXP[data_subset$CROPDMGEXP == "m"] <- 6
data_subset$CROPDMGEXP[data_subset$CROPDMGEXP == "b"] <- 9
data_subset$CROPDMGEXP[data_subset$CROPDMGEXP == "-"] <- 0
data_subset$CROPDMGEXP[data_subset$CROPDMGEXP == "?"] <- 0
data_subset$CROPDMGEXP[data_subset$CROPDMGEXP == "+"] <- 0
data_subset$CROPDMGEXP[data_subset$CROPDMGEXP == ""] <- 0
data_subset$CROPDMGEXP[data_subset$CROPDMGEXP == " "] <- 0
data_subset$CROPDMGEXP<- as.numeric(data_subset$CROPDMGEXP)

Population Health

In order to consider which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health we will use the sum of fatalities & injuries.

We create a new data frame (healthoutcomes) which contains only the event type and fatality & injury numbers. We also then create a table that summarises the combined fatality & injury numbers.

datarows<-nrow(data_subset)
healthoutcomes<-data.frame(EVTYPE=factor(datarows), FATALITIES=numeric(datarows),INJURIES=numeric(datarows),INCIDENTS=numeric(datarows), BGN_DATE=date())
healthoutcomes[,1]<-data_subset$EVTYPE
healthoutcomes[,2]<-data_subset$FATALITIES
healthoutcomes[,3]<-data_subset$INJURIES
healthoutcomes[,4]<-healthoutcomes$FATALITIES + healthoutcomes$INJURIES
healthoutcomes[,5]<-data_subset$BGN_DATE

noeventtypes<-nlevels(healthoutcomes$EVTYPE)
healthsummary<-data.frame(EVTYPE=factor(noeventtypes), Min=numeric(noeventtypes), Median=numeric(noeventtypes), Mean=numeric(noeventtypes), Max=numeric(noeventtypes), Sum=numeric(noeventtypes))

min<-aggregate(healthoutcomes$INCIDENTS,by= list(healthoutcomes$EVTYPE), FUN=min,na.rm=TRUE)
median<-aggregate(healthoutcomes$INCIDENTS,by= list(healthoutcomes$EVTYPE), FUN=median,na.rm=TRUE)
mean<-aggregate(healthoutcomes$INCIDENTS,by= list(healthoutcomes$EVTYPE), FUN=mean,na.rm=TRUE)
max<-aggregate(healthoutcomes$INCIDENTS,by= list(healthoutcomes$EVTYPE), FUN=max,na.rm=TRUE)
sum<-aggregate(healthoutcomes$INCIDENTS,by= list(healthoutcomes$EVTYPE), FUN=sum,na.rm=TRUE)

healthsummary[,1:2]<-min
healthsummary[,3]<-median[,2]
healthsummary[,4]<-mean[,2]
healthsummary[,5]<-max[,2]
healthsummary[,6]<-sum[,2]
healthsummaryordered<-healthsummary[with(healthsummary,order(-healthsummary[,6])),]
healthsummary5<-healthsummaryordered[1:5,]

#use top 5 list to subset 
test<-which(healthoutcomes[,1]%in% healthsummary5[,1])
healthoutcomes5<-healthoutcomes[test,]
healthoutcomes5<-healthoutcomes5[with(healthoutcomes5,order(EVTYPE,BGN_DATE)),]
healthoutcomes5 = ddply(healthoutcomes5, c('EVTYPE'), transform, INCIDENTSsum=cumsum(INCIDENTS))

Economic Impact

In order to consider which types of events (as indicated in the EVTYPE variable) have the greatest impact with respect to economic impact we will use the sum of property & crop damage. It should be noted that this measure will be an innaccurate measure of the true economic impact. This is because the economic impact is affected by many other factors. It will however present a basis on which to compare the effects of different events.

We create a new data frame (economicoutcomes) which contains only the event type and damage numbers.

datarows<-nrow(data_subset)
economicoutcomes<-data.frame(EVTYPE=factor(datarows), PROPDMG=numeric(datarows),PROPDMGEXP=character(datarows) ,CROPDMG=numeric(datarows),CROPDMGEXP=character(datarows),TTLDMG=numeric(datarows), BGN_DATE=date())
economicoutcomes[,1]<-data_subset$EVTYPE
economicoutcomes[,2]<-data_subset$PROPDMG
economicoutcomes[,3]<-data_subset$PROPDMGEXP
economicoutcomes[,4]<-data_subset$CROPDMG
economicoutcomes[,5]<-data_subset$CROPDMGEXP
economicoutcomes[,6]<-(data_subset$PROPDMG*10^data_subset$PROPDMGEXP+data_subset$CROPDMG*10^data_subset$CROPDMGEXP)/1000000
economicoutcomes[,7]<-data_subset$BGN_DATE

noeventtypes<-nlevels(economicoutcomes$EVTYPE)
economicsummary<-data.frame(EVTYPE=factor(noeventtypes), Min=numeric(noeventtypes), Median=numeric(noeventtypes), Mean=numeric(noeventtypes), Max=numeric(noeventtypes), Sum=numeric(noeventtypes))

min<-aggregate(economicoutcomes$TTLDMG,by= list(economicoutcomes$EVTYPE), FUN=min,na.rm=TRUE)
median<-aggregate(economicoutcomes$TTLDMG,by= list(economicoutcomes$EVTYPE), FUN=median,na.rm=TRUE)
mean<-aggregate(economicoutcomes$TTLDMG,by= list(economicoutcomes$EVTYPE), FUN=mean,na.rm=TRUE)
max<-aggregate(economicoutcomes$TTLDMG,by= list(economicoutcomes$EVTYPE), FUN=max,na.rm=TRUE)
sum<-aggregate(economicoutcomes$TTLDMG,by= list(economicoutcomes$EVTYPE), FUN=sum,na.rm=TRUE)


economicsummary[,1:2]<-min
economicsummary[,3]<-median[,2]
economicsummary[,4]<-mean[,2]
economicsummary[,5]<-max[,2]
economicsummary[,6]<-sum[,2]
economicsummaryordered<-economicsummary[with(economicsummary,order(-economicsummary[,6])),]
economicsummary5<-economicsummaryordered[1:5,]

#use 5 list to subset 
test<-which(economicoutcomes[,1]%in% economicsummary5[,1])
economicoutcomes5<-economicoutcomes[test,]
economicoutcomes5<-economicoutcomes5[with(economicoutcomes5,order(EVTYPE,BGN_DATE)),]
economicoutcomes5 = ddply(economicoutcomes5, c('EVTYPE'), transform, TTLDMGsum=cumsum(TTLDMG))

Results

To show the relative effects of different event types it was decided to use cumulative totals. By viewing the data across time it is possible to observe if there are occurances of

Population Health

The table below shows the top 5 event types based on total number of fatalities & injuries.

xt<-xtable(healthsummary5,digits=2)
print(xt, type = "html", include.rownames=FALSE)

EVTYPE	Mean	Max	Sum
tornado	0.96	1308.00	24931.00
excessive heat	4.99	521.00	8451.00
flood	0.29	802.00	7259.00
lightning	0.38	51.00	6048.00
tstm wind	0.03	60.00	3873.00

The plot below tracks the cumulative number of fatalities & injuries for these five event types across the 30 year period.

qplot(x=BGN_DATE,y=INCIDENTSsum, size = I(1.5), data=healthoutcomes5, color = EVTYPE, xlab="Date", ylab="Cumulative Number of Incidents", main = "Health Impact")

plot of chunk unnamed-chunk-8

Economic Impact

The table below shows the top 5 event types based on

xt<-xtable(economicsummary5,digits=2)
print(xt, type = "html", include.rownames=FALSE)

EVTYPE	Median	Mean	Max	Sum
flood	0.00	5.93	115032.50	150320.73
hurricane/typhoon	7.62	817.20	16930.00	71913.71
storm surge	0.04	165.99	31300.00	43323.54
tornado	0.00	1.03	2800.00	26764.24
hail	0.00	0.08	1800.00	18761.22

The plot below tracks the total damage caused for these five event types across the 30 year period.

qplot(x=BGN_DATE,y=TTLDMGsum, size = I(1.5), data=economicoutcomes5, color = EVTYPE, xlab="Date", ylab="Cumulative Total Damage /$M's", main = "Economic Impact")

plot of chunk unnamed-chunk-9