Storm Data Analysis: An Exercise in Reproducible Research and Data Analysis

Introduction

This document is produced for the Johns Hopkins with Coursera Reproducible Research course, repdata-011, for the second peer assessment project. I will be conducting an analysis of data from the National Oceanic and Atmospheric Administration (NOAA) storm database.

Specifically, I will be answering two questions:

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

This document will be split into two sections: Data Processing and Results. Naturally, all the data processing techniques are shown in the data processing section, and the corresponding analysis from the processed data are indicated in the results section.

Data Processing

To begin, I will load all necessary packages and display a snapshot of my current session.

suppressMessages(require(knitr))
suppressMessages(require(ggplot2))
suppressMessages(require(dplyr))
suppressMessages(require(R.utils))
sessionInfo()

## R version 3.1.1 (2014-07-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dplyr_0.4.1       ggplot2_1.0.0     knitr_1.9         R.utils_1.34.0   
## [5] R.oo_1.18.0       R.methodsS3_1.6.1
## 
## loaded via a namespace (and not attached):
##  [1] assertthat_0.1   colorspace_1.2-4 DBI_0.3.1        digest_0.6.8    
##  [5] evaluate_0.5.5   formatR_1.0      grid_3.1.1       gtable_0.1.2    
##  [9] htmltools_0.2.6  magrittr_1.5     MASS_7.3-33      munsell_0.4.2   
## [13] parallel_3.1.1   plyr_1.8.1       proto_0.3-10     Rcpp_0.11.4     
## [17] reshape2_1.4.1   rmarkdown_0.5.1  scales_0.2.4     stringr_0.6.2   
## [21] tools_3.1.1      yaml_2.1.13

Naturally, data acquisition is the first step. All necessary information for data acquisition and upload are in the following code block.

setwd('~/R/repdata-011/peer_assessment_2')
file.url <- 'https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2'
bz2.file.name <- '~/R/repdata-011/peer_assessment_2/repdata-data-StormData.csv.bz2'
csv.file.name <- '~/R/repdata-011/peer_assessment_2/repdata-data-StormData.csv'
if(!file.exists(csv.file.name){
  download.file(url=file.url, destfile=bz2.file.name, method='curl')
  bunzip2(filename=bz2.file.name, destname=csv.file.name, overwrite=TRUE)
}

I upload into R using the standard read.csv function. Note that I include the ‘nrow’ argument to speed up the data upload.

raw.data <- read.csv(file=csv.file.name, header=TRUE, nrow=1000000, na.strings="")

Of the 37 variables present in the dataset, I create a smaller subset with the variables necessary only to complete the analysis and reduce computational load. The all upper-casing of the names is not in line with best practices, so I do a brief bit of data cleansing on the names; furthermore, I strip the factorization of the ‘EVTYPE’ variable such that the entries are more easily changed. I will demonstrate the need for such change later.

raw.subset <- select(raw.data, EVTYPE, FATALITIES, INJURIES, PROPDMG,
                     PROPDMGEXP, CROPDMG, CROPDMGEXP)
names(raw.subset) <- tolower(names(raw.subset))
raw.subset$evtype <- as.character(raw.subset$evtype)

While digging into this set, I noticed that there is very little standardization concerning entry into the ‘evtype’ variable. Thus, it is very difficult to conduct accurate categorical analyses. One evtype variable that was noticeably scattered, if you will, concerned Thunderstorm Wind events. Some events were marked exactly as “THUNDERSTORM WIND”, while others were marked “TSTM WIND” or some other derivative thereof. These entries are necessary to be standardized for the purposes of this analysis. In addition, all entries in the ‘evtype’ variable should be aligned for truly accurate analyses.

tstm.rows <- grep(pattern="^TSTM WIND|^TSTM WND|^THUNDERSTORM WINDS$",
                   x=raw.subset$evtype
                  )
num.tstm.rows <- length(tstm.rows)

There are 241888 rows that fit the pattern in the above grep statement. Each row will be updated to the standard “THUNDERSTORM WIND” entry. NOTE: the following code chunk has a rather lengthy execution time.

for(i in 1:num.tstm.rows){
    raw.subset[tstm.rows[i], "evtype"] <- "THUNDERSTORM WIND"
}

Now that the thunderstorm related entries in the ‘evtype’ variable have been updated, I will split the dataset into two smaller sets to answer the two questions. I will start with the fatality and injury data.

harm.data <- summarize(group_by(raw.subset, evtype),
                       fatalities = sum(fatalities),
                       injuries = sum(injuries)
                       )

harm.data <- mutate(harm.data, harm.totals = harm.data$fatalities + harm.data$injuries)

I will consider the top ten event types by total harm events, which I consider the sum of the fatality and injury events. In addition, I refactorize the evtype variables so that they are displayed in descending order on the plots.

harm.top.ten <- arrange(harm.data, desc(harm.totals))[1:10, ]
harm.top.ten$evtype <- factor(harm.top.ten$evtype,
                              levels=harm.top.ten$evtype[10:1])

Switching gears from the harm event data, I create a subset to answer the question about economic consequences.

econ.data <- select(raw.subset, evtype, propdmg, propdmgexp, cropdmg,
                    cropdmgexp)

The two variables ‘propdmgexp’ and ‘cropdmgexp’ contain information that affects the scale of ‘propdmg’ and ‘cropdmg’, respectively. Thus, to obtain true damage amounts, those scales must be read and applied to their corresponding values. To do this in a way that is aggreeable with my computer, I will split the dataset into two sets: those with prefixes and those with no prefixes/nonsensical values. I will update both according to their prefix and reassemble the dataset. I will begin with the property damage.

#splitting dataset by propdmgexp
with.prefixes <- filter(econ.data,
                        propdmgexp=="k"|propdmgexp=="K"|
                          propdmgexp=="m"|propdmgexp=="M"|
                          propdmgexp=="b"|propdmgexp=="B"
                        )

without.prefixes <- filter(econ.data,
                        !(propdmgexp=="k"|propdmgexp=="K"|
                          propdmgexp=="m"|propdmgexp=="M"|
                          propdmgexp=="b"|propdmgexp=="B") |
                          is.na(propdmgexp)
                        )

#adding true property damage values
with.prefixes <- mutate(with.prefixes, true.prop.dmg = 
         if(propdmgexp=="k"||propdmgexp=="K"){
           propdmg * 1000
         } else if (propdmgexp=="m"||propdmgexp=="M"){
           propdmg * 10^6
         } else if (propdmgexp=="b"||propdmgexp=="B"){
           propdmg * 10^9
         }
      )

without.prefixes <- mutate(without.prefixes, true.prop.dmg=propdmg)

#recombining
new.econ.data <- rbind(with.prefixes, without.prefixes)

I repeat the process with this new dataset again for the crop damage entries.

with.prefixes <- filter(new.econ.data,
                        cropdmgexp=="k"|cropdmgexp=="K"|
                        cropdmgexp=="m"|cropdmgexp=="M"|
                        cropdmgexp=="b"|cropdmgexp=="B"
                        )

without.prefixes <- filter(new.econ.data,
                          !(cropdmgexp=="k"|cropdmgexp=="K"|
                          cropdmgexp=="m"|cropdmgexp=="M"|
                          cropdmgexp=="b"|cropdmgexp=="B") |
                          is.na(cropdmgexp)
                        )

with.prefixes <- mutate(with.prefixes, true.crop.dmg = 
         if(cropdmgexp=="k"||cropdmgexp=="K"){
           cropdmg * 1000
         } else if (cropdmgexp=="m"||cropdmgexp=="M"){
           cropdmg * 10^6
         } else if (cropdmgexp=="b"||cropdmgexp=="B"){
           cropdmg * 10^9
         }
      )

without.prefixes <- mutate(without.prefixes, true.crop.dmg = cropdmg)
new.econ.data <- rbind(with.prefixes, without.prefixes)

Now that the adjustments are made to the data, I will create a final column that considers the total economic damages from each entry and report from that metric. I adjust the evtype factor to display the results in a descending order when plotted.

econ.totals <- summarize(group_by(new.econ.data, evtype),
                         total.prop.dmg=sum(true.prop.dmg),
                         total.crop.dmg=sum(true.crop.dmg)
                         )

econ.totals <- mutate(econ.totals,
                      total.damages=econ.totals$total.crop.dmg +
                        econ.totals$total.prop.dmg)

top.ten.econ <- arrange(econ.totals, desc(total.damages))[1:10,]
top.ten.econ$evtype <- factor(top.ten.econ$evtype,
                              levels=top.ten.econ$evtype[10:1])

That final adjustment to the economic data is the last step in the data processing.

Results

The top ten harmful events are plotted below.

harm <- ggplot(harm.top.ten, aes(x=evtype, y=harm.totals/1000))
harm.plot <- harm + 
  geom_bar(stat="identity", fill="#3366AA", color="Black") +
  coord_flip() +
  ggtitle("Top Ten Events by Total Harm Events (fatality and injury)") +
  xlab("Event type") +
  ylab("Total Harm Events (in Thousands)")
print(harm.plot)

The following chart summarizes the top ten events which have the greatest economic consequences.

econ <- ggplot(top.ten.econ, aes(x=evtype, y=total.damages/(10^9)))
econ.plot <- econ +
  geom_bar(stat="identity", color="Black", fill="#3366AA") +
  coord_flip() +
  ylab("Total Economic Damages (in Billions of Dollars)") +
  xlab("Event Category")+
  ggtitle("Top Ten Events by Total Economic Damages")
print(econ.plot)