Synopsis

This is the R Markdown document prepared for the purposed of the second peer assessment for the Reproducible Reasearch course offered via the Coursera. The analysis focuses on events exerting greatest harm, where harm is defined as \(harm = injury + fatality\) and events exerting greatest damage where damage is defined as \(damage = property_damge + crops damge\).

Author: Konrad Zdeb Date: July, 2014

Introduction

The following analysis is concerned with answering the requirements of the Reproducible Research course delivered via Coursera. The analysis analyses the provided according to the extent of economic damage and associated harm with a set of events provided in the data. The events accounting for most damage/harm are graphed and subsequently listed in separate tables.

Data Processing

The section is concerned with processing the data. The data is processed separately for the questions in the assignment.

Functions

I’m using this section to declare functions that I call later during the analysis. For instance, I use packages that may not be installed so I’m using this function to test whether the package is installed.

# Function to check whether the package is installed
package.test <- function(x) { 
    if (!require(x,character.only = TRUE)) 
      { 
        install.packages(x,dep=TRUE) 
          if(!require(x,character.only = TRUE)) stop("Package not found")
      } 
    }

# Capitalise each word
proper <- function(x)
  paste0(toupper(substr(x, 1, 1)), tolower(substring(x, 2)))

Data Sourcing

The code below checks whether the required data set is available in the working directory. If the data file is available you can skip this part of code and move to the next section that reads the data.

# I'm using the RCurl, the code below will install it if not availble.
package.test("RCurl")
## Loading required package: RCurl
## Loading required package: bitops
## 
## Attaching package: 'RCurl'
## 
## The following object is masked from 'package:R.utils':
## 
##     reset
## 
## The following object is masked from 'package:R.oo':
## 
##     clone
# I will use bunzip from R Utils to unpack the the bz archive
package.test("R.utils")
# Now i check whether the data exists and download if it doesn't exists
if (!file.exists("repdata_data_StormData.csv")) {
        # Check wthether the zip archive is available.
        if (!file.exists("repdata_data_StormData.csv.bz2")) {
                # Define data source.
                dataurl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"  
                # Download, save and unzip the file.
                library(RCurl)
                bin <- getBinaryURL(dataurl, ssl.verifypeer=FALSE, 
                                    noprogress = FALSE)
                con <- file("repdata_data_StormData.csv.bz2",
                            open = "wb")
                writeBin(bin, con)
                close(con)
                bunzip2("repdata_data_StormData.csv.bz2")
        } else
                bunzip2("repdata_data_StormData.csv.bz2")
}
# Read the file path
file <- list.files(path=getwd(), pattern = "repdata_data_StormData.csv", 
                   all.files = TRUE, recursive = TRUE, full.names = TRUE)
# Clean redundant objects after download
if (exists("bin")) rm(bin)
if (exists("con")) rm(con)
if (exists("dataurl")) rm(dataurl)

Reading Data

The code below reads the data to the data frame.

# The code simply reads the CSV file
dta <- read.csv(file = file, header = TRUE)
# I'm also renaming one column that I noticed has underscore at the end
names(dta)[1] <- "STATE"

Analysis

Relative Harmffulness of The Events

This section answer the first question of the assignment: Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?.

Data Preparation

Before progressing with the analysis the data is formatted accordingly. First, the classes of columns in the data frame are verified.

sapply(dta, class)
##      STATE   BGN_DATE   BGN_TIME  TIME_ZONE     COUNTY COUNTYNAME 
##  "numeric"   "factor"   "factor"   "factor"  "numeric"   "factor" 
##      STATE     EVTYPE  BGN_RANGE    BGN_AZI BGN_LOCATI   END_DATE 
##   "factor"   "factor"  "numeric"   "factor"   "factor"   "factor" 
##   END_TIME COUNTY_END COUNTYENDN  END_RANGE    END_AZI END_LOCATI 
##   "factor"  "numeric"  "logical"  "numeric"   "factor"   "factor" 
##     LENGTH      WIDTH          F        MAG FATALITIES   INJURIES 
##  "numeric"  "numeric"  "integer"  "numeric"  "numeric"  "numeric" 
##    PROPDMG PROPDMGEXP    CROPDMG CROPDMGEXP        WFO STATEOFFIC 
##  "numeric"   "factor"  "numeric"   "factor"   "factor"   "factor" 
##  ZONENAMES   LATITUDE  LONGITUDE LATITUDE_E LONGITUDE_    REMARKS 
##   "factor"  "numeric"  "numeric"  "numeric"  "numeric"   "factor" 
##     REFNUM 
##  "numeric"

Relevant columns are kept to reflect sum of events.

# Aggregated data table is created
dta.evs <- aggregate(cbind(INJURIES,FATALITIES)~EVTYPE, data = dta, 
                     FUN = sum, na.rm = TRUE)
# Only the highest values are kept
dta.evs <- subset(dta.evs , FATALITIES > quantile(dta.evs$FATALITIES, 0.95)
                  | INJURIES > quantile(dta.evs$INJURIES, 0.95))
# Melting data for the graph
package.test("reshape2")
## Loading required package: reshape2
dta.evs.melt <- melt(dta.evs,id="EVTYPE")
# Words are being capitalised - aesthetics
dta.evs.melt <- data.frame(sapply(dta.evs.melt, proper))
# Numeric values are converted back to numeric
dta.evs.melt$value <- as.numeric(dta.evs.melt$value)

Graphing

Visual analysis of the data accounting for the Event Type and the Total Harm which is derived from the \(harm = injuries + fatalities\)

package.test("ggplot2")
## Loading required package: ggplot2
ggplot(data = dta.evs.melt, aes(x = factor(EVTYPE), y = value,
                                fill = factor(variable))) +
  geom_bar(position="stack", stat="identity") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5),
        panel.background = element_rect(fill = "white", colour = "grey"),
        legend.position = "bottom",
        legend.background = element_rect(colour = 'grey', 
                                         fill = 'white'),
        axis.title.x = element_text(size = 10, colour = 'black'),
        axis.title.y = element_text(size = 10, colour = 'black')) +
  scale_fill_discrete(name="Harm Type") +
  xlab("Event type") +
  ylab("Total Harm") +
  ggtitle("Event Types with Fatalities and Injuries\n(Only Highest)")

plot of chunk Events

Getting the highest values

# Aggregating the data to provide count for total fatalities and casulaties
dta.evs.melt.sum <- subset(dta.evs.melt, select=c("EVTYPE", "value"))
row.names(dta.evs.melt.sum) <- NULL
dta.evs.melt.sum[] <- lapply(dta.evs.melt.sum, function(x) type.convert(as.character(x)))
dta.evs.melt.sum <- aggregate(. ~ EVTYPE, dta.evs.melt.sum, sum)

# Getting highest values.
dta.evs.melt.sum <- subset(dta.evs.melt.sum, 
                        value > quantile(dta.evs.melt.sum$value, 0.90))

Answer

Assuming that harm defined as \(harm = injuries + fatalities\) The highest harm, above 90th percentile of total harm, is associated with those events

Eventype Harm
Fog 157
Lightning 157
Thunderstorm winds 167
Tornado 164
Tstm wind/hail 160
Wildfire 177

Economic Consequences

This section is concerned with the second part of the assignment and addresses the question about the economic impact of events.

Data Preparation

Initial master data table is manipulated in order to create a data frame with information pertaining to economic harmfulness of the events that would be further exploited.

Used variables

The analysis utilises the following variables - PROPDMG Property-related damage - CROPDMG Crops-related damge - PROPDMGEXP Multiplier - CROPDMGEXP Multiplier

# Relevant variables are extracted from the master data frame
dta.econ <- dta[,c("EVTYPE","PROPDMG","CROPDMG","PROPDMGEXP","CROPDMGEXP")]
# Letters are replaced with corresponding figures
dta.econ$PROPDMGEXP <- as.character(dta.econ$PROPDMGEXP)
dta.econ$PROPDMGEXP[dta.econ$PROPDMGEXP == "K"] <- "1000"
dta.econ$PROPDMGEXP[dta.econ$PROPDMGEXP == "M"] <- "1000000"
dta.econ$CROPDMGEXP <- as.character(dta.econ$CROPDMGEXP)
dta.econ$CROPDMGEXP[dta.econ$CROPDMGEXP == "K"] <- "1000"
dta.econ$CROPDMGEXP[dta.econ$CROPDMGEXP == "M"] <- "1000000"
# Cosmetic changes to names in the table
dta.econ <- data.frame(sapply(dta.econ, proper))
# Columns are converted to numbers
dta.econ$PROPDMGEXP <- as.numeric(dta.econ$PROPDMGEXP)
dta.econ$CROPDMGEXP <- as.numeric(dta.econ$CROPDMGEXP)
dta.econ$CROPDMG <- as.numeric(dta.econ$CROPDMG)
dta.econ$PROPDMG <- as.numeric(dta.econ$PROPDMG)
# Actual demage value is obtained
dta.econ$Property <- dta.econ$PROPDMGEXP * dta.econ$PROPDMG
dta.econ$Crops <- dta.econ$CROPDMGEXP * dta.econ$CROPDMG
# Aggregated dat frame is created for the analysis
dta.econ.evs <- aggregate(cbind(Property,Crops)~EVTYPE, data = dta.econ, 
                     FUN = sum, na.rm = TRUE)
# Subsetting the data to get highest values only
dta.econ.evs <- subset(dta.econ.evs, 
                       Property > quantile(dta.econ.evs$Property, 0.95) | 
                       Crops > quantile(dta.econ.evs$Crops, 0.95))
# Data is melted to make it easier to graph
dta.econ.melt <- melt(dta.econ.evs,id="EVTYPE")

Graphing

The data corresponding to the damage exerted by different events is graphed in the figure below.

# Force R not to use scientific notation
options("scipen"=100, "digits"=4)
# Make chart
ggplot(data = dta.econ.melt, aes(x = factor(EVTYPE), y = value,
                                fill = factor(variable))) +
  geom_bar(position="stack", stat="identity") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5),
        panel.background = element_rect(fill = "white", colour = "grey"),
        legend.position = "bottom",
        legend.background = element_rect(colour = 'grey', 
                                         fill = 'white'),
        axis.title.x = element_text(size = 10, colour = 'black'),
        axis.title.y = element_text(size = 10, colour = 'black')) +
  scale_fill_discrete(name="Damage") +
  xlab("Event Type") +
  ylab("Total Damage") +
  ggtitle("Event Types According to Damage")

plot of chunk GraphEcon

Greatest economic consequences

# Get events with highest total damage
dta.econ.melt <- subset(dta.econ.melt, 
                        value > quantile(dta.econ.melt$value, 0.90))
# Delete column
dta.econ.melt$variable <- NULL

Answer

The events listed below accounts for the events where the total damage was above 90 percentile of the distribution, with total damage defined as \(damage = crops damage + property damage\).

Eventype EconomicDamage
Flash flood 82770199
Flood 40774629
Hail 82960317
Heavy snow 6061175
High wind 21419573
Lightning 41722998
Strong wind 9808685
Thunderstorm wind 151928497
Thunderstorm winds 55202603
Tornado 142318860
Tstm wind 215970405
Hail 8001752

Results

In summary most harmful events are:

Eventype Harm
Fog 157
Lightning 157
Thunderstorm winds 167
Tornado 164
Tstm wind/hail 160
Wildfire 177

with respect to the economic damage, most damaging events are

Eventype EconomicDamage
Flash flood 82770199
Flood 40774629
Hail 82960317
Heavy snow 6061175
High wind 21419573
Lightning 41722998
Strong wind 9808685
Thunderstorm wind 151928497
Thunderstorm winds 55202603
Tornado 142318860
Tstm wind 215970405
Hail 8001752