Reproducible Research: Peer Assessment 2

Synopsis

The primary of this report is to explore and process the NOAA Storm Database in order to answer the following questions:

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

In addition, following report contains a Results section and several plots to illustrate the analysis and the answers to the basic questions. All computer code used in the analysis, as well as links to the original data, is included.

Accomplishing these two objectives took the following steps:

Downloading the data set
Performing an initial inspection of the data
Extracting data relevant for the analysis
Provide an overview of deaths, injuries, and damage
Performing the necessary analyses to answer the two questions that must be answered.
Provide a summary of the analysis

Downloading the data set

The data was downloaded from a compressed file at https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2, unzipped to reveal the CSV file ‘repdata-data-StormData.csv’, which was then loaded into RStudio into the data frame ‘rawdata’.

if (!file.exists('repdata-data-StormData.csv')) unzip('repdata-data-StormData.csv.bz2')
rawdata <- read.csv("repdata-data-StormData.csv")

Initial inspection of the data

raw.cols <- ncol(rawdata) # Number of raw data columns
raw.rows <- nrow(rawdata) # Number of raw data rows
event.names.unique <- length(unique(rawdata$EVTYPE)) # Unique event types

paste("After loading the data, a review of the data set revealed that it contained",  raw.cols, "columns and",format(raw.rows, big.mark=',') ,"rows of information, with each row having a unique reference number corresponding to a particular event.", sep=" ")

## [1] "After loading the data, a review of the data set revealed that it contained 37 columns and 902,297 rows of information, with each row having a unique reference number corresponding to a particular event."

paste("These events were associated with", event.names.unique,"unique event types.", sep=" ")

## [1] "These events were associated with 985 unique event types."

Data processing

The purpose of data processing actions associated was to modify the raw data into a format that was suitable for analysis that would address the two key questions to be addressed in this report. The data processing activities included several actions: - Extracting the relevant data for further evaluation - Adding or removing data to aid in the analysis

Extracting relevant data

A review of the 37 columns revealed that only four columns contained information that was relevant to the questions at hand:

EVTYPE - Event type
FATALITIES - Number of deaths
INJURIES - Number of injuries
PROPDMG - Amount of property damage

These four columns, along with a fifth column, REFNUM, which had a unique reference number for each observation, were placed into the data frame object ‘keydata’ for further analysis. Also, the column names were changed for clarity to the following:

EVTYPE - event.type
FATALITIES - deaths
INJURIES - injuries
PROPDMG - damage
REFNUM - event.id

# Included only relevant columns plus reference ID numbers for further analysis
keydata <- rawdata[,c(8,23:25,37)] 

# Changed the column names of the keydata data frame
colnames(keydata) <- c("event.type", "deaths", "injuries", "damage", "event.id")

Adding or removing data to aid in the analysis

To aid in further analysis, added three columns of logical vectors; death.event, injury.event, and damage.event, to identify observations involving fatalities, injuries, or damage. Also, the ‘event.type’ column was changed from type integer to type character.

# Added columns of logical vectors to indicate the presences of deaths, injuries, or damage
keydata$death.event <- keydata$deaths > 0
keydata$injury.event <- keydata$injuries > 0
keydata$damage.event <- keydata$damage > 0

# Changing the event.type column from type integer to type character
keydata$event.type <- as.character(keydata$event.type)

# Logical vector identifies observations (rows) without deaths, injuries, or damage
no.harm <- ((!keydata$death.event)&(!keydata$injury.event)&(!keydata$damage.event))
removed.rows <- sum(no.harm) # Number of rows to be removed

keydata <- keydata[!no.harm,] # Removed rows lacking deaths, injuries, or damage
unique.event.names <- unique(keydata$event.type)

paste("There were",format(removed.rows, big.mark=','), "events that did not result in any deaths, injuries, or damage, and they were  removed from the 'keydata' data frame.", sep=" ")

## [1] "There were 653,495 events that did not result in any deaths, injuries, or damage, and they were  removed from the 'keydata' data frame."

paste("The remaining ", format(nrow(keydata), big.mark=','), " rows of data, which accounted for ", format(100*nrow(keydata)/nrow(rawdata), digits = 4), "% of the ",format(nrow(rawdata), big.mark=',')," observations, represented events associated with at least one death or injury to a person, or some level of damage to property, and is the group of observations that would be subjected to further analysis.", sep = "")

## [1] "The remaining 248,802 rows of data, which accounted for 27.57% of the 902,297 observations, represented events associated with at least one death or injury to a person, or some level of damage to property, and is the group of observations that would be subjected to further analysis."

Analysis - Performing an overview of the data related to deaths, injuries, and damage

The analysis of the relevant information in the data set started with a general overview of the events resulting in deaths, injuries, and damage to determine how magnitude of those deaths, injuries, and damage events were distributed. Most of these events were damage events, and a relatively small proportion involved death or injury.

Of particular interest whether the overall distribution of the magnitude of these variables had similar or dissimilar shapes, and what kinds of events were associated with the greatest magnitudes of deaths, injuries, and damage.

Death event details

Below is a summary of the highlights of the damage event details, including a histogram of the log of the count of death events. The log was used because a histogram using a linear scale would have been skewed very heavily to the right, with most of the values below five.

number.death.events <- sum(keydata$deaths>0) 
number.death.vector <- keydata[keydata$deaths>0,"deaths"]
deaths.total <- sum(keydata[keydata$deaths>0,"deaths"])
death.90th.percentile <- quantile(keydata[keydata$deaths>0,"deaths"],0.9)

paste("- ",format(100*number.death.events/nrow(rawdata), digits = 2), "% of the ",format(nrow(rawdata), big.mark=',')," observations, representing ", format(number.death.events, big.mark=','), " events, involved one or more deaths.", sep = "")

## [1] "- 0.77% of the 902,297 observations, representing 6,974 events, involved one or more deaths."

paste("- There was a total of ",format(deaths.total, big.mark=',')," deaths. The mean number of deaths per death event was ",format((deaths.total/number.death.events), digits=3), " and the median number of deaths was ", median(keydata[keydata$deaths>0,"deaths"]),".", sep = "")

## [1] "- There was a total of 15,145 deaths. The mean number of deaths per death event was 2.17 and the median number of deaths was 1."

paste("- ", death.90th.percentile," represents the 90th percentile for the number of deaths in a death event, and the ",sum(keydata$deaths>=death.90th.percentile), " events that had a number of deaths at or above the 90th percentile, representing ",format((100*sum(keydata$deaths>=death.90th.percentile)/number.death.events),digits=4),"% percent of all death events, accounted for ",format(100*sum(keydata[keydata$deaths>=death.90th.percentile,"deaths"])/deaths.total, digits=4),"% of all deaths.", sep = "")

## [1] "- 3 represents the 90th percentile for the number of deaths in a death event, and the 968 events that had a number of deaths at or above the 90th percentile, representing 13.88% percent of all death events, accounted for 53.77% of all deaths."

# Histogram  of the log10 value of the number of deaths
hist(log10(number.death.vector), breaks=40, col="red",
     main="Histogram of log10 of death event values with 90th percentile bar",
     xlab="Log10 of number of deaths in a death event")
abline(v=log10(death.90th.percentile),lwd=3)

Injury event details

Below is a summary of the highlights of the injury event details, including a histogram of the log of the count of injury events. The log was used because a histogram using a linear scale would have been skewed very heavily to the right, with most of the values below five.

number.injury.events <- sum(keydata$injuries>0)
number.injury.vector <- keydata[keydata$injuries>0,"injuries"]
injuries.total <- sum(keydata[keydata$injuries>0,"injuries"])
injury.90th.percentile <- quantile(keydata[keydata$injuries>0,"injuries"],0.9)

paste("- ",format(100*number.injury.events/nrow(rawdata), digits = 2), "% of the ",format(nrow(rawdata), big.mark=',')," observations, representing ", format(number.injury.events, big.mark=','), " events, involved one or more injuries.", sep = "")

## [1] "- 2% of the 902,297 observations, representing 17,604 events, involved one or more injuries."

paste("- There was a total of ",format(injuries.total, big.mark=',')," injuries. The mean number of injuries per injury event was ",format((injuries.total/number.injury.events), digits=3), " and the median number of injuries was ", median(keydata[keydata$injuries>0,"injuries"]),".", sep = "")

## [1] "- There was a total of 140,528 injuries. The mean number of injuries per injury event was 7.98 and the median number of injuries was 2."

paste("- ", injury.90th.percentile," represents the 90th percentile for the number of injuries in an injury event, and the ",format(sum(keydata$injuries>=injury.90th.percentile), big.mark=','), " events that had a number of injuries at or above the 90th percentile, representing ",format((100*sum(keydata$injuries>=injury.90th.percentile)/number.injury.events),digits=4),"% percent of all death events, accounted for ",format(100*sum(keydata[keydata$injuries>=injury.90th.percentile,"injuries"])/injuries.total, digits=3),"% of all injuries.", sep = "")

## [1] "- 12 represents the 90th percentile for the number of injuries in an injury event, and the 1,892 events that had a number of injuries at or above the 90th percentile, representing 10.75% percent of all death events, accounted for 72.5% of all injuries."

hist(log10(number.injury.vector), breaks=40, col="blue",
     main="Histogram of log10 of injury event values with 90th percentile bar",
     xlab="Log10 of number of injuries")
abline(v=log10(injury.90th.percentile),lwd=3)

Damage event details

Below is a summary of the highlights of the damage event details, including a histogram of the log of the damage events. The log was used because the highest level damage events were several orders of magnitudes higher than the lowest level ones, and many of the events were concentrated in the lower range of values.

number.damage.events <- sum(keydata$damage>0)
number.damage.vector <- keydata[keydata$damage>0,"damage"]
damage.total <- sum(keydata[keydata$damage>0,"damage"])
damage.90th.percentile <- quantile(keydata[keydata$damage>0,"damage"],0.9)

paste("- ",format(100*number.damage.events/nrow(rawdata), digits = 3), "% of the ",format(nrow(rawdata), big.mark=',')," observations, representing ", format(number.damage.events, big.mark=','), " events, involved some level of damage.", sep = "")

## [1] "- 26.5% of the 902,297 observations, representing 239,174 events, involved some level of damage."

paste("- There was a total of $",format(damage.total, big.mark=',')," million worth of damage. The mean amount of damage per damage event was $",format((damage.total/number.damage.events), digits=3), " million and the median damage amount was $", median(keydata[keydata$damage>0,"damage"])," million.", sep = "")

## [1] "- There was a total of $10,884,500 million worth of damage. The mean amount of damage per damage event was $45.5 million and the median damage amount was $10 million."

paste("- $", damage.90th.percentile," million represents the 90th percentile for the cost of a damage event, and the ",format(sum(keydata$damage>=damage.90th.percentile), big.mark=','), " events that had an amount of damage at or above the 90th percentile, representing ",format((100*sum(keydata$damage>=damage.90th.percentile)/number.damage.events),digits=4),"% percent of all events causing damage, accounted for ",format(100*sum(keydata[keydata$damage>=damage.90th.percentile,"damage"])/damage.total, digits=3),"% of all damage costs.", sep = "")

## [1] "- $100 million represents the 90th percentile for the cost of a damage event, and the 29,666 events that had an amount of damage at or above the 90th percentile, representing 12.4% percent of all events causing damage, accounted for 74% of all damage costs."

hist(log10(number.damage.vector), breaks=40, col="green",
     main="Histogram of log10 of damage event values with 90th percentile bar",
     xlab="Log10 of damage value")
abline(v=log10(damage.90th.percentile),lwd=3)

Analysis

The raw data contained a substantial amount of information that was not needed to answer the two questions that had to be addressed by this report. Once the unnecessary information was eliminated, and once additional information was added to identify those events were harmful to human health; specifically events that involved death, injury, or measurable damage; it became possible to complete an overview of the data of interest to see how these events of interest were distributed. The distribution of the magnitude of deaths, injuries, and damage were highly skewed, with a large proportion of these events having a small number of deaths or injuries, or a relatively low level of economic damage.

Events that cause harm

Those events that are harmful to public health are assumed to be the events that cause deaths, injuries, or economic damage. Descriptions of the categories of events in this category are captured by the ‘event.type’ variable. However, the previous descriptions of these types of events showed that most of these events cause relatively low levels of harm, and that a relatively small proportion of events account for a significant fraction of the total harm.

For events with deaths, injuries, and damage, the events with a magnitude of harmful outcomes that were at or above their respective 90th percentile levels caused more than half of all harm. Focusing on those events that were most harmful to public health will help to identify those type of events that cause the most harm.

In the remainder of this report, events that are most harmful to public health will be defined as those that were associated with death, injury, or damage events that were at or above the 90th percentile for their respective category.

keydata$percentile90 <- keydata$deaths>=death.90th.percentile | keydata$injuries>=injury.90th.percentile | keydata$damage >= damage.90th.percentile
number.significant.harm <- sum(keydata$percentile90)

paste("- ",format(100*number.significant.harm/nrow(rawdata), digits = 3), "% of the ",format(nrow(rawdata), big.mark=',')," observations, representing ", format(number.significant.harm, big.mark=','), " events, are considered to be those that were most harmful to public health.", sep = "")

## [1] "- 3.5% of the 902,297 observations, representing 31,621 events, are considered to be those that were most harmful to public health."

paste("- These noteworthy events represented ", format(100*number.significant.harm/nrow(keydata), digits = 3), "% of those ",format(nrow(keydata), big.mark=',')," events that caused at least one death or injury, or that caused some level of damage.", sep = "")

## [1] "- These noteworthy events represented 12.7% of those 248,802 events that caused at least one death or injury, or that caused some level of damage."

paste("There were ", length(unique(keydata[keydata$percentile90, "event.type"])), " unique descriptions used for those events that were considered most harmful to human health.", sep = "")

## [1] "There were 192 unique descriptions used for those events that were considered most harmful to human health."

# Object containing only those events that were most harmful to human health
big.events <- keydata[keydata$percentile90,c("event.type","deaths","injuries","damage")]
rownames(big.events) <- NULL

# Unique descriptions of the most harmful events
unique.desc <- unique(keydata[keydata$percentile90, "event.type"])

# Number of unique descriptions
unique.desc.num <- length(unique.desc)

# Create an object for summary information for the three kinds of 90th percentile plus events
big.event.summary <- NULL
big.event.summary <- cbind(big.event.summary,unique.desc)

# Death sums from these unique events
big.death.sum <- sapply(1:unique.desc.num, function(x) {
        sum(big.events[big.events$event.type==unique.desc[x],"deaths"])
})
big.event.summary <- cbind(big.event.summary,big.death.sum)

# Injury sums from these unique events
big.injury.sum <- sapply(1:unique.desc.num, function(x) {
        sum(big.events[big.events$event.type==unique.desc[x],"injuries"])
})
big.event.summary <- cbind(big.event.summary,big.injury.sum)

# Damage sums from these unique events
big.damage.sum <- sapply(1:unique.desc.num, function(x) {
        sum(big.events[big.events$event.type==unique.desc[x],"damage"])
})
big.event.summary <- cbind(big.event.summary,big.damage.sum)

# Ensure that the big.event.summary object is a data frame
big.event.summary <- as.data.frame(big.event.summary)


# big.event.summary by deaths
big.event.death.sort <- big.event.summary[order(-big.death.sum),]
# big.event.summary by injuries
big.event.injury.sort <- big.event.summary[order(-big.injury.sum),]
# big.event.summary by damage
big.event.damage.sort <- big.event.summary[order(-big.damage.sum),]

paste("Top 10 event types for total deaths in 90th percentile or above death events:")

## [1] "Top 10 event types for total deaths in 90th percentile or above death events:"

print(big.event.death.sort[1:10,c("unique.desc","big.death.sum")], row.names=FALSE)

##     unique.desc big.death.sum
##         TORNADO          5071
##  EXCESSIVE HEAT          1405
##            HEAT           761
##     FLASH FLOOD           398
##           FLOOD           195
##       HEAT WAVE           161
##       TSTM WIND           122
##    WINTER STORM            94
##    EXTREME HEAT            91
##       HIGH WIND            89

paste("Top 10 event type for total injuries in 90th percentile or above injury events:")

## [1] "Top 10 event type for total injuries in 90th percentile or above injury events:"

print(big.event.injury.sort[1:10,c("unique.desc","big.injury.sum")], row.names=FALSE)

##        unique.desc big.injury.sum
##            TORNADO          79864
##              FLOOD           6601
##     EXCESSIVE HEAT           6202
##          TSTM WIND           2246
##               HEAT           2004
##          ICE STORM           1889
##  HURRICANE/TYPHOON           1256
##        FLASH FLOOD           1163
##       WINTER STORM           1011
##               HAIL            858

paste("Top 10 event type for total damage amounts in 90th percentile or above damage events:")

## [1] "Top 10 event type for total damage amounts in 90th percentile or above damage events:"

print(big.event.damage.sort[1:10,c("unique.desc","big.damage.sum")], row.names=FALSE)

##         unique.desc big.damage.sum
##             TORNADO     2772450.25
##         FLASH FLOOD     1122434.52
##               FLOOD      776705.14
##           TSTM WIND      692550.19
##                HAIL       451042.5
##   THUNDERSTORM WIND       434193.9
##           LIGHTNING       432924.4
##  THUNDERSTORM WINDS         291592
##           HIGH WIND       245969.6
##        WINTER STORM       109352.9

Results

The analysis section identified the types of events that were associated with the the events that had the greatest magnitudes of harful effects on the public, specifically events with the greatest magnitude of deaths, injuries, and damage.

As a first step toward managing and reducing threats to public health and safety, government or municipal managers who might be responsible for preparing for severe weather events may also have need to prioritize resources for different types of events. As a first step, it may be prudent to identify those types of events that have been associated with the greatest magnitude of harmful weather-related outcomes across the United States to see which, if any, may be relevant to their stakeholders, especially if that type of event is associated with significant deaths, injuries, and economic losses.

# Matching top injury, death, and damage event types
matching.events.vector <- as.character(big.event.injury.sort$unique.desc[1:10]) %in% as.character(big.event.death.sort$unique.desc[1:10]) %in% as.character(big.event.damage.sort$unique.desc[1:10]) 


# Event types present in each top ten list
big.event.intersection.death.injury <- intersect(as.character(big.event.injury.sort$unique.desc[1:10]), as.character(big.event.death.sort$unique.desc[1:10]))

big.event.intersection.all.harm <- intersect(big.event.intersection.death.injury, as.character(big.event.damage.sort$unique.desc[1:10]))

paste("The following ",length(big.event.intersection.all.harm), " harmful event descriptions were associated for the top 10 list for most deaths, most total injuries, and highest damage costs where the magnitude of the harm in was at or above the 90th percentile for each respective type of harm.", sep="")

## [1] "The following 5 harmful event descriptions were associated for the top 10 list for most deaths, most total injuries, and highest damage costs where the magnitude of the harm in was at or above the 90th percentile for each respective type of harm."

cat(sort(big.event.intersection.all.harm),sep="\n")

## FLASH FLOOD
## FLOOD
## TORNADO
## TSTM WIND
## WINTER STORM

Reproducible Research: Peer Assessment 2

Todd Curtis

February 16, 2015

Data processing

Analysis

Results