This is the R Markdown document prepared for the purposed of the second peer assessment for the Reproducible Reasearch course offered via the Coursera. The analysis focuses on events exerting greatest harm, where harm is defined as \(harm = injury + fatality\) and events exerting greatest damage where damage is defined as \(damage = property_damge + crops damge\).
Author: Konrad Zdeb Date: July, 2014
The following analysis is concerned with answering the requirements of the Reproducible Research course delivered via Coursera. The analysis analyses the provided according to the extent of economic damage and associated harm with a set of events provided in the data. The events accounting for most damage/harm are graphed and subsequently listed in separate tables.
The section is concerned with processing the data. The data is processed separately for the questions in the assignment.
I’m using this section to declare functions that I call later during the analysis. For instance, I use packages that may not be installed so I’m using this function to test whether the package is installed.
# Function to check whether the package is installed
package.test <- function(x) {
if (!require(x,character.only = TRUE))
{
install.packages(x,dep=TRUE)
if(!require(x,character.only = TRUE)) stop("Package not found")
}
}
# Capitalise each word
proper <- function(x)
paste0(toupper(substr(x, 1, 1)), tolower(substring(x, 2)))
The code below checks whether the required data set is available in the working directory. If the data file is available you can skip this part of code and move to the next section that reads the data.
# I'm using the RCurl, the code below will install it if not availble.
package.test("RCurl")
## Loading required package: RCurl
## Loading required package: bitops
##
## Attaching package: 'RCurl'
##
## The following object is masked from 'package:R.utils':
##
## reset
##
## The following object is masked from 'package:R.oo':
##
## clone
# I will use bunzip from R Utils to unpack the the bz archive
package.test("R.utils")
# Now i check whether the data exists and download if it doesn't exists
if (!file.exists("repdata_data_StormData.csv")) {
# Check wthether the zip archive is available.
if (!file.exists("repdata_data_StormData.csv.bz2")) {
# Define data source.
dataurl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
# Download, save and unzip the file.
library(RCurl)
bin <- getBinaryURL(dataurl, ssl.verifypeer=FALSE,
noprogress = FALSE)
con <- file("repdata_data_StormData.csv.bz2",
open = "wb")
writeBin(bin, con)
close(con)
bunzip2("repdata_data_StormData.csv.bz2")
} else
bunzip2("repdata_data_StormData.csv.bz2")
}
# Read the file path
file <- list.files(path=getwd(), pattern = "repdata_data_StormData.csv",
all.files = TRUE, recursive = TRUE, full.names = TRUE)
# Clean redundant objects after download
if (exists("bin")) rm(bin)
if (exists("con")) rm(con)
if (exists("dataurl")) rm(dataurl)
The code below reads the data to the data frame.
# The code simply reads the CSV file
dta <- read.csv(file = file, header = TRUE)
# I'm also renaming one column that I noticed has underscore at the end
names(dta)[1] <- "STATE"
This section answer the first question of the assignment: Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?.
Before progressing with the analysis the data is formatted accordingly. First, the classes of columns in the data frame are verified.
sapply(dta, class)
## STATE BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME
## "numeric" "factor" "factor" "factor" "numeric" "factor"
## STATE EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE
## "factor" "factor" "numeric" "factor" "factor" "factor"
## END_TIME COUNTY_END COUNTYENDN END_RANGE END_AZI END_LOCATI
## "factor" "numeric" "logical" "numeric" "factor" "factor"
## LENGTH WIDTH F MAG FATALITIES INJURIES
## "numeric" "numeric" "integer" "numeric" "numeric" "numeric"
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC
## "numeric" "factor" "numeric" "factor" "factor" "factor"
## ZONENAMES LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS
## "factor" "numeric" "numeric" "numeric" "numeric" "factor"
## REFNUM
## "numeric"
Relevant columns are kept to reflect sum of events.
# Aggregated data table is created
dta.evs <- aggregate(cbind(INJURIES,FATALITIES)~EVTYPE, data = dta,
FUN = sum, na.rm = TRUE)
# Only the highest values are kept
dta.evs <- subset(dta.evs , FATALITIES > quantile(dta.evs$FATALITIES, 0.95)
| INJURIES > quantile(dta.evs$INJURIES, 0.95))
# Melting data for the graph
package.test("reshape2")
## Loading required package: reshape2
dta.evs.melt <- melt(dta.evs,id="EVTYPE")
# Words are being capitalised - aesthetics
dta.evs.melt <- data.frame(sapply(dta.evs.melt, proper))
# Numeric values are converted back to numeric
dta.evs.melt$value <- as.numeric(dta.evs.melt$value)
Visual analysis of the data accounting for the Event Type and the Total Harm which is derived from the \(harm = injuries + fatalities\)
package.test("ggplot2")
## Loading required package: ggplot2
ggplot(data = dta.evs.melt, aes(x = factor(EVTYPE), y = value,
fill = factor(variable))) +
geom_bar(position="stack", stat="identity") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5),
panel.background = element_rect(fill = "white", colour = "grey"),
legend.position = "bottom",
legend.background = element_rect(colour = 'grey',
fill = 'white'),
axis.title.x = element_text(size = 10, colour = 'black'),
axis.title.y = element_text(size = 10, colour = 'black')) +
scale_fill_discrete(name="Harm Type") +
xlab("Event type") +
ylab("Total Harm") +
ggtitle("Event Types with Fatalities and Injuries\n(Only Highest)")
# Aggregating the data to provide count for total fatalities and casulaties
dta.evs.melt.sum <- subset(dta.evs.melt, select=c("EVTYPE", "value"))
row.names(dta.evs.melt.sum) <- NULL
dta.evs.melt.sum[] <- lapply(dta.evs.melt.sum, function(x) type.convert(as.character(x)))
dta.evs.melt.sum <- aggregate(. ~ EVTYPE, dta.evs.melt.sum, sum)
# Getting highest values.
dta.evs.melt.sum <- subset(dta.evs.melt.sum,
value > quantile(dta.evs.melt.sum$value, 0.90))
Assuming that harm defined as \(harm = injuries + fatalities\) The highest harm, above 90th percentile of total harm, is associated with those events
| Eventype | Harm |
|---|---|
| Fog | 157 |
| Lightning | 157 |
| Thunderstorm winds | 167 |
| Tornado | 164 |
| Tstm wind/hail | 160 |
| Wildfire | 177 |
This section is concerned with the second part of the assignment and addresses the question about the economic impact of events.
Initial master data table is manipulated in order to create a data frame with information pertaining to economic harmfulness of the events that would be further exploited.
The analysis utilises the following variables - PROPDMG Property-related damage - CROPDMG Crops-related damge - PROPDMGEXP Multiplier - CROPDMGEXP Multiplier
# Relevant variables are extracted from the master data frame
dta.econ <- dta[,c("EVTYPE","PROPDMG","CROPDMG","PROPDMGEXP","CROPDMGEXP")]
# Letters are replaced with corresponding figures
dta.econ$PROPDMGEXP <- as.character(dta.econ$PROPDMGEXP)
dta.econ$PROPDMGEXP[dta.econ$PROPDMGEXP == "K"] <- "1000"
dta.econ$PROPDMGEXP[dta.econ$PROPDMGEXP == "M"] <- "1000000"
dta.econ$CROPDMGEXP <- as.character(dta.econ$CROPDMGEXP)
dta.econ$CROPDMGEXP[dta.econ$CROPDMGEXP == "K"] <- "1000"
dta.econ$CROPDMGEXP[dta.econ$CROPDMGEXP == "M"] <- "1000000"
# Cosmetic changes to names in the table
dta.econ <- data.frame(sapply(dta.econ, proper))
# Columns are converted to numbers
dta.econ$PROPDMGEXP <- as.numeric(dta.econ$PROPDMGEXP)
dta.econ$CROPDMGEXP <- as.numeric(dta.econ$CROPDMGEXP)
dta.econ$CROPDMG <- as.numeric(dta.econ$CROPDMG)
dta.econ$PROPDMG <- as.numeric(dta.econ$PROPDMG)
# Actual demage value is obtained
dta.econ$Property <- dta.econ$PROPDMGEXP * dta.econ$PROPDMG
dta.econ$Crops <- dta.econ$CROPDMGEXP * dta.econ$CROPDMG
# Aggregated dat frame is created for the analysis
dta.econ.evs <- aggregate(cbind(Property,Crops)~EVTYPE, data = dta.econ,
FUN = sum, na.rm = TRUE)
# Subsetting the data to get highest values only
dta.econ.evs <- subset(dta.econ.evs,
Property > quantile(dta.econ.evs$Property, 0.95) |
Crops > quantile(dta.econ.evs$Crops, 0.95))
# Data is melted to make it easier to graph
dta.econ.melt <- melt(dta.econ.evs,id="EVTYPE")
The data corresponding to the damage exerted by different events is graphed in the figure below.
# Force R not to use scientific notation
options("scipen"=100, "digits"=4)
# Make chart
ggplot(data = dta.econ.melt, aes(x = factor(EVTYPE), y = value,
fill = factor(variable))) +
geom_bar(position="stack", stat="identity") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5),
panel.background = element_rect(fill = "white", colour = "grey"),
legend.position = "bottom",
legend.background = element_rect(colour = 'grey',
fill = 'white'),
axis.title.x = element_text(size = 10, colour = 'black'),
axis.title.y = element_text(size = 10, colour = 'black')) +
scale_fill_discrete(name="Damage") +
xlab("Event Type") +
ylab("Total Damage") +
ggtitle("Event Types According to Damage")
# Get events with highest total damage
dta.econ.melt <- subset(dta.econ.melt,
value > quantile(dta.econ.melt$value, 0.90))
# Delete column
dta.econ.melt$variable <- NULL
The events listed below accounts for the events where the total damage was above 90 percentile of the distribution, with total damage defined as \(damage = crops damage + property damage\).
| Eventype | EconomicDamage |
|---|---|
| Flash flood | 82770199 |
| Flood | 40774629 |
| Hail | 82960317 |
| Heavy snow | 6061175 |
| High wind | 21419573 |
| Lightning | 41722998 |
| Strong wind | 9808685 |
| Thunderstorm wind | 151928497 |
| Thunderstorm winds | 55202603 |
| Tornado | 142318860 |
| Tstm wind | 215970405 |
| Hail | 8001752 |
In summary most harmful events are:
| Eventype | Harm |
|---|---|
| Fog | 157 |
| Lightning | 157 |
| Thunderstorm winds | 167 |
| Tornado | 164 |
| Tstm wind/hail | 160 |
| Wildfire | 177 |
with respect to the economic damage, most damaging events are
| Eventype | EconomicDamage |
|---|---|
| Flash flood | 82770199 |
| Flood | 40774629 |
| Hail | 82960317 |
| Heavy snow | 6061175 |
| High wind | 21419573 |
| Lightning | 41722998 |
| Strong wind | 9808685 |
| Thunderstorm wind | 151928497 |
| Thunderstorm winds | 55202603 |
| Tornado | 142318860 |
| Tstm wind | 215970405 |
| Hail | 8001752 |