Author: Murtuza Ali Lakhani Date: September 26, 2015
The fundamental goal of this assignment is to explore the U.S. National Oceanic and Atmospheric Administration (NOAA) Storm Database and answer the following questions about the consequences of severe weather events:
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
We are to use the database to answer certain questions and show the code for our entire analysis. Our analysis is to consist of tables, figures, or other summaries. We have been given the freedom to use any R package needed to support our analysis.
A public database maintained by the NOAA contains quantitative data on the type of storm events, details like location, date, estimates for damage to property, and the number of human victims of the storm. The purpose of this study is to explore the type of events that are most harmful to the economy and human life.
In terms of the event most harmful with respect to population health, data show that Heat and Drought cause the highest number of fatalities. Tornadoes and Flooding & High Surf are the next leading causes of fatalities. Tornadoes cause the highest number of injuries, followed by Thunderstorm & Lightning and Flooding & High Surf. In contrast, Landslide & Erosion and Dust & Saharan Winds rank the lowest on the list of human fatality and injuries.
In terms of economic damage, data show Flooding & High Surf and Wind & Storm rank as the top two categories, in that order. Further, data show that Heat & Drought causes the most damage to crops. In contrast, Landslide & Erosion and Dust & Saharan Winds rank the lowest on the list of economic damage caused.
Part of the work below has been adapted from Tom Lous’s project.
These are libraries needed to load, compute, transform, and plot data
if (!("RCurl" %in% rownames(installed.packages())) ) {
install.packages("RCurl")
} else {
library(RCurl)
}
if (!("R.utils" %in% rownames(installed.packages())) ) {
install.packages("R.utils")
} else {
library(R.utils)
}
if (!("plyr" %in% rownames(installed.packages())) ) {
install.packages("plyr")
} else {
library(plyr)
}
if (!("reshape2" %in% rownames(installed.packages())) ) {
install.packages("reshape2")
} else {
library(reshape2)
}
if (!("ggplot2" %in% rownames(installed.packages())) ) {
install.packages("ggplot2")
} else {
library(ggplot2)
}
if (!("grid" %in% rownames(installed.packages())) ) {
install.packages("grid")
} else {
library(grid)
}
if (!("gridExtra" %in% rownames(installed.packages())) ) {
install.packages("gridExtra")
} else {
library(gridExtra)
}
if (!("scales" %in% rownames(installed.packages())) ) {
install.packages("scales")
} else {
library(scales)
}
RData (binary format) file is used for faster processing. RData file is loaded only once, but simply invoked for all subsequent uses.
dataExists <- TRUE
# Check if the directory path leading up to the Rdata file exists. Load data if it does.
if(file.exists("C:/Users/murlakha/Desktop/DATA SCIENCE/REPRODUCIBLE RESEARCH/PROJECT TWO/data/StormData.RData")){
load("C:/Users/murlakha/Desktop/DATA SCIENCE/REPRODUCIBLE RESEARCH/PROJECT TWO/data/StormData.RData")
dataExists <- FALSE
}
if(dataExists){
# Create a directory to hold data if it doesn't already exist
if(!file.exists("C:/Users/murlakha/Desktop/DATA SCIENCE/REPRODUCIBLE RESEARCH/PROJECT TWO/data")){
dir.create("C:/Users/murlakha/Desktop/DATA SCIENCE/REPRODUCIBLE RESEARCH/PROJECT TWO/data")
}
# Load file from URL to bz2 file in data directory. (Flags adopted from Tom Lous' work.)
if(!file.exists("C:/Users/murlakha/Desktop/DATA SCIENCE/REPRODUCIBLE RESEARCH/PROJECT TWO/data/StormData.csv.bz2")){
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
destPath <- "C:/Users/murlakha/Desktop/DATA SCIENCE/REPRODUCIBLE RESEARCH/PROJECT TWO/data/StormData.csv.bz2"
binData <- getBinaryURL(fileUrl, ssl.verifypeer=0L, followlocation=1L)
destFileHandle <- file(destPath, open="wb")
writeBin(binData,destFileHandle)
close(destFileHandle)
}
# Unzip bz2 file to csv
if(!file.exists("C:/Users/murlakha/Desktop/DATA SCIENCE/REPRODUCIBLE RESEARCH/PROJECT TWO/data/StormData.csv")){
filePath <- "C:/Users/murlakha/Desktop/DATA SCIENCE/REPRODUCIBLE RESEARCH/PROJECT TWO/data/StormData.csv.bz2"
destPath <- "C:/Users/murlakha/Desktop/DATA SCIENCE/REPRODUCIBLE RESEARCH/PROJECT TWO/data/StormData.csv"
bunzip2(filePath,destPath,overwrite=TRUE, remove=FALSE)
}
}
Read the source .csv file
if(dataExists){
csv_StormData <- read.csv("C:/Users/murlakha/Desktop/DATA SCIENCE/REPRODUCIBLE RESEARCH/PROJECT TWO/data/StormData.csv")
}
Keep only the columns needed for this analysis, namely BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, and CROPDMGEXP.
if(dataExists){
needed_Columns <- c("BGN_DATE", "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
truncated_StormData <- csv_StormData[, needed_Columns]
truncated_StormData <- rename(truncated_StormData, c("FATALITIES"="fatalities", "INJURIES"="injuries"))
}
Since most of the data is from later years, drop 85% of data (cutoff tentatively set at 85%) from the earlier period.
if(dataExists){
# get some counts
total_number_of_observations <- nrow(truncated_StormData)
cutoff_percentage = 0.85
cutoff_observations = round(total_number_of_observations * cutoff_percentage)
# add columns for date calculations based on BGN_DATEro
truncated_StormData$year = as.numeric(format(as.Date(truncated_StormData$BGN_DATE, format = "%m/%d/%Y"), "%Y"))
# create dataset with count per year, reverse the recordset, create running total
year_Records <- count(truncated_StormData, "year")
year_Records <- year_Records[order(year_Records$year, decreasing=TRUE), ]
year_Records$runningTotal = cumsum(year_Records$freq)
cutoff_Year <- min(year_Records[year_Records$runningTotal < cutoff_observations, 1])
# reduce the dataset
truncated_StormData <- truncated_StormData[truncated_StormData$year >= cutoff_Year, ]
endYear <- max(truncated_StormData$year)
# clean truncated_StormData
truncated_StormData$BGN_DATE <- NULL
rownames(truncated_StormData) <- NULL
}
The EVTYPE variable contains hundreds of unique source events, many of which can be clustered together. We will define 11 levels, covering effectively the majority of the data records. Summaries and combinations will be skipped.
if(dataExists){
truncated_StormData$damageSource <- NA
truncated_StormData[grepl("precipitation|rain|hail|drizzle|wet|percip|burst|depression|fog|wall cloud",
truncated_StormData$EVTYPE, ignore.case = TRUE), "damageSource"] <- "Precipitation & Fog"
truncated_StormData[grepl("wind|storm|wnd|hurricane|typhoon",
truncated_StormData$EVTYPE, ignore.case = TRUE), "damageSource"] <- "Wind & Storm"
truncated_StormData[grepl("slide|erosion|slump",
truncated_StormData$EVTYPE, ignore.case = TRUE), "damageSource"] <- "Landslide & Erosion"
truncated_StormData[grepl("warmth|warm|heat|dry|hot|drought|thermia|temperature record|record temperature|record high",
truncated_StormData$EVTYPE, ignore.case = TRUE), "damageSource"] <- "Heat & Drought"
truncated_StormData[grepl("cold|cool|ice|icy|frost|freeze|snow|winter|wintry|wintery|blizzard|chill|freezing|avalanche|glaze|sleet",
truncated_StormData$EVTYPE, ignore.case = TRUE), "damageSource"] <- "Snow & Ice"
truncated_StormData[grepl("flood|surf|blow-out|swells|fld|dam break",
truncated_StormData$EVTYPE, ignore.case = TRUE), "damageSource"] <- "Flooding & High Surf"
truncated_StormData[grepl("seas|high water|tide|tsunami|wave|current|marine|drowning",
truncated_StormData$EVTYPE, ignore.case = TRUE), "damageSource"] <- "High Seas"
truncated_StormData[grepl("dust|saharan",
truncated_StormData$EVTYPE, ignore.case = TRUE), "damageSource"] <- "Dust & Saharan Winds"
truncated_StormData[grepl("tstm|thunderstorm|lightning",
truncated_StormData$EVTYPE, ignore.case = TRUE), "damageSource"] <- "Thunderstorm & Lightning"
truncated_StormData[grepl("tornado|spout|funnel|whirlwind",
truncated_StormData$EVTYPE, ignore.case = TRUE), "damageSource"] <- "Tornado"
truncated_StormData[grepl("fire|smoke|volcanic",
truncated_StormData$EVTYPE, ignore.case = TRUE), "damageSource"] <- "Fire & Volcanic Activity"
# Remove uncategorized (damageSource = NA) records. Cast damageSource as a factor.
truncated_StormData <- truncated_StormData[complete.cases(truncated_StormData[, "damageSource"]), ]
truncated_StormData$damageSource <- as.factor(truncated_StormData$damageSource)
# Clean EVTYPE variable in truncated_StormData
truncated_StormData$EVTYPE <- NULL
}
Assign absolute values to DMG and DMGEXP fields. Impute NA for all undefined EXP properties, such as +, ?.
if(dataExists){
# Function to convert symbol to a power of 10 (hundreds, thousands, millions, and billions). (Adapted from Tom Lous' work.)
to_the_power_of_ten <- function(x){
if(is.numeric(x)) {
x <- x
}
else if(grepl("h", x, ignore.case=TRUE)) {
x <- 2
}
else if(grepl("k", x, ignore.case=TRUE)) {
x <- 3
}
else if(grepl("m", x, ignore.case=TRUE)) {
x <- 6
}
else if(grepl("b", x, ignore.case=TRUE)) {
x <- 9
}
else if(x == "" || x == " "){
x <- 0
}
else{
x <- NA
}
x
}
# Function to take two parameters num and exp and convert them to one absolute value. Assign zero to non-integers.
calculate_damage_amount <- function(num, exp){
raise_to_this_power <- to_the_power_of_ten(exp)
if(is.numeric(num)){
num <- num * (10 ^ raise_to_this_power)
}
if(!is.numeric(num)){
num <- 0
}
num
}
# Create two new fields for calculated propDamage & cropDamage and add them to the damageTotal field.
truncated_StormData$propDamage <- mapply(calculate_damage_amount, truncated_StormData$PROPDMG, truncated_StormData$PROPDMGEXP)
truncated_StormData$cropDamage <- mapply(calculate_damage_amount, truncated_StormData$CROPDMG, truncated_StormData$CROPDMGEXP)
truncated_StormData$damageTotal = truncated_StormData$propDamage + truncated_StormData$cropDamage
# Clean out property/crop and damage/expense fields in truncated_StormData
truncated_StormData$PROPDMG <- NULL
truncated_StormData$PROPDMGEXP <- NULL
truncated_StormData$CROPDMG <- NULL
truncated_StormData$CROPDMGEXP <- NULL
}
Recast final data frames to be used in certain plot funtions.
if(dataExists){
# Aggregate economic damage per damageSource
Sum_of_Economic_Damage <- aggregate(formula=cbind(propDamage, cropDamage, damageTotal) ~ damageSource, data=truncated_StormData, FUN=sum, na.rm=TRUE)
Sum_of_Economic_Damage <- Sum_of_Economic_Damage[order(Sum_of_Economic_Damage$damageTotal, decreasing=TRUE),]
rownames(Sum_of_Economic_Damage) <- NULL
Sum_of_Economic_Damage$damageSource <- factor(Sum_of_Economic_Damage$damageSource, levels=rev(Sum_of_Economic_Damage$damageSource))
# Melt the Sum_of_Economic_Damage into data frame to be used as bar chart
meltSum_of_Economic_Damage <- melt(Sum_of_Economic_Damage, id.vars=c("damageSource"), measure.vars=c("propDamage","cropDamage"), variable.name="damageType", value.name="damage")
levels(meltSum_of_Economic_Damage$damageType)[levels(meltSum_of_Economic_Damage$damageType)=="propDamage"] <- "property"
levels(meltSum_of_Economic_Damage$damageType)[levels(meltSum_of_Economic_Damage$damageType)=="cropDamage"] <- "crops"
# Aggregate humanitarian damage per damageSource
Sum_of_Human_Damage <-aggregate(formula=cbind(injuries, fatalities) ~ damageSource, data=truncated_StormData, FUN=sum, na.rm=TRUE)
Sum_of_Human_Damage <- Sum_of_Human_Damage[order(Sum_of_Human_Damage$injuries, decreasing=TRUE),]
Sum_of_Human_Damage_by_Fatalities <- Sum_of_Human_Damage[order(Sum_of_Human_Damage$fatalities, decreasing=TRUE),]
rownames(Sum_of_Human_Damage) <- NULL
rownames(Sum_of_Human_Damage_by_Fatalities) <- NULL
Sum_of_Human_Damage$damageSource <- factor(Sum_of_Human_Damage$damageSource, levels=rev(Sum_of_Human_Damage$damageSource))
# Define maximum values for bar chart axis
maxInjuries <- max(Sum_of_Human_Damage$injuries)
maxInjuries <- maxInjuries + round(maxInjuries * 0.25)
maxFatalities <- max(Sum_of_Human_Damage$fatalities)
maxFatalities <- maxFatalities + round(maxFatalities * 0.25)
}
Save the processed data to RData file.
if(dataExists){
save(truncated_StormData,
Sum_of_Human_Damage,
Sum_of_Human_Damage_by_Fatalities,
meltSum_of_Economic_Damage,
Sum_of_Economic_Damage,
maxInjuries,
maxFatalities,
cutoff_Year,
endYear,
file="C:/Users/murlakha/Desktop/DATA SCIENCE/REPRODUCIBLE RESEARCH/PROJECT TWO/data/StormData.RData")
}
Display a few records of the cleaned, reformatted stormData to be used for analysis.
head(truncated_StormData, n=5L)
## fatalities injuries year damageSource propDamage cropDamage
## 1 0 0 1989 Tornado 2500000 0
## 2 0 0 1989 Tornado 250000 0
## 3 0 0 1989 Thunderstorm & Lightning 0 0
## 4 0 0 1989 Tornado 2500000 0
## 5 0 0 1989 Thunderstorm & Lightning 0 0
## damageTotal
## 1 2500000
## 2 250000
## 3 0
## 4 2500000
## 5 0
tail(truncated_StormData, n=5L)
## fatalities injuries year damageSource propDamage cropDamage
## 762146 0 0 2011 Wind & Storm 0 0
## 762147 0 0 2011 Wind & Storm 0 0
## 762148 0 0 2011 Wind & Storm 0 0
## 762149 0 0 2011 Snow & Ice 0 0
## 762150 0 0 2011 Snow & Ice 0 0
## damageTotal
## 762146 0
## 762147 0
## 762148 0
## 762149 0
## 762150 0
Show the injuries and fatalaties for each major weather event. Tornadoes cause the highest number of injuries, followed by Thunderstorm & Lightning and Flooding & High Surf. In contrast, Landslide & Erosion and Dust & Saharan Winds rank the lowest on the list of causes of human injuries. Heat and Drought cause the highest number of fatalities. Tornadoes and Flooding & High Surf are the next leading causes of fatalities. In contrast, Dust & Saharan Winds and Landslide & Erosion rank the lowest on the list of causes of human fatalities.
# Plot injuries
ggplot(data=Sum_of_Human_Damage, aes(x=damageSource, y=injuries)) +
geom_bar(stat = "identity", fill="lightblue", colour = 'red') +
geom_text(aes(label=injuries), size=3, vjust=0.5, hjust=2.0) +
ggtitle("Injuries") +
scale_y_continuous(expand=c(0, 0), limits=c(0, maxInjuries)) +
coord_flip()
# Plot fatalities.
ggplot(data=Sum_of_Human_Damage[order(Sum_of_Human_Damage$fatalities, decreasing=TRUE),], aes(x=damageSource, y=fatalities)) +
geom_bar(stat = "identity", fill="magenta", colour='blue') +
geom_text(aes(label=fatalities), size=3, vjust=0.5, hjust=-1.0) +
ggtitle("Fatalities") +
scale_y_continuous(expand=c(0, 0), limits=c(0,maxFatalities)) +
coord_flip()
The underlying data
Sum_of_Human_Damage
## damageSource injuries fatalities
## 1 Tornado 28037 1834
## 2 Thunderstorm & Lightning 13222 1398
## 3 Flooding & High Surf 8931 1715
## 4 Heat & Drought 8865 2969
## 5 Snow & Ice 6874 1344
## 6 Wind & Storm 3601 644
## 7 Precipitation & Fog 2528 195
## 8 Fire & Volcanic Activity 1608 90
## 9 High Seas 1139 898
## 10 Dust & Saharan Winds 483 24
## 11 Landslide & Erosion 55 44
In terms of economic damage, data show Flooding & High Surf and Wind & Storm rank as the top two categories, in that order. Further, data show that Heat & Drought causes the most damage to crops. In relative terms, the damage to property far outweighs that to crops. Contrastingly, Landslide & Erosion and Dust & Saharan Winds rank the lowest on the list of the causes of economic damage.
ggplot(meltSum_of_Economic_Damage, aes(x=damageSource, y=damage/1000000)) +
geom_bar(stat = "identity", aes(fill=damageType)) +
xlab("Event Type") +
theme(axis.text.x = element_text(angle = 45, size=8, hjust = 1, vjust = 1)) +
ylab("Total Economic Damage (millions of USD)") +
ggtitle(paste("Aggregated property and crop damage involving weather events from ",cutoff_Year," to ",endYear, sep=""))
The underlying data
Sum_of_Economic_Damage
## damageSource propDamage cropDamage damageTotal
## 1 Flooding & High Surf 167702697194 12389477200 180092174394
## 2 Wind & Storm 142713420213 6961399350 149674819563
## 3 Tornado 33884111913 367458360 34251570273
## 4 Precipitation & Fog 18970846467 3958490253 22929336720
## 5 Snow & Ice 12705867660 8721961900 21427829560
## 6 Heat & Drought 1062504300 14871450280 15933954580
## 7 Thunderstorm & Lightning 11909700240 1283798498 13193498738
## 8 Fire & Volcanic Activity 8502228500 403281630 8905510130
## 9 High Seas 4809787890 46622500 4856410390
## 10 Landslide & Erosion 328262100 20017000 348279100
## 11 Dust & Saharan Winds 6337630 3600000 9937630