The basic goal of this assignment is to explore the NOAA Storm Database and answer two questions concerning severe weather events:
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
## Load the necessary packages:
if(!require(sqldf)){install.packages("sqldf")}
## Loading required package: sqldf
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
## Loading required package: DBI
library(sqldf)
if(!require(ggplot2)){install.packages("ggplot2")}
## Loading required package: ggplot2
library(ggplot2)
if(!require(gridExtra)){install.packages("gridExtra")}
## Loading required package: gridExtra
## Loading required package: grid
library(gridExtra)
if(!require(tidyr)){install.packages("tidyr")}
## Loading required package: tidyr
if(!require(dplyr)){install.packages("dplyr")}
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(dplyr)
library(tidyr)
if(!require(tcltk)){install.packages("tcltk")}
## Loading required package: tcltk
library(tcltk)
## Pull down the dataset into the working directory
dataset_url <-"http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(dataset_url, destfile="StormData.csv")
if(!file.exists("StormData.csv.bz2")){download.file(dataset_url, destfile="StormData.csv.bz2")}
storm <- read.csv("StormData.csv.bz2", header = TRUE, sep = ",", strip.white = TRUE)
## Create the timestamp of the data download and publish
dateDownloaded <- date()
cat("Data file downloaded & evaluated ", dateDownloaded," from ", dataset_url)
## Data file downloaded & evaluated Sun Feb 15 12:50:40 2015 from http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2
## Subset the data to reduce the processing load.
SubsetData <- sqldf("
select *
from (
SELECT
EVTYPE as EventType
,sum(FATALITIES) as Fatalities
,sum(INJURIES) as Injuries
,sum(FATALITIES)+sum(INJURIES) as Injured_n_Dead
,sum(PROPDMG) as PropertyDamage
,sum(CROPDMG) as CropDamage
,sum(PROPDMG)+sum(CROPDMG) as EconomicDamage
FROM storm
GROUP BY EVTYPE)
where Injured_n_Dead > 0 or EconomicDamage > 0 ")
## Table summarizing overall Deaths, Injuries, and Damage
tots<-as.data.frame(cbind(
FA=sum(SubsetData$Fatalities)
,I=sum(SubsetData$Injuries)
,InD=sum(SubsetData$Injured_n_Dead)
,PD=sum(SubsetData$PropertyDamage)
,CD=sum(SubsetData$CropDamage)
,ED=sum(SubsetData$EconomicDamage)
))
Processing will consist of creating several file data frames as well as temporary data frames. The overall data will be subsetted out so that only those Events which had either an economic impact or a population health impact will be reviewed. Those categories resulted in 488 observations of 7 variables, a reduction from the original data of 902,297 observations of 37 variables.
Two new variables are introduced in processing the data for summarization. Population Health is now defined as the sum of the original categories of “Fatalities” and “Injuries”. Economic Consequences will be measured by the new variable called “Economic Damage” created by summing the original categories of “Property Damage” and “Crop Damage”.
In the process of analysis the Top 5 values were chosen because of the significance of the overall quantities / loads these Event Types had on the overall measures to population health and economic consequence. These impacts were not uniform across all the individual categories, but null values are allowed to report in the plot to show the change in distribution. Therefore the plots did not remove those Event Types which had no impact in an individual category which summarized to the overall Affected value.
Analysis will be added which shows the effect the Top 5 had in the overall calculation of Affected.
# Data work to solve for the questions
## Tidying data for Question 1 & setting up plot.
### The data will be reordered by category type for ranking. Then it will be
## appended to the overall file made for plotting the results.
popsort <- sqldf("select EventType, Injured_n_Dead from SubsetData order by Injured_n_Dead desc, EventType")
Top5.ppl<-as.data.frame(c(popsort[1:5,],"Injured_n_Dead"))
names(Top5.ppl)<-c("EventType","Affected","Factor")
## This interim table is used to save overall results for future chunk
## calculations of the percentages represented by the Top N values.
save.ppl<-Top5.ppl
popsort <- sqldf("select EventType, Fatalities from SubsetData order by Fatalities desc, EventType")
top.5<-as.data.frame(c(popsort[1:5,],"Fatalities"))
names(top.5)<-c("EventType","Affected","Factor")
Top5.ppl<-rbind(Top5.ppl, top.5)
popsort <- sqldf("select EventType, Injuries from SubsetData order by Injuries desc, EventType")
top.5<-as.data.frame(c(popsort[1:5,],"Injuries"))
names(top.5)<-c("EventType","Affected","Factor")
Top5.ppl<-rbind(Top5.ppl, top.5)
plot1<-ggplot(Top5.ppl, aes(EventType, Affected)) +
geom_bar(stat="identity", width=1) +
facet_grid(. ~ Factor) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
aes(fill = factor(EventType))+
labs(fill = "Event Types")
## Data work for Question #2 - duplicate of Question #1 work except for
## different subsetting rules.
popsort <- sqldf("select EventType, EconomicDamage from SubsetData order by EconomicDamage desc, EventType")
Top5.econ<-as.data.frame(c(popsort[1:5,], "EconomicDamage"))
names(Top5.econ)<-c("EventType","Affected","Factor")
save.econ<-Top5.econ
popsort <- sqldf("select EventType, PropertyDamage from SubsetData order by PropertyDamage desc, EventType")
top.5<-as.data.frame(c(popsort[1:5,],"PropertyDamage"))
names(top.5)<-c("EventType","Affected","Factor")
Top5.econ<-rbind(Top5.econ, top.5)
popsort <- sqldf("select EventType, CropDamage from SubsetData order by CropDamage desc, EventType")
top.5<-as.data.frame(c(popsort[1:5,],"CropDamage"))
names(top.5)<-c("EventType","Affected","Factor")
Top5.econ<-rbind(Top5.econ, top.5)
plot2<-ggplot(Top5.econ, aes(EventType, Affected/1000000)) +
geom_bar(stat="identity", width=1) +
facet_grid(. ~ Factor) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
aes(fill = factor(EventType))+
labs(fill = "Event Types", y = "Affected - in Millions of Dollars", x = "Event Type")
Two categories have been identified as affecting “population health”. These are the catagories of Fatalities and Injuries.
## create code to output the percentages of the defined Top N values for reporting
cat(c("The Top 5 Events for Fatalities and Injuries represents",
round(100*(sum(save.ppl$Affected)/tots$InD), digits = 1),
"percent of all Fatalities and Injuries. Injuries
represent ",round(100*(tots$I/tots$InD), digits = 1),"percent of the combined Fatalities and Injuries total. While Fatalities remainder with",round(100*(tots$FA/tots$InD), digits = 1),"percent of the combined."))
## The Top 5 Events for Fatalities and Injuries represents 81.1 percent of all Fatalities and Injuries. Injuries
## represent 90.3 percent of the combined Fatalities and Injuries total. While Fatalities remainder with 9.7 percent of the combined.
The results of looking at the Top 5 events for each category resulted in 7 types of events which had impacts. Injuries, by far, outweighed fatalitites, however Tornadoes were the single event which was the top event in both Injuries & Fatalities.
## plot the output of the Top N values
print(plot1)
Two types of Damage are analyzed for the overall definition of “economic consequences”. These are the categories of Property Damage and Crop Damage.
## create code to output the percentages of the defined Top N values for reporting
cat(c("The Top 5 Events for Economic Damages represents",
round(100*(sum(save.econ$Affected)/tots$ED), digits = 1),
"percent of all Property and Crop Damages. Property
Damages represent ",round(100*(tots$PD/tots$ED), digits = 1),"percent of the combined Damage total. While Crop Damages remainder with",round(100*(tots$CD/tots$ED), digits = 1),"
percent of the combined."))
## The Top 5 Events for Economic Damages represents 70.9 percent of all Property and Crop Damages. Property
## Damages represent 88.8 percent of the combined Damage total. While Crop Damages remainder with 11.2
## percent of the combined.
The results of looking at the Top 5 events for each category resulted in 7 types of events which had the most significant impact. This time, while Tornadoes had the highest contribution to the Property Damage and Property Damage drove the greatest amount of overall Economic Damage, it was the Event Type of Hail which drove the most significant economic impacts to the category of Crop Damage.
## plot the output of the Top N values
print(plot2)