Course - Reproducible Research
Week 4 Assignment
Suleman Wadur
This report explores the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm data for year 1950 to 2011. The exploration will analyze the data in order to find the events that mostly impact the health of the population and the economic impact of such events.
The data is from U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, which tracks info on major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
We first load download the data from the data source and load into R. The data is a CSV file with each data variable delimited by a comma ‘,’.
Downloading and Loading of data file…
## Load needed libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Turns off exponential notation of numeric values such as when using the mean function
options(scipen = 999)
## Set working directory
workdir <- "C:/Move 4/Coursera/DataScience/Course5-ReproducibleResearch/Week4/Assignment/"
setwd(workdir)
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if(!file.exists("StormData.csv.bz2")) {
print("downloading file.....")
download.file(fileUrl, destfile="./StormData.csv.bz2")
}
if (file.exists("StormData.csv.bz2")) {
RawStormData <- read.csv("StormData.csv.bz2")
print("Data load completed.")
}
## [1] "Data load completed."
After reading data, we check that there are 902297 rows with 37 variables. Also, displays first few rows of data
dim(RawStormData)
## [1] 902297 37
##display few records for some important columns
head(RawStormData[,c(1,2,6,7,8,23,24,25)])
## STATE__ BGN_DATE COUNTYNAME STATE EVTYPE FATALITIES INJURIES
## 1 1 4/18/1950 0:00:00 MOBILE AL TORNADO 0 15
## 2 1 4/18/1950 0:00:00 BALDWIN AL TORNADO 0 0
## 3 1 2/20/1951 0:00:00 FAYETTE AL TORNADO 0 2
## 4 1 6/8/1951 0:00:00 MADISON AL TORNADO 0 2
## 5 1 11/15/1951 0:00:00 CULLMAN AL TORNADO 0 2
## 6 1 11/15/1951 0:00:00 LAUDERDALE AL TORNADO 0 6
## PROPDMG
## 1 25.0
## 2 2.5
## 3 25.0
## 4 2.5
## 5 2.5
## 6 2.5
In order to determine the events that are most harmful to humans, we look at the corresponding number of fatalities and injuries according to event types.
This section will perform aggregate of fatalities and injuries by events and sort each corresponding data by descending order to determine events with most occurrences of harm
## Aggregate the fatalites by event types.
EventsByFatalities <- aggregate(FATALITIES~EVTYPE, data = RawStormData, FUN=sum)
## Sort the data in decreasing order by fatalities
EventsByFatalities <- EventsByFatalities[order(EventsByFatalities$FATALITIES, decreasing = TRUE),]
## Aggregate the injuries by event types.
EventsByInjuries <- aggregate(INJURIES~EVTYPE, data = RawStormData, FUN=sum)
## Sort the data in decreasing order by injuries
EventsByInjuries <- EventsByInjuries[order(EventsByInjuries$INJURIES, decreasing = TRUE),]
##Display top 10 events by fatalities and another by injuries
head(EventsByFatalities, 10)
## EVTYPE FATALITIES
## 834 TORNADO 5633
## 130 EXCESSIVE HEAT 1903
## 153 FLASH FLOOD 978
## 275 HEAT 937
## 464 LIGHTNING 816
## 856 TSTM WIND 504
## 170 FLOOD 470
## 585 RIP CURRENT 368
## 359 HIGH WIND 248
## 19 AVALANCHE 224
head(EventsByInjuries, 10)
## EVTYPE INJURIES
## 834 TORNADO 91346
## 856 TSTM WIND 6957
## 170 FLOOD 6789
## 130 EXCESSIVE HEAT 6525
## 464 LIGHTNING 5230
## 275 HEAT 2100
## 427 ICE STORM 1975
## 153 FLASH FLOOD 1777
## 760 THUNDERSTORM WIND 1488
## 244 HAIL 1361
To determine the property damages caused by an event, we will look at two variables.
Property Damage (PROPDMG)
Crop Damage (CROPDMG)
This section will aggregate the property and crop damages for each record, and then aggregate across event types to find events with most damages.
Damages <- select(mutate(RawStormData, TOTALDMG = PROPDMG+CROPDMG), EVTYPE,TOTALDMG)
## Aggregate the damages by event types.
EventsByTotalDamages <- aggregate(TOTALDMG~EVTYPE, data = Damages, FUN=sum)
## Sort the data in decreasing order by fatalities
EventsByTotalDamages <- EventsByTotalDamages[order(EventsByTotalDamages$TOTALDMG, decreasing = TRUE),]
##Display top 10 events by total damages.
head(EventsByTotalDamages, 10)
## EVTYPE TOTALDMG
## 834 TORNADO 3312276.7
## 153 FLASH FLOOD 1599325.1
## 856 TSTM WIND 1445168.2
## 244 HAIL 1268289.7
## 170 FLOOD 1067976.4
## 760 THUNDERSTORM WIND 943635.6
## 464 LIGHTNING 606932.4
## 786 THUNDERSTORM WINDS 464978.1
## 359 HIGH WIND 342014.8
## 972 WINTER STORM 134699.6
Here, we present a result showing the top 10 events that are most harmful to people.
* First, looking at events causing fatalities.
* Secondly, looking at events causing injuries.
Next, the result of total economic losses by event types is displayed
## Define a list of hex colors to use for the bars in the next plots
fills <- c("#a8e4b1", "#a89447", "#ffc0cb", "#0da7f2", "#ffcc98", "#da5c53", "#4aa3ba","#fa8072", "#d4af37", "#ff00ff")
## Using ggplot, create a bar plot showing the top 10 events with most fatalities
##uses the fill option to change fills of the bars
## Uses the theme feature to angle the text at 25 degrees and horizontally justify the labels
ggplot(data=head(EventsByFatalities, 10), aes(x=EVTYPE, y=FATALITIES)) +
ylab("Fatalities") + xlab("Event Type") +
geom_bar(stat = "identity", fill=fills) +
ylim(0,7000) +
theme(axis.text.x = element_text(angle = 25, hjust = 1), plot.title = element_text(hjust = 0.5)) +
ggtitle("Top Ten Events with major fatalities Across the U.S")
ggplot(data=head(EventsByInjuries, 10), aes(x=EVTYPE, y=INJURIES)) +
ylab("Injuries") + xlab("Event Type") +
geom_bar(stat = "identity", fill=fills) +
ylim(0,100000) +
theme(axis.text.x = element_text(angle = 25, hjust = 1), plot.title = element_text(hjust = 0.5)) +
ggtitle("Top Ten Events with major Injuries Across the U.S")
ggplot(data=head(EventsByTotalDamages, 10), aes(x=EVTYPE, y=TOTALDMG/1000)) +
ylab("Economic losses in millions") + xlab("Event Type") +
geom_bar(stat = "identity", fill=fills) +
theme(axis.text.x = element_text(angle = 25, hjust = 1), plot.title = element_text(hjust = 0.5)) +
ggtitle("Top Ten Events with major Economic losses Across the U.S")