Course - Reproducible Research
Week 4 Assignment
Suleman Wadur

Synopsis

This report explores the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm data for year 1950 to 2011. The exploration will analyze the data in order to find the events that mostly impact the health of the population and the economic impact of such events.

The data:

The data is from U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, which tracks info on major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Data Processing

We first load download the data from the data source and load into R. The data is a CSV file with each data variable delimited by a comma ‘,’.

Downloading and Loading of data file…

## Load needed libraries
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
## Turns off exponential notation of numeric values such as when using the mean function
options(scipen = 999)

## Set working directory
workdir <- "C:/Move 4/Coursera/DataScience/Course5-ReproducibleResearch/Week4/Assignment/"
setwd(workdir)


 fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
 if(!file.exists("StormData.csv.bz2")) {
   print("downloading file.....")
  download.file(fileUrl, destfile="./StormData.csv.bz2")
}

if (file.exists("StormData.csv.bz2")) {
  RawStormData <- read.csv("StormData.csv.bz2")
   print("Data load completed.")
}
## [1] "Data load completed."

After reading data, we check that there are 902297 rows with 37 variables. Also, displays first few rows of data

dim(RawStormData)
## [1] 902297     37
##display few records for some important columns
head(RawStormData[,c(1,2,6,7,8,23,24,25)])
##   STATE__           BGN_DATE COUNTYNAME STATE  EVTYPE FATALITIES INJURIES
## 1       1  4/18/1950 0:00:00     MOBILE    AL TORNADO          0       15
## 2       1  4/18/1950 0:00:00    BALDWIN    AL TORNADO          0        0
## 3       1  2/20/1951 0:00:00    FAYETTE    AL TORNADO          0        2
## 4       1   6/8/1951 0:00:00    MADISON    AL TORNADO          0        2
## 5       1 11/15/1951 0:00:00    CULLMAN    AL TORNADO          0        2
## 6       1 11/15/1951 0:00:00 LAUDERDALE    AL TORNADO          0        6
##   PROPDMG
## 1    25.0
## 2     2.5
## 3    25.0
## 4     2.5
## 5     2.5
## 6     2.5

In order to determine the events that are most harmful to humans, we look at the corresponding number of fatalities and injuries according to event types.

This section will perform aggregate of fatalities and injuries by events and sort each corresponding data by descending order to determine events with most occurrences of harm

## Aggregate the fatalites by event types.
EventsByFatalities <- aggregate(FATALITIES~EVTYPE, data = RawStormData, FUN=sum)

## Sort the data in decreasing order by fatalities
EventsByFatalities <- EventsByFatalities[order(EventsByFatalities$FATALITIES, decreasing = TRUE),]

## Aggregate the injuries by event types.
EventsByInjuries <- aggregate(INJURIES~EVTYPE, data = RawStormData, FUN=sum)

## Sort the data in decreasing order by injuries
EventsByInjuries <- EventsByInjuries[order(EventsByInjuries$INJURIES, decreasing = TRUE),]


##Display top 10 events by fatalities and another by injuries
head(EventsByFatalities, 10)
##             EVTYPE FATALITIES
## 834        TORNADO       5633
## 130 EXCESSIVE HEAT       1903
## 153    FLASH FLOOD        978
## 275           HEAT        937
## 464      LIGHTNING        816
## 856      TSTM WIND        504
## 170          FLOOD        470
## 585    RIP CURRENT        368
## 359      HIGH WIND        248
## 19       AVALANCHE        224
head(EventsByInjuries, 10)
##                EVTYPE INJURIES
## 834           TORNADO    91346
## 856         TSTM WIND     6957
## 170             FLOOD     6789
## 130    EXCESSIVE HEAT     6525
## 464         LIGHTNING     5230
## 275              HEAT     2100
## 427         ICE STORM     1975
## 153       FLASH FLOOD     1777
## 760 THUNDERSTORM WIND     1488
## 244              HAIL     1361

To determine the property damages caused by an event, we will look at two variables.
Property Damage (PROPDMG)
Crop Damage (CROPDMG)
This section will aggregate the property and crop damages for each record, and then aggregate across event types to find events with most damages.

Damages <- select(mutate(RawStormData, TOTALDMG = PROPDMG+CROPDMG), EVTYPE,TOTALDMG)

## Aggregate the damages by event types.
EventsByTotalDamages <- aggregate(TOTALDMG~EVTYPE, data = Damages, FUN=sum)

## Sort the data in decreasing order by fatalities
EventsByTotalDamages <- EventsByTotalDamages[order(EventsByTotalDamages$TOTALDMG, decreasing = TRUE),]

##Display top 10 events by total damages.
head(EventsByTotalDamages, 10)
##                 EVTYPE  TOTALDMG
## 834            TORNADO 3312276.7
## 153        FLASH FLOOD 1599325.1
## 856          TSTM WIND 1445168.2
## 244               HAIL 1268289.7
## 170              FLOOD 1067976.4
## 760  THUNDERSTORM WIND  943635.6
## 464          LIGHTNING  606932.4
## 786 THUNDERSTORM WINDS  464978.1
## 359          HIGH WIND  342014.8
## 972       WINTER STORM  134699.6

Results

Here, we present a result showing the top 10 events that are most harmful to people.
* First, looking at events causing fatalities.
* Secondly, looking at events causing injuries.

Next, the result of total economic losses by event types is displayed

## Define a list of hex colors to use for the bars in the next plots
fills <- c("#a8e4b1", "#a89447", "#ffc0cb", "#0da7f2", "#ffcc98", "#da5c53", "#4aa3ba","#fa8072", "#d4af37", "#ff00ff")

## Using ggplot, create a bar plot showing the top 10 events with most fatalities
##uses the fill option to change fills of the bars
## Uses the theme feature to angle the text at 25 degrees and horizontally justify the labels
ggplot(data=head(EventsByFatalities, 10), aes(x=EVTYPE, y=FATALITIES)) +
  ylab("Fatalities") + xlab("Event Type") + 
  geom_bar(stat = "identity", fill=fills) + 
  ylim(0,7000) +
  theme(axis.text.x = element_text(angle = 25, hjust = 1), plot.title = element_text(hjust = 0.5)) +
  ggtitle("Top Ten Events with major fatalities Across the U.S")

ggplot(data=head(EventsByInjuries, 10), aes(x=EVTYPE, y=INJURIES)) +
  ylab("Injuries") + xlab("Event Type") + 
  geom_bar(stat = "identity", fill=fills) + 
  ylim(0,100000) +
  theme(axis.text.x = element_text(angle = 25, hjust = 1), plot.title = element_text(hjust = 0.5)) +
  ggtitle("Top Ten Events with major Injuries Across the U.S")

ggplot(data=head(EventsByTotalDamages, 10), aes(x=EVTYPE, y=TOTALDMG/1000)) +
  ylab("Economic losses in millions") + xlab("Event Type") + 
  geom_bar(stat = "identity", fill=fills) + 
  theme(axis.text.x = element_text(angle = 25, hjust = 1), plot.title = element_text(hjust = 0.5)) +
  ggtitle("Top Ten Events with major Economic losses Across the U.S")