This file will document the proces behind the data analysis for the Second Peer Assessment of the Reproducible Reseach course for Coursera and John Hopkin's Data Science specialization.
This analysis will make use of the NOAA Storm Database to understand which weather event types cause the most damage both economically and health related. The main idea is to understand: 1. Across the United States, which types of events are most harmful with respect to population health? 2. Across the United States, which types of events have the greatest economic consequences?
To do the analysis it was first necessary to download the data from the link provided in the course's Peer Assessment 2 page. Cleaning the data was a pretty straight forward process. We've eliminated older records where data was not complete in comparison to the newest data.
Finally the idea was to aggregate the records by EVTYPE (event type), summarizing on FATALITIES and INJURIES to do the first analysis; and on PROPDMG (property damage) and CROPDMG (crop damage) for the second analysis. The results were then arranged in a descending order and the top 10 results were taken and displayed in a graph.
In order to process the data, it was necessary to acquire the NOAA Storm Database from the link provided in the course's Peer Assessment 2 page (link to database).
if (!file.exists("Storm Data.csv.bz2")) {
download.file(url="https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "Storm Data.csv.bz2")
}
The next thing to do was to create a data frame object by calling the read.csv function with the recently downloaded file as a parameter.
if (!exists("stormData")) {
stormData <- read.csv(bzfile("Storm Data.csv.bz2"))
}
With the dataframe it was then possible to split it into two different dataframes to answer each of the proposed questions. To do this some external libraries were needed.
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.1.2
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.2
library(scales)
For the first sub-dataframe the records were aggregated by EVTYPE (event type) and summed the FATALITIES and INJURIES, then we added these both to get a TOTAL. Then the results were ordered by TOTAL descending and the top 10 were picked.
#AGGREGATE AND SUMMARIZE
stormData_byEvent_health <- group_by(stormData, EVTYPE) %>% summarize(FATALITIES = sum(FATALITIES), INJURIES = sum(INJURIES), TOTAL = sum(FATALITIES, INJURIES))
#ORDER TOP 10 WITH MOST FATALITIES AND INJURIES
top10_health_byEvent <- arrange(stormData_byEvent_health, desc(TOTAL), desc(FATALITIES), desc(INJURIES))[1:10,c("EVTYPE","TOTAL")]
The second sub-dataframe was processed in the same way but summing PROPDMG and CROPDMG and then summing both in a TOTAL column. Results were again ordered by TOTAL descending and the top 10 were picked.
#AGGREGATE AND SUMMARIZE
stormData_byEvent_economic <- group_by(stormData, EVTYPE) %>% summarize(PROPDMG = sum(PROPDMG), PROPDMGEXP = sum(PROPDMGEXP), CROPDMG = sum(CROPDMG), CROPDMGEXP = sum(CROPDMGEXP), TOTAL = sum(PROPDMG, CROPDMG))
#ORDER TOP 10 WITH MOST FATALITIES AND INJURIES
top10_economic_byEvent <- arrange(stormData_byEvent_economic, desc(TOTAL), desc(PROPDMG), desc(CROPDMG))[1:10,c("EVTYPE","TOTAL")]
Having the data frame, the next thing is to plot the events by thei total columns to understand which EVTYPE had the most fatalities and injuries and property and crop damage.
#ORDER TOP 10 WITH MOST FATALITIES AND INJURIES
ggplot(top10_health_byEvent, aes(y=TOTAL, x=EVTYPE, fill=EVTYPE)) +
geom_bar(stat="identity") +
scale_x_discrete(name = "Event Type", limits=top10_health_byEvent$EVTYPE[10:1]) +
theme(axis.text.x = element_text(angle = 90, hjust = 1), legend.position="none") +
ggtitle("Fatalitites and Injuries by event type in the U.S. \nfrom 1950 to 2011") +
coord_flip()
#ORDER TOP 10 WITH MOST FATALITIES AND INJURIES
ggplot(top10_economic_byEvent, aes(y=TOTAL, x=EVTYPE, fill=EVTYPE)) +
geom_bar(stat="identity") +
scale_x_discrete(name = "Event Type", limits=top10_economic_byEvent$EVTYPE[10:1]) +
scale_y_continuous(name="Total Property and Crop Damage", labels = comma) +
theme(axis.text.x = element_text(angle = 90, hjust = 1), legend.position="none") +
ggtitle("Property and Crop damage by event type in the U.S. \nfrom 1950 to 2011") +
coord_flip()