In this assignment we work exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database has the most important weather events from a harmful point of view the history of the United States of America. We will consider during this document the terrible outcomes on both health and an economic perspective.
First we load the libraries that we are going to use through the analysis
library(proto) # Required dependence
library(bitops) # Required dependence
library(RCurl) # Get file by HTTP
library(ggplot2) # For visualizations
library(plyr) # Data manipulation
library(gsubfn) # Find and sub strings - cleaning data
## Loading required namespace: tcltk
First of all you can click in order to get the following files:
Before starting remember to set your working directory with the next command:
setwd(“/Path/Where/You/Execute/Code”)
Now we download those files to our workspace:
destfile="StormData.csv.bz2"
fileURL="https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if(!file.exists(destfile)){download.file(url=fileURL, destfile=destfile, method='curl')}
destfile="01016005curr.pdf"
fileURL="https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf"
if(!file.exists(destfile)){download.file(url=fileURL,destfile=destfile,method='curl')}
destfile="FAQ_Page.pdf"
fileURL="https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf"
if(!file.exists(destfile)){download.file(url=fileURL,destfile=destfile,method='curl')}
First, load the data into the workspace:
events <- read.csv2(bzfile("StormData.csv.bz2"), sep = ",", na.strings = "NA")
Now we have to clean it as right now is hard to put everything together. For that purpose what we do is two main actions:
trim <- function(x) {
gsub("(^[[:space:]]+|[[:space:]]+$)", "", x)
}
events <- transform(events, EVTYPE = trim(EVTYPE) )
events <- transform(events, EVTYPE = gsubfn("\\B.", tolower, EVTYPE, perl = TRUE ) )
The next step is to put the data in a summary format that is easily accessible to show results. Therefore first we calculate the total number of injuries and fatalities per type of events:
injuries_summary <- ddply(events, .(EVTYPE), summarise, tot_fatalities = sum(as.numeric(FATALITIES)), tot_injuries = sum(as.numeric(INJURIES)))
injuries_summary <- transform(injuries_summary, total = tot_fatalities + tot_injuries)
And then the total damages on a economic level per type of event:
economic_summary <- ddply(events, .(EVTYPE), summarise, tot_properties_dmg = sum(as.numeric(PROPDMG)), tot_crop_dmg = sum(as.numeric(CROPDMG)))
economic_summary <- transform(economic_summary, total = tot_properties_dmg + tot_crop_dmg )
Now we are ready to jump to results
For results we focus on analyzing the impact of these events in two main issues:
The top 10 most harmful events are the next:
most_harmful <- injuries_summary[sort(injuries_summary$total, decreasing = TRUE, index.return=TRUE)$ix,][1:10,]
most_harmful
## EVTYPE tot_fatalities tot_injuries total
## 767 Tornado 79987 586102 666089
## 213 Hail 288719 301257 589976
## 788 Tstm Wind 221439 334853 556292
## 694 Thunderstorm Wind 82944 103332 186276
## 134 Flash Flood 58316 70707 129023
## 421 Lightning 17273 96160 113433
## 150 Flood 27283 34159 61442
## 322 High Wind 21123 38413 59536
## 720 Thunderstorm Winds 21021 34499 55520
## 276 Heavy Snow 16196 24309 40505
Lets do a chart to represent the most harmful events:
p_health <- ggplot(data = most_harmful, aes(x = EVTYPE)) + geom_point(aes(y=total/1000, color="Total"), size = 5) + geom_point(aes(y=tot_injuries/1000, color="Injuries"), size = 5, shape = 24) + geom_point(aes(y=tot_fatalities/1000, color="Fatalities"), size = 5, shape=8) + ylab("Thousands of People") + xlab("Event Type") + theme_bw(base_family = "Times")+ ggtitle("Top 10 most harmful events for health")
plot(p_health)
The 10 event with more economic impact
most_economic <- economic_summary[sort(economic_summary$total, decreasing = TRUE, index.return=TRUE)$ix,][1:10,]
most_economic
## EVTYPE tot_properties_dmg tot_crop_dmg total
## 788 Tstm Wind 30962589 831227 31793816
## 694 Thunderstorm Wind 21693662 248867 21942529
## 767 Tornado 20043516 318943 20362459
## 213 Hail 11965969 2135137 14101106
## 134 Flash Flood 11759910 456151 12216061
## 720 Thunderstorm Winds 7928514 181908 8110422
## 150 Flood 5727502 359155 6086657
## 421 Lightning 5960289 33401 5993690
## 322 High Wind 3045928 58634 3104562
## 612 Strong Wind 1400798 16728 1417526
Lets represent those events in a graph
p_economy <- ggplot(data = most_economic, aes(x = EVTYPE)) + geom_point(aes(y=total/10^6, color="Total Damages"), size = 5) + geom_point(aes(y=tot_properties_dmg/10^6, color="Properties Damages"), size = 5, shape = 24) + geom_point(aes(y=tot_crop_dmg/10^6, color="Crop Damages"), size = 8, shape=4) + ylab("Event Damages [Millions of $]") + xlab("Event Type") + theme_bw(base_family = "Times") + ggtitle("The 10 events with more economic impact")
plot(p_economy)
We found that the Tstm Wind, the Tunderstorm Wind, Tornados and Hail are the top 5 most harmful events for people but also the ones with more economic impact. Although they do not occupy the exact same position in the ranking for both criterias, it is clear that there is a correlation between the economical and people damage due to these events. Our finding suggest that we should focus our efforts towards improving the response when these phenomenom occur.