“Peer Graded Assignment: Course Project 2”" (c) by “Daniel A. Cialdella C.” licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. You should have received a copy of the license along with this work. If not, see http://creativecommons.org/licenses/by-sa/3.0/.
The basic goal of this document is to explore the NOAA Storm Database and answer only two (2) basic questions about severe weather events (see Section “Questions to answer”).
We used th information provided by NOAA (U.S.A. Organization), used “R” as the statistics application to collect, analyze and report the information wanted and could be reproducible.
Knitr was used to prepare the report (All in one doc) and was published in Rpubs too. http://rpubs.com/dcialdella
The original document provided was obtained from a specific link. https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2 I did not verify the authenticity/content of the “original doc provided”, I used it as “the starting point”.
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
I checked this page “https://www.ncdc.noaa.gov/stormevents/ftp.jsp” and seams it’s related to the information, and this link has lots of history data compressed. http://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/
National Weather Service Storm Data Documentation. https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf
National Climatic Data Center Storm Events FAQ. https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf
Information about the Disaster covered. http://www.ncdc.noaa.gov/stormevents/details.jsp Take care about the info 1996 and later, with 48 types of disaster covered.
Questions to solve/Answer:
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
This document could be used by any government or municipal manager, it’s generated
Information about the computer used to process the petition.
sessionInfo()
## R version 3.2.5 (2016-04-14)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.1 LTS
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=es_ES.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=es_ES.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] magrittr_1.5 formatR_1.3 tools_3.2.5 htmltools_0.3.5
## [5] yaml_2.1.13 Rcpp_0.12.5 stringi_1.1.1 rmarkdown_0.9.6
## [9] knitr_1.12.3 stringr_1.0.0 digest_0.6.9 evaluate_0.9
The original zipped file is “repdata_data_StormData.csv.bz2” with 49.2 mb Unzipped and ready to be processed/used in R is “repdata_data_StormData.csv” with 561.6 mb
The dataload process will spend some minutes to be completed.
#####################################################
# UNCOMMENT THIS LINE TO LOAD DATA FROM INTERNET
# obtained ZIPPED FILE from the original page. ( 49025 kb )
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "origen")
#####################################################
# unzip it and load into object called "datos" (very creative)
#
# trying to not reload data if not needed
#
result = tryCatch(
{
# this have to fail if the var "datos" is not created.
# "E" as error, but run "F" anyway.... strange!!!
testing <- sum(datos$FATALITIES) >0
}, warning = function(w) {
print ("W")
}, error = function(e) {
print ("E")
}, finally = {
print ("F")
}
)
## [1] "E"
## [1] "F"
datos <- read.csv( bzfile("origen"), sep="," , header = T)
Complete information about the data, should be “902297 records x 37 columns”.
dim(datos)
## [1] 902297 37
datos2 <- datos [ , c("BGN_DATE", "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
# Clean memory object "datos" and maintain datos2, idea is reduce RAM usage.
remove ( datos )
# fix dates, not sure if needed, seams relevant data
datos2$BGN_DATE <- as.Date(datos2$BGN_DATE , format = "%m/%d/%Y")
# Identify values in special columns
unique( datos2$PROPDMGEXP )
## [1] K M B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
# - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
unique( datos2$CROPDMGEXP )
## [1] M K m B ? 0 k 2
## Levels: ? 0 2 B k K m M
# ? 0 2 B k K m M
Fix numbers of units, altering the data multiplying it with a “factor” represented by a letter.
# VERY LAAAAAARGE PROCESS (timing)
Cantidad <- dim(datos2)
for (i in 1:Cantidad[1] )
{
# print(i)
if ( datos2$PROPDMGEXP[i] == "K") { datos2$PROPDMG[i] <- (datos2$PROPDMG[i] * 1000) }
if ( datos2$PROPDMGEXP[i] == "k") { datos2$PROPDMG[i] <- (datos2$PROPDMG[i] * 1000) }
if ( datos2$PROPDMGEXP[i] == "M") { datos2$PROPDMG[i] <- (datos2$PROPDMG[i] * 1000000) }
if ( datos2$PROPDMGEXP[i] == "m") { datos2$PROPDMG[i] <- (datos2$PROPDMG[i] * 1000000) }
if ( datos2$PROPDMGEXP[i] == "B") { datos2$PROPDMG[i] <- (datos2$PROPDMG[i] * 1000000000) }
if ( datos2$PROPDMGEXP[i] == "b") { datos2$PROPDMG[i] <- (datos2$PROPDMG[i] * 1000000000) }
if ( datos2$PROPDMGEXP[i] == "H") { datos2$PROPDMG[i] <- (datos2$PROPDMG[i] * 100) }
if ( datos2$PROPDMGEXP[i] == "h") { datos2$PROPDMG[i] <- (datos2$PROPDMG[i] * 100) }
if ( datos2$CROPDMGEXP[i] == "K") { datos2$CROPDMG[i] <- (datos2$CROPDMG[i] * 1000) }
if ( datos2$CROPDMGEXP[i] == "k") { datos2$CROPDMG[i] <- (datos2$CROPDMG[i] * 1000) }
if ( datos2$CROPDMGEXP[i] == "M") { datos2$CROPDMG[i] <- (datos2$CROPDMG[i] * 1000000) }
if ( datos2$CROPDMGEXP[i] == "m") { datos2$CROPDMG[i] <- (datos2$CROPDMG[i] * 1000000) }
if ( datos2$CROPDMGEXP[i] == "B") { datos2$CROPDMG[i] <- (datos2$CROPDMG[i] * 1000000000) }
if ( datos2$CROPDMGEXP[i] == "b") { datos2$CROPDMG[i] <- (datos2$CROPDMG[i] * 1000000000) }
if ( datos2$CROPDMGEXP[i] == "H") { datos2$CROPDMG[i] <- (datos2$CROPDMG[i] * 100 )}
if ( datos2$CROPDMGEXP[i] == "h") { datos2$CROPDMG[i] <- (datos2$CROPDMG[i] * 100 )}
# like RDBMS will think in a massive change, trying to reduce the timing for this change.
}
Now calculate the “sum” of data values. Obtain the firsts 3 elements in each category. Economic damage, Injured and Fatalities.
# Damage calculated by kind of disaster
# Calculate total damage (PROP + CROP) and store it in a new col
datos2$total <- datos2$PROPDMG + datos2$CROPDMG
# In Persons and in Economic values
# Calculate FATALITIES by Event type
# fat <- tail( aggregate( FATALITIES ~ EVTYPE, datos2, sum, na.rm=T) ,3 )
fat <- aggregate( FATALITIES ~ EVTYPE, datos2, sum, na.rm=T)
fat2 <- tail( fat [ order( fat$FATALITIES), ], 3)
# Calculate INJURED by Event type
inj <- aggregate( INJURIES ~ EVTYPE, datos2, sum, na.rm=T)
inj2 <- tail( inj [ order( inj$INJURIES), ], 3)
# Calculate TOTAL DAMAGE by Event type
# estimate relation between Fatalities vs Injuries by each kind of disaster
# datos2$rel <- datos2$FATALITIES / datos2$INJURIES
# RelFatInj <- aggregate( ( FATALITIES/INJURIES ) ~ EVTYPE, datos2, sum, na.rm=T)
# names( RelFatInj ) <- c("EVTYPE", "REL")
# RelFatInj2 <- tail( RelFatInj [ order( RelFatInj$REL), ], 3)
# Calculate TOTAL DAMAGE by Event type
dam <- aggregate( total ~ EVTYPE, datos2, sum, na.rm=T)
dam2 <- tail( dam [ order( dam$total), ], 3)
# Fatalities
fat2
## EVTYPE FATALITIES
## 147 FLASH FLOOD 978
## 123 EXCESSIVE HEAT 1903
## 830 TORNADO 5633
# Injured
inj2
## EVTYPE INJURIES
## 164 FLOOD 6789
## 854 TSTM WIND 6957
## 830 TORNADO 91346
# Relation between fata / injured by Disaster
# RelFatInj2
# Damage
dam2
## EVTYPE total
## 830 TORNADO 57352114049
## 406 HURRICANE/TYPHOON 71913712800
## 164 FLOOD 150319678257
Question 1.
library(ggplot2)
ggplot( data=fat2, aes(x=EVTYPE, y=FATALITIES)) + geom_bar(stat="identity") + xlab("Kind of Event") + ylab("Fatalities") + labs(title="Fatalities ( big 3 ) )")
# aggregate(FATALITIES ~ EVTYPE, myData, sum)
# hist( tapply(activityConNA$steps, activityConNA$date,sum), main = "Plot 4.4", xlab="Steps", ylim=c(0,40), labels=T )
Answer 1. Events, most dangerous “Tornado”, then “Excessive heat” and “FLASH FLOOD”
More information about Injuried.
library( ggplot2)
ggplot( data=inj2, aes(x=EVTYPE, y=INJURIES)) + geom_bar(stat="identity") + xlab("Kind of Event") + ylab("Injuried") + labs(title="Injuried ( big 3 ) )")
Another graph to explain the information collected (Injuried) Ordered “Tornado”, “TSTM WIND”" and “Flood”.
Question 2.
library( ggplot2)
ggplot( data=dam2, aes(x=EVTYPE, y=total/1000000000)) + geom_bar(stat="identity") + xlab("Kind of Event") + ylab("Billions of units") + labs(title="Damage (PROPDMG + CROPDMG )")
Answer 2. Events, most economical losses “Flood”, HURRICANE/TYPHOON" and “Tornados”.