Peer Graded Assignment: Course Project 2

License

“Peer Graded Assignment: Course Project 2”" (c) by “Daniel A. Cialdella C.” licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. You should have received a copy of the license along with this work. If not, see http://creativecommons.org/licenses/by-sa/3.0/.

Goal

The basic goal of this document is to explore the NOAA Storm Database and answer only two (2) basic questions about severe weather events (see Section “Questions to answer”).

We used th information provided by NOAA (U.S.A. Organization), used “R” as the statistics application to collect, analyze and report the information wanted and could be reproducible.

Knitr was used to prepare the report (All in one doc) and was published in Rpubs too. http://rpubs.com/dcialdella

The original document provided was obtained from a specific link. https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2 I did not verify the authenticity/content of the “original doc provided”, I used it as “the starting point”.

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

I checked this page “https://www.ncdc.noaa.gov/stormevents/ftp.jsp” and seams it’s related to the information, and this link has lots of history data compressed. http://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/

National Weather Service Storm Data Documentation. https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf

National Climatic Data Center Storm Events FAQ. https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf

Information about the Disaster covered. http://www.ncdc.noaa.gov/stormevents/details.jsp Take care about the info 1996 and later, with 48 types of disaster covered.

Questions to answer.

Questions to solve/Answer:

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

This document could be used by any government or municipal manager, it’s generated

Detail about the way to response questions.

Information about the computer used to process the petition.

sessionInfo()

## R version 3.2.5 (2016-04-14)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.1 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] magrittr_1.5    formatR_1.3     tools_3.2.5     htmltools_0.3.5
##  [5] yaml_2.1.13     Rcpp_0.12.5     stringi_1.1.1   rmarkdown_0.9.6
##  [9] knitr_1.12.3    stringr_1.0.0   digest_0.6.9    evaluate_0.9

Obtaining the source data.

The original zipped file is “repdata_data_StormData.csv.bz2” with 49.2 mb Unzipped and ready to be processed/used in R is “repdata_data_StormData.csv” with 561.6 mb

The dataload process will spend some minutes to be completed.

#####################################################
# UNCOMMENT THIS LINE TO LOAD DATA FROM INTERNET
# obtained ZIPPED FILE from the original page. ( 49025 kb )

download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "origen") 

#####################################################

# unzip it and load into object called "datos" (very creative)

#
# trying to not reload data if not needed
#
result = tryCatch(
{
  # this have to fail if the var "datos" is not created.
  #    "E" as error, but run "F" anyway.... strange!!!
   testing <- sum(datos$FATALITIES) >0
}, warning = function(w) {
    print ("W")
}, error = function(e) {
    print ("E")
}, finally = {
    print ("F")
}
)

## [1] "E"
## [1] "F"

datos <- read.csv( bzfile("origen"), sep="," , header = T)

Complete information about the data, should be “902297 records x 37 columns”.

dim(datos)

## [1] 902297     37

Data Processing

datos2 <- datos [ , c("BGN_DATE", "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]

# Clean memory object "datos" and maintain datos2, idea is reduce RAM usage.
remove ( datos )

# fix dates, not sure if needed, seams relevant data
datos2$BGN_DATE <- as.Date(datos2$BGN_DATE , format = "%m/%d/%Y")


# Identify values in special columns 
unique( datos2$PROPDMGEXP )

##  [1] K M   B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels:  - ? + 0 1 2 3 4 5 6 7 8 B h H K m M

# - ? + 0 1 2 3 4 5 6 7 8 B h H K m M

unique( datos2$CROPDMGEXP )

## [1]   M K m B ? 0 k 2
## Levels:  ? 0 2 B k K m M

# ? 0 2 B k K m M

Fix numbers of units, altering the data multiplying it with a “factor” represented by a letter.

# VERY LAAAAAARGE PROCESS (timing)
Cantidad <- dim(datos2)
for (i in 1:Cantidad[1] )
    {
    # print(i)
    if ( datos2$PROPDMGEXP[i] == "K")  { datos2$PROPDMG[i]  <- (datos2$PROPDMG[i] * 1000) }
    if ( datos2$PROPDMGEXP[i] == "k")  { datos2$PROPDMG[i]  <- (datos2$PROPDMG[i] * 1000) } 
    if ( datos2$PROPDMGEXP[i] == "M")  { datos2$PROPDMG[i]  <- (datos2$PROPDMG[i] * 1000000) }
    if ( datos2$PROPDMGEXP[i] == "m")  { datos2$PROPDMG[i]  <- (datos2$PROPDMG[i] * 1000000) }
    if ( datos2$PROPDMGEXP[i] == "B")  { datos2$PROPDMG[i]  <- (datos2$PROPDMG[i] * 1000000000) }
    if ( datos2$PROPDMGEXP[i] == "b")  { datos2$PROPDMG[i]  <- (datos2$PROPDMG[i] * 1000000000) }
    if ( datos2$PROPDMGEXP[i] == "H")  { datos2$PROPDMG[i]  <- (datos2$PROPDMG[i] * 100) }
    if ( datos2$PROPDMGEXP[i] == "h")  { datos2$PROPDMG[i]  <- (datos2$PROPDMG[i] * 100) }
      
    if ( datos2$CROPDMGEXP[i] == "K")  { datos2$CROPDMG[i]  <- (datos2$CROPDMG[i] * 1000) }
    if ( datos2$CROPDMGEXP[i] == "k")  { datos2$CROPDMG[i]  <- (datos2$CROPDMG[i] * 1000) }
    if ( datos2$CROPDMGEXP[i] == "M")  { datos2$CROPDMG[i]  <- (datos2$CROPDMG[i] * 1000000) }
    if ( datos2$CROPDMGEXP[i] == "m")  { datos2$CROPDMG[i]  <- (datos2$CROPDMG[i] * 1000000) }
    if ( datos2$CROPDMGEXP[i] == "B")  { datos2$CROPDMG[i]  <- (datos2$CROPDMG[i] * 1000000000) }
    if ( datos2$CROPDMGEXP[i] == "b")  { datos2$CROPDMG[i]  <- (datos2$CROPDMG[i] * 1000000000) }
    if ( datos2$CROPDMGEXP[i] == "H")  { datos2$CROPDMG[i]  <- (datos2$CROPDMG[i] * 100 )}
    if ( datos2$CROPDMGEXP[i] == "h")  { datos2$CROPDMG[i]  <- (datos2$CROPDMG[i] * 100 )}
  
    # like RDBMS will think in a massive change, trying to reduce the timing for this change.
}

Now calculate the “sum” of data values. Obtain the firsts 3 elements in each category. Economic damage, Injured and Fatalities.

# Damage calculated by kind of disaster

# Calculate total damage (PROP + CROP) and store it in a new col
datos2$total <- datos2$PROPDMG + datos2$CROPDMG


# In Persons and in Economic values

# Calculate FATALITIES by Event type
# fat <- tail(   aggregate(  FATALITIES ~ EVTYPE, datos2, sum, na.rm=T) ,3 )
fat  <- aggregate(  FATALITIES ~ EVTYPE, datos2, sum, na.rm=T)
fat2 <- tail( fat [ order( fat$FATALITIES), ], 3)


# Calculate INJURED by Event type
inj  <- aggregate(  INJURIES ~ EVTYPE, datos2, sum, na.rm=T)
inj2 <- tail( inj [ order( inj$INJURIES), ], 3)


# Calculate TOTAL DAMAGE by Event type
# estimate relation between Fatalities vs Injuries by each kind of disaster
# datos2$rel <- datos2$FATALITIES / datos2$INJURIES
# RelFatInj  <- aggregate(  ( FATALITIES/INJURIES )  ~ EVTYPE, datos2, sum, na.rm=T)
# names( RelFatInj ) <- c("EVTYPE", "REL")
# RelFatInj2 <- tail( RelFatInj [ order( RelFatInj$REL), ], 3)


# Calculate TOTAL DAMAGE by Event type
dam  <- aggregate(  total  ~ EVTYPE, datos2, sum, na.rm=T)
dam2 <- tail( dam [ order( dam$total), ], 3)

# Fatalities
fat2

##             EVTYPE FATALITIES
## 147    FLASH FLOOD        978
## 123 EXCESSIVE HEAT       1903
## 830        TORNADO       5633

# Injured
inj2

##        EVTYPE INJURIES
## 164     FLOOD     6789
## 854 TSTM WIND     6957
## 830   TORNADO    91346

# Relation between fata / injured by Disaster
# RelFatInj2

# Damage
dam2

##                EVTYPE        total
## 830           TORNADO  57352114049
## 406 HURRICANE/TYPHOON  71913712800
## 164             FLOOD 150319678257

Results

Question 1.

library(ggplot2)

ggplot( data=fat2, aes(x=EVTYPE, y=FATALITIES)) + geom_bar(stat="identity") + xlab("Kind of Event") + ylab("Fatalities") + labs(title="Fatalities ( big 3 ) )")

# aggregate(FATALITIES ~ EVTYPE, myData, sum)
# hist( tapply(activityConNA$steps, activityConNA$date,sum), main = "Plot 4.4", xlab="Steps", ylim=c(0,40), labels=T )

Answer 1. Events, most dangerous “Tornado”, then “Excessive heat” and “FLASH FLOOD”

More information about Injuried.

library( ggplot2)

ggplot( data=inj2, aes(x=EVTYPE, y=INJURIES)) + geom_bar(stat="identity") + xlab("Kind of Event") + ylab("Injuried") + labs(title="Injuried ( big 3 ) )")

Another graph to explain the information collected (Injuried) Ordered “Tornado”, “TSTM WIND”" and “Flood”.

Question 2.

library( ggplot2)

ggplot( data=dam2, aes(x=EVTYPE, y=total/1000000000)) + geom_bar(stat="identity") + xlab("Kind of Event") + ylab("Billions of units") + labs(title="Damage (PROPDMG + CROPDMG )")

Answer 2. Events, most economical losses “Flood”, HURRICANE/TYPHOON" and “Tornados”.