Synopsis

Humans has very little influence over atmospheric phenomena so, in legistation and insurance documents for example, sometimes they as called ‘acts of God’. Trying to understand these phenomena, since ancient times man start keeping records about their occurences. In this analyses, the National Oceanic and Atmospheric Administration - NOAA database are inspected. In that database are registred, since 1950, information about atmospheric phenomena, when and where they has occured, a little description and the consequences over crops, damages in properties and human fatalities and injuries. The objective is identify the most harmful events to population helth and which have greatest economic consequences and, at the end we find that tornadoes causes the most damage to people and flood to the economics.

Creating the environment to process data

The libraries that are used to´process the data are loaded:
plyr - Tools for Splitting, Applying and Combining Data
ggplot2 - An implementation of the Grammar of Graphics
reshape2 - Flexibly reshape data: a reboot of the Reshape Package
cowplot - Streamlined plot theme and plot annotations for ggplot2

library(plyr)
library(ggplot2)
library(reshape2)
library(cowplot)

## 
## Attaching package: 'cowplot'

## The following object is masked from 'package:ggplot2':
## 
##     ggsave

Getting the data

Verify if the directory exists otherwise create the directory and set

if(!dir.exists("C:/Coursera/05_RepRes/CourseProject2")){
        dir.create("C:/Coursera/05_RepRes/CourseProject2")
        setwd("C:/Coursera/05_RepRes/CourseProject2")
}

Verify if the dataset exists otherwise, download file from cloudfront

if(!file.exists("stormdata.csv.bz2")){
        download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2","stormdata.csv.bz2")
}
alloccurences<-read.csv(bzfile("stormdata.csv.bz2"))

## Warning in scan(file, what, nmax, sep, dec, quote, skip, nlines,
## na.strings, : EOF within quoted string

Data Processing

Seleting the meaning data for the analysis

After a summary - omitted in this report because much extense, but easily get by adding a command summary(alloccurences), some variables were selected. For this analysis the information processed are:
Event type(8), Fatalities(23), Injuries(24), Property Damage: Value in (25), power of 10 in (26), Crop Damage: value in (27) and power of 10 in (28)

selectdata <- alloccurences[,c(8,23,24,25,26,27,28)]

Verify if there is missing data among the selected occurences

If exists some data missing (N.A.) then some filling processe should be provide.
The verification is done obtainig the Sum of N.A. find in each sbdataset. If the result is 0, the dataset is complete.

sum(is.na(selectdata$FATALITIES))+sum(is.na(selectdata$INJURIES))+sum(is.na(selectdata$PROPDMG))+sum(is.na(selectdata$PROPDMGEXP))+sum(is.na(selectdata$CROPDMG))+sum(is.na(selectdata$CROPDMGEXP))

## [1] 0

After the process, no missing data were found.

Improving the data quality of text contents

When analysing the Event Type contents, can be observed that, along the time and among regions, the records of the types of the events were differently annotated, bringing to the automatic sorting the wrong meaning that there are more types then the real thing.

That was observed using thr command unique(selectdata$EVTYPE), ommited in this report because the result is much extense.

To minimize this effect, we put all text in upper case using ‘toupper’ funtion, remove leading and trailing white spaces with ‘trimws’funtion and the unwanted character like brackets, slashs and others with ’gsub’ function.

selectdata$EVTYPE <- toupper(selectdata$EVTYPE)
selectdata$EVTYPE <- trimws(selectdata$EVTYPE)
selectdata$EVTYPE <- gsub("( |/|-)","",selectdata$EVTYPE)

Doing some verification of data selected

names(selectdata)

## [1] "EVTYPE"     "FATALITIES" "INJURIES"   "PROPDMG"    "PROPDMGEXP"
## [6] "CROPDMG"    "CROPDMGEXP"

str(selectdata)

## 'data.frame':    816265 obs. of  7 variables:
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...

Organizing the monetary values

The records of the monetary values were made of different ways.
Here are the unique values found in PROPDMGEXP

unique(selectdata$PROPDMGEXP)

##  [1] K M   B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels:  - ? + 0 1 2 3 4 5 6 7 8 B h H K m M

And the unique values found in CROPDMGEXP

unique(selectdata$CROPDMGEXP)

## [1]   M K m B ? 0 k 2
## Levels:  ? 0 2 B k K m M

There are letters to represent powers of 10 that are h and H to 100, k and K for 1.000, m and M for 1.000.000, B for 1.000.000.000 and the numbers to reppresent vrious power.

So the exponents are transformed in numericvalues before the calculation of the total value for damages.

To express the values in millions of dollars, the value is divided by 10^6.

Calculation of Property Damage in millions of dollars

selectdata$PROPDMGEXP <- mapvalues(selectdata$PROPDMGEXP, from = c("h", "H", "k", "K", "m", "M", "b", "B", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "+", "-", "?", ""), to = c(10^2, 10^2, 10^3, 10^3, 10^6, 10^6, 10^9, 10^9, 1, 10, 10^2, 10^3, 10^4, 10^5, 10^6, 10^7, 10^8, 10^9, 0, 0, 0, 1))

## The following `from` values were not present in `x`: k, b, 9

selectdata$PROPDMGEXP <- as.numeric(as.character(selectdata$PROPDMGEXP))
selectdata$PROPDMGTOT <- (selectdata$PROPDMG * selectdata$PROPDMGEXP) / 1000000

Calculation of Crop Damage in millions of dollars

selectdata$CROPDMGEXP <- mapvalues(selectdata$CROPDMGEXP, from = c("h", "H", "k", "K", "m", "M", "b", "B", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "+", "-", "?", ""), to = c(10^2, 10^2, 10^3, 10^3, 10^6, 10^6, 10^9, 10^9, 1, 10, 10^2, 10^3, 10^4, 10^5, 10^6, 10^7, 10^8, 10^9, 0, 0, 0, 1))

## The following `from` values were not present in `x`: h, H, b, 1, 3, 4, 5, 6, 7, 8, 9, +, -

selectdata$CROPDMGEXP <- as.numeric(as.character(selectdata$CROPDMGEXP))
selectdata$CROPDMGTOT <- (selectdata$CROPDMG * selectdata$CROPDMGEXP) / 1000000

End of data preparation

At this point, all the data are prepared and the objectives calculations start to be made to find the Results.

Results

Which types of events are most harmful to population health?

To answer that question, for each event type must be added FATALITIES to INJURIES.

totpeople.sum <- ddply(selectdata, "EVTYPE", summarise, Fatalities = sum(FATALITIES), Injuries = sum(INJURIES))
totpeople.sum$sum <- totpeople.sum$Fatalities + totpeople.sum$Injuries
totpeople.sum <- totpeople.sum[order(-totpeople.sum$sum),]
totpeople.sum <- totpeople.sum[1:20,]
totpeople.sum <- melt(totpeople.sum, id.vars=c("sum", "EVTYPE"))

Presenting in a graphical way

graph01 <- ggplot(totpeople.sum, aes(EVTYPE, value, fill=variable)) + geom_bar(stat="identity") + coord_flip() + ylab("Population") + xlab("Event Type") + ggtitle("Top 20 sources of population damage")

graph01

Fig. 1 - Graphic showing that Tornado is by far the main cause of most population damage

totpeople.sum[totpeople.sum$EVTYPE=="TORNADO",]

##      sum  EVTYPE   variable value
## 1  89974 TORNADO Fatalities  5023
## 21 89974 TORNADO   Injuries 84951

The values of Injuries and Fatalities caused by TORNADOES are Fatalities: 5,023 and Injuries: 84,951.

Which types of events have the the greatest economics consequences?

To answer that question, for each event type must be added the costs of CROP DAMAGE to costs of PPROPERTY DAMAGE

totdamage.sum <- ddply(selectdata, "EVTYPE", summarise, Crop=sum(CROPDMGTOT), Prop=sum(PROPDMGTOT))
totdamage.sum$sum <- totdamage.sum$Crop +totdamage.sum$Prop
totdamage.sum <- totdamage.sum[order(-totdamage.sum$sum),]
totdamage.sum <- totdamage.sum[1:20,]
totdamage.sum <- melt(totdamage.sum, id.vars=c("sum", "EVTYPE"))