Title: Data exploration of the US NOAA Storm Database

Reproducible Research Project2 Week4

Synopsis

The following analysis intends to derive conclusions based on the NOAA meteorological data, with the objective of indetifying the most harmfull meteorological events for public health and the ones that represent the biggest amount of economic losses on the US

Question 1.

Across the United States, which types of events are most harmful with respect to population health?

Data Processing

first data need to be uploaded

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

pathinit<-getwd()
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
filedest<-"dataproject.csv"
finalpath<-paste0(pathinit,"/",filedest)
download.file(fileUrl, destfile = finalpath)
tablatormenta<-read.csv(filedest)
str(tablatormenta)

## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

to answer the question the following variables are going to be selected, FATALITIES and INJURIES since they are the only ones related to population healt. FATALITY refers to the number of reported deaths by each ocurring event and INJURIES refers to the number of reported injuries related to an event

summary(tablatormenta$FATALITIES)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0000   0.0000   0.0000   0.0168   0.0000 583.0000

summary(tablatormenta$INJURIES)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##    0.0000    0.0000    0.0000    0.1557    0.0000 1700.0000

a subset of the data frame will be created with the only three relevant variables

question1<-select(tablatormenta,EVTYPE,FATALITIES,INJURIES)
question1<-as.data.table(question1)

the only observations tha are relevant for the analysis of the question are the ones that have at least one injury or one fatality, in the absence of any of this events there is no harm to population health.

question1<-filter(question1,FATALITIES>0,INJURIES>0)
question1[,"evento"]<-as.character(question1$EVTYPE)

Results Question 1

grouping the selected variables by type of event, we can answer the proposed question

fatal_event = question1 %>% group_by(evento) %>% summarise_each(funs(sum),FATALITIES,INJURIES)

## `summarise_each()` is deprecated.
## Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
## To map `funs` over a selection of variables, use `summarise_at()`

colnames(fatal_event)<-c("evento","FATALITIES","INJURIES")
fatal_event=fatal_event %>% arrange(desc(FATALITIES))
injury_event=fatal_event %>% arrange(desc(INJURIES))

most_lethal=fatal_event[1:5,]
most_injuries=injury_event[1:5,]

most_lethal

## # A tibble: 5 x 3
##   evento         FATALITIES INJURIES
##   <chr>               <dbl>    <dbl>
## 1 TORNADO              5227    60187
## 2 EXCESSIVE HEAT        402     4791
## 3 LIGHTNING             283      649
## 4 TSTM WIND             199      646
## 5 FLASH FLOOD           171      641

most_injuries

## # A tibble: 5 x 3
##   evento         FATALITIES INJURIES
##   <chr>               <dbl>    <dbl>
## 1 TORNADO              5227    60187
## 2 EXCESSIVE HEAT        402     4791
## 3 FLOOD                 104     2679
## 4 ICE STORM              35     1720
## 5 HEAT                   73     1420

on the following graph we can see the top5 most dangerous events for public health by injuries and fatalities

g1<-ggplot(most_lethal,aes(x=evento,y=FATALITIES),fill=evento)
g1<-g1+geom_bar(aes(fill=evento),stat = "identity")+labs(title = "Top 5 most lethal risk events")

g2<-ggplot(most_injuries,aes(x=evento,y=INJURIES),fill=evento)
g2<-g2+geom_bar(aes(fill=evento),stat = "identity") + labs(title = "Top 5 most injury risk events")

grid.arrange(g1,g2,nrow=1)

With this information we can conclude that the most dangerous type of events is TORNADOS

Question 2

Across the United States, which types of events have the greatest economic consequences?

Data Processing

to answer the question the following variables are going to be selected, PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP since they are the only ones related to economic consequences. The variables PROPDMG and CROPDMG, refers to the numeric loss of money related to either property damage or crop damage,the EXP variables correspond to the order of magnitude of the loss, for example k corresponds to a 10^3 dollars loss, M corresponds to a 10^6 dollars loss,etc.

question2<-select(tablatormenta,EVTYPE,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP)

The only events that represent an economic loss are events on which eihter CROPDMG or PROPDMG have values above 0. The data is going to be filter for this conditions

question2<-filter(question2,PROPDMG>0,CROPDMG>0)
question2[,"evento"]<-as.character(question2$EVTYPE)

Both PROP and CROP EXP variables need to be transformed to numeric, in order to perform the required analysis

m1<-grepl("m",question2$PROPDMGEXP,ignore.case = TRUE)
question2[m1,"PROPDMGEXP_value"]<-c(10^6)
m2<-grepl("k",question2$PROPDMGEXP,ignore.case = TRUE)
question2[m2,"PROPDMGEXP_value"]<-c(10^3)
m3<-grepl("b",question2$PROPDMGEXP,ignore.case = TRUE)
question2[m3,"PROPDMGEXP_value"]<-c(10^9)

question2[is.na(question2$PROPDMGEXP_value),"PROPDMGEXP_value"]<-10^0

question2[,"property_damage"]<-question2$PROPDMGEXP_value*question2$PROPDMG

c1<-grepl("m",question2$CROPDMGEXP,ignore.case = TRUE)
question2[c1,"CROPDMGEXP_value"]<-c(10^6)
c2<-grepl("k",question2$CROPDMGEXP,ignore.case = TRUE)
question2[c2,"CROPDMGEXP_value"]<-c(10^3)
c3<-grepl("b",question2$CROPDMGEXP,ignore.case = TRUE)
question2[c3,"CROPDMGEXP_value"]<-c(10^9)

question2[is.na(question2$CROPDMGEXP_value),"CROPDMGEXP_value"]<-10^0
question2[,"crop_damage"]<-question2$CROPDMGEXP_value*question2$CROPDMG

Results Question 2

Economic losses in both variales needs to be classified by event

damage = question2 %>% group_by(evento) %>% summarise_each(funs(sum),crop_damage,property_damage)

## `summarise_each()` is deprecated.
## Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
## To map `funs` over a selection of variables, use `summarise_at()`

colnames(damage)<-c("evento","crop_damage","property_damage")

prop_destructive=damage %>% arrange(desc(property_damage))
crop_destructive=damage %>% arrange(desc(crop_damage))

top5_crop = crop_destructive[1:5,]
top5_prop = prop_destructive[1:5,]

top5_crop[1,];top5_prop[1,]

## # A tibble: 1 x 3
##   evento      crop_damage property_damage
##   <chr>             <dbl>           <dbl>
## 1 RIVER FLOOD  5028734000      5079635000

## # A tibble: 1 x 3
##   evento crop_damage property_damage
##   <chr>        <dbl>           <dbl>
## 1 FLOOD   4073443450    121971090050

results are ploted on the following bar graphs:

g3<-ggplot(top5_prop,aes(x=evento,y=property_damage/10^9))
g3<-g3+geom_bar(aes(fill=evento),stat = "identity")+labs(title = "Top 5 most property destructive events",y="losses (Billions of dollars)")

g4<-ggplot(top5_crop,aes(x=evento,y=crop_damage/10^9))
g4<-g4+geom_bar(aes(fill=evento),stat = "identity") + labs(title = "Top 5 most crop destructive events",y="losses (Billions of dollars)")

grid.arrange(g3,g4,nrow=1)

The most destructive event by an unidemsional loss category can also be calculated

question2[,"total_losses"]=question2$property_damage+question2$crop_damage
most_destructive = question2 %>% group_by(evento) %>% summarise(losses = sum(total_losses))
colnames(most_destructive)<-c("evento","loss")
top5<-most_destructive %>% arrange(desc(loss))

destruccion<-top5[1:5,]

The result is plotted on the following bar graph:

g5<-ggplot(destruccion,aes(x=evento,y=loss/10^9))
g5<-g5 + geom_bar(aes(fill=evento),stat = "identity") + labs(title = "Top 5 most destructive events of US",y="losses (Billions of dollars)")
g5

The most destructive event is:

top5[1,]

## # A tibble: 1 x 2
##   evento         loss
##   <chr>         <dbl>
## 1 FLOOD  126044533500

Summary

Q1: Based on the data and the subsequent analysis performed we can conclude that Tornados are the most deadliest and risky weather events

Q2: Based on the data and the subsequent analysis performed we can conclude that flood is the weather event that produces the biggest amount of economic losses