Reproducible Research Project2 Week4
The following analysis intends to derive conclusions based on the NOAA meteorological data, with the objective of indetifying the most harmfull meteorological events for public health and the ones that represent the biggest amount of economic losses on the US
first data need to be uploaded
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
pathinit<-getwd()
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
filedest<-"dataproject.csv"
finalpath<-paste0(pathinit,"/",filedest)
download.file(fileUrl, destfile = finalpath)
tablatormenta<-read.csv(filedest)
str(tablatormenta)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
to answer the question the following variables are going to be selected, FATALITIES and INJURIES since they are the only ones related to population healt. FATALITY refers to the number of reported deaths by each ocurring event and INJURIES refers to the number of reported injuries related to an event
summary(tablatormenta$FATALITIES)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0168 0.0000 583.0000
summary(tablatormenta$INJURIES)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1557 0.0000 1700.0000
a subset of the data frame will be created with the only three relevant variables
question1<-select(tablatormenta,EVTYPE,FATALITIES,INJURIES)
question1<-as.data.table(question1)
the only observations tha are relevant for the analysis of the question are the ones that have at least one injury or one fatality, in the absence of any of this events there is no harm to population health.
question1<-filter(question1,FATALITIES>0,INJURIES>0)
question1[,"evento"]<-as.character(question1$EVTYPE)
grouping the selected variables by type of event, we can answer the proposed question
fatal_event = question1 %>% group_by(evento) %>% summarise_each(funs(sum),FATALITIES,INJURIES)
## `summarise_each()` is deprecated.
## Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
## To map `funs` over a selection of variables, use `summarise_at()`
colnames(fatal_event)<-c("evento","FATALITIES","INJURIES")
fatal_event=fatal_event %>% arrange(desc(FATALITIES))
injury_event=fatal_event %>% arrange(desc(INJURIES))
most_lethal=fatal_event[1:5,]
most_injuries=injury_event[1:5,]
most_lethal
## # A tibble: 5 x 3
## evento FATALITIES INJURIES
## <chr> <dbl> <dbl>
## 1 TORNADO 5227 60187
## 2 EXCESSIVE HEAT 402 4791
## 3 LIGHTNING 283 649
## 4 TSTM WIND 199 646
## 5 FLASH FLOOD 171 641
most_injuries
## # A tibble: 5 x 3
## evento FATALITIES INJURIES
## <chr> <dbl> <dbl>
## 1 TORNADO 5227 60187
## 2 EXCESSIVE HEAT 402 4791
## 3 FLOOD 104 2679
## 4 ICE STORM 35 1720
## 5 HEAT 73 1420
on the following graph we can see the top5 most dangerous events for public health by injuries and fatalities
g1<-ggplot(most_lethal,aes(x=evento,y=FATALITIES),fill=evento)
g1<-g1+geom_bar(aes(fill=evento),stat = "identity")+labs(title = "Top 5 most lethal risk events")
g2<-ggplot(most_injuries,aes(x=evento,y=INJURIES),fill=evento)
g2<-g2+geom_bar(aes(fill=evento),stat = "identity") + labs(title = "Top 5 most injury risk events")
grid.arrange(g1,g2,nrow=1)
With this information we can conclude that the most dangerous type of events is TORNADOS
to answer the question the following variables are going to be selected, PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP since they are the only ones related to economic consequences. The variables PROPDMG and CROPDMG, refers to the numeric loss of money related to either property damage or crop damage,the EXP variables correspond to the order of magnitude of the loss, for example k corresponds to a 10^3 dollars loss, M corresponds to a 10^6 dollars loss,etc.
question2<-select(tablatormenta,EVTYPE,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP)
The only events that represent an economic loss are events on which eihter CROPDMG or PROPDMG have values above 0. The data is going to be filter for this conditions
question2<-filter(question2,PROPDMG>0,CROPDMG>0)
question2[,"evento"]<-as.character(question2$EVTYPE)
Both PROP and CROP EXP variables need to be transformed to numeric, in order to perform the required analysis
m1<-grepl("m",question2$PROPDMGEXP,ignore.case = TRUE)
question2[m1,"PROPDMGEXP_value"]<-c(10^6)
m2<-grepl("k",question2$PROPDMGEXP,ignore.case = TRUE)
question2[m2,"PROPDMGEXP_value"]<-c(10^3)
m3<-grepl("b",question2$PROPDMGEXP,ignore.case = TRUE)
question2[m3,"PROPDMGEXP_value"]<-c(10^9)
question2[is.na(question2$PROPDMGEXP_value),"PROPDMGEXP_value"]<-10^0
question2[,"property_damage"]<-question2$PROPDMGEXP_value*question2$PROPDMG
c1<-grepl("m",question2$CROPDMGEXP,ignore.case = TRUE)
question2[c1,"CROPDMGEXP_value"]<-c(10^6)
c2<-grepl("k",question2$CROPDMGEXP,ignore.case = TRUE)
question2[c2,"CROPDMGEXP_value"]<-c(10^3)
c3<-grepl("b",question2$CROPDMGEXP,ignore.case = TRUE)
question2[c3,"CROPDMGEXP_value"]<-c(10^9)
question2[is.na(question2$CROPDMGEXP_value),"CROPDMGEXP_value"]<-10^0
question2[,"crop_damage"]<-question2$CROPDMGEXP_value*question2$CROPDMG
Economic losses in both variales needs to be classified by event
damage = question2 %>% group_by(evento) %>% summarise_each(funs(sum),crop_damage,property_damage)
## `summarise_each()` is deprecated.
## Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
## To map `funs` over a selection of variables, use `summarise_at()`
colnames(damage)<-c("evento","crop_damage","property_damage")
prop_destructive=damage %>% arrange(desc(property_damage))
crop_destructive=damage %>% arrange(desc(crop_damage))
top5_crop = crop_destructive[1:5,]
top5_prop = prop_destructive[1:5,]
top5_crop[1,];top5_prop[1,]
## # A tibble: 1 x 3
## evento crop_damage property_damage
## <chr> <dbl> <dbl>
## 1 RIVER FLOOD 5028734000 5079635000
## # A tibble: 1 x 3
## evento crop_damage property_damage
## <chr> <dbl> <dbl>
## 1 FLOOD 4073443450 121971090050
results are ploted on the following bar graphs:
g3<-ggplot(top5_prop,aes(x=evento,y=property_damage/10^9))
g3<-g3+geom_bar(aes(fill=evento),stat = "identity")+labs(title = "Top 5 most property destructive events",y="losses (Billions of dollars)")
g4<-ggplot(top5_crop,aes(x=evento,y=crop_damage/10^9))
g4<-g4+geom_bar(aes(fill=evento),stat = "identity") + labs(title = "Top 5 most crop destructive events",y="losses (Billions of dollars)")
grid.arrange(g3,g4,nrow=1)
The most destructive event by an unidemsional loss category can also be calculated
question2[,"total_losses"]=question2$property_damage+question2$crop_damage
most_destructive = question2 %>% group_by(evento) %>% summarise(losses = sum(total_losses))
colnames(most_destructive)<-c("evento","loss")
top5<-most_destructive %>% arrange(desc(loss))
destruccion<-top5[1:5,]
The result is plotted on the following bar graph:
g5<-ggplot(destruccion,aes(x=evento,y=loss/10^9))
g5<-g5 + geom_bar(aes(fill=evento),stat = "identity") + labs(title = "Top 5 most destructive events of US",y="losses (Billions of dollars)")
g5
The most destructive event is:
top5[1,]
## # A tibble: 1 x 2
## evento loss
## <chr> <dbl>
## 1 FLOOD 126044533500
Q1: Based on the data and the subsequent analysis performed we can conclude that Tornados are the most deadliest and risky weather events
Q2: Based on the data and the subsequent analysis performed we can conclude that flood is the weather event that produces the biggest amount of economic losses