Created by Fabiano Silva - nov/2015
In this document, as part of the “Reproducible Research” course, I’ll explore the NOAA Storm Database and evaluate the impacts of metereological events on public health and economy. The data file was gathered from the link. The details for this dataset can be found in this link.
After downloading the data the first thing to do was to load its content into a new variable:
readcsvbz2file <- read.csv(bzfile("./data/repdata-data-StormData.csv.bz2"))
With the data loaded it’s possible to start working on the details of the impacts:
The idea is to identify the events with most impact on public health. In this respect the A count of deaths (in the “FATALITIES” field) for each event type will be made and it’s aggregation for the top 5 biggest counts will be used.
deaths<-aggregate(readcsvbz2file[, "FATALITIES"], by = list(readcsvbz2file$EVTYPE), FUN = "sum")
head(deaths[order(deaths$x, decreasing=TRUE),], n=5)
## Group.1 x
## 834 TORNADO 5633
## 130 EXCESSIVE HEAT 1903
## 153 FLASH FLOOD 978
## 275 HEAT 937
## 464 LIGHTNING 816
With 10267 deaths, those 10 events accounts for more than 68% of all fatalities, as described in the chart bellow.
lbls<-as.character(head(deaths[order(deaths$x, decreasing=TRUE),c("Group.1")], n=5))
lbls[6]<-"OTHERS"
slices<-head(deaths[order(deaths$x, decreasing=TRUE),c("x")], n=5)
totalTop10 <- sum(head(deaths[order(deaths$x, decreasing=TRUE),c("x")], n=5))
totalFatalities <- sum(deaths$x)
slices[6]<-totalFatalities- totalTop10
pie(slices, labels = lbls, main="Financial Impact")
For the economic impacts it’s necessary to first normalize the data and fix the “multipliers” for the dollar amounts.
For this task I started reducing the amount of data used in this evaluation and removing events that have no data for the economic impact:
clean_data<-readcsvbz2file[, c("EVTYPE", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
clean_data<-clean_data[(clean_data$PROPDMGEXP != ""
& clean_data$PROPDMGEXP != "+"
& clean_data$PROPDMGEXP != "-"
& clean_data$PROPDMGEXP != "?") &
(clean_data$CROPDMGEXP != ""
& clean_data$CROPDMGEXP != "+"
& clean_data$CROPDMGEXP != "-"
& clean_data$CROPDMGEXP != "?"), ]
Once this is made the correct “multiplier” is then added to each line:
clean_data$multiplierProp[(toupper(clean_data$PROPDMGEXP)=="K")]<-1000
clean_data$multiplierProp[(toupper(clean_data$PROPDMGEXP)=="M")]<-1000000
clean_data$multiplierProp[(toupper(clean_data$PROPDMGEXP)=="B")]<-1000000000
clean_data$multiplierCrop[(toupper(clean_data$CROPDMGEXP)=="K")]<-1000
clean_data$multiplierCrop[(toupper(clean_data$CROPDMGEXP)=="M")]<-1000000
clean_data$multiplierCrop[(toupper(clean_data$CROPDMGEXP)=="B")]<-1000000000
And this allows us to make the calculation for each event:
clean_data$totalDamage <- (clean_data$PROPDMG*clean_data$multiplierProp) + (clean_data$CROPDMG*clean_data$multiplierCrop)
And those events are then aggregated to display the top 10:
impact<-aggregate(clean_data[, "totalDamage"], by = list(clean_data$EVTYPE), FUN = "sum")
head(impact[order(impact$x, decreasing=TRUE),c("Group.1","x")], n=10)
## Group.1 x
## 23 FLOOD 138007444500
## 62 HURRICANE/TYPHOON 29348167800
## 57 HURRICANE 12405268000
## 75 RIVER FLOOD 10108369000
## 85 STORM SURGE/TIDE 4641493000
## 89 THUNDERSTORM WIND 3813647990
## 118 WILDFIRE 3684468370
## 52 HIGH WIND 3057666640
## 60 HURRICANE OPAL 2187000000
## 11 DROUGHT 1886417000
As a conclusion we have that the Flood is most harmful event for the economy and is also amongst the top 10 events for health impact. For the health side, Tornados are the most dangerous events.