This project explores the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database as required by the Coursera Course “Reproducible Research” Peer Assessment 2. Complementary information about the database can be found here. The project includes data obtaining and processing and analysis scripts and explanations to answer 2 questions about storm events: what are most harmful with respect to population health and which have the greatest economic consequences accross de USA. According to the analysis, it is possible that much information could be lost in first years of registry, so the analysis was restricted to data recorded between 1995-01-01 and 2011-11-30.
Health population impact of storm events was evaluated in terms of fatality and injury. In the first case the most harmful events were EXCESSIVE HEAT, TORNADO and FLASH FLOOD, whereas in the second they were TORNADO, FLOOD and EXCESSIVE HEAT.
Economical impact of storm events was evaluated with the properties and crops damage loses estimations in US dollars, and FLOOD, HURRICANE/TYPHOON and STORM SURGE where the most problematic in this area.
The data is processed as follows:
if “StormData.csv” file is not found in the working directory, data in bzip2 format is downloaded from Storm Data url, and the former file is extracte using R.utils::bunzip2() function.
Data is read to stormdata dataframe.
To speed up the script processiong, cache option of this chunk is set TRUE
if (!file.exists("StormData.csv")){
library (R.utils)
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2","StormData.csv.bz2")
bunzip2("StormData.csv.bz2")
}
stormdata<-read.csv("StormData.csv",
na.strings = "",
header = TRUE,
stringsAsFactors = FALSE)
stormdata$BGN_DATE<-as.Date(stormdata$BGN_DATE,format = "%m/%d/%Y")
library (ggplot2)
library (dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
qplot (x = format(BGN_DATE,"%Y"),
data=stormdata, geom="bar",
fill=format(BGN_DATE,"%Y")>="1995")+
ggtitle("Storm events registered per year") + xlab("Year")+
scale_x_discrete(breaks=seq(1950,2011,by = 10))+
guides(fill=FALSE)
stormdata<-filter(stormdata, BGN_DATE>="1995-01-01")
First question: Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
The dataset include 2 variable related to population health outcomes: fatalities and injuries.
I check the proportion of missing values:
cat("There are ",mean(is.na(stormdata$FATALITIES))*100, " % of missing values in fatalities", sep="")
## There are 0 % of missing values in fatalities
cat("There are ",mean(is.na(stormdata$INJURIES))*100, " % of missing values in injuries",sep="")
## There are 0 % of missing values in injuries
I analyse both of them as outcomes without combining them as they have obvious different weight in terms of public health. I decided to calculate the sum of each outcome per event to summarise the result.
library (knitr)
table.fatalities<-stormdata%>%
group_by(EVTYPE)%>%
summarise(fatalities=sum (FATALITIES), eventCount=length (EVTYPE))%>%
arrange(desc(fatalities))
eventFatalitiesOrder<-table.fatalities$EVTYPE
table.fatalities<-mutate(table.fatalities,
EVTYPE=factor(EVTYPE,levels = eventFatalitiesOrder))
fatalities<-kable(head (table.fatalities, n = 10),align = "l")
The top 10 fatality related EVTYPE were:
knit_print (fatalities)
| EVTYPE | fatalities | eventCount |
|---|---|---|
| EXCESSIVE HEAT | 1903 | 1675 |
| TORNADO | 1545 | 24335 |
| FLASH FLOOD | 934 | 52673 |
| HEAT | 924 | 755 |
| LIGHTNING | 729 | 14280 |
| FLOOD | 423 | 24641 |
| RIP CURRENT | 360 | 459 |
| HIGH WIND | 241 | 19956 |
| TSTM WIND | 241 | 128923 |
| AVALANCHE | 223 | 380 |
table.injuries<-stormdata%>%
group_by(EVTYPE)%>%
summarise(injuries=sum (INJURIES), eventCount=length (EVTYPE))%>%
arrange(desc(injuries))
eventInjuriesOrder<-table.injuries$EVTYPE
table.injuries<-mutate(table.injuries,
EVTYPE=factor(EVTYPE,levels = eventInjuriesOrder))
injuries<-kable(head (table.injuries, n = 10),align = "l")
The top 10 injury related EVTYPE were:
knit_print (injuries)
| EVTYPE | injuries | eventCount |
|---|---|---|
| TORNADO | 21765 | 24335 |
| FLOOD | 6769 | 24641 |
| EXCESSIVE HEAT | 6525 | 1675 |
| LIGHTNING | 4631 | 14280 |
| TSTM WIND | 3630 | 128923 |
| HEAT | 2030 | 755 |
| FLASH FLOOD | 1734 | 52673 |
| THUNDERSTORM WIND | 1426 | 81745 |
| WINTER STORM | 1298 | 11372 |
| HURRICANE/TYPHOON | 1275 | 88 |
Thus, the storm event that resulted more harmful in terms of fatalities was EXCESSIVE HEAT and in terms of injuries was TORNADO from 1995-01-01 to 2011-11-30.
Second question: Across the United States, which types of events have the greatest economic consequences?
First, total damage economic consequences should be calculated.
In storm database, properties and crop damage estimates were registered in 4 columns: PROPDMG and PRPDMGEXP, and CROPDMG and CROPDMGEXP. In both cases the estimate is the first 2 column (PROPDMG or CROPDMG) multiplied by the EXP included in the second 2 column (PROPDMGEXP or CROPDMGEXP). In the EXP columns, alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions. Alternatively, a number is included to denote the multiplier exponent (i.e., “1” is 10^1). The EXP columns have some codes that do not fall in any of these cathegories. In such cases, I assigned a NA to those estimates. If EXP variable is a missing value, I assumed that the multiplier is 1. To perform calculations I built a function (damageFun()) which perform calculations and I used apply() function to calculates by row results of properties and crop damage, and finally they were sum in totalDamage variable.
damageFun<-function (data){
x<-as.numeric(data[1])
y<-data[2]
suppressWarnings(w<-as.numeric(y))
if (grepl(pattern = "[Kk]", y)){
return(x * 1000)
}else if(grepl (pattern = "[Mm]", y)){
return(x * 1e+6)
}else if (grepl (pattern="[Bb]", y)){
return(x * 1e+9)
}else if (!is.na(w)){
return(x *10^w)
}else if (is.na(y)){
return (x)
}else{
return(NA)
}
}
propdmg<-select(.data = stormdata,PROPDMG:PROPDMGEXP)
cropdmg<-select(.data = stormdata,CROPDMG:CROPDMGEXP)
propDamage<-apply (propdmg,1, FUN = damageFun)
cropDamage<-apply (cropdmg,1, FUN = damageFun)
stormdata$properties<-propDamage
stormdata$crop<-cropDamage
stormdata$totalDamage<-rowSums(cbind(propDamage,cropDamage),na.rm=TRUE)
I check for missing values properties and crop damage variables:
cat("There are ",round(mean(is.na(stormdata$properties))*100,5),
" % of missing values in properties", sep="")
## There are 0.00235 % of missing values in properties
cat("There are ",round(mean(is.na(stormdata$crop))*100,5), " % of missing values in crop",sep="")
## There are 0.00073 % of missing values in crop
Then I made a summary table for damage estimates (properties, crop and totalDamage) by event type (EVTYPE) with sum() as summary function. This table was arranged in descendent order of totalDamage estimate loses in order to build a plot (see below).
table.damage<-select (stormdata, EVTYPE, properties, crop, totalDamage)%>%
group_by(EVTYPE)%>%
summarise(totalDamage=sum(totalDamage,na.rm=TRUE),properties=sum (properties,na.rm=TRUE),crop=sum(crop,na.rm=TRUE))%>%
arrange(desc(totalDamage))
eventDamageOrder<-table.damage$EVTYPE
table.damage<-mutate(table.damage,
EVTYPE=factor(EVTYPE,levels = eventDamageOrder))
The following plot illustrates the damage estimates from properties and crops accross US in billions of dollars of the 10 EVTYPE with larger economical consequences from 1995 to 2011.
library (tidyr)
table.damage.long<-gather(data = head(table.damage,10),
key = damage,
value = estimate, -EVTYPE, -totalDamage)
damageplot<-ggplot(data = (table.damage.long),
aes (x = EVTYPE,y = estimate/1e+9, fill=damage))+
geom_bar(stat="identity")+coord_flip()+
ggtitle("Damage estimates per event type from 1995 to 2011 accross the USA")+
xlab("")+ylab("billions of US dollars")
damageplot
FLOOD produced the largest economical loses from 1995-01-01 to 2011-11-30.