Synopsis

This project explores the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database as required by the Coursera Course “Reproducible Research” Peer Assessment 2. Complementary information about the database can be found here. The project includes data obtaining and processing and analysis scripts and explanations to answer 2 questions about storm events: what are most harmful with respect to population health and which have the greatest economic consequences accross de USA. According to the analysis, it is possible that much information could be lost in first years of registry, so the analysis was restricted to data recorded between 1995-01-01 and 2011-11-30.
Health population impact of storm events was evaluated in terms of fatality and injury. In the first case the most harmful events were EXCESSIVE HEAT, TORNADO and FLASH FLOOD, whereas in the second they were TORNADO, FLOOD and EXCESSIVE HEAT.
Economical impact of storm events was evaluated with the properties and crops damage loses estimations in US dollars, and FLOOD, HURRICANE/TYPHOON and STORM SURGE where the most problematic in this area.

Data processing

The data is processed as follows:

  1. if “StormData.csv” file is not found in the working directory, data in bzip2 format is downloaded from Storm Data url, and the former file is extracte using R.utils::bunzip2() function.

  2. Data is read to stormdata dataframe.

  3. To speed up the script processiong, cache option of this chunk is set TRUE

if (!file.exists("StormData.csv")){
    library (R.utils)
    download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2","StormData.csv.bz2")
    bunzip2("StormData.csv.bz2")
}

stormdata<-read.csv("StormData.csv",
                    na.strings = "",
                    header = TRUE,
                    stringsAsFactors = FALSE)

stormdata$BGN_DATE<-as.Date(stormdata$BGN_DATE,format = "%m/%d/%Y")
  1. Database subsetting: Storm database is a 902297x37 (rows x columns) dataframe that has data from 01/03/1950 to 11/30/2011. However, events registry increased along this time as is shown in the following plot. There is a step increase in events count in 1995. This suggests that data after this point may be more complete, so I decided to use only data obtain after 2005-01-01 by subsetting stormdata with BGN_DATE >= 1995 (green bars).
library (ggplot2)
library (dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
qplot (x = format(BGN_DATE,"%Y"),
       data=stormdata, geom="bar",
       fill=format(BGN_DATE,"%Y")>="1995")+ 
    ggtitle("Storm events registered per year") + xlab("Year")+
    scale_x_discrete(breaks=seq(1950,2011,by = 10))+
    guides(fill=FALSE)

stormdata<-filter(stormdata, BGN_DATE>="1995-01-01")

Data analysis must address the following questions:

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

Results

First question: Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

The dataset include 2 variable related to population health outcomes: fatalities and injuries.
I check the proportion of missing values:

cat("There are ",mean(is.na(stormdata$FATALITIES))*100, " % of missing values in fatalities", sep="")
## There are 0 % of missing values in fatalities
cat("There are ",mean(is.na(stormdata$INJURIES))*100, " % of missing values in injuries",sep="")
## There are 0 % of missing values in injuries

I analyse both of them as outcomes without combining them as they have obvious different weight in terms of public health. I decided to calculate the sum of each outcome per event to summarise the result.

  1. I start with fatalities. I create a table of event type (EVTYPE), fatalities count and event count (eventCount), arranged in descendent order of fatality count.
library (knitr)


table.fatalities<-stormdata%>%
    group_by(EVTYPE)%>%
    summarise(fatalities=sum (FATALITIES), eventCount=length (EVTYPE))%>%
    arrange(desc(fatalities))

eventFatalitiesOrder<-table.fatalities$EVTYPE
table.fatalities<-mutate(table.fatalities,
                         EVTYPE=factor(EVTYPE,levels = eventFatalitiesOrder))

fatalities<-kable(head (table.fatalities, n = 10),align = "l")

The top 10 fatality related EVTYPE were:

knit_print (fatalities)
EVTYPE fatalities eventCount
EXCESSIVE HEAT 1903 1675
TORNADO 1545 24335
FLASH FLOOD 934 52673
HEAT 924 755
LIGHTNING 729 14280
FLOOD 423 24641
RIP CURRENT 360 459
HIGH WIND 241 19956
TSTM WIND 241 128923
AVALANCHE 223 380
  1. Then I analyse the injuries. I create a table of event type (EVTYPE), injuries and event count (eventCount), arranged in descendent order of injuries count.
table.injuries<-stormdata%>%
    group_by(EVTYPE)%>%
    summarise(injuries=sum (INJURIES), eventCount=length (EVTYPE))%>%
    arrange(desc(injuries))

eventInjuriesOrder<-table.injuries$EVTYPE
table.injuries<-mutate(table.injuries, 
                       EVTYPE=factor(EVTYPE,levels = eventInjuriesOrder))


injuries<-kable(head (table.injuries, n = 10),align = "l")

The top 10 injury related EVTYPE were:

knit_print (injuries)
EVTYPE injuries eventCount
TORNADO 21765 24335
FLOOD 6769 24641
EXCESSIVE HEAT 6525 1675
LIGHTNING 4631 14280
TSTM WIND 3630 128923
HEAT 2030 755
FLASH FLOOD 1734 52673
THUNDERSTORM WIND 1426 81745
WINTER STORM 1298 11372
HURRICANE/TYPHOON 1275 88
Thus, the storm event that resulted more harmful in terms of fatalities was  EXCESSIVE HEAT and in terms of injuries was TORNADO from 1995-01-01 to 2011-11-30.

Second question: Across the United States, which types of events have the greatest economic consequences?

First, total damage economic consequences should be calculated.
In storm database, properties and crop damage estimates were registered in 4 columns: PROPDMG and PRPDMGEXP, and CROPDMG and CROPDMGEXP. In both cases the estimate is the first 2 column (PROPDMG or CROPDMG) multiplied by the EXP included in the second 2 column (PROPDMGEXP or CROPDMGEXP). In the EXP columns, alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions. Alternatively, a number is included to denote the multiplier exponent (i.e., “1” is 10^1). The EXP columns have some codes that do not fall in any of these cathegories. In such cases, I assigned a NA to those estimates. If EXP variable is a missing value, I assumed that the multiplier is 1. To perform calculations I built a function (damageFun()) which perform calculations and I used apply() function to calculates by row results of properties and crop damage, and finally they were sum in totalDamage variable.

damageFun<-function (data){
    x<-as.numeric(data[1])
    y<-data[2]
    suppressWarnings(w<-as.numeric(y)) 
    if (grepl(pattern = "[Kk]", y)){
        return(x * 1000)
    }else if(grepl (pattern = "[Mm]", y)){
        return(x * 1e+6)
    }else if (grepl (pattern="[Bb]", y)){
        return(x * 1e+9)
    }else if (!is.na(w)){
        return(x *10^w)
    }else if (is.na(y)){
        return (x)
    }else{
        return(NA)
    }
}

propdmg<-select(.data = stormdata,PROPDMG:PROPDMGEXP)
cropdmg<-select(.data = stormdata,CROPDMG:CROPDMGEXP)

propDamage<-apply (propdmg,1, FUN = damageFun)
cropDamage<-apply (cropdmg,1, FUN = damageFun)


stormdata$properties<-propDamage
stormdata$crop<-cropDamage
stormdata$totalDamage<-rowSums(cbind(propDamage,cropDamage),na.rm=TRUE)

I check for missing values properties and crop damage variables:

cat("There are ",round(mean(is.na(stormdata$properties))*100,5),
    " % of missing values in properties", sep="")
## There are 0.00235 % of missing values in properties
cat("There are ",round(mean(is.na(stormdata$crop))*100,5), " % of missing values in crop",sep="")
## There are 0.00073 % of missing values in crop

Then I made a summary table for damage estimates (properties, crop and totalDamage) by event type (EVTYPE) with sum() as summary function. This table was arranged in descendent order of totalDamage estimate loses in order to build a plot (see below).

table.damage<-select (stormdata, EVTYPE, properties, crop, totalDamage)%>%
    group_by(EVTYPE)%>%
    summarise(totalDamage=sum(totalDamage,na.rm=TRUE),properties=sum (properties,na.rm=TRUE),crop=sum(crop,na.rm=TRUE))%>%
    arrange(desc(totalDamage))

eventDamageOrder<-table.damage$EVTYPE
table.damage<-mutate(table.damage, 
                       EVTYPE=factor(EVTYPE,levels = eventDamageOrder))

The following plot illustrates the damage estimates from properties and crops accross US in billions of dollars of the 10 EVTYPE with larger economical consequences from 1995 to 2011.

library (tidyr)

table.damage.long<-gather(data = head(table.damage,10),
                          key = damage, 
                          value = estimate, -EVTYPE, -totalDamage)

damageplot<-ggplot(data = (table.damage.long), 
                   aes (x = EVTYPE,y = estimate/1e+9, fill=damage))+
    geom_bar(stat="identity")+coord_flip()+
    ggtitle("Damage estimates per event type from 1995 to 2011 accross the USA")+
    xlab("")+ylab("billions of US dollars")

damageplot

FLOOD produced the largest economical loses from 1995-01-01 to 2011-11-30.