Health and economical consequences of storm activity accross de United States

Synopsis

This project explores the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database as required by the Coursera Course “Reproducible Research” Peer Assessment 2. Complementary information about the database can be found here. The project includes data obtaining and processing and analysis scripts and explanations to answer 2 questions about storm events: what are most harmful with respect to population health and which have the greatest economic consequences accross de USA. According to the analysis, it is possible that much information could be lost in first years of registry, so the analysis was restricted to data recorded between 1995-01-01 and 2011-11-30.
Health population impact of storm events was evaluated in terms of fatality and injury. In the first case the most harmful events were EXCESSIVE HEAT, TORNADO and FLASH FLOOD, whereas in the second they were TORNADO, FLOOD and EXCESSIVE HEAT.
Economical impact of storm events was evaluated with the properties and crops damage loses estimations in US dollars, and FLOOD, HURRICANE/TYPHOON and STORM SURGE where the most problematic in this area.

Data processing

The data is processed as follows:

if “StormData.csv” file is not found in the working directory, data in bzip2 format is downloaded from Storm Data url, and the former file is extracte using R.utils::bunzip2() function.
Data is read to stormdata dataframe.
To speed up the script processiong, cache option of this chunk is set TRUE

if (!file.exists("StormData.csv")){
    library (R.utils)
    download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2","StormData.csv.bz2")
    bunzip2("StormData.csv.bz2")
}

stormdata<-read.csv("StormData.csv",
                    na.strings = "",
                    header = TRUE,
                    stringsAsFactors = FALSE)

stormdata$BGN_DATE<-as.Date(stormdata$BGN_DATE,format = "%m/%d/%Y")

Database subsetting: Storm database is a 902297x37 (rows x columns) dataframe that has data from 01/03/1950 to 11/30/2011. However, events registry increased along this time as is shown in the following plot. There is a step increase in events count in 1995. This suggests that data after this point may be more complete, so I decided to use only data obtain after 2005-01-01 by subsetting stormdata with BGN_DATE >= 1995 (green bars).

library (ggplot2)
library (dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

qplot (x = format(BGN_DATE,"%Y"),
       data=stormdata, geom="bar",
       fill=format(BGN_DATE,"%Y")>="1995")+ 
    ggtitle("Storm events registered per year") + xlab("Year")+
    scale_x_discrete(breaks=seq(1950,2011,by = 10))+
    guides(fill=FALSE)

stormdata<-filter(stormdata, BGN_DATE>="1995-01-01")

Data analysis must address the following questions:

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

Results

First question: Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

The dataset include 2 variable related to population health outcomes: fatalities and injuries.
I check the proportion of missing values:

cat("There are ",mean(is.na(stormdata$FATALITIES))*100, " % of missing values in fatalities", sep="")

## There are 0 % of missing values in fatalities

cat("There are ",mean(is.na(stormdata$INJURIES))*100, " % of missing values in injuries",sep="")

## There are 0 % of missing values in injuries

I analyse both of them as outcomes without combining them as they have obvious different weight in terms of public health. I decided to calculate the sum of each outcome per event to summarise the result.

I start with fatalities. I create a table of event type (EVTYPE), fatalities count and event count (eventCount), arranged in descendent order of fatality count.

library (knitr)


table.fatalities<-stormdata%>%
    group_by(EVTYPE)%>%
    summarise(fatalities=sum (FATALITIES), eventCount=length (EVTYPE))%>%
    arrange(desc(fatalities))

eventFatalitiesOrder<-table.fatalities$EVTYPE
table.fatalities<-mutate(table.fatalities,
                         EVTYPE=factor(EVTYPE,levels = eventFatalitiesOrder))

fatalities<-kable(head (table.fatalities, n = 10),align = "l")

The top 10 fatality related EVTYPE were:

knit_print (fatalities)

EVTYPE	fatalities	eventCount
EXCESSIVE HEAT	1903	1675
TORNADO	1545	24335
FLASH FLOOD	934	52673
HEAT	924	755
LIGHTNING	729	14280
FLOOD	423	24641
RIP CURRENT	360	459
HIGH WIND	241	19956
TSTM WIND	241	128923
AVALANCHE	223	380

Then I analyse the injuries. I create a table of event type (EVTYPE), injuries and event count (eventCount), arranged in descendent order of injuries count.

table.injuries<-stormdata%>%
    group_by(EVTYPE)%>%
    summarise(injuries=sum (INJURIES), eventCount=length (EVTYPE))%>%
    arrange(desc(injuries))

eventInjuriesOrder<-table.injuries$EVTYPE
table.injuries<-mutate(table.injuries, 
                       EVTYPE=factor(EVTYPE,levels = eventInjuriesOrder))


injuries<-kable(head (table.injuries, n = 10),align = "l")

The top 10 injury related EVTYPE were:

knit_print (injuries)

EVTYPE	injuries	eventCount
TORNADO	21765	24335
FLOOD	6769	24641
EXCESSIVE HEAT	6525	1675
LIGHTNING	4631	14280
TSTM WIND	3630	128923
HEAT	2030	755
FLASH FLOOD	1734	52673
THUNDERSTORM WIND	1426	81745
WINTER STORM	1298	11372
HURRICANE/TYPHOON	1275	88

Thus, the storm event that resulted more harmful in terms of fatalities was  EXCESSIVE HEAT and in terms of injuries was TORNADO from 1995-01-01 to 2011-11-30.

Second question: Across the United States, which types of events have the greatest economic consequences?

First, total damage economic consequences should be calculated.
In storm database, properties and crop damage estimates were registered in 4 columns: PROPDMG and PRPDMGEXP, and CROPDMG and CROPDMGEXP. In both cases the estimate is the first 2 column (PROPDMG or CROPDMG) multiplied by the EXP included in the second 2 column (PROPDMGEXP or CROPDMGEXP). In the EXP columns, alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions. Alternatively, a number is included to denote the multiplier exponent (i.e., “1” is 10^1). The EXP columns have some codes that do not fall in any of these cathegories. In such cases, I assigned a NA to those estimates. If EXP variable is a missing value, I assumed that the multiplier is 1. To perform calculations I built a function (damageFun()) which perform calculations and I used apply() function to calculates by row results of properties and crop damage, and finally they were sum in totalDamage variable.

damageFun<-function (data){
    x<-as.numeric(data[1])
    y<-data[2]
    suppressWarnings(w<-as.numeric(y)) 
    if (grepl(pattern = "[Kk]", y)){
        return(x * 1000)
    }else if(grepl (pattern = "[Mm]", y)){
        return(x * 1e+6)
    }else if (grepl (pattern="[Bb]", y)){
        return(x * 1e+9)
    }else if (!is.na(w)){
        return(x *10^w)
    }else if (is.na(y)){
        return (x)
    }else{
        return(NA)
    }
}

propdmg<-select(.data = stormdata,PROPDMG:PROPDMGEXP)
cropdmg<-select(.data = stormdata,CROPDMG:CROPDMGEXP)

propDamage<-apply (propdmg,1, FUN = damageFun)
cropDamage<-apply (cropdmg,1, FUN = damageFun)


stormdata$properties<-propDamage
stormdata$crop<-cropDamage
stormdata$totalDamage<-rowSums(cbind(propDamage,cropDamage),na.rm=TRUE)

I check for missing values properties and crop damage variables:

cat("There are ",round(mean(is.na(stormdata$properties))*100,5),
    " % of missing values in properties", sep="")

## There are 0.00235 % of missing values in properties

cat("There are ",round(mean(is.na(stormdata$crop))*100,5), " % of missing values in crop",sep="")

## There are 0.00073 % of missing values in crop

Then I made a summary table for damage estimates (properties, crop and totalDamage) by event type (EVTYPE) with sum() as summary function. This table was arranged in descendent order of totalDamage estimate loses in order to build a plot (see below).

table.damage<-select (stormdata, EVTYPE, properties, crop, totalDamage)%>%
    group_by(EVTYPE)%>%
    summarise(totalDamage=sum(totalDamage,na.rm=TRUE),properties=sum (properties,na.rm=TRUE),crop=sum(crop,na.rm=TRUE))%>%
    arrange(desc(totalDamage))

eventDamageOrder<-table.damage$EVTYPE
table.damage<-mutate(table.damage, 
                       EVTYPE=factor(EVTYPE,levels = eventDamageOrder))

The following plot illustrates the damage estimates from properties and crops accross US in billions of dollars of the 10 EVTYPE with larger economical consequences from 1995 to 2011.

library (tidyr)

table.damage.long<-gather(data = head(table.damage,10),
                          key = damage, 
                          value = estimate, -EVTYPE, -totalDamage)

damageplot<-ggplot(data = (table.damage.long), 
                   aes (x = EVTYPE,y = estimate/1e+9, fill=damage))+
    geom_bar(stat="identity")+coord_flip()+
    ggtitle("Damage estimates per event type from 1995 to 2011 accross the USA")+
    xlab("")+ylab("billions of US dollars")

damageplot

FLOOD produced the largest economical loses from 1995-01-01 to 2011-11-30.