Reproducible Research - Peer Assessment 2 http://rpubs.com/develHector/PA2_template Update - 25 Apr, 2015

Synopsis

The aim of this report is to analyze both the the most harmful events with respect to population health across USA as well as the economic consequences of each type. The Questions needed to be answered are: 1. Which types of events (EVTYPE) are most harmful with respect to USA population health? 2. Which types of events have the greatest economic consequences?

Summary of the data analysis

What we are going do do here is
1. Download the raw data
2. Clean it as much as possible with a couple of basic techniques
3. Summarize (aggregate) the data according the requirements of the assignment
4. Order it in a way it presents us the more satistically relevant info
5. Plot the final report, on two ways: total, and an extra one: per year.

For the purposes of this report:
- FATALITIES AND INJURIES has been added in a single column impact value called HEALTH_EVENTS
- PROPERTY DAMAGE + CROP DAMMAGE has been added as a single column called ECONOMIC_DAMAGE

Data Processing

Raw information is obtained directly from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Loading and preprocessing the data

URL from where the file can be downloaded from the course web site:

url <- "https://d396qusza40orc.cloudfront.net/repdata/data/StormData.csv.bz2" ;  

Download the file in case it’s not on the current directory

# Choose a local directory for storing intermediate data, I went for "~/data"
path <- "~/data/" ; 
if( !file.exists( path ) ) dir.create( path ) ; 

# Download and unpack it only once
zipfile <- paste0( path, "StormData.csv.bz2" ) ;
if( !file.exists(zipfile) ) download.file( url, zipfile, method="curl", mode="ab", quiet=TRUE );

We need to unzip the data with a spefic command bunzip2 from R.utils library

# Unzip it only if not already unzipped
file <- paste0( path, "StormData.csv" ) ;
if( !file.exists(file) )
{
  if( !("R.utils" %in% rownames( installed.packages() ) ) ) install.packages("R.utils") ;  
  library(R.utils)
  bunzip2( filename = zipfile, destname = file  ) ;
}

Store the data on a dataframe “StormData”" to be used during the rest of the report

# This needs to be cached in order to reduce load time during recursions
StormData <- read.csv( file, header = TRUE ) ;

Let’s clean a little the event factor through trimming leading and trailing spaces un turn to uppercase

StormData$EVTYPE <- toupper( gsub( "^\\s+|\\s+$", "", StormData$EVTYPE ) ) ;

Now let’s aggregate the data on the SUM of individual events

AggregatedDirtyData <- aggregate( cbind( 
    FATALITIES, INJURIES, HEALTH_EVENTS = FATALITIES+INJURIES, 
    PROPDMG, CROPDMG, ECONOMIC_DAMAGE = PROPDMG+CROPDMG ) ~ EVTYPE, data=StormData, FUN=sum ) ;

Lets filter out a little bit more by only taking those above the mean

m <- mean( AggregatedDirtyData$HEALTH_EVENTS ) ;
AggregatedCleanData <- subset( AggregatedDirtyData, HEALTH_EVENTS >  m ) ;

Lets get the most important health events (Top of the list)

TOP <- 25 ; # You may choose whatever you want, I was interested only on thwe first 25

TopEventsHealth <- head( AggregatedCleanData[ order( AggregatedCleanData$HEALTH_EVENTS, 
                                                 decreasing =  TRUE ), c(1,2:4) ], TOP );

RemainingData <- AggregatedCleanData[ !(AggregatedCleanData$EVTYPE %in% TopEventsHealth$EVTYPE), ] ;

TopEventsHealth <- rbind( TopEventsHealth, 
                          data.frame( EVTYPE=paste( "[REST", nrow( RemainingData ),"RELEVANT TYPES]" ),
                                      FATALITIES = sum(  RemainingData$FATALITIES ),
                                      INJURIES = sum( RemainingData$INJURIES ), 
                                      HEALTH_EVENTS = sum( RemainingData$HEALTH_EVENTS ) ) ) ;

# Let's renumber the results in order to be consistent with the TopN idea
rownames( TopEventsHealth ) <- c( 1:(TOP+1) ) ;

Lets filter out a little bit more by only taking those above the mean

m <- mean( AggregatedDirtyData$ECONOMIC_DAMAGE ) ;
AggregatedCleanData <- subset( AggregatedDirtyData, ECONOMIC_DAMAGE >  m ) ;

Lets get the most economic damage events (Top of the list)

TOP <- 25 ; # You may choose whatever you want, I was interested only on thwe first 25

TopEventsEconomic <- head( AggregatedCleanData[ order( AggregatedCleanData$ECONOMIC_DAMAGE, 
                                                     decreasing =  TRUE ), c(1,5:7) ], TOP );

RemainingData <- AggregatedCleanData[ !(AggregatedCleanData$EVTYPE %in% TopEventsHealth$EVTYPE), ] ;

TopEventsEconomic <- rbind( TopEventsEconomic, 
                            data.frame( EVTYPE=paste( "[REST", nrow( RemainingData ),"RELEVANT TYPES]" ),
                                        PROPDMG=sum( RemainingData$PROPDMG ),
                                        CROPDMG=sum( RemainingData$CROPDMG ),
                                        ECONOMIC_DAMAGE=sum( RemainingData$ECONOMIC_DAMAGE ) ) ) ;

# Let's renumber the results in order to be consistent with the TopN idea
rownames( TopEventsEconomic ) <- c( 1:(TOP+1) ) ;

Results

Table 1 - Events are most harmful to population health

This table protrays the conclusion of the most relevant health events per type since the begining of the measurement, check the descending order of events, numbered, portrays at a single view which are the most relevant events, its number of ocurrences both on fatalities and injuries, and the sum (Health_Events) which was the variable used for the analysis. We added at the bottom the sum of the other relevant events to have an idea of what we are missing once we are filtering by the top ones.

library(xtable);
xt<-xtable(TopEventsHealth, display=c("d", "s", "d", "d", "d"), caption = "Individual Fatalities and Injuries per Event");
print( xt, type="html" ) ;
Individual Fatalities and Injuries per Event
EVTYPE FATALITIES INJURIES HEALTH_EVENTS
1 TORNADO 5633 91346 96979
2 EXCESSIVE HEAT 1903 6525 8428
3 TSTM WIND 504 6957 7461
4 FLOOD 470 6789 7259
5 LIGHTNING 816 5230 6046
6 HEAT 937 2100 3037
7 FLASH FLOOD 978 1777 2755
8 ICE STORM 89 1975 2064
9 THUNDERSTORM WIND 133 1488 1621
10 WINTER STORM 206 1321 1527
11 HIGH WIND 248 1137 1385
12 HAIL 15 1361 1376
13 HURRICANE/TYPHOON 64 1275 1339
14 HEAVY SNOW 127 1021 1148
15 WILDFIRE 75 911 986
16 THUNDERSTORM WINDS 64 908 972
17 BLIZZARD 101 805 906
18 FOG 62 734 796
19 RIP CURRENT 368 232 600
20 WILD/FOREST FIRE 12 545 557
21 HEAT WAVE 172 379 551
22 RIP CURRENTS 204 297 501
23 DUST STORM 22 440 462
24 WINTER WEATHER 33 398 431
25 TROPICAL STORM 58 340 398
26 [REST 9 RELEVANT TYPES] 847 2103 2950

Figure 1 - Events are most harmful to population health

Let’s see that on a chart to have an grasp of the distribution of events. You will see the same data as in the previous table, but on this case, and to simplify the reading, we turned the x axis to be logarithimc, in order to let us view more clearly the descending impact of the events after the major one. We also plotted the rest of the relevant events. This graph by default orders inveresely to the table, but still presents the needed information for review.

par(ps=9, cex=1, cex.main=2, las=2, oma=c(0, 4, 0, 4)) ; # adjust graph parameters to fit best
barplot( TopEventsHealth$HEALTH_EVENTS, horiz = TRUE, names.arg = TopEventsHealth$EVTYPE, 
         cex.names = 0.8, log="x", main="Health Events per Event Type", xlab="Number of Events - Logarithmic Scale");

The conclusions you can draw from here are based on the data from the previous table and graph, Tornado causes the vast majority of the health issues, but there are other important events to consider as well.

Table 2 - Events have the greatest economic consequences

This table is shown as the conclusion we can draw from the most relevant economic data events per type since the begining of the measurement. check the descending order of events, numbered, portrays at a single view which are the most relevant events, its number of ocurrences both on “Property Damage” and “Crop Damage”, and its sum (ECONOMIC_DAMAGE) which was the variable used for the analysis. We added at the bottom the sum of the other relevant events to have an idea of what we are missing once we are filtering by the top ones.

library(xtable);
xt<-xtable(TopEventsEconomic, caption="Units in thousands $USD total");
print( xt, type="html" ) ;
Units in thousands $USD total
EVTYPE PROPDMG CROPDMG ECONOMIC_DAMAGE
1 TORNADO 3212258.16 100018.52 3312276.68
2 FLASH FLOOD 1420174.59 179200.46 1599375.05
3 TSTM WIND 1336103.61 109202.60 1445306.21
4 HAIL 688693.38 579596.28 1268289.66
5 FLOOD 899938.48 168037.88 1067976.36
6 THUNDERSTORM WIND 876844.17 66791.45 943635.62
7 LIGHTNING 603351.78 3580.61 606932.39
8 THUNDERSTORM WINDS 446293.18 18684.93 464978.11
9 HIGH WIND 324731.56 17283.21 342014.77
10 WINTER STORM 132720.59 1978.99 134699.58
11 HEAVY SNOW 122251.99 2165.72 124417.71
12 WILDFIRE 84459.34 4364.20 88823.54
13 ICE STORM 66000.67 1688.95 67689.62
14 STRONG WIND 63011.81 1616.90 64628.71
15 HEAVY RAIN 50842.14 11122.80 61964.94
16 HIGH WINDS 55625.00 1759.60 57384.60
17 TROPICAL STORM 48423.68 5899.12 54322.80
18 WILD/FOREST FIRE 39344.95 4189.54 43534.49
19 DROUGHT 4099.05 33898.62 37997.67
20 FLASH FLOODING 28497.15 5126.05 33623.20
21 URBAN/SML STREAM FLD 26051.94 2793.80 28845.74
22 BLIZZARD 25318.48 172.00 25490.48
23 HURRICANE 15513.68 5339.31 20852.99
24 FLOOD/FLASH FLOOD 19153.25 1427.70 20580.95
25 STORM SURGE 19393.49 5.00 19398.49
26 [REST 14 RELEVANT TYPES] 350096.69 73681.42 423778.11

Figure 2 - Events have the greatest economic consequences

Let’s see that on a chart to have an grasp of the distribution of events and its economic relevance. You will see the same data as in the previous table, but on this case, and to simplify the reading, we turned the x axis to be logarithimc, in order to let us view more clearly the descending impact of the events after the major one. We also plotted the rest of the relevant events. This graph by default orders inveresely to the table, but still presents the needed information for review.

par(ps=9, cex=1, cex.main=2, las=2, oma=c(0, 4, 0, 4)) ; # adjust graph parameters to fit best
barplot( TopEventsEconomic$ECONOMIC_DAMAGE / 1000, horiz = TRUE, names.arg = TopEventsEconomic$EVTYPE, 
         cex.names = 0.8, log="x", main="Economic Damange per Event Type", xlab="Millions USD in Damage - Logarithmic Scale");

The conclusions you can draw from here are based on the data from the table and graph, trying not to be repetitive, the graph speaks for itself.

There’s an extra plot allowed!

Lets have sum fun with a measurements-within-time plot, with a graph with more depth on the information, looking to present the most important overal events, but within time.

First, lets fix the BGN_DATE column in order to be a Date value instead of a string

StormData$BGN_DATE <- as.Date( StormData$BGN_DATE, "%m/%d/%Y" ) ;
divisor <- 5 ;
StormData$YEAR <- as.integer( ( as.POSIXlt( StormData$BGN_DATE )$year+1900 ) / divisor ) * divisor ;

Aggregate per year and to have an unique indicator, my proposal is to lets create a “Damage Index”, called “OVERAL_DAMAGE” wich accounts for the product of health and economic damage events, nice!

AggregatedDirtyDataPerYear <- aggregate( cbind( 
  OVERALL_DAMAGE = (FATALITIES+INJURIES) * (PROPDMG+CROPDMG) ) ~ EVTYPE  + YEAR, 
data=StormData, FUN=sum ) ;

Let’s remove such cases on where there are no health harms at all, and since this last analysis is a time based one, to detect tendencies, we can filter out old data, say, before 1980

AggregatedDataPerYear <- subset( AggregatedDirtyDataPerYear, YEAR > 1980 & OVERALL_DAMAGE > 0 ) ;
mDamage <- median( AggregatedDataPerYear$OVERALL_DAMAGE ) ;
AggregatedDataPerYear <- subset( AggregatedDataPerYear, OVERALL_DAMAGE > mDamage ) ;

Lets get the Top events within time

top <- 10 ; # You may choose to see whichever top number you want

AggregatedDataPerType <- aggregate( OVERALL_DAMAGE ~ EVTYPE, data=AggregatedDataPerYear, FUN=sum )
topDamage  <- head( AggregatedDataPerType[ order( AggregatedDataPerType$OVERALL_DAMAGE, 
                                                  decreasing = TRUE ), ], top )
topData  <- AggregatedDataPerYear[ AggregatedDataPerYear$EVTYPE %in% topDamage$EVTYPE,  ]

Figure 3 - Overal Health & Economic Damage within Time

This last plot is intended to present the behavior of the most frequent event types within time. It allows us to see which event types seem to have an upward tendency and which downward, permitting the decision maker to take previsions more on a “where-we-come-from & where-are-we-going” based model.

library(ggplot2)
qplot( YEAR, OVERALL_DAMAGE, data=topData, color=EVTYPE, geom = c( "point", "line" ), main ="Damage Index Withing Time" ) 

The main conclusion we can draw from here is that there has been some kind of seasonality of eventuality effects within time. It’s clear that the effect of some events are growing like Torando and High Winds, and the effect of some others are fading down, like Typhoons and Winter Storms.

Thanks guys!

Good luck with yours