Reproducible Research - Peer Assessment 2 http://rpubs.com/develHector/PA2_template Update - 25 Apr, 2015

Synopsis

The aim of this report is to analyze both the the most harmful events with respect to population health across USA as well as the economic consequences of each type. The Questions needed to be answered are: 1. Which types of events (EVTYPE) are most harmful with respect to USA population health? 2. Which types of events have the greatest economic consequences?

Summary of the data analysis

What we are going do do here is
1. Download the raw data
2. Clean it as much as possible with a couple of basic techniques
3. Summarize (aggregate) the data according the requirements of the assignment
4. Order it in a way it presents us the more satistically relevant info
5. Plot the final report, on two ways: total, and an extra one: per year.

For the purposes of this report:
- FATALITIES AND INJURIES has been added in a single column impact value called HEALTH_EVENTS
- PROPERTY DAMAGE + CROP DAMMAGE has been added as a single column called ECONOMIC_DAMAGE

Data Processing

Raw information is obtained directly from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Loading and preprocessing the data

URL from where the file can be downloaded from the course web site:

url <- "https://d396qusza40orc.cloudfront.net/repdata/data/StormData.csv.bz2" ;

Download the file in case it’s not on the current directory

# Choose a local directory for storing intermediate data, I went for "~/data"
path <- "~/data/" ; 
if( !file.exists( path ) ) dir.create( path ) ; 

# Download and unpack it only once
zipfile <- paste0( path, "StormData.csv.bz2" ) ;
if( !file.exists(zipfile) ) download.file( url, zipfile, method="curl", mode="ab", quiet=TRUE );

We need to unzip the data with a spefic command bunzip2 from R.utils library

# Unzip it only if not already unzipped
file <- paste0( path, "StormData.csv" ) ;
if( !file.exists(file) )
{
  if( !("R.utils" %in% rownames( installed.packages() ) ) ) install.packages("R.utils") ;  
  library(R.utils)
  bunzip2( filename = zipfile, destname = file  ) ;
}

Store the data on a dataframe “StormData”" to be used during the rest of the report

# This needs to be cached in order to reduce load time during recursions
StormData <- read.csv( file, header = TRUE ) ;

Let’s clean a little the event factor through trimming leading and trailing spaces un turn to uppercase

StormData$EVTYPE <- toupper( gsub( "^\\s+|\\s+$", "", StormData$EVTYPE ) ) ;

Now let’s aggregate the data on the SUM of individual events

AggregatedDirtyData <- aggregate( cbind( 
    FATALITIES, INJURIES, HEALTH_EVENTS = FATALITIES+INJURIES, 
    PROPDMG, CROPDMG, ECONOMIC_DAMAGE = PROPDMG+CROPDMG ) ~ EVTYPE, data=StormData, FUN=sum ) ;

Lets filter out a little bit more by only taking those above the mean

m <- mean( AggregatedDirtyData$HEALTH_EVENTS ) ;
AggregatedCleanData <- subset( AggregatedDirtyData, HEALTH_EVENTS >  m ) ;

Lets get the most important health events (Top of the list)

TOP <- 25 ; # You may choose whatever you want, I was interested only on thwe first 25

TopEventsHealth <- head( AggregatedCleanData[ order( AggregatedCleanData$HEALTH_EVENTS, 
                                                 decreasing =  TRUE ), c(1,2:4) ], TOP );

RemainingData <- AggregatedCleanData[ !(AggregatedCleanData$EVTYPE %in% TopEventsHealth$EVTYPE), ] ;

TopEventsHealth <- rbind( TopEventsHealth, 
                          data.frame( EVTYPE=paste( "[REST", nrow( RemainingData ),"RELEVANT TYPES]" ),
                                      FATALITIES = sum(  RemainingData$FATALITIES ),
                                      INJURIES = sum( RemainingData$INJURIES ), 
                                      HEALTH_EVENTS = sum( RemainingData$HEALTH_EVENTS ) ) ) ;

# Let's renumber the results in order to be consistent with the TopN idea
rownames( TopEventsHealth ) <- c( 1:(TOP+1) ) ;

Lets filter out a little bit more by only taking those above the mean

m <- mean( AggregatedDirtyData$ECONOMIC_DAMAGE ) ;
AggregatedCleanData <- subset( AggregatedDirtyData, ECONOMIC_DAMAGE >  m ) ;

Lets get the most economic damage events (Top of the list)

TOP <- 25 ; # You may choose whatever you want, I was interested only on thwe first 25

TopEventsEconomic <- head( AggregatedCleanData[ order( AggregatedCleanData$ECONOMIC_DAMAGE, 
                                                     decreasing =  TRUE ), c(1,5:7) ], TOP );

RemainingData <- AggregatedCleanData[ !(AggregatedCleanData$EVTYPE %in% TopEventsHealth$EVTYPE), ] ;

TopEventsEconomic <- rbind( TopEventsEconomic, 
                            data.frame( EVTYPE=paste( "[REST", nrow( RemainingData ),"RELEVANT TYPES]" ),
                                        PROPDMG=sum( RemainingData$PROPDMG ),
                                        CROPDMG=sum( RemainingData$CROPDMG ),
                                        ECONOMIC_DAMAGE=sum( RemainingData$ECONOMIC_DAMAGE ) ) ) ;

# Let's renumber the results in order to be consistent with the TopN idea
rownames( TopEventsEconomic ) <- c( 1:(TOP+1) ) ;

Results

Table 1 - Events are most harmful to population health

This table protrays the conclusion of the most relevant health events per type since the begining of the measurement, check the descending order of events, numbered, portrays at a single view which are the most relevant events, its number of ocurrences both on fatalities and injuries, and the sum (Health_Events) which was the variable used for the analysis. We added at the bottom the sum of the other relevant events to have an idea of what we are missing once we are filtering by the top ones.

library(xtable);
xt<-xtable(TopEventsHealth, display=c("d", "s", "d", "d", "d"), caption = "Individual Fatalities and Injuries per Event");
print( xt, type="html" ) ;

Individual Fatalities and Injuries per Event
	EVTYPE	FATALITIES	INJURIES	HEALTH_EVENTS
1	TORNADO	5633	91346	96979
2	EXCESSIVE HEAT	1903	6525	8428
3	TSTM WIND	504	6957	7461
4	FLOOD	470	6789	7259
5	LIGHTNING	816	5230	6046
6	HEAT	937	2100	3037
7	FLASH FLOOD	978	1777	2755
8	ICE STORM	89	1975	2064
9	THUNDERSTORM WIND	133	1488	1621
10	WINTER STORM	206	1321	1527
11	HIGH WIND	248	1137	1385
12	HAIL	15	1361	1376
13	HURRICANE/TYPHOON	64	1275	1339
14	HEAVY SNOW	127	1021	1148
15	WILDFIRE	75	911	986
16	THUNDERSTORM WINDS	64	908	972
17	BLIZZARD	101	805	906
18	FOG	62	734	796
19	RIP CURRENT	368	232	600
20	WILD/FOREST FIRE	12	545	557
21	HEAT WAVE	172	379	551
22	RIP CURRENTS	204	297	501
23	DUST STORM	22	440	462
24	WINTER WEATHER	33	398	431
25	TROPICAL STORM	58	340	398
26	[REST 9 RELEVANT TYPES]	847	2103	2950

Figure 1 - Events are most harmful to population health

Let’s see that on a chart to have an grasp of the distribution of events. You will see the same data as in the previous table, but on this case, and to simplify the reading, we turned the x axis to be logarithimc, in order to let us view more clearly the descending impact of the events after the major one. We also plotted the rest of the relevant events. This graph by default orders inveresely to the table, but still presents the needed information for review.

par(ps=9, cex=1, cex.main=2, las=2, oma=c(0, 4, 0, 4)) ; # adjust graph parameters to fit best
barplot( TopEventsHealth$HEALTH_EVENTS, horiz = TRUE, names.arg = TopEventsHealth$EVTYPE, 
         cex.names = 0.8, log="x", main="Health Events per Event Type", xlab="Number of Events - Logarithmic Scale");

The conclusions you can draw from here are based on the data from the previous table and graph, Tornado causes the vast majority of the health issues, but there are other important events to consider as well.

Table 2 - Events have the greatest economic consequences

This table is shown as the conclusion we can draw from the most relevant economic data events per type since the begining of the measurement. check the descending order of events, numbered, portrays at a single view which are the most relevant events, its number of ocurrences both on “Property Damage” and “Crop Damage”, and its sum (ECONOMIC_DAMAGE) which was the variable used for the analysis. We added at the bottom the sum of the other relevant events to have an idea of what we are missing once we are filtering by the top ones.

library(xtable);
xt<-xtable(TopEventsEconomic, caption="Units in thousands $USD total");
print( xt, type="html" ) ;

Units in thousands $USD total
	EVTYPE	PROPDMG	CROPDMG	ECONOMIC_DAMAGE
1	TORNADO	3212258.16	100018.52	3312276.68
2	FLASH FLOOD	1420174.59	179200.46	1599375.05
3	TSTM WIND	1336103.61	109202.60	1445306.21
4	HAIL	688693.38	579596.28	1268289.66
5	FLOOD	899938.48	168037.88	1067976.36
6	THUNDERSTORM WIND	876844.17	66791.45	943635.62
7	LIGHTNING	603351.78	3580.61	606932.39
8	THUNDERSTORM WINDS	446293.18	18684.93	464978.11
9	HIGH WIND	324731.56	17283.21	342014.77
10	WINTER STORM	132720.59	1978.99	134699.58
11	HEAVY SNOW	122251.99	2165.72	124417.71
12	WILDFIRE	84459.34	4364.20	88823.54
13	ICE STORM	66000.67	1688.95	67689.62
14	STRONG WIND	63011.81	1616.90	64628.71
15	HEAVY RAIN	50842.14	11122.80	61964.94
16	HIGH WINDS	55625.00	1759.60	57384.60
17	TROPICAL STORM	48423.68	5899.12	54322.80
18	WILD/FOREST FIRE	39344.95	4189.54	43534.49
19	DROUGHT	4099.05	33898.62	37997.67
20	FLASH FLOODING	28497.15	5126.05	33623.20
21	URBAN/SML STREAM FLD	26051.94	2793.80	28845.74
22	BLIZZARD	25318.48	172.00	25490.48
23	HURRICANE	15513.68	5339.31	20852.99
24	FLOOD/FLASH FLOOD	19153.25	1427.70	20580.95
25	STORM SURGE	19393.49	5.00	19398.49
26	[REST 14 RELEVANT TYPES]	350096.69	73681.42	423778.11

Figure 2 - Events have the greatest economic consequences

Let’s see that on a chart to have an grasp of the distribution of events and its economic relevance. You will see the same data as in the previous table, but on this case, and to simplify the reading, we turned the x axis to be logarithimc, in order to let us view more clearly the descending impact of the events after the major one. We also plotted the rest of the relevant events. This graph by default orders inveresely to the table, but still presents the needed information for review.

par(ps=9, cex=1, cex.main=2, las=2, oma=c(0, 4, 0, 4)) ; # adjust graph parameters to fit best
barplot( TopEventsEconomic$ECONOMIC_DAMAGE / 1000, horiz = TRUE, names.arg = TopEventsEconomic$EVTYPE, 
         cex.names = 0.8, log="x", main="Economic Damange per Event Type", xlab="Millions USD in Damage - Logarithmic Scale");

The conclusions you can draw from here are based on the data from the table and graph, trying not to be repetitive, the graph speaks for itself.

There’s an extra plot allowed!

Lets have sum fun with a measurements-within-time plot, with a graph with more depth on the information, looking to present the most important overal events, but within time.

First, lets fix the BGN_DATE column in order to be a Date value instead of a string

StormData$BGN_DATE <- as.Date( StormData$BGN_DATE, "%m/%d/%Y" ) ;
divisor <- 5 ;
StormData$YEAR <- as.integer( ( as.POSIXlt( StormData$BGN_DATE )$year+1900 ) / divisor ) * divisor ;

Aggregate per year and to have an unique indicator, my proposal is to lets create a “Damage Index”, called “OVERAL_DAMAGE” wich accounts for the product of health and economic damage events, nice!

AggregatedDirtyDataPerYear <- aggregate( cbind( 
  OVERALL_DAMAGE = (FATALITIES+INJURIES) * (PROPDMG+CROPDMG) ) ~ EVTYPE  + YEAR, 
data=StormData, FUN=sum ) ;

Let’s remove such cases on where there are no health harms at all, and since this last analysis is a time based one, to detect tendencies, we can filter out old data, say, before 1980

AggregatedDataPerYear <- subset( AggregatedDirtyDataPerYear, YEAR > 1980 & OVERALL_DAMAGE > 0 ) ;
mDamage <- median( AggregatedDataPerYear$OVERALL_DAMAGE ) ;
AggregatedDataPerYear <- subset( AggregatedDataPerYear, OVERALL_DAMAGE > mDamage ) ;

Lets get the Top events within time

top <- 10 ; # You may choose to see whichever top number you want

AggregatedDataPerType <- aggregate( OVERALL_DAMAGE ~ EVTYPE, data=AggregatedDataPerYear, FUN=sum )
topDamage  <- head( AggregatedDataPerType[ order( AggregatedDataPerType$OVERALL_DAMAGE, 
                                                  decreasing = TRUE ), ], top )
topData  <- AggregatedDataPerYear[ AggregatedDataPerYear$EVTYPE %in% topDamage$EVTYPE,  ]

Figure 3 - Overal Health & Economic Damage within Time

This last plot is intended to present the behavior of the most frequent event types within time. It allows us to see which event types seem to have an upward tendency and which downward, permitting the decision maker to take previsions more on a “where-we-come-from & where-are-we-going” based model.

library(ggplot2)
qplot( YEAR, OVERALL_DAMAGE, data=topData, color=EVTYPE, geom = c( "point", "line" ), main ="Damage Index Withing Time" )

The main conclusion we can draw from here is that there has been some kind of seasonality of eventuality effects within time. It’s clear that the effect of some events are growing like Torando and High Winds, and the effect of some others are fading down, like Typhoons and Winter Storms.

Thanks guys!

Good luck with yours

Analisys of Harmful Storm Events in the USA

develHector

Apr 25, 2015