Reproducible Research - Peer Assessment 2 http://rpubs.com/develHector/PA2_template Update - 25 Apr, 2015
The aim of this report is to analyze both the the most harmful events with respect to population health across USA as well as the economic consequences of each type. The Questions needed to be answered are: 1. Which types of events (EVTYPE) are most harmful with respect to USA population health? 2. Which types of events have the greatest economic consequences?
What we are going do do here is
1. Download the raw data
2. Clean it as much as possible with a couple of basic techniques
3. Summarize (aggregate) the data according the requirements of the assignment
4. Order it in a way it presents us the more satistically relevant info
5. Plot the final report, on two ways: total, and an extra one: per year.
For the purposes of this report:
- FATALITIES AND INJURIES has been added in a single column impact value called HEALTH_EVENTS
- PROPERTY DAMAGE + CROP DAMMAGE has been added as a single column called ECONOMIC_DAMAGE
Raw information is obtained directly from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
URL from where the file can be downloaded from the course web site:
url <- "https://d396qusza40orc.cloudfront.net/repdata/data/StormData.csv.bz2" ;
Download the file in case it’s not on the current directory
# Choose a local directory for storing intermediate data, I went for "~/data"
path <- "~/data/" ;
if( !file.exists( path ) ) dir.create( path ) ;
# Download and unpack it only once
zipfile <- paste0( path, "StormData.csv.bz2" ) ;
if( !file.exists(zipfile) ) download.file( url, zipfile, method="curl", mode="ab", quiet=TRUE );
We need to unzip the data with a spefic command bunzip2 from R.utils library
# Unzip it only if not already unzipped
file <- paste0( path, "StormData.csv" ) ;
if( !file.exists(file) )
{
if( !("R.utils" %in% rownames( installed.packages() ) ) ) install.packages("R.utils") ;
library(R.utils)
bunzip2( filename = zipfile, destname = file ) ;
}
Store the data on a dataframe “StormData”" to be used during the rest of the report
# This needs to be cached in order to reduce load time during recursions
StormData <- read.csv( file, header = TRUE ) ;
Let’s clean a little the event factor through trimming leading and trailing spaces un turn to uppercase
StormData$EVTYPE <- toupper( gsub( "^\\s+|\\s+$", "", StormData$EVTYPE ) ) ;
Now let’s aggregate the data on the SUM of individual events
AggregatedDirtyData <- aggregate( cbind(
FATALITIES, INJURIES, HEALTH_EVENTS = FATALITIES+INJURIES,
PROPDMG, CROPDMG, ECONOMIC_DAMAGE = PROPDMG+CROPDMG ) ~ EVTYPE, data=StormData, FUN=sum ) ;
Lets filter out a little bit more by only taking those above the mean
m <- mean( AggregatedDirtyData$HEALTH_EVENTS ) ;
AggregatedCleanData <- subset( AggregatedDirtyData, HEALTH_EVENTS > m ) ;
Lets get the most important health events (Top of the list)
TOP <- 25 ; # You may choose whatever you want, I was interested only on thwe first 25
TopEventsHealth <- head( AggregatedCleanData[ order( AggregatedCleanData$HEALTH_EVENTS,
decreasing = TRUE ), c(1,2:4) ], TOP );
RemainingData <- AggregatedCleanData[ !(AggregatedCleanData$EVTYPE %in% TopEventsHealth$EVTYPE), ] ;
TopEventsHealth <- rbind( TopEventsHealth,
data.frame( EVTYPE=paste( "[REST", nrow( RemainingData ),"RELEVANT TYPES]" ),
FATALITIES = sum( RemainingData$FATALITIES ),
INJURIES = sum( RemainingData$INJURIES ),
HEALTH_EVENTS = sum( RemainingData$HEALTH_EVENTS ) ) ) ;
# Let's renumber the results in order to be consistent with the TopN idea
rownames( TopEventsHealth ) <- c( 1:(TOP+1) ) ;
Lets filter out a little bit more by only taking those above the mean
m <- mean( AggregatedDirtyData$ECONOMIC_DAMAGE ) ;
AggregatedCleanData <- subset( AggregatedDirtyData, ECONOMIC_DAMAGE > m ) ;
Lets get the most economic damage events (Top of the list)
TOP <- 25 ; # You may choose whatever you want, I was interested only on thwe first 25
TopEventsEconomic <- head( AggregatedCleanData[ order( AggregatedCleanData$ECONOMIC_DAMAGE,
decreasing = TRUE ), c(1,5:7) ], TOP );
RemainingData <- AggregatedCleanData[ !(AggregatedCleanData$EVTYPE %in% TopEventsHealth$EVTYPE), ] ;
TopEventsEconomic <- rbind( TopEventsEconomic,
data.frame( EVTYPE=paste( "[REST", nrow( RemainingData ),"RELEVANT TYPES]" ),
PROPDMG=sum( RemainingData$PROPDMG ),
CROPDMG=sum( RemainingData$CROPDMG ),
ECONOMIC_DAMAGE=sum( RemainingData$ECONOMIC_DAMAGE ) ) ) ;
# Let's renumber the results in order to be consistent with the TopN idea
rownames( TopEventsEconomic ) <- c( 1:(TOP+1) ) ;
This table protrays the conclusion of the most relevant health events per type since the begining of the measurement, check the descending order of events, numbered, portrays at a single view which are the most relevant events, its number of ocurrences both on fatalities and injuries, and the sum (Health_Events) which was the variable used for the analysis. We added at the bottom the sum of the other relevant events to have an idea of what we are missing once we are filtering by the top ones.
library(xtable);
xt<-xtable(TopEventsHealth, display=c("d", "s", "d", "d", "d"), caption = "Individual Fatalities and Injuries per Event");
print( xt, type="html" ) ;
| EVTYPE | FATALITIES | INJURIES | HEALTH_EVENTS | |
|---|---|---|---|---|
| 1 | TORNADO | 5633 | 91346 | 96979 |
| 2 | EXCESSIVE HEAT | 1903 | 6525 | 8428 |
| 3 | TSTM WIND | 504 | 6957 | 7461 |
| 4 | FLOOD | 470 | 6789 | 7259 |
| 5 | LIGHTNING | 816 | 5230 | 6046 |
| 6 | HEAT | 937 | 2100 | 3037 |
| 7 | FLASH FLOOD | 978 | 1777 | 2755 |
| 8 | ICE STORM | 89 | 1975 | 2064 |
| 9 | THUNDERSTORM WIND | 133 | 1488 | 1621 |
| 10 | WINTER STORM | 206 | 1321 | 1527 |
| 11 | HIGH WIND | 248 | 1137 | 1385 |
| 12 | HAIL | 15 | 1361 | 1376 |
| 13 | HURRICANE/TYPHOON | 64 | 1275 | 1339 |
| 14 | HEAVY SNOW | 127 | 1021 | 1148 |
| 15 | WILDFIRE | 75 | 911 | 986 |
| 16 | THUNDERSTORM WINDS | 64 | 908 | 972 |
| 17 | BLIZZARD | 101 | 805 | 906 |
| 18 | FOG | 62 | 734 | 796 |
| 19 | RIP CURRENT | 368 | 232 | 600 |
| 20 | WILD/FOREST FIRE | 12 | 545 | 557 |
| 21 | HEAT WAVE | 172 | 379 | 551 |
| 22 | RIP CURRENTS | 204 | 297 | 501 |
| 23 | DUST STORM | 22 | 440 | 462 |
| 24 | WINTER WEATHER | 33 | 398 | 431 |
| 25 | TROPICAL STORM | 58 | 340 | 398 |
| 26 | [REST 9 RELEVANT TYPES] | 847 | 2103 | 2950 |
Let’s see that on a chart to have an grasp of the distribution of events. You will see the same data as in the previous table, but on this case, and to simplify the reading, we turned the x axis to be logarithimc, in order to let us view more clearly the descending impact of the events after the major one. We also plotted the rest of the relevant events. This graph by default orders inveresely to the table, but still presents the needed information for review.
par(ps=9, cex=1, cex.main=2, las=2, oma=c(0, 4, 0, 4)) ; # adjust graph parameters to fit best
barplot( TopEventsHealth$HEALTH_EVENTS, horiz = TRUE, names.arg = TopEventsHealth$EVTYPE,
cex.names = 0.8, log="x", main="Health Events per Event Type", xlab="Number of Events - Logarithmic Scale");
The conclusions you can draw from here are based on the data from the previous table and graph, Tornado causes the vast majority of the health issues, but there are other important events to consider as well.
This table is shown as the conclusion we can draw from the most relevant economic data events per type since the begining of the measurement. check the descending order of events, numbered, portrays at a single view which are the most relevant events, its number of ocurrences both on “Property Damage” and “Crop Damage”, and its sum (ECONOMIC_DAMAGE) which was the variable used for the analysis. We added at the bottom the sum of the other relevant events to have an idea of what we are missing once we are filtering by the top ones.
library(xtable);
xt<-xtable(TopEventsEconomic, caption="Units in thousands $USD total");
print( xt, type="html" ) ;
| EVTYPE | PROPDMG | CROPDMG | ECONOMIC_DAMAGE | |
|---|---|---|---|---|
| 1 | TORNADO | 3212258.16 | 100018.52 | 3312276.68 |
| 2 | FLASH FLOOD | 1420174.59 | 179200.46 | 1599375.05 |
| 3 | TSTM WIND | 1336103.61 | 109202.60 | 1445306.21 |
| 4 | HAIL | 688693.38 | 579596.28 | 1268289.66 |
| 5 | FLOOD | 899938.48 | 168037.88 | 1067976.36 |
| 6 | THUNDERSTORM WIND | 876844.17 | 66791.45 | 943635.62 |
| 7 | LIGHTNING | 603351.78 | 3580.61 | 606932.39 |
| 8 | THUNDERSTORM WINDS | 446293.18 | 18684.93 | 464978.11 |
| 9 | HIGH WIND | 324731.56 | 17283.21 | 342014.77 |
| 10 | WINTER STORM | 132720.59 | 1978.99 | 134699.58 |
| 11 | HEAVY SNOW | 122251.99 | 2165.72 | 124417.71 |
| 12 | WILDFIRE | 84459.34 | 4364.20 | 88823.54 |
| 13 | ICE STORM | 66000.67 | 1688.95 | 67689.62 |
| 14 | STRONG WIND | 63011.81 | 1616.90 | 64628.71 |
| 15 | HEAVY RAIN | 50842.14 | 11122.80 | 61964.94 |
| 16 | HIGH WINDS | 55625.00 | 1759.60 | 57384.60 |
| 17 | TROPICAL STORM | 48423.68 | 5899.12 | 54322.80 |
| 18 | WILD/FOREST FIRE | 39344.95 | 4189.54 | 43534.49 |
| 19 | DROUGHT | 4099.05 | 33898.62 | 37997.67 |
| 20 | FLASH FLOODING | 28497.15 | 5126.05 | 33623.20 |
| 21 | URBAN/SML STREAM FLD | 26051.94 | 2793.80 | 28845.74 |
| 22 | BLIZZARD | 25318.48 | 172.00 | 25490.48 |
| 23 | HURRICANE | 15513.68 | 5339.31 | 20852.99 |
| 24 | FLOOD/FLASH FLOOD | 19153.25 | 1427.70 | 20580.95 |
| 25 | STORM SURGE | 19393.49 | 5.00 | 19398.49 |
| 26 | [REST 14 RELEVANT TYPES] | 350096.69 | 73681.42 | 423778.11 |
Let’s see that on a chart to have an grasp of the distribution of events and its economic relevance. You will see the same data as in the previous table, but on this case, and to simplify the reading, we turned the x axis to be logarithimc, in order to let us view more clearly the descending impact of the events after the major one. We also plotted the rest of the relevant events. This graph by default orders inveresely to the table, but still presents the needed information for review.
par(ps=9, cex=1, cex.main=2, las=2, oma=c(0, 4, 0, 4)) ; # adjust graph parameters to fit best
barplot( TopEventsEconomic$ECONOMIC_DAMAGE / 1000, horiz = TRUE, names.arg = TopEventsEconomic$EVTYPE,
cex.names = 0.8, log="x", main="Economic Damange per Event Type", xlab="Millions USD in Damage - Logarithmic Scale");
The conclusions you can draw from here are based on the data from the table and graph, trying not to be repetitive, the graph speaks for itself.
Lets have sum fun with a measurements-within-time plot, with a graph with more depth on the information, looking to present the most important overal events, but within time.
First, lets fix the BGN_DATE column in order to be a Date value instead of a string
StormData$BGN_DATE <- as.Date( StormData$BGN_DATE, "%m/%d/%Y" ) ;
divisor <- 5 ;
StormData$YEAR <- as.integer( ( as.POSIXlt( StormData$BGN_DATE )$year+1900 ) / divisor ) * divisor ;
Aggregate per year and to have an unique indicator, my proposal is to lets create a “Damage Index”, called “OVERAL_DAMAGE” wich accounts for the product of health and economic damage events, nice!
AggregatedDirtyDataPerYear <- aggregate( cbind(
OVERALL_DAMAGE = (FATALITIES+INJURIES) * (PROPDMG+CROPDMG) ) ~ EVTYPE + YEAR,
data=StormData, FUN=sum ) ;
Let’s remove such cases on where there are no health harms at all, and since this last analysis is a time based one, to detect tendencies, we can filter out old data, say, before 1980
AggregatedDataPerYear <- subset( AggregatedDirtyDataPerYear, YEAR > 1980 & OVERALL_DAMAGE > 0 ) ;
mDamage <- median( AggregatedDataPerYear$OVERALL_DAMAGE ) ;
AggregatedDataPerYear <- subset( AggregatedDataPerYear, OVERALL_DAMAGE > mDamage ) ;
Lets get the Top events within time
top <- 10 ; # You may choose to see whichever top number you want
AggregatedDataPerType <- aggregate( OVERALL_DAMAGE ~ EVTYPE, data=AggregatedDataPerYear, FUN=sum )
topDamage <- head( AggregatedDataPerType[ order( AggregatedDataPerType$OVERALL_DAMAGE,
decreasing = TRUE ), ], top )
topData <- AggregatedDataPerYear[ AggregatedDataPerYear$EVTYPE %in% topDamage$EVTYPE, ]
This last plot is intended to present the behavior of the most frequent event types within time. It allows us to see which event types seem to have an upward tendency and which downward, permitting the decision maker to take previsions more on a “where-we-come-from & where-are-we-going” based model.
library(ggplot2)
qplot( YEAR, OVERALL_DAMAGE, data=topData, color=EVTYPE, geom = c( "point", "line" ), main ="Damage Index Withing Time" )
The main conclusion we can draw from here is that there has been some kind of seasonality of eventuality effects within time. It’s clear that the effect of some events are growing like Torando and High Winds, and the effect of some others are fading down, like Typhoons and Winter Storms.
Good luck with yours