First, I’ll set some globals for this R Markdown document.
knitr::opts_chunk$set(echo = TRUE,results='markup',cache=TRUE)
options(width = 80)
library(ggplot2)
library(reshape2)
library(grid)
The data used in this analysis comes from Project 2 of the course “Reproducible Research”, offered by Johns Hopkins on Coursera. According to the assignment, this data was taken from the US National Oceanic and Atmospheric Administration’s (NOAA) storm database. Further documentation of the data can be found here:
The goal of this analysis is to answer two primary questions, both specified in Project 2:
1. Across the United States, which types of events (as indicated in the EVTYPE variable)
are most harmful with respect to population health?
2. Across the United States, which types of events have the greatest economic consequences?
For the remainder of this analysis, events that are harmful to population health (i.e., events with high injury and fatality reports) are referred to as “hazardous”. Events that produce high amounts of crop and property damage are referred to as “damaging”.
Our primary results only focused on total reported injuries, deaths, crop damage, and property damage over the entire ~60 year time period. Results from any additional analyses will not be discussed here.
With respect to population health, our results identify tornadoes as the most hazardous storm-related event. Tornado-related deaths and injures far out-numbered the second-highest, event-related statistic in both categories (‘excessive heat’ for deaths and ‘thunderstorm winds’ for injuries): \[D_{Tornadoes} / D_{2ndPlace} > 2.5\] \[I_{Tornadoes} / I_{2ndPlace} > 10\]
With respect to economic value, our analysis identifies flooding, low-temperature storms, and other coastal events as among the most damaging storm-related events. This is especially true for property damage, which constitutes 9 of the 10 highest reported damages and displays the most variance among the events. ‘Drought’ is identified as the most-damaging event with respect to crop damage (and is the 7th most damaging event, overall).
Data processing will occur in 3 steps: (1) loading, (2) exploring, and (3) processing (which will include any transformations).
We need to load the data and explore the column names.
StormData <- read.csv('repdata_data_StormData.csv.bz2',stringsAsFactors=FALSE)
dim(StormData)
## [1] 902297 37
names(StormData)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
The following columns were chosen for their relevance to our objectives. They were also picked for analysis for their similarity to one another (i.e., comparing “apples-to-apples”, and not FATALITIES-to-CROPDMG):
EVTYPE, F (Fujita scale)
INJURIES, FATALITIES
CROPDMG, CROPDMGEXP, PROPDMG, PROPDMGEXP
To save on memory and increase processing speed, we will only keep the columns we need.
StormData <- StormData[,c('EVTYPE', 'F',
'FATALITIES', 'INJURIES',
'PROPDMG', 'PROPDMGEXP',
'CROPDMG','CROPDMGEXP')]
## We can look at what this shortened data looks like
StormData[1:10,]
## EVTYPE F FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO 3 0 15 25.0 K 0
## 2 TORNADO 2 0 0 2.5 K 0
## 3 TORNADO 2 0 2 25.0 K 0
## 4 TORNADO 2 0 2 2.5 K 0
## 5 TORNADO 2 0 2 2.5 K 0
## 6 TORNADO 2 0 6 2.5 K 0
## 7 TORNADO 2 0 1 2.5 K 0
## 8 TORNADO 1 0 0 2.5 K 0
## 9 TORNADO 3 1 14 25.0 K 0
## 10 TORNADO 3 0 0 25.0 K 0
After some exploring, a few columns seem to need attention before proceeding.
unique(StormData$F)
## [1] 3 2 1 4 NA 0 5
unique(StormData$CROPDMGEXP)
## [1] "" "M" "K" "m" "B" "?" "0" "k" "2"
unique(StormData$PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
After more exploration and some additional reading, the F column likely refers to the “Fujita scale” for tornadoes and high-winds. Furthermore, it has been shown that both *EXP columns contain order-of-magnitude information for the CROPDMG and PROPDMG columns (see Section “EXP Handling”).
The F-scale analysis was removed. Feel free to look in Section “Removed Analyses” for why this was removed. Below is a replacement code block that continues the program as intended, by simply removing the F column.
## this is basically just removing the F column
StormData <- StormData[, c('EVTYPE',
'FATALITIES', 'INJURIES',
'PROPDMG', 'PROPDMGEXP',
'CROPDMG','CROPDMGEXP')]
We are able to make use of the CROPDMGEXP and PROPDMGEXP variables thanks to an analysis done by Eddie Song and a second, unlisted author, found here: https://rstudio-pubs-static.s3.amazonaws.com/58957_37b6723ee52b455990e149edde45e5b6.html
In short, each tag in either column stands for an order-of-magnitude multiplier for another, associated column (CROPDMGEXP corresponds to CROPDMG, etc.). Here are the tags and the multipliers they correspond to:
'-','?' = 0
'+', NA = 1
[0 - 8] = 10
'h','H' = 100
'k','K' = 1000
'm','M' = 1,000,000
'b','B' = 1,000,000,000
For example, if StormData$CROPDMGEXP[1]=='H' and StormData$CROPDMG[1]==2, then the true, numeric crop damage of StormData[1,] would be 200.
The processing below will add these *EXP columns’ information to the proper CROPDMG or PROPDMG columns and then eliminate the *EXP columns entirely. Note that in the R-code below, I am allowing missing *EXP values to be replaced by ‘~’, which is my own tag to multiply by 1, similar to ‘+’. This was done so that I would not be mislabelling points that previously had no label or adding data that was not originally present.
## First, I'll define these conversions as two vectors:
## tags & their corresponding values
EXPVals <- c(0, 0, 1, 1,
rep(10,9),
100, 100,
1000, 1000,
1000000, 1000000,
1000000000, 1000000000)
EXPTags <- c('?', '-', '+', '~', #note my '~'
as.character(0:8),
'h', 'H',
'k', 'K',
'm', 'M',
'b', 'B')
## For both CROPDMG and PROPDMG,
## I want to treat every *two* entries (*DMG and *DMGEXP)
## as a *single* entry for mathematic clarity.
## Thus, I'm making a function that takes in a string (character) containing both:
## example: "2;h"==200, "53;NA"==53
EvaluateDMG <- function(CombinedDMGString){
DMGVector = unlist(strsplit(CombinedDMGString,split = ';')) #will make strings later
if (length(DMGVector)==1){
DMGVector[2] <- '~'
}
if (!(DMGVector[2] %in% EXPTags)){
stop(paste('DMGVector[2] needs to be in EXPTags! The following was not:',DMGVector[2]))
}
return(as.numeric(DMGVector[1]) * EXPVals[EXPTags==DMGVector[2]])
}
This function is then applied to the CROPDMG, CROPDMGEXP, PROPDMG, and PROPDMGEXP columns.
## Here, mapply is used to combine 2 columns (at a time) into one,
## filled with strings in *preparation* for EvaluateDMG
## For example, if StormData$CROPDMGEXP[1]=='h' & StormData$CROPDMG[1]==2,
## then StormData$CROPDMG would be replaced such that StormData$CROPDMG[1]=='2;h'
StormData$CROPDMG <- mapply(paste,
StormData$CROPDMG,
StormData$CROPDMGEXP,
sep = ';',
USE.NAMES = FALSE)
StormData$PROPDMG <- mapply(paste,
StormData$PROPDMG,
StormData$PROPDMGEXP,
sep = ';',
USE.NAMES = FALSE)
## Get rid of excess columns (CROPDMGEXP and PROPDMGEXP)
StormData <- StormData[,c('EVTYPE',
'FATALITIES', 'INJURIES',
'PROPDMG', 'CROPDMG')]
## apply EvaluateDMG to the CROPDMG and PROPDMG columns
StormData$CROPDMG <- sapply(StormData$CROPDMG,
EvaluateDMG,
USE.NAMES = FALSE)
StormData$PROPDMG <- sapply(StormData$PROPDMG,
EvaluateDMG,
USE.NAMES = FALSE)
dim(StormData)
## [1] 902297 5
StormData should now be significantly smaller, cleaned, and ready for analysis.
From the data we have, there are a few combinations that can help us compare things between the 985 types of events present in the dataframe:
- Injuries vs. Fatalities
- CropDamage vs. PropertyDamage
- Totals vs. Averages (“Over 60 years” vs. “Per each occurence”)
The last combination of values will be important when judging the impact of a single occurence of any type of event. By doing this, we are enabling ourselves to identify both common, high-in-total and rare, extremely harmful events. This type of analysis (Totals vs Average) will only be covered in the final plot, inside the section “Stats Per Occurence”.
To find the most influential types of events, we will want to find outliers that have high values, compared to other events.
We’ll be able to sum and average by EVTYPE via aggregate(). Furthermore, once averages and totals have been created, I will identify and label all of the outliers per each category. Here, a category is defined as any combination of storm measurement (Injuries/Fatalities/CropDMG/PropDMG) and measurement type (Total/Average). As can be seen below, outliers will be defined as any event with a value that is within the 95th exclusive percentile of the positive numbers within its category.
It is important to note that I only choose positive numbers because we have already totalled and averaged our numbers. This way, I simply don’t include irrelevent events. For example, we wouldn’t want an event like “Perfect Weather”, with zero total fatalities, to count in our fatality distribution. One could argue that such an event should not be considered among fatal-events.
## aggregate for all totals and means
Totals <- aggregate(. ~ EVTYPE, StormData, sum, na.rm=TRUE)
Averages <- aggregate(. ~ EVTYPE, StormData, mean, na.rm=TRUE)
## Each of these should be significantly smaller than StormData
dim(StormData); dim(Totals); dim(Averages)
## [1] 902297 5
## [1] 985 5
## [1] 985 5
## to find quantiles PER measurement PER measurement type (avg or tot),
## we should reshape Totals and Averages into skinny dataframes
## and NOT merge the two dataframes until after labeling everything.
## furthermore, we will be labeling based on *subsets* of these skinny dataframes
Totals <- melt(Totals, id.vars='EVTYPE', measure.vars=c('INJURIES','FATALITIES',
'CROPDMG','PROPDMG'))
Averages <- melt(Averages, id.vars='EVTYPE', measure.vars=c('INJURIES','FATALITIES',
'CROPDMG','PROPDMG'))
## we can use aggregate to get the quantiles as well!
## THIS is where i only include positive numbers.
TotalCutoffs <- aggregate(value ~ variable, subset(Totals,value>0), quantile, probs=c(0.95), na.rm=TRUE)
AverageCutoffs <- aggregate(value ~ variable, subset(Averages,value>0), quantile, probs=c(0.95), na.rm=TRUE)
## going to reshape these *slightly* (I want a 1x4 dataframe)
TotalCutoffs <- data.frame(INJURIES=subset(TotalCutoffs,variable=='INJURIES')$value,
FATALITIES=subset(TotalCutoffs,variable=='FATALITIES')$value,
CROPDMG=subset(TotalCutoffs,variable=='CROPDMG')$value,
PROPDMG=subset(TotalCutoffs,variable=='PROPDMG')$value)
AverageCutoffs <- data.frame(INJURIES=subset(AverageCutoffs,variable=='INJURIES')$value,
FATALITIES=subset(AverageCutoffs,variable=='FATALITIES')$value,
CROPDMG=subset(AverageCutoffs,variable=='CROPDMG')$value,
PROPDMG=subset(AverageCutoffs,variable=='PROPDMG')$value)
## now we can add labels to the skinny dataframes
## i will be using ifelse() to see if the entries *are* outliers or not
## this ifelse() will look a little complicated because each entry can
## only be compared to the right cutoff
## the as.logical() part might seem weird, but it SHOULD only label if the following is satisfied:
## if each value is greater than or equal to the corresponding value under TotalCutoffs[,Totals$variable]
## the [1,] is needed because it returns a matrix, since
## it does NOT assume TotalCutoffs[,Totals$variable] has only one value (which it does!)
## by the way, this is where the "Exclusive" part of "Exclusive percentile" comes into play
Totals$Label <- ifelse(as.logical((Totals$value > TotalCutoffs[,Totals$variable])[1,]),
Totals$EVTYPE,
'')
Averages$Label <- ifelse(as.logical((Averages$value > AverageCutoffs[,Averages$variable])[1,]),
Averages$EVTYPE,
'')
We finally have two dataframes that contain all of the necessary information to plot and label our data. Lastly, I will combine the Totals and Averages dataframes into one, and then split them into Health and Econ dataframes, useful for the 2 questions. This step is actually pretty redundant for this analysis (in its current form), but can be useful for additional analyses.
AllData <- rbind(Totals,Averages)
AllData$SumType <- c(rep('TOTAL',length(Totals[,1])),
rep('AVERAGE',length(Averages[,1])))
Health <- subset(AllData, variable %in% c('INJURIES',
'FATALITIES'))
Econ <- subset(AllData, variable %in% c('CROPDMG',
'PROPDMG'))
Health <- Health[order(-Health$value),]
Econ <- Econ[order(-Econ$value),]
Econ$CostPhrase <- ifelse(Econ$value>1e+09,
paste(round(Econ$value/(1e+09),digits=1),'B $',sep=''),
paste(round(Econ$value/(1e+06),digits=1),'M $',sep=''))
Restated, question 1 asks:
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Plotted below is the data from the Health dataframe.
g <- ggplot(subset(Health,SumType=='TOTAL')) +
geom_boxplot(aes(x=variable,y=value)) +
coord_cartesian(ylim = c(0,7500))
## I'll catch all unplotted datapoints here
Unplotted <- subset(Health, value>7500)
## to make the plot more readable, I'm going to try to alternate
## the labels, using a saved vector of random T/F
set.seed(1234)
Alternate <- rbinom(length(subset(Health,SumType=='TOTAL')$Label), 1, 0.5)
g <- g + geom_text(aes(x=variable, y=value, col=Label,
label=ifelse(Alternate,
Label, rep('',length(Label)))),
hjust=-0.1, size=2.2) +
geom_text(aes(x=variable, y=value, col=Label,
label=ifelse(!Alternate,
Label, rep('',length(Label)))),
hjust=1.1, size=2.2) +
geom_text(aes(x=variable, col=Label,
y=rep(7700,length(value)) - (10 * c(1:length(value))),
label=ifelse(value>7500,
paste(Label,value,sep=', '),
rep('',length(Label)))),
size=2.3)
## I'm going to replot the unplotted things starting at the very top
## and "working my way down" via a few new columns, Replot*
## first, need to order
#Unplotted <- Unplotted[order(-Unplotted$value),]
#Unplotted$ReplotValue <- rep(7700,dim(Unplotted)[1]) - (10*(1:dim(Unplotted)))
#Unplotted$ReplotLabel <- paste(Unplotted$EVTYPE,Unplotted$value,sep = ', ')
#g <- g + geom_text(aes(x=Unplotted$variable, y=Unplotted$ReplotValue,
# label=Unplotted$ReplotLabel),
# size=2.1)
## now for some labels
PresetCaption <- c("The reported number of injuries caused by tornadoes is far too large to be plotted.",
"Totalling to more than 10 times its contender, this statistic is described below.")
PresetCaption <- paste(PresetCaption,collapse='\n')
g <- g +
labs(title = 'Health Hazards from Storms',
subtitle = "Total Number of Injuries and Fatalities Recorded (1950 - Nov 2011)",
caption = PresetCaption,
x = '',
y = 'Counts')
g <- g + guides(col=FALSE)
print(g)
## how many were unplotted?
dim(Unplotted)
## [1] 1 5
We can see that there is only 1 point(s) that is beyond the y-limits of what is graphed. This point, understandably, is the total number of injuries from the “TORNADO” EVTYPE. This event caused a whopping 91346 reported injuries during this time period.
Restated, question 2 asks:
Across the United States, which types of events have the greatest economic consequences?
Plotted below is the data from the Econ dataframe.
g <- ggplot(subset(Econ,SumType=='TOTAL')) +
geom_boxplot(aes(x=variable,
y=value/1e+09)) +
coord_cartesian(ylim = c(0,1.75e+1)) #1.75e+10/1e+09
## catch unplotted, like before
Unplotted <- subset(Econ, value>1.75e+10)
#Unplotted$Megadollars <- Unplotted$value / 1e+06
## to make the plot more readable, I'm going to try to alternate
## the labels, using a saved vector of random T/F
set.seed(5678)
Alternate <- rbinom(length(subset(Health,SumType=='TOTAL')$Label),
1, 0.5)
g <- g +
geom_text(aes(x=variable,
y=value/1e+09,
col=Label,
label=ifelse(Alternate,
Label,
rep('',length(Label)))
),
hjust=-0.1, size=2.2) +
geom_text(aes(x=variable,
y=value/1e+09,
col=Label,
label=ifelse(!Alternate,
Label,
rep('',length(Label)))
),
hjust=1.1, size=2.2) +
geom_text(aes(x=variable, col=Label,
y=(rep(18.75,length(value)) - (0.5 * c(1:length(value)))),
label=ifelse(value>1.75e+10,
paste(Label,CostPhrase,sep=', '),
rep('',length(Label)))),
size=2)#,hjust=-0.2)
## now for some labels
n = length( subset(Econ, value>1.75e+10)$value )
PresetCaption <- c(paste(c("There are ",n," reported events with total ",
"damage estimates larger than 17.5 billion ",
"dollars.",collapse='')))
PresetCaption <- paste(PresetCaption,collapse='')
g <- g +
labs(title = 'Economic Damages from Storms',
subtitle = "Total Recorded Crop and Property Damage (1950 - Nov 2011)",
caption = PresetCaption,
x = '',
y = expression("Billions of Dollars ( 10 "^9*" $)")) +
guides(col=FALSE)
print(g)
These values also illustrate the trend of highly-damaging coastal storms, floods, and tornadoes. The unplotted points (FLOOD, HURRICANE/TYPHOON, TORNADO, STORM SURGE) are all property damage with values of (144.657709807, 69.30584, 56.9371629, 43.323536) billions of dollars, respectively.
One thing we do not see from this analysis is how influential these events are per occurence (i.e., each event’s average destructive value). Using our previous a subset of our previous AllData dataframe, we will need to normalize our values so that we can visualize their properties next to one another.
## technically, we already have an Averages dataframe
## but this is just to be consistent with what we JUST were doing
## since AllData is still just Health + Econ
Averages <- subset(AllData,SumType=='AVERAGE')
## these will be used for labelling later on
ShortName = c(' INJUR',' FATAL','M $','M $')
LongName = c('INJURIES','FATALITIES','CROPDMG','PROPDMG')
## now I will go through a semi-complicated way of normalizing
## each value (with that variable's distribution of values)
NormalizedAverages <- NULL
for (harmType in c('INJURIES','FATALITIES',
'CROPDMG','PROPDMG')){
tempRows <- subset(Averages,variable==harmType)
if (harmType %in% c('CROPDMG','PROPDMG')){
PrettyVal <- as.numeric(tempRows$value)/(1e+6)
} else { PrettyVal <- as.numeric(tempRows$value)}
tempMu <- mean(tempRows$value); tempSig <- sd(tempRows$value)
tempRows$NormVal <- (tempRows$value - tempMu)/tempSig
tempRows$Highest <- ifelse(tempRows$value==max(tempRows$value),
paste(PrettyVal,
ShortName[LongName==harmType],
sep = ''),
'')
tempMuSigRow <- data.frame(Mu=c(tempMu), Sigma=c(tempSig),
row.names = c(harmType))
if (is.null(NormalizedAverages)){
MuAndSigmaValues <- tempMuSigRow
NormalizedAverages <- tempRows
} else {
MuAndSigmaValues <- rbind(MuAndSigmaValues, tempMuSigRow)
NormalizedAverages <- rbind(NormalizedAverages, tempRows)
}
}
With this, we should be able to plot our results, similar to how we plotted figures for questions 1 & 2.
g <- ggplot(NormalizedAverages) +
geom_boxplot(aes(x=variable,y=NormVal)) +
## I'll only label things that are above a certain significance
geom_text(aes(x=variable, y=NormVal, col=Label,
#label=Label),
label=ifelse(NormVal>4.9,
Label,
'')),
hjust=-0.1, size=2)
## and I'll add some values for the highest #s
g <- g + geom_text(aes(x=variable, y=NormVal, col=Label,
label=Highest),
hjust=-0.5, vjust=2, size=2)
g <- g +
labs(title = 'Most Dangerous Storms Per Occurence',
subtitle = "1950 - Nov 2011",
x = '',
y = expression(paste('Normalized Cost Per Occurence, Z = ( ',bar(X),' - ',
mu, ' ) / ', sigma))) +
guides(col=FALSE)
print(g)
Note that the labeled points are still defined as outliers in the 95th exclusive percentile. We can see distinct differences between this analysis and those of the totals. In particular, this graph favors events that are rare and highly damaging. This explains the presence of specific events, such as Topical Storm Gordon and Hurricane Opal. In this author’s opinion, events that are consistent outliers between the plots (such as tropical storms, hurricanes, and excessive temperatures) could be be noted as regular, highly-destructive events.
For various reasons, as I go through these courses, I often need to remove or modify ideas I had. This section is devoted to those analyses that were eventually cut. In each section, the first paragraph explains why the section was cut and what (if any) results could be drawn when that section was included. Everything else is copied verbatim from the original analysis (except for the eval=FALSE option in the code block).
This section was removed simply due to redundancy and clutter in the final plots that answered questions 1 and 2. This added an F-scale to the EVTYPE value for any applicable event. While it did add an extra layer of information, there were very few datapoints with F-scales to begin with. Furthermore, by quickly looking at the final plot of total injuries per event, the highest four events (by a significant margin) are F4-, F3-, F2-, and F5-Tornadoes (in descending order of total injuries). This suggests that F4, F3, and F2 tornadoes are both harmful (in terms of # of injuries) and widespread-enough to be considered among the most harmful-to-health events during this time period. All-in-all, I removed this section so that more types of events could be revealed than just tornadoes. This section was originally placed immediately before Section “EXP Handling”.
First, inspecting the F column:
mean(is.na(StormData$F))
mean(!is.na(StormData$F))
totalFraction <- 0
for (L in levels(as.factor(StormData$F))){
print(mean(StormData$F==L,na.rm=TRUE))
totalFraction <- totalFraction + mean(StormData$F==L,na.rm=TRUE)
}
totalFraction == 1
There doesn’t seem to be anything wrong with their method of recording F-scales, but I’d prefer to combine the EVTYPE and F value whenever possible. This will help treat NA’s in F, and will allow us to use F as a classifier. This way, “F5 Tornadoes” will be treated the same (categorically) as “Floods” or “F0 High Winds” (or even “Tornadoes”, for that matter).
## we're just going to replace the `EVTYPES` column entirely
## we'll paste() 'F0' to 'Tornado' with sep=' ', making 'F0-Tornado'
## we can use an ifelse() function to paste '' whenever F is NA
StormData$EVTYPE <- paste( ifelse(is.na(StormData$F),
'',
paste( paste('F',StormData$F,sep='') ,
'-', sep = '')),
StormData$EVTYPE,
sep = '')
StormData <- StormData[, c('EVTYPE',
'FATALITIES', 'INJURIES',
'PROPDMG', 'PROPDMGEXP',
'CROPDMG','CROPDMGEXP')]
StormData[1,]