The following analysis was performed with the goal of determining the costs, both economic and population-wise, of various storm types to the United States. By the end of my analysis, you will be able to see that tornadoes pose the greatest population threat to the United States, while floods pose the greatest economic threat.
For both the health and economic damage caused per storm, I decided to use the total (sum) of the damage caused, as opposed to the mean or the median. My reasoning for this is that since certain types of storms are more likely to occur than others (i.e. P(tornado) > P(tsunami)), taking the mean/median could potentially lead to a misallocation of resources. In other words, although the mean death count of a Tsunami may (would require a separate analysis) be greater than the death count of a tornado, since tornadoes are a more probable event, the death count of a tornado should be viewed as more probable as well.
I will go into more depth about the following in the Data Processing section of this analysis, however, I decided to use the attribute PROPDMG (property damage) as opposed to CROPDMG (crop damage) when calculating the economic damage per storm type. My reasoning for this is that the percentage of the US population that are farmers (and therefore directly affected by crop damage) is relatively small compared to those who would not be directly affected by crop damage.
The following code chunk sets the working directory, reads data the data into R and creates the subset containing only the columns that will be necessary for this analysis. (EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
setwd("/Users/marsh/data_science_coursera/reproducible_research/course_project_2/")
stormdata <- read.csv("repdata%2Fdata%2FStormData.csv.bz2")
stormdata <- stormdata[,c("EVTYPE","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")]
The one recommendation I am going to make in this analysis has more to do with how the National Weather Service stores their data.
The columns PROPDMG and PROPDMGEXP express one attribute of each observation in the data set, and, in my opinion, should therefore be stored in one column. Although it may be easier to store them in separate columns, the following three (rather large) code chunks are performing the task of bringing the two columns into one. Should these two columns be stored together in the future, these code chunks would not be necessary (or at least would be to a lesser degree). This principal could (and should) be applied to the columns CROPDMG and CROPDMGEXP as well.
Moving on, the following code chunk removes any observation where both CROPDMG and PROPDMG are equal to zero (since I am using the sum of the data as opposed to the mean, removing these rows won’t affect the outcome). Next, it removes the columns CROPDMG and CROPDMGEXP from the data set (see Synopsis for reasoning). After that, it keeps only obersvations where an uppercase or lowercase “B”,“K” or “M” are present. My reasoning for doing this is that in the third paragraph of section 2.7 of the metadata for this dataset, it says that, “Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions.” Lastly, it creates a new column, property_damage, that pastes PROPDMG and PROPDMGEXP together.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#Create econ_data, subset of stormdata
econ_data <- stormdata[,c("EVTYPE", "PROPDMG", "PROPDMGEXP","CROPDMG","CROPDMGEXP")]
#Remove data where both PROPDMG and CROPDMG have 0's as values
econ_data <- filter(econ_data, econ_data$PROPDMG != 0 & econ_data$CROPDMG != 0)
econ_data <- econ_data[,c("EVTYPE","PROPDMG","PROPDMGEXP")]
econ_data <- econ_data[grep("[Bb]|[Mm]|[Kk]", econ_data$PROPDMGEXP),]
econ_data <- mutate(econ_data, property_damage = paste(PROPDMG, PROPDMGEXP))
The following code chunk creates a subset of econ_data called thousands, which only contains observations where an uppercase or lowercase “K” is present in the PROPDMGEXP column. The code then creates a new column, property_damage_1, that replaces the “K” with either three zeroes (000), two zeroes (00) or one zero (0), depending on the number of decimal places.
#Create "thousands" data set and new column with real price values
thousands <- filter(econ_data, econ_data$PROPDMGEXP == "K" | econ_data$PROPDMGEXP == "k")
#"thousand" data set with no decimals
thousands_periods_0 <- thousands[grep("^[0-9]{1,3} [Kk]", thousands$property_damage),]
thousands_periods_0$property_damage_1 <- gsub("[Kk]","000", thousands_periods_0$property_damage)
thousands_periods_0$property_damage_1 <- gsub(" ","", thousands_periods_0$property_damage_1)
thousands_periods_0$property_damage_1 <- as.numeric(thousands_periods_0$property_damage_1)
#"thousand" data set with 1 decimal point
thousands_periods_1 <- thousands[grep("\\.([0-9] [Kk]){1}", thousands$property_damage),]
thousands_periods_1$property_damage_1 <- gsub("\\.","", thousands_periods_1$property_damage)
thousands_periods_1$property_damage_1 <- gsub("[Kk]","00", thousands_periods_1$property_damage_1)
thousands_periods_1$property_damage_1 <- gsub(" ","", thousands_periods_1$property_damage_1)
thousands_periods_1$property_damage_1 <- as.numeric(thousands_periods_1$property_damage_1)
#"thousand" data set with 2 decimal points
thousands_periods_2 <- thousands[grep("\\.([0-9]){2} [Kk]",
thousands$property_damage),]
thousands_periods_2$property_damage_1 <- gsub("\\.","", thousands_periods_2$property_damage)
thousands_periods_2$property_damage_1 <- gsub("[Kk]","0", thousands_periods_2$property_damage_1)
thousands_periods_2$property_damage_1 <- gsub(" ","", thousands_periods_2$property_damage_1)
thousands_periods_2$property_damage_1 <- as.numeric(thousands_periods_2$property_damage_1)
#rbind thousands_periods_0, thousands_periods_1, thousands_periods_2 and assign to thousands
thousands <- rbind(thousands_periods_0, thousands_periods_1, thousands_periods_2)
The following code chunk creates a subset of econ_data called millions, which only contains observations where an uppercase or lowercase “M” is present in the PROPDMGEXP column. The code then creates a new column, property_damage_1, that replaces the “M” with either six zeroes (000000), five zeroes (00000) or four zeroes (0000), depending on the number of decimal places.
#Millions
millions <- filter(econ_data, econ_data$PROPDMGEXP == "M" | econ_data$PROPDMGEXP == "m")
#"millions" data set with no decimals
millions_periods_0 <- millions[grep("^[0-9]{1,3} [Mm]", millions$property_damage),]
millions_periods_0$property_damage_1 <- gsub("[Mm]","000000", millions_periods_0$property_damage)
millions_periods_0$property_damage_1 <- gsub(" ","", millions_periods_0$property_damage_1)
millions_periods_0$property_damage_1 <- as.numeric(millions_periods_0$property_damage_1)
#"millions" data set with 1 decimal point
millions_periods_1 <- millions[grep("\\.([0-9] [Mm]){1}", millions$property_damage),]
millions_periods_1$property_damage_1 <- gsub("\\.","", millions_periods_1$property_damage)
millions_periods_1$property_damage_1 <- gsub("[Mm]","00000", millions_periods_1$property_damage_1)
millions_periods_1$property_damage_1 <- gsub(" ","", millions_periods_1$property_damage_1)
millions_periods_1$property_damage_1 <- as.numeric(millions_periods_1$property_damage_1)
#"millions" data set with 2 decimal points
millions_periods_2 <- millions[grep("\\.([0-9]){2} [Mm]", millions$property_damage),]
millions_periods_2$property_damage_1 <- gsub("\\.","", millions_periods_2$property_damage)
millions_periods_2$property_damage_1 <- gsub("[Mm]","0000", millions_periods_2$property_damage_1)
millions_periods_2$property_damage_1 <- gsub(" ","", millions_periods_2$property_damage_1)
millions_periods_2$property_damage_1 <- as.numeric(millions_periods_2$property_damage_1)
#rbind millions_periods_0, millions_periods_1, millions_periods_2 and assign to millions
millions <- rbind(millions_periods_0, millions_periods_1, millions_periods_2)
The following code chunk creates a subset of econ_data called bllions, which only contains observations where an uppercase or lowercase “B” is present in the PROPDMGEXP column. The code then creates a new column, property_damage_1, that replaces the “B” with either nine zeroes (000000000), eight zeroes (00000000) or seven zeroes (0000000), depending on the number of decimal places.
#Billions
billions <- filter(econ_data, econ_data$PROPDMGEXP == "B" | econ_data$PROPDMGEXP == "b")
#"billions" data set with no decimals
billions_periods_0 <- billions[grep("^[0-9]{1,3} [Bb]", billions$property_damage),]
billions_periods_0$property_damage_1 <- gsub("[Bb]","000000000", billions_periods_0$property_damage)
billions_periods_0$property_damage_1 <- gsub(" ","", billions_periods_0$property_damage_1)
billions_periods_0$property_damage_1 <- as.numeric(billions_periods_0$property_damage_1)
#"billions" data set with 1 decimal point
billions_periods_1 <- billions[grep("\\.([0-9] [Bb]){1}", billions$property_damage),]
billions_periods_1$property_damage_1 <- gsub("\\.","", billions_periods_1$property_damage)
billions_periods_1$property_damage_1 <- gsub("[Bb]","00000000", billions_periods_1$property_damage_1)
billions_periods_1$property_damage_1 <- gsub(" ","", billions_periods_1$property_damage_1)
billions_periods_1$property_damage_1 <- as.numeric(billions_periods_1$property_damage_1)
#"billions" data set with 2 decimal points
billions_periods_2 <- billions[grep("\\.([0-9]){2} [Bb]", billions$property_damage),]
billions_periods_2$property_damage_1 <- gsub("\\.","", billions_periods_2$property_damage)
billions_periods_2$property_damage_1 <- gsub("[Bb]","0000000", billions_periods_2$property_damage_1)
billions_periods_2$property_damage_1 <- gsub(" ","", billions_periods_2$property_damage_1)
billions_periods_2$property_damage_1 <- as.numeric(billions_periods_2$property_damage_1)
#rbind billions_periods_0, billions_periods_1, billions_periods_2 and assign to billions
billions <- rbind(billions_periods_0, billions_periods_1, billions_periods_2)
Finally, the following code chunk combines the thousands, millions and billions data sets into one data set, groups that data set by storm type, and then aggregates the property damage done by storm type using the property_damage_1 column. I then add a final column, rounded_billions, which expresses the number in property_damage_1 in billions. (this will be used for plotting purposes)
tidy_econ_data <- rbind(billions, millions, thousands) %>%
group_by(EVTYPE) %>%
summarize(sum(property_damage_1)) %>%
arrange(desc(`sum(property_damage_1)`))
tidy_econ_data$rounded_billions <- round(tidy_econ_data$`sum(property_damage_1)`/1000000000, 2)
The following graph shows the top ten most deadly storm types with the total death count at the top of each respective bar. As you can see, Tornadoes are more than twice as deadly as the next closes storm type (Excessive Heat).
#Create "Top 10 Most Deadly Storms Types" Graph
org_data <- group_by(stormdata, EVTYPE) %>% summarize(sum(FATALITIES)) %>% arrange(desc(`sum(FATALITIES)`))
org_data_10 <- head(org_data, n = 10)
top_10 <- org_data_10$EVTYPE
y_limit <- c(0, 1.1*max(org_data_10$`sum(FATALITIES)`))
bp <- with(org_data_10, barplot(`sum(FATALITIES)`, ylim = y_limit, main = "Top 10 Most Fatal Storm Types",
ylab = "Death Count"))
axis(1, at = bp, labels = top_10, tick = FALSE, las = 2, line = -0.75, cex.axis = 0.65)
text(x = bp, y = org_data_10$`sum(FATALITIES)`, labels = org_data_10$`sum(FATALITIES)`, pos = 3, col = "red")
This next graph shows the top ten most injury-inducing storm types. Again, Tornadoes are far and away the most harmful.
#Create "Top 10 Most Injury-Inducing Storm Types" Graph
inj_data <- group_by(stormdata, EVTYPE) %>% summarize(sum(INJURIES)) %>% arrange(desc(`sum(INJURIES)`))
inj_data_10 <- head(inj_data, n = 10)
top_10_inj <- inj_data_10$EVTYPE
y_limit_inj <- c(0, 1.08*max(inj_data_10$`sum(INJURIES)`))
bp_inj <- with(inj_data_10, barplot(`sum(INJURIES)`, ylim = y_limit_inj,
main = "Top 10 Most Injury-Inducing Storm Types", ylab = "Injury Count"))
axis(1, at = bp, labels = top_10_inj, tick = FALSE, las = 2, line = -0.85, cex.axis = 0.45)
text(x = bp, y = inj_data_10$`sum(INJURIES)`, labels = inj_data_10$`sum(INJURIES)`, pos = 3, col = "red")
Finally, this graph shows the top ten most economically damaging storm types. By more than $90 Billion, Floods are the most economically harmful storm types in this dataset.
top_10_tidy_econ_data <- head(tidy_econ_data, n = 10)
top_10_tidy <- top_10_tidy_econ_data$EVTYPE
y_limit_tidy <- c(0, 1.1*max(top_10_tidy_econ_data$rounded_billions))
bp_tidy <- with(top_10_tidy_econ_data, barplot(rounded_billions, ylim = y_limit_tidy,
main = "Total Economic Damage by Storm Type",
ylab = "Cost (US Dollars in Billions)"))
axis(1, at = bp_tidy, labels = top_10_tidy, tick = FALSE, las = 2, line = -0.75, cex.axis = 0.65)
text(x = bp_tidy, y = top_10_tidy_econ_data$rounded_billions, labels = top_10_tidy_econ_data$rounded_billions,
pos = 3, col = "red")