From the NOAA Storm Database, we answered 2 questions:
This chapter describes first how data is uploaded into R, and immediately after that, the processing done to answer both questions of the project.
After the creation of a local R project, create a folder called input_data. Store the bz2 file in that folder an call it stormdata.csv.bz2.
Following those instructions, this R project will work correctly.
stormdata <- read.csv("input_data/stormdata.csv.bz2")
The question 1 is: Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
In order to answer this we will need the dplyr package.
library(dplyr)
The steps are:
question1 <- stormdata %>% select(EVTYPE,FATALITIES,INJURIES)
group_question1 <- question1 %>% group_by(EVTYPE)
question1_answer <- summarize(group_question1, sum(FATALITIES), sum(INJURIES)) %>% arrange(desc(`sum(FATALITIES)`),desc(`sum(INJURIES)`))
rm(group_question1)
rm(question1)
The question 1 is: Across the United States, which types of events have the greatest economic consequences?
This question is harder to answer because the total amount of damage is given from a combination of:
And besides that, not all the values in PROPDMGEXP and CROPDMGEXP are correctly coded. The vast majority of values in PROPDMGEXP and CROPDMGEXP are:
For vast majority we mean:
propdmgexp_well_coded <- nrow(filter(stormdata, PROPDMG > 0 & PROPDMGEXP %in% c("k","K","m","M","b","B")))
total_propodmgexp <- nrow(filter(stormdata, PROPDMG > 0 ))
cropdmgexp_well_coded <- nrow(filter(stormdata, CROPDMG > 0 & CROPDMGEXP %in% c("k","K","m","M","b","B")))
total_cropodmgexp <- nrow(filter(stormdata, CROPDMG > 0 ))
To calculate damage values in a more accurate way, a new variable will be added, with the calculated damage value. As an example, if PROPDMG = 2.5 and PROPDMGEXP = “M”, the calculated damage value will be 2.500.000.
After this is done for all the values of PROPDMG (in new variable PROPDAMAGE_AMOUNT) and CROPDMG (in new variable PROPDAMAGE_AMOUNT), the sum of PROPDAMAGE_AMOUNT and CROPDAMAGE_AMOUNT for each event will be the total damage value.
At the end we extract only EVTYPE and the newly total damage value, group_by, and summarize.
The steps are:
question2 <- stormdata %>% select(EVTYPE,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP)
question2$PROPDAMAGE_AMOUNT <- with(question2,
ifelse(PROPDMGEXP %in% c("k","K"), PROPDMG*1000,
ifelse(PROPDMGEXP %in% c("m","M"), PROPDMG*1000000,
ifelse(PROPDMGEXP %in% c("b","B"), PROPDMG*1000000000,
0)
)
)
)
question2$CROPDAMAGE_AMOUNT <- with(question2,
ifelse(CROPDMGEXP %in% c("k","K"), CROPDMG*1000,
ifelse(CROPDMGEXP %in% c("m","M"), CROPDMG*1000000,
ifelse(CROPDMGEXP %in% c("b","B"), CROPDMG*1000000000,
0)
)
)
)
question2$TOTAL_DAMAGE_VALUE <- question2$PROPDAMAGE_AMOUNT+question2$CROPDAMAGE_AMOUNT
question2 <- select(question2, c(EVTYPE,TOTAL_DAMAGE_VALUE))
group_question2 <- question2 %>% group_by(EVTYPE)
question2_answer <- summarize(group_question2, sum(TOTAL_DAMAGE_VALUE)) %>% arrange(desc(`sum(TOTAL_DAMAGE_VALUE)`))
rm(group_question2)
rm(question2)
This chapter describes the answer for both questions with the help of plots.
library(ggplot2)
question1_answer_plot <- question1_answer[1:20,]
question1_answer_plot <- rbind(
cbind(select(question1_answer_plot, c(EVTYPE,"COUNT" = `sum(FATALITIES)`)),"CATEGORY" = "FATALITIES"),
cbind(select(question1_answer_plot, c(EVTYPE,"COUNT" = `sum(INJURIES)`)),"CATEGORY" = "INJURIES")
)
question1_answer_plot %>%
ggplot(aes(fill=CATEGORY, y= COUNT, x=reorder(EVTYPE,-COUNT))) +
geom_bar(position="dodge", stat="identity") +
theme(axis.text.x = element_text(angle = 45,hjust = 1)) +
labs( x = "WEATHER EVENT") +
labs( title = "Question 1 - Most harmful events")
options("scipen"=100)
question2_answer_plot <- question2_answer[1:20,]
question2_answer_plot$`sum(TOTAL_DAMAGE_VALUE)` <- question2_answer_plot$`sum(TOTAL_DAMAGE_VALUE)` /1000000
question2_answer_plot %>%
ggplot(aes(y= `sum(TOTAL_DAMAGE_VALUE)`, x=reorder(EVTYPE,-`sum(TOTAL_DAMAGE_VALUE)`))) +
geom_bar(position="dodge", stat="identity") +
theme(axis.text.x = element_text(angle = 45,hjust = 1)) +
labs( x = "WEATHER EVENT") +
labs( y = "US Dollars (Millions)") +
labs( title = "Question 2 - Events with the greatest economic consequences")