Data Processing

This chapter describes first how data is uploaded into R, and immediately after that, the processing done to answer both questions of the project.

Data uploading

After the creation of a local R project, create a folder called input_data. Store the bz2 file in that folder an call it stormdata.csv.bz2.

Following those instructions, this R project will work correctly.

stormdata <- read.csv("input_data/stormdata.csv.bz2")

Answering question 1

The question 1 is: Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

In order to answer this we will need the dplyr package.

library(dplyr)

The steps are:

Create a new data frame containing the variables EVTYPE,FATALITIES,INJURIES.

question1 <- stormdata %>% select(EVTYPE,FATALITIES,INJURIES)

Group the data using the EVTYPE variable:

group_question1 <- question1 %>% group_by(EVTYPE)

Summarize de the data, storing the results in question1_answer data frame, and then delete auxiliary data frames:

question1_answer <- summarize(group_question1, sum(FATALITIES), sum(INJURIES)) %>% arrange(desc(`sum(FATALITIES)`),desc(`sum(INJURIES)`))

rm(group_question1)
rm(question1)

Answering question 2

The question 1 is: Across the United States, which types of events have the greatest economic consequences?

This question is harder to answer because the total amount of damage is given from a combination of:

PROPDMG and PROPDMGEXP
CROPDMG and CROPDMGEXP

And besides that, not all the values in PROPDMGEXP and CROPDMGEXP are correctly coded. The vast majority of values in PROPDMGEXP and CROPDMGEXP are:

k or K for Thousands
m or M for Millions
b or B for Billions

For vast majority we mean:

propdmgexp_well_coded <- nrow(filter(stormdata, PROPDMG > 0 & PROPDMGEXP %in% c("k","K","m","M","b","B")))
total_propodmgexp <- nrow(filter(stormdata, PROPDMG > 0 ))
cropdmgexp_well_coded <- nrow(filter(stormdata, CROPDMG > 0 & CROPDMGEXP %in% c("k","K","m","M","b","B")))
total_cropodmgexp <- nrow(filter(stormdata, CROPDMG > 0 ))

238847 PROPDMGEXP well coded out of 239174
22084 CROPDMGEXP well coded out of 22099

To calculate damage values in a more accurate way, a new variable will be added, with the calculated damage value. As an example, if PROPDMG = 2.5 and PROPDMGEXP = “M”, the calculated damage value will be 2.500.000.

After this is done for all the values of PROPDMG (in new variable PROPDAMAGE_AMOUNT) and CROPDMG (in new variable PROPDAMAGE_AMOUNT), the sum of PROPDAMAGE_AMOUNT and CROPDAMAGE_AMOUNT for each event will be the total damage value.

At the end we extract only EVTYPE and the newly total damage value, group_by, and summarize.

The steps are:

Extract the needed variables into question2 data frame.

question2 <- stormdata %>% select(EVTYPE,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP)

Create a new variable for the total property damage value. As you can see, each value of PROPDMG is multiplied accordingly:

if PROPDMGEXP is k or K, by 1.000
if PROPDMGEXP is m or M, by 1.000.000
if PROPDMGEXP is b or B, by 1.000.000.000

question2$PROPDAMAGE_AMOUNT <- with(question2,
    ifelse(PROPDMGEXP %in% c("k","K"), PROPDMG*1000,
        ifelse(PROPDMGEXP %in% c("m","M"), PROPDMG*1000000,
               ifelse(PROPDMGEXP %in% c("b","B"), PROPDMG*1000000000,
                      0)
        )
    )
)

Create a new variable for the total crop damage value. It follows the same principle as Step 2.

question2$CROPDAMAGE_AMOUNT <- with(question2,
      ifelse(CROPDMGEXP %in% c("k","K"), CROPDMG*1000,
             ifelse(CROPDMGEXP %in% c("m","M"), CROPDMG*1000000,
                    ifelse(CROPDMGEXP %in% c("b","B"), CROPDMG*1000000000,
                           0)
             )
      )
)

Sum recently calculated variables PROPDAMAGE_AMOUNT + CROPDAMAGE_AMOUNT, in order to calculate total damage value per event.

question2$TOTAL_DAMAGE_VALUE <- question2$PROPDAMAGE_AMOUNT+question2$CROPDAMAGE_AMOUNT

Select only the columns EVTYPE and TOTAL_DAMAGE_VALUE

question2 <- select(question2, c(EVTYPE,TOTAL_DAMAGE_VALUE))

Group the data using the EVTYPE variable:

group_question2 <- question2 %>% group_by(EVTYPE)

Summarize de the data, storing the results in question1_answer data frame, and then delete auxiliary data frames:

question2_answer <- summarize(group_question2, sum(TOTAL_DAMAGE_VALUE)) %>% arrange(desc(`sum(TOTAL_DAMAGE_VALUE)`))

rm(group_question2)
rm(question2)

Results

This chapter describes the answer for both questions with the help of plots.

Question 1 - Most harmful events

library(ggplot2)

question1_answer_plot <- question1_answer[1:20,]

question1_answer_plot <- rbind(
  cbind(select(question1_answer_plot, c(EVTYPE,"COUNT" = `sum(FATALITIES)`)),"CATEGORY" = "FATALITIES"),
  cbind(select(question1_answer_plot, c(EVTYPE,"COUNT" = `sum(INJURIES)`)),"CATEGORY" = "INJURIES")
)


question1_answer_plot %>%
  ggplot(aes(fill=CATEGORY, y= COUNT, x=reorder(EVTYPE,-COUNT))) +
  geom_bar(position="dodge", stat="identity") +
  theme(axis.text.x = element_text(angle = 45,hjust = 1)) +
  labs( x = "WEATHER EVENT") +
  labs( title = "Question 1 - Most harmful events")

Question 2 - Events with the greatest economic consequences

options("scipen"=100)

question2_answer_plot <- question2_answer[1:20,]

question2_answer_plot$`sum(TOTAL_DAMAGE_VALUE)` <- question2_answer_plot$`sum(TOTAL_DAMAGE_VALUE)` /1000000

question2_answer_plot %>%
  ggplot(aes(y= `sum(TOTAL_DAMAGE_VALUE)`, x=reorder(EVTYPE,-`sum(TOTAL_DAMAGE_VALUE)`))) +
  geom_bar(position="dodge", stat="identity") +
  theme(axis.text.x = element_text(angle = 45,hjust = 1)) +
  labs( x = "WEATHER EVENT") +
  labs( y = "US Dollars (Millions)") +
  labs( title = "Question 2 - Events with the greatest economic consequences")

Severe weather events, study of harmfulnes and economic damage

Emilio

2022-10-29

Synopsis