The aim of this analysis is to determine which weather event is most harmful to the population health and to the economy. Firstly, the storm dataset is loaded into R and the important variables are subsetted. Then, the PROPDMG (property damage) and CROPDMG (crop damage) variables are scaled accordingly with its respective exponent PROPDMGEXP and CROPDMGEXP respectively.
Aggregate measures of harmfulness to population health and to the economy are created, i.e. HARM and ECONCONS. Both of this measures are sum up by the EVTYPE (event type). Finally, barplots of the top 10 most harmful weather events to both population health and the economy are plotted.
#Load the R packages needed
suppressMessages(library(dplyr));suppressMessages(library(plyr))
suppressWarnings(library(ggplot2))
setwd("C:/Users/User/Desktop")
df1 <- read.csv("stormdata.bz2")
#Subset variables useful to our analysis
df <- df1 %>% select(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
df <- as.data.frame(df)
str(df %>% select(FATALITIES))
## 'data.frame': 902297 obs. of 1 variable:
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
str(df %>% select(INJURIES))
## 'data.frame': 902297 obs. of 1 variable:
## $ INJURIES: num 15 0 2 2 2 6 1 0 14 0 ...
Both FATALITIES and INJURIES are numeric variables, transformations or imputations are not necessary.
#Take a look at the PROPDMGEXP variable
unique.propexp <- unique(df %>% select(PROPDMGEXP))
unique.propexp
## PROPDMGEXP
## 1 K
## 11 M
## 54
## 187564 B
## 187584 m
## 188780 +
## 189004 0
## 192527 5
## 198496 6
## 198689 ?
## 199104 4
## 199528 2
## 200331 3
## 209285 h
## 213319 7
## 216476 H
## 229327 -
## 233829 1
## 234228 8
#Create a numerical vector that represents this exponent data,
#where K refers to multiplier of 1e3,
#M refers to 1e6,
#B refers to 1e9,
#H refers to 1e2,
#blanks or 0 refer to multiplier of 1
#5 refers to multiplier of 1e5 etc.
#"+","-", "?" will be ignored and considered as multiplier of 0
numeric.propexp <- c(1e3, 1e6, 1, 1e9, 1e6, 0, 1, 1e5, 1e6, 0, 1e4, 1e2, 1e3, 1e2, 1e7, 1e2, 0, 1, 1e8)
L <- length(unique.propexp)
#Now we scale the original PROPDMG variable by its respective PROPDMGEXP
for(i in 1:L){
x <- unique.propexp[i]
u1 <- df[,"PROPDMGEXP"] %in% x
df[u1, "PROPDMG"] <- df[u1, "PROPDMG"]*numeric.propexp[i]
}
#We can take a look at the scaled PROPDMG
head(df %>% select(PROPDMG, PROPDMGEXP))
## PROPDMG PROPDMGEXP
## 1 25.0 K
## 2 2.5 K
## 3 25.0 K
## 4 2.5 K
## 5 2.5 K
## 6 2.5 K
#Take a look at the CROPDMGEXP variable
unique.cropexp <- unique(df %>% select(CROPDMGEXP))
unique.cropexp
## CROPDMGEXP
## 1
## 187566 M
## 187571 K
## 187584 m
## 188633 B
## 192467 ?
## 192758 0
## 195667 k
## 221151 2
#Replace these exponent data numerically,
#where K refers to multiplier of 1e3,
#M refers to 1e6,
#B refers to 1e9,
#blanks or 0 refer to multiplier of 1
#2 refers to multiplier of 1e2 etc.
#"?" will be ignored and considered as multiplier of 0
numeric.cropexp <- c(1, 1e6, 1e3, 1e6, 1e9, 0, 1, 1e3, 1e2)
L <- length(unique.cropexp)
#now we scale the original PROPDMG variable by its respective PROPDMGEXP
for(i in 1:L){
x <- unique.cropexp[i]
u1 <- df[,"CROPDMGEXP"] %in% x
df[u1, "CROPDMG"] <- df[u1, "CROPDMG"]*numeric.cropexp[i]
}
#Create a new varaible called HARM as an aggregate measure of harmfulness to population health
#Aggregate both FATALITIES and INJURIES as a measure of harmfulness to population health by allocating weights of 1.5 and 1 to FATALITIES and INJURIES respectively.
df <- df %>% mutate(HARM = FATALITIES*1.5 + INJURIES*1.0)
df <- as.data.frame(df)
To measure harmfulness to population health, we can allocate higher weight to FATALITIES and a lower weight to INJURIES, i.e. 1.5 and 1.0 respectively, and sum them up.
#Create a new variable called ECONCONS to measure the economic consequences of an event, where ECONCONS will be the sum of both PROPDMG and CROPDMG
df <- df %>% mutate(ECONCONS = PROPDMG + CROPDMG)
df <- as.data.frame(df)
sumharm <- tapply(X = df[,"HARM"], INDEX = df[, "EVTYPE"], FUN = sum)
sdf.harm <- data.frame(EVTYPE = names(sumharm), sharm = sumharm)
sumecon <- tapply(X = df[,"ECONCONS"], INDEX = df[, "EVTYPE"], FUN = sum)
sdf.econ <- data.frame(EVTYPE = names(sumecon), secon = sumecon)
head(sdf.harm)
## EVTYPE sharm
## HIGH SURF ADVISORY HIGH SURF ADVISORY 0
## COASTAL FLOOD COASTAL FLOOD 0
## FLASH FLOOD FLASH FLOOD 0
## LIGHTNING LIGHTNING 0
## TSTM WIND TSTM WIND 0
## TSTM WIND (G45) TSTM WIND (G45) 0
head(sdf.econ)
## EVTYPE secon
## HIGH SURF ADVISORY HIGH SURF ADVISORY 200
## COASTAL FLOOD COASTAL FLOOD 0
## FLASH FLOOD FLASH FLOOD 50
## LIGHTNING LIGHTNING 0
## TSTM WIND TSTM WIND 108
## TSTM WIND (G45) TSTM WIND (G45) 8
sdf1 <- as.data.frame((sdf.harm %>% arrange(-sharm))[1:10,])
g1 <- ggplot(sdf1, aes(x= EVTYPE, weight = sharm)) + geom_bar() + xlab("Event Type") + ylab("Measure of harmfulness to population health") + ggtitle("Top 10 events that are most harmful to population health") + theme(axis.text.x = element_text(size=7, angle=25))
g1
We can observe that Tornado is significantly most harmful to population health than most other events.
sdf2 <- as.data.frame((sdf.econ %>% arrange(-secon))[1:10,])
g2 <- ggplot(sdf2, aes(x=EVTYPE, weight = secon)) + geom_bar() + xlab("Event Type") + ylab("Measure of economic damage") + ggtitle("Top 10 events that has the greatest economic consequences") + theme(axis.text.x = element_text(size=7, angle=25))
g2
We can observe that Tornado has the greatest economic consequences follow by Excessive Heat (such as heat waves), then TSTM Wind (Thunderstorm Wind).
Hence, the event that are most harmful to population health and has the greatest economic consequences is Tornado.