Finding out the most harmful and costly event in the United States

Finding out which type of disaster event is most harmful or costly is always the top priorities to mitigate consequences of crisis of every countries. In this analysis, R was used to investigate disaster event accross the United States based on data in National Weather Service (https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2). Below are the result:

_Type of disaster that cause most injuries in each single event: Heat wave

_Type of disaster that cause most fatalities in each single event: Tornado with Thunderstorm wind and Hail

_Type of disaster that cause most aggregated injuries and injuries: Tornadoes

_Type of disaster that cause most economic damage in each single event: Heavy rain/severe weather

_Type of disaster that cause most aggregated economic damage: Flood

#install.packages("ggrepel")
library(ggplot2)
library(plotly)
#install.packages("devtools")
library(devtools)
#devtools::install_github('slowkow/ggrepel')
library(ggrepel)
library(dplyr)
library(xtable)
library(knitr)
library(gridExtra)
library(magrittr)

Overall data summaries: variables contain in the data and some example of the disaster events:

temp <- tempfile()
download.file(url = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", temp, method = "curl", mode = "wb")
data <- read.csv(temp)

Data processing for question 1

Data related to public health are number of fatalities and injuries of each event. These columns would be extracted out, together of the type of disaster of the event.

healthData <- data[, c("EVTYPE", "FATALITIES", "INJURIES")]

Number of fatalities and injuries for single events are summed up to generate total number of fatalities and injuries for each type of event. To normalized the data, we generate percent of fatalities and injuries of each type of event using the formula:

_ Percent of injuries of event X= 100 x Total number of injuries for event X/Total number of injuries for all types of event.

_ Percent of fatalities of event X= 100 x Total number of fatalities for event X/Total number of fatalities for all types of event.

Average number of injuries and fatalities are also calculated:

_ Average number of injuries of event X= Total number of injuries for event X/Number of event X.

_ Average number of fatalities of event X = Total number of fatalities for event X/Number of event X.

#Sum all injuries and fatalities within each category
tbl_df(healthData) %>% group_by(EVTYPE) %>% summarise(sum(FATALITIES), sum(INJURIES)) -> healthData2
colnames(healthData2) <- c("type", "fatalities", "injuries")

#Percent injuries and fatalities of each category to total injuries and fatalities.
tbl_df(healthData) %>% group_by(EVTYPE) %>% 
        summarise(percentFatalities = 100*sum(FATALITIES)/sum(healthData2$fatalities), 
                  percentInjuries = 100*sum(INJURIES)/sum(healthData2$injuries),
                  averageFatalities = mean(FATALITIES),
                  averageInjuries = mean(INJURIES)) %>%
        arrange(desc(percentFatalities), desc(percentInjuries)) -> healthData3
arrange(healthData3, desc(averageFatalities), desc(averageInjuries) ) -> healthData4
arrange(healthData3, desc(averageInjuries), desc(averageFatalities) ) -> healthData5
colnames(healthData3)[1] <- "type"
colnames(healthData4)[1] <- "type"
colnames(healthData5)[1] <- "type"
ranks10percent <- healthData3[1:10,]
ranks10percent[,1] <- tolower(as.character(ranks10percent[,1][[1]]))

ranks10average1 <- healthData4[1:5,]
ranks10average1[,1] <- tolower(as.character(ranks10average1[,1][[1]]))
ranks10average2 <- healthData5[1:5,]
ranks10average2[,1] <- tolower(as.character(ranks10average2[,1][[1]]))
ranks10average <- cbind(rbind(ranks10average1, ranks10average2), 
                        c(rep("Top 5 highest fatalities",5), rep("Top 5 highest injuries",5)) )
colnames(ranks10average)[6] <- "Top_5"

Data processing for question 2

Variables related to this questions are: type of the disease, crop damage, crop damage expression, prop damage, prop damage expression. Crop damage expression and prop damage expression was translated:

1.“B”, “b” -> 10^9

2.“M”, “m” -> 10^6

3.“K”, “k” -> 10^3

Expression that is not “B”, “b”, “M”, “m”, “K”, “k” will be excluded.

damage <- data[,c("EVTYPE","CROPDMG", "CROPDMGEXP", "PROPDMG", "PROPDMGEXP")]
unique(damage[,3])
unique(damage[,5])
unit <- cbind.data.frame(c("B","b", "M", "m", "K", "k"), c(10^9, 10^9, 10^6, 10^6, 10^3, 10^3))
unit[,1] <- as.character(unit[,1])

Crop damage expression and prop damage expression was combined with crop damage and prop damage to create new variable crop value and prop value, respectively.

For example:

_ Crop damage = 2

_ Crop damage expression = M

_ Crop damage value = 2*10^6

cropdamage <- damage[damage[,3] %in% c("B", "b", "k", "K", "m", "M"),c(1,2,3)]
cropdamageValue <- sapply(cropdamage[,3], function(x){unit[(unit[,1] == x), 2]}) * cropdamage[,2]
cropdamage2 <- cbind(cropdamage, cropdamageValue)
totalcropdamage <- sum(cropdamageValue)
tbl_df(cropdamage2) %>% group_by(EVTYPE) %>% 
                        summarise("total"   = sum(cropdamageValue),
                                  "percent" = 100*sum(cropdamageValue)/totalcropdamage,
                                  "number of disaster" = length(EVTYPE),
                                  "average" = mean(cropdamageValue)) %>%
                        arrange(desc(percent)) -> cropdamageP 
arrange(cropdamageP,desc(average)) -> cropdamageA
cropsum <- cbind(rbind(cropdamageP[1:5,], cropdamageA[1:5,]), 
                 factor(c(rep("Sorted by percentage",5), 
                          rep("Sorted by average",5))) )

propdamage      <- damage[damage[,5] %in% c("B", "b", "k", "K", "m", "M"),c(1,4,5)]
propdamageValue <- sapply(propdamage[,3], function(x){unit[(unit[,1] == x), 2]}) * propdamage[,2]
propdamage2     <- cbind(propdamage, propdamageValue)
totalpropdamage <- sum(propdamageValue)
tbl_df(propdamage2) %>% group_by(EVTYPE) %>% 
        summarise("total"   = sum(propdamageValue),
                  "percent" = 100*sum(propdamageValue)/totalpropdamage,
                  "number of disaster" = length(EVTYPE),
                  "average" = mean(propdamageValue)) %>%
        arrange(desc(percent)) -> propdamageP 
arrange(propdamageP,desc(average)) -> propdamageA
propsum <- cbind(rbind(propdamageP[1:5,], propdamageA[1:5,]), 
                 factor(c(rep("Sorted by percentage",5), 
                          rep("Sorted by average",5))) )

propdamage3 <- propdamage2[, c(1,4)]
colnames(propdamage3) <- c("type", "value")
cropdamage3 <- cropdamage2[, c(1,4)]
colnames(cropdamage3) <- c("type", "value")
totalDamage <- tbl_df(rbind(propdamage3, cropdamage3))
allValue <- sum(totalDamage$value)
group_by(totalDamage, type) %>% summarise("percent" = 100*sum(value)/allValue,
                                         "average" = mean(value),
                                         "number of disaster" = length(type)) %>% 
        arrange(desc(percent)) -> totalDamageP
arrange(totalDamageP, desc(average)) -> totalDamageA
damagesum <- cbind(rbind(totalDamageP[1:5,], totalDamageA[1:5,]), 
                 factor(c(rep("Sorted by percentage",5), 
                          rep("Sorted by average",5))) )

Question 1: Across United States, which type of event is the most harmful to the public health

There are two variables related to public health in the data set: fatalities and injuries. We did not clumped fatalities and injuries together as they are clearly not equal: an event that cause death should be more harmful than an event that cause an injury. Therefore, there would be two ranking systems, one for fatalities and one for injuries.

The most harmful disaster should wreck a lot of havoc in a single event. That is, the average number of fatalities and injuries in that type of disaster should be the highest. On the other hand, the most harmful disaster may not be the most dangerous in its own characteristic, but because of its frequency. A single event in this type of disaster may not cause many fatalities or injuries, but because it happens so often, the aggregate number fatalities and injuries is very high. Therefore, each event will also be ranked based on the aggregated or average number of fatalities or injuries.

In short, there would be four ranking:

Percents fatalities compare to total fatalities
Percents injuries compare to total injuries
Average fatalities
Average injuries

Scatterplot of percentage of fatalities versus injuries show positive correlation between fatalities and injuries as expected. Ten most fatal disasters are labelled and it is clearly that tornado was the event that cause the most fatalities and injuries.

In contrast, scatterplot of average fatalities versus average injuries show negative correlation. Events that have high average fatalities would have low average injuries and vice versa. Tornadoes with thunderstorm wind and hail has highest average number of fatalities while heat wave has highest average number of injuries.

Figure 1:

PH1 <- ggplot(data = healthData3) +
        aes(x = percentInjuries, y = percentFatalities) + 
        geom_point(alpha = 0.3) +
        ggtitle("Percent of fatalities and injuries\n") +
        labs(x ="Percents injuries of disaster \ncompare to total injuries", 
             y = "Percents fatalities of disaster \ncompare to total fatalities") + 
        geom_text_repel(data = ranks10percent, 
                  aes(x = percentInjuries, y = percentFatalities, label = type), size = 3)
PH1g <- ggplotGrob(PH1)
PH2 <- ggplot(data = healthData4) +
        aes(x = averageInjuries, y = averageFatalities) + 
        geom_point(alpha = 0.3) +
        ggtitle("Average number of fatalities and injuries\n for each type of disaster") +
        labs(x ="Average number of injuries", 
             y ="Average number of fatalities") + 
        geom_text_repel(data = ranks10average, 
                  aes(x = averageInjuries, y = averageFatalities, label = type, colour = Top_5), 
                  size = 3) +
        theme(legend.justification=c(1,1), legend.position=c(1,0.8), legend.title=element_blank())
#PH2g <- ggplotGrob(PH2)
#PH1g$heights <- PH2g$heights
grid.arrange(PH1, PH2, ncol = 2)

PHtable <- cbind(1:5, head(arrange(healthData3[, c(1,2)], desc(percentFatalities)), 5)[,1],
            head(arrange(healthData3[, c(1,3)], desc(percentInjuries)), 5)[,1],
            head(arrange(healthData3[, c(1,4)], desc(averageFatalities)), 5)[,1],
            head(arrange(healthData3[, c(1,5)], desc(averageInjuries)), 5)[,1])
colnames(PHtable) <- c("Rank", "Top 5 percent fatalities", "Top 5 percent injuries", "Top 5 average fatalities", "Top 5 average injuries")
kable(PHtable)

Rank	Top 5 percent fatalities	Top 5 percent injuries	Top 5 average fatalities	Top 5 average injuries
1	TORNADO	TORNADO	TORNADOES, TSTM WIND, HAIL	Heat Wave
2	EXCESSIVE HEAT	TSTM WIND	COLD AND SNOW	TROPICAL STORM GORDON
3	FLASH FLOOD	FLOOD	TROPICAL STORM GORDON	WILD FIRES
4	HEAT	EXCESSIVE HEAT	RECORD/EXCESSIVE HEAT	THUNDERSTORMW
5	LIGHTNING	LIGHTNING	EXTREME HEAT	HIGH WIND AND SEAS

Question 2: Across United States, which types of events have the greatest economic consequences?

Unlike the case of fatalities and injuries, one dollar of crop damage can be assumed to be equal one dollar of prop damage. Thus, cropdamage and propdamage were merged together and summed up to create new variable called “total”. Figure 2 show total damage vs average damage. It is cleared that Heavy rain/severe weather cause the most damage per single event (2.5 billion) while flood cause the most aggregated damage 31016 event cause 150 billion damage.

all <- merge(cropdamageP[,c(1,2,4)], propdamageP[,c(1,2,4)], by = "EVTYPE", all = TRUE)
all[is.na(all)] <- 0
mutate(all, "total" = total.x +total.y, "number" = all[,3] + all[,5]) %>% mutate("average" = total/number) %>% 
        select(c(1, 6,7,8)) %>% rename(type = EVTYPE) -> all2
arrange(all2, desc(total))[1:5,] %>% mutate("sort_by" = rep("total", 5)) -> allTotal
arrange(all2, desc(average))[1:5,] %>% mutate("sort_by" = rep("average", 5)) -> allAverage
sort <- rbind(allTotal, allAverage)
Dplot <- ggplot(data = all2, aes(x = total, y = average)) + 
        geom_point(alpha = 0.5) +
        geom_text_repel(data = sort, aes(x = total, y = average, label = type, colour = sort_by),
                        size = 3) +
        ggtitle("Average damage and total damage in USD for different type of disaster") + 
        ylab("Average damage (USD) per event of a disaster type") +
        xlab("Total damage (USD) of all event of a disaster type")
print(Dplot)

damageTable <- cbind.data.frame(1:5, as.character(allTotal[,1]), as.character(allAverage[,1]))
colnames(sort) <- c("Disaster", "Total number of USD damage", "Number of event", "Average number of USD damage", "sort by")
kable(sort[,c(1:4)])

Disaster	Total number of USD damage	Number of event	Average number of USD damage
FLOOD	150319678250	31016	4846520.4
HURRICANE/TYPHOON	71913712800	103	698191386.4
TORNADO	57352113590	61420	933769.4
STORM SURGE	43323541000	182	238041434.1
HAIL	18758221170	174271	107638.2
HEAVY RAIN/SEVERE WEATHER	2500000000	1	2500000000.0
TORNADOES, TSTM WIND, HAIL	1602500000	2	801250000.0
HURRICANE/TYPHOON	71913712800	103	698191386.4
HURRICANE OPAL	3191846000	11	290167818.2
STORM SURGE	43323541000	182	238041434.1

Finding out the most harmful and costly event in the United States

NguyenDuy

19 November 2016

Data processing for question 1

Data processing for question 2

Question 1: Across United States, which type of event is the most harmful to the public health

Question 2: Across United States, which types of events have the greatest economic consequences?