Finding out which type of disaster event is most harmful or costly is always the top priorities to mitigate consequences of crisis of every countries. In this analysis, R was used to investigate disaster event accross the United States based on data in National Weather Service (https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2). Below are the result:
_Type of disaster that cause most injuries in each single event: Heat wave
_Type of disaster that cause most fatalities in each single event: Tornado with Thunderstorm wind and Hail
_Type of disaster that cause most aggregated injuries and injuries: Tornadoes
_Type of disaster that cause most economic damage in each single event: Heavy rain/severe weather
_Type of disaster that cause most aggregated economic damage: Flood
#install.packages("ggrepel")
library(ggplot2)
library(plotly)
#install.packages("devtools")
library(devtools)
#devtools::install_github('slowkow/ggrepel')
library(ggrepel)
library(dplyr)
library(xtable)
library(knitr)
library(gridExtra)
library(magrittr)
Overall data summaries: variables contain in the data and some example of the disaster events:
temp <- tempfile()
download.file(url = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", temp, method = "curl", mode = "wb")
data <- read.csv(temp)
Data related to public health are number of fatalities and injuries of each event. These columns would be extracted out, together of the type of disaster of the event.
healthData <- data[, c("EVTYPE", "FATALITIES", "INJURIES")]
Number of fatalities and injuries for single events are summed up to generate total number of fatalities and injuries for each type of event. To normalized the data, we generate percent of fatalities and injuries of each type of event using the formula:
_ Percent of injuries of event X= 100 x Total number of injuries for event X/Total number of injuries for all types of event.
_ Percent of fatalities of event X= 100 x Total number of fatalities for event X/Total number of fatalities for all types of event.
Average number of injuries and fatalities are also calculated:
_ Average number of injuries of event X= Total number of injuries for event X/Number of event X.
_ Average number of fatalities of event X = Total number of fatalities for event X/Number of event X.
#Sum all injuries and fatalities within each category
tbl_df(healthData) %>% group_by(EVTYPE) %>% summarise(sum(FATALITIES), sum(INJURIES)) -> healthData2
colnames(healthData2) <- c("type", "fatalities", "injuries")
#Percent injuries and fatalities of each category to total injuries and fatalities.
tbl_df(healthData) %>% group_by(EVTYPE) %>%
summarise(percentFatalities = 100*sum(FATALITIES)/sum(healthData2$fatalities),
percentInjuries = 100*sum(INJURIES)/sum(healthData2$injuries),
averageFatalities = mean(FATALITIES),
averageInjuries = mean(INJURIES)) %>%
arrange(desc(percentFatalities), desc(percentInjuries)) -> healthData3
arrange(healthData3, desc(averageFatalities), desc(averageInjuries) ) -> healthData4
arrange(healthData3, desc(averageInjuries), desc(averageFatalities) ) -> healthData5
colnames(healthData3)[1] <- "type"
colnames(healthData4)[1] <- "type"
colnames(healthData5)[1] <- "type"
ranks10percent <- healthData3[1:10,]
ranks10percent[,1] <- tolower(as.character(ranks10percent[,1][[1]]))
ranks10average1 <- healthData4[1:5,]
ranks10average1[,1] <- tolower(as.character(ranks10average1[,1][[1]]))
ranks10average2 <- healthData5[1:5,]
ranks10average2[,1] <- tolower(as.character(ranks10average2[,1][[1]]))
ranks10average <- cbind(rbind(ranks10average1, ranks10average2),
c(rep("Top 5 highest fatalities",5), rep("Top 5 highest injuries",5)) )
colnames(ranks10average)[6] <- "Top_5"
Variables related to this questions are: type of the disease, crop damage, crop damage expression, prop damage, prop damage expression. Crop damage expression and prop damage expression was translated:
1.“B”, “b” -> 10^9
2.“M”, “m” -> 10^6
3.“K”, “k” -> 10^3
Expression that is not “B”, “b”, “M”, “m”, “K”, “k” will be excluded.
damage <- data[,c("EVTYPE","CROPDMG", "CROPDMGEXP", "PROPDMG", "PROPDMGEXP")]
unique(damage[,3])
unique(damage[,5])
unit <- cbind.data.frame(c("B","b", "M", "m", "K", "k"), c(10^9, 10^9, 10^6, 10^6, 10^3, 10^3))
unit[,1] <- as.character(unit[,1])
Crop damage expression and prop damage expression was combined with crop damage and prop damage to create new variable crop value and prop value, respectively.
For example:
_ Crop damage = 2
_ Crop damage expression = M
_ Crop damage value = 2*10^6
cropdamage <- damage[damage[,3] %in% c("B", "b", "k", "K", "m", "M"),c(1,2,3)]
cropdamageValue <- sapply(cropdamage[,3], function(x){unit[(unit[,1] == x), 2]}) * cropdamage[,2]
cropdamage2 <- cbind(cropdamage, cropdamageValue)
totalcropdamage <- sum(cropdamageValue)
tbl_df(cropdamage2) %>% group_by(EVTYPE) %>%
summarise("total" = sum(cropdamageValue),
"percent" = 100*sum(cropdamageValue)/totalcropdamage,
"number of disaster" = length(EVTYPE),
"average" = mean(cropdamageValue)) %>%
arrange(desc(percent)) -> cropdamageP
arrange(cropdamageP,desc(average)) -> cropdamageA
cropsum <- cbind(rbind(cropdamageP[1:5,], cropdamageA[1:5,]),
factor(c(rep("Sorted by percentage",5),
rep("Sorted by average",5))) )
propdamage <- damage[damage[,5] %in% c("B", "b", "k", "K", "m", "M"),c(1,4,5)]
propdamageValue <- sapply(propdamage[,3], function(x){unit[(unit[,1] == x), 2]}) * propdamage[,2]
propdamage2 <- cbind(propdamage, propdamageValue)
totalpropdamage <- sum(propdamageValue)
tbl_df(propdamage2) %>% group_by(EVTYPE) %>%
summarise("total" = sum(propdamageValue),
"percent" = 100*sum(propdamageValue)/totalpropdamage,
"number of disaster" = length(EVTYPE),
"average" = mean(propdamageValue)) %>%
arrange(desc(percent)) -> propdamageP
arrange(propdamageP,desc(average)) -> propdamageA
propsum <- cbind(rbind(propdamageP[1:5,], propdamageA[1:5,]),
factor(c(rep("Sorted by percentage",5),
rep("Sorted by average",5))) )
propdamage3 <- propdamage2[, c(1,4)]
colnames(propdamage3) <- c("type", "value")
cropdamage3 <- cropdamage2[, c(1,4)]
colnames(cropdamage3) <- c("type", "value")
totalDamage <- tbl_df(rbind(propdamage3, cropdamage3))
allValue <- sum(totalDamage$value)
group_by(totalDamage, type) %>% summarise("percent" = 100*sum(value)/allValue,
"average" = mean(value),
"number of disaster" = length(type)) %>%
arrange(desc(percent)) -> totalDamageP
arrange(totalDamageP, desc(average)) -> totalDamageA
damagesum <- cbind(rbind(totalDamageP[1:5,], totalDamageA[1:5,]),
factor(c(rep("Sorted by percentage",5),
rep("Sorted by average",5))) )
There are two variables related to public health in the data set: fatalities and injuries. We did not clumped fatalities and injuries together as they are clearly not equal: an event that cause death should be more harmful than an event that cause an injury. Therefore, there would be two ranking systems, one for fatalities and one for injuries.
The most harmful disaster should wreck a lot of havoc in a single event. That is, the average number of fatalities and injuries in that type of disaster should be the highest. On the other hand, the most harmful disaster may not be the most dangerous in its own characteristic, but because of its frequency. A single event in this type of disaster may not cause many fatalities or injuries, but because it happens so often, the aggregate number fatalities and injuries is very high. Therefore, each event will also be ranked based on the aggregated or average number of fatalities or injuries.
In short, there would be four ranking:
Percents fatalities compare to total fatalities
Percents injuries compare to total injuries
Average fatalities
Average injuries
Scatterplot of percentage of fatalities versus injuries show positive correlation between fatalities and injuries as expected. Ten most fatal disasters are labelled and it is clearly that tornado was the event that cause the most fatalities and injuries.
In contrast, scatterplot of average fatalities versus average injuries show negative correlation. Events that have high average fatalities would have low average injuries and vice versa. Tornadoes with thunderstorm wind and hail has highest average number of fatalities while heat wave has highest average number of injuries.
Figure 1:
PH1 <- ggplot(data = healthData3) +
aes(x = percentInjuries, y = percentFatalities) +
geom_point(alpha = 0.3) +
ggtitle("Percent of fatalities and injuries\n") +
labs(x ="Percents injuries of disaster \ncompare to total injuries",
y = "Percents fatalities of disaster \ncompare to total fatalities") +
geom_text_repel(data = ranks10percent,
aes(x = percentInjuries, y = percentFatalities, label = type), size = 3)
PH1g <- ggplotGrob(PH1)
PH2 <- ggplot(data = healthData4) +
aes(x = averageInjuries, y = averageFatalities) +
geom_point(alpha = 0.3) +
ggtitle("Average number of fatalities and injuries\n for each type of disaster") +
labs(x ="Average number of injuries",
y ="Average number of fatalities") +
geom_text_repel(data = ranks10average,
aes(x = averageInjuries, y = averageFatalities, label = type, colour = Top_5),
size = 3) +
theme(legend.justification=c(1,1), legend.position=c(1,0.8), legend.title=element_blank())
#PH2g <- ggplotGrob(PH2)
#PH1g$heights <- PH2g$heights
grid.arrange(PH1, PH2, ncol = 2)
PHtable <- cbind(1:5, head(arrange(healthData3[, c(1,2)], desc(percentFatalities)), 5)[,1],
head(arrange(healthData3[, c(1,3)], desc(percentInjuries)), 5)[,1],
head(arrange(healthData3[, c(1,4)], desc(averageFatalities)), 5)[,1],
head(arrange(healthData3[, c(1,5)], desc(averageInjuries)), 5)[,1])
colnames(PHtable) <- c("Rank", "Top 5 percent fatalities", "Top 5 percent injuries", "Top 5 average fatalities", "Top 5 average injuries")
kable(PHtable)
Rank | Top 5 percent fatalities | Top 5 percent injuries | Top 5 average fatalities | Top 5 average injuries |
---|---|---|---|---|
1 | TORNADO | TORNADO | TORNADOES, TSTM WIND, HAIL | Heat Wave |
2 | EXCESSIVE HEAT | TSTM WIND | COLD AND SNOW | TROPICAL STORM GORDON |
3 | FLASH FLOOD | FLOOD | TROPICAL STORM GORDON | WILD FIRES |
4 | HEAT | EXCESSIVE HEAT | RECORD/EXCESSIVE HEAT | THUNDERSTORMW |
5 | LIGHTNING | LIGHTNING | EXTREME HEAT | HIGH WIND AND SEAS |
Unlike the case of fatalities and injuries, one dollar of crop damage can be assumed to be equal one dollar of prop damage. Thus, cropdamage and propdamage were merged together and summed up to create new variable called “total”. Figure 2 show total damage vs average damage. It is cleared that Heavy rain/severe weather cause the most damage per single event (2.5 billion) while flood cause the most aggregated damage 31016 event cause 150 billion damage.
all <- merge(cropdamageP[,c(1,2,4)], propdamageP[,c(1,2,4)], by = "EVTYPE", all = TRUE)
all[is.na(all)] <- 0
mutate(all, "total" = total.x +total.y, "number" = all[,3] + all[,5]) %>% mutate("average" = total/number) %>%
select(c(1, 6,7,8)) %>% rename(type = EVTYPE) -> all2
arrange(all2, desc(total))[1:5,] %>% mutate("sort_by" = rep("total", 5)) -> allTotal
arrange(all2, desc(average))[1:5,] %>% mutate("sort_by" = rep("average", 5)) -> allAverage
sort <- rbind(allTotal, allAverage)
Dplot <- ggplot(data = all2, aes(x = total, y = average)) +
geom_point(alpha = 0.5) +
geom_text_repel(data = sort, aes(x = total, y = average, label = type, colour = sort_by),
size = 3) +
ggtitle("Average damage and total damage in USD for different type of disaster") +
ylab("Average damage (USD) per event of a disaster type") +
xlab("Total damage (USD) of all event of a disaster type")
print(Dplot)
damageTable <- cbind.data.frame(1:5, as.character(allTotal[,1]), as.character(allAverage[,1]))
colnames(sort) <- c("Disaster", "Total number of USD damage", "Number of event", "Average number of USD damage", "sort by")
kable(sort[,c(1:4)])
Disaster | Total number of USD damage | Number of event | Average number of USD damage |
---|---|---|---|
FLOOD | 150319678250 | 31016 | 4846520.4 |
HURRICANE/TYPHOON | 71913712800 | 103 | 698191386.4 |
TORNADO | 57352113590 | 61420 | 933769.4 |
STORM SURGE | 43323541000 | 182 | 238041434.1 |
HAIL | 18758221170 | 174271 | 107638.2 |
HEAVY RAIN/SEVERE WEATHER | 2500000000 | 1 | 2500000000.0 |
TORNADOES, TSTM WIND, HAIL | 1602500000 | 2 | 801250000.0 |
HURRICANE/TYPHOON | 71913712800 | 103 | 698191386.4 |
HURRICANE OPAL | 3191846000 | 11 | 290167818.2 |
STORM SURGE | 43323541000 | 182 | 238041434.1 |