Synopsis This analysis presents the results of the most harmful events with respect to population health and those with the greatest economic consequences. The top 15 are prioritized for the outcomes. Tornado is by far the most harmful with respect to population heath (96979 fatalities and injuries). The economic burden is lead by flood that is much more important than the others.
Data processing The loading of the dataset starts with importing the dataset directement from the internet. The second step consists in unzipping the file using a program installed on the computer. There may be a need to adapt the code to be able to run the second step. One will need to find the path of the ‘.exe’ file used to open zip file and break it down like it appears in the following command : executable <- file.path(“C:”, “Program Files”, “WinRAR”, “WinRAR.exe”)
setwd("G:/My Drive/From Dropbox/Training/Data science specialization/Assignments/Project 2")
#Load the data
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl, file.path("./", "repdata_data_StormData.bz2"))
# Unzip using WinRAR utility on Windows 11:
executable <- file.path("C:", "Program Files", "WinRAR", "WinRAR.exe")
cmd <- paste(paste0("\"", executable, "\""), "x",
paste0("\"", file.path("./", "repdata_data_StormData.zip"), "\""))
system(cmd)
## [1] 1
data <- read.csv("repdata_data_StormData.csv", sep = ",")
#Create a folder for figures
if(!dir.exists("figure") == TRUE) dir.create("figure")
To find the most harmful events with respect to population health, we compute the sum of fatalities and injuries and save it under variable pophealth. Then, we compute this total by event and sort them in descending order to select the top 15.
#Sum of fatalities and injuries
data$pophealth <- data$INJURIES + data$FATALITIES
pophealth <- (tapply(data$pophealth, data$EVTYPE, sum))
str(pophealth)
## num [1:985(1d)] 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "dimnames")=List of 1
## ..$ : chr [1:985] " HIGH SURF ADVISORY" " COASTAL FLOOD" " FLASH FLOOD" " LIGHTNING" ...
dim(pophealth)
## [1] 985
pophealth <-as.data.frame(pophealth)
pophealth$evtype <- rownames(pophealth)
rownames(pophealth) <- NULL
pophealth <- pophealth %>%
arrange(desc(pophealth))
pophealth <- pophealth %>% select(evtype, pophealth)
top_15_ph <- pophealth[1:15, ]
print(top_15_ph)
## evtype pophealth
## 1 TORNADO 96979
## 2 EXCESSIVE HEAT 8428
## 3 TSTM WIND 7461
## 4 FLOOD 7259
## 5 LIGHTNING 6046
## 6 HEAT 3037
## 7 FLASH FLOOD 2755
## 8 ICE STORM 2064
## 9 THUNDERSTORM WIND 1621
## 10 WINTER STORM 1527
## 11 HIGH WIND 1385
## 12 HAIL 1376
## 13 HURRICANE/TYPHOON 1339
## 14 HEAVY SNOW 1148
## 15 WILDFIRE 986
The pie chart below shows the distribution of the fatalities and injuries across the top 15 events. The value labels were not added to avoid overlapping. We can cleary see that the most harmful is TORNADO.
#Define palette colors for the top 15
custom_colors <- c("TORNADO"="#669E40", "EXCESSIVE HEAT"="#FFD966", "TSTM WIND"="#FFD966", "FLOOD"="#FFD966", "LIGHTNING"="#FFD966", "HEAT"="#FFD966", "FLASH FLOOD"="#FFD966", "ICE STORM"="#FFD966", "THUNDERSTORM WIND"="#FD8D77", "WINTERS STORM"="#FD8D77", "HIGH WIND"="#FD8D77", "HAIL"="#FD8D77", "HURRICANE/TYPHOON"="#FD8D77", "HEAVY SNOW"="#EB0335", "WILDFIRE"="#EB0335")
ggplot(top_15_ph, aes(x = "", y = pophealth, fill = evtype)) +
geom_bar(stat = "identity", width = 1) + # Create a bar chart
coord_polar(theta = "y") + # Convert to a pie chart
theme_void() + # Remove unnecessary grid elements
labs(fill = "Events", title = "Distribution of the top 15 events with respect to the impact on the population health") +
scale_fill_manual(values = custom_colors) # Add custom colors
To find the events with the greatest economic consequences, we set the cost on the same scale prior to compute the cost for each event. Then, we sort them in descending order and select the top 15. And we use barchart to visualize it. Flood is by far the event with the greatest economic consequences. Then the second most import can be the group of huricane/typhon, tornado and storm surge. With flood, they represent around 75% of the economic burden.They could be prioritized for resources allocation.
data$costmag <- data$PROPDMGEXP
data$costmag[data$costmag=="K"] <- 1000
data$costmag[data$costmag=="M"] <- 1000000
data$costmag[data$costmag=="B"] <- 1000000000
data$costmag <- as.numeric(data$costmag)
## Warning: NAs introduced by coercion
data$costmag[is.na(data$costmag) ] <- 0
data$ecoconsq <- data$PROPDMG*data$costmag
eco_dmg <- tapply(data$ecoconsq, data$EVTYPE, sum)
eco_dmg <- as.data.frame(eco_dmg)
eco_dmg$evtype <- rownames(eco_dmg)
rownames(eco_dmg) <- NULL
eco_dmg <- eco_dmg %>%
select(evtype, eco_dmg)
eco_dmg <- eco_dmg %>%
arrange(desc(eco_dmg))
top_15_cost <- eco_dmg[1:15, ]
top_15_cost <- top_15_cost %>%
arrange(desc(eco_dmg))
print(top_15_cost)
## evtype eco_dmg
## 1 FLOOD 144657709800
## 2 HURRICANE/TYPHOON 69305840000
## 3 TORNADO 56925660991
## 4 STORM SURGE 43323536000
## 5 FLASH FLOOD 16140812087
## 6 HAIL 15727366870
## 7 HURRICANE 11868319010
## 8 TROPICAL STORM 7703890550
## 9 WINTER STORM 6688497250
## 10 HIGH WIND 5270046260
## 11 RIVER FLOOD 5118945500
## 12 WILDFIRE 4765114000
## 13 STORM SURGE/TIDE 4641188000
## 14 TSTM WIND 4484928440
## 15 ICE STORM 3944927810
ggplot(top_15_cost, aes(x = reorder(evtype, -eco_dmg), y = eco_dmg))+
geom_bar(stat = "identity")+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
labs(x = "Events", y = "Economic consequences", title = "Ranking of the top 15 events with the greatest economic consequences")