This project is made as part of the Data visualization course at CEU. The goal of the project is to make useful data visualization of selected data using obtained during the course visualization techniques in R. The data set selected is Volcano Eruptions from Tidy Tuesday Github Repo. The project is published on Rpubs.
I used mainly data table and ggplot2 packages. In order to set the theme for all the code only once, I used package ggthemr from Mikata Project. Dust theme was selected.
#load packages
library(data.table)
library(maps)
library(knitr)
library(skimr)
library(lubridate)
library(ggplot2)
#install package ggthemr: devtools::install_github('Mikata-Project/ggthemr')
library(ggthemr)
library(ggrepel)
library(ggmap)
library(GGally)
library(gganimate)
library(animation)
library(emojifont)
library(ggiraph)
library(treemap)
#set theme
ggthemr('dust', layout = 'scientific', spacing = 1, type = 'inner')
There are two data tables used for visualization of Volcano Eruptions:
#load data
volcano <- fread('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-12/volcano.csv')
eruptions <- fread('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-12/eruptions.csv')
I cleaned data and remained only columns that are useful for our visualization. Moreover, I joined volcano and eruptions tables for better analysis. I formatted column for Volcanic Explosivity Index (vei) from numerical to characteristic variable describing status of vei (from non-explosive to very large) for easier understanding of meaning of the column.
#format dates
eruptions$start_date <- as.Date(with(eruptions,paste(start_year, start_month, start_day, sep = "/")))
eruptions$end_date <- as.Date(with(eruptions,paste(end_year, end_month, end_day, sep = "/")))
eruptions$duration_days <- as.numeric(eruptions$end_date - eruptions$start_date)
#join datasets
volcano_eruptions = merge(volcano, eruptions, by=c("volcano_number", "volcano_name"))
rm(volcano)
rm(eruptions)
#keep needed columns
keeps <- c("volcano_number", "volcano_name", "primary_volcano_type", "major_rock_1", "population_within_5_km", "eruption_number","eruption_category","vei","latitude.x","longitude.x","start_date", "end_date", "start_month", "end_month", "start_year", "end_year", "duration_days", "country","region","elevation", "evidence_category")
volcano_eruptions <- volcano_eruptions[, keeps, with=FALSE]
rm(keeps)
#format vei(Volcanic Explosivity Index) and create new variable vei_status
volcano_eruptions <- volcano_eruptions[,vei_status:= cut(vei, c(0,1,2,3,4,5,6,7, Inf), labels = c("non explosive", "small", "moderate", "moderate", "large", "large", "very large", "very large"))]
#delete rows with missing values
volcano_eruptions <- na.omit(volcano_eruptions)
Firstly, I showed on the map where volcanoes erupted. The size of circles defines the vei status of volcanoes. The bigger the size of circle, the worse its vei status. From the map it can be noticed that volcanoes mainly erupted in Indonesia and Japan.
bp <- get_stamenmap(
c(
left = min(volcano_eruptions$longitude.x) * 0.995,
right = max(volcano_eruptions$longitude.x) * 1.001,
bottom = min(volcano_eruptions$latitude.x) * 0.999,
top = max(volcano_eruptions$latitude.x)) * 1.001,
maptype = 'terrain',
zoom = 4)
ggmap(bp) +
geom_point(
data = volcano_eruptions,
aes(longitude.x, latitude.x, size = vei^5, color=vei_status),
alpha = 0.9,
) + theme_void() +
ggtitle("Volcanoes Eruptions") +
guides(size=FALSE) +
theme(legend.position = 'bottom')
Secondly, in order to get the general idea of the data I firstly checked distributions of numerical variables, including duration days of volcanoes eruptions, vei status, elevation and population within 5 km. As it can be seen from the plot below there is no strong correlation between numerical variables.
#distributions and correlations
cols <- c('duration_days', 'vei', 'elevation', 'population_within_5_km')
pairwise <- volcano_eruptions[ , cols, with=FALSE]
ggpairs(pairwise, ggtitle('Main dstributions and pairwise correlations'))
Then, it was interesting to check in which countries there is the largest number of volcanoes and which volcanoes are erupted more than others. Thus, I built bar charts for top countries and top volcanoes. Please see results below.
#countries with most volcanoes
mostvolcanoes_country <- data.frame(table(volcano_eruptions$country))
#mostvolcanoes_country[order(-mostvolcanoes_country$Freq),] #top 8 (>100): Indonesia, Japan, Russia, United States, Italy, Nicaragua, Chile, France
topcountries <- volcano_eruptions[volcano_eruptions$country=="Indonesia"|volcano_eruptions$country=="Japan"| volcano_eruptions$country=="Russia"|volcano_eruptions$country=="US"|volcano_eruptions$country=="Italy"|volcano_eruptions$country=="Nicaragua"|volcano_eruptions$country=="Chile"|volcano_eruptions$country=="France",]
#countries with most number of volcanoes
ggplot(topcountries, aes(x = country, na.rm = TRUE)) +
geom_bar(stat = "count") +
ggtitle('Top countries') +
labs(x="Major Countries")
#most volcanoes
mostvolcanoes <- data.frame(table(volcano_eruptions$volcano_name))
#mostvolcanoes[order(-mostvolcanoes$Freq),] #top 8: Etna, Fournaise, Piton de la, Asosan, Asamayama, Klyuchevskoy, Gamalama, Bezymianny
topvolcanoes <- volcano_eruptions[volcano_eruptions$volcano_name=="Etna"|volcano_eruptions$volcano_name=="Fournaise"| volcano_eruptions$volcano_name=="Piton de la"|volcano_eruptions$volcano_name=="Asosan"|volcano_eruptions$volcano_name=="Asamayama"|volcano_eruptions$volcano_name=="Klyuchevskoy"|volcano_eruptions$volcano_name=="Bezymianny"|volcano_eruptions$volcano_name=="Gamalama",]
#volcanoes with most eruptions
ggplot(topvolcanoes, aes(x = volcano_name, na.rm = TRUE)) +
geom_bar(stat = "count") +
ggtitle('Top volcanoes') +
labs(x="Volcano name")
Using animation I decided to show how annual volcano eruption duration days changed among top volcanoes during the period from 1600 to 2020. From the animated line chart below, it can be seen that duration days increased for top volcanoes during 400 years.
volcano_annual <- topvolcanoes [, .(annual_total_days = sum(duration_days)) , by=.(volcano_name, start_year)]
p<-ggplot(volcano_annual, aes(x = start_year, y = annual_total_days)) +
geom_line(aes(color=volcano_name)) +
labs( x='Year', y='Duration days', title = 'Annual volcano eruption days among top volcanoes') +
ylim(0,500) + xlim(1600,2020) +
theme(legend.title = element_blank())
p + transition_reveal(start_year)
Also using animation I wanted to show number of eruptions by duration days and mean duration days for top countries. Please see the result below:
volcano_average <- topcountries[, .(mean_duration_days = mean(duration_days, na.rm = T), number_of_eruptions = .N, duration_days), by = country]
ggplot(volcano_average, aes(x = duration_days)) +
geom_histogram() +
scale_x_continuous(limits = c(0,1000), breaks = seq(0,1000, by = 100)) +
scale_y_continuous(limits = c(0, 100), breaks = seq(0, 100, by = 10)) +
transition_states(country) +
labs (title = "{closest_state}", subtitle = paste('Number of eruptions: {nrow(subset(volcano_average, country == closest_state))}', 'Mean duration days: {volcano_average[country == closest_state, round(mean(duration_days))]}', sep = " \n "))
I also wanted to check if there is any association between elevation of volcano and duration days of its eruption conditioning on vei status. As it can be seen from the scatter plot below there is no visible pattern between those two variables.
#relationship btw duration days (less than 500 days) and volcano elevation conditioning on vei status
volcano_eruptions2 <- volcano_eruptions[ duration_days < 500]
ggplot(volcano_eruptions2, aes(x = duration_days, y = elevation, color = vei_status)) +
geom_point(size = 0.8, alpha = 0.5, color = 'black') +
geom_smooth(method = lm, se =FALSE) +
ggtitle ('Association between days of eruption and elevation of volcano') +
labs(x="Duration days", y = "Elevation")
Then, I decided to check distribution of volcanoes eruption days among starting months. As it can be seen from the boxplots below all distributions have long right tail and mean below 100 days regardless of the month.
#boxplot for start month
ggplot(volcano_eruptions, aes(factor(start_month),duration_days)) +
geom_boxplot() +
coord_cartesian(ylim = c(0, 500)) +
labs(x = 'Start Month')
Finally, I used treemap package to show dominant rock types for all volcanoes. From the treemap below we see that Andesite/Basaltic Andesite and Basalt/Picro-Basalt are the most dominant rock types.
#treemap for dominant rock types
commomrocks<-data.frame(table(volcano_eruptions$major_rock_1))
commomrocks$Freq<-as.numeric(commomrocks$Freq)
#str(commomrocks)
treemap(dtf = commomrocks,index = c("Var1"),vSize = "Freq",vColor = "Freq",
palette="RdYlBu",type="manual",
border.col ="white", title="Dominant Rock types")