A Violin Plot is mainly used to present the distribution of the data and its probability density. This chart combines a Box Plot and a Density Plot and shows the data’s distribution shape. Box Plots are limited in their ability to display many aspects of the data, as their visual simplicity tends to hide significant details about how values in the data are distributed.
A violin plot is more informative than a plain box plot. While a box plot only shows basic statistics such as mean, median, and interquartile ranges, the violin plot shows different aspects of the distribution of the data. The difference is advantageous when the data distribution is multimodal (with more than one peak). In this scenario, a violin chart presents the presence of different peaks in the data, their position and relative amplitude.
Like box plots, violin plots are used to present the comparison of a variable distribution (or sample distribution) across different “categories”.
About the dataset: Number and proportion of individuals by age and sex who indicated their level of satisfaction with specific areas of their lives on a scale from 0 to 10. Source: Source: Statistics Canada, General Social Survey, 2016, Canadians at Work and Home.
A boxplot can summarize the distribution of a numeric variable for several groups. However, summarizing the data can cause information loss, which negatively impacts the results of exploratory analyses. If we consider the boxplot below, we can easily conclude that Feelings of Safety has a higher value than the others. However, we cannot see the actual underlying distribution of data in each group or their number of observations.
# Libraries
library(tidyverse)
library(hrbrthemes)
library(viridis)
extrafont::loadfonts()
setwd("/Users/zahrashakeri/Library/Mobile Documents/com~apple~CloudDocs/Teaching/2021/Data 624/Lectures/Lecture 02/Handout#2")
#load the data [The dataset is available on D2L]
data<- read.csv("LifeSatisfaction.csv")
# Plot
data %>%
ggplot( aes(x=Satisfaction, y=Value, fill=Satisfaction)) +
geom_boxplot() +
scale_fill_viridis(discrete = TRUE) +
theme_ipsum() +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
coord_flip() +
ggtitle("") +
xlab("")
If the number of data points you are working with is not too large, adding jitter on top of your boxplot can make the graphic more insightful, as it ads more information about the frequency of each value and the distribution of data.
# Plot
data %>%
ggplot( aes(x=Satisfaction, y=Value, fill=Satisfaction)) +
geom_boxplot() +
geom_jitter(color="gray", size=0.7, alpha=0.5) +
scale_fill_viridis(discrete = TRUE) +
theme_ipsum() +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
coord_flip() +
ggtitle("") +
xlab("")
Here, as we have a large dataset, the jitters cannot help to clearly find the distribution of the values associated with each factor. To address this, we introduce the Violin chart.
If you have a large sample size, using jitter is no longer an option since dots will overlap, making the figure uninterpretable. An alternative is the violin plot, which describes the distribution of the data for each group:
# Plot
data %>%
ggplot( aes(x=Satisfaction, y=Value, fill=Satisfaction)) +
geom_violin()+
geom_boxplot(width=0.35, color="white") +
geom_jitter(color="black", size=0.2, alpha=0.25) +
scale_fill_viridis(discrete = TRUE) +
theme_ipsum() +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
coord_flip() +
ggtitle("") +
xlab("")
Here it is obvious that the groups have different distributions. Violin plots are a powerful way to display information–they are probably under-utilized compared to boxplots.
# Load dataset from your computer
setwd("/Users/zahrashakeri/Library/Mobile Documents/com~apple~CloudDocs/Teaching/2021/Data 624/Lectures/Lecture 02/Handout#2")
#load the data
data<- read.csv("LifeSatisfaction.csv")
data$Satisfaction <-as.factor(data$Satisfaction)
# Plot
data %>%
ggplot( aes(x=Satisfaction, y=Value, fill=Satisfaction, color=Satisfaction)) +
geom_violin(width=1, size=0.2) + #To change the transparency use alpha=0.5. alpha can range from 0 to 1.
scale_fill_viridis(discrete=TRUE) +
scale_color_viridis(discrete=TRUE) +
# scale_fill_brewer(palette = "OrRd")+
theme_ipsum() +
theme(
legend.position="none"
) +
coord_flip() +
xlab("") +
ylab("")
A Ridgeline plot (or Joyplot) shows the distribution of numeric values for several groups of data (i.e. categorical data). Distribution can be visualized using histograms or density plots aligned to the same horizontal scale and presented with a slight overlap.
Any Differences between different provinces in the country?
library(ggridges)
data %>%
ggplot(aes(x=Value, y=GEO, fill=GEO)) +
geom_density_ridges(alpha=0.6, bandwidth=4) +
scale_fill_viridis(discrete=TRUE) +
scale_color_viridis(discrete=TRUE) +
scale_fill_brewer(palette = "YlOrRd")+
theme_ipsum() +
theme(
legend.position="none",
panel.spacing = unit(0.1, "lines"),
strip.text.x = element_text(size = 8)
) +
xlab("") +
ylab("")
Ridgeline visualization is more effective when the number of categories to visualize is medium to high (Why?). Indeed, given that groups overlap each other allows using space more efficiently. If you have less than ~3 groups in your data, dealing with other distribution plots is probably better (e.g. area plot).
This visualization works well when there is a clear pattern in the result, like if there is an obvious ranking in groups. Otherwise, groups will tend to overlap each other, leading to a messy plot, not providing any insight. Also, in addition to the density plots, it is possible to use histograms as well:
data %>%
ggplot( aes(x=Value, y=GEO, fill=GEO)) +
geom_density_ridges(alpha=0.6, stat="binline", bins=20) +
scale_fill_brewer(palette = "BrBG")+
theme_ipsum() +
theme(
legend.position="none",
panel.spacing = unit(0.1, "lines"),
strip.text.x = element_text(size = 8)
) +
xlab("") +
ylab("")
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette BrBG is 11
## Returning the palette you asked for with that many colors
Also, it is possible to colour depending on the numeric variable instead of the categoric one.
ggplot(data, aes(x=Value, y=GEO, fill = ..x..)) +
geom_density_ridges_gradient(scale = 3, rel_min_height = 0.01) +
scale_fill_viridis(option = "E") + #Try A, B, C, D, and E for the option parameter
labs(title = '') +
theme_ipsum() +
theme(
legend.position="none",
panel.spacing = unit(0.1, "lines"),
strip.text.x = element_text(size = 8)
)
## Picking joint bandwidth of 0.162
Th violinplot function always treats one of the variables as categorical and visualizes data at ordinal positions (0, 1, … n) on the relevant axis, even when it has a numeric or date type.
library(reticulate)
use_python("/Users/zahrashakeri/opt/anaconda3/bin/python/")
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
#----------------------------------------------------------
sns.set(style="whitegrid")
#sns.set(rc={'figure.figsize':(15,10)})
df= pd.read_csv("LifeSatisfaction.csv")
df.head()
## REF_DATE GEO ... Satisfaction Value
## 0 2016 Canada ... Life as a whole 7.9
## 1 2016 Canada ... Standard of living 7.7
## 2 2016 Canada ... Health 7.4
## 3 2016 Canada ... Achievement in life 7.4
## 4 2016 Canada ... Personal relationships 7.9
##
## [5 rows x 6 columns]
plt.figure(figsize=(5,5))
sns.violinplot(x=df["Value"], color="yellow")
#To add more data to your visualization try this:
plt.figure(figsize=(12,10))
sns.violinplot(x="Value", y="GEO", data=df)
df= df[df['Sex']!='Both sexes']
plt.figure(figsize=(15,12))
plot=sns.violinplot(x="Satisfaction", y="Value", hue="Sex",data=df, palette="colorblind")
plt.setp(plot.get_xticklabels(), rotation=90)
## [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]
plot
df= df[df['Sex']!='Both sexes']
plt.figure(figsize=(10,13))
plot=sns.violinplot(x="Satisfaction", y="Value", hue="Sex",data=df, palette="pastel", split=True, scale="count") #Scale the violin width by the number of observations in each bin
plt.setp(plot.get_xticklabels(), rotation=90)
## [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]
plot
Draw the quartiles as horizontal lines instead of a mini-box.
df= df[df['Sex']!='Both sexes']
plt.figure(figsize=(10,13))
plot=sns.violinplot(x="Satisfaction", y="Value", hue="Sex",data=df, palette="pastel", split=True, scale="count", inner="quartile") #Use a narrow bandwidth to reduce the amount of smoothing example: bw=.2
plt.setp(plot.get_xticklabels(), rotation=90)
## [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]
plot
Please note: Feel free to use your own Python IDE, but in case you want to work with R Markdown, please check this before starting this handout: https://abndistro.com/post/2018/01/09/getting-started-with-python-in-r-markdown-using-the-reticulate-package/
A treemap displays data in the form of nested rectangles. The dimensions define the structure of the treemap, and measures represent the individual rectangle’s size or colour. The rectangles are easy to visualize as both visual variables of size and shade of the rectangle reflect the value of the measure.
Let’s start with a basic treemap:
As we see in the Show Me tab, we need at least one dimension and one or two measures to build a treemap. So we multiple select “Value”, “Satisfaction” by holding the Control key (Command key on Mac), then choose “treemaps” in Show Me. Tableau will generate a raw treemap automatically.
Compared with other pre-defined charts in Tableau, the treemap has the following advantages:
Meanwhile, there are many opposing voices:
You can implement several different types of maps for your geo-spatial analysis in Tableau. If you are new to maps or simply want to take advantage of the built-in mapping features that Tableau provides, you can create a simple point or filled (polygon) map similar to the examples below. After importing the dataset, we need to change the type of the GEO column from string to location:
Prerequisites: To build a simple map, your data source must contain location/geo-spatial data (location names, latitude, and longitude coordinates).
In Tableau, a geographic role associates each value in a field with a latitude and longitude value. When you assign the correct geographic role to a field, Tableau gives latitude and longitude values to each location by finding a match built into the installed geocoding database. This is how Tableau knows where to plot your locations on the map.
When you assign a geographic role to a field, such as State, Tableau creates a Latitude (generated) field and a Longitude (generated) field.
Follow the steps below to generate a geo-visual presentation of the dataset. We will have a more detailed lecture on this topic in our future lectures.
Select the area map from the Show Me panel. Why is the point-map not a good option for this dataset?
</center
To make the comparisons more clear, we can add different levels of filters on Sex and Satisfaction columns:
A crosstab chart in Tableau (also called a Text table), is a type of visualization that shows the data in textual form. The chart is made up of one or more dimensions and one or more measures. This chart can also offer various calculations on the values of the measured field.
The resulting chart will look like as below:
Consider you want to share your findings with a larger team. Instead of having to guess which key findings your team is interested in and including them in a PowerPoint presentation, you can create a visual story in Tableau. This way, you can walk viewers through your data discovery process, and you have the option to interactively explore and present your data to answer any questions that come up during your presentation.
For the presentation, you will start with an overview:
You are presented with a blank workspace that reads, “Drag a sheet here.” This is where you will create your first story point.
If you think that blank stories look a lot like blank dashboards, that is because they do. Similar to a dashboard, you can drag worksheets over to present the stories.
From the Story pane on the left, drag one of your worksheets onto your view.
Add a caption by editing the text in the gray box above the worksheet.
To add more sheets, in the Story pane, select Blank.
In the left-hand pane, select Drag to add text and drag it onto your view.
Enter a description for your dashboard that emphasizes the key points of your story/findings. You can change the format of this textbox by right-clicking on the text.
Enjoy the Story Telling experience!