Data 624- Advanced Exploration and Visualization in Health

Instructor: Zahra Shakeri– Winter 2021

Handout #2 Visual Exploration and Analysis

A Violin Plot is mainly used to present the distribution of the data and its probability density. This chart combines a Box Plot and a Density Plot and shows the data’s distribution shape. Box Plots are limited in their ability to display many aspects of the data, as their visual simplicity tends to hide significant details about how values in the data are distributed.

A violin plot is more informative than a plain box plot. While a box plot only shows basic statistics such as mean, median, and interquartile ranges, the violin plot shows different aspects of the distribution of the data. The difference is advantageous when the data distribution is multimodal (with more than one peak). In this scenario, a violin chart presents the presence of different peaks in the data, their position and relative amplitude.

Like box plots, violin plots are used to present the comparison of a variable distribution (or sample distribution) across different “categories”.

About the dataset: Number and proportion of individuals by age and sex who indicated their level of satisfaction with specific areas of their lives on a scale from 0 to 10. Source: Source: Statistics Canada, General Social Survey, 2016, Canadians at Work and Home.

Implementation in R

BoxPlot

A boxplot can summarize the distribution of a numeric variable for several groups. However, summarizing the data can cause information loss, which negatively impacts the results of exploratory analyses. If we consider the boxplot below, we can easily conclude that Feelings of Safety has a higher value than the others. However, we cannot see the actual underlying distribution of data in each group or their number of observations.

# Libraries
library(tidyverse)
library(hrbrthemes)
library(viridis)
extrafont::loadfonts()

setwd("/Users/zahrashakeri/Library/Mobile Documents/com~apple~CloudDocs/Teaching/2021/Data 624/Lectures/Lecture 02/Handout#2")

#load the data [The dataset is available on D2L]
data<- read.csv("LifeSatisfaction.csv")
# Plot
data %>%
  ggplot( aes(x=Satisfaction, y=Value, fill=Satisfaction)) +
  geom_boxplot() +
  scale_fill_viridis(discrete = TRUE) +
  theme_ipsum() +
  theme(
    legend.position="none",
    plot.title = element_text(size=11)
  ) +
  coord_flip() +
  ggtitle("") +
  xlab("")

If the number of data points you are working with is not too large, adding jitter on top of your boxplot can make the graphic more insightful, as it ads more information about the frequency of each value and the distribution of data.

# Plot
data %>%
  ggplot( aes(x=Satisfaction, y=Value, fill=Satisfaction)) +
  geom_boxplot() +
  geom_jitter(color="gray", size=0.7, alpha=0.5) +
  scale_fill_viridis(discrete = TRUE) +
  theme_ipsum() +
  theme(
    legend.position="none",
    plot.title = element_text(size=11)
  ) +
  coord_flip() +
  ggtitle("") +
  xlab("")

Here, as we have a large dataset, the jitters cannot help to clearly find the distribution of the values associated with each factor. To address this, we introduce the Violin chart.

Violin Plot

If you have a large sample size, using jitter is no longer an option since dots will overlap, making the figure uninterpretable. An alternative is the violin plot, which describes the distribution of the data for each group:

# Plot
data %>%
  ggplot( aes(x=Satisfaction, y=Value, fill=Satisfaction)) +
  geom_violin()+
  geom_boxplot(width=0.35, color="white") +
  geom_jitter(color="black", size=0.2, alpha=0.25) +
  scale_fill_viridis(discrete = TRUE) +
  theme_ipsum() +
  theme(
    legend.position="none",
    plot.title = element_text(size=11)
  ) +
  coord_flip() +
  ggtitle("") +
  xlab("")

Here it is obvious that the groups have different distributions. Violin plots are a powerful way to display information–they are probably under-utilized compared to boxplots.

# Load dataset from your computer
setwd("/Users/zahrashakeri/Library/Mobile Documents/com~apple~CloudDocs/Teaching/2021/Data 624/Lectures/Lecture 02/Handout#2")

#load the data
data<- read.csv("LifeSatisfaction.csv")
data$Satisfaction <-as.factor(data$Satisfaction)

# Plot
data %>%
  ggplot( aes(x=Satisfaction, y=Value, fill=Satisfaction, color=Satisfaction)) +
  geom_violin(width=1, size=0.2) + #To change the transparency use alpha=0.5. alpha can range from 0 to 1.
  scale_fill_viridis(discrete=TRUE) +
  scale_color_viridis(discrete=TRUE) +
  # scale_fill_brewer(palette = "OrRd")+
  theme_ipsum() +
  theme(
    legend.position="none"
  ) +
  coord_flip() +
  xlab("") +
  ylab("")

RidgeLine

A Ridgeline plot (or Joyplot) shows the distribution of numeric values for several groups of data (i.e. categorical data). Distribution can be visualized using histograms or density plots aligned to the same horizontal scale and presented with a slight overlap.

Any Differences between different provinces in the country?

library(ggridges)

data %>%
  ggplot(aes(x=Value, y=GEO, fill=GEO)) +
  geom_density_ridges(alpha=0.6, bandwidth=4) +
  scale_fill_viridis(discrete=TRUE) +
  scale_color_viridis(discrete=TRUE) +
  scale_fill_brewer(palette = "YlOrRd")+
  theme_ipsum() +
  theme(
    legend.position="none",
    panel.spacing = unit(0.1, "lines"),
    strip.text.x = element_text(size = 8)
  ) +
  xlab("") +
  ylab("")

Ridgeline visualization is more effective when the number of categories to visualize is medium to high (Why?). Indeed, given that groups overlap each other allows using space more efficiently. If you have less than ~3 groups in your data, dealing with other distribution plots is probably better (e.g. area plot).

This visualization works well when there is a clear pattern in the result, like if there is an obvious ranking in groups. Otherwise, groups will tend to overlap each other, leading to a messy plot, not providing any insight. Also, in addition to the density plots, it is possible to use histograms as well:

data %>%
  ggplot( aes(x=Value, y=GEO, fill=GEO)) +
    geom_density_ridges(alpha=0.6, stat="binline", bins=20) +
    scale_fill_brewer(palette = "BrBG")+
    theme_ipsum() +
    theme(
      legend.position="none",
      panel.spacing = unit(0.1, "lines"),
      strip.text.x = element_text(size = 8)
    ) +
    xlab("") +
    ylab("")
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette BrBG is 11
## Returning the palette you asked for with that many colors

Also, it is possible to colour depending on the numeric variable instead of the categoric one.

ggplot(data, aes(x=Value, y=GEO, fill = ..x..)) +
  geom_density_ridges_gradient(scale = 3, rel_min_height = 0.01) +
  scale_fill_viridis(option = "E") + #Try A, B, C, D, and E for the option parameter
  labs(title = '') +
  theme_ipsum() +
    theme(
      legend.position="none",
      panel.spacing = unit(0.1, "lines"),
      strip.text.x = element_text(size = 8)
    )
## Picking joint bandwidth of 0.162

Violin Plot in Python

Th violinplot function always treats one of the variables as categorical and visualizes data at ordinal positions (0, 1, … n) on the relevant axis, even when it has a numeric or date type.

library(reticulate)
use_python("/Users/zahrashakeri/opt/anaconda3/bin/python/")
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

#----------------------------------------------------------
sns.set(style="whitegrid")

#sns.set(rc={'figure.figsize':(15,10)})

df= pd.read_csv("LifeSatisfaction.csv")
df.head()
##    REF_DATE     GEO  ...            Satisfaction Value
## 0      2016  Canada  ...         Life as a whole   7.9
## 1      2016  Canada  ...      Standard of living   7.7
## 2      2016  Canada  ...                  Health   7.4
## 3      2016  Canada  ...     Achievement in life   7.4
## 4      2016  Canada  ...  Personal relationships   7.9
## 
## [5 rows x 6 columns]
plt.figure(figsize=(5,5))

sns.violinplot(x=df["Value"], color="yellow")

#To add more data to your visualization try this:

plt.figure(figsize=(12,10))
sns.violinplot(x="Value", y="GEO", data=df)


df= df[df['Sex']!='Both sexes']
plt.figure(figsize=(15,12))
plot=sns.violinplot(x="Satisfaction", y="Value", hue="Sex",data=df, palette="colorblind")
plt.setp(plot.get_xticklabels(), rotation=90)
## [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]
plot


df= df[df['Sex']!='Both sexes']
plt.figure(figsize=(10,13))

plot=sns.violinplot(x="Satisfaction", y="Value", hue="Sex",data=df, palette="pastel", split=True, scale="count") #Scale the violin width by the number of observations in each bin
plt.setp(plot.get_xticklabels(), rotation=90)
## [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]
plot

Draw the quartiles as horizontal lines instead of a mini-box.


df= df[df['Sex']!='Both sexes']
plt.figure(figsize=(10,13))
plot=sns.violinplot(x="Satisfaction", y="Value", hue="Sex",data=df, palette="pastel", split=True, scale="count",  inner="quartile") #Use a narrow bandwidth to reduce the amount of  smoothing example: bw=.2
plt.setp(plot.get_xticklabels(), rotation=90)
## [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]
plot

Please note: Feel free to use your own Python IDE, but in case you want to work with R Markdown, please check this before starting this handout: https://abndistro.com/post/2018/01/09/getting-started-with-python-in-r-markdown-using-the-reticulate-package/

Data Exploration in Tableau

TreeMap

A treemap displays data in the form of nested rectangles. The dimensions define the structure of the treemap, and measures represent the individual rectangle’s size or colour. The rectangles are easy to visualize as both visual variables of size and shade of the rectangle reflect the value of the measure.

Let’s start with a basic treemap:

  1. As a standard chart, we can click on Show Me and see the treemap.

As we see in the Show Me tab, we need at least one dimension and one or two measures to build a treemap. So we multiple select “Value”, “Satisfaction” by holding the Control key (Command key on Mac), then choose “treemaps” in Show Me. Tableau will generate a raw treemap automatically.

  1. Drag Sex to color! We could indeed add more information to the chart, but you can not clearly see which factors contribute more to each sex. So, this is not an insightful visualization!
  1. We can also add the value of each factor to its corresponding leaf.

Compared with other pre-defined charts in Tableau, the treemap has the following advantages:

  • Displays hierarchical data types: This is the mission that treemap is created for. For displaying nested and hierarchical data, usually, the best visualization is two-layers (We’ll show this in our Lecture on Hierarchical Data).
  • Scalability- support a large number of categories: If you need to compare a large number of elements and highlight contributing members, you can consider a treemap.
  • Higher utilization of space: Packed rectangles can cover almost the entire area.
  • Focus on highlighting major contributors: The treemap displays the man contributors of data with larger rectangles or a conspicuous colour. If you need to highlight these contributors, the treemap is an appropriate solution.

Meanwhile, there are many opposing voices:

  • Difficult to make accurate comparisons: Practical studies show that people are more good at quantifying and comparing length and position variables but not as good at comparing the area of rectangles. Another problem is that, in most cases, it is hard to define a baseline to compare the components under study. Bar charts, scatter plots, and line charts provide a more quantitative and standard method for comparisons.
  • No labels on the small components: Although a treemap can show many categories, if it contains too many sub-classes, the rectangles may become very small and hard to detect. Thus, Tableau can not present all labels. This means you have to use the interactive features of Tableau to be able to understand the visualized data, like tooltips or highlight.

Map

You can implement several different types of maps for your geo-spatial analysis in Tableau. If you are new to maps or simply want to take advantage of the built-in mapping features that Tableau provides, you can create a simple point or filled (polygon) map similar to the examples below. After importing the dataset, we need to change the type of the GEO column from string to location:

Prerequisites: To build a simple map, your data source must contain location/geo-spatial data (location names, latitude, and longitude coordinates).

In Tableau, a geographic role associates each value in a field with a latitude and longitude value. When you assign the correct geographic role to a field, Tableau gives latitude and longitude values to each location by finding a match built into the installed geocoding database. This is how Tableau knows where to plot your locations on the map.

When you assign a geographic role to a field, such as State, Tableau creates a Latitude (generated) field and a Longitude (generated) field.

Follow the steps below to generate a geo-visual presentation of the dataset. We will have a more detailed lecture on this topic in our future lectures.

  1. Select the area map from the Show Me panel. Why is the point-map not a good option for this dataset?

    </center

  2. To make the comparisons more clear, we can add different levels of filters on Sex and Satisfaction columns:

  1. To be able to include/exclude the filters, right-click on the Filter and select Show Filter option.

Crosstabs in Tableau

A crosstab chart in Tableau (also called a Text table), is a type of visualization that shows the data in textual form. The chart is made up of one or more dimensions and one or more measures. This chart can also offer various calculations on the values of the measured field.

  1. Drag Satisfaction and Sex to Rows and Value to Columns.
  2. Drag Geo to colour

The resulting chart will look like as below:

  1. Right-click on the current Worksheet and select Duplicate as Crosstab.
  1. You can get the values that are colour encoded in the crosstab chart by dropping the ‘measure’ field into the Color shelf, as shown in the following screenshot. This colour-coding shows the strength of the colour depending on the value of the measure. The larger values have a darker shade than the lighter values. Drag Value to colour and change the colour.
  1. Change the Marks to square and click on the Show Mark Labels.
  2. Drag Geo and Satisfaction to Filter. Then right-click on each filter and select Show Filter.

Wrapping Up– Build a story to present

Consider you want to share your findings with a larger team. Instead of having to guess which key findings your team is interested in and including them in a PowerPoint presentation, you can create a visual story in Tableau. This way, you can walk viewers through your data discovery process, and you have the option to interactively explore and present your data to answer any questions that come up during your presentation.

For the presentation, you will start with an overview:

  1. Click the New story button

You are presented with a blank workspace that reads, “Drag a sheet here.” This is where you will create your first story point.

If you think that blank stories look a lot like blank dashboards, that is because they do. Similar to a dashboard, you can drag worksheets over to present the stories.

  1. From the Story pane on the left, drag one of your worksheets onto your view.

  2. Add a caption by editing the text in the gray box above the worksheet.

  3. To add more sheets, in the Story pane, select Blank.

  4. In the left-hand pane, select Drag to add text and drag it onto your view.

  5. Enter a description for your dashboard that emphasizes the key points of your story/findings. You can change the format of this textbox by right-clicking on the text.

Enjoy the Story Telling experience!