Mosaics plots using ggmosaic?

Author

Dhafer Malouche

Mosaic plots

First, let me explain what mosaic plots are. Mosaic plots provide a visual representation of the relationship between two or more categorical variables in a dataset. They help us understand how these variables are related to each other.

To create a mosaic plot, let’s consider two categorical variables, \(X\) and \(Y\), with levels \(1\) to \(I\) and \(1\) to \(J\), respectively. Let \((x_1, y_1), \ldots, (x_n, y_n)\) be a sample generated from \((X, Y)\), which are the two columns in our data. We define the following frequencies:

  • Marginal frequencies of \(X\) and \(Y\), denoted as \(f_{i+}\) and \(f_{+j}\), respectively: \[f_{i+} = \frac{1}{n} \sum_{l=1}^n 1\{x_l=i\}\] \[f_{+j} = \frac{1}{n} \sum_{l=1}^n 1\{y_l=j\}\]

  • Join frequencies, denoted as \(f_{ij}\): \[f_{ij} = \frac{1}{n} \sum_{l=1}^n 1\{x_l=i\} \cdot 1\{y_l=j\}\]

  • Conditional frequency of \(Y\) given \(X\), denoted as \(f_{j\mid i}\): \[f_{j\mid i} = \frac{f_{ij}}{f_{i+}}\]

We can observe two facts about \(X\) and \(Y\) for any two levels \(i\) and \(j\):

  1. \(\sum_{i,j} f_{ij} = 1\)
  2. \(f_{ij} = f_{i+} \cdot f_{j\mid i}\). Therefore, the area of the rectangle with width \(f_{i+}\) and length \(f_{j\mid i}\) represents the frequency \(f_{ij}\).

A mosaic plot is a square divided into tiles, where the areas of the tiles sum to 1. Each tile corresponds to one of the joint levels \((i, j)\) of the variables \((X, Y)\). It is defined as a rectangle with width \(f_{i+}\), length \(f_{j\mid i}\), and area \(f_{ij}\). The horizontal edge represents the conditioning variable \(X\), and the vertical edge represents the target variable \(Y\).

If there is no evidence of a relationship between \(X\) and \(Y\), the joint frequency \(f_{ij}\) will approximately equal the product of the marginal frequencies, which can be written as: \[f_{ij} \approx f_{i+} \cdot f_{+j}\]

Furthermore, for all \(j\), \[f_{j\mid i} \approx f_{+j}\]

In such cases, all tiles at each horizontal level will have the same length, and the mosaic plot will consist of parallel intersections only. Therefore, we compare other mosaic plots to these situations. The more frequently we have non-parallel horizontal intersections, the stronger the relationship between \(X\) and \(Y\).

Using ggmosaic with two variables

Let us consider the dataset tea whichcan be found in FactoMineR package. This dataset contains the responses of 300 individuals, with data on how they drink tea (18 questions), how they perceive different products, (12 questions) and some personal details (4 questions).

> library(FactoMineR)
> data(tea)

For example, let’s examine the relationship between the variables “sex” and “frequency”. Can we predict the frequency of tea consumption based on the gender of the consumer? In this case, we will place the “sex” variable on the horizontal edge as the conditioning variable, and the “frequency” variable on the vertical edge as the target variable.

Let’s then display the basic mosaic plot

> library(ggmosaic)
> p1<-ggplot(data=tea)+
+   geom_mosaic(aes(x=product(frequency,sex),fill=sex))
> p1
Warning: `unite_()` was deprecated in tidyr 1.2.0.
ℹ Please use `unite()` instead.
ℹ The deprecated feature was likely used in the ggmosaic package.
  Please report the issue at <https://github.com/haleyjeppson/ggmosaic>.

This is okay, but there are several other features missing from this figure. Now, let’s discuss how to include percentages in the plot. We will display the conditional frequencies \(f_{j∣i}\) inside each rectangle as percentages. To achieve this, we will utilize the tidyverse package and specifically the %>% function. Additionally, we will make use of the ggplot_build function to visualize all the necessary steps involved in creating the plot p1.

> library(tidyverse)
> p1d<- ggplot_build(p1)$data %>% as.data.frame() %>% filter(.wt > 0)
> head(p1d)
     fill x__fill__sex .wt   xmin   xmax      ymin      ymax level x__frequency
1 #F8766D            F  49 0.0000 0.5874 0.0000000 0.2674068     2        1/day
2 #F8766D            F  21 0.0000 0.5874 0.2769414 0.3915443     2  1 to 2/week
3 #F8766D            F  89 0.0000 0.5874 0.4010790 0.8867770     2       +2/day
4 #F8766D            F  19 0.0000 0.5874 0.8963116 1.0000000     2  3 to 6/week
5 #00BFC4            M  46 0.5974 1.0000 0.0000000 0.3662641     2        1/day
6 #00BFC4            M  23 0.5974 1.0000 0.3757987 0.5589308     2  1 to 2/week
  .n          label x y group PANEL width linetype fontsize shape colour size
1 49       1/day\nF 0 0     1     1  0.75    solid        5    19     NA  0.1
2 21 1 to 2/week\nF 0 0     1     1  0.75    solid        5    19     NA  0.1
3 89      +2/day\nF 0 0     1     1  0.75    solid        5    19     NA  0.1
4 19 3 to 6/week\nF 0 0     1     1  0.75    solid        5    19     NA  0.1
5 46       1/day\nM 0 0     1     1  0.75    solid        5    19     NA  0.1
6 23 1 to 2/week\nM 0 0     1     1  0.75    solid        5    19     NA  0.1
  alpha stroke linewidth weight
1   0.8    0.1       0.1      1
2   0.8    0.1       0.1      1
3   0.8    0.1       0.1      1
4   0.8    0.1       0.1      1
5   0.8    0.1       0.1      1
6   0.8    0.1       0.1      1

In the object p1d, the column “ymax” represents the cumulative conditional frequency. Next, we create an R function to calculate all the conditional frequencies. To accomplish this, we utilize tapply and the “fill” column in p1d to compute the conditional frequencies for each level of \(X\). It is important to ensure that the newly computed vector is ordered in the same sequence as the data in p1d.

Here is the R function I have developed:

> compt_perc=function(x){
+   d=c(x,1)-c(0,x)
+   d[-length(d)]
+ }

We now compute all of the conditonal frequencies:

> x=tapply(p1d$ymax,factor(p1d$fill,levels=unique(p1d$fill)),compt_perc)
> x=unlist(x)

Then, we add to p1d a vector containing percentages:

> p1d$percentage=paste0(round(100*x,2),"%")

Now, we have generated a new mosaic plot with the conditional frequencies displayed in the center of the tiles. Additionally, I have removed the legend title. Furthermore, I have adjusted the graph’s theme to a black and white color scheme. Lastly, I have included the appropriate labels for the axes.

> p2<-p1 + 
+   geom_label(data = p1d, 
+              aes(x = (xmin + xmax)/2, 
+                  y = (ymin + ymax)/2, 
+                  label = percentage))
> p2<-p2+xlab("Sex")+ylab("Frequency")+theme_bw()
> p2<-p2+theme(legend.position="none")
> p2

From this graph, we can deduce that 36.63% of men drink tea at least once per day, while this percentage is lower for women at only 26.74%.

Lastly, it is important to conduct a chi-square test to assess the independence between these two categorical variables. Additionally, it is crucial to display the result of this test on the figure.

> x=chisq.test(xtabs(~frequency+sex,data=tea))
> title=paste("Pearson's Chi-squared test:", round(x[[1]],4)," p-value:", round(x[[3]],4))
> p2<-p2+ggtitle(title)
> p2