Introduction

This notes provides some essential details for learning visual summaries of a data set using ggplot2 package in R. The learning methodolgy is explained through the knowledge of nature of a variable and possible list of plots associated with the variable. Also we shall note the ways to improve the presentation of a plot by enhancing necessary components of a plot.

Codes may be obtained using the tab appearing in the top right of each output

General Points

To obtain a visual representation of one or more variables, first we shall consider the type of the variable; this will help us to decide a specific plot then secondly we shall work on the ways to improve the look and feel of the plot. This enhancement can be carried out in three ways

  1. Context - we may have to transform a variable or to change data range to focus a specific area of interest

  2. Aesthetics - Color, size, position and other possible changes to improve the way plot is user-friendly in its appearance

  3. Information - we can provide suitable title for the plot or axes, labels of axes or any other data values to provide as much as information about the data / variables considered

Following is an indicative guideline for choosing a plot based on number of variables (dimension) and type of variables

library(kableExtra)
p1=c("Plot 1","Plot 2","Plot 3","Plot 4")
p2=c("1-D","2-D More","2-D More","2-D More")
p3=c("Numeric","Numeric-Grouped by Factor (Dichot/Polychot)","Numeric and Numeric","Binary (Dichot/Polychot)")
p4=c("Histogram, Box plot, Density, Scatter Plot","Histogram, Box plot, Density, Scatter Plot","Scatter Plot","Bar plot (1-D, 2-D - Factor and Factor - Stack, Side), Pie Chart, Donut")
plda=as.data.frame(cbind(p1,p2,p3,p4))
colnames(plda)=c("Plots","Dimension","Variable Type","Plot Type")

kable(rbind(plda)) %>%   kable_styling()
Plots Dimension Variable Type Plot Type
Plot 1 1-D Numeric Histogram, Box plot, Density, Scatter Plot
Plot 2 2-D More Numeric-Grouped by Factor (Dichot/Polychot) Histogram, Box plot, Density, Scatter Plot
Plot 3 2-D More Numeric and Numeric Scatter Plot
Plot 4 2-D More Binary (Dichot/Polychot) Bar plot (1-D, 2-D - Factor and Factor - Stack, Side), Pie Chart, Donut

Data Set

We shall consider Credit data set extracted from ISLR, a package in R. Following table provides meta data (variables and their nature) of Credit data set.

library(kableExtra)
da1=ISLR::Credit
meda1=1:12
meda2=as.vector(names(da1))
meda3=c("Int/Char","Numeric","Int","Int","Int","Int","Int","Binary Factor","Binary Factor","Binary Factor","Poly Factor","Int")
meda4=c("Nil","Boxplot","Histogram","Histogram","Histogram","Histogram","Histogram","Bar Plot","Bar Plot","Bar Plot","Pie","Density")
meda=as.data.frame(cbind(meda1,meda2,meda3,meda4))
colnames(meda)=c("SNo","Variables","Type","VisualSummary")


kable(rbind(meda)) %>%   kable_styling()
SNo Variables Type VisualSummary
1 ID Int/Char Nil
2 Income Numeric Boxplot
3 Limit Int Histogram
4 Rating Int Histogram
5 Cards Int Histogram
6 Age Int Histogram
7 Education Int Histogram
8 Gender Binary Factor Bar Plot
9 Student Binary Factor Bar Plot
10 Married Binary Factor Bar Plot
11 Ethnicity Poly Factor Pie
12 Balance Int Density

Plots

We shall illustrate each type of plot (listed in the first table) with suitable Numeric / Factor variables; count variable can be treated as numeric for this purpose. In Credit data set, Cards is a count variable, which refers to the number of credit cards a person owns. We use the plots for numeric variables to obtain visual summaries for this variable, Cards

Also, order of learning will be as per the procedure outlined as base plot and with suitable enhancements in COntext, Aesthetics, and Information.

Remarks

  1. Plots are presented as illustrative cases with codes. Attempts should be made to go beyond the plots and options

  2. Codes may be reused

  3. In the subsequent discussions, plots are two dimensional. The word variate refers to the Number of variables used in a plot

    1. Univariate: When the plot has only one variable

    2. Multivariate: More than one variable

  4. Plots can further be grouped based on type of the variables

Univariate Numeric Variable

We can attempt with Scatter Plot, histogram, Box plot, Density plot. For scatter plot with a single numeric variable we have to define values for both axes. So usually we consider index from 1 to number of rows of the data set as x axis value; y axis will be our (numeric) variable of interest

library(ggplot2)
library(GGally)
library(corrplot) 
library(ggpubr)
library(ggridges)
library(gridExtra)

ggplot(da1,aes(x=1:nrow(da1),y=Balance))+
  geom_point()

Enhancements

Aesthetics: Changes in Color, Size, and Shape

ggplot(da1,aes(x=1:nrow(da1),y=Balance))+
  geom_point(size=5, shape=3, color="darkgreen")

Information - Title, Subtitle, and Caption

ggplot(da1,aes(x=1:nrow(da1),y=Balance))+
  geom_point(size=5, shape=2, color="maroon")+
  labs(title="Balance", 
      subtitle="Average credit card balance in $",
       caption="From ISLR - Credit data set",
       x="Row Index",y="Balance in $")

More options using theme() for formatting axis labels, titles

ggplot(da1,aes(x=1:nrow(da1),y=Balance))+
  geom_point(size=1.5, shape=2, color="maroon")+
  labs(title="Balance", 
  subtitle="Average credit card balance in $",
       caption="From ISLR - Credit data set",
       x="Row Index",y="Balance in $")+
  theme(plot.title=element_text(hjust=0.5,vjust=0.5,color="tomato",size=20, face="bold.italic"),
        plot.subtitle=element_text(size=18, face="bold",hjust=0.5),  
        plot.caption=element_text(size=18,face="bold.italic",color="blue3"),  
        axis.title.x=element_text(color="blue", size=20, face="bold",vjust=10, hjust=0.1),  
        axis.title.y=element_text(color="brown", size=20, face="bold",angle = 270),  
        axis.text.x=element_text(size=20, angle = 30,vjust=.5),  
        axis.text.y=element_text(size=20)) 

Contextual Changes

Depending on the context of the data set / variable of interest, we may be interested to consider a partial set of values. Or, we may use mathematical transformations (for example, logarithm) of a variable.

Let us omit “Balance” that are Less than 100; we use coord_cartesian()

ggplot(da1,aes(x=1:nrow(da1),y=Balance))+
  geom_point(size=5, shape=2, color="maroon")+
  coord_cartesian(ylim=c(100,max(da1$Balance)))+
  labs(title="Balance", 
      subtitle="Average credit card balance in $",
       caption="From ISLR - Credit data set",
       x="Row Index",y="Balance in $ (>100)")

More Univarite plots

Box plot

The variable of interest is taken in x axis; plot is horizontal

ggplot(da1,aes(x=Balance))+
  geom_boxplot(col="darkred")+
  labs(title="Balance", 
  subtitle="Average credit card balance in $",
       caption="From ISLR - Credit data set",
       y=" ")+
  theme(plot.title=element_text(hjust=0.5,vjust=0.5,color="blue",size=20, face="bold.italic"),
      plot.subtitle=element_text(size=18,face="bold",hjust=0.5),      plot.caption=element_text(size=18,face="bold.italic",color="blue3"),  
  axis.title.x=element_text(color="blue", size=20, face="bold"),  axis.title.y=element_text(color="brown", size=20, face="bold"),  axis.text.x=element_text(size=20),     axis.text.y=element_text(size=20)) 

The variable of interest is taken in Y axis ; plot is vertical

ggplot(da1,aes(y=Balance))+
  geom_boxplot(fill="yellow")+
  labs(title="Balance", 
  subtitle="Average credit card balance in $",
       caption="From ISLR - Credit data set",
       y=" ")+
  theme(plot.title=element_text(hjust=0.5,vjust=0.5,color="blue",size=20, face="bold.italic"),
      plot.subtitle=element_text(size=18,face="bold",hjust=0.5),      plot.caption=element_text(size=18,face="bold.italic",color="blue3"),  
  axis.title.x=element_text(color="blue", size=20, face="bold"),  axis.title.y=element_text(color="brown", size=20, face="bold"),  axis.text.x=element_text(size=20),     axis.text.y=element_text(size=20)) 

Histogram and Density Plot

ggplot(da1,aes(x=Age))+
  geom_histogram(col="darkred",fill="yellow")+
  labs(title="Age", 
  subtitle="Age Distribution",
       caption="From ISLR - Credit data set",
       y=" ")+
  theme(plot.title=element_text(hjust=0.5,vjust=0.5,color="blue",size=20, face="bold.italic"),
      plot.subtitle=element_text(size=18,face="bold",hjust=0.5),      plot.caption=element_text(size=18,face="bold.italic",color="blue3"),  
  axis.title.x=element_text(color="blue", size=20, face="bold"),  axis.title.y=element_text(color="brown", size=20, face="bold"),  axis.text.x=element_text(size=20),     axis.text.y=element_text(size=20)) 

ggplot(da1,aes(x=Age))+
  geom_density(fill="pink",col="tan3")+
  labs(title="Age", 
  subtitle="Age Distribution",
       caption="From ISLR - Credit data set",
       y=" ")+
  theme(plot.title=element_text(hjust=0.5,vjust=0.5,color="blue",size=20, face="bold.italic"),
      plot.subtitle=element_text(size=18,face="bold",hjust=0.5),      plot.caption=element_text(size=18,face="bold.italic",color="blue3"),  
  axis.title.x=element_text(color="blue", size=20, face="bold"),  axis.title.y=element_text(color="brown", size=20, face="bold"),  axis.text.x=element_text(size=20),     axis.text.y=element_text(size=20)) 

Histogram with density plot

ggplot(da1,aes(x=Age))+
  geom_histogram(aes(y = ..density..),
                 col="darkred",fill="yellow",
                 position = "identity")+
  geom_density(col="tan3",size = 2)+
  labs(title="Age", 
  subtitle="Age Distribution",
       caption="From ISLR - Credit data set",
       y=" ")+
  theme(plot.title=element_text(hjust=0.5,vjust=0.5,color="blue",size=20, face="bold.italic"),
      plot.subtitle=element_text(size=18,face="bold",hjust=0.5),      plot.caption=element_text(size=18,face="bold.italic",color="blue3"),  
  axis.title.x=element_text(color="blue", size=20, face="bold"),  axis.title.y=element_text(color="brown", size=20, face="bold"),  axis.text.x=element_text(size=20),     axis.text.y=element_text(size=20)) 

Mutlivariate Numeric Variable

A Metric variable categorized by factor variables (binary / polychotomous) and paired with one or more numeric variables.

Color, size, and / or shape can be used to plot this option

Distribution of a metric variable (Balance) is grouped according to Ethnicity using color of the geometry object (here, point)

ggplot(da1,aes(x=1:nrow(da1),y=Balance))+
  geom_point(aes(color=Ethnicity),size=3)+
  theme(legend.position = "bottom")+
  labs(title="Balance by Ethnicity", 
       subtitle="Average credit card balance in $",
       caption="From ISLR - Credit data set",
       x="Row Index",y="Balance in $")

Another variable Gender is added through the size of the object

ggplot(da1,aes(x=1:nrow(da1),y=Balance))+
  geom_point(aes(color=Ethnicity,size=Gender))+
  theme(legend.position = "bottom")+
  labs(title="Balance by Ethnicity and Gender", 
       subtitle="Average credit card balance in $",
       caption="From ISLR - Credit data set",
       x="Row Index",y="Balance in $")

A numeric variable Income is used for a similar task

ggplot(da1,aes(x=1:nrow(da1),y=Balance))+
  geom_point(aes(color=Gender,size=Income))+
  theme(legend.position = "bottom")+
  labs(title="Balance by Gender and Income", 
       subtitle="Average credit card balance in $",
       caption="From ISLR - Credit data set",
       x="Row Index",y="Balance in $")

When we have more than one numeric variable such as Balance and Income, we may choose scatter plot; other factor variables may be used for more detailed and simultaneous comparison

ggplot(da1,aes(x=Income,y=Balance))+
  geom_point(aes(color=Gender,size=Ethnicity))+
  theme(legend.position = "bottom")+
  labs(title="Income and Balance", 
       subtitle="Average credit card balance in $",
       caption="From ISLR - Credit data set",
       x="Income",y="Balance in $")

Multiple Plots using Facet

Using GGally and ggridges packages

These packages are useful to create multiple plots based on a Numeric variable that is divided with a factor variable

Facet option is used to plot with more variables beyond two axes. In general, Facet is attempted with factor variables.

ggdensity(da1, x = "Balance",
          add = "mean",
          color = "Gender", fill = "Gender",
          palette = c("darkgreen", "yellow"))

ggplot(da1, aes(x = Balance, y = Ethnicity)) +
  geom_density_ridges(aes(fill = Ethnicity)) +
  theme(legend.position = "")

ggplot(da1, aes(x = Balance, y = Ethnicity)) +
  geom_density_ridges(aes(fill = Ethnicity)) +
  theme(legend.position = "")+
  facet_grid(~Student)

ggplot(da1, aes(x = Balance, y = Ethnicity)) +
  geom_density_ridges(aes(fill = Ethnicity)) +
  theme(legend.position = "")+
  facet_grid(Student~.,)

ggplot(da1, aes(x = Balance, y = Married)) +
  geom_density_ridges(aes(fill = Married)) +
  theme(legend.position = "")+
  facet_grid(Student~Ethnicity)+
  scale_fill_manual(values = c("violet", "green3"))

ggplot(da1, aes(x = Balance, y = Married)) +
  geom_density_ridges(aes(fill = Married)) +
  theme(legend.position = "")+
  facet_grid(Student~Ethnicity+Gender)+
  scale_fill_manual(values = c("violet", "green3"))

ggplot(da1, aes(x = Balance, y = Married)) +
  geom_density_ridges(aes(fill = Married)) +
  theme(legend.position = "")+
  facet_grid(Student+Gender~Ethnicity)+
  scale_fill_manual(values = c("maroon", "yellow"))

ggplot(da1, aes(x = Balance,y=as.factor(Cards))) +
  geom_density_ridges(aes(fill = as.factor(Cards))) +
  theme(legend.position = "")+
  facet_grid(Gender~ Ethnicity)+
  labs(y="Number of Cards")

Univariate Factor Variables

We shall consider the plots for qualitative / categorical / factor variables

Simple Bar Plot

ggplot(da1,aes(Ethnicity))+
  geom_bar(col="green",fill="yellow",size=2)

Bar Plot with enhancements - Information. Frequency counts may be added in each class/level of the factor variable

ggplot(da1,aes(Ethnicity))+
  geom_bar(col="green",fill="yellow")+
geom_text(aes(label = ..count..), stat = "count",
          vjust=3, size=4,color='darkred',fontface="bold")+
  labs(title="Distribution of Ethnicity",caption ="From ISLR - Credit data set")+
  theme(plot.title=element_text(hjust=0.5,vjust=0.5,color="tomato",size=14, face="bold.italic"),
        plot.caption=element_text(size=12,face="bold.italic",color="tan3"),  
        axis.title.x=element_text(color="blue", size=10, face="bold",vjust=10, hjust=0.1),  
        axis.title.y=element_text(color="brown", size=14, face="bold",angle = 270),  
        axis.text.x=element_text(size=10, angle = 30),  
        axis.text.y=element_text(size=10)) 

Mutlivariate Factor Variable

Bar Plots using more than one factor variables

ggplot(da1,aes(x=Ethnicity,fill=Gender))+
  geom_bar()+
  scale_fill_manual(values=c("yellow", "green"))+
  geom_text(aes(label = ..count..), stat = "count",
            position = position_stack(vjust = 0.5),
            size=4,color='darkred',fontface="bold")+
  labs(title="Distribution of Ethnicity",
       subtitle = "Gender Wise",
       caption ="From ISLR - Credit data set")+
theme(legend.position = "bottom")

Bar Plots using more than one factor variables and the use of position_dodge()

ggplot(da1,aes(x=Ethnicity,fill=Gender))+
  geom_bar(position=position_dodge())+
  geom_text(aes(label = ..count..), stat = "count",
            position = position_dodge(width =  0.9),
        vjust=1.5,size=4,color='darkred',fontface="bold")+
  labs(title="Distribution of Ethnicity",
       subtitle = "Gender Wise",
       caption ="From ISLR - Credit data set")+
  theme(legend.position = "bottom")

Pie chart

ggplot(da1,
       aes(x = factor(""), fill = Ethnicity) ) +
  geom_bar() +
  coord_polar(theta = "y") +
  scale_x_discrete("")+
  theme(axis.ticks=element_blank(),  
        axis.title=element_blank(),  
        axis.text.y=element_blank(),
        axis.text.x=element_blank(),
        panel.grid  = element_blank(),
        legend.position = "bottom")+
  labs(title = "Ethnicity",
       caption="Credit Data Set from ISLR")

More on Numeric variables

When we deal more than 1 numeric variables, it is a practice of understanding their association (relation) using scatter plot and Correlation, a numerical summary for knowing the association.

Using GGally and corrplot packages

ggpairs(da1,columns=c(2:6,12),
        lower = list(continuous = wrap("points", 
                                       color = "darkgreen",                                        alpha = 0.25,
                                       size=2,
                                       shape=5)),
        diag = list(continuous = wrap("densityDiag", 
                                      color = "red", 
                                      fill="yellow")))

res_corr=cor(da1[,c(2:6,12)])

#basic plot
corrplot(res_corr)

#desired enhancements
corrplot(res_corr,type="lower")

corrplot(res_corr,type="lower",diag=F)

corrplot(res_corr,method="number",
         type="lower",
         number.digits=2,
         cl.lim=c(-1,1), 
         col=colorRampPalette(c("darkblue","red","darkgreen"))(10) ,tl.pos="n")

gridExtra Package

We try to use this package to arrange many plots in a layout specified by number of rows and columns

Balanced Layout

If we plan to arrange even / non-prime number of plots, the arguments are quite direct to specify; for example, if we intend to arrange 16 plots then the lay out may be \(4\times4\) which would have a better appearance. Any combination with even / non-prime numbers have the multiple plots without white spaces because balanced pairing is possible

For example, we arrange four plots in a balanced lay out with two rows and two columns (\(2\times2\))

PL1=ggplot(da1,aes(x=1:nrow(da1),y=Balance))+
  geom_point()+
  labs(title="Balance", 
       subtitle="Average credit card balance in $",
       caption="From ISLR - Credit data set",
       x="Row Index",y="Balance in $")

PL2=ggplot(da1,aes(x=Age))+
  geom_histogram(aes(y = ..density..),
                 col="darkred",fill="yellow",
                 position = "identity")+
  geom_density(col="tan3",size = 2)

PL3=ggplot(da1, aes(x = Balance, y = Ethnicity)) +
  geom_density_ridges(aes(fill = Ethnicity)) +
  theme(legend.position = "")

PL4=ggplot(da1,aes(x=Ethnicity,fill=Gender))+
  geom_bar()+
  scale_fill_manual(values=c("yellow", "green"))+
  geom_text(aes(label = ..count..), stat = "count",
            position = position_stack(vjust = 0.5),
            size=4,color='darkred',fontface="bold")+
  labs(title="Distribution of Ethnicity",
       subtitle = "Gender Wise",
       caption ="From ISLR - Credit data set")+
theme(legend.position = "bottom")

grid.arrange(PL1,PL2,PL3,PL4,nrow=2,ncol=2)

Imbalanced Layout

On the other hand, if the number of plots is an odd prime number, then the above specification will result with undesirable white spaces because of difficulty in pairing.

For example, we arrange three plots in a lay out with two rows and two columns (\(2\times2\)). Then second row, second column will have white space because it has no plot to hold

PL1=ggplot(da1,aes(x=1:nrow(da1),y=Balance))+
  geom_point()+ 
  labs(title="Balance",
       subtitle="Average credit card balance in $",
       caption="From ISLR - Credit data set",
       x="Row Index",y="Balance in $")

PL2=ggplot(da1,aes(x=Age))+
  geom_histogram(aes(y = ..density..),
                 col="darkred",fill="yellow",
                 position = "identity")+
  geom_density(col="tan3",size = 2)

PL3=ggplot(da1, aes(x = Balance, y = Ethnicity)) +
  geom_density_ridges(aes(fill = Ethnicity)) +
  theme(legend.position = "")

grid.arrange(PL1,PL2,PL3,nrow=2,ncol=2)

To avoid such unwanted white space in the layout, we can use grid.arrange function from gridExtra package with additional argument layout_matrix. This option provides a simple way to plan a possible layout to arrange the odd number of plots.

In the above case with three plots, let us plan one of the possible ways as

  • We have a \(2\times2\) lay out

  • Plot 1: First row, First column

  • Plot 2: First row, Second column

  • Plot 3: Second row, but stretched in two columns

This may be provided as input in the argument layout_matrix of the function grid.arrange in gridExtra package. The syntax will be

\[\textrm{layout_matrix=rbind(c(1,2), c(3,3))}\] This will yield the desired output as

PL1=ggplot(da1,aes(x=1:nrow(da1),y=Balance))+
    geom_point()+ 
    labs(title="Balance", 
       subtitle="Average credit card balance in $",
       caption="From ISLR - Credit data set",
       x="Row Index",y="Balance in $")

PL2=ggplot(da1,aes(x=Age))+
  geom_histogram(aes(y = ..density..),
                 col="darkred",fill="yellow",
                 position = "identity")+
  geom_density(col="tan3",size = 2)

PL3=ggplot(da1, aes(x = Balance, y = Ethnicity)) +
  geom_density_ridges(aes(fill = Ethnicity)) +
  theme(legend.position = "")

grid.arrange(PL1,PL2,PL3,nrow=2,ncol=2,
             layout_matrix=rbind(c(1,2), c(3,3)))

Another possible lay out is

PL1=ggplot(da1,aes(x=1:nrow(da1),y=Balance))+
    geom_point()+ 
    labs(title="Balance", 
       subtitle="Average credit card balance in $",
       caption="From ISLR - Credit data set",
       x="Row Index",y="Balance in $")

PL2=ggplot(da1,aes(x=Age))+
  geom_histogram(aes(y = ..density..),
                 col="darkred",fill="yellow",
                 position = "identity")+
  geom_density(col="tan3",size = 2)

PL3=ggplot(da1, aes(x = Balance, y = Ethnicity)) +
  geom_density_ridges(aes(fill = Ethnicity)) +
  theme(legend.position = "")

grid.arrange(PL1,PL2,PL3,nrow=2,ncol=2,
             layout_matrix=rbind(c(1,2), c(1,3)))

If we have five plots in a single lay out, then we may plan a \(3\times2\) lay out and position the plots in a similar way. Then the syntax \(\textrm{layout_matrix=rbind(c(1,2),(3,4), c(5,5))}\) will produce plots 1 and 2 in first row, 3 and 4 in second row, and third row will have the plot 3.

PL1=ggplot(da1,aes(x=1:nrow(da1),y=Balance))+
    geom_point()+ 
    labs(title="Balance", 
       subtitle="Average credit card balance in $",
       caption="From ISLR - Credit data set",
       x="Row Index",y="Balance in $")

PL2=ggplot(da1,aes(x=Age))+
  geom_histogram(aes(y = ..density..),
                 col="darkred",fill="yellow",
                 position = "identity")+
  geom_density(col="tan3",size = 2)

PL3=ggplot(da1, aes(x = Balance, y = Ethnicity)) +
  geom_density_ridges(aes(fill = Ethnicity)) +
  theme(legend.position = "")

PL4= ggplot(da1, aes(x = Balance)) +
  geom_density(aes(fill = Student,col=Student)) +
  theme(legend.position = "bottom")

PL5= ggplot(da1, aes(x = Balance, y = Education)) +
  geom_point(aes(col = Student),size=2) +
  theme(legend.position = "bottom")

grid.arrange(PL1,PL2,PL3,PL4,PL5, nrow=3,ncol=2,
             layout_matrix=rbind(c(1,2), c(3,4),c(5,5)))

Final Remarks

Visualization of a variable or a set of variables require a detailed articulation idea to understand the spread and shape of a variable and association between variables. This material may help to have a methodological approach to make visualization exercises using the package ggplot2 and few more packages. Also to note that this material should be considered as an overview to produce plots using ggplot2 package. More options can be attempted with additional examples, may be in a self-paced manner.

Now we may be ready to generate plots using the ggplot2 and other necessary packages. That is we may be able to

  • know the different plots based on the nature of the variables

  • create basic plots

  • use appropriate enhancements in the appearance of the plot and its components like axes, texts, labels, titles etc,

  • plan multiple plots in the required lay outs with available options

  • increase the dimensions in a plot by increasing as many variables as required to compare, as well as retaining the readability of the plot for better insights