This notes provides some essential details for learning visual summaries of a data set using ggplot2 package in R. The learning methodolgy is explained through the knowledge of nature of a variable and possible list of plots associated with the variable. Also we shall note the ways to improve the presentation of a plot by enhancing necessary components of a plot.
Codes may be obtained using the tab appearing in the top right of each output
To obtain a visual representation of one or more variables, first we shall consider the type of the variable; this will help us to decide a specific plot then secondly we shall work on the ways to improve the look and feel of the plot. This enhancement can be carried out in three ways
Context - we may have to transform a variable or to change data range to focus a specific area of interest
Aesthetics - Color, size, position and other possible changes to improve the way plot is user-friendly in its appearance
Information - we can provide suitable title for the plot or axes, labels of axes or any other data values to provide as much as information about the data / variables considered
Following is an indicative guideline for choosing a plot based on number of variables (dimension) and type of variables
library(kableExtra)
p1=c("Plot 1","Plot 2","Plot 3","Plot 4")
p2=c("1-D","2-D More","2-D More","2-D More")
p3=c("Numeric","Numeric-Grouped by Factor (Dichot/Polychot)","Numeric and Numeric","Binary (Dichot/Polychot)")
p4=c("Histogram, Box plot, Density, Scatter Plot","Histogram, Box plot, Density, Scatter Plot","Scatter Plot","Bar plot (1-D, 2-D - Factor and Factor - Stack, Side), Pie Chart, Donut")
plda=as.data.frame(cbind(p1,p2,p3,p4))
colnames(plda)=c("Plots","Dimension","Variable Type","Plot Type")
kable(rbind(plda)) %>% kable_styling()
| Plots | Dimension | Variable Type | Plot Type |
|---|---|---|---|
| Plot 1 | 1-D | Numeric | Histogram, Box plot, Density, Scatter Plot |
| Plot 2 | 2-D More | Numeric-Grouped by Factor (Dichot/Polychot) | Histogram, Box plot, Density, Scatter Plot |
| Plot 3 | 2-D More | Numeric and Numeric | Scatter Plot |
| Plot 4 | 2-D More | Binary (Dichot/Polychot) | Bar plot (1-D, 2-D - Factor and Factor - Stack, Side), Pie Chart, Donut |
We shall consider Credit data set extracted from ISLR, a package in R. Following table provides meta data (variables and their nature) of Credit data set.
library(kableExtra)
da1=ISLR::Credit
meda1=1:12
meda2=as.vector(names(da1))
meda3=c("Int/Char","Numeric","Int","Int","Int","Int","Int","Binary Factor","Binary Factor","Binary Factor","Poly Factor","Int")
meda4=c("Nil","Boxplot","Histogram","Histogram","Histogram","Histogram","Histogram","Bar Plot","Bar Plot","Bar Plot","Pie","Density")
meda=as.data.frame(cbind(meda1,meda2,meda3,meda4))
colnames(meda)=c("SNo","Variables","Type","VisualSummary")
kable(rbind(meda)) %>% kable_styling()
| SNo | Variables | Type | VisualSummary |
|---|---|---|---|
| 1 | ID | Int/Char | Nil |
| 2 | Income | Numeric | Boxplot |
| 3 | Limit | Int | Histogram |
| 4 | Rating | Int | Histogram |
| 5 | Cards | Int | Histogram |
| 6 | Age | Int | Histogram |
| 7 | Education | Int | Histogram |
| 8 | Gender | Binary Factor | Bar Plot |
| 9 | Student | Binary Factor | Bar Plot |
| 10 | Married | Binary Factor | Bar Plot |
| 11 | Ethnicity | Poly Factor | Pie |
| 12 | Balance | Int | Density |
We shall illustrate each type of plot (listed in the first table) with suitable Numeric / Factor variables; count variable can be treated as numeric for this purpose. In Credit data set, Cards is a count variable, which refers to the number of credit cards a person owns. We use the plots for numeric variables to obtain visual summaries for this variable, Cards
Also, order of learning will be as per the procedure outlined as base plot and with suitable enhancements in COntext, Aesthetics, and Information.
Remarks
Plots are presented as illustrative cases with codes. Attempts should be made to go beyond the plots and options
Codes may be reused
In the subsequent discussions, plots are two dimensional. The word variate refers to the Number of variables used in a plot
Univariate: When the plot has only one variable
Multivariate: More than one variable
Plots can further be grouped based on type of the variables
We can attempt with Scatter Plot, histogram, Box plot, Density plot. For scatter plot with a single numeric variable we have to define values for both axes. So usually we consider index from 1 to number of rows of the data set as x axis value; y axis will be our (numeric) variable of interest
library(ggplot2)
library(GGally)
library(corrplot)
library(ggpubr)
library(ggridges)
library(gridExtra)
ggplot(da1,aes(x=1:nrow(da1),y=Balance))+
geom_point()
Aesthetics: Changes in Color, Size, and Shape
ggplot(da1,aes(x=1:nrow(da1),y=Balance))+
geom_point(size=5, shape=3, color="darkgreen")
Information - Title, Subtitle, and Caption
ggplot(da1,aes(x=1:nrow(da1),y=Balance))+
geom_point(size=5, shape=2, color="maroon")+
labs(title="Balance",
subtitle="Average credit card balance in $",
caption="From ISLR - Credit data set",
x="Row Index",y="Balance in $")
More options using theme() for formatting axis labels, titles
ggplot(da1,aes(x=1:nrow(da1),y=Balance))+
geom_point(size=1.5, shape=2, color="maroon")+
labs(title="Balance",
subtitle="Average credit card balance in $",
caption="From ISLR - Credit data set",
x="Row Index",y="Balance in $")+
theme(plot.title=element_text(hjust=0.5,vjust=0.5,color="tomato",size=20, face="bold.italic"),
plot.subtitle=element_text(size=18, face="bold",hjust=0.5),
plot.caption=element_text(size=18,face="bold.italic",color="blue3"),
axis.title.x=element_text(color="blue", size=20, face="bold",vjust=10, hjust=0.1),
axis.title.y=element_text(color="brown", size=20, face="bold",angle = 270),
axis.text.x=element_text(size=20, angle = 30,vjust=.5),
axis.text.y=element_text(size=20))
Contextual Changes
Depending on the context of the data set / variable of interest, we may be interested to consider a partial set of values. Or, we may use mathematical transformations (for example, logarithm) of a variable.
Let us omit “Balance” that are Less than 100; we use coord_cartesian()
ggplot(da1,aes(x=1:nrow(da1),y=Balance))+
geom_point(size=5, shape=2, color="maroon")+
coord_cartesian(ylim=c(100,max(da1$Balance)))+
labs(title="Balance",
subtitle="Average credit card balance in $",
caption="From ISLR - Credit data set",
x="Row Index",y="Balance in $ (>100)")
More Univarite plots
Box plot
The variable of interest is taken in x axis; plot is horizontal
ggplot(da1,aes(x=Balance))+
geom_boxplot(col="darkred")+
labs(title="Balance",
subtitle="Average credit card balance in $",
caption="From ISLR - Credit data set",
y=" ")+
theme(plot.title=element_text(hjust=0.5,vjust=0.5,color="blue",size=20, face="bold.italic"),
plot.subtitle=element_text(size=18,face="bold",hjust=0.5), plot.caption=element_text(size=18,face="bold.italic",color="blue3"),
axis.title.x=element_text(color="blue", size=20, face="bold"), axis.title.y=element_text(color="brown", size=20, face="bold"), axis.text.x=element_text(size=20), axis.text.y=element_text(size=20))
The variable of interest is taken in Y axis ; plot is vertical
ggplot(da1,aes(y=Balance))+
geom_boxplot(fill="yellow")+
labs(title="Balance",
subtitle="Average credit card balance in $",
caption="From ISLR - Credit data set",
y=" ")+
theme(plot.title=element_text(hjust=0.5,vjust=0.5,color="blue",size=20, face="bold.italic"),
plot.subtitle=element_text(size=18,face="bold",hjust=0.5), plot.caption=element_text(size=18,face="bold.italic",color="blue3"),
axis.title.x=element_text(color="blue", size=20, face="bold"), axis.title.y=element_text(color="brown", size=20, face="bold"), axis.text.x=element_text(size=20), axis.text.y=element_text(size=20))
Histogram and Density Plot
ggplot(da1,aes(x=Age))+
geom_histogram(col="darkred",fill="yellow")+
labs(title="Age",
subtitle="Age Distribution",
caption="From ISLR - Credit data set",
y=" ")+
theme(plot.title=element_text(hjust=0.5,vjust=0.5,color="blue",size=20, face="bold.italic"),
plot.subtitle=element_text(size=18,face="bold",hjust=0.5), plot.caption=element_text(size=18,face="bold.italic",color="blue3"),
axis.title.x=element_text(color="blue", size=20, face="bold"), axis.title.y=element_text(color="brown", size=20, face="bold"), axis.text.x=element_text(size=20), axis.text.y=element_text(size=20))
ggplot(da1,aes(x=Age))+
geom_density(fill="pink",col="tan3")+
labs(title="Age",
subtitle="Age Distribution",
caption="From ISLR - Credit data set",
y=" ")+
theme(plot.title=element_text(hjust=0.5,vjust=0.5,color="blue",size=20, face="bold.italic"),
plot.subtitle=element_text(size=18,face="bold",hjust=0.5), plot.caption=element_text(size=18,face="bold.italic",color="blue3"),
axis.title.x=element_text(color="blue", size=20, face="bold"), axis.title.y=element_text(color="brown", size=20, face="bold"), axis.text.x=element_text(size=20), axis.text.y=element_text(size=20))
Histogram with density plot
ggplot(da1,aes(x=Age))+
geom_histogram(aes(y = ..density..),
col="darkred",fill="yellow",
position = "identity")+
geom_density(col="tan3",size = 2)+
labs(title="Age",
subtitle="Age Distribution",
caption="From ISLR - Credit data set",
y=" ")+
theme(plot.title=element_text(hjust=0.5,vjust=0.5,color="blue",size=20, face="bold.italic"),
plot.subtitle=element_text(size=18,face="bold",hjust=0.5), plot.caption=element_text(size=18,face="bold.italic",color="blue3"),
axis.title.x=element_text(color="blue", size=20, face="bold"), axis.title.y=element_text(color="brown", size=20, face="bold"), axis.text.x=element_text(size=20), axis.text.y=element_text(size=20))
A Metric variable categorized by factor variables (binary / polychotomous) and paired with one or more numeric variables.
Color, size, and / or shape can be used to plot this option
Distribution of a metric variable (Balance) is grouped according to Ethnicity using color of the geometry object (here, point)
ggplot(da1,aes(x=1:nrow(da1),y=Balance))+
geom_point(aes(color=Ethnicity),size=3)+
theme(legend.position = "bottom")+
labs(title="Balance by Ethnicity",
subtitle="Average credit card balance in $",
caption="From ISLR - Credit data set",
x="Row Index",y="Balance in $")
Another variable Gender is added through the size of the object
ggplot(da1,aes(x=1:nrow(da1),y=Balance))+
geom_point(aes(color=Ethnicity,size=Gender))+
theme(legend.position = "bottom")+
labs(title="Balance by Ethnicity and Gender",
subtitle="Average credit card balance in $",
caption="From ISLR - Credit data set",
x="Row Index",y="Balance in $")
A numeric variable Income is used for a similar task
ggplot(da1,aes(x=1:nrow(da1),y=Balance))+
geom_point(aes(color=Gender,size=Income))+
theme(legend.position = "bottom")+
labs(title="Balance by Gender and Income",
subtitle="Average credit card balance in $",
caption="From ISLR - Credit data set",
x="Row Index",y="Balance in $")
When we have more than one numeric variable such as Balance and Income, we may choose scatter plot; other factor variables may be used for more detailed and simultaneous comparison
ggplot(da1,aes(x=Income,y=Balance))+
geom_point(aes(color=Gender,size=Ethnicity))+
theme(legend.position = "bottom")+
labs(title="Income and Balance",
subtitle="Average credit card balance in $",
caption="From ISLR - Credit data set",
x="Income",y="Balance in $")
Using GGally and ggridges packages
These packages are useful to create multiple plots based on a Numeric variable that is divided with a factor variable
Facet option is used to plot with more variables beyond two axes. In general, Facet is attempted with factor variables.
ggdensity(da1, x = "Balance",
add = "mean",
color = "Gender", fill = "Gender",
palette = c("darkgreen", "yellow"))
ggplot(da1, aes(x = Balance, y = Ethnicity)) +
geom_density_ridges(aes(fill = Ethnicity)) +
theme(legend.position = "")
ggplot(da1, aes(x = Balance, y = Ethnicity)) +
geom_density_ridges(aes(fill = Ethnicity)) +
theme(legend.position = "")+
facet_grid(~Student)
ggplot(da1, aes(x = Balance, y = Ethnicity)) +
geom_density_ridges(aes(fill = Ethnicity)) +
theme(legend.position = "")+
facet_grid(Student~.,)
ggplot(da1, aes(x = Balance, y = Married)) +
geom_density_ridges(aes(fill = Married)) +
theme(legend.position = "")+
facet_grid(Student~Ethnicity)+
scale_fill_manual(values = c("violet", "green3"))
ggplot(da1, aes(x = Balance, y = Married)) +
geom_density_ridges(aes(fill = Married)) +
theme(legend.position = "")+
facet_grid(Student~Ethnicity+Gender)+
scale_fill_manual(values = c("violet", "green3"))
ggplot(da1, aes(x = Balance, y = Married)) +
geom_density_ridges(aes(fill = Married)) +
theme(legend.position = "")+
facet_grid(Student+Gender~Ethnicity)+
scale_fill_manual(values = c("maroon", "yellow"))
ggplot(da1, aes(x = Balance,y=as.factor(Cards))) +
geom_density_ridges(aes(fill = as.factor(Cards))) +
theme(legend.position = "")+
facet_grid(Gender~ Ethnicity)+
labs(y="Number of Cards")
We shall consider the plots for qualitative / categorical / factor variables
Simple Bar Plot
ggplot(da1,aes(Ethnicity))+
geom_bar(col="green",fill="yellow",size=2)
Bar Plot with enhancements - Information. Frequency counts may be added in each class/level of the factor variable
ggplot(da1,aes(Ethnicity))+
geom_bar(col="green",fill="yellow")+
geom_text(aes(label = ..count..), stat = "count",
vjust=3, size=4,color='darkred',fontface="bold")+
labs(title="Distribution of Ethnicity",caption ="From ISLR - Credit data set")+
theme(plot.title=element_text(hjust=0.5,vjust=0.5,color="tomato",size=14, face="bold.italic"),
plot.caption=element_text(size=12,face="bold.italic",color="tan3"),
axis.title.x=element_text(color="blue", size=10, face="bold",vjust=10, hjust=0.1),
axis.title.y=element_text(color="brown", size=14, face="bold",angle = 270),
axis.text.x=element_text(size=10, angle = 30),
axis.text.y=element_text(size=10))
Bar Plots using more than one factor variables
ggplot(da1,aes(x=Ethnicity,fill=Gender))+
geom_bar()+
scale_fill_manual(values=c("yellow", "green"))+
geom_text(aes(label = ..count..), stat = "count",
position = position_stack(vjust = 0.5),
size=4,color='darkred',fontface="bold")+
labs(title="Distribution of Ethnicity",
subtitle = "Gender Wise",
caption ="From ISLR - Credit data set")+
theme(legend.position = "bottom")
Bar Plots using more than one factor variables and the use of position_dodge()
ggplot(da1,aes(x=Ethnicity,fill=Gender))+
geom_bar(position=position_dodge())+
geom_text(aes(label = ..count..), stat = "count",
position = position_dodge(width = 0.9),
vjust=1.5,size=4,color='darkred',fontface="bold")+
labs(title="Distribution of Ethnicity",
subtitle = "Gender Wise",
caption ="From ISLR - Credit data set")+
theme(legend.position = "bottom")
Pie chart
ggplot(da1,
aes(x = factor(""), fill = Ethnicity) ) +
geom_bar() +
coord_polar(theta = "y") +
scale_x_discrete("")+
theme(axis.ticks=element_blank(),
axis.title=element_blank(),
axis.text.y=element_blank(),
axis.text.x=element_blank(),
panel.grid = element_blank(),
legend.position = "bottom")+
labs(title = "Ethnicity",
caption="Credit Data Set from ISLR")
When we deal more than 1 numeric variables, it is a practice of understanding their association (relation) using scatter plot and Correlation, a numerical summary for knowing the association.
Using GGally and corrplot packages
ggpairs(da1,columns=c(2:6,12),
lower = list(continuous = wrap("points",
color = "darkgreen", alpha = 0.25,
size=2,
shape=5)),
diag = list(continuous = wrap("densityDiag",
color = "red",
fill="yellow")))
res_corr=cor(da1[,c(2:6,12)])
#basic plot
corrplot(res_corr)
#desired enhancements
corrplot(res_corr,type="lower")
corrplot(res_corr,type="lower",diag=F)
corrplot(res_corr,method="number",
type="lower",
number.digits=2,
cl.lim=c(-1,1),
col=colorRampPalette(c("darkblue","red","darkgreen"))(10) ,tl.pos="n")
We try to use this package to arrange many plots in a layout specified by number of rows and columns
Balanced Layout
If we plan to arrange even / non-prime number of plots, the arguments are quite direct to specify; for example, if we intend to arrange 16 plots then the lay out may be \(4\times4\) which would have a better appearance. Any combination with even / non-prime numbers have the multiple plots without white spaces because balanced pairing is possible
For example, we arrange four plots in a balanced lay out with two rows and two columns (\(2\times2\))
PL1=ggplot(da1,aes(x=1:nrow(da1),y=Balance))+
geom_point()+
labs(title="Balance",
subtitle="Average credit card balance in $",
caption="From ISLR - Credit data set",
x="Row Index",y="Balance in $")
PL2=ggplot(da1,aes(x=Age))+
geom_histogram(aes(y = ..density..),
col="darkred",fill="yellow",
position = "identity")+
geom_density(col="tan3",size = 2)
PL3=ggplot(da1, aes(x = Balance, y = Ethnicity)) +
geom_density_ridges(aes(fill = Ethnicity)) +
theme(legend.position = "")
PL4=ggplot(da1,aes(x=Ethnicity,fill=Gender))+
geom_bar()+
scale_fill_manual(values=c("yellow", "green"))+
geom_text(aes(label = ..count..), stat = "count",
position = position_stack(vjust = 0.5),
size=4,color='darkred',fontface="bold")+
labs(title="Distribution of Ethnicity",
subtitle = "Gender Wise",
caption ="From ISLR - Credit data set")+
theme(legend.position = "bottom")
grid.arrange(PL1,PL2,PL3,PL4,nrow=2,ncol=2)
Imbalanced Layout
On the other hand, if the number of plots is an odd prime number, then the above specification will result with undesirable white spaces because of difficulty in pairing.
For example, we arrange three plots in a lay out with two rows and two columns (\(2\times2\)). Then second row, second column will have white space because it has no plot to hold
PL1=ggplot(da1,aes(x=1:nrow(da1),y=Balance))+
geom_point()+
labs(title="Balance",
subtitle="Average credit card balance in $",
caption="From ISLR - Credit data set",
x="Row Index",y="Balance in $")
PL2=ggplot(da1,aes(x=Age))+
geom_histogram(aes(y = ..density..),
col="darkred",fill="yellow",
position = "identity")+
geom_density(col="tan3",size = 2)
PL3=ggplot(da1, aes(x = Balance, y = Ethnicity)) +
geom_density_ridges(aes(fill = Ethnicity)) +
theme(legend.position = "")
grid.arrange(PL1,PL2,PL3,nrow=2,ncol=2)
To avoid such unwanted white space in the layout, we can use grid.arrange function from gridExtra package with additional argument layout_matrix. This option provides a simple way to plan a possible layout to arrange the odd number of plots.
In the above case with three plots, let us plan one of the possible ways as
We have a \(2\times2\) lay out
Plot 1: First row, First column
Plot 2: First row, Second column
Plot 3: Second row, but stretched in two columns
This may be provided as input in the argument layout_matrix of the function grid.arrange in gridExtra package. The syntax will be
\[\textrm{layout_matrix=rbind(c(1,2), c(3,3))}\] This will yield the desired output as
PL1=ggplot(da1,aes(x=1:nrow(da1),y=Balance))+
geom_point()+
labs(title="Balance",
subtitle="Average credit card balance in $",
caption="From ISLR - Credit data set",
x="Row Index",y="Balance in $")
PL2=ggplot(da1,aes(x=Age))+
geom_histogram(aes(y = ..density..),
col="darkred",fill="yellow",
position = "identity")+
geom_density(col="tan3",size = 2)
PL3=ggplot(da1, aes(x = Balance, y = Ethnicity)) +
geom_density_ridges(aes(fill = Ethnicity)) +
theme(legend.position = "")
grid.arrange(PL1,PL2,PL3,nrow=2,ncol=2,
layout_matrix=rbind(c(1,2), c(3,3)))
Another possible lay out is
PL1=ggplot(da1,aes(x=1:nrow(da1),y=Balance))+
geom_point()+
labs(title="Balance",
subtitle="Average credit card balance in $",
caption="From ISLR - Credit data set",
x="Row Index",y="Balance in $")
PL2=ggplot(da1,aes(x=Age))+
geom_histogram(aes(y = ..density..),
col="darkred",fill="yellow",
position = "identity")+
geom_density(col="tan3",size = 2)
PL3=ggplot(da1, aes(x = Balance, y = Ethnicity)) +
geom_density_ridges(aes(fill = Ethnicity)) +
theme(legend.position = "")
grid.arrange(PL1,PL2,PL3,nrow=2,ncol=2,
layout_matrix=rbind(c(1,2), c(1,3)))
If we have five plots in a single lay out, then we may plan a \(3\times2\) lay out and position the plots in a similar way. Then the syntax \(\textrm{layout_matrix=rbind(c(1,2),(3,4), c(5,5))}\) will produce plots 1 and 2 in first row, 3 and 4 in second row, and third row will have the plot 3.
PL1=ggplot(da1,aes(x=1:nrow(da1),y=Balance))+
geom_point()+
labs(title="Balance",
subtitle="Average credit card balance in $",
caption="From ISLR - Credit data set",
x="Row Index",y="Balance in $")
PL2=ggplot(da1,aes(x=Age))+
geom_histogram(aes(y = ..density..),
col="darkred",fill="yellow",
position = "identity")+
geom_density(col="tan3",size = 2)
PL3=ggplot(da1, aes(x = Balance, y = Ethnicity)) +
geom_density_ridges(aes(fill = Ethnicity)) +
theme(legend.position = "")
PL4= ggplot(da1, aes(x = Balance)) +
geom_density(aes(fill = Student,col=Student)) +
theme(legend.position = "bottom")
PL5= ggplot(da1, aes(x = Balance, y = Education)) +
geom_point(aes(col = Student),size=2) +
theme(legend.position = "bottom")
grid.arrange(PL1,PL2,PL3,PL4,PL5, nrow=3,ncol=2,
layout_matrix=rbind(c(1,2), c(3,4),c(5,5)))
Visualization of a variable or a set of variables require a detailed articulation idea to understand the spread and shape of a variable and association between variables. This material may help to have a methodological approach to make visualization exercises using the package ggplot2 and few more packages. Also to note that this material should be considered as an overview to produce plots using ggplot2 package. More options can be attempted with additional examples, may be in a self-paced manner.
Now we may be ready to generate plots using the ggplot2 and other necessary packages. That is we may be able to
know the different plots based on the nature of the variables
create basic plots
use appropriate enhancements in the appearance of the plot and its components like axes, texts, labels, titles etc,
plan multiple plots in the required lay outs with available options
increase the dimensions in a plot by increasing as many variables as required to compare, as well as retaining the readability of the plot for better insights