Introduction

ggplot2, by Hadley Wickham, is an excellent and flexible package for elegant data visualization in R. However the default generated plots requires some formatting before we can send them for publication. Furthermore, to customize a ggplot, the syntax is opaque and this raises the level of difficulty for researchers with no advanced R programming skills.

The {ggpubr} package provides some easy-to-use functions for creating and customizing ‘ggplot2’- based publication ready plots.

Find out more at https://rpkgs.datanovia.com/ggpubr.

Why {ggpubr}?

  • The syntax is simpler compared to ggplot2.
  • Creates publication ready plots with minimum code.
  • In the box plots and line plots, it automatically adds P and significance values.
  • Annotation is satisfying to watch.
  • You can easily play with colors and labels of the plot.

Install ggpubr in R

#Install required package
install.packages('ggpubr')

Load the Package

# Load the package 
library(tidyverse) 
library(gt)
library(ggpubr)
library(ggsci)
library(gridExtra)

Loading and Exploring Data

# Load data into R 
data <- read.csv("../data/pulse_data.csv")
# Explore first few rows of the data
data %>% 
  head() %>% 
  gt()
Height Weight Age Gender Smokes Alcohol Exercise Ran Pulse1 Pulse2 BMI BMICat
1.73 57 18 Female No Yes Moderate No 86 88 19.04507 Underweight
1.79 58 19 Female No Yes Moderate Yes 82 150 18.10181 Underweight
1.67 62 18 Female No Yes High Yes 96 176 22.23099 Normal
1.95 84 18 Male No Yes High No 71 73 22.09073 Normal
1.73 64 18 Female No Yes Low No 90 88 21.38394 Normal
1.84 74 22 Male No Yes Low Yes 78 141 21.85728 Normal
# Check Data Structure 
glimpse(data)
Rows: 108
Columns: 12
$ Height   <dbl> 1.73, 1.79, 1.67, 1.95, 1.73, 1.84, 1.62, 1.69, 1.64, 1.68, 1…
$ Weight   <dbl> 57, 58, 62, 84, 64, 74, 57, 55, 56, 60, 75, 58, 68, 59, 72, 1…
$ Age      <int> 18, 19, 18, 18, 18, 22, 20, 18, 19, 23, 20, 19, 22, 18, 18, 2…
$ Gender   <chr> "Female", "Female", "Female", "Male", "Female", "Male", "Fema…
$ Smokes   <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "…
$ Alcohol  <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"…
$ Exercise <chr> "Moderate", "Moderate", "High", "High", "Low", "Low", "Modera…
$ Ran      <chr> "No", "Yes", "Yes", "No", "No", "Yes", "No", "No", "No", "Yes…
$ Pulse1   <dbl> 86, 82, 96, 71, 90, 78, 68, 71, 68, 88, 76, 74, 70, 78, 69, 7…
$ Pulse2   <dbl> 88, 150, 176, 73, 88, 141, 72, 77, 68, 150, 88, 76, 71, 82, 6…
$ BMI      <dbl> 19.04507, 18.10181, 22.23099, 22.09073, 21.38394, 21.85728, 2…
$ BMICat   <chr> "Underweight", "Underweight", "Normal", "Normal", "Normal", "…

Plot One Variable – X, Continuous

Histogram

4 Main Aspects

  • Shape: Overall appearance of histogram. Can be symmetric, bell-shaped, left skewed, right skewed, etc.

  • Center: Mean or Median

  • Spread: How far our data spreads. Range, Interquartile Range (IQR),standard deviation, variance.

  • Outliers: Data points that fall far from the bulk of the data

gghistogram(data, x = "BMI")
Warning: Using `bins = 30` by default. Pick better value with the argument
`bins`.

# Change the bins size 
gghistogram(data, x = "BMI", bins = 15)

# Color 
gghistogram(data, x = "BMI", bins = 15, color = "Gender")

# fill 
gghistogram(data, x = "BMI", bins = 15, color = "Gender", fill="Gender")

# Add statistics 
gghistogram(data, x = "BMI", bins = 15, color = "Gender", fill="Gender", add = "mean")

# Add rug  
gghistogram(data, x = "BMI", bins = 15, color = "Gender", fill="Gender", add = "mean", rug = TRUE)

# Add rug  
gghistogram(data, x = "BMI", bins = 15, color = "Gender", fill="Gender", add = "mean", rug = TRUE, add_density = TRUE)

# Add palette  
gghistogram(data, x = "BMI", bins = 15, color = "Gender", fill="Gender", add = "mean", rug = TRUE, add_density = TRUE, palette = c("#00AFBB", "#E7B800")) 

Density Plots

  • Density plots are another way of getting a quick idea of the distribution of each attribute.

  • The plots look like an abstracted histogram with a smooth curve drawn through the top of each bin,much like your eye tried to do with the histograms

# Create density plot 
ggdensity(data, x = "Height")

# Separate by Sex 
ggdensity(data, x = "Height", fill="Gender")

# Color by a categorical variable 
ggdensity(data, x = "Height", fill="Gender", color = "Gender")

# Add rug 
ggdensity(data, x = "Height", fill="Gender", color = "Gender", rug = TRUE)

# Add statistics 
ggdensity(data, x = "Height", fill="Gender", color = "Gender", rug = TRUE, add = "median")

# Combine density plots with histogram
gghistogram(data, x = "Height", bins = 15, color = "Gender", fill="Gender", rug = TRUE, add = "mean", add_density = TRUE)

QQ Plot

  • Q Q Plots (Quantile-Quantile plots) are plots of two quantiles against each other. A quantile is a fraction where certain values fall below that quantile
  • The purpose of Q Q plots is to find out if two sets of data come from the same distribution
  • The assumption of normality is an important assumption for many statistical tests; you assume you are sampling from a normally distributed population.
  • The normal Q Q plot is one way to assess normality.
ggqqplot(data, x = "Weight")

Overlay Normal Density Plot

  • Overlay normal density plot (with the same mean and SD) to the density distribution of ‘x’.
  • This is useful for visually inspecting the degree of deviance from normality.
ggdensity(data, x = "BMI", fill = "red") +
  scale_x_continuous(limits = c(-1, 50)) +
  stat_overlay_normal_density(color = "red", linetype = "dashed")

# Color by groups 
ggdensity(data, "BMI", color = "Exercise") +
 stat_overlay_normal_density(aes(color = "Exercise"), linetype = "dashed")

# Color by groups 
ggdensity(data, "BMI", color = "Exercise", facet.by = "Exercise") +
 stat_overlay_normal_density(aes(color = "Exercise"), linetype = "dashed")

Plot Two Vriables - X and Y, Discrete X and Continuous Y

Boxplot

  • {ggpubr} documentation link - https://rpkgs.datanovia.com/ggpubr/reference/ggboxplot.html

  • Boxplots provide a graphical picture of the five-number summary: showing center (median), spread (IQR and range), and identifies potential outliers.

  • Boxplots can hide some shape aspects(histograms do better job at displaying shape)

  • Side-by-Side Boxplots are useful for comparing two or more sets of observations.

ggboxplot(data, x = "BMICat", y = "Age")

# Change the plot orientation: horizontal
ggboxplot(data, x = "BMICat", y = "Age", orientation = "horiz")

# Set width 
ggboxplot(data, x = "BMICat", y = "Age", width = 0.8)

# Color 
ggboxplot(data, x = "BMICat", y = "Age", width = 0.8, fill="red")

# Color by Sex 
ggboxplot(data, x = "BMICat", y = "Age", color = "Gender")

# Add jitter 
ggboxplot(data, x = "BMICat", y = "Age", color = "Gender", 
          add = "jitter")

# Add shape 
ggboxplot(data, x = "BMICat", y = "Age", color = "Gender", 
          add = "jitter", shape = "BMICat")

Violin Plots

ggviolin(data, x = "BMICat", y = "Weight")

# Change the plot orientation: horizontal
ggviolin(data, x = "BMICat", y = "Weight", orientation = "horiz")

# Add summary statistics
# Draw quantiles
ggviolin(data, "BMICat", "Weight", add = "none",
   draw_quantiles = 0.5)

# Add box plot
ggviolin(data, x = "BMICat", y = "Weight",
 add = "boxplot")

# 
ggviolin(data, x = "BMICat", y = "Weight", color = "Gender", 
          add = "jitter", error.plot = "crossbar")

Bar Charts

  • {ggpubr} documentation link - https://rpkgs.datanovia.com/ggpubr/reference/ggbarplot.html
  • A barplot shows the relationship between a numeric and a categoric variable. Each entity of the categoric variable is represented as a bar. The size of the bar represents its numeric value.
  • Barplot is sometimes described as a boring way to visualize information. However it is probably the most efficient way to show this kind of data. Ordering bars and providing good annotation are often necessary
  • To describe the number of observations in each category of the discrete variable
  • To visualize estimated error for discrete variables
# Data: Reading Hours 
df <- data.frame(days = c("D1", "D2", "D3"),
   hours = c(4.2, 10, 10.5))
df
  days hours
1   D1   4.2
2   D2  10.0
3   D3  10.5
ggbarplot(df, x = "days", y = "hours")

# Change width
ggbarplot(df, x = "days", y = "hours", width = 0.5)

# Change the plot orientation: horizontal
ggbarplot(df, x = "days", y = "hours", width = 0.5, orientation = "horiz")

# Change the default order of items
ggbarplot(df, x = "days", y = "hours", width = 0.5, orientation = "horiz", order = c("D3", "D2", "D1"))

# Change colors
ggbarplot(df, x = "days", y = "hours", width = 0.5, color = "steelblue",  fill = "steelblue")

# Add label 
ggbarplot(df, x = "days", y = "hours", width = 0.5, color = "steelblue",  fill = "steelblue",  label = TRUE, lab.pos = "in", lab.col = "white")

# Use custom color palette
ggbarplot(df, x = "days", y = "hours", width = 0.5, color = "days",  fill = "steelblue",  label = TRUE, lab.pos = "in", lab.col = "white",  palette = c("#00AFCB", "#E7B800", "#FC4E07"))

# Use custom color palette
ggbarplot(df, x = "days", y = "hours", width = 0.5, color = "days",  fill = "days",  label = TRUE, lab.pos = "in", lab.col = "white",  palette = c("#00AFCB", "#E7B800", "#FC4E07"))

Pie Charts

# Data 
df <- data.frame(
 group = c("Male", "Female", "Child"),
  value = c(25, 25, 50))
df
   group value
1   Male    25
2 Female    25
3  Child    50
# Basic pie charts
ggpie(df, "value", label = "group")

# Change color
# Change fill color by group
# set line color to white
# Use custom color palette
 ggpie(df, "value", label = "group",
      fill = "group", color = "white",
       palette = c("#00AFBB", "#E7B800", "#FC4E07") )

# Change label
# Show group names and value as labels
labs <- paste0(df$group, " (", df$value, "%)")
ggpie(df, "value", label = labs,
   fill = "group", color = "white",
   palette = c("#00AFBB", "#E7B800", "#FC4E07"))

# Change the position and font color of labels
ggpie(df, "value", label = labs,
   lab.pos = "in", lab.font = "white",
   fill = "group", color = "white",
   palette = c("#00AFBB", "#E7B800", "#FC4E07"))

Line Plots

ggline(data, x = "BMICat", y = "Weight")

ggline(data, x = "BMICat", y = "Weight", shape = "Gender", linetype = "Gender")

ggline(data, x = "BMICat", y = "Weight", shape = "Gender", linetype = "Gender", color = "Gender")

# Visualize the mean of each group
ggline(data, x = "BMICat", y = "Weight", shape = "Gender", linetype = "Gender", color = "Gender", add = "mean")

# Add error bars: mean_se
ggline(data, x = "BMICat", y = "Weight", shape = "Gender", linetype = "Gender", color = "Gender", add = "mean_se")

# Add error bars: mean_se
ggline(data, x = "BMICat", y = "Weight", shape = "Gender", linetype = "Gender", color = "Gender", add = "mean_se", error.plot = "pointrange")

# Add jitter points and errors (mean_se)
ggline(data, x = "BMICat", y = "Weight", shape = "Gender", linetype = "Gender", color = "Gender", add = c("mean_se", "jitter"))