Introduction

This tutorial explores how to use the ggpubr package in R. Specifically, it will demonstrate how a some of the functions from the package can be used to simplify the production of ggplot2-style data visualization, as well as integrate statistical analysis into the visualizations.

Why ggpubr?

Many data visualizations created in R or R Studio utilize the ggplot2 package, and many users are already familiar with this package. For users who regularly make a variety of types of plots or who need to consider statistical analysis in addition to creating visualizations the ggpubr package can provide simplified ways to do both at the same time.

Loading ggbupr and Loading Data

Loading ggbupr

Because ggbupR is an enhancement to ggplot2, it is important to make sure to load both packages, using the following code. Because ggplot2 is part of the tidyverse of packages, I choose to load the whole tidyverse when I am working in R, so that I also have access to associated packages, such as dplyr, in the course of my work. I also elect to load the plyr package, which allows for more data manipulation than dplyr alone does. A critical note about ggpubr is that, although the package is sometimes stylized as “ggpubR”, the “R” should be lowercase (“ggpubr”) for use in R Studio.

# Load Packages
library(tidyverse)
library(plyr)
library(ggpubr)

In either case, use the following code to install the package(s) before loading the libraries, if you don’t already have them installed.

# Installing Packages if Needed
install.packages("tidyverse")
install.packages("plyr")
install.packages("ggpubr")

Loading the Data

In my professional life, I often look at student satisfaction survey data for the college where I work in order to identify patterns and understand results. However, because that data is private, we will look instead at the Descriptive Dataset on Student’s Level of Satisfaction on Facilities Provided available via Mendeley Data, which bears some resemblance to the student data I look at professionally.

From the Mendeley Data page, I am able to download the student survey data as an Excel file, save as a .csv file to a preferred location on my computer, and manually load it to R. The code below reflects that the .rmd file I’m working in and the .csv file with the data are saved to the same location.

# Define Data Set
dat <- read.csv("Student_Satisfaction.csv")

The Student Satisfaction Data is fairly straightforward. It is a small data set, containing one entry per student respondent, with a total of 280 entries, each as a row, and each row contains 22 columns: the first column contains a time stamp for when the survey was completed, columns 2-8 contain student demographic information, and columns 9-22 contain the satisfaction scores (1-10) for several areas of the campus.

Preparing the Data

It is important to note that the original data set is in a combination of English and the Malay language, which I have translated to American English using Google for the purposes of this tutorial. To do this, we can rename the columns as needed using base R.

# Rename Columns to American English
colnames(dat)[colnames(dat) == 'Jantina'] <- 'Gender'
colnames(dat)[colnames(dat) == 'umur'] <- 'Age'
colnames(dat)[colnames(dat) == 'Etnik'] <- 'Ethnicity'
colnames(dat)[colnames(dat) == 'Agama'] <- 'Religion'
colnames(dat)[colnames(dat) == 'Bursary'] <- 'Bursar'
colnames(dat)[colnames(dat) == 'Health.Centre'] <- 'Health.Center'
colnames(dat)[colnames(dat) == 'Sport.Centre'] <- 'Athletic.Center'
colnames(dat)[colnames(dat) == 'Islamic.Centre'] <- 'Islamic.Center'
colnames(dat)[colnames(dat) == 'Auto.teller.Machine'] <- 'ATM'
colnames(dat)[colnames(dat) == 'Residential.Collage'] <- 'Residence.Halls'
colnames(dat)[colnames(dat) == 'Transportation.Centre'] <- 'Transportation.Center'
colnames(dat)[colnames(dat) == 'Wireless.Internet'] <- 'WiFi'
colnames(dat)[colnames(dat) == 'Toilet'] <- 'Bathrooms'

The values associated with some of the variables also need to be renamed. We can do this using the mutate() function that is part of the dyplr package, which we loaded with the tidyverse at the beginning of the tutorial.

## Rename Values
# Gender
dat <- dat %>%
  mutate(Gender = recode(Gender, 
                         Lelaki = "Male", 
                         Perempuan = "Female")
         )

# Age
dat <- dat %>%
  mutate(Age = recode(Age, 
                      "? 19 tahun" = "under 19", 
                      "20 < 25 tahun" = "20 - 25", 
                      "? 26 tahun" = "26 and older")
         )

The data in the Semester Column is more complicated, as that column contains both semester (1 or 2) and year (1, 2, 3, 4) data. We will clean this up by changing to English and removing the excess words and characters, then splitting the existing column into separate semester and year columns. We can accomplish this using the mutate() and separate() functions that are part of the dyplr package, which we loaded with the tidyverse at the beginning of the tutorial.

## Semester Column
# Change to English and Remove Excess Words and Characters
dat <- dat %>%
  mutate(Semester = recode(Semester, 
                           "Semester Tambahan" = "Other Other",
                           "Semester 1 Tahun 1" = "Freshman 1", 
                           "Semester 2 Tahun 1" = "Freshman 2",
                           "Semester 1 Tahun 2" = "Sophomore 1", 
                           "Semester 2 Tahun 2" = "Sophomore 2",
                           "Semester 1 Tahun 3" = "Junior 1", 
                           "Semester 2 Tahun 3" = "Junior 2",
                           "Semester 1 Tahun 4" = "Senior 1", 
                           "Semester 2 Tahun 4" = "Senior 2"))

## Split Column 
dat <- dat %>%
  separate(Semester, c("Year", "Semester"))

Once the data is transformed for our use, the general format of the data can be reviewed by using the “head()” function.

# Review Data
head(dat)

Using the ggpubr Package Functions

Example 1: Comparing Simple Density Plots

Using ggplot2

If we wanted to look at student welfare scores by gender, one way to do so is to use the “geom_density()” function in ggplot2. This requires us to use the ggplot() function, as well as aes() (aesthetic) function for a basic plot, and then add the geom_density() function to indicate that that we are creating a density plot specifically. Additionally, we can add the ggtitle() function to add a title to our plot.

# Basic Density Plot from ggplot2
p1 <- ggplot(dat, aes(x = Welfare, color = Gender)) + 
  geom_density() + 
  ggtitle("Density Plot of Student Welfare by Gender (ggplot2)")
p1

If we calculate the mean for each group, which we can do by using the ddply() function from the plyr package, we can even add a line to represent the mean to the ggplot.

# Calculate Group Means
mu <- ddply(dat, "Gender", summarise, grp.mean=mean(Welfare))
head(mu)

To add the line for the mean to the existing density plot, we can add the geom_vline() function, which requires its own data set and aesthetics, to the existing plot function.

# Density Plot with Mean Line from ggplot2
p2 <- ggplot(dat, aes(x = Welfare, color = Gender)) + 
  geom_density() + 
  ggtitle("Density Plot of Student Welfare by Gender (ggplot2)") +
  geom_vline(data = mu, 
             aes(xintercept = grp.mean, color = Gender),
             linetype="dashed")
p2

Using ggpubr

Using ggpubr, however, we can create effectively the same graph, but in a single step, without using multiple functions or manually calculating the mean. Instead, a single function, ggdensity(), contains all of the same details that we manually added to the ggplot() function, but requires only that we add our own variables and allows us to define aesthetic choices without using an additional function.

# Density Plot with Mean Line from ggpubr
p3 <- ggdensity(dat, 
                x = "Welfare", 
                add = "mean", 
                color="Gender", 
                title = "Density Plot of Student Welfare by Gender (ggpubr)")
p3

ggpubr also allows for the addition of a statistical overlay - meaning a density plot with the same mean and standard deviation as the real data, but a normal distribution - which allows us to visually investigate the extent to which the real data deviates from the normal distribution. The stat_overlay_normal_density() function that does this is part of the ggpubr packages, but operates similarly to the functions from ggplot2 in the sense that it requires the use of an aesthetic function, aes() to work.

# Density Plot with Mean Line and Statistical Overlay from ggpubr
p4 <- ggdensity(dat, 
                x = "Welfare", 
                add = "mean", 
                color="Gender", 
                title = "Density Plot of Student Welfare by Gender (ggpubr)",
                subtitle = "With Statistical Overlay") + 
  stat_overlay_normal_density(aes(color = Gender), 
                              linetype = "dashed")
p4

In this example, the statistical overlay allows us to visually demonstrate that, while the mean welfare for female students is greater than that for male students, the distribution of the actual scores for the male students is much closer to normal than it is for the female students, which suggests that it may also be more trustworthy.

Example 1: Box Plots

Using ggplot2

If we wanted to look at the scores for student satisfaction with the residence halls by year, one way to do so is to use the “geom_boxplot()” function in ggplot2. As was the case for the density plot, this requires us to use the ggplot() function, as well as aes() (aesthetic) function for a basic plot. However, it is a bit more complicated, as it requires the addition of the geom_boxplot() function to indicate that that we are creating a box plot specifically (similar to the geom_density() function used before) and the geom_jitter() function to display individual data points, each of with requires the addition of the aes() function to indicate what colors will be used. Finally, we need to use the ggtitle() function to add a title to our plot here as well.

# Basic Box Plot from ggplot2
p5 <- ggplot(dat, 
             aes(x = factor(Year), 
                 y = Residence.Halls)) + 
  geom_boxplot(aes(color = factor(Year))) + 
  geom_jitter(aes(color = factor(Year),
                  shape = factor(Year))) + 
  ggtitle("Box Plot of Student Satisfaction with Residence Halls by Year (ggplot2)")
p5

Using ggpubr

Using the ggboxplot() funtion from ggpubr once again simplifies the process by which the same basic plot can be created by allowing us to add components without the use of additional functions.

# Box Plot from ggpubr
p6 <- ggboxplot(dat, 
                x = "Year", 
                y = "Residence.Halls", 
                color="Year", 
                add = "jitter", 
                shape = "Year",
                title = "Box Plot of Student Satisfaction with Residence Halls by Year (ggpubr)")
p6

In our first example, we saw that ggpubr allowed for the addition of a mean without calculating it separately and manually adding it to the plot. Here, we will see that, in the case where we have multiple groups we can use the stat_compare_means() function within ggpubr to add comparison data in the form of p-values to our plot using the Kruskal-Wallis test.

This requires us to first define the groups that we want to make comparisons between.

# Specify the Comparison Groups
my_comparisons <- list( c("Freshman", "Sophomore"), 
                        c("Sophomore", "Junior"), 
                        c("Junior", "Senior"), 
                        c("Freshman", "Senior")
                        )

Once the comparison groups are defined, we can add them to the plot using the stat_compare_means() function. This function can be used to both specify the nature of the comparisons, as well as to position the label for the comparisons. Here, I have chose to indicate simply whether the comparisons are statistically significant, but not the actual numeric values, for visual clarity.

# Box Plot Including Comparison Groups from ggpubr
p7 <- ggboxplot(dat, 
                x = "Year", 
                y = "Residence.Halls", 
                color="Year", 
                add = "jitter", 
                shape = "Year",
                title = "Box Plot of Student Satisfaction with Residence Halls by Year (ggpubr)", 
                subtitle = "Including Statistical Significance of Comparison Groups" ) + 
  stat_compare_means(comparisons = my_comparisons, 
                     label = "p.signif")+
  stat_compare_means(label.y = 15)
p7

In this example, while we can visually see some slight deviation in the data for student satisfaction with residence halls by year, our statistical analysis reveals that these differences are not statistically significant at any of the points of comparison.

Works Cited

This tutorial references and/or cites the following sources:

  • Comprehensive R Archive Network (CRAN). (n.d.). ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. CRAN

  • Comprehensive R Archive Network (CRAN). (n.d.). ggpubr: ‘ggplot2’ Based Publication Ready Plots. CRAN

  • Datanovia. (n.d.) Overlay Normal Density Plot. Datanovia

  • Data Visualization and Statistical Integration with ggpubr. (April 2025). Bioinformatics Training and Education Program

  • Descriptive Dataset on Student’s Level of Satisfaction on Facilities Provided. (1 April 2020). Mendeley Data

  • ggpubr. (n.d.). github.com