Getting started with ggplot2

The ggplot2 library is imported using the library() command.

library(ggplot2)

The imported file data.csv will be used in this post. It can be found at https://github.com/juanklopper/R_statistics .

df <- read.csv("Data.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE)

The variable names of the the dataset are WCC, CRP, Grade, Group. Categorical variables can be set manually using the as.factor() command.

df$Grade <- as.factor(df$Grade)
df$Group <- as.factor(df$Group)

Introduction

Similar to any language, the elemnts that make up a plot can be described by a grammer. In the case of ggplot2 and in reference to the two g’s in the name, is the term grammar of graphics. It is a set of rules that break down a plot into parts so that coding to create them is a simple, yet powerful task.

The important parts of the grammar of ggplot2 are:

These concepts will become clear as the plots are introduced.

Scatter plots

An easy plot to start with is the scatter plot. In the code cell below, the WCC is on the x axis and the corresponding patient’s CRP is on the y axis.

ggplot(df, # the data
       aes(x = WCC,
           y = CRP)) + # the aesthetic
  geom_point() # the layer

More than one geometry can be added to a plot. In the code snippet below, a linear model shows the regression line with standard error.

ggplot(df,
       aes(x = WCC,
           y = CRP)) + 
  geom_point() +
  geom_smooth(method = "lm")

There are two groups in the sample space of the Group variable. This information can be extracted to provide information of a third variable on a scatter plot. The first way to achieve this is by using a separate color for each species. The code snippet below uses the col = argument to separate on the Group. A second geometrys is added to show each linear model, this time without the standard error.

ggplot(df,
       aes(x = WCC,
              y = CRP,
              col = Group)) +
  geom_point() +
  geom_smooth(method = "lm",
              se = FALSE)

A shape can be used instead of color. The code snippet below uses the shape = argument to separate on group.

ggplot(df,
       aes(x = CRP,
              y = WCC,
              shape = Group)) +
  geom_point()

Both can be used.

ggplot(df,
       aes(x = WCC,
           y = CRP,
           col = Group,
           shape = Group)) +
  geom_point()

If there is too much information on single figure, the data can be facetted based on a categorical variable. In the code snippet below, each of groups A and B are plotted in each own figure.

ggplot(df,
       aes(x = WCC,
           y = CRP)) +
  geom_point() +
  facet_grid(~ Group) # Note the ~ symbol

If the Grade variable is used, there will be four facets.

ggplot(df,
       aes(x = WCC,
           y = CRP)) +
  geom_point() +
  facet_grid(~ Grade)

There is also a facet_wrap() command.

ggplot(df,
       aes(x = WCC,
           y = CRP)) +
  geom_point() +
  facet_wrap(~ Grade)

Histogram

A histogram splits a numerical value into bins and counts how many values occur in each bin. The default number of bins is \(30\). A specific value can be added as an argument to the geom_histogram() command using bins =.

ggplot(df,
       aes(x = CRP)) + 
  geom_histogram(bins = 20)

The width of each bin can be specied using binwidth = as an alternative argument in the geom_histogram() command.

ggplot(df,
       aes(x = CRP)) +
  geom_histogram(binwidth = 0.5)

Color can be changed using the fill = arguemnt in the geom_histogram() command.

ggplot(df,
       aes(x = CRP)) +
  geom_histogram(binwidth = 0.5,
                 fill = "deepskyblue")

Bondaries to the bars can make reading the plot easier. To achive this add the col = argument to the geom_histogram() command.

ggplot(df,
       aes(x = CRP)) +
  geom_histogram(binwidth = 0.5,
                 fill = "deepskyblue",
                 col = "black")

More than one of the groups (which is represented in the dataset as a categorical variable) can be plotted. This is achieved using the fill = argument in the aes() argument.

ggplot(df,
       aes(x = CRP,
           fill = Group)) +
  geom_histogram(bins = 20,
                 col = "black")

Displaying the proportions can be set. Use the position = arguemnt in the geom_histogram() command.

ggplot(df,
       aes(x = CRP,
           fill = Group)) +
  geom_histogram(bins = 10,
                 position= "fill",
                 col = "black")

Another method uses density plots. They create a density estimate to replace the rectangles. Setting a transparency value using the alpha = argument in the geom-density() command creates a good representation of the spread of the data.

ggplot(df,
       aes(x = CRP,
           fill = Group)) +
  geom_density(alpha = 0.5)

A frequency polygon gives similar information to a histogram.

ggplot(df,
       aes(x = CRP,
           col = Group)) +
  geom_freqpoly(bins = 20)

Bar charts

A bar chart counts the number of each of the unique data point values for a categorical variable.

ggplot(df,
       aes(x = Grade)) +
  geom_bar()

A fill = argument, if it holds a factor (categorical variable), allows for the addition of a second variable to create basr with.

ggplot(df,
       aes(x = Grade,
           fill = Group)) +
  geom_bar()

Box-and-whisker plots

These plots map categorical variable sin the x axis and for each pof these the same numerical variable is plotte don the y axis. It might be required to specify that the x axis variable is a factor (categorical variable). Since this was done initially (start of the post), it is not strictly necessary here.

Note the arguments for the outliers in the code snippet below.

ggplot(df,
       aes(x = factor(Grade),
           y = WCC)) +
  geom_boxplot(outlier.color = "red",
               outlier.size = 3,
               outlier.shape = 3)

Another categorical variable can be used to makes groups.

ggplot(df,
       aes(x = factor(Grade),
           y = WCC,
           fill = factor(Group))) +
  geom_boxplot(outlier.color = "black",
               outlier.size = 4,
               outlier.shape = "o", # Specified instead of a numerical value
               outlier.alpha = 0.5)

## Changing the background with a theme

The light gray background might not be the best for report and publications. The easiet way to create a white background is with a theme.

ggplot(df,
       aes(x = Grade,
           y = WCC,
           fill = Group)) +
  geom_boxplot(outlier.color = "red") +
  theme_bw()

The data point values can all be plotted using the geom_jitter() command.

ggplot(df,
       aes(x = Grade,
           y = WCC)) +
  geom_boxplot() +
  geom_jitter(width = 0.2) +
  theme_bw()

Box plots can be flipped horizontally by using the coord_flip() command.

ggplot(df,
       aes(x = Grade,
           y = WCC)) +
  geom_boxplot() +
  geom_jitter(width = 0.2) +
  theme_bw() +
  coord_flip()

Adding labels

boxPlot1 <- ggplot(df,
                   aes(x = factor(Grade),
                       y = WCC)) +
  geom_boxplot(outlier.color = "red") +
  theme_bw()

boxPlot2 <- boxPlot1 + labs(title = "WCC for each disease group",
                            x = "Grade of disease",
                            y = "White cell count")

boxPlot2

Centering the title.

boxPlot3 <- boxPlot2 +
  theme(plot.title = element_text(hjust = 0.5))

boxPlot3