Data Visualization Using ggplot2

Plotting One Variable

Introduction

To visualize one variable, the type of graphs to be used depends on the type of the variable:

For categorical variable or grouping variables. You can visualize the count of categories using a bar plot or using a pie chart to show the proportion of each category. Dot charts can be another alternative.
For continuous variable, you can visualize the distribution of the variable using density plots, histograms, box plots, Empirical cumulative distribution function (ECDF) and Quantile-quantile plot (QQ plots)

One categorical variable: Bar plot

Basically a bar plot of a categorical variable can be generated using the following code chunk.

msleep %>% 
  select(vore) %>%
  ggplot(aes(vore)) + 
  geom_bar()

Notice that the missing observations are also included in the plot. To exclude missing values, we modify the code as follows.

msleep %>% 
  select(vore) %>%
  drop_na(vore) %>% 
  ggplot(aes(vore)) + 
  geom_bar()

Based on the plot, most of the mammals are herbivores which are followed by the omnivores and carnivores. The least number of mammals in the data are the insectivores.

Oftentimes, we want to indicate on the plot the actual count of each type of mammal. Also, we can add more aesthetics to the plot such as color of the bars and labels. We do this in the code chunk below.

msleep %>% 
  select(vore) %>%
  drop_na(vore) %>% 
  group_by(vore) %>% 
  summarize(count = n()) %>% 
  ggplot(aes(x=reorder(vore,-count),y=count)) +
  geom_bar(fill = "blue", stat="identity") +
  geom_text(aes(label = count), vjust = -0.3) +
  labs(x = "Type of Mammal", y = "Number")

One categorical variable: Pie chart

Alternative to a bar plot of the distribution of counts of the various levels of a categorical variable, we can also display the proportion instead in a pie chart. To illustrate, let us save a separate data frame containing the counts of each type of mammal and add the proportion of each type.

First, we arrange the grouping variable (vore) in descending order. This important to compute the y coordinates of labels. Next we compute the proportion (count/total) of each category and compute the position of the text labels as the cumulative sum of the proportion. To put the labels in the center of pies, we’ll use \(cumsum(prop) - 0.5\times prop\) as label position.

df1 <- msleep %>% 
  select(vore) %>%
  drop_na(vore) %>% 
  group_by(vore) %>% 
  summarize(count = n()) %>% 
  arrange(desc(count)) %>% 
  mutate(prop = round(count*100/sum(count), 1))

The following code chunk generates the pie chart.

df1 %>% 
  ggplot(aes(x = "", y = prop, fill = vore))+
  geom_col(color = "black") +
  geom_text(aes(label = prop),
            position = position_stack(vjust = 0.5)) +
  coord_polar(theta = "y") +
  theme_void()

Alternative solution to easily create a pie chart: use the function ggpie()[in ggpubr package]:

library(ggpubr)
df1 %>% 
  ggpie(x = "prop", label = "prop",
        lab.pos = "in", 
        lab.font = list(color = "white"),
        fill = "vore", color = "white", palette = "jco")

One continuous variable: Histogram

msleep %>% 
  drop_na(sleep_rem) %>% 
  ggplot(aes(sleep_rem)) +
  geom_histogram(bins = 15, 
                 color = "black", 
                 fill = "maroon") +
  labs(x = 'Sleep(REM)', y = "Count")

One continuous variable: Density plot

msleep %>% 
  drop_na(sleep_rem) %>% 
  ggplot(aes(sleep_rem)) +
  geom_density(color = "red") + 
  labs(x = 'Sleep(REM)', y = "Count")

We can combine the density and histogram in one plot. We do this by creating a histogram with density values on y-axis (instead of count values). Then we add a transparent density plot.

msleep %>% 
  drop_na(sleep_rem) %>% 
  ggplot(aes(sleep_rem)) +
  geom_histogram(aes(y = after_stat(density)),
                 colour="black",
                 bins = 15,
                 fill="light blue") +
  geom_density(alpha = 0.2, color="red")

One continuous variable: Frequency polygon

Very close to histogram plots, but it uses lines instead of bars.

msleep %>% 
  drop_na(sleep_rem) %>% 
  ggplot(aes(sleep_rem)) +
  geom_freqpoly(bins = 15)

One continuous variable: Area plot

msleep %>% 
  drop_na(sleep_rem) %>% 
  ggplot(aes(sleep_rem)) +
  geom_area( stat = "bin", bins = 15,
             color = "black", fill = "light blue")

One continuous variable: Dot plot

It is another alternative to histograms and density plots, that can be used to visualize a continuous variable. Dots are stacked with each dot representing one observation. The width of a dot corresponds to the bin width.

msleep %>% 
  drop_na(sleep_rem) %>% 
  ggplot(aes(sleep_rem)) +
  geom_dotplot(binwidth = 0.25,
               color = "black", 
               fill = "red", 
               dotsize = 1)

One continuous variable: Boxplot

msleep %>% 
  drop_na(sleep_rem) %>% 
  ggplot(aes(sleep_rem)) +
  geom_boxplot(width = 0.2, fill = "yellow green") +
  coord_flip()

One continuous variable: Empirical cumulative distribution function (ECDF)

Provides another alternative visualization of distribution. It reports for any given number the percent of individuals that are below that threshold.

msleep %>% 
  drop_na(sleep_rem) %>% 
  ggplot(aes(sleep_rem)) + 
  stat_ecdf(color = "black",
            geom = "step", 
            linewidth = 1.5)

One continuous variable: Quantile-quantile plot (QQ plots).

Used to check whether a given data follows normal distribution.

msleep %>% 
  drop_na(sleep_rem) %>% 
  ggplot(aes(sample = sleep_rem)) +
  stat_qq(color = "black")

Alternative plot using the function ggqqplot() [in ggpubr]. The 95% confidence band is shown by default.

msleep %>% 
  drop_na(sleep_rem) %>% 
  ggqqplot(x = "sleep_rem",
           color = "black",
           conf.int = FALSE)

Plotting Two or More Variables

Side-by-side, stacked, and percent stacked barplot

We prepare a contingency table of the number of mammals categorized according to vore (herbi , carni, omni) and conservation (domesticated, lc). Then, create the bar plot. The following code chunk will generate a side-by-side bar plot of counts. To generate a stacked and a stacked percent bar plot just replace the argument position = “dodge” with position = “stack” and _position = “fill”_in the geom_bar() function. Alternatively, one can use position = position_dodge() for position = “dodge” or position = position_stack() for position = “stock”.

msleep %>% 
  select(vore, conservation) %>%
  filter(conservation %in% c("domesticated", "lc")) %>%
  filter(vore %in% c("herbi" , "carni", "omni"))%>% 
  group_by(vore, conservation) %>% 
  summarize(count = n()) %>% 
  ggplot(aes(x = conservation, y = count, fill = vore)) +
  geom_bar(position="dodge", stat="identity")

Bar plot of means with error bars

msleep %>% 
  select(vore, conservation, sleep_rem) %>%
  filter(conservation %in% c("domesticated", "lc")) %>%
  filter(vore %in% c("herbi" , "carni"))%>% 
  drop_na(sleep_rem) %>% 
  group_by(vore, conservation) %>% 
  summarize(m = mean(sleep_rem),s = sd(sleep_rem)) %>% 
  ggplot(aes(x = vore, y = m, ymin = m-s, ymax = m+s)) +
  geom_bar(aes(fill = conservation), stat = "identity",
           position = position_dodge()) + 
  geom_errorbar(aes(fill = conservation), width=0.1, 
                color = "black",
                position = position_dodge(0.9))

Histogram of multiple groups

msleep %>%
  select(vore, sleep_rem) %>% 
  filter(vore %in% c("herbi" , "carni", "omni"))%>% 
  drop_na(sleep_rem) %>% 
  ggplot(aes(sleep_rem))+
  geom_histogram(aes(fill = vore), color = "black", position="identity")

Boxplot of multiple groups

msleep %>%
  select(vore, sleep_rem) %>% 
  filter(vore %in% c("herbi" , "carni", "omni"))%>% 
  drop_na(sleep_rem) %>% 
  ggplot(aes(sleep_rem))+
  geom_boxplot(aes(fill = vore), color = "black")+
  coord_flip()

Scatterplot

Recall: A scatterplot shows the relationship between two continuous variables. The following code chunk will generate the scatter plot of sleep_total and sleep_rem.

msleep %>% 
  drop_na(sleep_total) %>% 
  drop_na(sleep_rem) %>% 
  ggplot(aes(x = sleep_total, y = sleep_rem)) +
  geom_point(size = 2, 
             shape = 21, 
             fill = "red", 
             color = "black")

We can add a third continuous variable into a scatter plot to see this variable correlates with the other two variables.

msleep %>% 
  drop_na(sleep_total) %>% 
  drop_na(sleep_rem) %>% 
  ggplot(aes(x = sleep_total, y = sleep_rem)) +
  geom_point(aes(size =  bodywt), 
             shape = 21, 
             fill = "red", 
             color = "black")

Scatter plots with multiple groups

Alternatively, we can also apply the facet_wrap() function to see how a third (usually categorical) variable change the relationship between two continuous variables.

msleep %>% 
  filter(vore %in% c("herbi", "omni", "carni")) %>% 
  drop_na(sleep_total) %>% 
  drop_na(sleep_rem) %>% 
  ggplot(aes(x = sleep_total, y = sleep_rem)) +
  geom_point(size = 2, 
             shape = 21, 
             fill = "red", 
             color = "black") +
  facet_wrap(~vore)