okramushroom
RLanguage

Help please :) Highlighting a column in ggplot2

I have the following code, with help from https://stackoverflow.com/questions/58866575/how-to-highlight-a-column-in-ggplot2 . But when I apply to my actual test data, where there’re about 30 columns to be highlighted, there are diagonal lines (between) connecting the highlighted columns (ie., ideally it should be empty in those columns). And my y-axis also increase a lots (not the data value, just the axis.) Any idea? Thank you !!!

Load libraries and set up the data frame.

library(tidyr)
library(kableExtra)
library(ggplot2)

fruits <- c("apple", "orange", "watermelons")

juice_content <- c(10, 1, 1000)

weight <- c(5, 2, 2000)

df <- data.frame(fruits, juice_content, weight)

Note that the data frame is ‘short & skinny’.

fruits juice_content weight
apple 10 5
orange 1 2
watermelons 1000 2000

Use the tidyr package to reshape the data.

df <-  gather(df, compare, measure, juice_content:weight, factor_key = TRUE)

Now, the data is ‘long & skinny’

fruits compare measure
apple juice_content 10
orange juice_content 1
watermelons juice_content 1000
apple weight 5
orange weight 2
watermelons weight 2000

First, make a plot with No background highlighting.

plot <- ggplot(df, aes(fruits, measure, fill = compare)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  scale_y_log10()

plot

Second, generate the data for background highlighting.

highlight_level <- which(levels(df$fruits) %in% c("apple", "watermelons"))

AreaDF <- data.frame(
  fruits = unlist(
    lapply(highlight_level, function(x) c(x - 0.51, x - 0.5, x + 0.5, x + 0.51))
  ),
  yval = rep(
    c(1, max(df$measure), max(df$measure), 1), length(highlight_level)
  )
)
AreaDF
fruits yval
0.49 1
0.50 2000
1.50 2000
1.51 1
2.49 1
2.50 2000
3.50 2000
3.51 1

Third, create the plot with background highlights.

plot <- ggplot(df, aes(fruits)) +
  geom_blank(aes(y = measure, fill = compare)) +
  geom_area(data = AreaDF, aes(y = yval), fill = "yellow") +
  geom_bar(aes(y = measure, fill = compare), stat = "identity", position = position_dodge()) +
  scale_y_log10()

plot

Summary

Your problem exists in this section of code.

AreaDF <- data.frame(
  fruits = unlist(
    lapply(highlight_level, function(x) c(x - 0.51, x - 0.5, x + 0.5, x + 0.51))
    ),
  yval = rep(
    c(1, max(df$measure), max(df$measure), 1), length(highlight_level))
  )

Specifically, this code is hardcoded for the case where the dataset contains only three factors. This part of code (see line 2 in the preceding code block) contains hardcoded values for the data elements in positions 1 and 3 of the vector: apple and watermelons.

highlight_level <- which(levels(df$fruits) %in% c("apple", "watermelons"))

In your sample code, you have only three factors, specifically, apple, orange, and watermelon. The unlist() function, which starts on line two in the preceding code block, converts the factors to integers. When you apply this code to your actual dataset, it breaks because you have 30 factors.

You could modify your code to accommodote an arbitrary number of factors. The key to this sample code is understanding the purpose of these two lines:

  1. highlight_level <- which(levels(df$fruits) %in% c("apple", "watermelons"))

  2. lapply(highlight_level, function(x) c(x - 0.51, x - 0.5, x + 0.5, x + 0.51))

Line 1 simply returns a two element vector equal to c(1, 3), but only in the case where elements 1 and 3 are equal apple and watermelons, respectively.

Then, the anonymous function in Line 2 calculcates a series of alternating polygons to use for the background polygon.

A better approach would be to modify Line 1 make it return a vector of all odd numbered integers from 1 to n, where n = the number of factors in the fruits vector. Here’s one way to accomplish that goal.

highlight_level <- which(seq(length(unique(df$fruits))) %% 2 != 0)

Of course, an even better approach would be to build a few functions to automatically calculate the polygon regions to be highlighted.