How to make a Pareto Chart using ggplot2 (and dplyr)

A first draft of this post was first published on my wordpress blog:
http://dav1d00.wordpress.com/2014/11/16/enjoy-r-how-to-make-a-pareto-chart-using-ggplot2-and-dplyr/

Hi all.

The well-known choice of pushing ggplot2 users towards a cleaner and more correct way of plotting data, has led to the miss-implementation of a secondary axis.

This is at the basis of the difficulty of plotting a Pareto Chart using this smart R package.

In this post, I suggest a way to overcome this hurdle, by adding a segment, with ticks and text expressing the percentages, on the right-hand side of the x-axis. I also deicided to use theme_bw().

Let’s see an example.

# creating a factor variable:
Example <- rep(c(letters[1:2], LETTERS[1:3]), c(15, 39, 6, 42, 50))

# implementing the function:
ggpareto <- function(x) {
  
  title <- deparse(substitute(x))
  
  x <- data.frame(modality = na.omit(x))
  
  library(dplyr)
  
  Df <- x %>% group_by(modality) %>% summarise(frequency=n()) %>% 
    arrange(desc(frequency))
  
  Df$modality <- ordered(Df$modality, levels = unlist(Df$modality, use.names = F))
  
  Df <- Df %>% mutate(modality_int = as.integer(modality), 
                      cumfreq = cumsum(frequency), cumperc = cumfreq/nrow(x) * 100)
  nr <- nrow(Df)
  N <- sum(Df$frequency)
  
  Df_ticks <- data.frame(xtick0 = rep(nr +.55, 11), xtick1 = rep(nr +.59, 11), 
                         ytick = seq(0, N, N/10))
  
  y2 <- c("  0%", " 10%", " 20%", " 30%", " 40%", " 50%", " 60%", " 70%", " 80%", " 90%", "100%")
  
  library(ggplot2)
  
  g <- ggplot(Df, aes(x=modality, y=frequency)) + 
    geom_bar(stat="identity", aes(fill = modality_int)) +
    geom_line(aes(x=modality_int, y = cumfreq, color = modality_int)) +
    geom_point(aes(x=modality_int, y = cumfreq, color = modality_int), pch = 19) +
    scale_y_continuous(breaks=seq(0, N, N/10), limits=c(-.02 * N, N * 1.02)) + 
    scale_x_discrete(breaks = Df$modality) +
    guides(fill = FALSE, color = FALSE) + 
    annotate("rect", xmin = nr + .55, xmax = nr + 1, 
             ymin = -.02 * N, ymax = N * 1.02, fill = "white") +
    annotate("text", x = nr + .8, y = seq(0, N, N/10), label = y2, size = 3.5) +
    geom_segment(x = nr + .55, xend = nr + .55, y = -.02 * N, yend = N * 1.02, color = "grey50") +
    geom_segment(data = Df_ticks, aes(x = xtick0, y = ytick, xend = xtick1, yend = ytick)) +
    labs(title = paste0("Pareto Chart of ", title), y = "absolute frequency") +
    theme_bw()
  
  return(list(graph = g, Df = Df[, c(3, 1, 2, 4, 5)]))
}

# applying the function to the factor variable:
ggpareto(Example)

## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## $graph

## 
## $Df
## Source: local data frame [5 x 5]
## 
##   modality_int modality frequency cumfreq   cumperc
## 1            1        C        50      50  32.89474
## 2            2        B        42      92  60.52632
## 3            3        b        39     131  86.18421
## 4            4        a        15     146  96.05263
## 5            5        A         6     152 100.00000

As you can see, the function returns a list of two elements: one is the plot, while the other is the data frame generated with dplyr, which has been used to create the graph.

The plot may be improved by adapting the second axis also to the cases with several levels (this would look really bad), though I believe that such a chart is more useful when there are few levels to visualize.

In case one, two or three levels are highly frequent, whereas some twenty levels have frequencies very close to zero, what I would do is to plot the first one, two, three levels against only one other level that groups all the twenty least frequent ones.

In case I have several levels all with similar frequencies, I would just avoid to use this chart (I can’t find the usefulness it may give to my analysis).