1 Tools
2 Prepare the Data
3 Long Format to Manipulate
- 3.1 Plot Proportions
4 Wide Format to Report
- 4.1 Cross-tabulate for a Distribution
- 4.2 Join an Overall Count to the Distribution
5 A Function for Proportions
- 5.1 Plot facets Conditioned on a 3rd Variable
6 What’s this for?
- 6.1 What’s it really about?

Professor Jennifer Bryan (@JennyBryan) of the University of British Columbia asked how one might perform efficient cross-tabulation with dplyr in R. Professor Bryan has written up several answers on github, using both dplyr and data.table, some also use tidyr for reformatting tables or reshape2 to specify cross-tabulation using a formula.

I’d previously thought about the question of cross-tabulation in the context of analysing and reporting student course grades and award classifications. This elaborates on the ‘count & spread’ approach I suggested and extends it to tables and plots required for both data manipulation and reporting.

This entails deriving tables from a data frame in two distinct formats for different purposes:

Counts and proportions in long format, suitable for manipulation, e.g. plotting.
An overall count and a cross-tabulation of proportions in wide format for reporting a distribution.

For reporting, an overall count for the parent group is required alongside the percentage distribution within each sub-group. Each measure elaborates on the other without duplicating information, so both are required for a reader to gauge the significance of the values. This is more compact and readable than long format table required for data manipulation such as plotting.

1 Tools

The packages required fall into three groups, for data manipulation, plotting & presenting tables:

library('dplyr')      # for data manipulation
library('tidyr')      # for reshaping data

library('ggplot2')    # plotting data
library('scales')     # for scale_y_continuous(label = percent)
library('ggthemes')   # for scale_fill_few('medium')

library('ztable')     # format tables for reporting

knitr::opts_chunk$set(comment = NA)
options(ztable.type = 'html')

2 Prepare the Data

Prepare data to minimise subsequent coding necessary, e.g. convert variable types. This data uses numeric values to indicate discrete values, convert these to factors.

dfr <- mtcars %>%             
  mutate(cyl = as.factor(cyl)     # numeric values to factor
      , gear = as.ordered(gear))  # numeric to ordered factor (like a grade)

3 Long Format to Manipulate

The grouping and a summary by count are both implicit in count(). The count, n, can then be mutated into proportions of each sub-group within their parent group with prop.table().

dfr_prop <- dfr %>% 
  count(cyl, gear) %>%            # group_by() & summarise(n = n()) are implicit
  mutate(prop = prop.table(n))    # prop = n/sum(n) works too

as.data.frame(dfr_prop)           # strip tbl_df() properties to print

  cyl gear  n       prop
1   4    3  1 0.09090909
2   4    4  8 0.72727273
3   4    5  2 0.18181818
4   6    3  2 0.28571429
5   6    4  4 0.57142857
6   6    5  1 0.14285714
7   8    3 12 0.85714286
8   8    5  2 0.14285714

This long format is central to the concept of ‘tidy data’, which is easy to manipulate, e.g. to filter or arrange by count or proportion, or to plot with.

3.1 Plot Proportions

Plot proportions on the y-axis, indicating each subgroup with fill, by parent groups on the x-axis. First define a plot, then insert data. This allows the same plot to be reused with different data and other arguments and avoids creating ggplot objects bloated with data.

Note the parent group is assigned to the x-axis, the measure of proportions to the y-axis, filled with colour assigned by sub-group:

# create a reusable ggplot object *without* data = data.frame()

gg_prop <- ggplot(data = data.frame()
                  , aes(x = cyl, y = prop, fill = gear)) + 
  geom_bar(stat = 'identity', position = 'dodge', alpha = 2/3) +  
  scale_y_continuous(labels = percent) +    
  scale_fill_few('medium', drop = FALSE) +              # keep levels, if data is filtered
  labs(x = 'Cylinders', y = NULL, fill = 'Gears'
       , title = 'Proportions in Sub-groups by Group')  # what's the reader looking at?

Insert data with the ‘add’ operator in ggplot2, %+%, documented here:

gg_prop %+%   # use %+% to add...
  dfr_prop    # a dataframe

Ideally, I’d like to keep the bar position equal, keeping a space for missing values.

4 Wide Format to Report

Long format is not suitable for reporting, cross tabulation in wide format of counts and proportions is more readable. Both an overall count and proportions across a distribution are required for ‘tidy reporting’. Each measure elaborates on the other, so both are required for a reader to gauge their significance.

4.1 Cross-tabulate for a Distribution

First mutate() proportions into percentages to make those more readable and spread() the values for each subgroup into a cross-tabular layout.

dfr_perc <- dfr %>% 
  count(cyl, gear) %>% 
  mutate(perc = prop.table(n)*100) %>%      # mutate count(n) into perc
  select(-n) %>%                            # remove the count...
  spread(gear, perc)                        # to spread perc by subgroup

as.data.frame(dfr_perc)

  cyl         3        4        5
1   4  9.090909 72.72727 18.18182
2   6 28.571429 57.14286 14.28571
3   8 85.714286       NA 14.28571

4.2 Join an Overall Count to the Distribution

Only an overall count for parent groups is required. A full count by sub-groups would duplicate information in the distribution above.

Create an overall count and percentage for parent groups and join them by the common variable, the parent group:

dfr_dist <- dfr %>% 
  count(cyl) %>%                            # count for parent groups
  mutate(`(\\%)` = prop.table(n)*100) %>%   # & percentage in those groups
  left_join(dfr_perc, by = 'cyl')           # join to dfr_perc

Format so that counts are integers and percentages have 1 decimal place making them distinguishable for the reader.

Caption the table, then label and group the columns to guide the reader.

# 1 digit to distinguish % from n and caption
ztab <- ztable(as.data.frame(dfr_dist), digits = 1
               , caption = 'Distribution of Gears by Cylinders')

# name and number column groups
ztab <- addcgroup(ztab, cgroup = c('Cylinders', 'Gear Distribution (\\%)') 
                 , n.cgroup = c(3, ncol(dfr_dist)-3))   # 3 columns & others

This produces a compact and readable table for reporting:

ztab

Distribution of Gears by Cylinders
	cyl	n	(%)	3	4	5
	Cylinders			Gear Distribution (%)
1	4	11	34.4	9.1	72.7	18.2
2	6	7	21.9	28.6	57.1	14.3
3	8	14	43.8	85.7		14.3

Ideally, I’d like to be able remove the row names when appropriate, but adding include.rownames = FALSE to ztable() leads to an error when column groups are added.

5 A Function for Proportions

In practice, proportions are required for many variables from various datasets or subsets. A function allows the same operation to be applied to various groups in any dataset.

This requires some Non-Standard Evaluation (NSE) in dplyr, to allow groups to be defined as arguments. count() doesn’t work here, so NSE versions of group_by_() and summarise_() are used instead:

## using NSE, except for mutate, the last step

fun_prop <- function (data = data.frame(), dots = NULL, ...) {

  dfr_prop <- data %>% 
    group_by_(.dots = dots) %>% 
    summarise_(n = ~n()) %>% 
    mutate(prop = prop.table(n))  # can this be done by NSE with mutate_() ??
  
  dfr_prop
}

Adding a third variable allows a cross-tabulation to be conditioned, splitting the dataset based on that third variable. In this case for engine type, a logical value either vertical (0) or straight (1).

dfr <- dfr %>%             
  mutate(vs = as.logical(vs))     # 0 = Vertical, 1 = Straight

Note the conditional variable is first, over and above the factors representing parent and sub-groups:

dfr_cond <- dfr %>% 
  fun_prop(c('vs','cyl','gear'))  # piping & position take care of arguments

5.1 Plot facets Conditioned on a 3rd Variable

Using the same empty ggplot object above, insert the new data frame and use the conditional variable to define facets:

gg_prop %+% 
  dfr_cond +                # add new data into gg_prop
  facet_grid(~vs) +         # conditional variable defines facets, and re-title
  labs(title = 'Proportions in Sub-groups by Group, \n Conditioned on a 3rd Variable')

6 What’s this for?

The function for proportions is generally applicable to any data frame. Modifying gg_plot objects with %+% is integral to ggplot2, though perhaps not well-known as it’s only briefly referenced in the original book. The particular specification of gg_prop above is not general, but could be made into a function that is.

Counting gears in mtcars as an ordered factor is analogous to counting students’ course grades or award classifications and calculating a distribution, to report in both tables and plots.

The use of cross-tabulation and a little non-standard evaluation in a function allows the same operations to be applied to data for a variety of students on different mark schemes that generate different grades, e.g. Bachelor’s schemes with six or seven grades and Master’s ones with five.

6.1 What’s it really about?

Examination results, with both marks and ordered grades, make ideal datasets for data analysis and manipulation of numeric and factor variables. They’re almost universal in education, easily understood and immediately relevant to both teachers and students. Any useful analysis entails idiomatic operations to count, cross tabulation to calculate distribution, to plot and present in tabular form.

Educational outcomes are often poorly analysed and reported by administrators to both teachers and students, usually using licensed tools that offer recipients limited access to the underlying data and analysis. Reproducible reporting in education with open-source tools would make both data and analyses more accessible, exposing them to scrutiny to make them more reliable and allow readers with greater expertise to specify more sophisticated analyses for administrators to reproduce as new data is generated.

Common records systems and national reporting obligations impose the same data structures across different institutions, making analyses generally applicable and potentially reducing the costs of excessive dependence on licensed tools.

Tables and Plots of Counts and Proportions

M.Devlin

16 May 2015