Professor Jennifer Bryan (@JennyBryan) of the University of British Columbia asked how one might perform efficient cross-tabulation with dplyr
in R. Professor Bryan has written up several answers on github, using both dplyr
and data.table
, some also use tidyr
for reformatting tables or reshape2
to specify cross-tabulation using a formula.
I’d previously thought about the question of cross-tabulation in the context of analysing and reporting student course grades and award classifications. This elaborates on the ‘count
& spread
’ approach I suggested and extends it to tables and plots required for both data manipulation and reporting.
This entails deriving tables from a data frame in two distinct formats for different purposes:
For reporting, an overall count for the parent group is required alongside the percentage distribution within each sub-group. Each measure elaborates on the other without duplicating information, so both are required for a reader to gauge the significance of the values. This is more compact and readable than long format table required for data manipulation such as plotting.
The packages required fall into three groups, for data manipulation, plotting & presenting tables:
library('dplyr') # for data manipulation
library('tidyr') # for reshaping data
library('ggplot2') # plotting data
library('scales') # for scale_y_continuous(label = percent)
library('ggthemes') # for scale_fill_few('medium')
library('ztable') # format tables for reporting
knitr::opts_chunk$set(comment = NA)
options(ztable.type = 'html')
Prepare data to minimise subsequent coding necessary, e.g. convert variable types. This data uses numeric values to indicate discrete values, convert these to factors.
dfr <- mtcars %>%
mutate(cyl = as.factor(cyl) # numeric values to factor
, gear = as.ordered(gear)) # numeric to ordered factor (like a grade)
The grouping and a summary by count are both implicit in count()
. The count, n
, can then be mutated into proportions of each sub-group within their parent group with prop.table()
.
dfr_prop <- dfr %>%
count(cyl, gear) %>% # group_by() & summarise(n = n()) are implicit
mutate(prop = prop.table(n)) # prop = n/sum(n) works too
as.data.frame(dfr_prop) # strip tbl_df() properties to print
cyl gear n prop
1 4 3 1 0.09090909
2 4 4 8 0.72727273
3 4 5 2 0.18181818
4 6 3 2 0.28571429
5 6 4 4 0.57142857
6 6 5 1 0.14285714
7 8 3 12 0.85714286
8 8 5 2 0.14285714
This long format is central to the concept of ‘tidy data’, which is easy to manipulate, e.g. to filter or arrange by count or proportion, or to plot with.
Plot proportions on the y-axis, indicating each subgroup with fill, by parent groups on the x-axis. First define a plot, then insert data. This allows the same plot to be reused with different data and other arguments and avoids creating ggplot
objects bloated with data.
Note the parent group is assigned to the x-axis, the measure of proportions to the y-axis, filled with colour assigned by sub-group:
# create a reusable ggplot object *without* data = data.frame()
gg_prop <- ggplot(data = data.frame()
, aes(x = cyl, y = prop, fill = gear)) +
geom_bar(stat = 'identity', position = 'dodge', alpha = 2/3) +
scale_y_continuous(labels = percent) +
scale_fill_few('medium', drop = FALSE) + # keep levels, if data is filtered
labs(x = 'Cylinders', y = NULL, fill = 'Gears'
, title = 'Proportions in Sub-groups by Group') # what's the reader looking at?
Insert data with the ‘add’ operator in ggplot2
, %+%
, documented here:
gg_prop %+% # use %+% to add...
dfr_prop # a dataframe
Ideally, I’d like to keep the bar position equal, keeping a space for missing values.
Long format is not suitable for reporting, cross tabulation in wide format of counts and proportions is more readable. Both an overall count and proportions across a distribution are required for ‘tidy reporting’. Each measure elaborates on the other, so both are required for a reader to gauge their significance.
First mutate()
proportions into percentages to make those more readable and spread()
the values for each subgroup into a cross-tabular layout.
dfr_perc <- dfr %>%
count(cyl, gear) %>%
mutate(perc = prop.table(n)*100) %>% # mutate count(n) into perc
select(-n) %>% # remove the count...
spread(gear, perc) # to spread perc by subgroup
as.data.frame(dfr_perc)
cyl 3 4 5
1 4 9.090909 72.72727 18.18182
2 6 28.571429 57.14286 14.28571
3 8 85.714286 NA 14.28571
Only an overall count for parent groups is required. A full count by sub-groups would duplicate information in the distribution above.
Create an overall count and percentage for parent groups and join them by the common variable, the parent group:
dfr_dist <- dfr %>%
count(cyl) %>% # count for parent groups
mutate(`(\\%)` = prop.table(n)*100) %>% # & percentage in those groups
left_join(dfr_perc, by = 'cyl') # join to dfr_perc
Format so that counts are integers and percentages have 1 decimal place making them distinguishable for the reader.
Caption the table, then label and group the columns to guide the reader.
# 1 digit to distinguish % from n and caption
ztab <- ztable(as.data.frame(dfr_dist), digits = 1
, caption = 'Distribution of Gears by Cylinders')
# name and number column groups
ztab <- addcgroup(ztab, cgroup = c('Cylinders', 'Gear Distribution (\\%)')
, n.cgroup = c(3, ncol(dfr_dist)-3)) # 3 columns & others
This produces a compact and readable table for reporting:
ztab
Cylinders | Gear Distribution (%) | ||||||
cyl | n | (%) | 3 | 4 | 5 | ||
---|---|---|---|---|---|---|---|
1 | 4 | 11 | 34.4 | 9.1 | 72.7 | 18.2 | |
Ideally, I’d like to be able remove the row names when appropriate, but adding include.rownames = FALSE
to ztable()
leads to an error when column groups are added.
In practice, proportions are required for many variables from various datasets or subsets. A function allows the same operation to be applied to various groups in any dataset.
This requires some Non-Standard Evaluation (NSE) in dplyr
, to allow groups to be defined as arguments. count()
doesn’t work here, so NSE versions of group_by_()
and summarise_()
are used instead:
## using NSE, except for mutate, the last step
fun_prop <- function (data = data.frame(), dots = NULL, ...) {
dfr_prop <- data %>%
group_by_(.dots = dots) %>%
summarise_(n = ~n()) %>%
mutate(prop = prop.table(n)) # can this be done by NSE with mutate_() ??
dfr_prop
}
Adding a third variable allows a cross-tabulation to be conditioned, splitting the dataset based on that third variable. In this case for engine type, a logical value either vertical (0) or straight (1).
dfr <- dfr %>%
mutate(vs = as.logical(vs)) # 0 = Vertical, 1 = Straight
Note the conditional variable is first, over and above the factors representing parent and sub-groups:
dfr_cond <- dfr %>%
fun_prop(c('vs','cyl','gear')) # piping & position take care of arguments
Using the same empty ggplot
object above, insert the new data frame and use the conditional variable to define facets:
gg_prop %+%
dfr_cond + # add new data into gg_prop
facet_grid(~vs) + # conditional variable defines facets, and re-title
labs(title = 'Proportions in Sub-groups by Group, \n Conditioned on a 3rd Variable')
The function for proportions is generally applicable to any data frame. Modifying gg_plot
objects with %+%
is integral to ggplot2
, though perhaps not well-known as it’s only briefly referenced in the original book. The particular specification of gg_prop
above is not general, but could be made into a function that is.
Counting gears in mtcars
as an ordered factor is analogous to counting students’ course grades or award classifications and calculating a distribution, to report in both tables and plots.
The use of cross-tabulation and a little non-standard evaluation in a function allows the same operations to be applied to data for a variety of students on different mark schemes that generate different grades, e.g. Bachelor’s schemes with six or seven grades and Master’s ones with five.
Examination results, with both marks and ordered grades, make ideal datasets for data analysis and manipulation of numeric and factor variables. They’re almost universal in education, easily understood and immediately relevant to both teachers and students. Any useful analysis entails idiomatic operations to count, cross tabulation to calculate distribution, to plot and present in tabular form.
Educational outcomes are often poorly analysed and reported by administrators to both teachers and students, usually using licensed tools that offer recipients limited access to the underlying data and analysis. Reproducible reporting in education with open-source tools would make both data and analyses more accessible, exposing them to scrutiny to make them more reliable and allow readers with greater expertise to specify more sophisticated analyses for administrators to reproduce as new data is generated.
Common records systems and national reporting obligations impose the same data structures across different institutions, making analyses generally applicable and potentially reducing the costs of excessive dependence on licensed tools.