This tutorial will illustrate how to create cross-tabs from your data and how to visualize the results via a variety of bar charts.

To start, let’s create a sample dataset for 25 employees where we have each employee’s department (HR, IT, or Sales) and their tenure status (Probationary/Tenured).

dat1 <- tibble(employee_id = sample(x = 100000:200000, size = 25, replace = FALSE)
               ,department = c(rep(x = "HR", times = 5),
                              rep(x = "IT",times = 7),
                              rep(x = "Sales", times = 13)),
               tenured = sample(x = 0:1, size = 25, replace = TRUE))

dat1$department <- factor(dat1$department)
dat1$tenured <- factor(dat1$tenured, labels = c("Probationary", "Tenured"))

dat1[1:10,]

Cross-tabs & Contingency Tables

In this scenario, we want to create a cross-tab by department & tenure status. In other words, we want to know the counts/proportions of employees who fall under each department and/or tenure status.

There are multiple ways to create a cross-tab, but I almost always use the Janitor package.

Tabyl() with the Janitor package

We can start by just getting a simple crosstab without row or column totals. We simply specify the name of our dataset and which variables we want to create a crosstab on. I’ve also added an optional adorn_title() argument to make the results clearer:

dat1 %>% 
  tabyl(department, tenured) %>% 
  adorn_title()

If we want to include row, column, and/or overall totals/sub-totals, we can use adorn_totals(). If we do so, we need to specify whether we want row-wise, column-wise, or both types of totals, as shown below:

dat1 %>% 
  tabyl(department, tenured) %>% 
  adorn_totals(where = c("col","row")) %>% 
  adorn_title()

Finally, if we want proportions/percentages, we can use adorn_percentages(). When we do so, we can specify whether we want row-wise, column-wise, or cell-wise percentages by using the adorn_percentages() **denominator = ** argument, which accepts “row”, “col”, or “all”, respectively. We’ll also round the results using adorn_pct_formatting(), and add in our percentage totals/subtotals using adorn_totals() again:

dat1 %>% 
  tabyl(department, tenured) %>%
  adorn_totals(where = c("row","col")) %>% 
  adorn_percentages(denominator = "all") %>% 
  adorn_pct_formatting(digits = 0) %>% 
  adorn_title()

Lastly, if we want counts, totals, and percentages all displayed together, we can add in the adorn_ns() function:

dat1 %>% 
  tabyl(department, tenured) %>%
  adorn_totals(where = c("row","col")) %>% 
  adorn_percentages(denominator = "all") %>% 
  adorn_pct_formatting(digits = 0) %>% 
  adorn_ns() %>% 
  adorn_title()

Bar Charts

Now we’ll take a look at how to visualize data using bar charts. We’ll cover simple, side-by-side, and stacked bar charts. Let’s first start by distinguishing between the two different formats your data could be in prior to creating the bar chart.

“Raw” Format

“Raw” data is a dataset without counts/cross-tabs for the categorical variables. It looks exactly like our sample dataset:

head(dat1)

Tabular Data

“Tabular” data is when your categorical variables have already been tabulated and counted. It looks like this:

dat1 %>% 
  count(department, tenured) %>% 
  rename(count = n) -> dat2_counts

dat2_counts

dat1 %>% 
  count(department) %>% 
  rename(count = n) -> dat1_counts

dat1_counts

Note: It is VERY important to distinguish between the 2 formats when creating your bar chart, as they each require a slightly different approach.

Simple Bar Charts

We’ll start with a simple (single-variable) bar chart where we wish to visualize the number of employees in each department. We’ll use the ggplot functions from the tidyverse package.

“Raw” Format Simple Bar Chart

To do so, we need to supply the name of our dataset, our x-axis variable, and how we wish to color and fill our bar chart. As this is a simple bar chart, our x-axis variable, color, and fill will all be “mapped” to ‘department’:

ggplot(data = dat1, 
       aes(x = department,
           color = department,
           fill = department)) + 
  geom_bar() -> p1

p1

We can also make this a little more visually appealing by adding some optional arguments to include titles and make our color fill transparent (using the *alpha = * argument):

ggplot(data = dat1, 
       aes(x = department,
           color = department,
           fill = department)) + 
  geom_bar(alpha = 0.6) + 
  labs(title = "HeadCount by Department",
       x = "Department",
       y = "Count") -> p1

p1

“Tabular” Format Simple Bar Chart

Now we’ll show how to do this when our data is tabular (already has counts).

The only adjustment we need to make is to add in y = COUNT VAR. NAME as an argument (which specifies which variable contains your category counts), and the stat = “identity” argument, which tells ggplot it doesn’t need to count anything itself b/c those category counts are already in the dataset.

Note: If you type in ?geom_bar, you’ll see the “stat =” argument has a default of ‘stat = “count”’, which basically means when we plot our bar chart, it normally counts up the categories by default. This why we need to change this argument if our data is already counted and in tabular form.

Here’s what that looks like (note that I changed the data argument to “dat1_counts” because that’s the “tabular” form dataset we created as an example earlier):

ggplot(data = dat1_counts, 
       aes(x = department,
           y = count,
           color = department,
           fill = department)) + 
  geom_bar(alpha = 0.6,
           stat = "identity") + 
  labs(title = "HeadCount by Department",
       x = "Department",
       y = "Count") -> p1b

p1b

Stacked Bar Charts

Instead of just visualizing headcounts by department, let’s also add in tenured status, so that we can see both the overall counts by department, as well as the within-department breakdowns by tenure status.

The main change to the code is that we are now switching our color and fill to the name of the secondary variable (tenure status). That’s all we need to change!

“Raw” Format Stacked Bar Chart

ggplot(data = dat1, 
       aes(x = department,
           color = tenured,
           fill = tenured)) + 
  geom_bar(alpha = 0.6)  + 
  labs(title = "HeadCount by Department & Tenure",
       x = "Department",
       y = "Count") -> p2

p2

As we can see, now we not only see how many employees are in each department, but also how many are probationary vs tenured.

“Tabular” Format Stacked Bar Chart

Now let’s look at how to create the same plot with “tabular” (pre-counted) data. As with the simple bar chart, the only real change is to specify stat = “identity” and y = COUNT VAR. NAME. We’ll still keep the color and fill mapped to our second variable (tenure status):

ggplot(data = dat2_counts,
       aes(x = department,
           y = count,
           color = tenured,
           fill = tenured)) +
  geom_bar(stat = "identity",
           alpha = 0.6) + 
  labs(title = "HeadCount by Department & Tenure",
       x = "Department",
       y = "Count") -> p2b

p2b

Side-by-side Bar Charts

Lastly, we’ll show how to visualize the same data (employee counts by department and tenure status) using a non-stacked (side-by-side) bar chart. While they convey pretty much the same info, the emphasis of stacked bar charts is typically on the proportions of the second variable within each primary variable group, whereas a side-by-side bar chart helps visualize the exact employee counts for both variables. Both charts are effective options, and the decision of which one to use will usually depend on your specific goal.

In creating the side-by-side bar chart, the main change is the addition of the position = “dodge” argument.

Note: We can again use ?geom_bar to see that the default for the position argument is position = “stack. By switching to “dodge”, we’re telling ggplot not to create a stacked chart by default.

“Raw” Format Side-by-Side Bar Chart

Here’s what the side-by-side bar chart looks like for “raw” (non-count) data formats:

ggplot(data = dat1, 
       aes(x = department,
           color = tenured,
           fill = tenured)) + 
  geom_bar(alpha = 0.6,
           position = "dodge") + 
  labs(title = "HeadCount by Department & Tenure",
       x = "Department",
       y = "Count") -> p3

p3

“Tabular” Format Side-by-Side Bar Chart

And finally, here’s how this looks for “tabular” (pre-counted) data. As with the previous example, the main change for the side-by-side bar chart when using tabular vs raw data is just to specify that stat = “identity”:

ggplot(data = dat2_counts, 
       aes(x = department,
           y = count,
           color = tenured,
           fill = tenured)) + 
  geom_bar(alpha = 0.6,
           stat = "identity",
           position = "dodge") + 
  labs(title = "HeadCount by Department & Tenure",
       x = "Department",
       y = "Count") -> p3

p3

Additional Resources

Thanks for reading! Here are two fantastic data viz resources that have helped me on countless occasions:

Crosstabs & Bar Charts in R

Creating Crosstabs & Side-by-Side/Stacked Bar Charts