This tutorial will illustrate how to create cross-tabs from your data and how to visualize the results via a variety of bar charts.
To start, let’s create a sample dataset for 25 employees where we have each employee’s department (HR, IT, or Sales) and their tenure status (Probationary/Tenured).
dat1 <- tibble(employee_id = sample(x = 100000:200000, size = 25, replace = FALSE)
,department = c(rep(x = "HR", times = 5),
rep(x = "IT",times = 7),
rep(x = "Sales", times = 13)),
tenured = sample(x = 0:1, size = 25, replace = TRUE))
dat1$department <- factor(dat1$department)
dat1$tenured <- factor(dat1$tenured, labels = c("Probationary", "Tenured"))
dat1[1:10,]## # A tibble: 10 × 3
## employee_id department tenured
## <int> <fct> <fct>
## 1 124968 HR Probationary
## 2 157835 HR Tenured
## 3 174234 HR Tenured
## 4 142962 HR Tenured
## 5 164466 HR Tenured
## 6 103636 IT Probationary
## 7 106285 IT Tenured
## 8 190830 IT Probationary
## 9 133617 IT Tenured
## 10 198528 IT Probationary
In this scenario, we want to create a cross-tab by department & tenure status. In other words, we want to know the counts/proportions of employees who fall under each department and/or tenure status.
There are multiple ways to create a cross-tab, but I almost always use the Janitor package.
We can start by just getting a simple crosstab without row or column totals. We simply specify the name of our dataset and which variables we want to create a crosstab on. I’ve also added an optional adorn_title() argument to make the results clearer:
dat1 %>%
tabyl(department, tenured) %>%
adorn_title()## tenured
## department Probationary Tenured
## HR 1 4
## IT 4 3
## Sales 8 5
If we want to include row, column, and/or overall totals/sub-totals, we can use adorn_totals(). If we do so, we need to specify whether we want row-wise, column-wise, or both types of totals, as shown below:
dat1 %>%
tabyl(department, tenured) %>%
adorn_totals(where = c("col","row")) %>%
adorn_title()## tenured
## department Probationary Tenured Total
## HR 1 4 5
## IT 4 3 7
## Sales 8 5 13
## Total 13 12 25
Finally, if we want proportions/percentages, we can use adorn_percentages(). When we do so, we can specify whether we want row-wise, column-wise, or cell-wise percentages by using the adorn_percentages() **denominator = ** argument, which accepts “row”, “col”, or “all”, respectively. We’ll also round the results using adorn_pct_formatting(), and add in our percentage totals/subtotals using adorn_totals() again:
dat1 %>%
tabyl(department, tenured) %>%
adorn_totals(where = c("row","col")) %>%
adorn_percentages(denominator = "all") %>%
adorn_pct_formatting(digits = 0) %>%
adorn_title()## tenured
## department Probationary Tenured Total
## HR 4% 16% 20%
## IT 16% 12% 28%
## Sales 32% 20% 52%
## Total 52% 48% 100%
Lastly, if we want counts, totals, and percentages all displayed together, we can add in the adorn_ns() function:
dat1 %>%
tabyl(department, tenured) %>%
adorn_totals(where = c("row","col")) %>%
adorn_percentages(denominator = "all") %>%
adorn_pct_formatting(digits = 0) %>%
adorn_ns() %>%
adorn_title()## tenured
## department Probationary Tenured Total
## HR 4% (1) 16% (4) 20% (5)
## IT 16% (4) 12% (3) 28% (7)
## Sales 32% (8) 20% (5) 52% (13)
## Total 52% (13) 48% (12) 100% (25)
Now we’ll take a look at how to visualize data using bar charts. We’ll cover simple, side-by-side, and stacked bar charts. Let’s first start by distinguishing between the two different formats your data could be in prior to creating the bar chart.
“Raw” data is a dataset without counts/cross-tabs for the categorical variables. It looks exactly like our sample dataset:
head(dat1)## # A tibble: 6 × 3
## employee_id department tenured
## <int> <fct> <fct>
## 1 124968 HR Probationary
## 2 157835 HR Tenured
## 3 174234 HR Tenured
## 4 142962 HR Tenured
## 5 164466 HR Tenured
## 6 103636 IT Probationary
“Tabular” data is when your categorical variables have already been tabulated and counted. It looks like this:
dat1 %>%
count(department, tenured) %>%
rename(count = n) -> dat2_counts
dat2_counts## # A tibble: 6 × 3
## department tenured count
## <fct> <fct> <int>
## 1 HR Probationary 1
## 2 HR Tenured 4
## 3 IT Probationary 4
## 4 IT Tenured 3
## 5 Sales Probationary 8
## 6 Sales Tenured 5
Or, if you just have one categorical variable, it looks like this:
dat1 %>%
count(department) %>%
rename(count = n) -> dat1_counts
dat1_counts## # A tibble: 3 × 2
## department count
## <fct> <int>
## 1 HR 5
## 2 IT 7
## 3 Sales 13
Note: It is VERY important to distinguish between the 2 formats when creating your bar chart, as they each require a slightly different approach.
We’ll start with a simple (single-variable) bar chart where we wish to visualize the number of employees in each department. We’ll use the ggplot functions from the tidyverse package.
To do so, we need to supply the name of our dataset, our x-axis variable, and how we wish to color and fill our bar chart. As this is a simple bar chart, our x-axis variable, color, and fill will all be “mapped” to ‘department’:
ggplot(data = dat1,
aes(x = department,
color = department,
fill = department)) +
geom_bar() -> p1
p1We can also make this a little more visually appealing by adding some optional arguments to include titles and make our color fill transparent (using the *alpha = * argument):
ggplot(data = dat1,
aes(x = department,
color = department,
fill = department)) +
geom_bar(alpha = 0.6) +
labs(title = "HeadCount by Department",
x = "Department",
y = "Count") -> p1
p1Now we’ll show how to do this when our data is tabular (already has counts).
The only adjustment we need to make is to add in y = COUNT VAR. NAME as an argument (which specifies which variable contains your category counts), and the stat = “identity” argument, which tells ggplot it doesn’t need to count anything itself b/c those category counts are already in the dataset.
Note: If you type in ?geom_bar, you’ll see the “stat =” argument has a default of ‘stat = “count”’, which basically means when we plot our bar chart, it normally counts up the categories by default. This why we need to change this argument if our data is already counted and in tabular form.
Here’s what that looks like (note that I changed the data argument to “dat1_counts” because that’s the “tabular” form dataset we created as an example earlier):
ggplot(data = dat1_counts,
aes(x = department,
y = count,
color = department,
fill = department)) +
geom_bar(alpha = 0.6,
stat = "identity") +
labs(title = "HeadCount by Department",
x = "Department",
y = "Count") -> p1b
p1bInstead of just visualizing headcounts by department, let’s also add in tenured status, so that we can see both the overall counts by department, as well as the within-department breakdowns by tenure status.
The main change to the code is that we are now switching our color and fill to the name of the secondary variable (tenure status). That’s all we need to change!
We’ll start first with the “raw” data format:
ggplot(data = dat1,
aes(x = department,
color = tenured,
fill = tenured)) +
geom_bar(alpha = 0.6) +
labs(title = "HeadCount by Department & Tenure",
x = "Department",
y = "Count") -> p2
p2
As we can see, now we not only see how many employees are in each
department, but also how many are probationary vs tenured.
Now let’s look at how to create the same plot with “tabular” (pre-counted) data. As with the simple bar chart, the only real change is to specify stat = “identity” and y = COUNT VAR. NAME. We’ll still keep the color and fill mapped to our second variable (tenure status):
ggplot(data = dat2_counts,
aes(x = department,
y = count,
color = tenured,
fill = tenured)) +
geom_bar(stat = "identity",
alpha = 0.6) +
labs(title = "HeadCount by Department & Tenure",
x = "Department",
y = "Count") -> p2b
p2bSure enough, our results match what we got with the “raw” data format.
Lastly, we’ll show how to visualize the same data (employee counts by department and tenure status) using a non-stacked (side-by-side) bar chart. While they convey pretty much the same info, the emphasis of stacked bar charts is typically on the proportions of the second variable within each primary variable group, whereas a side-by-side bar chart helps visualize the exact employee counts for both variables. Both charts are effective options, and the decision of which one to use will usually depend on your specific goal.
In creating the side-by-side bar chart, the main change is the addition of the position = “dodge” argument.
Note: We can again use ?geom_bar to see that the default for the position argument is position = “stack. By switching to “dodge”, we’re telling ggplot not to create a stacked chart by default.
Here’s what the side-by-side bar chart looks like for “raw” (non-count) data formats:
ggplot(data = dat1,
aes(x = department,
color = tenured,
fill = tenured)) +
geom_bar(alpha = 0.6,
position = "dodge") +
labs(title = "HeadCount by Department & Tenure",
x = "Department",
y = "Count") -> p3
p3And finally, here’s how this looks for “tabular” (pre-counted) data. As with the previous example, the main change for the side-by-side bar chart when using tabular vs raw data is just to specify that stat = “identity”:
ggplot(data = dat2_counts,
aes(x = department,
y = count,
color = tenured,
fill = tenured)) +
geom_bar(alpha = 0.6,
stat = "identity",
position = "dodge") +
labs(title = "HeadCount by Department & Tenure",
x = "Department",
y = "Count") -> p3
p3Thanks for reading! Here are two fantastic data viz resources that have helped me on countless occasions: