The proportion of Australian students completing Year 12 varies considerably with a range of demographic characteristics (e.g., gender, State/Territory, geographic location, socio-economic status, language background, and Indigenous status).
This week, I challenged myself to use R to replicate Figure 1.1 below. Shown in this figure are Year 12 completion rates for 19 year olds based on data from the Australian Census of Population and Housing 2016, as reported in Lamb et al. (2020).
Figure 1.1: Percentage of 19-year-olds who have completed a Year 12 or equivalent qualification, by selected background characteristics (2016). Data from Australian Census of Population and Housing 2016, as reported by Lamb et al. (2020).
Figure 2.1 was my solution, depicting Year 12 completion rates by grouped demographic characteristics, with each group incorporated within the graph as a separate facet.
Figure 2.1: Percentage of 19-year-olds who have completed a Year 12 or equivalent qualification, by selected background characteristics (2016). Data from Australian Census of Population and Housing 2016, as reported by Lamb et al. (2020). Replication in R.
First, load packages into R.
library(tidyverse)
library(knitr)
library(bookdown)
library(here)
The data used in this example can be found here.
But here’s how I input and previewed the original data1.
Census2016 <- read.csv(here("data", "Census2016.csv"))
head(Census2016[1:3]) #preview columns 1-3 only
## ï..Group Characteristic Proportion
## 1 Australia 81.6
## 2 Gender Males 78.4
## 3 Gender Females 85.0
## 4 Indigeneity Non-Indigenous 82.7
## 5 Indigeneity Indigenous 57.8
## 6 Indigeneity Aboriginal 56.8
Note that the ‘Group’ variable name above has been read in with a weird extra couple of characters preceding it. I fixed this as follows.
names(Census2016)[1] <- "Group"
The following code can be tweaked as necessary to set options for figure height, width, scale, and legend size in R Markdown documents.
scale_height = knitr::opts_chunk$get('fig.height')*2
scale_width = knitr::opts_chunk$get('fig.width')*1.25
knitr::opts_chunk$set(fig.height = scale_height, fig.width = scale_width)
theme_update(legend.text = element_text(size = rel(0.6)))
Figure 3.1 below shows my first attempt to produce a grouped bar graph based on this data, depicting Year 12 completion rates by grouped demographic characteristics, with each group incorporated as a separate facet.
Census2016 %>%
ggplot(aes(y = Proportion, x = Characteristic)) +
geom_bar(stat = 'identity') +
facet_grid(rows = vars(Group), scales = "free_y", space = "free_y") +
coord_flip() +
theme(panel.spacing = unit(1, "lines")) +
labs(y = "Percent", x = "") +
geom_text(aes(label=round(Proportion, digits = 0)), hjust = 1.6, color="white", size=3.5)
Figure 3.1: First attempt (using facets, default sort order)
In this graph, the facets (Group) are presented in alphabetical order, while bars (Proportion by Characteristics) are presented within groups in reverse alphabetical order.
The code that follows is used to establish the desired sort order for facets and characteristics, before repeating the above code to generate Figure 3.2.
# first, sort order of facets as desired
Census2016$Group <- factor(Census2016$Group,
levels = c("", "Gender", "State/Territory", "Location",
"SES deciles (Low to High)", "Indigeneity", "Language background"))
# next, sort order of characteristics in graph by descending values of proportion variable
# use of `levels = unique()` required because some proportions are identical across categories
# High (SES) and Eastern Asian (Language background) are both 91.8
Census2016$Characteristic <- factor(Census2016$Characteristic,
levels = unique(Census2016$Characteristic[order(Census2016$Proportion)]))
Figure 3.2: Second attempt (using facets, sorted by descending order of proportions within each group)
This graph was better, but still not ideal, because it sorted in descending order of the proportion variable within categories without regard to whether entries were reference or comparator categories (i.e., Indigenous vs. non-Indigenous; Language Background Other than English (LBOTE) vs. English) or subgroups within these categories (e.g., Aboriginal, Torres Strait Islander, both).
To further refine the graph, I explored ways of coloring selected bars with the following code, after amending the datafile to include a Color variable (coded “Reference”, “Comparator”, or “Default”). I also added an Order variable, manually assigning a number to each observation to represent the order in which I wanted this information to appear. (The alternative would be to write code specifying that I wanted Reference to precede Comparator if present, and then remaining values to appear in descending order of proportion.) Finally, I added a vertical line at the national average (81.6%) to facilitate comparison of each observation with national averages.
Here is what the amended datafile looked like:
head(Census2016)
## Group Characteristic Proportion Color Order
## 1 Australia 81.6 Reference 1
## 2 Gender Males 78.4 Default 3
## 3 Gender Females 85.0 Default 2
## 4 Indigeneity Non-Indigenous 82.7 Reference 4
## 5 Indigeneity Indigenous 57.8 Comparator 5
## 6 Indigeneity Aboriginal 56.8 Default 8
The final code used to produce Figure 3.3 is presented below:
# first, sort order of facets as desired
Census2016$Group <- factor(Census2016$Group,
levels = c("", "Gender", "State/Territory", "Location",
"SES deciles (Low to High)", "Indigeneity",
"Language background"))
# second, sort order of characteristics in graph by descending values of order variable
Census2016$Characteristic <- factor(Census2016$Characteristic,
levels = unique(Census2016$Characteristic[order(desc(Census2016$Order))]))
# next, generate graph of proportions by characteristic, with group as facet, and Color determining fill
Census2016 %>%
ggplot(aes(y = Proportion, x = Characteristic,
fill=factor(ifelse(Color=="Reference", "Highlight1",
ifelse(Color=="Comparator", "Highlight2",
"Normal"))))) +
geom_bar(stat = 'identity', show.legend = FALSE) +
scale_fill_manual(name="Color", values = c("red", "black", "grey50")) +
facet_grid(rows = vars(Group), scales = "free_y", space = "free_y") +
coord_flip() +
theme(panel.spacing = unit(1, "lines")) +
labs(y = "Percent", x = "") +
geom_hline(yintercept = 81.6, color="red", linetype = "longdash") +
geom_text(aes(label=round(Proportion, digits = 0)), hjust = 1.6, color="white", size=3.5)
Figure 3.3: Final attempt (colored bars)
Please feel free to use the code that follows to copy this data into R and try the above examples for yourself.
Census2016 <- structure(list(
Category = c("Australia", "Males", "Females", "NSW", "Victoria",
"Queensland", "South Australia", "Western Australia", "Tasmania",
"Northern Territory", "Australian Capital Territory", "Major Cities",
"Inner Regional", "Outer Regional", "Remote", "Very Remote",
"Low", "2", "3", "4", "5", "6", "7", "8", "9", "High", "English",
"LBOTE", "Northern European", "Southern European", "Eastern European",
"Southwest and Central Asian", "Southern Asian", "Southeast Asian",
"Eastern Asian", "Australian Indigenous", "Other", "Non-Indigenous",
"Indigenous", "Aboriginal", "Torres Strait Islander", "Both"),
Proportion = c(81.6,
78.4, 85, 80.2, 82.9, 84.4, 79.6, 80.7, 71.2, 58.8, 90.6,
85.1, 72, 70.9, 65, 48.4, 66.8, 74, 76.9, 78.9, 80, 82.4,
84.2, 86.3, 88.7, 91.8, 79.7, 88.3, 79.8, 89, 90, 79.1, 94.9,
88.6, 91.8, 33.3, 79.4, 82.7, 57.8, 56.8, 68.7, 65.3),
Group = c("", "Sex", "Sex", "State/Territory",
"State/Territory", "State/Territory", "State/Territory", "State/Territory",
"State/Territory", "State/Territory", "State/Territory", "Location",
"Location", "Location", "Location", "Location", "SES deciles (Low to High)",
"SES deciles (Low to High)", "SES deciles (Low to High)", "SES deciles (Low to High)",
"SES deciles (Low to High)", "SES deciles (Low to High)", "SES deciles (Low to High)",
"SES deciles (Low to High)", "SES deciles (Low to High)", "SES deciles (Low to High)",
"Language background", "Language background", "Language background",
"Language background", "Language background", "Language background",
"Language background", "Language background", "Language background",
"Language background", "Language background", "Indigeneity",
"Indigeneity", "Indigeneity", "Indigeneity", "Indigeneity"),
Order = c(1L, 3L, 2L, 38L,
36L, 35L, 39L, 37L, 40L, 41L, 34L, 19L, 20L, 21L, 22L, 23L, 33L,
32L, 31L, 30L, 29L, 28L, 27L, 26L, 25L, 24L, 9L, 10L, 16L, 15L,
13L, 17L, 11L, 14L, 12L, 19L, 18L, 4L, 5L, 8L, 6L, 7L),
Color = c("Reference", "Default", "Default",
"Default", "Default", "Default", "Default", "Default", "Default",
"Default", "Default", "Default", "Default", "Default", "Default",
"Default", "Default", "Default", "Default", "Default", "Default",
"Default", "Default", "Default", "Default", "Default", "Reference",
"Comparator", "Default", "Default", "Default", "Default", "Default",
"Default", "Default", "Default", "Default", "Reference", "Comparator",
"Default", "Default", "Default")),
row.names = c(NA, -42L),
class = "data.frame")
Thanks to the following contributors on stackoverflow for tips and tricks used in this document:
RoB for pointers on how to group bars using facet_grid within ggplot
Russ Thomas for tips on using if else within ggplot to format fill of bars based on an additional variable in the dataframe
Joris Meys for tips on using dput to generate a list of data for use in reproducible examples.
Lamb, Stephen, Shuyan Huo, Anne Walstab, Andrew Wade, Quentin Maire, Esther Doecke, Jen Jackson, and Zoran Endekov. 2020. “Educational Opportunity in Australia 2020: Who Succeeds and Who Misses Out?” Melbourne: Centre for International Research on Education Systems, Victoria University, for the Mitchell Institute. https://www.vu.edu.au/sites/default/files/educational-opportunity-in-australia-2020.pdf.
If variable names are numbers, an X is added to these when processing, use check.names = FALSE in read.csv function to avoid this↩︎