Be sure to change the author in the YAML to your name. Remember to keep it inside the quotes.
Questions that require the use of R will have an R code chunk below it.
Download the UNESCO Institute of Statistics dataset on global student to teacher ratios (titled student_teacher_ratio.csv) from the Canvas page for this assignment and save this file to the folder where the RMD file is located.
Remember to change the filepath location in the
read.csv() function to where the .csv dataset is
located on your computer. You can find the filepath by using the
file.choose() function.
Starting with this assignment, there will be less hand-holding through the analysis process.
read.csv() function to where the .csv dataset is
located on your computer, and then you can delete the # at
the start of line 26.For this Challenge Problem assignment, you are going to be using the UNESCO Institute of Statistics dataset on global student to teacher ratios.1 The data includes educational information at the country-level. The original documentation for this dataset can be found on its Tidy Tuesday page. Note that the Tidy Tuesday data has been merged with another dataset so that it also contains region information:
region: Continent/region name
sub.region: Sub-region name
region.code: Continent/region code
sub.region.code: Sub-region code
The goal of this assignment is to take you through the process of creating a data visualization from a basic ggplot to one that is appealing and easy to understand. The question of interest for this assignment is:
What are the student to teacher ratios in primary education for each continent/region in the world (excluding Antarctica) in the most recent year? Note: Use the most recent year we have data for each country (this may be different across countries).
[Hints: How many rows should there be for each country if you want the most recent year? Which educational level group are you interested in? Are there some countries with a missing value for region?] - No, the data is not in the appropriate form to answer the question. Currently, there are multiple rows for each country and for several education levels. The research question focuses on the primary education level group, so the data should be wrangled to remove countries with missing values, filter for the primary education level group, and include only one row per country.
[Hint: You should have 165 observations in your wrangled data frame.]
S_T <- student_teacher %>%
filter(!is.na(region), indicator == "Primary Education") %>%
group_by(country) %>%
filter(year == max(year, na.rm = TRUE))
nrow(S_T)
## [1] 165
[Hints: Which plot is better for comparing groups: histograms, boxplots, or density plots? No faceting allowed.]
ggplot(S_T, aes(x = student_ratio, y = region, fill = region)) +
geom_boxplot() +
theme_classic() +
labs(x = "Student To Teacher Ratio",
y = "Region",
title = "Primary Education: Student to Teacher Ratio by Region") +
scale_fill_paletteer_d("RSkittleBrewer::smarties") +
theme(legend.position = "none",
plot.title = element_text(size = 15, face = "bold", hjust = 1))
Let’s improve the plot with these suggestions:
reorder, in descending order, the regions by the median of student to teacher ratio rather than by alphabetical order of region (HINT: if reordering outside of ggplot, e.g. using the mutate function, make sure there is no grouping structure on the data),
put the region variable on the y-axis and student to teacher ratio variable on the x-axis, if haven’t done so, so long labels can be seen,
customize the axes labels so that y has no label and x has an meaningful label, and
choose the classic theme for the plot.
S_T <- S_T %>%
ungroup() %>%
mutate(region = fct_reorder(region, student_ratio, .fun = median, .desc = TRUE))
ggplot(S_T, aes(x = student_ratio, y = region, fill = region)) +
geom_boxplot() +
theme_classic() +
labs(x = "Student To Teacher Ratio",
y = NULL,
title = "Primary Education: Student to Teacher Ratio by Region") +
scale_fill_paletteer_d("RSkittleBrewer::smarties") +
theme(legend.position = "none",
plot.title = element_text(size = 15, face = "bold", hjust = 1))
Educational Note: If this visualization was for a paper or a presentation, you could stop after this plot. However, you could add the raw data to the plot to show more detail for each region.
Let’s include multiple geoms to provide even more information in one plot:
use the argument to remove outliers in the original geom (NOTE: proceed very carefully if doing this in practice! This is fine here because we are about to put all the points on the plot, including these),
modify the color of the original geom to
gray60,
add another geom layer of jittered points that is colored by region and semi-transparent, and
remove the legend.
[Note: Choose a color palette that you think is visually pleasing.]
##Setting the seed so that the same jittered plot is reproduced instead of always having random points
set.seed(20211026)
ggplot(S_T, aes(x = student_ratio, y = region, fill = region)) +
geom_boxplot(outlier.shape = NA, fill = "gray60") +
geom_jitter(aes(color = region), size = 2, alpha = 0.1) +
theme_classic() +
labs(x = "Student To Teacher Ratio",
y = NULL,
title = "Primary Education: Student to Teacher Ratio by Region") +
scale_color_paletteer_d("RSkittleBrewer::smarties") +
theme(legend.position = "none",
plot.title = element_text(size = 15, face = "bold", hjust = 1))
Educational Note: You might add the raw data to the visualization to identify gaps in each region (e.g., gap in Oceania) or to identify number of observations per region.
Let’s suppose you decide that you want to show the distribution using violin plots instead of boxes and you also want to add a larger dot for the mean of student to teacher ratio by region:
add a column to the data frame that computes the mean of student to teacher ratio by region,
create a data visualization with the jittered points (as created previously), violin geom, and the other customizations (e.g., modified axes labels, no legend), and
add a larger point for the mean of student to teacher ratio by region. Be sure that the added point is colored by region.
[Note: Think about the ordering of the geoms when creating the plot.]
set.seed(20211026)
mean_S_T <- S_T %>%
group_by(region) %>%
mutate(ST_mean = mean(student_ratio, na.rm = TRUE)) %>%
ungroup()
ggplot(mean_S_T, aes(x = student_ratio, y = region, fill = region)) +
geom_violin()+
geom_jitter(aes(color = region), size = 2, alpha = 0.1) +
geom_point(aes(x = ST_mean, color = region), size = 3) +
theme_classic() +
labs(x = "Student To Teacher Ratio",
y = NULL,
title = "Primary Education: Student to Teacher Ratio Mean by Region") +
scale_color_paletteer_d("RSkittleBrewer::smarties") +
theme(legend.position = "none",
plot.title = element_text(size = 15, face = "bold", hjust = 1))
Let’s add a vertical line to relate these points to a baseline (worldwide average):
gray70.[Note: Use online resources (e.g., STHDA) for how to create a vertical line in the plot.]
set.seed(20211026)
mean_global <- mean(S_T$student_ratio, na.rm = TRUE)
ggplot(mean_S_T, aes(x = student_ratio, y = region, fill = region)) +
geom_violin()+
geom_jitter(aes(color = region), size = 2, alpha = 0.1) +
geom_point(aes(x = ST_mean, color = region), size = 3) +
geom_vline(xintercept = mean_global, color = "gray70", linewidth = 1.5) +
theme_classic() +
labs(x = "Student To Teacher Ratio",
y = NULL,
title = "Primary Education: Student to Teacher Ratio Mean by Region") +
scale_color_paletteer_d("RSkittleBrewer::smarties") +
theme(legend.position = "none",
plot.title = element_text(size = 15, face = "bold", hjust = 1))
Lastly, let’s add a meaningful title, subtitle, caption, and text annotations so the plot speaks for itself:
add a meaningful title for the plot, any
add a meaningful subtitle that summarizes the plot (you may need
to use \n in your label to add a new line if have a long
subtitle),
add one meaningful text annotation for each geom in the plot (this link is a good resource), and
add a meaningful caption describing the original source of the dataset.
set.seed(20211026)
ggplot(mean_S_T, aes(x = student_ratio, y = region, fill = region)) +
geom_violin()+
geom_jitter(aes(color = region), size = 2, alpha = 0.1) +
geom_point(aes(x = ST_mean, color = region), size = 3) +
geom_vline(xintercept = mean_global, color = "gray70", linewidth = 1.5) +
annotate("text", x = 44, y = "Oceania",
label = "Africa had the highest distribution of student to teacher ratio",
color = "black", size = 3.5, hjust = 0, vjust = -0.98) +
annotate("text", x = 37, y = "South America",
label = "Each semi-transparent dot represent a country",
color = "black", size = 3.5, hjust = 0.05, vjust = -0.5) +
annotate("text", x = 35, y = "Europe",
label = "Colored circles in barplots represent the regional student to teacher mean",
color = "black", size = 3.5, hjust = 0., vjust = -0.5) +
annotate("text", x = 35, y = "North America",
label = "The gray line represents the global student to teacher ratio mean",
color = "black", size = 3.5, hjust = 0, vjust = -0.8) +
theme_classic() +
labs(x = "Student To Teacher Ratio",
y = NULL,
title = "Primary Education: Student to Teacher Ratio Mean by Region",
subtitle = "The barplot represents regional student to teacher ratio and indicates the highest mean in Africa and the lowest mean in Europe",
caption = "Source:UNESCO Institute of Statistics (Tidy Tuesday May 7, 2019") +
scale_color_paletteer_d("RSkittleBrewer::smarties") +
theme(legend.position = "none",
plot.title = element_text(size = 15, face = "bold", hjust = 1),
plot.subtitle = element_text(size = 8, hjust = 1),
plot.caption = element_text(size = 8, face = "italic", hjust = 0.5))
CHALLENGE (optional): Another part of EDA is to
identify interesting or relevant points on the plot. Using the
geom_label_repel() function from the {ggrepel}
package, denote the countries that have the smallest ratio with a green
label on the plot and denote the countries that have the largest ratio
with a red label on the plot.
set.seed(20211026)