This below visualization shows the 100 most frequent word pairings (broken down to their dictionary stem to limit redundancy) pulled from documents assigned as readings in the Yale Medical School curriculum.
Question: Which word pairs occur most frequently together in the documents? Can this reveal major concepts in the curriculum.
Methods: The data will be stemmed to ensure all words with the same root are included together. Then, the words will be tokenized and separated into bigrams. Then we will remove all stop words. Finally, the most common terms will be visualized.
First, load in the required packages:
# load packages
library(rjson)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidytext)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.4
## ✔ ggplot2 3.5.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyr)
library(ggplot2)
library(igraph)
##
## Attaching package: 'igraph'
##
## The following objects are masked from 'package:lubridate':
##
## %--%, union
##
## The following objects are masked from 'package:purrr':
##
## compose, simplify
##
## The following object is masked from 'package:tidyr':
##
## crossing
##
## The following object is masked from 'package:tibble':
##
## as_data_frame
##
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
##
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
##
## The following object is masked from 'package:base':
##
## union
library(ggraph)
library(SnowballC)
Read in the data. The original data is in a JSON file, so we must use
the function fromJSON to read it, then transform it to a
dataframe.
# read in data
read_curriculum <- fromJSON(file = 'class_2021.json')
curriculum_data <- data.frame(matrix(unlist(read_curriculum), ncol = length(read_curriculum[[1]]),
byrow = TRUE),
stringsAsFactors = FALSE)
Now we will stem the words. This will reduce the words down to their dictionary stem or root word. To do this, we must unnest the words. At this time, we will also remove stop words, or common words we don’t want to include in our analysis. Then rejoin them for the next step, separating them into bigrams.
# create a new variable called stem with our stemmed words. remove stop words.
stemmed_curriculum <- curriculum_data %>%
unnest_tokens(output = word, input = X2) %>%
anti_join(stop_words, by = "word") %>%
mutate(stem = wordStem(word))
# rejoin the words
joinstemmed_curriculum <- stemmed_curriculum %>%
group_by(X1) %>%
summarize(text = str_c(stem, collapse = " ")) %>%
ungroup()
Separate the text into bigrams by specifying n = 2.
# tokenize bigrams by specifying n = 2.
curriculum_bigrams <- joinstemmed_curriculum %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2)
Next we are going to remove even more stop words. This will be words that were stemmed into stop words.
# separate into two columns to remove stop words
cbigrams_separated <- curriculum_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
Looking at our data, it seems to contain lots of numbers so we will filter those out as well.
# remove stop words in both columns
cbigrams_filtered <- cbigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
filter(!word1 %in% (0:5000)) %>%
filter(!word2 %in% (0:5000))
Look at the most common bigrams. This will reveal words that don’t appear to be relevant for the analysis. We can create a custom list of stop words to remove.
# check number of occurrences
cbigram_counts <- cbigrams_filtered %>%
count(word1, word2, sort = TRUE)
# custom stop words, to be removed from analysis
custom_stop_words <- tibble(word = c("mm","hg","md","section", "mg", "yale", "kei", "www", "http", "html", "school", "learn", "answer", "lesson"))
Filter out these stop words:
# filter stop words
cbigrams_filtered2 <- cbigrams_filtered %>%
filter(!word1 %in% custom_stop_words$word) %>%
filter(!word2 %in% custom_stop_words$word)
Check our analysis again:
# check number of occurrences
cbigram_counts2 <- cbigrams_filtered2 %>%
count(word1, word2, sort = TRUE)
Now we won’t want to include all of the bigrams in our graph, only the most common ones, so we’ll look at the top 100 rows of our data.
# filter out words that occur less than once
cbigram_occurance <- cbigram_counts2 %>%
top_n(100, n)
We want to create a graph with edges that show direction. To do this,
we need to turn our data into variables that show where the node is
coming from, where it is going to, and the weight of each edge. To do
this, we will use graph_from_data_frame().
# create a data frame to visualize data
cbigram_graph_filtered <- cbigram_occurance %>%
graph_from_data_frame()
Visualize the data with the arrows thickness corresponding to the frequency with which the bigram occurs:
# visualize net
set.seed(100)
a <- grid::arrow(type = "open", length = unit(.01, "inches"))
ggraph(cbigram_graph_filtered, layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n), show.legend = FALSE,
arrow = a,end_cap = circle(.07, 'inches')) +
geom_node_point(color = "red", size = 3) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()
## Warning: The `trans` argument of `continuous_scale()` is deprecated as of ggplot2 3.5.0.
## ℹ Please use the `transform` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The visualization reveals that many terms relate to cell, blood, disease, and system. These finding make sense. With the visualization, we can also see that there are many terms that share bigrams, but not necessarily the same meaning or connotation, such as acid. When limiting to only the top 100 bigrams, some of the terms seem to be connected only to the other term in their bigram.
This analysis is intended to help visualize the key two-word terms and ideas studied by Yale Medical students. This can be useful when evaluating curriculum and learning objectives, altering learning materials, and creating assessments. Further analysis can examine lecture materials, readings, and assessments and see if the appearance of bigrams is similar across these documents. More analysis may find that it may not be helpful to stem words.