Week 7 Independent Analysis: Topic Modeling

Introduction

This analysis uses topic modeling to find common topics within the assigned readings in the Yale Medical School curriculum. Both Latent Dirichlet Allocation and Structural Topic Modeling are used to find topics and seemingly related words in the data. It looks to answer the questions: What central topics are featured in the assigned readings?

Prepare the Environment

Install the necessary packages:

# load packages

  pacman::p_load(rjson,
                 tm,
                 tidyr,
                 tidyverse, 
                 tidytext,
                 DT,
                 SnowballC,
                 topicmodels,
                 stm,
                 ldatuning,
                 knitr,
                 LDAvis
                 )

Wrangle Data

Read in the data:

# read data into environment
  read_curriculum <- fromJSON(file = 'class_2021.json')
  curriculum_data <- data.frame(matrix(unlist(read_curriculum),
                                       ncol = length(read_curriculum[[1]]), byrow = TRUE),
                                stringsAsFactors = FALSE)

Later in this analysis we will be using STM to identify topics. This algorithm includes a argument that factors metadata into the groupings. In order to include metadata, we will need to make the columns specifying the course less unique so that it does not mess up our analysis. We will keep just the initial 3 character course identifier:

# remove unneeded characters in course identifiers
curriculum_3 <- curriculum_data %>%
    mutate(X3 = substr(X3, 1, nchar(X3) - 6))

We’ll also take a smaller random sample so the algorithms run better:

# take a random sample of documents
  set.seed(7)  # for reproducibility
  curriculum_sample <- curriculum_3[sample(nrow(curriculum_3), size = 100), ]

Separate the text strings into words for analysis by tokenizing the data. We will also stem the data in this analysis to reduce the dataset even further and consolidate words like “disease” and “diseases.”

# tokenize forum text and remove stop words and numbers
  curriculum_tidy <- curriculum_sample %>%
      unnest_tokens(output = word, input = X2) %>%
      anti_join(stop_words, by = "word") %>%
      filter(!word %in% (0:5000)) %>%
      mutate(stem = wordStem(word))

Examine the results:

# find most common words
  curriculum_tidy %>%
    count(word, sort = TRUE) %>%
    datatable(options = list(pageLength = 20))

## Warning in instance$preRenderHook(instance): It seems your data is too big for
## client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html

Looking at the results, there are a few words that don’t seem important to analysis. We will remove these custom stop words:

# create custom stop words list
custom_stop_words <- c("mm","hg","md","section", "mg", "yale", 
                                       "kei", "www", "http", "html", "school", 
                                       "learn", "answer", "lesson", "will")
# filter it from data
 curriculum_tidy <- curriculum_sample %>%
      unnest_tokens(output = word, input = X2) %>%
      anti_join(stop_words, by = "word") %>%
      filter(!word %in% (0:5000)) %>%
      filter(!word %in% custom_stop_words) %>%
      mutate(stem = wordStem(word))

Prepare for Modeling

We must create a document term matrix for the LDA algorithm.

To do this, we will consider each individual reading as a unique document. count() how many times each word occurs in each post and create a matrix that contains one row per post with a column for each word and how many times that word occurs in each post.

# word count per reading, one column for each word, include number of times word occurs
  curr_dtm <- curriculum_tidy %>%
    count(X1, stem) %>%
    cast_dtm(X1, stem, n)

Next, to prepare for the Structural Topic Modeling, we will use the textProcessor() function.

# process curriculum document text in preparation for structural topic modeling
  curr_temp <- textProcessor(curriculum_sample$X2, # column in our data frame that contains the text to be processed
                        metadata = curriculum_sample, # the data frame that contains the text of interest
                        lowercase = TRUE,
                        removestopwords = TRUE,
                        removenumbers = TRUE,
                        removepunctuation = TRUE,
                        wordLengths = c(3,Inf),
                        stem = TRUE,
                        onlycharacter = FALSE,
                        striphtml = TRUE,
                        customstopwords = custom_stop_words)

## Building corpus... 
## Converting to Lower Case... 
## Removing punctuation... 
## Removing stopwords... 
## Remove Custom Stopwords...
## Removing numbers... 
## Stemming... 
## Creating Output...

Using the stm package requires a very unique set of inputs that are specific to the package. The following code will pull elements from the temp list that was created that will be required later when we use the stm() function.

# pull elements from the temp list that was created that will be required for the stm() function
  curr_meta <- curr_temp$meta
  curr_vocab <- curr_temp$vocab
  curr_docs <- curr_temp$documents

Model

Now we’re going analyze the data with topic modeling. These models will produce a specified number of topics based on the data and parameters they are given. This can allow us to see broader topics discussed in the curriculum.

We’ll start with the LDA algorithm. This unsupervised mode classifies textual data to find groupings of words, or topics. LDA will allow documents to overlap each other in terms of content, meaning one document can be made up of many topics. Furthermore, for each topic there are also a combination of words, and these words can be shared between topics.

Find k-value

Both models require a k-value for the number of topics in our corpus. We can check for what might be the optimal value of k using FindTopicsNumber() and the DTM created earlier. The following code will show the metrics for k-values 5 to 50, plotted by 5:

# find values of k
  k_metrics <- FindTopicsNumber(
    curr_dtm,
    topics = seq(5, 50, by = 5),
    metrics = "Griffiths2004",
    method = "Gibbs",
    control = list(),
    mc.cores = NA,
    return_models = FALSE,
    verbose = FALSE,
    libpath = NULL
  )

  FindTopicsNumber_plot(k_metrics)

It appears there is a peak at 18 for the value of k. Now we will test it by creating new models.

curr_lda <- LDA(curr_dtm, 
                    k = 18, 
                    control = list(seed = 7) # set seed for reproducibility
                    )

Then we will use the stm() function. STM factors in meta data about the documents when it clusters the model results. We will ask it to examine the course identifier column we tidied earlier.

# fit STM topic model
  curr_stm <- stm(documents = curr_docs,
                    data = curr_meta,
                    vocab = curr_vocab,
                    prevalence = ~X3,
                    K = 18,
                    max.em.its = 100,
                    verbose = FALSE,
                    )

To help better visualize the ouput, use the plot.STM() function to see the top 5 expected words assigned to each topic.

  plot.STM(curr_stm, n = 5)

Now we will explore these models.

Explore

We can explore the models by visualizing the beta and gamma values.

First we will start by looking at the beta values for each of our LDA models, which is how likely it is for a word to belong to a topic. First, convert the LDA model to a tidy dataframe to prepare for beta value visualizations:

# convert LDA model to tidy dataframe
      tidy_lda_curr <- tidy(curr_lda)

Now we will be able to visualize the top 5 terms for each topic:

# visualize top 5 terms for each topic
 curr_top_terms <- tidy_lda_curr %>%
    group_by(topic) %>%
    slice_max(beta, n = 5, with_ties = FALSE) %>%
    ungroup() %>%
    arrange(topic, -beta)
  
  curr_top_terms %>%
    mutate(term = reorder_within(term, beta, topic)) %>%
    group_by(topic, term) %>%    
    arrange(desc(beta)) %>%  
    ungroup() %>%
    ggplot(aes(beta, term, fill = as.factor(topic))) +
    geom_col(show.legend = FALSE) +
    scale_y_reordered() +
    labs(title = "Top 5 terms in each LDA topic",
         x = expression(beta), y = NULL) +
    facet_wrap(~ topic, ncol = 4, scales = "free")

To look at the STM, we can use toLDAvis():

  toLDAvis(mod = curr_stm, docs = curr_docs)

Communicate

Summary: question(s), methods, findings, and discussion

This analysis used unsupervised machine learning to identify topics within the given curriculum texts from Yale Medical School. It looked to discover What central topics are featured in the assigned readings.

The STM model shows the topic with the highest expected proportion contains many general medical terms that point to the overarching theme of the data. The STM model has some topics related to medical conditions, but have some that appear to be on diagnostics, testing, or treatment, like topics 11 or 8. The LDA model also has an overarching topic 3. The LDA model appears to have separated the topics seemingly based on medical conditions, such as topic 14, which appears to relate to pregnancy. For topics that contain more than one medical condition, such as topic 11 with “diabetes,” “adrenal,” and “mrsa,” it is possible these diseases require similar skills, treatments, or have similar symptoms. Further analysis with findThoughts() function may reveal insight into why these words were grouped together. Examining the STM model visually with toLDavis reveals two distinct groups with many topics that somewhat or very much overlap. It also reveals 6 isolated groups, 3 of which are further than the others.

The usage of topic modeling can be useful when evaluating curriculum and learning objectives, altering learning materials, and creating assessments. It could even be used to create curriculum from relevant documents or reorganize unit or topic sequencing to place similar topics together. Further analysis can examine lecture materials, readings, and assessments and see if the topics are similar across other documents. These results can reveal whether the assigned readings pertain to topics included in the curriculum and discussed in the course. More analysis may find that it may not be helpful to stem words, that there is another optimal k-value, or that the course identifiers are unhelpful in the STM modeling. There were no other variables/features collected with the original data, but future analyses could focus on medium or source.