1. PREPARE

To help us better understand the context, questions, and data sources we’ll be using in Unit 3, this section will focus on the following topics:

  1. Context. As context for our analysis this week, we’ll review several related papers by my colleagues relevant to our analysis of MOOC-Ed discussion forums.
  2. Questions. We’ll also examine what insight topic modeling can provide to a question that we asked participants answer in their professional learning teams (PLTs).
  3. Project Setup. This should be very familiar by now, but we’ll set up a new R project and install and load the required packages for the topic modeling walkthrough.

1a. Context

Participating in a MOOC and Professional Learning Team: How a Blended Approach to Professional Development Makes a Difference

Abstract

Massive Open Online Courses for Educators (MOOC-Eds) provide opportunities for using research-based learning and teaching practices, along with new technological tools and facilitation approaches for delivering quality online professional development. The Teaching Statistics Through Data Investigations MOOC-Ed was built for preparing teachers in pedagogy for teaching statistics, and it has been offered to participants from around the world. During 2016-2017, professional learning teams (PLTs) were formed from a subset of MOOC-Ed participants. These teams met several times to share and discuss their learning and experiences. This study focused on examining the ways that a blended approach to professional development may result in similar or different patterns of engagement to those who only participate in a large-scale online course. Results show the benefits of a blended learning environment for retention, engagement with course materials, and connectedness within the online community of learners in an online professional development on teaching statistics. The findings suggest the use of self-forming autonomous PLTs for supporting a deeper and more comprehensive experience with self-directed online professional developments such as MOOCs. Other online professional development courses, such as MOOCs, may benefit from purposely suggesting and advertising, and perhaps facilitating, the formation of small face-to-face or virtual PLTs who commit to engage in learning together.

Data Source & Analysis

All peer interaction, including peer discussion, take place within discussion forums of MOOC-Eds, which are hosted using the Moodle Learning Management System. To build the dataset you’ll be using for this walkthrough, the research team wrote a query for Moodle’s MySQL database, which records participants’ user-logs of activity in the online forums. This sql query combines separate database tables containing postings and comments including participant IDs, timestamps, discussion text and other attributes or “metadata.”

Summary of Key Findings

The following highlight some key findings related to the discussion forums in the papers cited above:

  1. MOOCs designed specifically for K-12 teachers can provide positive self-directed learning experiences and rich engagement in discussion forums that help form online communities for educators.
  2. Analysis of discussion forum data in TSDI provided a very clear picture of how enthusiastic many PLT members and leaders were to talk to others in the online community. They posed their questions and shared ideas with others about teaching statistics throughout the units, even though they were also meeting synchronously several times with their colleagues in small group PLTs.
  3. Findings on knowledge construction demonstrated that over half of the discussions in both courses moved beyond sharing information and statements of agreement and entered a process of dissonance, negotiation and co-construction of knowledge, but seldom moved beyond this phase in which new knowledge was tested or applied. These findings echo similar research on difficulties in promoting knowledge construction in online settings.
  4. Topic modeling provides more interpretable and cohesive models for discussion forums than other popular unsupervised modeling techniques such as k-means and k-medoids clustering algorithms.

1b. Guiding Questions

For the paper, Participating in a MOOC and Professional Learning Team: How a Blended Approach to Professional Development Makes a Difference, the researchers were interested in unpacking how participants who enrolled in the Teaching Statistics through Data Investigations MOOC-Ed might benefit from also being in a smaller group of professionals committed to engaging in the same professional development. The specific research question for this paper was:

What are the similarities and differences between how PLT members and Non-PLT online participants engage and meet course goals in a MOOC-Ed designed for educators in secondary and collegiate settings?

Dr. Hollylynne Lee and the TSDI team also developed a facilitation guide designed specifically for PLT teams to help groups synthesize the ideas in the course and make plans for how to implement new strategies in their classroom in order to impact students’ learning of statistics. One question PLT members were asked to address was:

What ideas or issues emerged in the discussion forums this past week?

For this walkthrough, we will further examine that question through the use of topic modeling.

And just to reiterate yet again from Unit 1, one overarching question we’ll explore throughout this course, and that Silge and Robinson (2018) identify as a central question to text mining and natural language processing, is:

How do we to quantify what a document or collection of documents is about?

1c. Set Up

As highlighted in Chapter 6 of Data Science in Education Using R (DSIEUR), one of the first steps of every workflow should be to set up a “Project” within RStudio. This will be your “home” for any files and code used or created in Unit 2.

You are welcome to continue using the same project created for Unit 1, or create an entirely new project for Unit 2. However, after you’ve created your project open up a new R script, and load the following packages that we’ll be needing for this walkthrough:

## Warning: package 'tidyverse' was built under R version 4.0.5
## Warning: package 'ggplot2' was built under R version 4.0.5
## Warning: package 'tibble' was built under R version 4.0.5
## Warning: package 'tidyr' was built under R version 4.0.5
## Warning: package 'readr' was built under R version 4.0.5
## Warning: package 'purrr' was built under R version 4.0.5
## Warning: package 'dplyr' was built under R version 4.0.5
## Warning: package 'stringr' was built under R version 4.0.5
## Warning: package 'forcats' was built under R version 4.0.5
## Warning: package 'tidytext' was built under R version 4.0.5
## Warning: package 'topicmodels' was built under R version 4.0.5
## Warning: package 'ldatuning' was built under R version 4.0.5
## Warning: package 'knitr' was built under R version 4.0.5
## Warning: package 'LDAvis' was built under R version 4.0.5
## Warning: package 'devtools' was built under R version 4.0.5
## Warning: package 'usethis' was built under R version 4.0.5

2. WRANGLE

2a. Import Forum Data

By default, many of the columns like course_id and forum_id are read in as numeric data. For our purposes, we plan to treat them as unique identifiers or names for out courses, forums, discussions, and posts. The read_csv() function has a handy col_types = argument changing the column types from numeric to characters.

2b. Cast a Document Term Matrix

In this section we’ll revisit some familiar tidytext functions used in Units 1 & 2 for tidying and tokenizing text and introduce some new functions from the stm package for processing text and transforming our data frames into new data structures required for topic modeling.

Functions Used

tidytext functions

  • unnest_tokens() splits a column into tokens
  • anti_join() returns all rows from x without a match in y and used to remove stop words from out data.
  • cast_dtm() takes a tidied data frame take and “casts” it into a document-term matrix (dtm)

dplyr functions

  • count() lets you quickly count the unique values of one or more variables
  • group_by() takes a data frame and one or more variables to group by
  • summarise() creates a summary of data using arguments like sum and mean

stm functions

  • textProcessor() takes in a vector or column of raw texts and performs text processing like removing punctuation and word stemming.
  • prepDocuments() performs several corpus manipulations including removing words and renumbering word indices

Tidying Text

Prior to topic modeling, we have a few remaining steps to tidy our text that hopefully should feel familiar by this point. If you recall from Chapter 1 of Text Mining With R, these preprocessing steps include:

  1. Transforming our text into “tokens”
  2. Removing unnecessary characters, punctuation, and whitespace
  3. Converting all text to lowercase
  4. Removing stop words such as “the”, “of”, and “to”

Let’s tokenize our forum text and by using the familiar unnest_tokens() and remove stop words per usual:

## # A tibble: 165,720 x 10
##    course_id course_name       forum_id forum_name discussion_id discussion_name
##    <chr>     <chr>             <chr>    <chr>      <chr>         <chr>          
##  1 9         Teaching Statist~ 126      Investiga~ 6822          Not much compa~
##  2 9         Teaching Statist~ 126      Investiga~ 6822          Not much compa~
##  3 9         Teaching Statist~ 126      Investiga~ 6822          Not much compa~
##  4 9         Teaching Statist~ 126      Investiga~ 6822          Not much compa~
##  5 9         Teaching Statist~ 126      Investiga~ 6822          Not much compa~
##  6 9         Teaching Statist~ 126      Investiga~ 6822          Not much compa~
##  7 9         Teaching Statist~ 126      Investiga~ 6822          Not much compa~
##  8 9         Teaching Statist~ 126      Investiga~ 6822          Not much compa~
##  9 9         Teaching Statist~ 126      Investiga~ 6822          Not much compa~
## 10 9         Teaching Statist~ 126      Investiga~ 6822          Not much compa~
## # ... with 165,710 more rows, and 4 more variables: post_id <chr>,
## #   post_title <chr>, post_date <chr>, word <chr>

Now let’s do a quick word count to see some of the most common words used throughout the forums. This should get a sense of what we’re working with and later we’ll need these word counts for creating our document term matrix for topic modeling:

## # A tibble: 13,666 x 2
##    word            n
##    <chr>       <int>
##  1 students     6837
##  2 data         4338
##  3 statistics   3095
##  4 school       1488
##  5 questions    1470
##  6 time         1252
##  7 class        1208
##  8 agree         999
##  9 teaching      987
## 10 statistical   957
## # ... with 13,656 more rows

Terms like “students,” “data,” and “class” are about what we would have expected from a course teaching statistics. The term “agree” and “time” however, are not so intuitive and worth a quick look as well.

✅ Comprehension Check

Use the filter() and grepl() functions introduced in Unit 1. Section 3b to filter for rows in our ts_forum_data data frame that contain the terms “agree” and “time” and another term or terms of your choosing. Select a random sample of 10 posts using the sample_n() function for your terms and answer the following questions:

## # A tibble: 10 x 1
##    post_content                                                                 
##    <chr>                                                                        
##  1 "-  Airport Data     <What learning goal(s) could this task be used for stud~
##  2 "As a learner of statistics  and then as a teacher  I think that is too much~
##  3 "I agree.  The best form of graphical data display depends on what you are t~
##  4 "I really liked how the students refined their questions by looking at the d~
##  5 "I think it's great that so many of you are teaching non-AP stats courses!! ~
##  6 "Technology ushers in fundamental structural changes that can be integral to~
##  7 "I try to have my students engaged in projects at least 8 times during the y~
##  8 "This unit uses data from Census @School. However  the data evolved from jus~
##  9 "One specific thought sticks out in my mind as I reflect on these questions ~
## 10 "Maybe I am not understanding the task fully without actually working with t~
  1. What, if anything, do these posts have in common?
  • similar forum topics: either “discuss w/ your colleagues or investigate”
  1. What topics or themes might be apparent, or do you anticipate emerging, from our topic modeling?
  • Just guessing here but: technology, application to classroom, including time limitations

Your output should look something like this:

## # A tibble: 10 x 1
##    post_content                                                                 
##    <chr>                                                                        
##  1 "I teach Algebra 2 as well.  I taught a Unit on Statistics in the Algebra 2 ~
##  2 "If we look at the CCSL  starting in 6th grade students begin to ask statist~
##  3 "The short class can also be a challenge.   I have sent students home with t~
##  4 "Hi Margaret      I agree with your comments.  The Coke vs Pepsi activity wa~
##  5 "I found the definitions of math and statistics helpful.  As a math major  I~
##  6 "Unfortunately  my class is small (typically 5 - 15 students) and that makes~
##  7 "Every time I teach statistics  I try to add more that my students can do  s~
##  8 "Lana  you are right!  The louder and brighter - the better for the students~
##  9 "In my first year teaching AP Stats  I found my biggest problem was finding ~
## 10 "Dear Participant         It has been a pleasure to offer the view.php?52 Te~

Creating a Document Term Matrix

For now, however, let treat each individual post as a unique “document.” noted above, to create our document term matrix, we’ll need to first count() how many times each word occurs in each document, or post_id in our case, and create a matrix that contains one row per post as our original data frame did, but now contains a column for each word in the entire corpus and a value of n for how many times that word occurs in each post.

To create this document term matrix from our post counts, we’ll use the cast_dtm() function like so and assign it to the variable forums_dtm:

✅ Comprehension Check

Take a look at our forums_dtm object in the console and answer the following question:

  1. What “class” of object is forums_dtm? a simple triplet matrix

  2. How many unique documents and terms are included our matrix? documents: 5761, terms: 13666

  3. Why might there be fewer documents/posts than were in our original data frame? maybe some documents (posts) are too sparse to be useful for topic modeling

  4. What exactly is meant by “sparsity”? perhaps not enough meaningful text to enable topic modeling

## [1] "DocumentTermMatrix"    "simple_triplet_matrix"
## <<DocumentTermMatrix (documents: 5761, terms: 13666)>>
## Non-/sparse entries: 135796/78594030
## Sparsity           : 100%
## Maximal term length: 320
## Weighting          : term frequency (tf)

2c. To Stem or not to Stem?

Processing and Stemming for STM

Like unnest_tokens(), the textProcessor() function includes several useful arguments for processing text like converting text to lowercase and removing punctuation and numbers. I’ve included several of these in the script below along with their defaults used if you do not explicitly specify in your function. Most of these are pretty intuitive and you can learn more by viewing the ?textProcessor documentation.

Let’s go ahead and process our discussion forum post_content in preparation for structural topic modeling:

## Building corpus... 
## Converting to Lower Case... 
## Removing punctuation... 
## Removing stopwords... 
## Removing numbers... 
## Stemming... 
## Creating Output...

Note that the first argument the textProcessor function expects is the column in our data frame that contains the text to be processed, the second argument metadata = expects the data frame that contains the text of interest and uses the column names to label the metadata such as course ids and forum names. This meatdata can be used to to improve the assignment of words to topics in a corpus and examine the relationship between topics and various covariates of interest.

Unlike the unnest_tokens() function, the output is not a nice tidy data frame. Topic modeling using the stm package requires a very unique set of inputs that are specific to the package. The following code will pull elements from the temp list that was created that will be required for the stm() function we’ll use in Section 4:

Stemming Tidy Text

Notice that the textProcessor stem argument we used above is set to TRUE by default. We haven’t introduced word stemming at this point because there is some debate about the actual value of this process. While words like “students” and “student” might make sense to collapse into their base word and actually make analyses and visualizations more concise and easier to interpret. Hvitfeldt and Silge (2021) note, however, that words like the following have dramatic differences in meaning, semantics, and use and could result in poor models or misinterpreted results:

  • meaning and mean
  • likely, like, liking
  • university and universe

The first word pair is particularly relevant to discussion posts from our Teaching Statistics course data. In addition, collapsing words like “teachers” and “teaching” could dramatically alter the results from a topic model.

For now, we will leave as is the forums_dtm we created earlier with words unstemmed, but what if we wanted to stem words in a “tidy” way?

Since the unnest_tokens() function does not (intentionally I believe) include a stemming function, one approach would be to use the wordStem() function from the snowballC package to either replace our words column with a word stems or create a new variable called stem with our stemmed words. Let’s do the latter and take a look at the original words and the stem that was produced:

## # A tibble: 165,720 x 11
##    course_id course_name       forum_id forum_name discussion_id discussion_name
##    <chr>     <chr>             <chr>    <chr>      <chr>         <chr>          
##  1 9         Teaching Statist~ 126      Investiga~ 6822          Not much compa~
##  2 9         Teaching Statist~ 126      Investiga~ 6822          Not much compa~
##  3 9         Teaching Statist~ 126      Investiga~ 6822          Not much compa~
##  4 9         Teaching Statist~ 126      Investiga~ 6822          Not much compa~
##  5 9         Teaching Statist~ 126      Investiga~ 6822          Not much compa~
##  6 9         Teaching Statist~ 126      Investiga~ 6822          Not much compa~
##  7 9         Teaching Statist~ 126      Investiga~ 6822          Not much compa~
##  8 9         Teaching Statist~ 126      Investiga~ 6822          Not much compa~
##  9 9         Teaching Statist~ 126      Investiga~ 6822          Not much compa~
## 10 9         Teaching Statist~ 126      Investiga~ 6822          Not much compa~
## # ... with 165,710 more rows, and 5 more variables: post_id <chr>,
## #   post_title <chr>, post_date <chr>, word <chr>, stem <chr>

You can see that words like “activity” and “activities” that occur frequently in our discussions have been reduced to the word stem “activ”. If you are interested in learning other approaches for word stemming in R, as well as reading a more in depth description of the stemming process, I highly recommend the Chapter 4 Stemming from Hvitfeldt and Silge (2021) book, Supervised Machine Learning for Text Analysis in R.

✅ Comprehension Check

Complete the following code using what we learned in the section on Creating a Document Term Matrix and answer the following questions:

  1. How many fewer terms are in our stemmed document term matrix? documents: 5761, terms: 10060 (reduced by 3,606)

  2. Did stemming words significantly reduce the sparsity of the network? sparsity = 100% - the sparsity stayed the same

Hint: Make sure your code includes stem counts rather than word counts.

## <<DocumentTermMatrix (documents: 5761, terms: 10060)>>
## Non-/sparse entries: 129593/57826067
## Sparsity           : 100%
## Maximal term length: 320
## Weighting          : term frequency (tf)
## <<DocumentTermMatrix (documents: 5761, terms: 13666)>>
## Non-/sparse entries: 135796/78594030
## Sparsity           : 100%
## Maximal term length: 320
## Weighting          : term frequency (tf)
## # A tibble: 10,060 x 2
##    stem         n
##    <chr>    <int>
##  1 student   7346
##  2 data      4338
##  3 statist   4152
##  4 question  2470
##  5 teach     1841
##  6 school    1606
##  7 class     1520
##  8 time      1424
##  9 learn     1355
## 10 task      1214
## # ... with 10,050 more rows

3. MODEL

This unit provides our first opportunity for modeling a text as data. In very simple terms, modeling involves developing a mathematical summary of a dataset. These summaries can help us further explore trends and patterns in our data.

3a. Fitting a Topic Modeling with LDA

Before running our first topic model using the LDA() function, let’s quick recap from our readings some basic principles behind Latent Dirichlet allocation and why LDA is of preferred over other automatic classification or clustering approaches.

Unlike simple forms of cluster analysis such as k-means clustering, LDA is a “mixture” model, which in our context means that:

  1. Every document contains a mixture of topics. Unlike algorithms like k-means, LDA treats each document as a mixture of topics, which allows documents to “overlap” each other in terms of content, rather than being separated into discrete groups. So in practice, this means that a discussion forum post could have an estimated topic proportion of 70% for Topic 1 (e.g. be mostly about a Topic 1), but also be partly about Topic 2.
  2. Every topic contains a mixture of words. For example, if we specified in our LDA model just 2 topics for our discussion posts, we might find that one topic seems to be about pedagogy while another is about learning. The most common words in the pedagogy topic might be “teacher”, “strategies”, and “instruction”, while the learning topic may be made up of words like “understanding” and “students”. However, words can be shared between topics and words like “statistics” or “assessment” might appear in both equally.

Similar to k-means other other simple clustering approaches, however, LDA does require us to specify a value of k for the number of topics in our corpus. Selecting k is no trivial matter and can greatly impact your results.

Since we don’t have a have strong rationale about the number of topics that might exist in discussion forums, let’s use the n_distinct() function from the dplyr package to find the number of unique forum names in our course data and run with that:

## [1] 20

Since it looks like there are 20 distinct discussion forums, we’ll use that as our value for the k = argument of the LDA(). Be patient while this runs, since the default setting of is to perform a large number of iterations.

## [1] 20
## A LDA_VEM topic model with 20 topics.

Note that we used the control = argument to pass a random number (588) to seed the assignment of topics to each word in our corpus. Since LDA is a stochastic algorithm that could have different results depending on where the algorithm starts, specified a seed for reproducibility and so we’re all seeing the same results every time we specify the same number of topics.

And tying back to our work in Unit 1, Bail (2020) notes that topic assignments for each word are updated in an iterative fashion and that LDA employs the Term Frequency-Inverse Document Frequency (TF-IDF) metric to assign probabilities.

3b. Fitting a Structural Topic Model

Bail notes that LDA, while perhaps the most common approach to topic modeling, is just one of many different types, including Dynamic Topic Models, Correlated Topic Models, Hierarchical Topic Models, and more recently, Structural Topic Modeling (STM). He argues that one reason STM has rising in popularity and use is that it employs meta data about documents to improve the assignment of words to topics in a corpus and that can be used to examine relationships between covariates and documents. 

Also, since Julia Silge has indicated that STM is, “my current favorite implementation of topic modeling in R” and has built supports in the tidytext package for building structural topic models, this package definitely is worth discussing in this walkthrough. I also highly recommend her own walkthrough of the stm package: The game is afoot! Topic modeling of Sherlock Holmes stories as well as her follow up post, Training, evaluating, and interpreting topic models.

The stm Package

As we’ve seen above, STM produced an unusual temp textProcessor output that is unique to the stm package. And as you’ve probably already guessed, the stm() function for fitting a structural topic model does not take a fairly standard document term matrix like the LDA() function.

Before we fit our model, we’ll have to extract the elements from the temp object created after we processed our text. Specifically, the stm() function expects the following arguments:

  • documents = the document term matrix to be modeled in the native stm format
  • data = an optional data frame containing meta data for the prevalence and/or content covariates to include in the model
  • vocab = a character vector specifying the words in the corpus in the order of the vocab indices in documents.

Let’s go ahead and extract these elements:

And now use these elements to fit the model using the same number of topics for K that we specified for our LDA topic model. Let’s also take advantage of the fact that we can include the course_id and forum_id covariates in the prevealence = argument to help improve, in theory, our model fit:

## A topic model with 20 topics, 5777 documents and a 9306 word dictionary.

As noted earlier, the stm package has a number of handy features. One of these is the plot.STM() function for viewing the most probable words assigned to each topic.

By default, it only shows the first 3 terms so let’s change that to 5 to help with interpretation:

Note that you can also just use plot() as well:

✅ Comprehension Check

Fit a model for both LDA and STM using different values for K and answer the following questions:

  1. What topics appear to be similar to those using 20 topics for K?
## A topic model with 14 topics, 5777 documents and a 9306 word dictionary.

## A topic model with 12 topics, 5777 documents and a 9306 word dictionary.

  1. Knowing that you don’t have as much context as the researchers of this study do, how might you interpret one of these latent topics or themes using the key terms assigned?
    • Topic 10 seems to be about experiential learning via a project that required using technology
  2. What topic emerged that seem dramatically different and how might you interpret this topic?
    • Topic 6 seems to be about the speed of a roller coaster (maybe a wooden versus a metal track?). Not knowing more, I think this may refer to a specific project

3c. Finding K

As alluded to earlier, selecting the number of topics for your model is a non-trivial decision and can dramatically impact your results. Bail (2018) notes that

The results of topic models should not be over-interpreted unless the researcher has strong theoretical apriori about the number of topics in a given corpus, or if the researcher has carefully validated the results of a topic model using both the quantitative and qualitative techniques described above.

There are several approaches to estimating a value for K and we’ll take a quick look at one from the ldatuning package and one from our stm package.

The FindTopicsNumber Function

The ldatuning package has functions for both calculating and plotting different metrics that can be used to estimate the most preferable number of topics for LDA model. It also conveniently takes the standard document term matrix object that we created from out tidy text and has the added benefit of running fairly quickly, especially compared to the function for finding K from the stm package.

Let’s use the defaults specified in the ?FindTopicNumber documentation and modify slightly get metrics for a sequence of topics from 10-75 counting by 5 and plot the output we saved using the FindTopicsNumber_plot() function:

The findingK() Function

Finally, Bail (2018) notes that thestm package has a useful function called searchK which allows us to specify a range of values for k and outputs multiple goodness-of-fit measures that are “very useful in identifying a range of values for k that provide the best fit for the data.”

The syntax of this function is very similar to the stm() function we used above, except that we specify a range for k as one of the arguments. In the code below, we search all values of k between 10 and 30.

Note that Running searchK() function on this corpus took all night on a pretty powerful MacBook Pro and crashed once as well, so I do not expect you to run this for the walkthrough. I ran a couple iterations and landed on between 5 and 15 with an optimal number of topics somewhere around 14:

Given the somewhat conflicting results, also somewhat selfishly and for the same of simplicity for this walkthrough, I’m just going to stick with the rather arbitrary selection of 20 topics for the remainder of this Unit.

The LDAvis Explorer

One final tool that I want to introduce from the stm package is the toLDAvis() function which provides a great visualizations for exploring topic and word distributions using LDAvis topic browser:

## Loading required namespace: servr

Our current stm model of 20 topics is resulting in a lot of overlap among topics and suggests that 20 may not be an optimal number of topics, as other approaches for finding k also suggests:

4. EXPLORE

Silge and Robinson (2018) note that fitting at topic model is the “easy part.” The hard part is making sense of the model results and that the rest of the analysis involves exploring and interpreting the model using a variety of approaches which we’ll walkthrough in in this section.

Bail (2018) cautions, however, that:

…post-hoc interpretation of topic models is rather dangerous… and can quickly come to resemble the process of “reading tea leaves,” or finding meaning in patterns that are in fact quite arbitrary or even random.

4a. Exploring Beta Values

Hidden within this forums_lda topic model object we created are per-topic-per-word probabilities, called β (“beta”). It is the probability of a term (word) belonging to a topic. 

Let’s take a look at the 5 most likely terms assigned to each topic, i.e. those with the largest β values using the terms() function from the topicmodels package:

##      Topic 1      Topic 2      Topic 3       Topic 4    Topic 5    Topic 6   
## [1,] "school"     "statistics" "test"        "grade"    "students" "students"
## [2,] "students"   "em"         "students"    "data"     "task"     "level"   
## [3,] "middle"     "unit"       "standard"    "graphs"   "data"     "dice"    
## [4,] "video"      "education"  "deviation"   "students" "tasks"    "size"    
## [5,] "elementary" "teaching"   "statistical" "plots"    "question" "sample"  
##      Topic 7      Topic 8       Topic 9     Topic 10     Topic 11    
## [1,] "students"   "statistics"  "questions" "assessment" "technology"
## [2,] "statistics" "teaching"    "kids"      "resource"   "students"  
## [3,] "feel"       "math"        "love"      "items"      "software"  
## [4,] "teach"      "teachers"    "students"  "locus"      "program"   
## [5,] "questions"  "mathematics" "sharing"   "students"   "excel"     
##      Topic 12   Topic 13     Topic 14    Topic 15    Topic 16     Topic 17    
## [1,] "students" "activity"   "question"  "students"  "fuel"       "students"  
## [2,] "project"  "students"   "data"      "questions" "cost"       "makes"     
## [3,] "real"     "agree"      "students"  "class"     "src"        "regression"
## [4,] "class"    "lesson"     "questions" "learn"     "codap"      "wondering" 
## [5,] "projects" "activities" "census"    "answer"    "pluginfile" "model"     
##      Topic 18   Topic 19     Topic 20
## [1,] "data"     "sample"     "stats" 
## [2,] "students" "difference" "ap"    
## [3,] "real"     "samples"    "class" 
## [4,] "collect"  "random"     "school"
## [5,] "sets"     "population" "stat"

Even though we’ve somewhat arbitrarily selected the number of topics for our corpus, some these topics or themes are fairly intuitive to interpret. For example:

  • Topic 11 (technology, students, software, program, excel) seems to be about students use of technology including software programs like excel;

  • Topic 9 (questions, kids, love, gapminder, sharing) seems to be about the gapminder activity from the MOOC-Ed and kids enjoyment of it; and

  • Topic 18 (data, students, collect, real, sets) seems to be about student collection and use of real world data sets.

Not surprisingly, the tidytext package has a handy function conveniently name tidy() to convert our lda model to a tidy data frame containing these beta values for each term:

## # A tibble: 273,320 x 3
##    topic term       beta
##    <int> <chr>     <dbl>
##  1     1 2015  1.36e- 67
##  2     2 2015  1.55e-  4
##  3     3 2015  1.73e-  4
##  4     4 2015  3.27e-  4
##  5     5 2015  4.43e- 80
##  6     6 2015  1.17e-114
##  7     7 2015  2.73e- 50
##  8     8 2015  7.10e- 40
##  9     9 2015  2.45e- 29
## 10    10 2015  1.12e-  3
## # ... with 273,310 more rows

Obviously, it’s not very easy to interpret what the topics are about from a data frame like this so let’s borrow code again from Chapter 8.4.3 Interpreting the topic model in Text Mining with R to examine the top 5 terms for each topic and then look at this information visually:

4b. Exploring Gamma Values

Now that we have a sense of the most common words associated with each topic, let’s take a look at the topic prevalence in our MOOC-Ed discussion forum corpus, including the words that contribute to each topic we examined above.

Also, hidden within our forums_lda topic model object we created are per-document-per-topic probabilities, called γ (“gamma”). This provides the probabilities that each document is generated from each topic, that gamma matrix. We can combine our beta and gamma values to understand the topic prevalence in our corpus, and which words contribute to each topic.

To do this, we’re going to borrow some code from the Silge (2018) post, Training, evaluating, and interpreting topic models.

First, let’s create two tidy data frames for our beta and gamma values

## # A tibble: 273,320 x 3
##    topic term       beta
##    <int> <chr>     <dbl>
##  1     1 2015  1.36e- 67
##  2     2 2015  1.55e-  4
##  3     3 2015  1.73e-  4
##  4     4 2015  3.27e-  4
##  5     5 2015  4.43e- 80
##  6     6 2015  1.17e-114
##  7     7 2015  2.73e- 50
##  8     8 2015  7.10e- 40
##  9     9 2015  2.45e- 29
## 10    10 2015  1.12e-  3
## # ... with 273,310 more rows
## # A tibble: 115,220 x 3
##    document topic    gamma
##    <chr>    <int>    <dbl>
##  1 11295        1 0.00241 
##  2 12711        1 0.000381
##  3 12725        1 0.0289  
##  4 12733        1 0.142   
##  5 12743        1 0.00928 
##  6 12744        1 0.00476 
##  7 12756        1 0.0289  
##  8 12757        1 0.00353 
##  9 12775        1 0.00353 
## 10 12816        1 0.00353 
## # ... with 115,210 more rows

Next, we’ll adopt Julia’s code wholesale to create a filtered data frame of our top_terms, join this to a new data frame for gamma-terms and create a nice clean table using the kabel() function knitr package:

## Warning: `cols` is now required when using unnest().
## Please use `cols = c(terms)`
Topic Expected topic proportion Top 7 terms
Topic 7 0.087 students, statistics, feel, teach, questions, level, school
Topic 13 0.077 activity, students, agree, lesson, activities, ideas, resources
Topic 15 0.070 students, questions, class, learn, answer, reading, learning
Topic 18 0.067 data, students, real, collect, sets, set, analyze
Topic 5 0.065 students, task, data, tasks, question, statistical, activity
Topic 8 0.063 statistics, teaching, math, teachers, mathematics, teach, science
Topic 9 0.053 questions, kids, love, students, sharing, lot, gapminder
Topic 11 0.051 technology, students, software, program, excel, time, core
Topic 6 0.050 students, level, dice, size, sample, trials, technology
Topic 14 0.048 question, data, students, questions, census, answer, survey
Topic 12 0.047 students, project, real, class, projects, life, time
Topic 10 0.047 assessment, resource, items, locus, students, resources, website
Topic 1 0.046 school, students, middle, video, elementary, age, census
Topic 3 0.039 test, students, standard, deviation, statistical, results, understand
Topic 20 0.039 stats, ap, class, school, stat, math, students
Topic 4 0.037 grade, data, graphs, students, plots, 1, box
Topic 17 0.036 students, makes, regression, wondering, model, days, linear
Topic 19 0.034 sample, difference, samples, random, population, sampling, minutes
Topic 2 0.030 statistics, em, unit, education, teaching, learning, online
Topic 16 0.013 fuel, cost, src, codap, pluginfile, vehicles, city

And let’s also compare this to the most prevalent topics and terms from our forums_stm model that we created using the plot() function:

4c. Reading the Tea Leaves

Recognizing that topic modeling is best used as a “tool for reading” and provides only an incomplete answer to our overarching, “How do we quantify what a corpus is about?”, the results do suggest some potential topics that have emerges as well as some areas worth following up on.

Specifically, looking at some of the common clusters of words for the more prevalent topics suggest that some key topics or “latent themes” (renamed in bold) might include:

  • Teaching Statistics: Unsurprising, given the course title, the topics most prevalent in both the forums_stm and forums_lda models contains the terms “teach”, “students”, “statistics”. This could be an “overarching theme” but more likely may simply be just the residue of the course title though being sprinkled throughout the forums and deserves some follow up. Topics 8 from the LDA model may overlap with this topic as well.
  • Course Utility: The second most prevalent Topics (13 and 2) in the lda and stm models respectively, seem to potentially be about the usefulness of course “resources” like lessons, tools, videos, and activities. I’m wagering this might be a forum dedicated to course feedback. Topic 15 from the STM model also suggest this may be a broader theme.
  • Using Real-World Data: Topics 18 & 12 from the LDA model particularly intrigue me and I’m wagering this is pretty positive sentiment among participants about the value and benefit of having students collect and analyze real data sets (e.g. Census data in Topic 1) and work on projects relevant to their real life. Will definitely follow up on this one.
  • Technology Use: Several topics (6 & 11 from LDA and 8 & 19 from STM) appear to be about student use of technology and software like calculators and Excel for teaching statistics and using simulations. Topic 16 from LDA also suggest the use of the Common Online Data Analysis Platform.
  • Student Struggle & Engagement: Topic 15 from LDA and Topic 16 from STM also intrigue me and appear to be two opposite sides of perhaps the same coin. The former includes “struggle” and “reading” which suggests perhaps a barrier to teaching statistics while Topic 16 contains top stems like “engage”, “activ”, and “think” and may suggest participants anticipate activities may engage students.

To serve as a check on my tea leaf reading, I’m going to follow Bail’s recommendation to examine some of these topics qualitatively. The stm package has another useful function though exceptionally fussy function called findThoughts which extracts passages from documents within the corpus associate with topics that you specify.

The first line of code may not be necessary for your independent analysis, but because the textProcessor() function removed several documents during processing, the findthoughts() function can’t properly index the processed docs. This line of code found on stackoverflow removes documents from original ts_forum_data source that were removed during processing so there is a one-to-one correspondence with forums_stm and so you can use the function to find posts associated with a given topic.

Let’s slightly reduce our original data set to match our STM model, pass both to the findThoughts() function, and set our arguments to return n =10 posts from topics = 2 (i.e. Topic 2) that have at least 50% or thresh = 0.5 as a minimum threshold for the estimated topic proportion.

## 
##  Topic 2: 
##       WHoops!  forgot the link to the video.  VERY SORRY!     www.ted.com talks hans_rosling_shows_the_best_stats_you_ve_ever_seen www.ted.com talks hans_rosling_shows_the_best_stats_you_ve_ever_seen  "
##      Hello Dina  your excitement perked my interest. I must say I did not really bother about these tools as I was reading them but promised myself to look closer and learn more after I read your comments. Your excitement tells me that there must be something in there that I should take a closer look on. Thanks for the vibes.
##      Thank you for putting that up! I haven't looked at the Extend your Learning section at all  but I have now bookmarked the Against all Odds website!
##      I too bookmarked the LinkedIn video.  I also continued to watch some of the SAS videos.  I plan to show some to my AP students.  I also plan to send it as a link to the guidance counselors.
##      Thank you Bonnie. I visited stattrek to help me refresh my own craftsmanship.
##      Thanks for the link. I have never used that Site before and the nurse Geen example may stimulate some good discussions.
##      When I returned from the preceding link  I accidentally rated this resource.    I would give it zero stars if I could  or one star in a do over  because it takes me to a link to purchase something.
##      Thanks for the video links.  I enjoy TED talks.
##      I've never heard of FiveThirtyEight  could you tell me more about the resource?     Thanks!  Carisa
##      Hi Keren and all      I used the following videos (or parts of them) with Business students but I think many of them would be suitable for other ares as well. I found suitable their size (about 10-15 minutes)  and often used them as a review  www.economicsnetwork.ac.uk statistics videos   And some of the older joy of stats videos  www.open.edu openlearn science-maths-technology mathematics-and-statistics statistics watch-the-joy-stats   Klara Kelecsenyi

Duplicate posts aside, this Course Utility topic returns posts there were expected based on my interpretation of the key terms for Topic 2. It looks like I may have read those tea leaves correctly.

Now let’s take a look at Topic 16 that we thought might be related to student engagement:

## 
##  Topic 16: 
##       Maureen   You have a good point  students want to rush through things and just get to the answer but not really think about things. I intentionally didn't re-read the questions and rushed through it just to see how I did  and I missed some things. Students would as well. I like your idea of demonstrating this for students.   "
##      The framework really makes you think about what level the students are at and what you can do to fit the task to that level. It also gives you an idea of what you can do to maybe challenge your students a little and keep them growing.
##      I agree that the reading would be difficult for many of my students. I work with mostly Tier 2 students. I had to read the questions several times myself and reason out the answers. I can see many of my students having trouble with this task. We would need to practice these types of problems many times and discuss the answers for the students to feel comfortable attempting these types of questions.
##      I have the same issues with initial questions. This is where I try to inject motivating questions to explore where my students' motivation may lie. Eventually I hope to get them asking their own questions and not get overwhelmed by the open-endness of the process.
##      I think that AP students would be up to this task and it is a topic that would interest them. I think that the questions are thought provoking and would facilitate good discussions in the classroom. It would be a good extension to ask the students what other information might improve this task to see if they would come up with age and gender as pertinent information.
##      I agree.  I think that the final question was almost more confusing (and multi-layered) then the initial questions.
##      I agree Eric in that I like this activity the best. It gives guidance and helps students learn how to conduct an experiment. Plus  it gets the entire class involved. And  as you said  it can be expanded a little to hit all 4 phases of the statistical process  and it can be transferred to other activities. I think it has a lot of potential!
##      I agree on the two different approaches  and both (I think) are very valid. I  struggle with how to intertwine the two without losing the students. They seem to want to compartmentalize the two instead of looking at them as two parts of a whole  though that could be more due to my lack of experience.   I also would like to think that many of my students would do well. I could see them not doing well because of the differences in question presentation  though. I don't give tests in the traditional sense  so this question format might be a little foreign.
##      I feel like my ap students are good students who will learn to answer those questions as well. I think they are self motivated and want to do well. After a semester I do think they will feel good about answer those questions
##      I really like the war game activity. It is a game that your student are probably going to know how to play already  which should help you when explaining. Also it can really get your students up and going and you will see the competitiveness really kick in with your students. I also like the discussion part you have with it as well because in reality who really knows which deck of cards is going to be the best to pick from  so this makes a great topic for discussion and it would be interesting to see what all of your students say. Overall I would have to say you did a great job with your project here. The only way I can see for adjustments though would be after you do your activities the first time because you really don't know how things are going to work out until you actually try them yourself.

It looks like my tea reading was a partially correct for Topic 16, though the results seem to be about a specific “Pepsi challenge” activity to conduct with students.

Finally, let’s look at posts from Topic 3 which we though might be an overarching theme about teaching statistics:

## 
##  Topic 3: 
##       I am a non native English speaker as well  taking the test was a little intimedating  I hope to gradually gain a better feel of terminology and content
##      The area that I feel I need the most improvement is asking better questions.  Not only do I need to pose better questions  but I need to help my students develop this skill as well.  I feel more confident that I will be able to do so or at least I am willing to try!
##      My confidence has increased slightly and my ability to find useful resources has increased a lot.  So that should translate into a classroom atmosphere where I am even more comfortable pushing students and when they ask things I cannot immediately answer  being able to say  I will get back to you tomorrow with a better answer or question! "
##      I am sure this applies to every area of math  but I find that my students with disabilities really struggle with mathematical reasoning.  They also struggle with math vocabulary.  Most could match a definition to a term  but they do not have understanding of the term when needing to make a mathematical statement to go with it.
##      Hello Katherine. My students are in the upper level Math courses  however  I found this strategy very helpful for my Math vocabulary. This might help in the Math vocabulary development for ELD students. I ask my students to just focus on one Math word  or one Math concept they learned during the week's lessons. Then I ask them to make a Math Graffiti out of the word or concept  a drawing that would subtly depict what they learned. I emphasize that I do not just want a drawing  not a formula solution  not an essay  rather I want a subtle drawing of what they learned using the word or concept. Then at the back of the Math Graffiti  I'd ask them to describe the word or concept. It really helped in how they learned concepts  at the same time asking them to verbalize  in writing  whatever they have learned. Here several modalities of learning had been used.
##      I agree that teaching students how to decode the language used in questions is one of the most important ways we can help them prepare for tests.  I think that too often we focus on vocabulary as individual mathematics and statistics words  and think that if students understand the individual words they will be able to understand the test questions.  In fact  understanding test questions is often about interpreting the little words like to or if or not  as well as the statistics words. "
##      It is always easier to teach a topic when you have had more training on the area and I believe this has made me feel way more confidant in this area!
##      I learned that I need to improve on my understanding of the basic concepts in order to teach students. I need a clear understanding of variability and other key vocabulary terms that left me guessing on the answer. "
##      My students are not at all prepared to answer the questions in the investigation.  They are all sophomores in high school and haven't had much interaction with statistics.  I would definitely need to understand the concepts in order to teach them the concepts.  I would need to feel completely confident in teaching the material beforehand.
##      I also agree that the wording in statistics questions is often difficult to understand  but like learning the specific language of any subject  the more time we and our students spend studying statistics and working statistics problems  the better prepared we will be to overcome those misconceptions.

Looking at just the 10 posts returned, perhaps a better name for this topic would be Course Reflections on Teaching Statistics.

Unit Takeaway

In addition to some useful R packages and functions for the actual process of topic modeling, hopefully there are two main lessons I’m hoping you take away from this walkthrough:

  1. Topic modeling requires a lot of decisions. Beyond deciding on a value for K, there are a number of key decisions that you have to make that can dramatically affect your results. For example, to stem or not to stem? What qualifies as a document? What flavor or topic modeling is best suited to your data and research questions? How many iterations to run?
  2. Topic modeling is as much art as (data) science. As Bail (2018) noted, the term “topic” is somewhat ambitious, and topic models do not produce highly nuanced classification of texts. Once you’ve fit your model, interpreting your model requires some mental gymnastics and ideally some knowledge of the context from which the data came to help with interpretation of your topics. Moreover, the quantitative approaches for making the decisions highlighted above are imperfect and a good deal of human judgment required.
✅ Comprehension Check

Using the STM model you fit from the Section 3 [Comprehension Check] with a different value for K, use the approaches demonstrated in Section 4 to explore and interpret your topics and terms and revisit the following question:

  1. Now that you have a little more context, how might you revise your initial interpretation of some of the latent topics or latent themes from your model?
## 
##  Topic 10: 
##       In the past I have used tinkerplots to allow students to  explore various outcomes associated with rolling two dice. They start by playing the Two-Dice Elimination game  and then simulate rolling two dice many times using TinkerPlots to determine whether a step model or a triangle model of the distribution of sums is correct. Finally  they use the triangle model to calculate the probability of rolling each sum. Students really liked this activity and really grasped the concept.
##      Is there an underlying assumption here that the two teams are equally likely to win the game? We are assuming that the octopus is equally likely to swim to either flag. But if one team is heavily favored (for example  if one of the teams has a 90% chance of winning)  and the octopus has picked the long shot team  then wouldn't his likelihood of picking the correct team be less? Or is this accounted for with the fact that the octopus is equally likely to swim to either flag  and that he was equally likely to choose the heavily favored team?
##      I also agree with the fact that the students were able to see more clearly by doing the different trials with the different amount of trials in each.  It really did help the students to see how things can be fair or unfair by doing all the trials.
##      I agree with you that I liked how the students could run a simulation of 3000 trials.  On the graphing calculator that we use (TI-84) there is a simulation program also that involves dice and coins  but we are not able to go up to 3000 trials!  We have done something similar to this using the calculator  but it would be nice to try something new with software like what was used in the video.
##      I  personally  enjoy the video simulations.  I find them more engaging and even my children are watching the video at home  )  Adding the technology aspect to the dice roll eased  created prediction  and helped to solve a pattern formation.  The student were talking to help the metacognition of thinking through their thoughts  but the computer quickly revealed dice rolls at various numbers including 3 000.  Trying to roll the dice that many times would take forever.  With the bar graph and pie graph showing the outcome  it helped students determine if the dice were bias or not.
##      Howard   There are many different software options for creating animations.  Most will allow you to use real recorded voices or a computer generated voice  but not all.  We have tried a few  the ones that you see in the course were created using goanimate.com  GoAnimate and "www.powtoon.com  Powtoon. "
##      Once students are able to manipulate the simulation software  they start to make their own inferences about the data.  Students can visually see what happens to the data when they roll the dice more than one time  or more than 50  or even more than 100.  I thought that it was interested to see the pie graph as well.  The percentages looked like they were so far apart when using the bar graph  and then in the pie graph they looked closer.  I think this was great for the students to see as well.
##      From the videos  I saw that the students we trying to level out the dice rolls. Both partners were discussing how it was it was a fair dice. The computer helped them by having the students critically think about their next move. The first pair of boys kept rolling until they got to 3 000.Then the girls rolled until they got to 36. They used and algorithm to find a good stopping point. Even with the AP Stats students  they were able to show their work to prove if the dice was fair or unfair. Eventually  at some point the students thought that the outcomes would even out.
##      It was great to see the activity being used at much different levels  but expecting similar outcomes.  The use of the various technologies really makes it accessible for all learners.
##      I liked that they had the  tool to try the simulations with increasing number of trials.  (It would be nice if my students had similar  tools.)  I wonder why one group did not  explore using larger number of trials.   The first group was able to see the difference and the second buy became  convinced after seeing enough trails.

This topic has to do with different platforms and tools used to conduct experiments (seemingly to better understand statistics). Many teachers seemed to be sharing individual tools and commenting on how they liked the learning experiences that the technology seemed to evoke from students (i.e. “metacognition”)

