1. PREPARE

To help us better understand the context, questions, and data sources we’ll be using in Unit 3, this section will focus on the following topics:

  1. Context. As context for our analysis this week, we’ll review several related papers by my colleagues relevant to our analysis of MOOC-Ed discussion forums.
  2. Questions. We’ll also examine what insight topic modeling can provide to a question that we asked participants answer in their professional learning teams (PLTs).
  3. Project Setup. This should be very familiar by now, but we’ll set up a new R project and install and load the required packages for the topic modeling walkthrough.

1a. Context

Participating in a MOOC and Professional Learning Team: How a Blended Approach to Professional Development Makes a Difference

Abstract

Massive Open Online Courses for Educators (MOOC-Eds) provide opportunities for using research-based learning and teaching practices, along with new technological tools and facilitation approaches for delivering quality online professional development. The Teaching Statistics Through Data Investigations MOOC-Ed was built for preparing teachers in pedagogy for teaching statistics, and it has been offered to participants from around the world. During 2016-2017, professional learning teams (PLTs) were formed from a subset of MOOC-Ed participants. These teams met several times to share and discuss their learning and experiences. This study focused on examining the ways that a blended approach to professional development may result in similar or different patterns of engagement to those who only participate in a large-scale online course. Results show the benefits of a blended learning environment for retention, engagement with course materials, and connectedness within the online community of learners in an online professional development on teaching statistics. The findings suggest the use of self-forming autonomous PLTs for supporting a deeper and more comprehensive experience with self-directed online professional developments such as MOOCs. Other online professional development courses, such as MOOCs, may benefit from purposely suggesting and advertising, and perhaps facilitating, the formation of small face-to-face or virtual PLTs who commit to engage in learning together.

Data Source & Analysis

All peer interaction, including peer discussion, take place within discussion forums of MOOC-Eds, which are hosted using the Moodle Learning Management System. To build the dataset you’ll be using for this walkthrough, the research team wrote a query for Moodle’s MySQL database, which records participants’ user-logs of activity in the online forums. This sql query combines separate database tables containing postings and comments including participant IDs, timestamps, discussion text and other attributes or “metadata.”

Summary of Key Findings

The following highlight some key findings related to the discussion forums in the papers cited above:

  1. MOOCs designed specifically for K-12 teachers can provide positive self-directed learning experiences and rich engagement in discussion forums that help form online communities for educators.
  2. Analysis of discussion forum data in TSDI provided a very clear picture of how enthusiastic many PLT members and leaders were to talk to others in the online community. They posed their questions and shared ideas with others about teaching statistics throughout the units, even though they were also meeting synchronously several times with their colleagues in small group PLTs.
  3. Findings on knowledge construction demonstrated that over half of the discussions in both courses moved beyond sharing information and statements of agreement and entered a process of dissonance, negotiation and co-construction of knowledge, but seldom moved beyond this phase in which new knowledge was tested or applied. These findings echo similar research on difficulties in promoting knowledge construction in online settings.
  4. Topic modeling provides more interpretable and cohesive models for discussion forums than other popular unsupervised modeling techniques such as k-means and k-medoids clustering algorithms.

1b. Guiding Questions

For the paper, Participating in a MOOC and Professional Learning Team: How a Blended Approach to Professional Development Makes a Difference, the researchers were interested in unpacking how participants who enrolled in the Teaching Statistics through Data Investigations MOOC-Ed might benefit from also being in a smaller group of professionals committed to engaging in the same professional development. The specific research question for this paper was:

What are the similarities and differences between how PLT members and Non-PLT online participants engage and meet course goals in a MOOC-Ed designed for educators in secondary and collegiate settings?

Dr. Hollylynne Lee and the TSDI team also developed a facilitation guide designed specifically for PLT teams to help groups synthesize the ideas in the course and make plans for how to implement new strategies in their classroom in order to impact students’ learning of statistics. One question PLT members were asked to address was:

What ideas or issues emerged in the discussion forums this past week?

For this walkthrough, we will further examine that question through the use of topic modeling.

And just to reiterate yet again from Unit 1, one overarching question we’ll explore throughout this course, and that Silge and Robinson (2018) identify as a central question to text mining and natural language processing, is:

How do we to quantify what a document or collection of documents is about?

1c. Set Up

As highlighted in Chapter 6 of Data Science in Education Using R (DSIEUR), one of the first steps of every workflow should be to set up a “Project” within RStudio. This will be your “home” for any files and code used or created in Unit 2.

You are welcome to continue using the same project created for Unit 1, or create an entirely new project for Unit 2. However, after you’ve created your project open up a new R script, and load the following packages that we’ll be needing for this walkthrough:

## Warning: package 'tidyverse' was built under R version 4.0.5
## Warning: package 'ggplot2' was built under R version 4.0.5
## Warning: package 'tibble' was built under R version 4.0.5
## Warning: package 'tidyr' was built under R version 4.0.5
## Warning: package 'readr' was built under R version 4.0.5
## Warning: package 'purrr' was built under R version 4.0.5
## Warning: package 'dplyr' was built under R version 4.0.5
## Warning: package 'stringr' was built under R version 4.0.5
## Warning: package 'forcats' was built under R version 4.0.5
## Warning: package 'tidytext' was built under R version 4.0.5
## Warning: package 'topicmodels' was built under R version 4.0.5
## Warning: package 'ldatuning' was built under R version 4.0.5
## Warning: package 'knitr' was built under R version 4.0.5
## Warning: package 'LDAvis' was built under R version 4.0.5
## Warning: package 'devtools' was built under R version 4.0.5
## Warning: package 'usethis' was built under R version 4.0.5

2. WRANGLE

2a. Import Forum Data

By default, many of the columns like course_id and forum_id are read in as numeric data. For our purposes, we plan to treat them as unique identifiers or names for out courses, forums, discussions, and posts. The read_csv() function has a handy col_types = argument changing the column types from numeric to characters.

2b. Cast a Document Term Matrix

In this section we’ll revisit some familiar tidytext functions used in Units 1 & 2 for tidying and tokenizing text and introduce some new functions from the stm package for processing text and transforming our data frames into new data structures required for topic modeling.

Functions Used

tidytext functions

  • unnest_tokens() splits a column into tokens
  • anti_join() returns all rows from x without a match in y and used to remove stop words from out data.
  • cast_dtm() takes a tidied data frame take and “casts” it into a document-term matrix (dtm)

dplyr functions

  • count() lets you quickly count the unique values of one or more variables
  • group_by() takes a data frame and one or more variables to group by
  • summarise() creates a summary of data using arguments like sum and mean

stm functions

  • textProcessor() takes in a vector or column of raw texts and performs text processing like removing punctuation and word stemming.
  • prepDocuments() performs several corpus manipulations including removing words and renumbering word indices

Tidying Text

Prior to topic modeling, we have a few remaining steps to tidy our text that hopefully should feel familiar by this point. If you recall from Chapter 1 of Text Mining With R, these preprocessing steps include:

  1. Transforming our text into “tokens”
  2. Removing unnecessary characters, punctuation, and whitespace
  3. Converting all text to lowercase
  4. Removing stop words such as “the”, “of”, and “to”

Let’s tokenize our forum text and by using the familiar unnest_tokens() and remove stop words per usual:

## # A tibble: 165,720 x 10
##    course_id course_name       forum_id forum_name discussion_id discussion_name
##    <chr>     <chr>             <chr>    <chr>      <chr>         <chr>          
##  1 9         Teaching Statist~ 126      Investiga~ 6822          Not much compa~
##  2 9         Teaching Statist~ 126      Investiga~ 6822          Not much compa~
##  3 9         Teaching Statist~ 126      Investiga~ 6822          Not much compa~
##  4 9         Teaching Statist~ 126      Investiga~ 6822          Not much compa~
##  5 9         Teaching Statist~ 126      Investiga~ 6822          Not much compa~
##  6 9         Teaching Statist~ 126      Investiga~ 6822          Not much compa~
##  7 9         Teaching Statist~ 126      Investiga~ 6822          Not much compa~
##  8 9         Teaching Statist~ 126      Investiga~ 6822          Not much compa~
##  9 9         Teaching Statist~ 126      Investiga~ 6822          Not much compa~
## 10 9         Teaching Statist~ 126      Investiga~ 6822          Not much compa~
## # ... with 165,710 more rows, and 4 more variables: post_id <chr>,
## #   post_title <chr>, post_date <chr>, word <chr>

Now let’s do a quick word count to see some of the most common words used throughout the forums. This should get a sense of what we’re working with and later we’ll need these word counts for creating our document term matrix for topic modeling:

## # A tibble: 13,666 x 2
##    word            n
##    <chr>       <int>
##  1 students     6837
##  2 data         4338
##  3 statistics   3095
##  4 school       1488
##  5 questions    1470
##  6 time         1252
##  7 class        1208
##  8 agree         999
##  9 teaching      987
## 10 statistical   957
## # ... with 13,656 more rows

Terms like “students,” “data,” and “class” are about what we would have expected from a course teaching statistics. The term “agree” and “time” however, are not so intuitive and worth a quick look as well.

✅ Comprehension Check

Use the filter() and grepl() functions introduced in Unit 1. Section 3b to filter for rows in our ts_forum_data data frame that contain the terms “agree” and “time” and another term or terms of your choosing. Select a random sample of 10 posts using the sample_n() function for your terms and answer the following questions:

## # A tibble: 10 x 1
##    post_content                                                                 
##    <chr>                                                                        
##  1 "-  Airport Data     <What learning goal(s) could this task be used for stud~
##  2 "As a learner of statistics  and then as a teacher  I think that is too much~
##  3 "I agree.  The best form of graphical data display depends on what you are t~
##  4 "I really liked how the students refined their questions by looking at the d~
##  5 "I think it's great that so many of you are teaching non-AP stats courses!! ~
##  6 "Technology ushers in fundamental structural changes that can be integral to~
##  7 "I try to have my students engaged in projects at least 8 times during the y~
##  8 "This unit uses data from Census @School. However  the data evolved from jus~
##  9 "One specific thought sticks out in my mind as I reflect on these questions ~
## 10 "Maybe I am not understanding the task fully without actually working with t~
  1. What, if anything, do these posts have in common?
  • similar forum topics: either “discuss w/ your colleagues or investigate”
  1. What topics or themes might be apparent, or do you anticipate emerging, from our topic modeling?
  • Just guessing here but: technology, application to classroom, including time limitations

Your output should look something like this:

## # A tibble: 10 x 1
##    post_content                                                                 
##    <chr>                                                                        
##  1 "I teach Algebra 2 as well.  I taught a Unit on Statistics in the Algebra 2 ~
##  2 "If we look at the CCSL  starting in 6th grade students begin to ask statist~
##  3 "The short class can also be a challenge.   I have sent students home with t~
##  4 "Hi Margaret      I agree with your comments.  The Coke vs Pepsi activity wa~
##  5 "I found the definitions of math and statistics helpful.  As a math major  I~
##  6 "Unfortunately  my class is small (typically 5 - 15 students) and that makes~
##  7 "Every time I teach statistics  I try to add more that my students can do  s~
##  8 "Lana  you are right!  The louder and brighter - the better for the students~
##  9 "In my first year teaching AP Stats  I found my biggest problem was finding ~
## 10 "Dear Participant         It has been a pleasure to offer the view.php?52 Te~

Creating a Document Term Matrix

For now, however, let treat each individual post as a unique “document.” noted above, to create our document term matrix, we’ll need to first count() how many times each word occurs in each document, or post_id in our case, and create a matrix that contains one row per post as our original data frame did, but now contains a column for each word in the entire corpus and a value of n for how many times that word occurs in each post.

To create this document term matrix from our post counts, we’ll use the cast_dtm() function like so and assign it to the variable forums_dtm:

✅ Comprehension Check

Take a look at our forums_dtm object in the console and answer the following question:

  1. What “class” of object is forums_dtm? a simple triplet matrix

  2. How many unique documents and terms are included our matrix? documents: 5761, terms: 13666

  3. Why might there be fewer documents/posts than were in our original data frame? maybe some documents (posts) are too sparse to be useful for topic modeling

  4. What exactly is meant by “sparsity”? perhaps not enough meaningful text to enable topic modeling

## [1] "DocumentTermMatrix"    "simple_triplet_matrix"
## <<DocumentTermMatrix (documents: 5761, terms: 13666)>>
## Non-/sparse entries: 135796/78594030
## Sparsity           : 100%
## Maximal term length: 320
## Weighting          : term frequency (tf)

2c. To Stem or not to Stem?

Processing and Stemming for STM

Like unnest_tokens(), the textProcessor() function includes several useful arguments for processing text like converting text to lowercase and removing punctuation and numbers. I’ve included several of these in the script below along with their defaults used if you do not explicitly specify in your function. Most of these are pretty intuitive and you can learn more by viewing the ?textProcessor documentation.

Let’s go ahead and process our discussion forum post_content in preparation for structural topic modeling:

## Building corpus... 
## Converting to Lower Case... 
## Removing punctuation... 
## Removing stopwords... 
## Removing numbers... 
## Stemming... 
## Creating Output...

Note that the first argument the textProcessor function expects is the column in our data frame that contains the text to be processed, the second argument metadata = expects the data frame that contains the text of interest and uses the column names to label the metadata such as course ids and forum names. This meatdata can be used to to improve the assignment of words to topics in a corpus and examine the relationship between topics and various covariates of interest.

Unlike the unnest_tokens() function, the output is not a nice tidy data frame. Topic modeling using the stm package requires a very unique set of inputs that are specific to the package. The following code will pull elements from the temp list that was created that will be required for the stm() function we’ll use in Section 4:

Stemming Tidy Text

Notice that the textProcessor stem argument we used above is set to TRUE by default. We haven’t introduced word stemming at this point because there is some debate about the actual value of this process. While words like “students” and “student” might make sense to collapse into their base word and actually make analyses and visualizations more concise and easier to interpret. Hvitfeldt and Silge (2021) note, however, that words like the following have dramatic differences in meaning, semantics, and use and could result in poor models or misinterpreted results:

  • meaning and mean
  • likely, like, liking
  • university and universe

The first word pair is particularly relevant to discussion posts from our Teaching Statistics course data. In addition, collapsing words like “teachers” and “teaching” could dramatically alter the results from a topic model.

For now, we will leave as is the forums_dtm we created earlier with words unstemmed, but what if we wanted to stem words in a “tidy” way?

Since the unnest_tokens() function does not (intentionally I believe) include a stemming function, one approach would be to use the wordStem() function from the snowballC package to either replace our words column with a word stems or create a new variable called stem with our stemmed words. Let’s do the latter and take a look at the original words and the stem that was produced:

## # A tibble: 165,720 x 11
##    course_id course_name       forum_id forum_name discussion_id discussion_name
##    <chr>     <chr>             <chr>    <chr>      <chr>         <chr>          
##  1 9         Teaching Statist~ 126      Investiga~ 6822          Not much compa~
##  2 9         Teaching Statist~ 126      Investiga~ 6822          Not much compa~
##  3 9         Teaching Statist~ 126      Investiga~ 6822          Not much compa~
##  4 9         Teaching Statist~ 126      Investiga~ 6822          Not much compa~
##  5 9         Teaching Statist~ 126      Investiga~ 6822          Not much compa~
##  6 9         Teaching Statist~ 126      Investiga~ 6822          Not much compa~
##  7 9         Teaching Statist~ 126      Investiga~ 6822          Not much compa~
##  8 9         Teaching Statist~ 126      Investiga~ 6822          Not much compa~
##  9 9         Teaching Statist~ 126      Investiga~ 6822          Not much compa~
## 10 9         Teaching Statist~ 126      Investiga~ 6822          Not much compa~
## # ... with 165,710 more rows, and 5 more variables: post_id <chr>,
## #   post_title <chr>, post_date <chr>, word <chr>, stem <chr>

You can see that words like “activity” and “activities” that occur frequently in our discussions have been reduced to the word stem “activ”. If you are interested in learning other approaches for word stemming in R, as well as reading a more in depth description of the stemming process, I highly recommend the Chapter 4 Stemming from Hvitfeldt and Silge (2021) book, Supervised Machine Learning for Text Analysis in R.

✅ Comprehension Check

Complete the following code using what we learned in the section on Creating a Document Term Matrix and answer the following questions:

  1. How many fewer terms are in our stemmed document term matrix? documents: 5761, terms: 10060 (reduced by 3,606)

  2. Did stemming words significantly reduce the sparsity of the network? sparsity = 100% - the sparsity stayed the same

Hint: Make sure your code includes stem counts rather than word counts.

## <<DocumentTermMatrix (documents: 5761, terms: 10060)>>
## Non-/sparse entries: 129593/57826067
## Sparsity           : 100%
## Maximal term length: 320
## Weighting          : term frequency (tf)
## <<DocumentTermMatrix (documents: 5761, terms: 13666)>>
## Non-/sparse entries: 135796/78594030
## Sparsity           : 100%
## Maximal term length: 320
## Weighting          : term frequency (tf)
## # A tibble: 10,060 x 2
##    stem         n
##    <chr>    <int>
##  1 student   7346
##  2 data      4338
##  3 statist   4152
##  4 question  2470
##  5 teach     1841
##  6 school    1606
##  7 class     1520
##  8 time      1424
##  9 learn     1355
## 10 task      1214
## # ... with 10,050 more rows

3. MODEL

This unit provides our first opportunity for modeling a text as data. In very simple terms, modeling involves developing a mathematical summary of a dataset. These summaries can help us further explore trends and patterns in our data.

3a. Fitting a Topic Modeling with LDA

Before running our first topic model using the LDA() function, let’s quick recap from our readings some basic principles behind Latent Dirichlet allocation and why LDA is of preferred over other automatic classification or clustering approaches.

Unlike simple forms of cluster analysis such as k-means clustering, LDA is a “mixture” model, which in our context means that:

  1. Every document contains a mixture of topics. Unlike algorithms like k-means, LDA treats each document as a mixture of topics, which allows documents to “overlap” each other in terms of content, rather than being separated into discrete groups. So in practice, this means that a discussion forum post could have an estimated topic proportion of 70% for Topic 1 (e.g. be mostly about a Topic 1), but also be partly about Topic 2.
  2. Every topic contains a mixture of words. For example, if we specified in our LDA model just 2 topics for our discussion posts, we might find that one topic seems to be about pedagogy while another is about learning. The most common words in the pedagogy topic might be “teacher”, “strategies”, and “instruction”, while the learning topic may be made up of words like “understanding” and “students”. However, words can be shared between topics and words like “statistics” or “assessment” might appear in both equally.

Similar to k-means other other simple clustering approaches, however, LDA does require us to specify a value of k for the number of topics in our corpus. Selecting k is no trivial matter and can greatly impact your results.

Since we don’t have a have strong rationale about the number of topics that might exist in discussion forums, let’s use the n_distinct() function from the dplyr package to find the number of unique forum names in our course data and run with that:

## [1] 20

Since it looks like there are 20 distinct discussion forums, we’ll use that as our value for the k = argument of the LDA(). Be patient while this runs, since the default setting of is to perform a large number of iterations.

## [1] 20
## A LDA_VEM topic model with 20 topics.

Note that we used the control = argument to pass a random number (588) to seed the assignment of topics to each word in our corpus. Since LDA is a stochastic algorithm that could have different results depending on where the algorithm starts, specified a seed for reproducibility and so we’re all seeing the same results every time we specify the same number of topics.

And tying back to our work in Unit 1, Bail (2020) notes that topic assignments for each word are updated in an iterative fashion and that LDA employs the Term Frequency-Inverse Document Frequency (TF-IDF) metric to assign probabilities.

3b. Fitting a Structural Topic Model

Bail notes that LDA, while perhaps the most common approach to topic modeling, is just one of many different types, including Dynamic Topic Models, Correlated Topic Models, Hierarchical Topic Models, and more recently, Structural Topic Modeling (STM). He argues that one reason STM has rising in popularity and use is that it employs meta data about documents to improve the assignment of words to topics in a corpus and that can be used to examine relationships between covariates and documents. 

Also, since Julia Silge has indicated that STM is, “my current favorite implementation of topic modeling in R” and has built supports in the tidytext package for building structural topic models, this package definitely is worth discussing in this walkthrough. I also highly recommend her own walkthrough of the stm package: The game is afoot! Topic modeling of Sherlock Holmes stories as well as her follow up post, Training, evaluating, and interpreting topic models.

The stm Package

As we’ve seen above, STM produced an unusual temp textProcessor output that is unique to the stm package. And as you’ve probably already guessed, the stm() function for fitting a structural topic model does not take a fairly standard document term matrix like the LDA() function.

Before we fit our model, we’ll have to extract the elements from the temp object created after we processed our text. Specifically, the stm() function expects the following arguments:

  • documents = the document term matrix to be modeled in the native stm format
  • data = an optional data frame containing meta data for the prevalence and/or content covariates to include in the model
  • vocab = a character vector specifying the words in the corpus in the order of the vocab indices in documents.

Let’s go ahead and extract these elements:

And now use these elements to fit the model using the same number of topics for K that we specified for our LDA topic model. Let’s also take advantage of the fact that we can include the course_id and forum_id covariates in the prevealence = argument to help improve, in theory, our model fit:

## A topic model with 20 topics, 5777 documents and a 9306 word dictionary.

As noted earlier, the stm package has a number of handy features. One of these is the plot.STM() function for viewing the most probable words assigned to each topic.

By default, it only shows the first 3 terms so let’s change that to 5 to help with interpretation:

Note that you can also just use plot() as well:

✅ Comprehension Check

Fit a model for both LDA and STM using different values for K and answer the following questions:

  1. What topics appear to be similar to those using 20 topics for K?
## A topic model with 14 topics, 5777 documents and a 9306 word dictionary.

## A topic model with 12 topics, 5777 documents and a 9306 word dictionary.

  1. Knowing that you don’t have as much context as the researchers of this study do, how might you interpret one of these latent topics or themes using the key terms assigned?
    • Topic 10 seems to be about experiential learning via a project that required using technology
  2. What topic emerged that seem dramatically different and how might you interpret this topic?
    • Topic 6 seems to be about the speed of a roller coaster (maybe a wooden versus a metal track?). Not knowing more, I think this may refer to a specific project

3c. Finding K

As alluded to earlier, selecting the number of topics for your model is a non-trivial decision and can dramatically impact your results. Bail (2018) notes that

The results of topic models should not be over-interpreted unless the researcher has strong theoretical apriori about the number of topics in a given corpus, or if the researcher has carefully validated the results of a topic model using both the quantitative and qualitative techniques described above.

There are several approaches to estimating a value for K and we’ll take a quick look at one from the ldatuning package and one from our stm package.

The FindTopicsNumber Function

The ldatuning package has functions for both calculating and plotting different metrics that can be used to estimate the most preferable number of topics for LDA model. It also conveniently takes the standard document term matrix object that we created from out tidy text and has the added benefit of running fairly quickly, especially compared to the function for finding K from the stm package.

Let’s use the defaults specified in the ?FindTopicNumber documentation and modify slightly get metrics for a sequence of topics from 10-75 counting by 5 and plot the output we saved using the FindTopicsNumber_plot() function:

The findingK() Function

Finally, Bail (2018) notes that thestm package has a useful function called searchK which allows us to specify a range of values for k and outputs multiple goodness-of-fit measures that are “very useful in identifying a range of values for k that provide the best fit for the data.”

The syntax of this function is very similar to the stm() function we used above, except that we specify a range for k as one of the arguments. In the code below, we search all values of k between 10 and 30.

Note that Running searchK() function on this corpus took all night on a pretty powerful MacBook Pro and crashed once as well, so I do not expect you to run this for the walkthrough. I ran a couple iterations and landed on between 5 and 15 with an optimal number of topics somewhere around 14:

Given the somewhat conflicting results, also somewhat selfishly and for the same of simplicity for this walkthrough, I’m just going to stick with the rather arbitrary selection of 20 topics for the remainder of this Unit.

The LDAvis Explorer

One final tool that I want to introduce from the stm package is the toLDAvis() function which provides a great visualizations for exploring topic and word distributions using LDAvis topic browser:

## Loading required namespace: servr

Our current stm model of 20 topics is resulting in a lot of overlap among topics and suggests that 20 may not be an optimal number of topics, as other approaches for finding k also suggests:

4. EXPLORE

Silge and Robinson (2018) note that fitting at topic model is the “easy part.” The hard part is making sense of the model results and that the rest of the analysis involves exploring and interpreting the model using a variety of approaches which we’ll walkthrough in in this section.

Bail (2018) cautions, however, that:

…post-hoc interpretation of topic models is rather dangerous… and can quickly come to resemble the process of “reading tea leaves,” or finding meaning in patterns that are in fact quite arbitrary or even random.

4a. Exploring Beta Values

Hidden within this forums_lda topic model object we created are per-topic-per-word probabilities, called β (“beta”). It is the probability of a term (word) belonging to a topic. 

Let’s take a look at the 5 most likely terms assigned to each topic, i.e. those with the largest β values using the terms() function from the topicmodels package:

##      Topic 1      Topic 2      Topic 3       Topic 4    Topic 5    Topic 6   
## [1,] "school"     "statistics" "test"        "grade"    "students" "students"
## [2,] "students"   "em"         "students"    "data"     "task"     "level"   
## [3,] "middle"     "unit"       "standard"    "graphs"   "data"     "dice"    
## [4,] "video"      "education"  "deviation"   "students" "tasks"    "size"    
## [5,] "elementary" "teaching"   "statistical" "plots"    "question" "sample"  
##      Topic 7      Topic 8       Topic 9     Topic 10     Topic 11    
## [1,] "students"   "statistics"  "questions" "assessment" "technology"
## [2,] "statistics" "teaching"    "kids"      "resource"   "students"  
## [3,] "feel"       "math"        "love"      "items"      "software"  
## [4,] "teach"      "teachers"    "students"  "locus"      "program"   
## [5,] "questions"  "mathematics" "sharing"   "students"   "excel"     
##      Topic 12   Topic 13     Topic 14    Topic 15    Topic 16     Topic 17    
## [1,] "students" "activity"   "question"  "students"  "fuel"       "students"  
## [2,] "project"  "students"   "data"      "questions" "cost"       "makes"     
## [3,] "real"     "agree"      "students"  "class"     "src"        "regression"
## [4,] "class"    "lesson"     "questions" "learn"     "codap"      "wondering" 
## [5,] "projects" "activities" "census"    "answer"    "pluginfile" "model"     
##      Topic 18   Topic 19     Topic 20
## [1,] "data"     "sample"     "stats" 
## [2,] "students" "difference" "ap"    
## [3,] "real"     "samples"    "class" 
## [4,] "collect"  "random"     "school"
## [5,] "sets"     "population" "stat"

Even though we’ve somewhat arbitrarily selected the number of topics for our corpus, some these topics or themes are fairly intuitive to interpret. For example:

  • Topic 11 (technology, students, software, program, excel) seems to be about students use of technology including software programs like excel;

  • Topic 9 (questions, kids, love, gapminder, sharing) seems to be about the gapminder activity from the MOOC-Ed and kids enjoyment of it; and

  • Topic 18 (data, students, collect, real, sets) seems to be about student collection and use of real world data sets.

Not surprisingly, the tidytext package has a handy function conveniently name tidy() to convert our lda model to a tidy data frame containing these beta values for each term:

## # A tibble: 273,320 x 3
##    topic term       beta
##    <int> <chr>     <dbl>
##  1     1 2015  1.36e- 67
##  2     2 2015  1.55e-  4
##  3     3 2015  1.73e-  4
##  4     4 2015  3.27e-  4
##  5     5 2015  4.43e- 80
##  6     6 2015  1.17e-114
##  7     7 2015  2.73e- 50
##  8     8 2015  7.10e- 40
##  9     9 2015  2.45e- 29
## 10    10 2015  1.12e-  3
## # ... with 273,310 more rows

Obviously, it’s not very easy to interpret what the topics are about from a data frame like this so let’s borrow code again from Chapter 8.4.3 Interpreting the topic model in Text Mining with R to examine the top 5 terms for each topic and then look at this information visually:

4b. Exploring Gamma Values

Now that we have a sense of the most common words associated with each topic, let’s take a look at the topic prevalence in our MOOC-Ed discussion forum corpus, including the words that contribute to each topic we examined above.

Also, hidden within our forums_lda topic model object we created are per-document-per-topic probabilities, called γ (“gamma”). This provides the probabilities that each document is generated from each topic, that gamma matrix. We can combine our beta and gamma values to understand the topic prevalence in our corpus, and which words contribute to each topic.

To do this, we’re going to borrow some code from the Silge (2018) post, Training, evaluating, and interpreting topic models.

First, let’s create two tidy data frames for our beta and gamma values

## # A tibble: 273,320 x 3
##    topic term       beta
##    <int> <chr>     <dbl>
##  1     1 2015  1.36e- 67
##  2     2 2015  1.55e-  4
##  3     3 2015  1.73e-  4
##  4     4 2015  3.27e-  4
##  5     5 2015  4.43e- 80
##  6     6 2015  1.17e-114
##  7     7 2015  2.73e- 50
##  8     8 2015  7.10e- 40
##  9     9 2015  2.45e- 29
## 10    10 2015  1.12e-  3
## # ... with 273,310 more rows
## # A tibble: 115,220 x 3
##    document topic    gamma
##    <chr>    <int>    <dbl>
##  1 11295        1 0.00241 
##  2 12711        1 0.000381
##  3 12725        1 0.0289  
##  4 12733        1 0.142   
##  5 12743        1 0.00928 
##  6 12744        1 0.00476 
##  7 12756        1 0.0289  
##  8 12757        1 0.00353 
##  9 12775        1 0.00353 
## 10 12816        1 0.00353 
## # ... with 115,210 more rows

Next, we’ll adopt Julia’s code wholesale to create a filtered data frame of our top_terms, join this to a new data frame for gamma-terms and create a nice clean table using the kabel() function knitr package:

## Warning: `cols` is now required when using unnest().
## Please use `cols = c(terms)`
Topic Expected topic proportion Top 7 terms
Topic 7 0.087 students, statistics, feel, teach, questions, level, school
Topic 13 0.077 activity, students, agree, lesson, activities, ideas, resources
Topic 15 0.070 students, questions, class, learn, answer, reading, learning
Topic 18 0.067 data, students, real, collect, sets, set, analyze
Topic 5 0.065 students, task, data, tasks, question, statistical, activity
Topic 8 0.063 statistics, teaching, math, teachers, mathematics, teach, science
Topic 9 0.053 questions, kids, love, students, sharing, lot, gapminder
Topic 11 0.051 technology, students, software, program, excel, time, core
Topic 6 0.050 students, level, dice, size, sample, trials, technology
Topic 14 0.048 question, data, students, questions, census, answer, survey
Topic 12 0.047 students, project, real, class, projects, life, time
Topic 10 0.047 assessment, resource, items, locus, students, resources, website
Topic 1 0.046 school, students, middle, video, elementary, age, census
Topic 3 0.039 test, students, standard, deviation, statistical, results, understand
Topic 20 0.039 stats, ap, class, school, stat, math, students
Topic 4 0.037 grade, data, graphs, students, plots, 1, box
Topic 17 0.036 students, makes, regression, wondering, model, days, linear
Topic 19 0.034 sample, difference, samples, random, population, sampling, minutes
Topic 2 0.030 statistics, em, unit, education, teaching, learning, online
Topic 16 0.013 fuel, cost, src, codap, pluginfile, vehicles, city

And let’s also compare this to the most prevalent topics and terms from our forums_stm model that we created using the plot() function:

4c. Reading the Tea Leaves

Recognizing that topic modeling is best used as a “tool for reading” and provides only an incomplete answer to our overarching, “How do we quantify what a corpus is about?”, the results do suggest some potential topics that have emerges as well as some areas worth following up on.

Specifically, looking at some of the common clusters of words for the more prevalent topics suggest that some key topics or “latent themes” (renamed in bold) might include:

  • Teaching Statistics: Unsurprising, given the course title, the topics most prevalent in both the forums_stm and forums_lda models contains the terms “teach”, “students”, “statistics”. This could be an “overarching theme” but more likely may simply be just the residue of the course title though being sprinkled throughout the forums and deserves some follow up. Topics 8 from the LDA model may overlap with this topic as well.
  • Course Utility: The second most prevalent Topics (13 and 2) in the lda and stm models respectively, seem to potentially be about the usefulness of course “resources” like lessons, tools, videos, and activities. I’m wagering this might be a forum dedicated to course feedback. Topic 15 from the STM model also suggest this may be a broader theme.
  • Using Real-World Data: Topics 18 & 12 from the LDA model particularly intrigue me and I’m wagering this is pretty positive sentiment among participants about the value and benefit of having students collect and analyze real data sets (e.g. Census data in Topic 1) and work on projects relevant to their real life. Will definitely follow up on this one.
  • Technology Use: Several topics (6 & 11 from LDA and 8 & 19 from STM) appear to be about student use of technology and software like calculators and Excel for teaching statistics and using simulations. Topic 16 from LDA also suggest the use of the Common Online Data Analysis Platform.
  • Student Struggle & Engagement: Topic 15 from LDA and Topic 16 from STM also intrigue me and appear to be two opposite sides of perhaps the same coin. The former includes “struggle” and “reading” which suggests perhaps a barrier to teaching statistics while Topic 16 contains top stems like “engage”, “activ”, and “think” and may suggest participants anticipate activities may engage students.

To serve as a check on my tea leaf reading, I’m going to follow Bail’s recommendation to examine some of these topics qualitatively. The stm package has another useful function though exceptionally fussy function called findThoughts which extracts passages from documents within the corpus associate with topics that you specify.

The first line of code may not be necessary for your independent analysis, but because the textProcessor() function removed several documents during processing, the findthoughts() function can’t properly index the processed docs. This line of code found on stackoverflow removes documents from original ts_forum_data source that were removed during processing so there is a one-to-one correspondence with forums_stm and so you can use the function to find posts associated with a given topic.

Let’s slightly reduce our original data set to match our STM model, pass both to the findThoughts() function, and set our arguments to return n =10 posts from topics = 2 (i.e. Topic 2) that have at least 50% or thresh = 0.5 as a minimum threshold for the estimated topic proportion.

## 
##  Topic 2: 
##       WHoops!  forgot the link to the video.  VERY SORRY!     www.ted.com talks hans_rosling_shows_the_best_stats_you_ve_ever_seen www.ted.com talks hans_rosling_shows_the_best_stats_you_ve_ever_seen  "
##      Hello Dina  your excitement perked my interest. I must say I did not really bother about these tools as I was reading them but promised myself to look closer and learn more after I read your comments. Your excitement tells me that there must be something in there that I should take a closer look on. Thanks for the vibes.
##      Thank you for putting that up! I haven't looked at the Extend your Learning section at all  but I have now bookmarked the Against all Odds website!
##      I too bookmarked the LinkedIn video.  I also continued to watch some of the SAS videos.  I plan to show some to my AP students.  I also plan to send it as a link to the guidance counselors.
##      Thank you Bonnie. I visited stattrek to help me refresh my own craftsmanship.
##      Thanks for the link. I have never used that Site before and the nurse Geen example may stimulate some good discussions.
##      When I returned from the preceding link  I accidentally rated this resource.    I would give it zero stars if I could  or one star in a do over  because it takes me to a link to purchase something.
##      Thanks for the video links.  I enjoy TED talks.
##      I've never heard of FiveThirtyEight  could you tell me more about the resource?     Thanks!  Carisa
##      Hi Keren and all      I used the following videos (or parts of them) with Business students but I think many of them would be suitable for other ares as well. I found suitable their size (about 10-15 minutes)  and often used them as a review  www.economicsnetwork.ac.uk statistics videos   And some of the older joy of stats videos  www.open.edu openlearn science-maths-technology mathematics-and-statistics statistics watch-the-joy-stats   Klara Kelecsenyi

Duplicate posts aside, this Course Utility topic returns posts there were expected based on my interpretation of the key terms for Topic 2. It looks like I may have read those tea leaves correctly.

Now let’s take a look at Topic 16 that we thought might be related to student engagement:

## 
##  Topic 16: 
##       Maureen   You have a good point  students want to rush through things and just get to the answer but not really think about things. I intentionally didn't re-read the questions and rushed through it just to see how I did  and I missed some things. Students would as well. I like your idea of demonstrating this for students.   "
##      The framework really makes you think about what level the students are at and what you can do to fit the task to that level. It also gives you an idea of what you can do to maybe challenge your students a little and keep them growing.
##      I agree that the reading would be difficult for many of my students. I work with mostly Tier 2 students. I had to read the questions several times myself and reason out the answers. I can see many of my students having trouble with this task. We would need to practice these types of problems many times and discuss the answers for the students to feel comfortable attempting these types of questions.
##      I have the same issues with initial questions. This is where I try to inject motivating questions to explore where my students' motivation may lie. Eventually I hope to get them asking their own questions and not get overwhelmed by the open-endness of the process.
##      I think that AP students would be up to this task and it is a topic that would interest them. I think that the questions are thought provoking and would facilitate good discussions in the classroom. It would be a good extension to ask the students what other information might improve this task to see if they would come up with age and gender as pertinent information.
##      I agree.  I think that the final question was almost more confusing (and multi-layered) then the initial questions.
##      I agree Eric in that I like this activity the best. It gives guidance and helps students learn how to conduct an experiment. Plus  it gets the entire class involved. And  as you said  it can be expanded a little to hit all 4 phases of the statistical process  and it can be transferred to other activities. I think it has a lot of potential!
##      I agree on the two different approaches  and both (I think) are very valid. I  struggle with how to intertwine the two without losing the students. They seem to want to compartmentalize the two instead of looking at them as two parts of a whole  though that could be more due to my lack of experience.   I also would like to think that many of my students would do well. I could see them not doing well because of the differences in question presentation  though. I don't give tests in the traditional sense  so this question format might be a little foreign.
##      I feel like my ap students are good students who will learn to answer those questions as well. I think they are self motivated and want to do well. After a semester I do think they will feel good about answer those questions
##      I really like the war game activity. It is a game that your student are probably going to know how to play already  which should help you when explaining. Also it can really get your students up and going and you will see the competitiveness really kick in with your students. I also like the discussion part you have with it as well because in reality who really knows which deck of cards is going to be the best to pick from  so this makes a great topic for discussion and it would be interesting to see what all of your students say. Overall I would have to say you did a great job with your project here. The only way I can see for adjustments though would be after you do your activities the first time because you really don't know how things are going to work out until you actually try them yourself.

It looks like my tea reading was a partially correct for Topic 16, though the results seem to be about a specific “Pepsi challenge” activity to conduct with students.

Finally, let’s look at posts from Topic 3 which we though might be an overarching theme about teaching statistics:

## 
##  Topic 3: 
##       I am a non native English speaker as well  taking the test was a little intimedating  I hope to gradually gain a better feel of terminology and content
##      The area that I feel I need the most improvement is asking better questions.  Not only do I need to pose better questions  but I need to help my students develop this skill as well.  I feel more confident that I will be able to do so or at least I am willing to try!
##      My confidence has increased slightly and my ability to find useful resources has increased a lot.  So that should translate into a classroom atmosphere where I am even more comfortable pushing students and when they ask things I cannot immediately answer  being able to say  I will get back to you tomorrow with a better answer or question! "
##      I am sure this applies to every area of math  but I find that my students with disabilities really struggle with mathematical reasoning.  They also struggle with math vocabulary.  Most could match a definition to a term  but they do not have understanding of the term when needing to make a mathematical statement to go with it.
##      Hello Katherine. My students are in the upper level Math courses  however  I found this strategy very helpful for my Math vocabulary. This might help in the Math vocabulary development for ELD students. I ask my students to just focus on one Math word  or one Math concept they learned during the week's lessons. Then I ask them to make a Math Graffiti out of the word or concept  a drawing that would subtly depict what they learned. I emphasize that I do not just want a drawing  not a formula solution  not an essay  rather I want a subtle drawing of what they learned using the word or concept. Then at the back of the Math Graffiti  I'd ask them to describe the word or concept. It really helped in how they learned concepts  at the same time asking them to verbalize  in writing  whatever they have learned. Here several modalities of learning had been used.
##      I agree that teaching students how to decode the language used in questions is one of the most important ways we can help them prepare for tests.  I think that too often we focus on vocabulary as individual mathematics and statistics words  and think that if students understand the individual words they will be able to understand the test questions.  In fact  understanding test questions is often about interpreting the little words like to or if or not  as well as the statistics words. "
##      It is always easier to teach a topic when you have had more training on the area and I believe this has made me feel way more confidant in this area!
##      I learned that I need to improve on my understanding of the basic concepts in order to teach students. I need a clear understanding of variability and other key vocabulary terms that left me guessing on the answer. "
##      My students are not at all prepared to answer the questions in the investigation.  They are all sophomores in high school and haven't had much interaction with statistics.  I would definitely need to understand the concepts in order to teach them the concepts.  I would need to feel completely confident in teaching the material beforehand.
##      I also agree that the wording in statistics questions is often difficult to understand  but like learning the specific language of any subject  the more time we and our students spend studying statistics and working statistics problems  the better prepared we will be to overcome those misconceptions.

Looking at just the 10 posts returned, perhaps a better name for this topic would be Course Reflections on Teaching Statistics.

Unit Takeaway

In addition to some useful R packages and functions for the actual process of topic modeling, hopefully there are two main lessons I’m hoping you take away from this walkthrough:

  1. Topic modeling requires a lot of decisions. Beyond deciding on a value for K, there are a number of key decisions that you have to make that can dramatically affect your results. For example, to stem or not to stem? What qualifies as a document? What flavor or topic modeling is best suited to your data and research questions? How many iterations to run?
  2. Topic modeling is as much art as (data) science. As Bail (2018) noted, the term “topic” is somewhat ambitious, and topic models do not produce highly nuanced classification of texts. Once you’ve fit your model, interpreting your model requires some mental gymnastics and ideally some knowledge of the context from which the data came to help with interpretation of your topics. Moreover, the quantitative approaches for making the decisions highlighted above are imperfect and a good deal of human judgment required.
✅ Comprehension Check

Using the STM model you fit from the Section 3 [Comprehension Check] with a different value for K, use the approaches demonstrated in Section 4 to explore and interpret your topics and terms and revisit the following question:

  1. Now that you have a little more context, how might you revise your initial interpretation of some of the latent topics or latent themes from your model?
## 
##  Topic 10: 
##       In the past I have used tinkerplots to allow students to  explore various outcomes associated with rolling two dice. They start by playing the Two-Dice Elimination game  and then simulate rolling two dice many times using TinkerPlots to determine whether a step model or a triangle model of the distribution of sums is correct. Finally  they use the triangle model to calculate the probability of rolling each sum. Students really liked this activity and really grasped the concept.
##      Is there an underlying assumption here that the two teams are equally likely to win the game? We are assuming that the octopus is equally likely to swim to either flag. But if one team is heavily favored (for example  if one of the teams has a 90% chance of winning)  and the octopus has picked the long shot team  then wouldn't his likelihood of picking the correct team be less? Or is this accounted for with the fact that the octopus is equally likely to swim to either flag  and that he was equally likely to choose the heavily favored team?
##      I also agree with the fact that the students were able to see more clearly by doing the different trials with the different amount of trials in each.  It really did help the students to see how things can be fair or unfair by doing all the trials.
##      I agree with you that I liked how the students could run a simulation of 3000 trials.  On the graphing calculator that we use (TI-84) there is a simulation program also that involves dice and coins  but we are not able to go up to 3000 trials!  We have done something similar to this using the calculator  but it would be nice to try something new with software like what was used in the video.
##      I  personally  enjoy the video simulations.  I find them more engaging and even my children are watching the video at home  )  Adding the technology aspect to the dice roll eased  created prediction  and helped to solve a pattern formation.  The student were talking to help the metacognition of thinking through their thoughts  but the computer quickly revealed dice rolls at various numbers including 3 000.  Trying to roll the dice that many times would take forever.  With the bar graph and pie graph showing the outcome  it helped students determine if the dice were bias or not.
##      Howard   There are many different software options for creating animations.  Most will allow you to use real recorded voices or a computer generated voice  but not all.  We have tried a few  the ones that you see in the course were created using goanimate.com  GoAnimate and "www.powtoon.com  Powtoon. "
##      Once students are able to manipulate the simulation software  they start to make their own inferences about the data.  Students can visually see what happens to the data when they roll the dice more than one time  or more than 50  or even more than 100.  I thought that it was interested to see the pie graph as well.  The percentages looked like they were so far apart when using the bar graph  and then in the pie graph they looked closer.  I think this was great for the students to see as well.
##      From the videos  I saw that the students we trying to level out the dice rolls. Both partners were discussing how it was it was a fair dice. The computer helped them by having the students critically think about their next move. The first pair of boys kept rolling until they got to 3 000.Then the girls rolled until they got to 36. They used and algorithm to find a good stopping point. Even with the AP Stats students  they were able to show their work to prove if the dice was fair or unfair. Eventually  at some point the students thought that the outcomes would even out.
##      It was great to see the activity being used at much different levels  but expecting similar outcomes.  The use of the various technologies really makes it accessible for all learners.
##      I liked that they had the  tool to try the simulations with increasing number of trials.  (It would be nice if my students had similar  tools.)  I wonder why one group did not  explore using larger number of trials.   The first group was able to see the difference and the second buy became  convinced after seeing enough trails.

This topic has to do with different platforms and tools used to conduct experiments (seemingly to better understand statistics). Many teachers seemed to be sharing individual tools and commenting on how they liked the learning experiences that the technology seemed to evoke from students (i.e. “metacognition”)

---
title: "Topic Modeling"
author: "Catherine Noonan"
date: "2/18/2022"
output: 
  html_document:
    toc: true
    toc_depth: 3
    toc_float: yes
    code_folding: hide
    code_download: TRUE
editor_options: 
  markdown: 
    wrap: 72
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```


## 1. PREPARE

To help us better understand the context, questions, and data sources
we'll be using in Unit 3, this section will focus on the following
topics:

a.  **Context**. As context for our analysis this week, we'll review
    several related papers by my colleagues relevant to our analysis of
    MOOC-Ed discussion forums.
b.  **Questions.** We'll also examine what insight topic modeling can
    provide to a question that we asked participants answer in their
    professional learning teams (PLTs).
c.  **Project Setup.** This should be very familiar by now, but we'll
    set up a new R project and install and load the required packages
    for the topic modeling walkthrough.

### 1a. Context

#### Participating in a MOOC and Professional Learning Team: How a Blended Approach to Professional Development Makes a Difference



**Abstract**

Massive Open Online Courses for Educators (MOOC-Eds) provide
opportunities for using research-based learning and teaching practices,
along with new technological tools and facilitation approaches for
delivering quality online professional development. The Teaching
Statistics Through Data Investigations MOOC-Ed was built for preparing
teachers in pedagogy for teaching statistics, and it has been offered to
participants from around the world. During 2016-2017, professional
learning teams (PLTs) were formed from a subset of MOOC-Ed participants.
These teams met several times to share and discuss their learning and
experiences. This study focused on examining the ways that a blended
approach to professional development may result in similar or different
patterns of engagement to those who only participate in a large-scale
online course. Results show the benefits of a blended learning
environment for retention, engagement with course materials, and
connectedness within the online community of learners in an online
professional development on teaching statistics. The findings suggest
the use of self-forming autonomous PLTs for supporting a deeper and more
comprehensive experience with self-directed online professional
developments such as MOOCs. Other online professional development
courses, such as MOOCs, may benefit from purposely suggesting and
advertising, and perhaps facilitating, the formation of small
face-to-face or virtual PLTs who commit to engage in learning together.

**Data Source & Analysis**

All peer interaction, including peer discussion, take place within
discussion forums of MOOC-Eds, which are hosted using the Moodle
Learning Management System. To build the dataset you'll be using for
this walkthrough, the research team wrote a query for Moodle's MySQL
database, which records participants' user-logs of activity in the
online forums. This sql query combines separate database tables
containing postings and comments including participant IDs, timestamps,
discussion text and other attributes or "metadata."



**Summary of Key Findings**

The following highlight some key findings related to the discussion
forums in the papers cited above:

1.  MOOCs designed specifically for K-12 teachers can provide positive
    self-directed learning experiences and rich engagement in discussion
    forums that help form online communities for educators.
2.  Analysis of discussion forum data in TSDI provided a very clear
    picture of how enthusiastic many PLT members and leaders were to
    talk to others in the online community. They posed their questions
    and shared ideas with others about teaching statistics throughout
    the units, even though they were also meeting synchronously several
    times with their colleagues in small group PLTs.
3.  Findings on knowledge construction demonstrated that over half of
    the discussions in both courses moved beyond sharing information and
    statements of agreement and entered a process of dissonance,
    negotiation and co-construction of knowledge, but seldom moved
    beyond this phase in which new knowledge was tested or applied.
    These findings echo similar research on difficulties in promoting
    knowledge construction in online settings.
4.  Topic modeling provides more interpretable and cohesive models for
    discussion forums than other popular unsupervised modeling
    techniques such as k-means and k-medoids clustering algorithms.

### 1b. Guiding Questions

For the paper, [*Participating in a MOOC and Professional Learning Team:
How a Blended Approach to Professional Development Makes a
Difference*](https://www.learntechlib.org/p/195234/), the researchers
were interested in unpacking how participants who enrolled in the
Teaching Statistics through Data Investigations MOOC-Ed might benefit
from also being in a smaller group of professionals committed to
engaging in the same professional development. The specific research
question for this paper was:

> What are the similarities and differences between how PLT members and
> Non-PLT online participants engage and meet course goals in a MOOC-Ed
> designed for educators in secondary and collegiate settings?

Dr. Hollylynne Lee and the TSDI team also developed a facilitation guide
designed specifically for PLT teams to help groups synthesize the ideas
in the course and make plans for how to implement new strategies in
their classroom in order to impact students' learning of statistics. One
question PLT members were asked to address was:

> What ideas or issues emerged in the discussion forums this past week?

For this walkthrough, we will further examine that question through the
use of topic modeling.

And just to reiterate yet again from Unit 1, one overarching question
we'll explore throughout this course, and that Silge and Robinson (2018)
identify as a central question to text mining and natural language
processing, is:

> How do we to **quantify** what a document or collection of documents
> is about?

### 1c. Set Up

As highlighted in [Chapter 6 of Data Science in Education Using
R](https://datascienceineducation.com/c06.html) (DSIEUR), one of the
first steps of every workflow should be to set up a "Project" within
RStudio. This will be your "home" for any files and code used or created
in Unit 2.

You are welcome to continue using the same project created for Unit 1,
or create an entirely new project for Unit 2. However, after you've
created your project open up a new R script, and load the following
packages that we'll be needing for this walkthrough:

```{r, eval=FALSE}
#having trouble installing the stm package so trying this
if(!require(devtools)) install.packages("devtools")

library(devtools)
install_github("bstewart/stm",dependencies=TRUE)

#When I tried to knit this, however, I had all kinds of issues so ????
# I'm commenting it out
```

```{r, eval=FALSE}
#installing and loading required packages
install.packages("tidyverse")
install.packages("tidytext")
install.packages("SnowballC")
install.packages("topicmodels")
install.packages("ldatuning")
install.packages("knitr")
install.packages("LDAvis")
install.packages("stm")
```


```{r load-packages, message=FALSE}
library(tidyverse)
library(tidytext)
library(SnowballC)
library(topicmodels)
library(stm)
library(ldatuning)
library(knitr)
library(LDAvis)
library(devtools)
```


## 2. WRANGLE

### 2a. Import Forum Data


```{r read-csv}
ts_forum_data <- read_csv("data/ts_forum_data.csv", 
     col_types = cols(course_id = col_character(),
                   forum_id = col_character(), 
                   discussion_id = col_character(), 
                   post_id = col_character()
                   )
    )
```

By default, many of the columns like `course_id` and `forum_id` are read
in as numeric data. For our purposes, we plan to treat them as unique
identifiers or names for out courses, forums, discussions, and posts.
The `read_csv()` function has a handy `col_types =` argument changing
the column types from numeric to characters.

### 2b. Cast a Document Term Matrix

In this section we'll revisit some familiar `tidytext` functions used in
Units 1 & 2 for tidying and tokenizing text and introduce some new
functions from the `stm` package for processing text and transforming
our data frames into new data structures required for topic modeling.

#### Functions Used

**`tidytext` functions**

-   `unnest_tokens()` splits a column into tokens
-   `anti_join()` returns all rows from x without a match in y and used
    to remove `stop words` from out data.
-   `cast_dtm()` takes a tidied data frame take and "casts" it into a
    document-term matrix (dtm)

**`dplyr`** **functions**

-   `count()` lets you quickly count the unique values of one or more
    variables
-   `group_by()` takes a data frame and one or more variables to group
    by
-   `summarise()` creates a summary of data using arguments like sum and
    mean

**`stm` functions**

-   `textProcessor()` takes in a vector or column of raw texts and
    performs text processing like removing punctuation and word
    stemming.
-   `prepDocuments()` performs several corpus manipulations including
    removing words and renumbering word indices

#### Tidying Text

Prior to topic modeling, we have a few remaining steps to tidy our text
that hopefully should feel familiar by this point. If you recall from
[Chapter 1 of Text Mining With
R](https://www.tidytextmining.com/tidytext.html), these preprocessing
steps include:

1.  Transforming our text into "tokens"
2.  Removing unnecessary characters, punctuation, and whitespace
3.  Converting all text to lowercase
4.  Removing stop words such as "the", "of", and "to"

Let's tokenize our forum text and by using the familiar
`unnest_tokens()` and remove stop words per usual:

```{r tokenize-forums}
forums_tidy <- ts_forum_data %>%
  unnest_tokens(output = word, input = post_content) %>%
  anti_join(stop_words, by = "word")

forums_tidy
```

Now let's do a quick word count to see some of the most common words
used throughout the forums. This should get a sense of what we're
working with and later we'll need these word counts for creating our
document term matrix for topic modeling:

```{r count-words}
forums_tidy %>%
  count(word, sort = TRUE)
```

Terms like "students," "data," and "class" are about what we would have
expected from a course teaching statistics. The term "agree" and "time"
however, are not so intuitive and worth a quick look as well.

##### ✅ Comprehension Check

Use the `filter()` and `grepl()` functions introduced in [Unit 1.
Section
3b](http://shiyan2030.com/unit-1-walkthrough.html#3b_Word_Search) to
filter for rows in our `ts_forum_data` data frame that contain the terms
"agree" and "time" and another term or terms of your choosing. Select a
random sample of 10 posts using the `sample_n()` function for your terms
and answer the following questions:

```{r}
#Comprehension check: filter for rows in our `ts_forum_data` data frame that 
#contain the terms "agree" and "time" and another term or terms of your 
#choosing. 

ts_forum_agree <- forums_tidy %>%
  filter(word == "agree")

ts_forum_time <- forums_tidy %>%
  filter(word == "time")

ts_forum_timeagree <- forums_tidy %>%
  filter(word %in% c('time', 'agree'))

#Select a random sample of 10 posts using the `sample_n()` 
#function for your terms

ts_forum_randsamp <- sample_n(ts_forum_data, 10)

#all those things together
forum_quotes <- ts_forum_data %>%
  select(post_content) %>% 
  filter(grepl('time', post_content))

sample_n(forum_quotes,10)
```

1.  What, if anything, do these posts have in common?
    
  - similar forum topics: either "discuss w/ your colleagues or investigate"

2.  What topics or themes might be apparent, or do you anticipate
    emerging, from our topic modeling?
    
  - Just guessing here but: technology, application to classroom, including
    time limitations 

Your output should look something like this:

```{r find-quotes, echo=FALSE}
forum_quotes <- ts_forum_data %>%
  select(post_content) %>% 
  filter(grepl('time', post_content))

sample_n(forum_quotes,10)
```

#### Creating a Document Term Matrix


For now, however, let treat each individual post as a unique "document."
noted above, to create our document term matrix, we'll need to first
`count()` how many times each `word` occurs in each document, or
`post_id` in our case, and create a matrix that contains one row per
post as our original data frame did, but now contains a column for each
`word` in the entire corpus and a value of `n` for how many times that
word occurs in each post.

To create this document term matrix from our post counts, we'll use the
`cast_dtm()` function like so and assign it to the variable
`forums_dtm`:

```{r cast-dtm}
forums_dtm <- forums_tidy %>%
  count(post_id, word) %>%
  cast_dtm(post_id, word, n)
```

##### ✅ Comprehension Check

Take a look at our `forums_dtm` object in the console and answer the
following question:

1.  What "class" of object is `forums_dtm`?
    a simple triplet matrix
  
2.  How many unique documents and terms are included our matrix?
    documents: 5761, terms: 13666

3.  Why might there be fewer documents/posts than were in our original
    data frame?
    maybe some documents (posts) are too sparse to be useful for topic modeling
    
4.  What exactly is meant by "sparsity"?
    perhaps not enough meaningful text to enable topic modeling

```{r class-dtm, echo=FALSE}
class(forums_dtm)

forums_dtm
```

### 2c. To Stem or not to Stem?
#### Processing and Stemming for STM

Like `unnest_tokens()`, the `textProcessor()` function includes several
useful arguments for processing text like converting text to lowercase
and removing punctuation and numbers. I've included several of these in
the script below along with their defaults used if you do not explicitly
specify in your function. Most of these are pretty intuitive and you can
learn more by viewing the `?textProcessor` documentation.

Let's go ahead and process our discussion forum `post_content` in
preparation for structural topic modeling:

```{r textProcessor}
temp <- textProcessor(ts_forum_data$post_content, 
                    metadata = ts_forum_data,  
                    lowercase=TRUE, 
                    removestopwords=TRUE, 
                    removenumbers=TRUE,  
                    removepunctuation=TRUE, 
                    wordLengths=c(3,Inf),
                    stem=TRUE,
                    onlycharacter= FALSE, 
                    striphtml=TRUE, 
                    customstopwords=NULL)
```

Note that the first argument the `textProcessor` function expects is the
column in our data frame that contains the text to be processed, the
second argument `metadata =` expects the data frame that contains the
text of interest and uses the column names to label the metadata such as
course ids and forum names. This meatdata can be used to to improve the
assignment of words to topics in a corpus and examine the relationship
between topics and various covariates of interest.

Unlike the `unnest_tokens()` function, the output is not a nice tidy
data frame. Topic modeling using the `stm` package requires a very
unique set of inputs that are specific to the package. The following
code will pull elements from the `temp` list that was created that will
be required for the `stm()` function we'll use in Section 4:

```{r stm-inputs}
meta <- temp$meta
vocab <- temp$vocab
docs <- temp$documents
```

#### Stemming Tidy Text

Notice that the `textProcessor` stem argument we used above is set to
`TRUE` by default. We haven't introduced word stemming at this point
because there is some debate about the actual value of this process.
While words like "students" and "student" might make sense to collapse
into their base word and actually make analyses and visualizations more
concise and easier to interpret. [Hvitfeldt and Silge
(2021)](https://smltar.com/stemming.html) note, however, that words like
the following have dramatic differences in meaning, semantics, and use
and could result in poor models or misinterpreted results:

-   meaning and mean
-   likely, like, liking
-   university and universe

The first word pair is particularly relevant to discussion posts from
our Teaching Statistics course data. In addition, collapsing words like
"teachers" and "teaching" could dramatically alter the results from a
topic model.

For now, we will leave as is the `forums_dtm` we created earlier with
words unstemmed, but what if we wanted to stem words in a "tidy" way?

Since the `unnest_tokens()` function does not (intentionally I believe)
include a stemming function, one approach would be to use the
`wordStem()` function from the `snowballC` package to either replace our
`words` column with a word stems or create a new variable called `stem`
with our stemmed words. Let's do the latter and take a look at the
original words and the stem that was produced:

```{r wordStem}
stemmed_forums <- ts_forum_data %>%
  unnest_tokens(output = word, input = post_content) %>%
  anti_join(stop_words, by = "word") %>%
  mutate(stem = wordStem(word))

stemmed_forums
```

You can see that words like "activity" and "activities" that occur
frequently in our discussions have been reduced to the word stem
"activ". If you are interested in learning other approaches for word
stemming in R, as well as reading a more in depth description of the
stemming process, I highly recommend the [Chapter 4
Stemming](https://smltar.com/stemming.html) from Hvitfeldt and Silge
(2021) book, *Supervised Machine Learning for Text Analysis in R*.

##### ✅ Comprehension Check

Complete the following code using what we learned in the section on
[Creating a Document Term Matrix] and answer the following questions:

1.  How many fewer terms are in our stemmed document term matrix?
    documents: 5761, terms: 10060 (reduced by 3,606)
    
2.  Did stemming words significantly reduce the sparsity of the network?
    sparsity = 100% - the sparsity stayed the same

**Hint:** Make sure your code includes stem counts rather than word
counts.

```{r stem-practice, eval=FALSE}
stemmed_dtm <- ts_forum_data %>%
  unnest_tokens(output = word, input = post_content) %>%
  anti_join(stop_words, by = "word") %>%
  mutate(stem = wordStem(word)) %>%
  count(post_id, stem) %>%
  cast_dtm(post_id, stem, n)
  
stemmed_dtm
```

```{r stem-counts, echo=FALSE, message=FALSE}
stemmed_dtm <- ts_forum_data %>%
  unnest_tokens(output = word, input = post_content) %>%
  anti_join(stop_words, by = "word") %>%
  mutate(stem = wordStem(word)) %>%
  count(post_id, stem, sort = TRUE) %>%
  cast_dtm(post_id, stem, n)

stemmed_dtm
forums_dtm

stem_counts <- ts_forum_data %>%
  unnest_tokens(output = word, input = post_content) %>%
  anti_join(stop_words, by = "word") %>%
  mutate(stem = wordStem(word)) %>%
  count(stem, sort = TRUE)

stem_counts
```

## 3. MODEL

This unit provides our first opportunity for modeling a text as data. In
very simple terms, modeling involves developing a mathematical summary
of a dataset. These summaries can help us further explore trends and
patterns in our data.


### 3a. Fitting a Topic Modeling with LDA

Before running our first topic model using the `LDA()` function, let's
quick recap from our readings some basic principles behind Latent
Dirichlet allocation and why LDA is of preferred over other automatic
classification or clustering approaches.

Unlike simple forms of cluster analysis such as k-means clustering, LDA
is a **"mixture" model**, which in our context means that:

1.  **Every [document]{.ul} contains a mixture of topics.** Unlike
    algorithms like k-means, LDA treats each document as a mixture of
    topics, which allows documents to "overlap" each other in terms of
    content, rather than being separated into discrete groups. So in
    practice, this means that a discussion forum post could have an
    estimated topic proportion of 70% for Topic 1 (e.g. be mostly about
    a Topic 1), but also be partly about Topic 2.
2.  **Every [topic]{.ul} contains a mixture of words.** For example, if
    we specified in our LDA model just 2 topics for our discussion
    posts, we might find that one topic seems to be about pedagogy while
    another is about learning. The most common words in the pedagogy
    topic might be "teacher", "strategies", and "instruction", while the
    learning topic may be made up of words like "understanding" and
    "students". However, words can be shared between topics and words
    like "statistics" or "assessment" might appear in both equally.

Similar to k-means other other simple clustering approaches, however,
LDA does require us to specify a value of *k* for the number of topics
in our corpus. Selecting *k* is no trivial matter and can greatly impact
your results.

Since we don't have a have strong rationale about the number of topics
that might exist in discussion forums, let's use the `n_distinct()`
function from the `dplyr` package to find the number of unique forum
names in our course data and run with that:

```{r n-distinct}
n_distinct(ts_forum_data$forum_name)
```

Since it looks like there are 20 distinct discussion forums, we'll use
that as our value for the `k =` argument of the `LDA()`. Be patient
while this runs, since the default setting of is to perform a large
number of iterations.

```{r LDA}
n_distinct(ts_forum_data$forum_name)

forums_lda <- LDA(forums_dtm, 
                  k = 20, 
                  control = list(seed = 588)
                  )

forums_lda
```

Note that we used the `control =` argument to pass a random number
(`588`) to seed the assignment of topics to each word in our corpus.
Since LDA is a [stochastic
algorithm](https://machinelearningmastery.com/stochastic-in-machine-learning/)
that could have different results depending on where the algorithm
starts, specified a `seed` for reproducibility and so we're all seeing
the same results every time we specify the same number of topics.

And tying back to our work in Unit 1, Bail (2020) notes that topic
assignments for each word are updated in an iterative fashion and that
LDA employs the Term Frequency-Inverse Document Frequency (TF-IDF)
metric to assign probabilities.

### 3b. Fitting a Structural Topic Model

Bail notes that LDA, while perhaps the most common approach to topic
modeling, is just one of many different types, including Dynamic Topic
Models, Correlated Topic Models, Hierarchical Topic Models, and more
recently, Structural Topic Modeling (STM). He argues that one reason STM
has rising in popularity and use is that it employs meta data about
documents to improve the assignment of words to topics in a corpus and
that can be used to examine relationships between covariates and
documents. 

Also, since Julia Silge has indicated that STM is, "my current favorite
implementation of topic modeling in R" and has built supports in the
`tidytext` package for building structural topic models, this package
definitely is worth discussing in this walkthrough. I also highly
recommend her own walkthrough of the `stm` package: [The game is afoot!
Topic modeling of Sherlock Holmes
stories](https://juliasilge.com/blog/sherlock-holmes-stm/) as well as
her follow up post, [Training, evaluating, and interpreting topic
models](https://juliasilge.com/blog/evaluating-stm/).

#### The `stm` Package

As we've seen above, STM produced an unusual `temp` textProcessor output
that is unique to the `stm` package. And as you've probably already
guessed, the `stm()` function for fitting a structural topic model does
not take a fairly standard document term matrix like the `LDA()`
function.

Before we fit our model, we'll have to extract the elements from the
`temp` object created after we processed our text. Specifically, the
`stm()` function expects the following arguments:

-   `documents =` the document term matrix to be modeled in the native
    stm format
-   `data =` an optional data frame containing meta data for the
    prevalence and/or content covariates to include in the model
-   `vocab =` a character vector specifying the words in the corpus in
    the order of the vocab indices in documents.

Let's go ahead and extract these elements:

```{r stm-docs}
docs <- temp$documents 
meta <- temp$meta 
vocab <- temp$vocab 
```

And now use these elements to fit the model using the same number of
topics for *K* that we specified for our LDA topic model. Let's also
take advantage of the fact that we can include the `course_id` and
`forum_id` covariates in the `prevealence =` argument to help improve,
in theory, our model fit:

```{r stm}
forums_stm <- stm(documents=docs, 
         data=meta,
         vocab=vocab, 
         prevalence =~ course_id + forum_id,
         K=20,
         max.em.its=25,
         verbose = FALSE)

forums_stm
```

As noted earlier, the `stm` package has a number of handy features. One
of these is the `plot.STM()` function for viewing the most probable
words assigned to each topic.

By default, it only shows the first 3 terms so let's change that to 5 to
help with interpretation:

```{r plot-stm}
plot.STM(forums_stm, n = 5)
```

Note that you can also just use `plot()` as well:

```{r plot}
plot(forums_stm, n = 5)
```

##### ✅ Comprehension Check

Fit a model for both LDA and STM using different values for K and answer
the following questions:

1.  What topics appear to be similar to those using 20 topics for K?
```{r}
forums_stm_fourteen <- stm(documents=docs, 
                  data=meta,
                  vocab=vocab, 
                  prevalence =~ course_id + forum_id,
                  K=14,
                  max.em.its=25,
                  verbose = FALSE)

forums_stm_fourteen
plot(forums_stm_fourteen, n=5)

# Interestingly, these topics sort out differently. There are still references
# to statistics, learning and technology use, questions, and resources. However,
# there seem to be multiple separate topics (3, 13, 11) which all center around
# statistics. It might be interesting to reduce the topics to 12.

forums_stm_twelve <- stm(documents=docs, 
                           data=meta,
                           vocab=vocab, 
                           prevalence =~ course_id + forum_id,
                           K=12,
                           max.em.its=25,
                           verbose = FALSE)

forums_stm_twelve
plot(forums_stm_twelve, n=5)

# Still multiple topics that include terms related to statistics. I guess this
# is an important theme in the responses.
```


2.  Knowing that you don't have as much context as the researchers of this study do, how might you interpret one of these latent topics or themes using the key terms
    assigned?
    - Topic 10 seems to be about experiential learning via a project that
 required using technology
    
3.  What topic emerged that seem dramatically different and how might
    you interpret this topic?
    - Topic 6 seems to be about the speed of a roller coaster (maybe a wooden
versus a metal track?). Not knowing more, I think this may refer to a specific
project

### 3c. Finding *K*

As alluded to earlier, selecting the number of topics for your model is
a non-trivial decision and can dramatically impact your results. Bail
(2018) notes that

> *The results of topic models should not be over-interpreted unless the
> researcher has strong theoretical apriori about the number of topics
> in a given corpus, or if the researcher has carefully validated the
> results of a topic model using both the quantitative and qualitative
> techniques described above.*

There are several approaches to estimating a value for K and we'll take
a quick look at one from the `ldatuning` package and one from our `stm`
package.

#### The FindTopicsNumber Function

The `ldatuning` package has functions for both calculating and plotting
different metrics that can be used to estimate the most preferable
number of topics for LDA model. It also conveniently takes the standard
document term matrix object that we created from out tidy text and has
the added benefit of running fairly quickly, especially compared to the
function for finding K from the `stm` package.

Let's use the defaults specified in the `?FindTopicNumber` documentation
and modify slightly get metrics for a sequence of topics from 10-75
counting by 5 and plot the output we saved using the
`FindTopicsNumber_plot()` function:

```{r find-topic, eval=FALSE}
# THIS IS NOT WORKING FOR ME SO I'M GOING TO COMMENT IT OUT
#k_metrics <- FindTopicsNumber(
#  forums_dtm,
#  topics = seq(10, 75, by = 5),
#  metrics = "Griffiths2004",
#  method = "Gibbs",
#  control = list(),
#  mc.cores = NA,
#  return_models = FALSE,
#  verbose = FALSE,
#  libpath = NULL
#)

#FindTopicsNumber_plot(k_metrics)
```



#### The findingK() Function

Finally, Bail (2018) notes that the`stm` package has a useful function
called `searchK` which allows us to specify a range of values for `k`
and outputs multiple goodness-of-fit measures that are "very useful in
identifying a range of values for `k` that provide the best fit for the
data."

The syntax of this function is very similar to the `stm()` function we
used above, except that we specify a range for `k` as one of the
arguments. In the code below, we search all values of `k` between 10 and
30.

```{r searck-k, eval=FALSE}
#I am not expecting you run this code as it will take too long
#findingk <- searchK(docs, 
                    #vocab, 
                    #K = c(5:15),
                    #data = meta, 
                    #verbose=FALSE)

#plot(findingk)
```

Note that Running `searchK()` function on this corpus took all night on
a pretty powerful MacBook Pro and crashed once as well, so I do not
expect you to run this for the walkthrough. I ran a couple iterations
and landed on between 5 and 15 with an optimal number of topics
somewhere around 14:

Given the somewhat conflicting results, also somewhat selfishly and for
the same of simplicity for this walkthrough, I'm just going to stick
with the rather arbitrary selection of 20 topics for the remainder of
this Unit.

#### The LDAvis Explorer

One final tool that I want to introduce from the `stm` package is the
`toLDAvis()` function which provides a great visualizations for
exploring topic and word distributions using `LDAvis` topic browser:

```{r LDAvis}
toLDAvis(mod = forums_stm, docs = docs)
```

Our current stm model of 20 topics is resulting in a lot of overlap among topics and suggests that 20 may not be an optimal number of topics, as other approaches for
finding k also suggests:

## 4. EXPLORE

Silge and Robinson (2018) note that fitting at topic model is the "easy
part." The hard part is making sense of the model results and that the
rest of the analysis involves exploring and interpreting the model using
a variety of approaches which we'll walkthrough in in this section.

Bail (2018) cautions, however, that:

> *...post-hoc interpretation of topic models is rather dangerous... and
> can quickly come to resemble the process of "reading tea leaves," or
> finding meaning in patterns that are in fact quite arbitrary or even
> random.*

### 4a. Exploring Beta Values

Hidden within this `forums_lda` topic model object we created are
per-topic-per-word probabilities, called β ("beta"). It is the
probability of a term (word) belonging to a topic. 

Let's take a look at the 5 most likely terms assigned to each topic,
i.e. those with the largest β values using the `terms()` function from
the `topicmodels` package:

```{r terms}
terms(forums_lda, 5)
```

Even though we've somewhat arbitrarily selected the number of topics for
our corpus, some these topics or themes are fairly intuitive to
interpret. For example:

-   Topic 11 (technology, students, software, program, excel) seems to
    be about students use of technology including software programs like
    excel;

-   Topic 9 (questions, kids, love, gapminder, sharing) seems to be
    about the gapminder activity from the MOOC-Ed and kids enjoyment of
    it; and

-   Topic 18 (data, students, collect, real, sets) seems to be about
    student collection and use of real world data sets.

Not surprisingly, the `tidytext` package has a handy function
conveniently name `tidy()` to convert our lda model to a tidy data frame
containing these beta values for each term:

```{r tidy_lda}

tidy_lda <- tidy(forums_lda)

tidy_lda
```

Obviously, it's not very easy to interpret what the topics are about
from a data frame like this so let's borrow code again from [Chapter
8.4.3 Interpreting the topic
model](https://www.tidytextmining.com/nasa.html?q=beta#interpreting-the-topic-model)
in Text Mining with R to examine the top 5 terms for each topic and then
look at this information visually:

```{r top_terms}

top_terms <- tidy_lda %>%
  group_by(topic) %>%
  slice_max(beta, n = 5, with_ties = FALSE) %>%
  ungroup() %>%
  arrange(topic, -beta)

top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  group_by(topic, term) %>%    
  arrange(desc(beta)) %>%  
  ungroup() %>%
  ggplot(aes(beta, term, fill = as.factor(topic))) +
  geom_col(show.legend = FALSE) +
  scale_y_reordered() +
  labs(title = "Top 5 terms in each LDA topic",
       x = expression(beta), y = NULL) +
  facet_wrap(~ topic, ncol = 4, scales = "free")
```

### 4b. Exploring Gamma Values

Now that we have a sense of the most common words associated with each
topic, let's take a look at the topic prevalence in our MOOC-Ed
discussion forum corpus, including the words that contribute to each
topic we examined above.

Also, hidden within our `forums_lda` topic model object we created are
per-document-per-topic probabilities, called γ ("gamma"). This provides
the probabilities that each document is generated from each topic, that
gamma matrix. We can combine our beta and gamma values to understand the
topic prevalence in our corpus, and which words contribute to each
topic.

To do this, we're going to borrow some code from the Silge (2018) post,
[Training, evaluating, and interpreting topic
models](https://juliasilge.com/blog/evaluating-stm/).

First, let's create two tidy data frames for our beta and gamma values

```{r beta_gamma}
td_beta <- tidy(forums_lda)

td_gamma <- tidy(forums_lda, matrix = "gamma")

td_beta
td_gamma

```

Next, we'll adopt Julia's code wholesale to create a filtered data frame
of our `top_terms`, join this to a new data frame for `gamma-terms` and
create a nice clean table using the `kabel()` function `knitr` package:

```{r prevalence_table}
top_terms <- td_beta %>%
  arrange(beta) %>%
  group_by(topic) %>%
  top_n(7, beta) %>%
  arrange(-beta) %>%
  select(topic, term) %>%
  summarise(terms = list(term)) %>%
  mutate(terms = map(terms, paste, collapse = ", ")) %>% 
  unnest()

gamma_terms <- td_gamma %>%
  group_by(topic) %>%
  summarise(gamma = mean(gamma)) %>%
  arrange(desc(gamma)) %>%
  left_join(top_terms, by = "topic") %>%
  mutate(topic = paste0("Topic ", topic),
         topic = reorder(topic, gamma))

gamma_terms %>%
  select(topic, gamma, terms) %>%
  kable(digits = 3, 
        col.names = c("Topic", "Expected topic proportion", "Top 7 terms"))
```

And let's also compare this to the most prevalent topics and terms from
our `forums_stm` model that we created using the `plot()` function:

```{r plot_stm}
plot(forums_stm, n = 7)
```

### 4c. Reading the Tea Leaves

Recognizing that topic modeling is best used as a "tool for reading" and
provides only an incomplete answer to our overarching, **"How do we
quantify what a corpus is about?"**, the results do suggest some
potential topics that have emerges as well as some areas worth following
up on.

Specifically, looking at some of the common clusters of words for the
more prevalent topics suggest that some key topics or "latent themes"
(renamed in bold) might include:

-   **Teaching Statistics:** Unsurprising, given the course title, the
    topics most prevalent in both the `forums_stm` and `forums_lda`
    models contains the terms "teach", "students", "statistics". This
    could be an "overarching theme" but more likely may simply be just
    the residue of the course title though being sprinkled throughout
    the forums and deserves some follow up. Topics 8 from the LDA model
    may overlap with this topic as well.
-   **Course Utility:** The second most prevalent Topics (13 and 2) in
    the `lda` and `stm` models respectively, seem to potentially be
    about the usefulness of course "resources" like lessons, tools,
    videos, and activities. I'm wagering this might be a forum dedicated
    to course feedback. Topic 15 from the STM model also suggest this
    may be a broader theme.
-   **Using Real-World Data:** Topics 18 & 12 from the LDA model
    particularly intrigue me and I'm wagering this is pretty positive
    sentiment among participants about the value and benefit of having
    students collect and analyze real data sets (e.g. Census data in
    Topic 1) and work on projects relevant to their real life. Will
    definitely follow up on this one.
-   **Technology Use:** Several topics (6 & 11 from LDA and 8 & 19 from
    STM) appear to be about student use of technology and software like
    calculators and Excel for teaching statistics and using simulations.
    Topic 16 from LDA also suggest the use of the Common Online Data
    Analysis Platform.
-   **Student Struggle & Engagement:** Topic 15 from LDA and Topic 16
    from STM also intrigue me and appear to be two opposite sides of
    perhaps the same coin. The former includes "struggle" and "reading"
    which suggests perhaps a barrier to teaching statistics while Topic
    16 contains top stems like "engage", "activ", and "think" and may
    suggest participants anticipate activities may engage students.

To serve as a check on my tea leaf reading, I'm going to follow Bail's
recommendation to examine some of these topics qualitatively. The `stm`
package has another useful function though exceptionally fussy function
called `findThoughts` which extracts passages from documents within the
corpus associate with topics that you specify.

The first line of code may not be necessary for your independent
analysis, but because the `textProcessor()` function removed several
documents during processing, the `findthoughts()` function can't
properly index the processed docs. This [line of code found on
stackoverflow](https://stackoverflow.com/questions/43492667/r-stm-number-of-provided-texts-and-number-of-documents-modeled-do-not-match)
removes documents from original `ts_forum_data` source that were removed
during processing so there is a one-to-one correspondence with
`forums_stm` and so you can use the function to find posts associated
with a given topic.

Let's slightly reduce our original data set to match our STM model, pass
both to the `findThoughts()` function, and set our arguments to return
`n =10` posts from `topics = 2` (i.e. Topic 2) that have at least 50% or
`thresh = 0.5` as a minimum threshold for the estimated topic
proportion.

```{r findThoughts_2}

ts_forum_data_reduced <-ts_forum_data$post_content[-temp$docs.removed]

findThoughts(forums_stm,
             texts = ts_forum_data_reduced,
             topics = 2, 
             n = 10,
             thresh = 0.5)
```

Duplicate posts aside, this **Course Utility** topic returns posts there
were expected based on my interpretation of the key terms for Topic 2.
It looks like I may have read those tea leaves correctly.

Now let's take a look at Topic 16 that we thought might be related to
student engagement:

```{r findThoughts_16}

findThoughts(forums_stm,
             texts = ts_forum_data_reduced,
             topics = 16, 
             n = 10,
             thresh = 0.5)
```

It looks like my tea reading was a partially correct for Topic 16,
though the results seem to be about a specific "Pepsi challenge"
activity to conduct with students.

Finally, let's look at posts from Topic 3 which we though might be an
overarching theme about teaching statistics:

```{r findThoughts_3}

ts_forum_data_reduced <-ts_forum_data$post_content[-temp$docs.removed]

findThoughts(forums_stm,
             texts = ts_forum_data_reduced,
             topics = 3, 
             n = 10,
             thresh = 0.5)
```

Looking at just the 10 posts returned, perhaps a better name for this
topic would be **Course Reflections on Teaching Statistics**.

#### Unit Takeaway

In addition to some useful R packages and functions for the actual
process of topic modeling, hopefully there are two main lessons I'm
hoping you take away from this walkthrough:

1.  **Topic modeling requires a lot of decisions.** Beyond deciding on a
    value for K, there are a number of key decisions that you have to
    make that can dramatically affect your results. For example, to stem
    or not to stem? What qualifies as a document? What flavor or topic
    modeling is best suited to your data and research questions? How
    many iterations to run?
2.  **Topic modeling is as much art as (data) science.** As Bail (2018)
    noted, the term "topic" is somewhat ambitious, and topic models do
    not produce highly nuanced classification of texts. Once you've fit
    your model, interpreting your model requires some mental gymnastics
    and ideally some knowledge of the context from which the data came
    to help with interpretation of your topics. Moreover, the
    quantitative approaches for making the decisions highlighted above
    are imperfect and a good deal of human judgment required.

##### ✅ Comprehension Check

Using the STM model you fit from the Section 3 [Comprehension Check]
with a different value for K, use the approaches demonstrated in Section
4 to explore and interpret your topics and terms and revisit the
following question:

1.  Now that you have a little more context, how might you revise your
    initial interpretation of some of the latent topics or latent themes
    from your model?
    
```{r}
# I'm going to look at topic 10 (k=14) which seems to be related to
# students using technology
  
ts_forum_data_reduced <-ts_forum_data$post_content[-temp$docs.removed]

findThoughts(forums_stm_fourteen,
             texts = ts_forum_data_reduced,
             topics = 10, 
             n = 10,
             thresh = 0.5)
```
    
This topic has to do with different platforms and tools used to conduct
experiments (seemingly to better understand statistics). Many teachers
seemed to be sharing individual tools and commenting on how they liked
the learning experiences that the technology seemed to evoke from students
(i.e. "metacognition")