1. PREPARE
To help us better understand the context, questions, and data sources we’ll be using in Unit 3, this section will focus on the following topics:
- Context. As context for our analysis this week, we’ll review several related papers by my colleagues relevant to our analysis of MOOC-Ed discussion forums.
- Questions. We’ll also examine what insight topic modeling can provide to a question that we asked participants answer in their professional learning teams (PLTs).
- Project Setup. This should be very familiar by now, but we’ll set up a new R project and install and load the required packages for the topic modeling walkthrough.
1a. Context
Participating in a MOOC and Professional Learning Team: How a Blended Approach to Professional Development Makes a Difference

Full text: https://www.learntechlib.org/p/195234/
Abstract
Massive Open Online Courses for Educators (MOOC-Eds) provide opportunities for using research-based learning and teaching practices, along with new technological tools and facilitation approaches for delivering quality online professional development. The Teaching Statistics Through Data Investigations MOOC-Ed was built for preparing teachers in pedagogy for teaching statistics, and it has been offered to participants from around the world. During 2016-2017, professional learning teams (PLTs) were formed from a subset of MOOC-Ed participants. These teams met several times to share and discuss their learning and experiences. This study focused on examining the ways that a blended approach to professional development may result in similar or different patterns of engagement to those who only participate in a large-scale online course. Results show the benefits of a blended learning environment for retention, engagement with course materials, and connectedness within the online community of learners in an online professional development on teaching statistics. The findings suggest the use of self-forming autonomous PLTs for supporting a deeper and more comprehensive experience with self-directed online professional developments such as MOOCs. Other online professional development courses, such as MOOCs, may benefit from purposely suggesting and advertising, and perhaps facilitating, the formation of small face-to-face or virtual PLTs who commit to engage in learning together.
Data Source & Analysis
All peer interaction, including peer discussion, take place within discussion forums of MOOC-Eds, which are hosted using the Moodle Learning Management System. To build the dataset you’ll be using for this walkthrough, the research team wrote a query for Moodle’s MySQL database, which records participants’ user-logs of activity in the online forums. This sql query combines separate database tables containing postings and comments including participant IDs, timestamps, discussion text and other attributes or “metadata.”
For further description of the forums and data retrieval process, see also the following papers:
Kellogg, S., & Edelmann, A. (2015). Massively Open Online Course for Educators (MOOC‐Ed) network dataset. British journal of educational technology, 46(5), 977-983.
Ezen-Can, A., Boyer, K. E., Kellogg, S., & Booth, S. (2015, March). Unsupervised modeling for understanding MOOC discussion forums: a learning analytics approach. In Proceedings of the fifth international conference on learning analytics and knowledge (pp. 146-150).
Kellogg, S., Booth, S., & Oliver, K. (2014). A social network perspective on peer supported learning in MOOCs for educators. International Review of Research in Open and Distributed Learning, 15(5), 263-289.
Summary of Key Findings
The following highlight some key findings related to the discussion forums in the papers cited above:
- MOOCs designed specifically for K-12 teachers can provide positive self-directed learning experiences and rich engagement in discussion forums that help form online communities for educators.
- Analysis of discussion forum data in TSDI provided a very clear picture of how enthusiastic many PLT members and leaders were to talk to others in the online community. They posed their questions and shared ideas with others about teaching statistics throughout the units, even though they were also meeting synchronously several times with their colleagues in small group PLTs.
- Findings on knowledge construction demonstrated that over half of the discussions in both courses moved beyond sharing information and statements of agreement and entered a process of dissonance, negotiation and co-construction of knowledge, but seldom moved beyond this phase in which new knowledge was tested or applied. These findings echo similar research on difficulties in promoting knowledge construction in online settings.
- Topic modeling provides more interpretable and cohesive models for discussion forums than other popular unsupervised modeling techniques such as k-means and k-medoids clustering algorithms.
1b. Guiding Questions
For the paper, Participating in a MOOC and Professional Learning Team: How a Blended Approach to Professional Development Makes a Difference, the researchers were interested in unpacking how participants who enrolled in the Teaching Statistics through Data Investigations MOOC-Ed might benefit from also being in a smaller group of professionals committed to engaging in the same professional development. The specific research question for this paper was:
What are the similarities and differences between how PLT members and Non-PLT online participants engage and meet course goals in a MOOC-Ed designed for educators in secondary and collegiate settings?
Dr. Hollylynne Lee and the TSDI team also developed a facilitation guide designed specifically for PLT teams to help groups synthesize the ideas in the course and make plans for how to implement new strategies in their classroom in order to impact students’ learning of statistics. One question PLT members were asked to address was:
What ideas or issues emerged in the discussion forums this past week?
For this walkthrough, we will further examine that question through the use of topic modeling.
And just to reiterate yet again from Unit 1, one overarching question we’ll explore throughout this course, and that Silge and Robinson (2018) identify as a central question to text mining and natural language processing, is:
How do we to quantify what a document or collection of documents is about?
1c. Set Up
As highlighted in Chapter 6 of Data Science in Education Using R (DSIEUR), one of the first steps of every workflow should be to set up a “Project” within RStudio. This will be your “home” for any files and code used or created in Unit 2.
You are welcome to continue using the same project created for Unit 1, or create an entirely new project for Unit 2. However, after you’ve created your project open up a new R script, and load the following packages that we’ll be needing for this walkthrough:
library(tidyverse)
library(tidytext)
library(SnowballC)
library(topicmodels)
library(stm)
library(ldatuning)
library(knitr)
library(LDAvis)
At the end of this week, I encourage you share with me your R script as evidence that you have complete the walkthrough. Although I highly recommend that that you manually type the code shared throughout this walkthrough, for large blocks of text it may be easier to copy and paste.
2. WRANGLE
As noted previously, data wrangling involves some combination of cleaning, reshaping, transforming, and merging data (Wickham & Grolemund, 2017). This week we’ll revisit tidying and tokenizing text using the tidytext package, but are also introduced to the the stm package. This package makes use of tm text mining package to preprocess text (e.g., removing punctuation, stop words, etc.) and will also be our first introduction to word stemming.
- Import Data. We’ll be working with .csv files this week and the
read_csv() function but will introduce a new argument for changing column types.
- Cast a DTM. We revisit the
tidytext package to “tidy” and tokenize our forum data and introduce the cast_dtm() function to create the document term matrix (dtm) need for topic modeling.
- To Stem or not to STEM? We conclude our data wrangling by also introducing the
textProcessor() function for preprocessing and discuss the pros and cons of word stemming.
2a. Import Forum Data
To get started, we need to import, or “read”, our data into R. The function used to import your data will depend on the file format of the data you are trying to import. First, however, you’ll need to do the following:
- Download the
ts_forum_data.csv file we’ll be using for this Unit from our NCSU Moodle course site.
- Create a folder in the directory on your computer where you stored your R Project and name it “data”.
- Add the file to your data folder.
- Check your Files tab in RStudio to verify that your file is indeed in your data folder.
Now let’s read our data into our Environment using the read_csv() function and assign it to a variable name so we can work with it like any other object in R.
ts_forum_data <- read_csv("data/ts_forum_data.csv",
col_types = cols(course_id = col_character(),
forum_id = col_character(),
discussion_id = col_character(),
post_id = col_character()
)
)
By default, many of the columns like course_id and forum_id are read in as numeric data. For our purposes, we plan to treat them as unique identifiers or names for out courses, forums, discussions, and posts. The read_csv() function has a handy col_types = argument changing the column types from numeric to characters.
2b. Cast a Document Term Matrix
In this section we’ll revisit some familiar tidytext functions used in Units 1 & 2 for tidying and tokenizing text and introduce some new functions from the stm package for processing text and transforming our data frames into new data structures required for topic modeling.
Functions Used
tidytext functions
unnest_tokens() splits a column into tokens
anti_join() returns all rows from x without a match in y and used to remove stop words from out data.
cast_dtm() takes a tidied data frame take and “casts” it into a document-term matrix (dtm)
dplyr functions
count() lets you quickly count the unique values of one or more variables
group_by() takes a data frame and one or more variables to group by
summarise() creates a summary of data using arguments like sum and mean
stm functions
textProcessor() takes in a vector or column of raw texts and performs text processing like removing punctuation and word stemming.
prepDocuments() performs several corpus manipulations including removing words and renumbering word indices
Tidying Text
Prior to topic modeling, we have a few remaining steps to tidy our text that hopefully should feel familiar by this point. If you recall from Chapter 1 of Text Mining With R, these preprocessing steps include:
- Transforming our text into “tokens”
- Removing unnecessary characters, punctuation, and whitespace
- Converting all text to lowercase
- Removing stop words such as “the”, “of”, and “to”
Let’s tokenize our forum text and by using the familiar unnest_tokens() and remove stop words per usual:
forums_tidy <- ts_forum_data %>%
unnest_tokens(output = word, input = post_content) %>%
anti_join(stop_words, by = "word")
forums_tidy
## # A tibble: 192,159 x 14
## course_id course_name forum_id forum_name discussion_id discussion_name
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 9 Teaching Statis~ 126 Investigat~ 6822 Not much compa~
## 2 9 Teaching Statis~ 126 Investigat~ 6822 Not much compa~
## 3 9 Teaching Statis~ 126 Investigat~ 6822 Not much compa~
## 4 9 Teaching Statis~ 126 Investigat~ 6822 Not much compa~
## 5 9 Teaching Statis~ 126 Investigat~ 6822 Not much compa~
## 6 9 Teaching Statis~ 126 Investigat~ 6822 Not much compa~
## 7 9 Teaching Statis~ 126 Investigat~ 6822 Not much compa~
## 8 9 Teaching Statis~ 126 Investigat~ 6822 Not much compa~
## 9 9 Teaching Statis~ 126 Investigat~ 6822 Not much compa~
## 10 9 Teaching Statis~ 126 Investigat~ 6822 Not much compa~
## # ... with 192,149 more rows, and 8 more variables: discussion_creator <dbl>,
## # discussion_poster <dbl>, discussion_reference <chr>, parent_id <dbl>,
## # post_date <chr>, post_id <chr>, post_title <chr>, word <chr>
Now let’s do a quick word count to see some of the most common words used throughout the forums. This should get a sense of what we’re working with and later we’ll need these word counts for creating our document term matrix for topic modeling:
forums_tidy %>%
count(word, sort = TRUE)
## # A tibble: 13,620 x 2
## word n
## <chr> <int>
## 1 students 6841
## 2 data 4365
## 3 statistics 3103
## 4 school 1488
## 5 questions 1470
## 6 class 1426
## 7 font 1311
## 8 span 1267
## 9 time 1253
## 10 style 1150
## # ... with 13,610 more rows
Terms like “students,” “data,” and “class” are about what we would have expected from a course teaching statistics. The term “agree” and “time” however, are not so intuitive and worth a quick look as well.
✅ Comprehension Check
Use the filter() and grepl() functions introduced in Unit 1. Section 3b to filter for rows in our ts_forum_data data frame that contain the terms “agree” and “time” and another term or terms of your choosing. Select a random sample of 10 posts using the sample_n() function for your terms and answer the following questions:
- What, if anything, do these posts have in common?
- These posts are all replies to a separate discussion. Depending on the words used you may find similarities for the original question.
- What topics or themes might be apparent, or do you anticipate emerging, from our topic modeling?
- Lessons that the participants worked together on, or shared from past experience. Also, statistical models that are useful for explaning hard to understand concepts.
Your output should look something like this:
## # A tibble: 10 x 1
## post_content
## <chr>
## 1 "I couldn't agree more about how fabulously brilliant i think this lesson is~
## 2 "Great point! I agree that it was interesting the students only wanted to do~
## 3 "Hi Stephanie I agree with you it is a shame that too much of our focus ne~
## 4 "I thought the same thing when watching the video! I thought it was a good ~
## 5 "I also agree. The most beneficially part of this course for me was the exp~
## 6 "Sherri I agree with how this is helping you design your lesson plans. I ~
## 7 "I agree with this comment. Sometimes time constraints limit our ability to~
## 8 "Kelly- I love that you acknowledge the importance of still doing some thin~
## 9 "Doug I completely agree with you about how beneficial this online course ~
## 10 "I agree. I also think its a let less \\high stakes\\\" than a test. This wi~
Creating a Document Term Matrix
As highlighted in Chapter 5 of Text Mining with R, the topicmodels package and the Latent Dirichlet allocation (LDA) algorithm and LDA() function it uses expects document-term matrix as the data input.
Before we create a our document-term matrix, however, we have an important decision to make:
What do we consider to be a “document” in a MOOC-Ed discussion forum?
For example, we could consider each individual discussion post, or post_id in our data frame, as a document. It might also make sense to combine texts from all posts within each discussion, or disccussion_id, and consider that as a document since these posts are often interconnected an build off one another.
For now, however, let treat each individual post as a unique “document.” noted above, to create our document term matrix, we’ll need to first count() how many times each word occurs in each document, or post_id in our case, and create a matrix that contains one row per post as our original data frame did, but now contains a column for each word in the entire corpus and a value of n for how many times that word occurs in each post.
To create this document term matrix from our post counts, we’ll use the cast_dtm() function like so and assign it to the variable forums_dtm:
forums_dtm <- forums_tidy %>%
count(post_id, word) %>%
cast_dtm(post_id, word, n)
✅ Comprehension Check
Take a look at our forums_dtm object in the console and answer the following question:
- What “class” of object is
forums_dtm?
- It is a simple triplet matrix
- How many unique documents and terms are included our matrix?
- There are 57766 documents and 13620 terms in our Matrix
- Why might there be fewer documents/posts than were in our original data frame?
- Some documents are removed because they appear in the text 0 times.
- What exactly is meant by “sparsity”?
- Sparsity means that A term-document matrix whose terms are removed because they have at least a sparse percentage calculated by the term-frequency weighting.
class(forums_dtm)
## [1] "DocumentTermMatrix" "simple_triplet_matrix"
forums_dtm
## <<DocumentTermMatrix (documents: 5766, terms: 13620)>>
## Non-/sparse entries: 142641/78390279
## Sparsity : 100%
## Maximal term length: NA
## Weighting : term frequency (tf)
2c. To Stem or not to Stem?
Next we’ll need to prepare our original data set for structural topic modeling using the textProcessor() function. The stm package has a number of features that extend the functionality of the topicmodels package, including an argument for “stemming” words, which Schofield and Mimno (2016) describe as follows:
Stemming is a popular way to reduce the size of a vocabulary in natural language tasks by conflating words with related meanings. Specifically, stemming aims to convert words with the same “stem” or root (e.g “creative” and “creator”) to a single word type (“create”). Though originally developed in the context of information retrieval (IR) systems, stemmers are now commonly used as a preprocessing step in unsupervised machine learning tasks.
The rationale behind stemming is that it can dramatically reduce the number of words or terms to be modeled, which in theory should help simplify and improve the performance of your model. We’ll explore this assumption a little later in this section.
Processing and Stemming for STM
Like unnest_tokens(), the textProcessor() function includes several useful arguments for processing text like converting text to lowercase and removing punctuation and numbers. I’ve included several of these in the script below along with their defaults used if you do not explicitly specify in your function. Most of these are pretty intuitive and you can learn more by viewing the ?textProcessor documentation.
Let’s go ahead and process our discussion forum post_content in preparation for structural topic modeling:
temp <- textProcessor(ts_forum_data$post_content,
metadata = ts_forum_data,
lowercase=TRUE,
removestopwords=TRUE,
removenumbers=TRUE,
removepunctuation=TRUE,
wordLengths=c(3,Inf),
stem=TRUE,
onlycharacter= FALSE,
striphtml=TRUE,
customstopwords=NULL)
## Building corpus...
## Converting to Lower Case...
## Removing punctuation...
## Removing stopwords...
## Removing numbers...
## Stemming...
## Creating Output...
Note that the first argument the textProcessor function expects is the column in our data frame that contains the text to be processed, the second argument metadata = expects the data frame that contains the text of interest and uses the column names to label the metadata such as course ids and forum names. This meatdata can be used to improve the assignment of words to topics in a corpus and examine the relationship between topics and various covariates of interest.
Unlike the unnest_tokens() function, the output is not a nice tidy data frame. Topic modeling using the stm package requires a very unique set of inputs that are specific to the package. The following code will pull elements from the temp list that was created that will be required for the stm() function we’ll use in Section 4:
meta <- temp$meta
vocab <- temp$vocab
docs <- temp$documents
Stemming Tidy Text
Notice that the textProcessor stem argument we used above is set to TRUE by default. We haven’t introduced word stemming at this point because there is some debate about the actual value of this process. While words like “students” and “student” might make sense to collapse into their base word and actually make analyses and visualizations more concise and easier to interpret. Hvitfeldt and Silge (2021) note, however, that words like the following have dramatic differences in meaning, semantics, and use and could result in poor models or misinterpreted results:
- meaning and mean
- likely, like, liking
- university and universe
The first word pair is particularly relevant to discussion posts from our Teaching Statistics course data. In addition, collapsing words like “teachers” and “teaching” could dramatically alter the results from a topic model.
Schofield and Mimno (2016) specifically,
Despite their frequent use in topic modeling, we find that stemmers produce no meaningful improvement in likelihood and coherence and in fact can degrade topic stability.
For now, we will leave as is the forums_dtm we created earlier with words unstemmed, but what if we wanted to stem words in a “tidy” way?
Since the unnest_tokens() function does not (intentionally I believe) include a stemming function, one approach would be to use the wordStem() function from the snowballC package to either replace our words column with a word stems or create a new variable called stem with our stemmed words. Let’s do the latter and take a look at the original words and the stem that was produced:
stemmed_forums <- ts_forum_data %>%
unnest_tokens(output = word, input = post_content) %>%
anti_join(stop_words, by = "word") %>%
mutate(stem = wordStem(word))
stemmed_forums
## # A tibble: 192,159 x 15
## course_id course_name forum_id forum_name discussion_id discussion_name
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 9 Teaching Statis~ 126 Investigat~ 6822 Not much compa~
## 2 9 Teaching Statis~ 126 Investigat~ 6822 Not much compa~
## 3 9 Teaching Statis~ 126 Investigat~ 6822 Not much compa~
## 4 9 Teaching Statis~ 126 Investigat~ 6822 Not much compa~
## 5 9 Teaching Statis~ 126 Investigat~ 6822 Not much compa~
## 6 9 Teaching Statis~ 126 Investigat~ 6822 Not much compa~
## 7 9 Teaching Statis~ 126 Investigat~ 6822 Not much compa~
## 8 9 Teaching Statis~ 126 Investigat~ 6822 Not much compa~
## 9 9 Teaching Statis~ 126 Investigat~ 6822 Not much compa~
## 10 9 Teaching Statis~ 126 Investigat~ 6822 Not much compa~
## # ... with 192,149 more rows, and 9 more variables: discussion_creator <dbl>,
## # discussion_poster <dbl>, discussion_reference <chr>, parent_id <dbl>,
## # post_date <chr>, post_id <chr>, post_title <chr>, word <chr>, stem <chr>
You can see that words like “activity” and “activities” that occur frequently in our discussions have been reduced to the word stem “activ”. If you are interested in learning other approaches for word stemming in R, as well as reading a more in depth description of the stemming process, I highly recommend the Chapter 4 Stemming from Hvitfeldt and Silge (2021) book, Supervised Machine Learning for Text Analysis in R.
✅ Comprehension Check
Complete the following code using what we learned in the section on Creating a Document Term Matrix and answer the following questions:
- How many fewer terms are in our stemmed document term matrix?
- The stemmed corpus has 10,001 observations compared to the 192159 observations.
- Did stemming words significantly reduce the sparsity of the network?
- Yes the forums_dtm has 78390279 sparse documents and the stemmed_dtm has 57529581 sparse items. This is a 27% decrease in sparsity of the network.
Hint: Make sure your code includes stem counts rather than word counts.
stemmed_dtm <- ts_forum_data %>%
unnest_tokens(output = word, input = post_content) %>%
anti_join(stop_words, by = "word") %>%
mutate(stem = wordStem(word)) %>%
count(stem, sort = TRUE)
stemmed_dtm
## <<DocumentTermMatrix (documents: 5766, terms: 10001)>>
## Non-/sparse entries: 136185/57529581
## Sparsity : 100%
## Maximal term length: NA
## Weighting : term frequency (tf)
## <<DocumentTermMatrix (documents: 5766, terms: 13620)>>
## Non-/sparse entries: 142641/78390279
## Sparsity : 100%
## Maximal term length: NA
## Weighting : term frequency (tf)
## # A tibble: 10,001 x 2
## stem n
## <chr> <int>
## 1 student 7354
## 2 data 4365
## 3 statist 4161
## 4 question 2470
## 5 teach 1858
## 6 class 1738
## 7 school 1606
## 8 time 1457
## 9 learn 1372
## 10 font 1311
## # ... with 9,991 more rows
3. MODEL
This unit provides our first opportunity for modeling a text as data. In very simple terms, modeling involves developing a mathematical summary of a dataset. These summaries can help us further explore trends and patterns in our data.
In their book, Learning Analytics Goes to School, Krumm and Means (2018) describe two general types of modeling approaches used in the Data-Intensive Research workflow: unsupervised and supervised machine learning. In distinguishing between the two, they note:
Unsupervised learning algorithms can be used to understand the structure of one’s dataset. Supervised models, on the other hand, help to quantify relationships between features and a known outcome. Known outcomes are also commonly referred to as labels or dependent variables.
In Section 3 we focus on Topic Modeling, an unsupervised learning approach to automatically identify topics in a collection of documents. In fact, we’ll explore two different approaches to topic modeling, as well as strategies for identifying the “right” number of topics:
- Fitting a Topic Modeling with LDA. In this section we learn to use the
topicmodels package and associated LDA() function for unsupervised classification of our forum discussions to find natural groupings of words, or topics.
- Fitting a Structural Topic Model. We then explore the use of the
stm package and stm() function to fit our model and uses metadata about documents to improve the assignment of words to “topics” in a corpus.
- Choosing K. Finally, we wrap up Section 3 by learning about diagnostic properties like exclusivity, semantic coherence, and heldout likelihood for helping to select an appropriate number of topics.
3a. Fitting a Topic Modeling with LDA
Before running our first topic model using the LDA() function, let’s quick recap from our readings some basic principles behind Latent Dirichlet allocation and why LDA is of preferred over other automatic classification or clustering approaches.
Unlike simple forms of cluster analysis such as k-means clustering, LDA is a “mixture” model, which in our context means that:
- Every document contains a mixture of topics. Unlike algorithms like k-means, LDA treats each document as a mixture of topics, which allows documents to “overlap” each other in terms of content, rather than being separated into discrete groups. So in practice, this means that a discussion forum post could have an estimated topic proportion of 70% for Topic 1 (e.g. be mostly about a Topic 1), but also be partly about Topic 2.
- Every topic contains a mixture of words. For example, if we specified in our LDA model just 2 topics for our discussion posts, we might find that one topic seems to be about pedagogy while another is about learning. The most common words in the pedagogy topic might be “teacher”, “strategies”, and “instruction”, while the learning topic may be made up of words like “understanding” and “students”. However, words can be shared between topics and words like “statistics” or “assessment” might appear in both equally.
Similar to k-means other other simple clustering approaches, however, LDA does require us to specify a value of k for the number of topics in our corpus. Selecting k is no trivial matter and can greatly impact your results.
Since we don’t have a have strong rationale about the number of topics that might exist in discussion forums, let’s use the n_distinct() function from the dplyr package to find the number of unique forum names in our course data and run with that:
n_distinct(ts_forum_data$forum_name)
## [1] 21
Since it looks like there are 20 distinct discussion forums, we’ll use that as our value for the k = argument of the LDA(). Be patient while this runs, since the default setting of is to perform a large number of iterations.
n_distinct(ts_forum_data$forum_name)
## [1] 21
forums_lda <- LDA(forums_dtm,
k = 20,
control = list(seed = 588)
)
forums_lda
## A LDA_VEM topic model with 20 topics.
Note that we used the control = argument to pass a random number (588) to seed the assignment of topics to each word in our corpus. Since LDA is a stochastic algorithm that could have different results depending on where the algorithm starts, specified a seed for reproducibility and so we’re all seeing the same results every time we specify the same number of topics.
And tying back to our work in Unit 1, Bail (2020) notes that topic assignments for each word are updated in an iterative fashion and that LDA employs the Term Frequency-Inverse Document Frequency (TF-IDF) metric to assign probabilities.
3b. Fitting a Structural Topic Model
Bail notes that LDA, while perhaps the most common approach to topic modeling, is just one of many different types, including Dynamic Topic Models, Correlated Topic Models, Hierarchical Topic Models, and more recently, Structural Topic Modeling (STM). He argues that one reason STM has rising in popularity and use is that it employs meta data about documents to improve the assignment of words to topics in a corpus and that can be used to examine relationships between covariates and documents.
Also, since Julia Silge has indicated that STM is, “my current favorite implementation of topic modeling in R” and has built supports in the tidytext package for building structural topic models, this package definitely is worth discussing in this walkthrough. I also highly recommend her own walkthrough of the stm package: The game is afoot! Topic modeling of Sherlock Holmes stories as well as her follow up post, Training, evaluating, and interpreting topic models.
The stm Package
As we’ve seen above, STM produced an unusual temp textProcessor output that is unique to the stm package. And as you’ve probably already guessed, the stm() function for fitting a structural topic model does not take a fairly standard document term matrix like the LDA() function.
Before we fit our model, we’ll have to extract the elements from the temp object created after we processed our text. Specifically, the stm() function expects the following arguments:
documents = the document term matrix to be modeled in the native stm format
data = an optional data frame containing meta data for the prevalence and/or content covariates to include in the model
vocab = a character vector specifying the words in the corpus in the order of the vocab indices in documents.
Let’s go ahead and extract these elements:
docs <- temp$documents
meta <- temp$meta
vocab <- temp$vocab
And now use these elements to fit the model using the same number of topics for K that we specified for our LDA topic model. Let’s also take advantage of the fact that we can include the course_id and forum_id covariates in the prevealence = argument to help improve, in theory, our model fit:
forums_stm <- stm(documents=docs,
data=meta,
vocab=vocab,
prevalence =~ course_id + forum_id,
K=20,
max.em.its=25,
verbose = FALSE)
forums_stm
## A topic model with 20 topics, 5781 documents and a 7820 word dictionary.
As noted earlier, the stm package has a number of handy features. One of these is the plot.STM() function for viewing the most probable words assigned to each topic.
By default, it only shows the first 3 terms so let’s change that to 5 to help with interpretation:
plot.STM(forums_stm, n = 5)

Note that you can also just use plot() as well:
plot(forums_stm, n = 5)

✅ Comprehension Check
Fit a model for both LDA and STM using different values for K and answer the following questions:
#stm model with 10 topics
forums_stm2 <- stm(documents=docs,
data=meta,
vocab=vocab,
prevalence =~ course_id + forum_id,
K=10,
max.em.its=25,
verbose = FALSE)
forums_stm2
## A topic model with 10 topics, 5781 documents and a 7820 word dictionary.
n_distinct(ts_forum_data$forum_name)
## [1] 21
forums_lda2 <- LDA(forums_dtm,
k = 5,
control = list(seed = 588)
)
forums_lda2
## A LDA_VEM topic model with 5 topics.
- What topics appear to be similar to those using 20 topics for K?
- Knowing that you don’t have as much context as the researchers of this study do, how might you interpret one of these latent topics or themes using the key terms assigned?
- What topic emerged that seem dramatically different and how might you interpret this topic?
3c. Finding K
As alluded to earlier, selecting the number of topics for your model is a non-trivial decision and can dramatically impact your results. Bail (2018) notes that
The results of topic models should not be over-interpreted unless the researcher has strong theoretical apriori about the number of topics in a given corpus, or if the researcher has carefully validated the results of a topic model using both the quantitative and qualitative techniques described above.
There are several approaches to estimating a value for K and we’ll take a quick look at one from the ldatuning package and one from our stm package.
The FindTopicsNumber Function
The ldatuning package has functions for both calculating and plotting different metrics that can be used to estimate the most preferable number of topics for LDA model. It also conveniently takes the standard document term matrix object that we created from out tidy text and has the added benefit of running fairly quickly, especially compared to the function for finding K from the stm package.
Let’s use the defaults specified in the ?FindTopicNumber documentation and modify slightly get metrics for a sequence of topics from 10-75 counting by 5 and plot the output we saved using the FindTopicsNumber_plot() function:
k_metrics <- FindTopicsNumber(
forums_dtm,
topics = seq(10, 75, by = 5),
metrics = "Griffiths2004",
method = "Gibbs",
control = list(),
mc.cores = NA,
return_models = FALSE,
verbose = FALSE,
libpath = NULL
)
FindTopicsNumber_plot(k_metrics)
Note that the FindTopicNumbers() function contains three additional metrics for calculating metrics that can be used to estimate the most preferable number of topics for LDA model. We used the Griffiths2004 metrics included in the default example and I’ve also found this to produce the most interpretable results as show in the figure below:

As a general rule of thumb and overly simplistic heuristic, we’re looking for an inflection point in our plot which indicates an optimal number of topics to select for a value of K.
The findingK() Function
Finally, Bail (2018) notes that thestm package has a useful function called searchK which allows us to specify a range of values for k and outputs multiple goodness-of-fit measures that are “very useful in identifying a range of values for k that provide the best fit for the data.”
The syntax of this function is very similar to the stm() function we used above, except that we specify a range for k as one of the arguments. In the code below, we search all values of k between 10 and 30.
#I am not expecting you run this code as it will take too long
#findingk <- searchK(docs,
#vocab,
#K = c(5:15),
#data = meta,
#verbose=FALSE)
#plot(findingk)
Note that Running searchK() function on this corpus took all night on a pretty powerful MacBook Pro and crashed once as well, so I do not expect you to run this for the walkthrough. I ran a couple iterations and landed on between 5 and 15 with an optimal number of topics somewhere around 14:

Given the somewhat conflicting results, also somewhat selfishly and for the same of simplicity for this walkthrough, I’m just going to stick with the rather arbitrary selection of 20 topics for the remainder of this Unit.
The LDAvis Explorer
One final tool that I want to introduce from the stm package is the toLDAvis() function which provides a great visualizations for exploring topic and word distributions using LDAvis topic browser:
toLDAvis(mod = forums_stm, docs = docs)
## Loading required namespace: servr
As you can see from the browser screen shot below, our current stm model of 20 topics is resulting in a lot of overlap among topics and suggest that 20 may not be an optimal number of topics, as other approaches for finding k also suggests:

4. EXPLORE
Silge and Robinson (2018) note that fitting at topic model is the “easy part.” The hard part is making sense of the model results and that the rest of the analysis involves exploring and interpreting the model using a variety of approaches which we’ll walkthrough in in this section.
Bail (2018) cautions, however, that:
…post-hoc interpretation of topic models is rather dangerous… and can quickly come to resemble the process of “reading tea leaves,” or finding meaning in patterns that are in fact quite arbitrary or even random.
4a. Exploring Beta Values
Hidden within this forums_lda topic model object we created are per-topic-per-word probabilities, called β (“beta”). It is the probability of a term (word) belonging to a topic.
Let’s take a look at the 5 most likely terms assigned to each topic, i.e. those with the largest β values using the terms() function from the topicmodels package:
terms(forums_lda, 5)
## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6
## [1,] "dice" "resources" "students" "statistics" "age" "stats"
## [2,] "trials" "statistics" "kids" "math" "tv" "students"
## [3,] "fair" "teaching" "understand" "mathematics" "coasters" "statistics"
## [4,] "sasi" "unit" "standard" "students" "roller" "class"
## [5,] "level" "mooc" "test" "common" "steel" "ap"
## Topic 7 Topic 8 Topic 9 Topic 10 Topic 11 Topic 12
## [1,] "students" "gapminder" "students" "li" "td" "students"
## [2,] "school" "students" "software" "strong" "difference" "time"
## [3,] "statistics" "td" "technology" "href" "class" "technology"
## [4,] "middle" "love" "statistical" "https" "1" "school"
## [5,] "level" "videos" "calculator" "target" "5" "data"
## Topic 13 Topic 14 Topic 15 Topic 16 Topic 17 Topic 18
## [1,] "questions" "students" "data" "font" "task" "students"
## [2,] "students" "data" "students" "normal" "students" "sample"
## [3,] "answer" "real" "questions" "text" "data" "results"
## [4,] "question" "agree" "set" "0px" "tasks" "size"
## [5,] "answers" "activity" "collecting" "51" "mind" "data"
## Topic 19 Topic 20
## [1,] "div" "span"
## [2,] "http" "style"
## [3,] "href" "height"
## [4,] "https" "font"
## [5,] "target" "0"
Even though we’ve somewhat arbitrarily selected the number of topics for our corpus, some these topics or themes are fairly intuitive to interpret. For example:
Topic 11 (technology, students, software, program, excel) seems to be about students use of technology including software programs like excel;
Topic 9 (questions, kids, love, gapminder, sharing) seems to be about the gapminder activity from the MOOC-Ed and kids enjoyment of it; and
Topic 18 (data, students, collect, real, sets) seems to be about student collection and use of real world data sets.
Not surprisingly, the tidytext package has a handy function conveniently name tidy() to convert our lda model to a tidy data frame containing these beta values for each term:
# convert to a tidy data frame with Beta values
tidy_lda <- tidy(forums_lda)
tidy_lda
## # A tibble: 272,400 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 2015 1.01e-234
## 2 2 2015 7.71e- 4
## 3 3 2015 2.50e- 4
## 4 4 2015 3.97e- 4
## 5 5 2015 3.01e- 4
## 6 6 2015 3.39e- 17
## 7 7 2015 2.25e- 52
## 8 8 2015 2.10e- 39
## 9 9 2015 1.31e- 31
## 10 10 2015 7.11e- 5
## # ... with 272,390 more rows
Obviously, it’s not very easy to interpret what the topics are about from a data frame like this so let’s borrow code again from Chapter 8.4.3 Interpreting the topic model in Text Mining with R to examine the top 5 terms for each topic and then look at this information visually:
# visually interpret top terms for each topic
top_terms <- tidy_lda %>%
group_by(topic) %>%
slice_max(beta, n = 5, with_ties = FALSE) %>%
ungroup() %>%
arrange(topic, -beta)
top_terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
group_by(topic, term) %>%
arrange(desc(beta)) %>%
ungroup() %>%
ggplot(aes(beta, term, fill = as.factor(topic))) +
geom_col(show.legend = FALSE) +
scale_y_reordered() +
labs(title = "Top 5 terms in each LDA topic",
x = expression(beta), y = NULL) +
facet_wrap(~ topic, ncol = 4, scales = "free")

4b. Exploring Gamma Values
Now that we have a sense of the most common words associated with each topic, let’s take a look at the topic prevalence in our MOOC-Ed discussion forum corpus, including the words that contribute to each topic we examined above.
Also, hidden within our forums_lda topic model object we created are per-document-per-topic probabilities, called γ (“gamma”). This provides the probabilities that each document is generated from each topic, that gamma matrix. We can combine our beta and gamma values to understand the topic prevalence in our corpus, and which words contribute to each topic.
To do this, we’re going to borrow some code from the Silge (2018) post, Training, evaluating, and interpreting topic models.
First, let’s create two tidy data frames for our beta and gamma values
td_beta <- tidy(forums_lda)
td_gamma <- tidy(forums_lda, matrix = "gamma")
td_beta
## # A tibble: 272,400 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 2015 1.01e-234
## 2 2 2015 7.71e- 4
## 3 3 2015 2.50e- 4
## 4 4 2015 3.97e- 4
## 5 5 2015 3.01e- 4
## 6 6 2015 3.39e- 17
## 7 7 2015 2.25e- 52
## 8 8 2015 2.10e- 39
## 9 9 2015 1.31e- 31
## 10 10 2015 7.11e- 5
## # ... with 272,390 more rows
td_gamma
## # A tibble: 115,320 x 3
## document topic gamma
## <chr> <int> <dbl>
## 1 11295 1 0.00188
## 2 12711 1 0.000237
## 3 12725 1 0.0256
## 4 12733 1 0.00219
## 5 12743 1 0.00746
## 6 12744 1 0.00374
## 7 12756 1 0.0256
## 8 12757 1 0.00276
## 9 12775 1 0.00276
## 10 12816 1 0.00276
## # ... with 115,310 more rows
Next, we’ll adopt Julia’s code wholesale to create a filtered data frame of our top_terms, join this to a new data frame for gamma-terms and create a nice clean table using the kabel() function knitr package:
top_terms <- td_beta %>%
arrange(beta) %>%
group_by(topic) %>%
top_n(7, beta) %>%
arrange(-beta) %>%
select(topic, term) %>%
summarise(terms = list(term)) %>%
mutate(terms = map(terms, paste, collapse = ", ")) %>%
unnest()
## Warning: `cols` is now required when using unnest().
## Please use `cols = c(terms)`
gamma_terms <- td_gamma %>%
group_by(topic) %>%
summarise(gamma = mean(gamma)) %>%
arrange(desc(gamma)) %>%
left_join(top_terms, by = "topic") %>%
mutate(topic = paste0("Topic ", topic),
topic = reorder(topic, gamma))
gamma_terms %>%
select(topic, gamma, terms) %>%
kable(digits = 3,
col.names = c("Topic", "Expected topic proportion", "Top 7 terms"))
| Topic 7 |
0.098 |
students, school, statistics, middle, level, questions, understanding |
| Topic 6 |
0.087 |
stats, students, statistics, class, ap, teach, school |
| Topic 2 |
0.085 |
resources, statistics, teaching, unit, mooc, ideas, learning |
| Topic 14 |
0.083 |
students, data, real, agree, activity, time, analyze |
| Topic 13 |
0.070 |
questions, students, answer, question, answers, correct, results |
| Topic 15 |
0.064 |
data, students, questions, set, collecting, collection, question |
| Topic 8 |
0.061 |
gapminder, students, td, love, videos, video, top |
| Topic 4 |
0.057 |
statistics, math, mathematics, students, common, teaching, teach |
| Topic 17 |
0.055 |
task, students, data, tasks, mind, activity, statistical |
| Topic 18 |
0.050 |
students, sample, results, size, data, simulation, statistical |
| Topic 9 |
0.048 |
students, software, technology, statistical, calculator, excel, learn |
| Topic 12 |
0.046 |
students, time, technology, school, data, student, video |
| Topic 3 |
0.043 |
students, kids, understand, standard, test, box, deviation |
| Topic 19 |
0.028 |
div, http, href, https, target, _blank, class |
| Topic 11 |
0.027 |
td, difference, class, 1, 5, 2, align |
| Topic 10 |
0.026 |
li, strong, href, https, target, _blank, statistics |
| Topic 1 |
0.019 |
dice, trials, fair, sasi, level, framework, sophistication |
| Topic 5 |
0.019 |
age, tv, coasters, roller, steel, coaster, hours |
| Topic 16 |
0.018 |
font, normal, text, 0px, 51, 255, style |
| Topic 20 |
0.013 |
span, style, height, font, 0, line, margin |
And let’s also compare this to the most prevalent topics and terms from our forums_stm model that we created using the plot() function:
plot(forums_stm, n = 7)

4c. Reading the Tea Leaves
Recognizing that topic modeling is best used as a “tool for reading” and provides only an incomplete answer to our overarching, “How do we quantify what a corpus is about?”, the results do suggest some potential topics that have emerges as well as some areas worth following up on.
Specifically, looking at some of the common clusters of words for the more prevalent topics suggest that some key topics or “latent themes” (renamed in bold) might include:
- Teaching Statistics: Unsurprising, given the course title, the topics most prevalent in both the
forums_stm and forums_lda models contains the terms “teach”, “students”, “statistics”. This could be an “overarching theme” but more likely may simply be just the residue of the course title though being sprinkled throughout the forums and deserves some follow up. Topics 8 from the LDA model may overlap with this topic as well.
- Course Utility: The second most prevalent Topics (13 and 2) in the
lda and stm models respectively, seem to potentially be about the usefulness of course “resources” like lessons, tools, videos, and activities. I’m wagering this might be a forum dedicated to course feedback. Topic 15 from the STM model also suggest this may be a broader theme.
- Using Real-World Data: Topics 18 & 12 from the LDA model particularly intrigue me and I’m wagering this is pretty positive sentiment among participants about the value and benefit of having students collect and analyze real data sets (e.g. Census data in Topic 1) and work on projects relevant to their real life. Will definitely follow up on this one.
- Technology Use: Several topics (6 & 11 from LDA and 8 & 19 from STM) appear to be about student use of technology and software like calculators and Excel for teaching statistics and using simulations. Topic 16 from LDA also suggest the use of the Common Online Data Analysis Platform (CODAP).
- Student Struggle & Engagement: Topic 15 from LDA and Topic 16 from STM also intrigue me and appear to be two opposite sides of perhaps the same coin. The former includes “struggle” and “reading” which suggests perhaps a barrier to teaching statistics while Topic 16 contains top stems like “engage”, “activ”, and “think” and may suggest participants anticipate activities may engage students.
To serve as a check on my tea leaf reading, I’m going to follow Bail’s recommendation to examine some of these topics qualitatively. The stm package has another useful function though exceptionally fussy function called findThoughts which extracts passages from documents within the corpus associate with topics that you specify.
The first line of code may not be necessary for your independent analysis, but because the textProcessor() function removed several documents during processing, the findthoughts() function can’t properly index the processed docs. This line of code found on stackoverflow removes documents from original ts_forum_data source that were removed during processing so there is a one-to-one correspondence with forums_stm and so you can use the function to find posts associated with a given topic.
Let’s slightly reduce our original data set to match our STM model, pass both to the findThoughts() function, and set our arguments to return n =10 posts from topics = 2 (i.e. Topic 2) that have at least 50% or thresh = 0.5 as a minimum threshold for the estimated topic proportion.
ts_forum_data_reduced <-ts_forum_data$post_content[-temp$docs.removed]
findThoughts(forums_stm,
texts = ts_forum_data_reduced,
topics = 2,
n = 10,
thresh = 0.5)
##
## Topic 2:
## Ronald- We are glad that you have identied resources that you want to return to. If you do bookmark the urls they are available on the web after the course is closed.
## The materials presented in this course have been great resources. I can't wait to use them int he future.
## Thank you for providing these links and resources. The resources you provided in this MOOC have been very beneficial. I look forward to utilizing them in future lessons.
## I would download all pdfs and bookmark sites you might want in the future.
## The Annenberg Learner - is a teaching PD website that not only has \Against All Odds\" but whole lessons PD programs you can also access free with additional instructional footage - teaching teachers. It's a great resource. There are also a lot of lesson plans. Another great resource is \"The Teaching Channel\" They have lesson plans along with video. "
## I wonder the same thing Donna. I am trying to save in my hard drive the materials that had been provided us something I could pull out in the near future for use or reference.
## I as well found my weaknesses and now have resources to improve.
## I'll add my voice to the chorus of gratitude. I've already used a couple of the activities presented here in my fall quarter course and I will definitely bookmark as many of the resources as possible to try to use in the future. Thank you!
## Thanks for the info. I will definitely download many of the resources I liked for future use. Having a bank of resources for statistics is crucial.
## I agree. The resources made available are excellent for future use. I printed out several that were in print form to use and refresh my memory.
Duplicate posts aside, this Course Utility topic returns posts there were expected based on my interpretation of the key terms for Topic 2. It looks like I may have read those tea leaves correctly.
Now let’s take a look at Topic 16 that we thought might be related to student engagement:
findThoughts(forums_stm,
texts = ts_forum_data_reduced,
topics = 16,
n = 10,
thresh = 0.5)
##
## Topic 16:
## For a final project two years ago I had a couple of students do a study on water. They used 4 types of water: tap Aquafina Poland Springs and Fiji. Conclusion: Fiji water tasted the best. It was interesting to note that Fiji water also works best for \snap freezing\" experiments. If you're not familiar with snap freezing try a you-tube search...pretty cool stuff! "
## If you have not yet explored the Animated Movie dataset in the Dive Into Data I highly recommend it. I have two graphs below that I created. I'd love to hear what worthwhile statistics tasks you could imagine getting students engaged in where making sense of one or both of these graphical displays of the data would come into play. <b>Graph 1: two boxplots of movie ratings for a sample of Pixar and DreamWorks movies. The green color scale shows the overall budget for producing the movie.</b> <img src=\@@PLUGINFILE@@/MovieratinngsCOmpareBoxplots.JPG\" alt=\"two box plots\" width=\"961\" height=\"530\" style=\"vertical-align:text-bottom; margin: 0 .5em;\" class=\"img-responsive\"> What statistical question might we pose to students to investigate? <b>Graph 2: Scatterplot of Profit vs. Production budget. Legend Color indicates whether the movie was produced by Pixar or Dreamworks.</b> <img src=\"@@PLUGINFILE@@/BudgetimpactonProfitMovies.JPG\" alt=\"scatterplot of profit vs. budget\" width=\"831\" height=\"570\" style=\"vertical-align:text-bottom; margin: 0 .5em;\" class=\"img-responsive\"> What statistical question might we pose to students to investigate? "
## If you have not yet explored the Animated Movie dataset in the <b>Dive Into Data</b> I highly recommend it. I have two graphs below that I created. I'd love to hear what worthwhile statistics tasks you could imagine getting students engaged in where making sense of one or both of these graphical displays of the data would come into play. <b>Graph 1: two boxplots of movie ratings for a sample of Pixar and DreamWorks movies. The green color scale shows the overall budget for producing the movie.</b> <img src=\https://place.fi.ncsu.edu/pluginfile.php?file=/36093/mod_forum/post/36848/MovieratinngsCOmpareBoxplots.JPG\" alt=\"two box plots\" width=\"961\" height=\"530\"> What statistical question might we pose to students to investigate? <b>Graph 2: Scatterplot of Profit vs. Production budget. Legend Color indicates whether the movie was produced by Pixar or Dreamworks.</b> <img src=\"https://place.fi.ncsu.edu/pluginfile.php?file=/36093/mod_forum/post/36848/BudgetimpactonProfitMovies.JPG\" alt=\"scatterplot of profit vs. budget\" width=\"831\" height=\"570\"> What statistical question might we pose to students to investigate? "
## The television activity is more basic for sixth grade to comprehend. In my county this past week was Digital Citizenship week. The first task was to talk about what social media they use. I would like to say here that I was shocked at some of the media sites my sixth graders use on a regular basis but that was another discussion. We created a dot plot of the social media used and how many used it. The next day the students were asked to think about how long they are on social media on a daily basis. We graphed the data using a histogram. My sixth graders discovered that they spend too much time on social media and not on their studies. Since data collect is not until a later unit this was a great way to introduce the types of graphs and measures of center in a way that absolutely applies to them. The airplane data was great for teaching students how to interact with large numbers. I would love to do a follow up investigation on flight cancellations since this topic impacts many travelers all to often. I have done the Old Faithful investigation that I found in the Connected Math Program Series with my eighth graders but this activity was not as interesting to them.
## I analyzed data for air quality using excel and found the activity valuable but very time consuming. Being my students would only have access to excel I used that program to download data into. The first problem my students will have is to find the data they want to analyze. For instance there are four locations to look at before considering the affect location may have on measurements. Are these representative of a larger area or affected by geographical features? After students download the data this would require their becoming familiar with the data and a discussion asking many questions. This would be good for developing “habits of mind” but may be overwhelming if care is not taken to allow them time to discern the possibilities. Second the amount of data requires a lot of work when using Excel. Most students have only used small bits of data to make graphs if at all. Now they have to move large bits of information which they should be able to graph with little direction. However this brings up new problems as their initial graphs were I made very busy and would not make sense to many. For instance a dot plot of all the data from say one location for a graph of ozone on primary axis and nitrogen on secondary access by location shows a comparison of these two variables by location. The graph is very busy and I don not think my students would think of breaking down the location data by finding the means for each month and plotting those to simplify and communicate any possible relationships. Making such a graph will show a trend (using linear regression) that most will think as a means to an end. They would see a slight increase over the period measured and incorrectly make connections to global warming and would tend to end their motivation for further investigation there. My challenge question to myself is “how to keep students engaged with the bigger ideas behind the data and not get overwhelmed?” What I anticipate is much confusion initially and keeping students motivated in higher thinking will be my greatest challenge. Second finding another data analysis program besides Excel and the necessary technology would be equally problematic. I would consider TUVA before Census at School and the tools supplied would be a better choice if the we have access to the technology. Lastly time for implementation of such lessons may be greater than I am allowed; I need to find some way to integrate many standards into each lesson that follow existing course objectives.
## I wonder if introducing more than two variables in a graph would be engaging with Khan Academy Pixar in a Box.
## HI. Great idea with the balloons. With my year 9 class I always pose the quesiton 'Who writes bigger Adults or Year 9?' We then discuss what I mean by bigger and we end up by defining bigger by length as it's easy to measure. We use a sentence that has all the letters of the alphabet (The quick brown fox jumps over the lazy dog) and measure the length from the first pen stroke to the furthest pen stoke. We then collect the data in a tally chart (continuous data) draw a histogram and calculate the average(s) Homework is then set to get two adults to do the same thing. Next lesson we do the same analysis and compare the results. You normally get an interesting set of responses and some pretty obvious bias if all the students bring 2 bits of data in. Great to get the students thinking and applying some statistical skills.
## What are some of the ways you have already used in your classroom to make statistics real? In collaboration with the another teacher at my previous school we found the USGS Data on Stream Flow (cubic feet per second) compared with the size of a watershed. We shared this data with our student. They created a scatterplot and calculated the line of regression. We then took our students outside. Using a cross-section we calculated the cubic feet per second of our stream. From there we could predict how large our watershed was. We then looked at the topographic map and looked at how big our watershed was.
## You could have students analyze each of the contacts. They could identify how many contacts they have how many of those contacts they use daily and how many of those contact they haven't used in the past six months. With these questions you could make them think about which factors might influence the variability between their classmates data among other things.
## Help me make use of this dreamworks vs pixar graph. What does the placement of the dots relative to the other dots have to do with anything?
It looks like my tea reading was a partially correct for Topic 16, though the results seem to be about a specific “Pepsi challenge” activity to conduct with students.
Finally, let’s look at posts from Topic 3 which we though might be an overarching theme about teaching statistics:
ts_forum_data_reduced <-ts_forum_data$post_content[-temp$docs.removed]
findThoughts(forums_stm,
texts = ts_forum_data_reduced,
topics = 3,
n = 10,
thresh = 0.5)
##
## Topic 3:
## In my college faculty are given a syllabus (internal course designers) which we can modify. And I do modify the content. The standard syllabus for third year college students is less than HS AP Stats. Some students will get another dose of Statistics in their fourth year which is good. I have seen the quality and degradation of content over the 14 years. It is too bad as stats is far more than a math course.
## I teach statistics in HS. Most of the students take it because it sounds easier than trig or calculus. One comment I get frequently every year is how they enjoy the relevance of the course. \It's the first math class where I didn't have to ask why do we need to know this?\" "
## I feel I have grown in my knowledge of statistics because of this course. I do not teach a course in stats but do have to teach a part to my math 3 students and always felt like I could be doing a better job. I feel much more confident that this year will be much better.
## In the\ Extend you Learning Area\" there were two sections that made me laugh and want to dance One of these was on Harry Potter. It was amazingly creative. <div id=\"page-content\" class=\"row-fluid\"><section id=\"region-main-course\" class=\"col-md-10\"><div class=\"region-main-inner\"><div role=\"main\"><div class=\"list_content\"><section class=\"course-section course_resource\"><div class=\"section-content\"><div class=\"resource_item\"><div class=\"bookmark_container\" id=\"bookmark_container_318\" style=\"float: left; margin: 5px 0; padding-right: 10px;\"><div class=\"resource_title\"><img src=\"https://s3.amazonaws.com/fi-assets/icons/video.png\" alt=\"https://s3.amazonaws.com/fi-assets/icons/video.png\" class=\"iconsmall\" style=\"width: 30px; height: 30px;\"><a target=\"_blank\" class=\"resourcelink\" data-objectid=\"370\" href=\"https://www.youtube.com/playlist?list=PLG6iFkLydgaqBCtH6jjMgrUQnernI1Yqh\">Student Projects in Statistics There was one on Harry Potter which was amazingly creative. </a> </div><a class=\"bookmarklink\" id=\"bookmarklink_318\" data-objectid=\"318\" data-bookmark=\"\" data-action=\"bookmark\" style=\"cursor: pointer;\"></a></div></div><div class=\"resource_item\"><div class=\"resource_body\"><div class=\"resource_title\"><img src=\"https://s3.amazonaws.com/fi-assets/icons/video.png\" alt=\"https://s3.amazonaws.com/fi-assets/icons/video.png\" class=\"iconsmall\" style=\"width: 30px; height: 30px;\"><a target=\"_blank\" class=\"resourcelink\" data-objectid=\"371\" href=\"https://www.youtube.com/playlist?list=PLG6iFkLydgaoelxWwAozvKOVWFcw3P0bi\">Student-created parodies about statistics The one to Soul Sister got me moving/</a> I think most people find Statistics dry and boring. Although it is hard to make learning entertaining all the time now and then would certainly be great. If nothing else it will may help keep students awake. <a target=\"_blank\" class=\"resourcelink\" data-objectid=\"371\" href=\"https://www.youtube.com/playlist?list=PLG6iFkLydgaoelxWwAozvKOVWFcw3P0bi\"></a></div><div class=\"a2a_kit a2a_default_style\"> <a aria-label=\"Share\" class=\"a2a_dd\" href=\"https://www.addtoany.com/share#url=https%3A%2F%2Fwww.youtube.com%2Fplaylist%3Flist%3DPLG6iFkLydgaoelxWwAozvKOVWFcw3P0bi&title=Student-created%20parodies%20about%20statistics&description=\"></a><span class=\"a2a_divider\"></span><a aria-label=\"Facebook\" rel=\"nofollow\" href=\"https://place.fi.ncsu.edu/\" target=\"_blank\" class=\"a2a_button_facebook\"><span class=\"a2a_img a2a_i_facebook\"></span></a><a aria-label=\"Twitter\" rel=\"nofollow\" href=\"https://place.fi.ncsu.edu/\" target=\"_blank\" class=\"a2a_button_twitter\"><span class=\"a2a_img a2a_i_twitter\"></span></a><a aria-label=\"Google+\" rel=\"nofollow\" href=\"https://place.fi.ncsu.edu/\" target=\"_blank\" class=\"a2a_button_google_plus\"><span class=\"a2a_img a2a_i_google_plus\"></span></a></div></div><div class=\"addtoany\" style=\"margin: 5px 0;\"> </div></div></div></section></div> </div> </div> </section> </div> "
## I prepare math teachers and couldn't agree more. Our pre-service teachers only take one statistics course and it isn't a methods course. I am going to try to incorporate some of the statistics methods into my technology course for pre-service math teachers.
## All these posts are relevant to my everyday teaching and the way I assess students and the material I use in class. I am often questioned about whether statistics is really mathematics and whether my class should count in the math department offerings. Of course it does because that is how statistics is labeled but teachers disagree. Clarity about how important this is and how it is and isn't like other mathematics classes is important. I am originally trained as a sociologist and find that even when teaching a \pure\" mathematics class I look for motivation or context to engage the students...after many years of teaching math I would not dare call myself a mathematician but when I get to delve into statistics I feel I integrate all my loves. "
## As a person who prepares math teachers I feel that we don't spend enough time with the pre-service teachers on Statistics. They only take one class and I am not convinced that class is giving them the tools needed to be a good Stats teacher. It is reassuring (although sad) that our university is not the only that doesn't provide enough guidance in statistics to our pre-service teachers.
## I totally agree. I teach both Alg 2 and AP Statistics. We don't teach anything beyond the basics of mean median and mode and most years stats is the first thing to get cut when there is a time crunch at the end of the year. We have to start from the very beginning in AP. I think one of the major road blocks is that teachers themselves don't know enough about statistics to know where to start when deciding what and how to teach statistics. We weren't required to take much as far as statistics is concerned and so the importance of the subject was never emphasized to us.
## I agree with you Janet. Our introductory course does teach some calculations but focuses on the concepts and interpretations. I have had students who have taken AP stats and through conversations I have learned that they enjoyed this class and learned a lot. I do believe that an AP stats course is different than an introductory stats course.
## I am very happy to have found this course. I am willing to admit that I am one of the educators ill-prepared to teach statistics. I have little to no experience and I have found it hard and expensive to get the training I need. I suffer from a lack of experience for two main reasons: my K-12 teachers had very little statistical knowledge and I was given the choice to take Statistics or not in college. I made the wrong choice. I just wanted to post this for all the others out there who may be feeling inadequate!
Looking at just the 10 posts returned, perhaps a better name for this topic would be Course Reflections on Teaching Statistics.
Unit Takeaway
In addition to some useful R packages and functions for the actual process of topic modeling, hopefully there are two main lessons I’m hoping you take away from this walkthrough:
- Topic modeling requires a lot of decisions. Beyond deciding on a value for K, there are a number of key decisions that you have to make that can dramatically affect your results. For example, to stem or not to stem? What qualifies as a document? What flavor or topic modeling is best suited to your data and research questions? How many iterations to run?
- Topic modeling is as much art as (data) science. As Bail (2018) noted, the term “topic” is somewhat ambitious, and topic models do not produce highly nuanced classification of texts. Once you’ve fit your model, interpreting your model requires some mental gymnastics and ideally some knowledge of the context from which the data came to help with interpretation of your topics. Moreover, the quantitative approaches for making the decisions highlighted above are imperfect and a good deal of human judgment required.
✅ Comprehension Check
Using the STM model you fit from the Section 3 [Comprehension Check] with a different value for K, use the approaches demonstrated in Section 4 to explore and interpret your topics and terms and revisit the following question:
# Ran LDA model with 5 topics
# find top 5 terms in the lda model
terms(forums_lda2, 5)
## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
## [1,] "font" "li" "students" "students" "data"
## [2,] "span" "href" "sample" "statistics" "students"
## [3,] "style" "strong" "data" "school" "task"
## [4,] "text" "https" "test" "teach" "questions"
## [5,] "normal" "target" "size" "teaching" "question"
# convert to a tidy data frame with Beta values
tidy_lda2 <- tidy(forums_lda2)
# visually interpret top terms for each topic
top_terms2 <- tidy_lda2 %>%
group_by(topic) %>%
slice_max(beta, n = 5, with_ties = FALSE) %>%
ungroup() %>%
arrange(topic, -beta)
td_beta2 <- tidy(forums_lda2)
td_beta2 %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
mutate(topic = paste0("Topic ", topic),
term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = as.factor(topic))) +
geom_col(alpha = 0.8, show.legend = FALSE) +
facet_wrap(~ topic, scales = "free_y") +
coord_flip() +
scale_x_reordered() +
labs(x = NULL, y = expression(beta),
title = "Highest word probabilities for each topic",
subtitle = "Different words are associated with different topics")

td_gamma2 <- tidy(forums_lda2, matrix = "gamma")
beta_terms2 <- td_beta2 %>%
arrange(beta) %>%
group_by(topic) %>%
top_n(7, beta) %>%
arrange(-beta) %>%
select(topic, term) %>%
summarise(terms = list(term)) %>%
mutate(terms = map(terms, paste, collapse = ", ")) %>%
unnest()
## Warning: `cols` is now required when using unnest().
## Please use `cols = c(terms)`
gamma_terms2 <- td_gamma2 %>%
group_by(topic) %>%
summarise(gamma = mean(gamma)) %>%
arrange(desc(gamma)) %>%
left_join(top_terms, by = "topic") %>%
mutate(topic = paste0("Topic ", topic),
topic = reorder(topic, gamma))
gamma_terms2 %>%
select(topic, gamma, terms) %>%
kable(digits = 3,
col.names = c("Topic", "Expected topic proportion", "Top 7 terms"))
| Topic 4 |
0.436 |
statistics, math, mathematics, students, common, teaching, teach |
| Topic 5 |
0.293 |
age, tv, coasters, roller, steel, coaster, hours |
| Topic 3 |
0.160 |
students, kids, understand, standard, test, box, deviation |
| Topic 2 |
0.079 |
resources, statistics, teaching, unit, mooc, ideas, learning |
| Topic 1 |
0.032 |
dice, trials, fair, sasi, level, framework, sophistication |
- Now that you have a little more context, how might you revise your initial interpretation of some of the latent topics or latent themes from your model?
- I revised my topics to 10 and 5 in two different models. After revising for 10 I inspected with LDAvis. There was still overlap so I reduced it to 5 topics. Although there was overlap in the STM of 10 topics going down to only 5 topics in the second LDA model was to hard to interpret. Ten topics, are easier to make sense of and can quantify the topic amount chosen. My tea leaf interpretation for the LDA 5 Topics:
- Topic 4 - Teaching Students statistics using the common core
- Topic 5 - A popular data set is about Steel Roller Coasters
- Topic 3 - Students can understand Standard Deviation by testing in a Box Plot.
- Topic 2 - Educators sharing resources and ideas in the learning MOOCs
- Topic 1 - Items for learning and teaching statistics
