Introduction

This project will go through the following steps to conduct topic modeling on the Friday Institute for Educational Innovation at North Carolina State University’s Massively Open Online Courses for Educators (MOOC-Ed) and Online Professional Learning programs.

  1. Prepare: Prior to analysis, we’ll take a quick look at some of the related MOOC-Ed research and evaluation work to gain some context for our analysis. This should aid in the interpretation of our results and help guide some decisions as we tidy, model, and visualize our data.
  2. Wrangle: In section 2 we again revisit tidying and tokenizing text using the tidytext package but are also introduced to the the stm package. This package makes use of tm text mining package to preprocess text and will also be our first introduction to word stemming.
  3. Model: We take a look at two different approaches to topic modeling: Latent Dirichlet Allocation (LDA) and Structural Topic Modeling (STM), which is very similar to LDA but can use metadata about documents to improve the assignment of words to “topics” in a corpus and examine relationships between topics and covariates. 
  4. Explore: To further explore the results of our topic model, we use several handy functions from the topicmodels and stm packages, including the findThoughts function for viewing documents assigned to a given topic and the toLDAvis function for exploring topic and word distributions.

1. PREPARE

1a. Context

Data Source & Analysis

All peer interaction, including peer discussion, take place within discussion forums of MOOC-Eds, which are hosted using the Moodle Learning Management System. To build the dataset, the research team wrote a query for Moodle’s MySQL database, which records participants’ user-logs of activity in the online forums. This sql query combines separate database tables containing postings and comments including participant IDs, timestamps, discussion text and other attributes or “metadattsa.”

Summary of Key Findings

The following highlight some key findings related to the discussion forums in the papers cited above:

  1. MOOCs designed specifically for K-12 teachers can provide positive self-directed learning experiences and rich engagement in discussion forums that help form online communities for educators.
  2. Analysis of discussion forum data in TSDI provided a very clear picture of how enthusiastic many PLT members and leaders were to talk to others in the online community. They posed their questions and shared ideas with others about teaching statistics throughout the units, even though they were also meeting synchronously several times with their colleagues in small group PLTs.
  3. Findings on knowledge construction demonstrated that over half of the discussions in both courses moved beyond sharing information and statements of agreement and entered a process of dissonance, negotiation and co-construction of knowledge, but seldom moved beyond this phase in which new knowledge was tested or applied. These findings echo similar research on difficulties in promoting knowledge construction in online settings.
  4. Topic modeling provides more interpretable and cohesive models for discussion forums than other popular unsupervised modeling techniques such as k-means and k-medoids clustering algorithms.

1b. Guiding Questions

What are the similarities and differences between how PLT members and Non-PLT online participants engage and meet course goals in a MOOC-Ed designed for educators in secondary and collegiate settings?

What ideas or issues emerged in the discussion forums this past week?

How do we to quantify what a document or collection of documents is about?

1c. Set Up

The following packages were loaded:

library(tidyverse)
library(tidytext)
library(SnowballC)
library(topicmodels)
library(stm)
library(ldatuning)
library(knitr)
library(LDAvis)

2. WRANGLE

2a. Import Forum Data

2b. Cast a Document Term Matrix

Tidy Text

  1. Transform text into “tokens” and remove stop words
  2. Remove unnecessary characters, punctuation, and whitespace
  3. Convert all text to lowercase
  4. Remove stop words such as “the”, “of”, and “to”

Create a Document Term Matrix

2c. Stem Tidy Text

3. MODEL

Topic Modeling, an unsupervised learning approach to automatically identify topics in a collection of documents was conducted.

  1. Fit a Topic Modeling with LDA.
  2. Chose K = 20

4. EXPLORE

4a. Explored Beta Values

4b. Explored Gamma Values

4c. Read the Tea Leaves

  • Teaching Statistics: Unsurprising, given the course title, the topics most prevalent in both the forums_stm and forums_lda models contains the terms “teach”, “students”, “statistics”. This could be an “overarching theme” but more likely may simply be just the residue of the course title though being sprinkled throughout the forums and deserves some follow up. Topics 8 from the LDA model may overlap with this topic as well.
  • Course Utility: The second most prevalent Topics (13 and 2) in the lda and stm models respectively, seem to potentially be about the usefulness of course “resources” like lessons, tools, videos, and activities. I’m wagering this might be a forum dedicated to course feedback. Topic 15 from the STM model also suggest this may be a broader theme.
  • Using Real-World Data: Topics 18 & 12 from the LDA model particularly intrigue me and I’m wagering this is pretty positive sentiment among participants about the value and benefit of having students collect and analyze real data sets (e.g. Census data in Topic 1) and work on projects relevant to their real life. Will definitely follow up on this one.
  • Technology Use: Several topics (6 & 11 from LDA and 8 & 19 from STM) appear to be about student use of technology and software like calculators and Excel for teaching statistics and using simulations. Topic 16 from LDA also suggest the use of the Common Online Data Analysis Platform (CODAP).
  • Student Struggle & Engagement: Topic 15 from LDA and Topic 16 from STM also intrigue me and appear to be two opposite sides of perhaps the same coin. The former includes “struggle” and “reading” which suggests perhaps a barrier to teaching statistics while Topic 16 contains top stems like “engage”, “activ”, and “think” and may suggest participants anticipate activities may engage students.

To serve as a check on my tea leaf reading, I’m going to follow Bail’s recommendation to examine some of these topics qualitatively. The stm package has another useful function though exceptionally fussy function called findThoughts which extracts passages from documents within the corpus associate with topics that you specify.