Introduction

Corpus Building


Model Building With Math Symbols

What is LDA?

  • Latent Dirichlet Allocation (LDA) is a type of model that takes in a collection of papers, and helps us understand what underlying topics exist within that group of papers.
  • In other words, we are trying to create fuzzy clusters of papers based on the words that appear in those papers
  • The “topics” are sets of words and the corresponding likelihoods that each word appears within a particular topic
  • It is up to us to look at the list of words LDA uses to describe a topic with, and decide the meaning of topics from there.
  • It’s also important to point out that we are creating “fuzzy clusters” because specific words don’t have to be in once cluster or another - the likelihood of the words showing up in a cluster that defines each cluster
  • Ex: Later in our discussion the word “school” appears in both topics - but is more likely to appear in topic 1 (0.016), than in topic 2 (0.008)
  • We are trying to find out if there are latent, or hidden, structural differences between the papers in ERIC and the papers approved by the WWC - if this is true, we would expect that papers from each group naturally fall into two different topics

Structural Changes and Data Cleaning

  • Tokenizing: breaking the sentences in the titles and abstracts down into individual words
  • Removing stopwords: taking out frequently used english words like “the” or “and”
  • Getting frequency counts during exploratory data analysis
  • Lemmatization: Eliminating duplicate words based on meaning - for example “ran” and “run” would be condensed down to just “run”

  • The three boxes below show the results of data cleaning. Box one and box two are the original structure of the data for a paper within each corpus; after lemmatizing, combining, and taking out punctuation and quotation marks, we are left with box three, which is what we eventually feed into our model. By doing this, words like “randomize” and “randomized” are combined and essentially no words are double or triple counted.

## [1] "The Effects of Math Video Games on Learning: A Randomized Evaluation Study with Innovative Impact Estimation Techniques. CRESST Report 841"
## [1] "A large-scale randomized controlled trial tested the effects of researcher-developed learning games on a transfer measure of fractions knowledge. The measure contained items similar to standardized assessments. Thirty treatment and 29 control classrooms (~1500 students, 9 districts, 26 schools) participated in the study. Students in treatment classrooms played fractions games and students in the control classrooms played solving equations games. Multilevel multidimensional item response theory modeling of the outcome measure produced scaled scores that were more sensitive to the instructional treatment than standard measurement approaches. Hierarchical linear modeling of the scaled scores showed that the treatment condition performed significantly higher on the outcome measure than the control condition. The effect (d = 0.58) was medium to large (Cohen, 1992). Two appendices are included: (1) Descriptive Statistics of Pretest and Posttest Scores by Schools and Conditions; and (2) Summary of Efficacy Trial Procedures."
## [1] "The effect of Math Video game on learn A randomize Evaluation Study with Innovative Impact Estimation technique CRESST Report 841 A large scale randomize control trial test the effect of researcher develop learn game on a transfer measure of fraction knowledge The measure contain item similar to standardize assessment Thirty treatment and 29 control classroom ~1500 student 9 district 26 school participate in the study student in treatment classroom play fraction game and student in the control classroom play solve equation game Multilevel multidimensional item response theory model of the outcome measure produce scale score that be much sensitive to the instructional treatment than standard measurement approach Hierarchical linear model of the scale score show that the treatment condition perform significantly high on the outcome measure than the control condition The effect have = 0 58 be medium to large Cohen 1992 Two appendix be include 1 Descriptive statistic of Pretest and Posttest score by school and condition and 2 Summary of Efficacy Trial procedure ED555700"

Finding 1

  • Table 1 shows probabilities for a given word being in topic one or topic two. A word’s probability does not need to add up to one between topics, but the higher "beta" value signifies that the word appears more often in that topic. Ex. "intervention" will show up more in topic two than topic one, but only slightly.

Table 1

Finding 2

  • Tables 2 and 3 convey the same situation in different formats. In the dataframe, we can view the top 15 words for each topic based on the probabilities that they occur, with some words likely to show up in both topics. Between topic one and topic two, many of these words align; “student”, “mathematics”, and “intervention”, for example, are common to both topics. On the other hand, common words in topic two include “treatment”, “control”, and “knowledge”, whereas topic one includes “child”, “impact”, and “achievement”. As we will discuss, topic two is more strongly associated with papers from WWC. This seems to show that topic two is associated more with the analytical (or methodology) side of research, and that topic one is associated more with the instructional (or summary) side of research.

Table 2

Table 3

Finding 3

  • Tables 4 and 5 display unique words to each topic. These are often words that come up frequently in a paper strongly associated with topic one or topic two (but not both).

  • In table 5, we take a log transformation of the probability of topic two over the probability of topic one for each word in order to visualize the distribution of probabilities for these words. Values greater than zero correspond to topic two and x-axis values less than zero correspond to topic one. As shown, an example of unique words include "parent", "instructor", and "mentor" in topic one and "schema", "numerical" and “conceptual” in topic two. Because these values are unique to each topic, none of the words appear more than once.

Table 4

Table 5

Table 6

Finding 4

  • Table 7 shows that for papers in our non-WWC corpus, the mean proportion of allocation to topic one is 0.493, while the mean proportion of allocation to topic two is 0.507. Table 7 also shows that for papers in our WWC corpus, the mean proportion of allocation to topic one is 0.269, and the mean proportion of allocation to topic two is 0.731. This means that the breakdown of papers in our non-WWC corpus has approximately one-half of a chance of being allocated to topic one and one-half of a chance of being allocated to topic two. This also shows that the breakdown of papers in our WWC corpus has approximately one-quarter of a chance of being allocated to topic one and three-quarters of a chance of being allocated to topic two. With this model, we discover a relationship between topic and corpus, which leads us to conclude there may be structural differences between the papers in our WWC corpus and the papers in our non-WWC corpus.

Table 7

Table 8

Finding 5

  • Histogram 1 shows the distribution of proportions of allocation to topic one (the red bars) and to topic two (the blue bars) for the papers in our WWC corpus. The upper end of the x-axis demonstrates that there are a lot of high probabilities of topic two allocation and not a lot of high probabilities of topic one allocation. The lower end of the x-axis demonstrates that there are a lot of low probabilities of topic one allocation and not a lot of low probabilities of topic two allocation. Since topic one probabilities and topic two probabilities do not have the same count for the bins at the upper and lower ends, this shows that the breakdown of papers in our WWC corpus has a higher probability of topic 2 allocation and a lower probability of topic 1 allocation.

  • Histogram 2 shows the distribution of proportions of allocation to topic one (the red bars) and to topic two (the blue bars) for the papers in our non-WWC corpus. The upper end of the x-axis demonstrates that there are about as many high probabilities of topic one allocation as there are high probabilities of topic two allocation. The lower end of the x-axis demonstrates that there are about as many low probabilities of topic two allocation as there are low probabilities of topic one allocation. Since topic one probabilities and topic two probabilities have about the same count for all bins, this shows that the breakdown of papers in our non-WWC corpus has the same probability of topic one allocation as the probability of topic two allocation.

  • These histograms also lead to the conclusion that there seems to be structural differences between the papers in our WWC corpus and the papers in our non-WWC corpus. This distribution of probabilities makes sense because the non-WWC papers were not rejected by WWC, they just were not reviewed by WWC. This means that there is the possibility that some of these non-WWC papers have some structural similarities with our WWC papers that were reviewed and approved by WWC. And this subset of non-WWC papers would be the ones that are allocated to topic two with most of the WWC papers.

Histogram 1

Histogram 2