This demo is designed to get a feel for the type of analysis that is possible to achieve by using text mining approaches. The caveat is that we are still developing a process to build a corpus. The goal is to use citation analysis that should identify highly influential publications in sub-domain areas. Currently, this is really just a exploratory process to see if this approach could prove useful and will also just be a tool for experts to use in developing the schema.
We are going to use data that was sourced from a ERIC publication search focused on math interventions at the k-12 level. This is a sub-sample of 128 articles that will be used to build the topic model.
The word content, which will be the data in this example, is pulled from the abstracts, title and key words listed in the ERIC search results for math interventions delivered in elementary schools.
Below is the result of a good amount of data cleaning, essentially one long list of tokenized elements at the “word” level. We can use this list of words to determine which words then phrases are most “important” to these journal articles.
## # A tibble: 20 x 2
## `xx$word` n
## <chr> <int>
## 1 intervention 610
## 2 mathematics 568
## 3 students 539
## 4 grade 254
## 5 instruction 229
## 6 school 224
## 7 elementary 179
## 8 learning 165
## 9 study 141
## 10 control 124
## 11 achievement 115
## 12 effects 113
## 13 skills 111
## 14 student 111
## 15 math 109
## 16 teaching 99
## 17 solving 96
## 18 risk 94
## 19 interventions 91
## 20 2 89
We can also take a look at the data through two or three word phrases, called n-grams.
xx_ngrams <- xx %>%
unnest_tokens(word, word, token = "ngrams", n=2)
datatable(xx_ngrams)
xx_ngrams_3 <- xx %>%
unnest_tokens(word, word, token = "ngrams", n=3)
datatable(xx_ngrams_3)
Much of what is above is just exploratory, now we can develop a TF-IDF model that will assess the importance of these words in the context of text. The statistic TF-IDF is intended to measure how important a word is to a document in a collection (or corpus) of documents, for example, to one novel in a collection of novels or to one website in a collection of websites.(source: Tidy Text Mining (https://www.tidytextmining.com/tfidf.html))
## # A tibble: 8 x 3
## otherid word n
## <chr> <chr> <int>
## 1 EJ1182560 the 24
## 2 EJ1222664 the 21
## 3 EJ1158816 the 19
## 4 EJ1072506 the 18
## 5 EJ1092098 the 18
## 6 EJ1116305 the 18
## 7 EJ1182560 of 18
## 8 EJ1222664 and 18
## # A tibble: 8 x 2
## otherid total
## <chr> <int>
## 1 ED545392 59
## 2 ED552820 217
## 3 ED557355 30
## 4 ED570289 39
## 5 ED572835 236
## 6 ED578216 99
## 7 ED595063 216
## 8 ED595127 219
## Joining, by = "otherid"
Now that we’ve done some initial analysis to general focused on word importance we can move on to doing topic modeling to see if there are patterns that evolve between documents. per-topic-per-word probabilities, called β (“beta”) will be generated.
* Every document is a mixture of topics. We imagine that each document may contain words from several topics in particular proportions. For example, in a two-topic model we could say “Document 1 is 90% topic A and 10% topic B, while Document 2 is 30% topic A and 70% topic B.”
* Every topic is a mixture of words. For example, we could imagine a two-topic model of American news, with one topic for “politics” and one for “entertainment.” The most common words in the politics topic might be “President”, “Congress”, and “government”, while the entertainment topic may be made up of words such as “movies”, “television”, and “actor”. Importantly, words can be shared between topics; a word like “budget” might appear in both equally.
## Joining, by = "word"
## # A tibble: 15 x 3
## otherid word n
## <chr> <chr> <int>
## 1 EJ1257093 students 15
## 2 EJ1230022 word 14
## 3 EJ1257093 md 13
## 4 EJ1099265 learning 12
## 5 EJ1168275 students 12
## 6 EJ1079390 grade 11
## 7 EJ1112672 guided 11
## 8 EJ1246045 grade 11
## 9 ED595322 interleaved 10
## 10 EJ1049576 word 10
## 11 EJ1099265 students 10
## 12 EJ1115270 students 10
## 13 EJ1158172 skills 10
## 14 EJ1184246 strategies 10
## 15 EJ1196106 fractions 10
We’ve generate a model based on two and three topics so now we will compare the percentage likelihood of the top 10 terms for each topic.
We can also estimate the total percentage of words that were generated by each topic. We can see there’s a pretty clean split between journals either belonging to topic 1 or 2
We can also explore adding more topics and assessing the ideal number for this particular corpus.