This demo is designed to get a feel for the type of analysis that is possible to achieve by using text mining approaches. The caveat is that we using a really small data set and that has been process through a citation analysis that should identify highly influential publications in sub-domain areas. This is really just a exploratory process to see if this approach could prove useful.
We are going to use data that was sourced from a ERIC publication search focused on math interventions at the k-12 level. This is a sub-sample that includes some very specific articles in terms of the topics and others that are a bit more vague. This was done on purpose to see what topics might emerge from a sample that was heterogeneous in terms of approach.
The word content, which will be the data in this example, is pulled from the abstracts, title and key words listed in the ERIC search results for math interventions delivered in elementary schools.
Below is the result of a good amount of data cleaning, essentially one long list of tokenized elements at the “word” level. We can use this list of words to determine which words then phrases are most “important” to these journal articles.
## # A tibble: 20 x 2
## `xx$word` n
## <chr> <int>
## 1 intervention 57
## 2 students 41
## 3 mathematics 40
## 4 school 21
## 5 grade 19
## 6 math 19
## 7 instruction 18
## 8 knowledge 15
## 9 achievement 14
## 10 research 13
## 11 student 13
## 12 elementary 12
## 13 program 11
## 14 content 9
## 15 effectiveness 9
## 16 3 8
## 17 6 8
## 18 interventions 8
## 19 learning 8
## 20 mathematical 8
Much of what is above is just exploratory, now we can develop a TD-IDF model that will assess the importance of these words in the context of text. The statistic tf-idf is intended to measure how important a word is to a document in a collection (or corpus) of documents, for example, to one novel in a collection of novels or to one website in a collection of websites.(source: Tidy Text Mining (https://www.tidytextmining.com/tfidf.html))
## # A tibble: 15 x 3
## otherid word n
## <chr> <chr> <int>
## 1 EJ1218437 of 14
## 2 EJ1258255 the 13
## 3 EJ1156025 the 12
## 4 EJ935766 the 11
## 5 EJ1156025 of 10
## 6 EJ1249756 of 10
## 7 EJ1258255 of 10
## 8 EJ1249756 the 9
## 9 EJ1014930 in 8
## 10 EJ1098528 the 8
## 11 EJ1098528 to 8
## 12 EJ1099346 in 8
## 13 EJ1099346 of 8
## 14 EJ1156025 in 8
## 15 EJ1258255 and 8
## # A tibble: 11 x 2
## otherid total
## <chr> <int>
## 1 EJ1010749 142
## 2 EJ1014930 200
## 3 EJ1063565 174
## 4 EJ1098528 193
## 5 EJ1099346 189
## 6 EJ1156025 175
## 7 EJ1218437 253
## 8 EJ1228107 155
## 9 EJ1249756 148
## 10 EJ1258255 169
## 11 EJ935766 171
## Joining, by = "otherid"
Now that we’ve done some initial analysis just to see important words in these limited texts we can move on to doing topic modeling to see if there are patterns that evolve between documents. per-topic-per-word probabilities, called β (“beta”) will be generated
* Every document is a mixture of topics. We imagine that each document may contain words from several topics in particular proportions. For example, in a two-topic model we could say “Document 1 is 90% topic A and 10% topic B, while Document 2 is 30% topic A and 70% topic B.”
* Every topic is a mixture of words. For example, we could imagine a two-topic model of American news, with one topic for “politics” and one for “entertainment.” The most common words in the politics topic might be “President”, “Congress”, and “government”, while the entertainment topic may be made up of words such as “movies”, “television”, and “actor”. Importantly, words can be shared between topics; a word like “budget” might appear in both equally.
## Joining, by = "word"
## # A tibble: 15 x 3
## otherid word n
## <chr> <chr> <int>
## 1 EJ1258255 spatial 8
## 2 EJ1218437 content 7
## 3 EJ1218437 knowledge 7
## 4 EJ1228107 students 7
## 5 EJ1010749 math 6
## 6 EJ1014930 quot 6
## 7 EJ1063565 students 6
## 8 EJ1098528 multiplication 6
## 9 EJ1156025 proof 6
## 10 EJ1218437 students 6
## 11 EJ1249756 writing 6
## 12 EJ1010749 2 5
## 13 EJ1014930 follow 5
## 14 EJ1098528 math 5
## 15 EJ1098528 set 5
## A LDA_VEM topic model with 2 topics.
## null device
## 1
We’ve generate a model based on 2 topics so now we will compare the percentage likelihood of the top 10 terms for each topic.
We can also estimate the total percentage of words that were generated by each topic. We can see there’s a pretty clean split between journals either belonging to topic 1 or 2
Going to explore a different approach using the tm package.