Data Schema Demo

This demo is designed to get a feel for the type of analysis that is possible to achieve by using text mining approaches. The caveat is that we using a really small data set and that has been process through a citation analysis that should identify highly influential publications in sub-domain areas. This is really just a exploratory process to see if this approach could prove useful.

We are going to use data that was sourced from a ERIC publication search focused on math interventions at the k-12 level. This is a sub-sample that includes some very specific articles in terms of the topics and others that are a bit more vague. This was done on purpose to see what topics might emerge from a sample that was heterogeneous in terms of approach.

The word content, which will be the data in this example, is pulled from the abstracts, title and key words listed in the ERIC search results for math interventions delivered in elementary schools.

Below is the result of a good amount of data cleaning, essentially one long list of tokenized elements at the “word” level. We can use this list of words to determine which words then phrases are most “important” to these journal articles.

## # A tibble: 20 x 2
##    `xx$word`         n
##    <chr>         <int>
##  1 intervention     57
##  2 students         41
##  3 mathematics      40
##  4 school           21
##  5 grade            19
##  6 math             19
##  7 instruction      18
##  8 knowledge        15
##  9 achievement      14
## 10 research         13
## 11 student          13
## 12 elementary       12
## 13 program          11
## 14 content           9
## 15 effectiveness     9
## 16 3                 8
## 17 6                 8
## 18 interventions     8
## 19 learning          8
## 20 mathematical      8

Much of what is above is just exploratory, now we can develop a TD-IDF model that will assess the importance of these words in the context of text. The statistic tf-idf is intended to measure how important a word is to a document in a collection (or corpus) of documents, for example, to one novel in a collection of novels or to one website in a collection of websites.(source: Tidy Text Mining (https://www.tidytextmining.com/tfidf.html))

## # A tibble: 15 x 3
##    otherid   word      n
##    <chr>     <chr> <int>
##  1 EJ1218437 of       14
##  2 EJ1258255 the      13
##  3 EJ1156025 the      12
##  4 EJ935766  the      11
##  5 EJ1156025 of       10
##  6 EJ1249756 of       10
##  7 EJ1258255 of       10
##  8 EJ1249756 the       9
##  9 EJ1014930 in        8
## 10 EJ1098528 the       8
## 11 EJ1098528 to        8
## 12 EJ1099346 in        8
## 13 EJ1099346 of        8
## 14 EJ1156025 in        8
## 15 EJ1258255 and       8
## # A tibble: 11 x 2
##    otherid   total
##    <chr>     <int>
##  1 EJ1010749   142
##  2 EJ1014930   200
##  3 EJ1063565   174
##  4 EJ1098528   193
##  5 EJ1099346   189
##  6 EJ1156025   175
##  7 EJ1218437   253
##  8 EJ1228107   155
##  9 EJ1249756   148
## 10 EJ1258255   169
## 11 EJ935766    171
## Joining, by = "otherid"

Now that we’ve done some initial analysis just to see important words in these limited texts we can move on to doing topic modeling to see if there are patterns that evolve between documents. per-topic-per-word probabilities, called β (“beta”) will be generated

* Every document is a mixture of topics. We imagine that each document may contain words from several topics in particular proportions. For example, in a two-topic model we could say “Document 1 is 90% topic A and 10% topic B, while Document 2 is 30% topic A and 70% topic B.”

* Every topic is a mixture of words. For example, we could imagine a two-topic model of American news, with one topic for “politics” and one for “entertainment.” The most common words in the politics topic might be “President”, “Congress”, and “government”, while the entertainment topic may be made up of words such as “movies”, “television”, and “actor”. Importantly, words can be shared between topics; a word like “budget” might appear in both equally.

## Joining, by = "word"
## # A tibble: 15 x 3
##    otherid   word               n
##    <chr>     <chr>          <int>
##  1 EJ1258255 spatial            8
##  2 EJ1218437 content            7
##  3 EJ1218437 knowledge          7
##  4 EJ1228107 students           7
##  5 EJ1010749 math               6
##  6 EJ1014930 quot               6
##  7 EJ1063565 students           6
##  8 EJ1098528 multiplication     6
##  9 EJ1156025 proof              6
## 10 EJ1218437 students           6
## 11 EJ1249756 writing            6
## 12 EJ1010749 2                  5
## 13 EJ1014930 follow             5
## 14 EJ1098528 math               5
## 15 EJ1098528 set                5
## A LDA_VEM topic model with 2 topics.
## null device 
##           1

We’ve generate a model based on 2 topics so now we will compare the percentage likelihood of the top 10 terms for each topic.

We can also estimate the total percentage of words that were generated by each topic. We can see there’s a pretty clean split between journals either belonging to topic 1 or 2

Going to explore a different approach using the tm package.