demo_schema

Data Schema Demo

This demo is designed to get a feel for the type of analysis that is possible to achieve by using text mining approaches. The caveat is that we are still developing a process to build a corpus. The goal is to use citation analysis that should identify highly influential publications in sub-domain areas. Currently, this is really just a exploratory process to see if this approach could prove useful and will also just be a tool for experts to use in developing the schema.

We are going to use data that was sourced from a ERIC publication search focused on math interventions at the k-12 level. This is a sub-sample of 128 articles that will be used to build the topic model.

The word content, which will be the data in this example, is pulled from the abstracts, title and key words listed in the ERIC search results for math interventions delivered in elementary schools.

Below is the result of a good amount of data cleaning, essentially one long list of tokenized elements at the “word” level. We can use this list of words to determine which words then phrases are most “important” to these journal articles.

## # A tibble: 20 x 2
##    `xx$word`         n
##    <chr>         <int>
##  1 intervention    610
##  2 mathematics     568
##  3 students        539
##  4 grade           254
##  5 instruction     229
##  6 school          224
##  7 elementary      179
##  8 learning        165
##  9 study           141
## 10 control         124
## 11 achievement     115
## 12 effects         113
## 13 skills          111
## 14 student         111
## 15 math            109
## 16 teaching         99
## 17 solving          96
## 18 risk             94
## 19 interventions    91
## 20 2                89

We can also take a look at the data through two or three word phrases, called n-grams.

xx_ngrams <- xx %>%
  unnest_tokens(word, word, token = "ngrams", n=2)

datatable(xx_ngrams)

xx_ngrams_3 <- xx %>%
  unnest_tokens(word, word, token = "ngrams", n=3)

datatable(xx_ngrams_3)

Much of what is above is just exploratory, now we can develop a TF-IDF model that will assess the importance of these words in the context of text. The statistic TF-IDF is intended to measure how important a word is to a document in a collection (or corpus) of documents, for example, to one novel in a collection of novels or to one website in a collection of websites.(source: Tidy Text Mining (https://www.tidytextmining.com/tfidf.html))

## # A tibble: 8 x 3
##   otherid   word      n
##   <chr>     <chr> <int>
## 1 EJ1182560 the      24
## 2 EJ1222664 the      21
## 3 EJ1158816 the      19
## 4 EJ1072506 the      18
## 5 EJ1092098 the      18
## 6 EJ1116305 the      18
## 7 EJ1182560 of       18
## 8 EJ1222664 and      18

## # A tibble: 8 x 2
##   otherid  total
##   <chr>    <int>
## 1 ED545392    59
## 2 ED552820   217
## 3 ED557355    30
## 4 ED570289    39
## 5 ED572835   236
## 6 ED578216    99
## 7 ED595063   216
## 8 ED595127   219

## Joining, by = "otherid"

Now that we’ve done some initial analysis to general focused on word importance we can move on to doing topic modeling to see if there are patterns that evolve between documents. per-topic-per-word probabilities, called β (“beta”) will be generated.

* Every document is a mixture of topics. We imagine that each document may contain words from several topics in particular proportions. For example, in a two-topic model we could say “Document 1 is 90% topic A and 10% topic B, while Document 2 is 30% topic A and 70% topic B.”

* Every topic is a mixture of words. For example, we could imagine a two-topic model of American news, with one topic for “politics” and one for “entertainment.” The most common words in the politics topic might be “President”, “Congress”, and “government”, while the entertainment topic may be made up of words such as “movies”, “television”, and “actor”. Importantly, words can be shared between topics; a word like “budget” might appear in both equally.

## Joining, by = "word"

## # A tibble: 15 x 3
##    otherid   word            n
##    <chr>     <chr>       <int>
##  1 EJ1257093 students       15
##  2 EJ1230022 word           14
##  3 EJ1257093 md             13
##  4 EJ1099265 learning       12
##  5 EJ1168275 students       12
##  6 EJ1079390 grade          11
##  7 EJ1112672 guided         11
##  8 EJ1246045 grade          11
##  9 ED595322  interleaved    10
## 10 EJ1049576 word           10
## 11 EJ1099265 students       10
## 12 EJ1115270 students       10
## 13 EJ1158172 skills         10
## 14 EJ1184246 strategies     10
## 15 EJ1196106 fractions      10

We’ve generate a model based on two and three topics so now we will compare the percentage likelihood of the top 10 terms for each topic.

We can also estimate the total percentage of words that were generated by each topic. We can see there’s a pretty clean split between journals either belonging to topic 1 or 2

We can also explore adding more topics and assessing the ideal number for this particular corpus.

demo_schema

Brian Wright

8/8/2020

Data Schema Demo