1. PURPOSE

1a. Motivation and Focus

This research aims to illustrate the use of concepts, techniques, and mining process tools generate topics from a literature review looking at what the current state of AI or Machine Intelligence is in Education.

Methods for conducting semantic literature reviews are lengthy and require rigorous attention to reading and coding. In order to aquire a sense of what the most important topics are within the literature, the researcher needs to quickly assess an overview of what journals are most important with answering the research questions.

Coppin, is the ability of machines to adapt to new situations, deal with emerging situations, solve problems, answer questions, device plans, and perform various other functions that require some level of intelligence typically evident in human beings. Whitby defined artificial intelligence as the study of intelligence behavior in human beings, animals, and machines and endeavoring to engineer such behavior into an artifact, such as computers and computer-related technologies

Guiding Questions:

  1. What is the current state of research of machine intelligence in educational contexts?
  2. In what ways, if any, is machine intelligence supporting teaching and learning?

1b. Load Libraries

First we load our libraries to read in packages that we will use to answer our questions. Focusing on Topic Modeling packages that will benefit our research. We will also set our parameters to assist with colors, theme and style.

library(tidyverse)
library(tidytext) 
library(topicmodels) 
library(tidyr) 
library(dplyr) 
library(ggplot2) 
library(kableExtra) 
library(knitr) 
library(ggrepel) 
library(gridExtra)
library(formattable) 
library(tm) 
library(circlize) 
library(plotly) 
library(wordcloud2)
library(lubridate)
library(stringr)
library(SnowballC)

#SET PARAMETERS
#define colors to use throughout
my_colors <- c("#E69F00", "#56B4E9", "#009E73", "#CC79A7", "#D55E00", "#D65E00")

theme_plot <- function(aticks = element_blank(),
                         pgminor = element_blank(),
                         lt = element_blank(),
                         lp = "none")
{
  theme(plot.title = element_text(hjust = 0.5), #center the title
        axis.ticks = aticks, #set axis ticks to on or off
        panel.grid.minor = pgminor, #turn on or off the minor grid lines
        legend.title = lt, #turn on or off the legend title
        legend.position = lp) #turn on or off the legend
}

#customize the text tables for consistency using HTML formatting
my_kable_styling <- function(dat, caption) {
  kable(dat, "html", escape = FALSE, caption = caption) %>%
  kable_styling(bootstrap_options = c("striped", "condensed", "bordered"),
                full_width = FALSE)
}

word_chart <- function(data, input, title) {
  data %>%
  #set y = 1 to just plot one variable and use word as the label
  ggplot(aes(as.factor(row), 1, label = input, fill = factor(topic) )) +
  #you want the words, not the points
  geom_point(color = "transparent") +
  #make sure the labels don't overlap
  geom_label_repel(nudge_x = .2,  
                   direction = "y",
                   box.padding = 0.1,
                   segment.color = "transparent",
                   size = 3) +
  facet_grid(~topic) +
  theme_plot() +
  theme(axis.text.y = element_blank(), axis.text.x = element_blank(),
        #axis.title.x = element_text(size = 9),
        panel.grid = element_blank(), panel.background = element_blank(),
        panel.border = element_rect("lightgray", fill = NA),
        strip.text.x = element_text(size = 9)) +
  labs(x = NULL, y = NULL, title = title) +
    #xlab(NULL) + ylab(NULL) +
  #ggtitle(title) +
  coord_flip()
}

2. METHOD

Our initial read-in data frame includes 137 observations that include eighteen variables, including, Author, Title, Abstract and date. After reading in the data we will, wrangle the data. Data wrangling involves some combination of cleaning, reshaping, transforming, and merging data (Wickham & Grolemund, 2017). Then Cast a DTM and tokenize.

2a. Read and Inspect the Meta-Data

Looking at the first five observations we can see that we do not need all of the variables.

#read in literature review
review_data3 <-read_csv("data/review_noCR.csv")

review_data3 %>%
  head()%>%
  kbl(caption = "First 5 - Initial CS Literature Review Metadata") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
First 5 - Initial CS Literature Review Metadata
Title Authors Abstract Published Year Published Month Journal Field subField Volume Issue Pages Accession Number DOI Ref Covidence # Study Notes Tags
Crowd explicit sentiment analysis. Montejo-Ráez, A.; Díaz-Galiano, M.C.; Martínez-Santiago, F.; Ureña-López, L.A. With the rapid growth of data generated by social web applications new paradigms in the generation of knowledge are opening. This paper introduces Crowd Explicit Sentiment Analysis (CESA) as an approach for sentiment analysis in social media environments. Similar to Explicit Semantic Analysis, microblog posts are indexed by a predefined collection of documents. In CESA, these documents are built up from common emotional expressions in social streams. In this way, texts are projected to feelings or emotions . This process is performed within a Latent Semantic Analysis. A few simple regular expressions (e.g. “I feel X” , considering X a term representing an emotion or feeling) are used to scratch the enormous flow of micro-blog posts to generate a textual representation of an emotional state with clear polarity value (e.g. angry, happy, sad, confident, etc. ). In this way, new posts can be indexed by these feelings according to the distance to their textual representation. The approach is suitable in many scenarios dealing with social media publications and can be implemented in other languages with little effort. In particular, we have evaluated the system on Polarity Classification with both English and Spanish data sets. The results show that CESA is a valid solution for sentiment analysis and that similar approaches for model building from the continuous flow of posts could be exploited in other scenarios. 2014 10// Knowledge-Based Systems ai ai 69 NA 134-139 NA NA NA #6618 Montejo-Ráez 2014 NA NA
Docode 5: Building a real-world plagiarism detection system. Pizarro V., Gaspar; Velásquez, Juan D. Plagiarism refers to the appropriation of someone else’s ideas and expression. Its ubiquity makes it necessary to counter it, and invites the development of commercial systems to do so. In this document we introduce Docode 5, a system for plagiarism detection that can perform analyses on the World Wide Web and on user-defined collections, and can be used as a decision support system. Our contribution in this document is to present this system in all its range of components, from the algorithms used in it to the user interfaces, and the issues with deployment on a commercial scale at an algorithmic and architectural level. We ran performance tests on the plagiarism detection algorithm showing an acceptable performance from an academic and commercial point of view, and load tests on the deployed system, showing that we can benefit from a distributed deployment. With this, we conclude we can adapt algorithms made for small-scale plagiarism detection to a commercial-scale system. 2017 09// Engineering Applications of Artificial Intelligence engineering eng_ai 64 NA 261-271 NA NA NA #6111 Pizarro 2017 NA NA
Assessing mobile health applications with twitter analytics. Pai, Rajesh R.; Alathur, Sreejith <bold>Introduction: </bold>Advancement in the field of information technology and rise in the use of Internet has changed the lives of people by enabling various services online. In recent times, healthcare sector which faces its service delivery challenges started promoting and using mobile health applications with the intention of cutting down the cost making it accessible and affordable to the people.<bold>Objectives: </bold>The objective of the study is to perform sentiment analysis using the Twitter data which measures the perception and use of various mobile health applications among the citizens.<bold>Methods: </bold>The methodology followed in this research is qualitative with the data extracted from a social networking site “Twitter” through a tool RStudio. This tool with the help of Twitter Application Programming Interface requested one thousand tweets each for four different phrases of mobile health applications (apps) such as “fitness app”, “diabetes app”, “meditation app”, and “cancer app”. Depending on the tweets, sentiment analysis was carried out, and its polarity and emotions were measured.<bold>Results: </bold>Except for cancer app there exists a positive polarity towards the fitness, diabetes, and meditation apps among the users. Following a system thinking approach for our results, this paper also explains the causal relationships between the accessibility and acceptability of mobile health applications which helps the healthcare facility and the application developers in understanding and analyzing the dynamics involved the adopting a new system or modifying an existing one. 2018 05// International Journal of Medical Informatics technology tech_health_info 113 NA 72-84 NA NA NA #6365 Pai 2018 NA NA
A proposal for the development of adaptive spoken interfaces to access the Web. Griol, David; Molina, José Manuel; Callejas, Zoraida Spoken dialog systems have been proposed as a solution to facilitate a more natural human–machine interaction. In this paper, we propose a framework to model the user<U+05F3>s intention during the dialog and adapt the dialog model dynamically to the user needs and preferences, thus developing more efficient, adapted, and usable spoken dialog systems. Our framework employs statistical models based on neural networks that take into account the history of the dialog up to the current dialog state in order to predict the user<U+05F3>s intention and the next system response. We describe our proposal and detail its application in the Let<U+05F3>s Go spoken dialog system. 2015 09/02/ Neurocomputing technology tech_nc 163 NA 56-68 NA NA NA #6072 Griol 2015 NA NA
Computer-based working memory training in children with mild intellectual disability. Delavarian, Mona; Bokharaeian, Behrouz; Towhidkhah, Farzad; Gharibzadeh, Shahriar We designed a working memory (WM) training programme in game framework for mild intellectually disabled students. Twenty-four students participated as test and control groups. The auditory and visual–spatial WM were assessed by primary test, which included computerised Wechsler numerical forward and backward sub-tests and secondary tests, which contained three parts: dual visual–spatial test, auditory test and a one-syllable word recalling test. The results showed significant difference between WM capacity in the intellectually disabled children and normal ones (p-value < 0.00001). Visual–spatial WM, auditory WM and speaking were improved in the trained group. Four tests showed significant differences between pre-test and post-tests. The trained group showed more improvements in forward tasks. The trained participant’s processing speed increased with training. We found that school is the best place for training. More comprehensive human–computer interfaces could be suitable for intellectually disabled students with visual and auditory impairments and problems in motor skills. 2015 01// Early Child Development & Care education edu_early 185 1 66-74 NA NA NA #6576 Delavarian 2015 NA NA
Sequential deep learning from NTSB reports for aviation safety prognosis. Zhang, Xiaoge; Srinivasan, Prabhakar; Mahadevan, Sankaran • Sequential deep learning models are developed for performing prognosis of adverse aviation events. • We compare the performance of classification models trained with event sequences and text narratives. • The developed model can be used for prognosis with partial and evolving event sequences. In this paper, we apply a set of data-mining and sequential deep learning techniques to accident investigation reports published by the National Transportation Safety Board (NTSB) in support of the prognosis of adverse events. Our focus is on learning with text data that describes the sequences of events. NTSB creates post hoc investigation reports which contain raw text narratives of their investigation and their corresponding concise event sequences. Classification models are developed for passenger air carriers, that take either an observed sequence of events or the corresponding raw text narrative as input and make predictions regarding whether an accident or an incident is the likely outcome, whether the aircraft would be damaged or not and whether any fatalities are likely or not. The classification models are developed using Word Embedding and the Long Short-term Memory (LSTM) neural network. The proposed methodology is implemented in two steps: (i) transform the NTSB data extracts into labeled dataset for building supervised machine learning models; and (ii) develop deep learning (DL) models for performing prognosis of adverse events like accidents, aircraft damage or fatalities. We also develop a prototype for an interactive query interface for end-users to test various scenarios including complete or partial event sequences or narratives and get predictions regarding the adverse events. The development of sequential deep learning models facilitates safety professionals in auditing, reviewing, and analyzing accident investigation reports, performing what-if scenario analyses to quantify the contributions of various hazardous events to the occurrence of aviation accidents/incidents. 2021 10// Safety Science engineering eng_safety 142 NA N.PAG-N.PAG NA NA NA #5990 Zhang 2021 NA NA

Just out of curiosity let’s inspect the data with a histogram to see how many papers were published per year. It looks as though 2014 was a big year for papers. Simonite (2014) excitedly writes, “2014 saw major strides in machine learning software that can gain abilities from experience.”

hist(review_data3$`Published Year`)

2b. Tidy Data

Tidy data by converting to lowercase, and only select abstract, published_year, journal, field and subfield. add a unique identifier and unite document as “field.”

# convert all variable names to lower case
names(review_data3) <- tolower(names(review_data3))
 
#Clean Data and include unique identifier
tidy_data3 <- review_data3 %>% 
  rename(published_year = `published year`)%>%
  select(c('abstract', 'published_year', 'journal', 'field', 'subfield')) %>% # only select 
  mutate(number = row_number())%>%
  unite(document, field)
  
# inspect
tidy_data3%>%
  head()%>%
  kbl(caption = "First 5 - Tidy and Restructured Meta-Data") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
First 5 - Tidy and Restructured Meta-Data
abstract published_year journal document subfield number
With the rapid growth of data generated by social web applications new paradigms in the generation of knowledge are opening. This paper introduces Crowd Explicit Sentiment Analysis (CESA) as an approach for sentiment analysis in social media environments. Similar to Explicit Semantic Analysis, microblog posts are indexed by a predefined collection of documents. In CESA, these documents are built up from common emotional expressions in social streams. In this way, texts are projected to feelings or emotions . This process is performed within a Latent Semantic Analysis. A few simple regular expressions (e.g. “I feel X” , considering X a term representing an emotion or feeling) are used to scratch the enormous flow of micro-blog posts to generate a textual representation of an emotional state with clear polarity value (e.g. angry, happy, sad, confident, etc. ). In this way, new posts can be indexed by these feelings according to the distance to their textual representation. The approach is suitable in many scenarios dealing with social media publications and can be implemented in other languages with little effort. In particular, we have evaluated the system on Polarity Classification with both English and Spanish data sets. The results show that CESA is a valid solution for sentiment analysis and that similar approaches for model building from the continuous flow of posts could be exploited in other scenarios. 2014 Knowledge-Based Systems ai ai 1
Plagiarism refers to the appropriation of someone else’s ideas and expression. Its ubiquity makes it necessary to counter it, and invites the development of commercial systems to do so. In this document we introduce Docode 5, a system for plagiarism detection that can perform analyses on the World Wide Web and on user-defined collections, and can be used as a decision support system. Our contribution in this document is to present this system in all its range of components, from the algorithms used in it to the user interfaces, and the issues with deployment on a commercial scale at an algorithmic and architectural level. We ran performance tests on the plagiarism detection algorithm showing an acceptable performance from an academic and commercial point of view, and load tests on the deployed system, showing that we can benefit from a distributed deployment. With this, we conclude we can adapt algorithms made for small-scale plagiarism detection to a commercial-scale system. 2017 Engineering Applications of Artificial Intelligence engineering eng_ai 2
<bold>Introduction: </bold>Advancement in the field of information technology and rise in the use of Internet has changed the lives of people by enabling various services online. In recent times, healthcare sector which faces its service delivery challenges started promoting and using mobile health applications with the intention of cutting down the cost making it accessible and affordable to the people.<bold>Objectives: </bold>The objective of the study is to perform sentiment analysis using the Twitter data which measures the perception and use of various mobile health applications among the citizens.<bold>Methods: </bold>The methodology followed in this research is qualitative with the data extracted from a social networking site “Twitter” through a tool RStudio. This tool with the help of Twitter Application Programming Interface requested one thousand tweets each for four different phrases of mobile health applications (apps) such as “fitness app”, “diabetes app”, “meditation app”, and “cancer app”. Depending on the tweets, sentiment analysis was carried out, and its polarity and emotions were measured.<bold>Results: </bold>Except for cancer app there exists a positive polarity towards the fitness, diabetes, and meditation apps among the users. Following a system thinking approach for our results, this paper also explains the causal relationships between the accessibility and acceptability of mobile health applications which helps the healthcare facility and the application developers in understanding and analyzing the dynamics involved the adopting a new system or modifying an existing one. 2018 International Journal of Medical Informatics technology tech_health_info 3
Spoken dialog systems have been proposed as a solution to facilitate a more natural human–machine interaction. In this paper, we propose a framework to model the user<U+05F3>s intention during the dialog and adapt the dialog model dynamically to the user needs and preferences, thus developing more efficient, adapted, and usable spoken dialog systems. Our framework employs statistical models based on neural networks that take into account the history of the dialog up to the current dialog state in order to predict the user<U+05F3>s intention and the next system response. We describe our proposal and detail its application in the Let<U+05F3>s Go spoken dialog system. 2015 Neurocomputing technology tech_nc 4
We designed a working memory (WM) training programme in game framework for mild intellectually disabled students. Twenty-four students participated as test and control groups. The auditory and visual–spatial WM were assessed by primary test, which included computerised Wechsler numerical forward and backward sub-tests and secondary tests, which contained three parts: dual visual–spatial test, auditory test and a one-syllable word recalling test. The results showed significant difference between WM capacity in the intellectually disabled children and normal ones (p-value < 0.00001). Visual–spatial WM, auditory WM and speaking were improved in the trained group. Four tests showed significant differences between pre-test and post-tests. The trained group showed more improvements in forward tasks. The trained participant’s processing speed increased with training. We found that school is the best place for training. More comprehensive human–computer interfaces could be suitable for intellectually disabled students with visual and auditory impairments and problems in motor skills. 2015 Early Child Development & Care education edu_early 5
• Sequential deep learning models are developed for performing prognosis of adverse aviation events. • We compare the performance of classification models trained with event sequences and text narratives. • The developed model can be used for prognosis with partial and evolving event sequences. In this paper, we apply a set of data-mining and sequential deep learning techniques to accident investigation reports published by the National Transportation Safety Board (NTSB) in support of the prognosis of adverse events. Our focus is on learning with text data that describes the sequences of events. NTSB creates post hoc investigation reports which contain raw text narratives of their investigation and their corresponding concise event sequences. Classification models are developed for passenger air carriers, that take either an observed sequence of events or the corresponding raw text narrative as input and make predictions regarding whether an accident or an incident is the likely outcome, whether the aircraft would be damaged or not and whether any fatalities are likely or not. The classification models are developed using Word Embedding and the Long Short-term Memory (LSTM) neural network. The proposed methodology is implemented in two steps: (i) transform the NTSB data extracts into labeled dataset for building supervised machine learning models; and (ii) develop deep learning (DL) models for performing prognosis of adverse events like accidents, aircraft damage or fatalities. We also develop a prototype for an interactive query interface for end-users to test various scenarios including complete or partial event sequences or narratives and get predictions regarding the adverse events. The development of sequential deep learning models facilitates safety professionals in auditing, reviewing, and analyzing accident investigation reports, performing what-if scenario analyses to quantify the contributions of various hazardous events to the occurrence of aviation accidents/incidents. 2021 Safety Science engineering eng_safety 6

Now, let’s inspect how many journal contributions by each journal exist. It looks as though the two journals with the highest paper contributions are ‘International Journal of Artificial Intelligence in Education’ (Springer Science & Business Media B.V.) and ‘Computers in Education.’

library(ggplot2)

tidy_data3 %>%
  group_by(journal) %>%
  summarize(abstract = n_distinct(number)) %>%
  ggplot(aes(abstract, journal)) +
  geom_col() +
  scale_y_discrete(guide = guide_axis(check.overlap = TRUE)) +
  labs(y = NULL)

2c. Unnest tokenize

Here we will take the necessary steps to: 1. Transforming our text into “tokens” 2. Removing unnecessary characters, punctuation, and whitespace 3. Converting all text to lowercase 4. Removing stop words such as “the”, “of”, and “to”

After transforming we can quickly look at the word count. We can see that “learning” and “students” are at the top. This is exciting since we have papers from five different fields.

#unnest
token_words3 <- tidy_data3 %>%
  unnest_tokens(word, abstract) %>%
  filter(str_detect(word, "[a-z']$"),
         !word %in% stop_words$word)


token_words3 %>%
  group_by(word) %>%
  filter(n() >= 98) %>%
  count(word, sort = TRUE)%>%
  kbl(caption = "Tokenized Words >= 98") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Tokenized Words >= 98
word n
learning 386
students 180
system 139
based 138
design 137
user 116
data 98

We can also look at the top words in a word cloud: This is a nice visualization to see the top tokenized words from the abstracts.

top_tokens <- token_words3 %>%
  ungroup ()%>%  #ungroup the tokenize data to create a wordcloud
  count(word, sort = TRUE) %>%
  top_n(50)

wordcloud2(top_tokens)

2d. Dreate Document Term Matrix and inspect

When organizing the MetaData we organized the journals into their respective fields. We will use the fields as the document to connet the topics later on. We have 5 documents and 3854 terms.

review_dtm3 <- token_words3 %>%
  count(document, word, sort = TRUE) %>%
  ungroup()

cast_dtm3 <- review_dtm3 %>%
  cast_dtm(document, word, n)

dim(cast_dtm3)
## [1]    5 3854
cast_dtm3
## <<DocumentTermMatrix (documents: 5, terms: 3854)>>
## Non-/sparse entries: 6253/13017
## Sparsity           : 68%
## Maximal term length: 22
## Weighting          : term frequency (tf)

We can inspect five documents looking at 8 words within the DTM.

#look at 4 documents and 8 words of the DTM
inspect(cast_dtm3[1:5,1:8])
## <<DocumentTermMatrix (documents: 5, terms: 8)>>
## Non-/sparse entries: 39/1
## Sparsity           : 3%
## Maximal term length: 11
## Weighting          : term frequency (tf)
## Sample             :
##              Terms
## Docs          based data design intelligent learning results students system
##   ai              9    8     12           7       15       4       20     18
##   education      64   53     91          52      235      53      141     67
##   engineering    11   10      2           1       20       5        5     23
##   science         8    5      2           8       17       3        0      6
##   technology     46   22     30           8       99      20       14     25

Use generic variable names for the DTM = source_dtm3 and the tokenize words = source_tidy3.

#assign the source dataset to generic var names

source_dtm3 <- cast_dtm3
source_tidy3 <- token_words3

3. MODEL

3a. Fit Topic Model

We will use the GIBBS sampling method with the default VEM. We will classify documents into Topics based on the mean of gamma for a topic/source. The K - means number of documents we are using is equal to the field number of five. - i. look at the class of the LDA object - ii. inspect the topics

k <- 5 #number of topics
seed = 1234 #necessary for reproducibility
#fit the model 
#you could have more control parameters but will just use seed here
lda <- LDA(source_dtm3, k = k, method = "GIBBS", control = list(seed = seed))
#examine the class of the LDA object
class(lda)
## [1] "LDA_Gibbs"
## attr(,"package")
## [1] "topicmodels"
#inspect lda topics
lda
## A LDA_Gibbs topic model with 5 topics.

4. EXPLORE

4a. Beta Values

Jiang (2022) notes that “hidden within our topic model object we created are per-topic-per-word probabilities, called β (”beta”).” It is the probability of a term (word) belonging to a topic. We will extract the per-topic-per-word probabilities, called β from the model and show top 5 results in each topic.

the model into a one-topic-per-term-per-row format. For each combination, the model computes the probability of that term being generated from that topic. For example, the term “learning” has a .005831242 probability of being generated from topic 1, but only a .003388223 from topic 5.

topics <- tidy(lda, matrix = "beta")
topics%>%
  head%>%
  kbl(caption = "Term Probability by topic") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Term Probability by topic
topic term beta
1 learning 0.0058312
2 learning 0.0000426
3 learning 0.0000438
4 learning 0.0515828
5 learning 0.0000339
1 students 0.0290633

Now let’s look at the Top Beta Terms

review_topics3 <- tidy(lda, matrix = "beta")

top_terms <- review_topics3 %>%
  group_by(topic) %>%
  slice_max(beta, n = 5) %>% 
  ungroup() %>%
  arrange(topic, -beta)

top_terms%>%
  head%>%
  group_by(topic)%>%
  kbl(caption = "Top terms by Beta") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Top terms by Beta
topic term beta
1 students 0.0290633
1 intelligent 0.0121039
1 study 0.0121039
1 education 0.0109423
1 tutoring 0.0104777
2 users 0.0175237

Finally, inspecting the terms in a nice layout for comparison of what is in each topic. We inspect 10 terms that are most common within each topic.

num_words <- 10 #number of words to visualize

#create function that accepts the lda model and num word to display
top_terms_per_topic <- function(lda_model, num_words) {

  #tidy LDA object to get word, topic, and probability (beta)
topics_tidy <- tidy(lda_model, matrix = "beta")


  top_terms <- topics_tidy %>%
  group_by(topic) %>%
  arrange(topic, desc(beta)) %>%
  #get the top num_words PER topic
  slice(seq_len(num_words)) %>%
  arrange(topic, beta) %>%
  #row is required for the word_chart() function
  mutate(row = row_number()) %>%
  ungroup() %>%
  #add the word Topic to the topic labels
  mutate(topic = paste("Topic", topic, sep = " "))
  #create a title to pass to word_chart
  title <- paste("LDA Top Terms for", k, "Topics")
  #call the word_chart function 
  word_chart(top_terms, top_terms$term, title)
}
#call the function you just built!
top_terms_per_topic(lda, num_words)

### 4b. Gamma

Silge & Robinson (2017) state, “besides estimating each topic as a mixture of words, LDA also models each document as a mixture of topics. We can examine the per-document-per-topic probabilities, called γ(“gamma”), with the matrix = “gamma” argument to tidy().

Show relationship between topic and journal document field.

tidy_data3
## # A tibble: 137 x 6
##    abstract           published_year journal           document subfield  number
##    <chr>                       <dbl> <chr>             <chr>    <chr>      <int>
##  1 "With the rapid g~           2014 Knowledge-Based ~ ai       ai             1
##  2 "Plagiarism refer~           2017 Engineering Appl~ enginee~ eng_ai         2
##  3 "<bold>Introducti~           2018 International Jo~ technol~ tech_hea~      3
##  4 "Spoken dialog sy~           2015 Neurocomputing    technol~ tech_nc        4
##  5 "We designed a wo~           2015 Early Child Deve~ educati~ edu_early      5
##  6 "• Sequential dee~           2021 Safety Science    enginee~ eng_safe~      6
##  7 "The goal of this~           2019 International Jo~ educati~ edu_ai         7
##  8 "A Materials Acce~           2020 Advanced Science  science  sci_multi      8
##  9 "• Machine learni~           2020 Environment Inte~ science  sci_env_~      9
## 10 "Highlights: [•] ~           2014 Computers in Hum~ technol~ tech_hum~     10
## # ... with 127 more rows
#using tidy with gamma gets document probabilities into topic
#but you only have document, topic and gamma
source_topic_relationship <- tidy(lda, matrix = "gamma") %>%
  #join to orig tidy data bydoc to get the source field
  inner_join(tidy_data3, by = "document") %>%
  select(document, topic, gamma) %>%
  group_by(document, topic) %>%
  #get the avg doc gamma value per source/topic
  mutate(mean = mean(gamma)) %>%
  #remove the gamma value as you only need the mean
  select(-gamma) %>%
  #removing gamma created duplicates so remove them
  distinct()

#relabel topics to include the word Topic
source_topic_relationship$topic = paste("Topic", source_topic_relationship$topic, sep = " ")

circos.clear() #very important! Reset the circular layout parameters
#assign colors to the outside bars around the circle
grid.col = c("Education" = my_colors[1],
             "Science" = my_colors[2],
             "AI" = my_colors[3],
             "Technology" = my_colors[4],
             "Engineering"= my_colors[5],
             "Topic 1" = "grey", "Topic 2" = "grey", "Topic 3" = "grey", "Topic 4" = "grey", "Topic 5" = "grey")

# set the global parameters for the circular layout. Specifically the gap size (15)
#this also determines that topic goes on top half and source on bottom half
circos.par(gap.after = c(rep(5, length(unique(source_topic_relationship[[1]])) - 1), 15,
                         rep(5, length(unique(source_topic_relationship[[2]])) - 1), 15))
#main function that draws the diagram. transparancy goes from 0-1
chordDiagram(source_topic_relationship, grid.col = grid.col, transparency = .2)
title("Relationship Between Topic and Journal Field")

4c. Combine Gamma and Beta

#save to beta var
td_beta <- tidy(lda)
#save to gamma var
td_gamma <- tidy(lda, matrix = "gamma")
#copy Julia Silge code to combine
top_terms <- td_beta %>%
  arrange(beta) %>%
  group_by(topic) %>%
  top_n(7, beta) %>%
  arrange(-beta) %>%
  select(topic, term) %>%
  summarise(terms = list(term)) %>%
  mutate(terms = map(terms, paste, collapse = ", ")) %>% 
  unnest()

gamma_terms <- td_gamma %>%
  group_by(topic) %>%
  summarise(gamma = mean(gamma)) %>%
  arrange(desc(gamma)) %>%
  left_join(top_terms, by = "topic") %>%
  mutate(topic = paste0("Topic ", topic),
         topic = reorder(topic, gamma))

gamma_terms %>%
  select(topic, gamma, terms) %>%
  kable(digits = 3, 
        col.names = c("Topic", "Expected topic proportion", "Top 7 terms"))
Topic Expected topic proportion Top 7 terms
Topic 4 0.344 learning, based, design, system, data, paper, research, systems
Topic 3 0.211 human, service, machine, tests, relevant, collaborative, screening
Topic 2 0.192 users, system, interactions, tool, models, user, classification
Topic 5 0.135 human, user, bold, study, web, neural, multi
Topic 1 0.119 students, intelligent, study, education, tutoring, student, knowledge

My prediction of the topics are as follows: -Topic 1 is about intelligent tutoring systems -Topic 2 is about -Topic 3 is about -Topic 4 is about -Topic 5 is about

Fit Model 2 K-Means

The structure of the k-means object reveals two important pieces of information: clusters and centers. k-means clustering, each document—can be assigned to one, and only one, cluster.

source_dtm2 <- cast_dtm3
source_tidy2 <- token_words3
#Set a seed for replicable results
set.seed(1234)
k <- 4
kmeansResult <- kmeans(source_dtm2, k)
str(kmeansResult)
## List of 9
##  $ cluster     : Named int [1:5] 3 4 1 1 2
##   ..- attr(*, "names")= chr [1:5] "education" "technology" "engineering" "ai" ...
##  $ centers     : num [1:4, 1:3854] 17.5 17 235 99 12.5 0 141 14 7 2 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:4] "1" "2" "3" "4"
##   .. ..$ : chr [1:3854] "learning" "students" "design" "system" ...
##  $ totss       : num 133285
##  $ withinss    : num [1:4] 2626 0 0 0
##  $ tot.withinss: num 2626
##  $ betweenss   : num 130658
##  $ size        : int [1:4] 2 1 1 1
##  $ iter        : int 2
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"

Intelligence is in all four clusters, but falls mainly in cluster three.

head(kmeansResult$centers[,"intelligent"])
##  1  2  3  4 
##  4  8 52  8

K Means top terms

num_words <- 8 #number of words to display
#get the top words from the kmeans centers
kmeans_topics <- lapply(1:k, function(i) {
  s <- sort(kmeansResult$centers[i, ], decreasing = T)
  names(s)[1:num_words]
})

#make sure it's a data frame
kmeans_topics_df <- as.data.frame(kmeans_topics)
#label the topics with the word Topic
names(kmeans_topics_df) <- paste("Topic", seq(1:k), sep = " ")
#create a sequential row id to use with gather()
kmeans_topics_df <- cbind(id = rownames(kmeans_topics_df),
                          kmeans_topics_df)
kmeans_topics_df
##   id  Topic 1      Topic 2     Topic 3   Topic 4
## 1  1   system     learning    learning  learning
## 2  2 learning        human    students     based
## 3  3     user      service      design      user
## 4  4 students    screening      system     model
## 5  5  systems           ai       based     human
## 6  6    based intelligence        data    design
## 7  7     data      machine     results     paper
## 8  8    users   artificial intelligent interface
#transpose it into the format required for word_chart()
kmeans_top_terms <- kmeans_topics_df %>% 
  gather(id)
  colnames(kmeans_top_terms) = c("topic", "term")

kmeans_top_terms <- kmeans_top_terms %>%
  group_by(topic) %>%
  mutate(row = row_number()) %>% #needed by word_chart()
  ungroup()

title <- paste("K-Means Top Terms for", k, "Topics")
word_chart(kmeans_top_terms, kmeans_top_terms$term, title)

References:

Coppin, B. (2004). Artificial intelligence illuminated. Jones & Bartlett Learning.

Simonite, T. (2014). 2014 in Computing: Breakthroughs in Artificial Intelligence.

Warwick, K. (2013). Artificial intelligence: the basics. Routledge.

Whitby, B. (2009). Artificial intelligence. The Rosen Publishing Group, Inc.