This research aims to illustrate the use of concepts, techniques, and mining process tools generate topics from a literature review looking at what the current state of AI or Machine Intelligence is in Education.
Methods for conducting semantic literature reviews are lengthy and require rigorous attention to reading and coding. In order to aquire a sense of what the most important topics are within the literature, the researcher needs to quickly assess an overview of what journals are most important with answering the research questions.
Coppin, is the ability of machines to adapt to new situations, deal with emerging situations, solve problems, answer questions, device plans, and perform various other functions that require some level of intelligence typically evident in human beings. Whitby defined artificial intelligence as the study of intelligence behavior in human beings, animals, and machines and endeavoring to engineer such behavior into an artifact, such as computers and computer-related technologies
Guiding Questions:
First we load our libraries to read in packages that we will use to answer our questions. Focusing on Topic Modeling packages that will benefit our research. We will also set our parameters to assist with colors, theme and style.
library(tidyverse)
library(tidytext)
library(topicmodels)
library(tidyr)
library(dplyr)
library(ggplot2)
library(kableExtra)
library(knitr)
library(ggrepel)
library(gridExtra)
library(formattable)
library(tm)
library(circlize)
library(plotly)
library(wordcloud2)
library(lubridate)
library(stringr)
library(SnowballC)
#SET PARAMETERS
#define colors to use throughout
my_colors <- c("#E69F00", "#56B4E9", "#009E73", "#CC79A7", "#D55E00", "#D65E00")
theme_plot <- function(aticks = element_blank(),
pgminor = element_blank(),
lt = element_blank(),
lp = "none")
{
theme(plot.title = element_text(hjust = 0.5), #center the title
axis.ticks = aticks, #set axis ticks to on or off
panel.grid.minor = pgminor, #turn on or off the minor grid lines
legend.title = lt, #turn on or off the legend title
legend.position = lp) #turn on or off the legend
}
#customize the text tables for consistency using HTML formatting
my_kable_styling <- function(dat, caption) {
kable(dat, "html", escape = FALSE, caption = caption) %>%
kable_styling(bootstrap_options = c("striped", "condensed", "bordered"),
full_width = FALSE)
}
word_chart <- function(data, input, title) {
data %>%
#set y = 1 to just plot one variable and use word as the label
ggplot(aes(as.factor(row), 1, label = input, fill = factor(topic) )) +
#you want the words, not the points
geom_point(color = "transparent") +
#make sure the labels don't overlap
geom_label_repel(nudge_x = .2,
direction = "y",
box.padding = 0.1,
segment.color = "transparent",
size = 3) +
facet_grid(~topic) +
theme_plot() +
theme(axis.text.y = element_blank(), axis.text.x = element_blank(),
#axis.title.x = element_text(size = 9),
panel.grid = element_blank(), panel.background = element_blank(),
panel.border = element_rect("lightgray", fill = NA),
strip.text.x = element_text(size = 9)) +
labs(x = NULL, y = NULL, title = title) +
#xlab(NULL) + ylab(NULL) +
#ggtitle(title) +
coord_flip()
}
Our initial read-in data frame includes 137 observations that include eighteen variables, including, Author, Title, Abstract and date. After reading in the data we will, wrangle the data. Data wrangling involves some combination of cleaning, reshaping, transforming, and merging data (Wickham & Grolemund, 2017). Then Cast a DTM and tokenize.
Looking at the first five observations we can see that we do not need all of the variables.
#read in literature review
review_data3 <-read_csv("data/review_noCR.csv")
review_data3 %>%
head()%>%
kbl(caption = "First 5 - Initial CS Literature Review Metadata") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| Title | Authors | Abstract | Published Year | Published Month | Journal | Field | subField | Volume | Issue | Pages | Accession Number | DOI | Ref | Covidence # | Study | Notes | Tags |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Crowd explicit sentiment analysis. | Montejo-Ráez, A.; Díaz-Galiano, M.C.; Martínez-Santiago, F.; Ureña-López, L.A. | With the rapid growth of data generated by social web applications new paradigms in the generation of knowledge are opening. This paper introduces Crowd Explicit Sentiment Analysis (CESA) as an approach for sentiment analysis in social media environments. Similar to Explicit Semantic Analysis, microblog posts are indexed by a predefined collection of documents. In CESA, these documents are built up from common emotional expressions in social streams. In this way, texts are projected to feelings or emotions . This process is performed within a Latent Semantic Analysis. A few simple regular expressions (e.g. “I feel X” , considering X a term representing an emotion or feeling) are used to scratch the enormous flow of micro-blog posts to generate a textual representation of an emotional state with clear polarity value (e.g. angry, happy, sad, confident, etc. ). In this way, new posts can be indexed by these feelings according to the distance to their textual representation. The approach is suitable in many scenarios dealing with social media publications and can be implemented in other languages with little effort. In particular, we have evaluated the system on Polarity Classification with both English and Spanish data sets. The results show that CESA is a valid solution for sentiment analysis and that similar approaches for model building from the continuous flow of posts could be exploited in other scenarios. | 2014 | 10// | Knowledge-Based Systems | ai | ai | 69 | NA | 134-139 | NA | NA | NA | #6618 | Montejo-Ráez 2014 | NA | NA |
| Docode 5: Building a real-world plagiarism detection system. | Pizarro V., Gaspar; Velásquez, Juan D. | Plagiarism refers to the appropriation of someone else’s ideas and expression. Its ubiquity makes it necessary to counter it, and invites the development of commercial systems to do so. In this document we introduce Docode 5, a system for plagiarism detection that can perform analyses on the World Wide Web and on user-defined collections, and can be used as a decision support system. Our contribution in this document is to present this system in all its range of components, from the algorithms used in it to the user interfaces, and the issues with deployment on a commercial scale at an algorithmic and architectural level. We ran performance tests on the plagiarism detection algorithm showing an acceptable performance from an academic and commercial point of view, and load tests on the deployed system, showing that we can benefit from a distributed deployment. With this, we conclude we can adapt algorithms made for small-scale plagiarism detection to a commercial-scale system. | 2017 | 09// | Engineering Applications of Artificial Intelligence | engineering | eng_ai | 64 | NA | 261-271 | NA | NA | NA | #6111 | Pizarro 2017 | NA | NA |
| Assessing mobile health applications with twitter analytics. | Pai, Rajesh R.; Alathur, Sreejith | <bold>Introduction: </bold>Advancement in the field of information technology and rise in the use of Internet has changed the lives of people by enabling various services online. In recent times, healthcare sector which faces its service delivery challenges started promoting and using mobile health applications with the intention of cutting down the cost making it accessible and affordable to the people.<bold>Objectives: </bold>The objective of the study is to perform sentiment analysis using the Twitter data which measures the perception and use of various mobile health applications among the citizens.<bold>Methods: </bold>The methodology followed in this research is qualitative with the data extracted from a social networking site “Twitter” through a tool RStudio. This tool with the help of Twitter Application Programming Interface requested one thousand tweets each for four different phrases of mobile health applications (apps) such as “fitness app”, “diabetes app”, “meditation app”, and “cancer app”. Depending on the tweets, sentiment analysis was carried out, and its polarity and emotions were measured.<bold>Results: </bold>Except for cancer app there exists a positive polarity towards the fitness, diabetes, and meditation apps among the users. Following a system thinking approach for our results, this paper also explains the causal relationships between the accessibility and acceptability of mobile health applications which helps the healthcare facility and the application developers in understanding and analyzing the dynamics involved the adopting a new system or modifying an existing one. | 2018 | 05// | International Journal of Medical Informatics | technology | tech_health_info | 113 | NA | 72-84 | NA | NA | NA | #6365 | Pai 2018 | NA | NA |
| A proposal for the development of adaptive spoken interfaces to access the Web. | Griol, David; Molina, José Manuel; Callejas, Zoraida | Spoken dialog systems have been proposed as a solution to facilitate a more natural human–machine interaction. In this paper, we propose a framework to model the user<U+05F3>s intention during the dialog and adapt the dialog model dynamically to the user needs and preferences, thus developing more efficient, adapted, and usable spoken dialog systems. Our framework employs statistical models based on neural networks that take into account the history of the dialog up to the current dialog state in order to predict the user<U+05F3>s intention and the next system response. We describe our proposal and detail its application in the Let<U+05F3>s Go spoken dialog system. | 2015 | 09/02/ | Neurocomputing | technology | tech_nc | 163 | NA | 56-68 | NA | NA | NA | #6072 | Griol 2015 | NA | NA |
| Computer-based working memory training in children with mild intellectual disability. | Delavarian, Mona; Bokharaeian, Behrouz; Towhidkhah, Farzad; Gharibzadeh, Shahriar | We designed a working memory (WM) training programme in game framework for mild intellectually disabled students. Twenty-four students participated as test and control groups. The auditory and visual–spatial WM were assessed by primary test, which included computerised Wechsler numerical forward and backward sub-tests and secondary tests, which contained three parts: dual visual–spatial test, auditory test and a one-syllable word recalling test. The results showed significant difference between WM capacity in the intellectually disabled children and normal ones (p-value < 0.00001). Visual–spatial WM, auditory WM and speaking were improved in the trained group. Four tests showed significant differences between pre-test and post-tests. The trained group showed more improvements in forward tasks. The trained participant’s processing speed increased with training. We found that school is the best place for training. More comprehensive human–computer interfaces could be suitable for intellectually disabled students with visual and auditory impairments and problems in motor skills. | 2015 | 01// | Early Child Development & Care | education | edu_early | 185 | 1 | 66-74 | NA | NA | NA | #6576 | Delavarian 2015 | NA | NA |
| Sequential deep learning from NTSB reports for aviation safety prognosis. | Zhang, Xiaoge; Srinivasan, Prabhakar; Mahadevan, Sankaran | • Sequential deep learning models are developed for performing prognosis of adverse aviation events. • We compare the performance of classification models trained with event sequences and text narratives. • The developed model can be used for prognosis with partial and evolving event sequences. In this paper, we apply a set of data-mining and sequential deep learning techniques to accident investigation reports published by the National Transportation Safety Board (NTSB) in support of the prognosis of adverse events. Our focus is on learning with text data that describes the sequences of events. NTSB creates post hoc investigation reports which contain raw text narratives of their investigation and their corresponding concise event sequences. Classification models are developed for passenger air carriers, that take either an observed sequence of events or the corresponding raw text narrative as input and make predictions regarding whether an accident or an incident is the likely outcome, whether the aircraft would be damaged or not and whether any fatalities are likely or not. The classification models are developed using Word Embedding and the Long Short-term Memory (LSTM) neural network. The proposed methodology is implemented in two steps: (i) transform the NTSB data extracts into labeled dataset for building supervised machine learning models; and (ii) develop deep learning (DL) models for performing prognosis of adverse events like accidents, aircraft damage or fatalities. We also develop a prototype for an interactive query interface for end-users to test various scenarios including complete or partial event sequences or narratives and get predictions regarding the adverse events. The development of sequential deep learning models facilitates safety professionals in auditing, reviewing, and analyzing accident investigation reports, performing what-if scenario analyses to quantify the contributions of various hazardous events to the occurrence of aviation accidents/incidents. | 2021 | 10// | Safety Science | engineering | eng_safety | 142 | NA | N.PAG-N.PAG | NA | NA | NA | #5990 | Zhang 2021 | NA | NA |
Just out of curiosity let’s inspect the data with a histogram to see how many papers were published per year. It looks as though 2014 was a big year for papers. Simonite (2014) excitedly writes, “2014 saw major strides in machine learning software that can gain abilities from experience.”
hist(review_data3$`Published Year`)
Tidy data by converting to lowercase, and only select abstract, published_year, journal, field and subfield. add a unique identifier and unite document as “field.”
# convert all variable names to lower case
names(review_data3) <- tolower(names(review_data3))
#Clean Data and include unique identifier
tidy_data3 <- review_data3 %>%
rename(published_year = `published year`)%>%
select(c('abstract', 'published_year', 'journal', 'field', 'subfield')) %>% # only select
mutate(number = row_number())%>%
unite(document, field)
# inspect
tidy_data3%>%
head()%>%
kbl(caption = "First 5 - Tidy and Restructured Meta-Data") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| abstract | published_year | journal | document | subfield | number |
|---|---|---|---|---|---|
| With the rapid growth of data generated by social web applications new paradigms in the generation of knowledge are opening. This paper introduces Crowd Explicit Sentiment Analysis (CESA) as an approach for sentiment analysis in social media environments. Similar to Explicit Semantic Analysis, microblog posts are indexed by a predefined collection of documents. In CESA, these documents are built up from common emotional expressions in social streams. In this way, texts are projected to feelings or emotions . This process is performed within a Latent Semantic Analysis. A few simple regular expressions (e.g. “I feel X” , considering X a term representing an emotion or feeling) are used to scratch the enormous flow of micro-blog posts to generate a textual representation of an emotional state with clear polarity value (e.g. angry, happy, sad, confident, etc. ). In this way, new posts can be indexed by these feelings according to the distance to their textual representation. The approach is suitable in many scenarios dealing with social media publications and can be implemented in other languages with little effort. In particular, we have evaluated the system on Polarity Classification with both English and Spanish data sets. The results show that CESA is a valid solution for sentiment analysis and that similar approaches for model building from the continuous flow of posts could be exploited in other scenarios. | 2014 | Knowledge-Based Systems | ai | ai | 1 |
| Plagiarism refers to the appropriation of someone else’s ideas and expression. Its ubiquity makes it necessary to counter it, and invites the development of commercial systems to do so. In this document we introduce Docode 5, a system for plagiarism detection that can perform analyses on the World Wide Web and on user-defined collections, and can be used as a decision support system. Our contribution in this document is to present this system in all its range of components, from the algorithms used in it to the user interfaces, and the issues with deployment on a commercial scale at an algorithmic and architectural level. We ran performance tests on the plagiarism detection algorithm showing an acceptable performance from an academic and commercial point of view, and load tests on the deployed system, showing that we can benefit from a distributed deployment. With this, we conclude we can adapt algorithms made for small-scale plagiarism detection to a commercial-scale system. | 2017 | Engineering Applications of Artificial Intelligence | engineering | eng_ai | 2 |
| <bold>Introduction: </bold>Advancement in the field of information technology and rise in the use of Internet has changed the lives of people by enabling various services online. In recent times, healthcare sector which faces its service delivery challenges started promoting and using mobile health applications with the intention of cutting down the cost making it accessible and affordable to the people.<bold>Objectives: </bold>The objective of the study is to perform sentiment analysis using the Twitter data which measures the perception and use of various mobile health applications among the citizens.<bold>Methods: </bold>The methodology followed in this research is qualitative with the data extracted from a social networking site “Twitter” through a tool RStudio. This tool with the help of Twitter Application Programming Interface requested one thousand tweets each for four different phrases of mobile health applications (apps) such as “fitness app”, “diabetes app”, “meditation app”, and “cancer app”. Depending on the tweets, sentiment analysis was carried out, and its polarity and emotions were measured.<bold>Results: </bold>Except for cancer app there exists a positive polarity towards the fitness, diabetes, and meditation apps among the users. Following a system thinking approach for our results, this paper also explains the causal relationships between the accessibility and acceptability of mobile health applications which helps the healthcare facility and the application developers in understanding and analyzing the dynamics involved the adopting a new system or modifying an existing one. | 2018 | International Journal of Medical Informatics | technology | tech_health_info | 3 |
| Spoken dialog systems have been proposed as a solution to facilitate a more natural human–machine interaction. In this paper, we propose a framework to model the user<U+05F3>s intention during the dialog and adapt the dialog model dynamically to the user needs and preferences, thus developing more efficient, adapted, and usable spoken dialog systems. Our framework employs statistical models based on neural networks that take into account the history of the dialog up to the current dialog state in order to predict the user<U+05F3>s intention and the next system response. We describe our proposal and detail its application in the Let<U+05F3>s Go spoken dialog system. | 2015 | Neurocomputing | technology | tech_nc | 4 |
| We designed a working memory (WM) training programme in game framework for mild intellectually disabled students. Twenty-four students participated as test and control groups. The auditory and visual–spatial WM were assessed by primary test, which included computerised Wechsler numerical forward and backward sub-tests and secondary tests, which contained three parts: dual visual–spatial test, auditory test and a one-syllable word recalling test. The results showed significant difference between WM capacity in the intellectually disabled children and normal ones (p-value < 0.00001). Visual–spatial WM, auditory WM and speaking were improved in the trained group. Four tests showed significant differences between pre-test and post-tests. The trained group showed more improvements in forward tasks. The trained participant’s processing speed increased with training. We found that school is the best place for training. More comprehensive human–computer interfaces could be suitable for intellectually disabled students with visual and auditory impairments and problems in motor skills. | 2015 | Early Child Development & Care | education | edu_early | 5 |
| • Sequential deep learning models are developed for performing prognosis of adverse aviation events. • We compare the performance of classification models trained with event sequences and text narratives. • The developed model can be used for prognosis with partial and evolving event sequences. In this paper, we apply a set of data-mining and sequential deep learning techniques to accident investigation reports published by the National Transportation Safety Board (NTSB) in support of the prognosis of adverse events. Our focus is on learning with text data that describes the sequences of events. NTSB creates post hoc investigation reports which contain raw text narratives of their investigation and their corresponding concise event sequences. Classification models are developed for passenger air carriers, that take either an observed sequence of events or the corresponding raw text narrative as input and make predictions regarding whether an accident or an incident is the likely outcome, whether the aircraft would be damaged or not and whether any fatalities are likely or not. The classification models are developed using Word Embedding and the Long Short-term Memory (LSTM) neural network. The proposed methodology is implemented in two steps: (i) transform the NTSB data extracts into labeled dataset for building supervised machine learning models; and (ii) develop deep learning (DL) models for performing prognosis of adverse events like accidents, aircraft damage or fatalities. We also develop a prototype for an interactive query interface for end-users to test various scenarios including complete or partial event sequences or narratives and get predictions regarding the adverse events. The development of sequential deep learning models facilitates safety professionals in auditing, reviewing, and analyzing accident investigation reports, performing what-if scenario analyses to quantify the contributions of various hazardous events to the occurrence of aviation accidents/incidents. | 2021 | Safety Science | engineering | eng_safety | 6 |
Now, let’s inspect how many journal contributions by each journal exist. It looks as though the two journals with the highest paper contributions are ‘International Journal of Artificial Intelligence in Education’ (Springer Science & Business Media B.V.) and ‘Computers in Education.’
library(ggplot2)
tidy_data3 %>%
group_by(journal) %>%
summarize(abstract = n_distinct(number)) %>%
ggplot(aes(abstract, journal)) +
geom_col() +
scale_y_discrete(guide = guide_axis(check.overlap = TRUE)) +
labs(y = NULL)
Here we will take the necessary steps to: 1. Transforming our text into “tokens” 2. Removing unnecessary characters, punctuation, and whitespace 3. Converting all text to lowercase 4. Removing stop words such as “the”, “of”, and “to”
After transforming we can quickly look at the word count. We can see that “learning” and “students” are at the top. This is exciting since we have papers from five different fields.
#unnest
token_words3 <- tidy_data3 %>%
unnest_tokens(word, abstract) %>%
filter(str_detect(word, "[a-z']$"),
!word %in% stop_words$word)
token_words3 %>%
group_by(word) %>%
filter(n() >= 98) %>%
count(word, sort = TRUE)%>%
kbl(caption = "Tokenized Words >= 98") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| word | n |
|---|---|
| learning | 386 |
| students | 180 |
| system | 139 |
| based | 138 |
| design | 137 |
| user | 116 |
| data | 98 |
We can also look at the top words in a word cloud: This is a nice visualization to see the top tokenized words from the abstracts.
top_tokens <- token_words3 %>%
ungroup ()%>% #ungroup the tokenize data to create a wordcloud
count(word, sort = TRUE) %>%
top_n(50)
wordcloud2(top_tokens)
When organizing the MetaData we organized the journals into their respective fields. We will use the fields as the document to connet the topics later on. We have 5 documents and 3854 terms.
review_dtm3 <- token_words3 %>%
count(document, word, sort = TRUE) %>%
ungroup()
cast_dtm3 <- review_dtm3 %>%
cast_dtm(document, word, n)
dim(cast_dtm3)
## [1] 5 3854
cast_dtm3
## <<DocumentTermMatrix (documents: 5, terms: 3854)>>
## Non-/sparse entries: 6253/13017
## Sparsity : 68%
## Maximal term length: 22
## Weighting : term frequency (tf)
We can inspect five documents looking at 8 words within the DTM.
#look at 4 documents and 8 words of the DTM
inspect(cast_dtm3[1:5,1:8])
## <<DocumentTermMatrix (documents: 5, terms: 8)>>
## Non-/sparse entries: 39/1
## Sparsity : 3%
## Maximal term length: 11
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs based data design intelligent learning results students system
## ai 9 8 12 7 15 4 20 18
## education 64 53 91 52 235 53 141 67
## engineering 11 10 2 1 20 5 5 23
## science 8 5 2 8 17 3 0 6
## technology 46 22 30 8 99 20 14 25
Use generic variable names for the DTM = source_dtm3 and the tokenize words = source_tidy3.
#assign the source dataset to generic var names
source_dtm3 <- cast_dtm3
source_tidy3 <- token_words3
We will use the GIBBS sampling method with the default VEM. We will classify documents into Topics based on the mean of gamma for a topic/source. The K - means number of documents we are using is equal to the field number of five. - i. look at the class of the LDA object - ii. inspect the topics
k <- 5 #number of topics
seed = 1234 #necessary for reproducibility
#fit the model
#you could have more control parameters but will just use seed here
lda <- LDA(source_dtm3, k = k, method = "GIBBS", control = list(seed = seed))
#examine the class of the LDA object
class(lda)
## [1] "LDA_Gibbs"
## attr(,"package")
## [1] "topicmodels"
#inspect lda topics
lda
## A LDA_Gibbs topic model with 5 topics.
Jiang (2022) notes that “hidden within our topic model object we created are per-topic-per-word probabilities, called β (”beta”).” It is the probability of a term (word) belonging to a topic. We will extract the per-topic-per-word probabilities, called β from the model and show top 5 results in each topic.
the model into a one-topic-per-term-per-row format. For each combination, the model computes the probability of that term being generated from that topic. For example, the term “learning” has a .005831242 probability of being generated from topic 1, but only a .003388223 from topic 5.
topics <- tidy(lda, matrix = "beta")
topics%>%
head%>%
kbl(caption = "Term Probability by topic") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| topic | term | beta |
|---|---|---|
| 1 | learning | 0.0058312 |
| 2 | learning | 0.0000426 |
| 3 | learning | 0.0000438 |
| 4 | learning | 0.0515828 |
| 5 | learning | 0.0000339 |
| 1 | students | 0.0290633 |
Now let’s look at the Top Beta Terms
review_topics3 <- tidy(lda, matrix = "beta")
top_terms <- review_topics3 %>%
group_by(topic) %>%
slice_max(beta, n = 5) %>%
ungroup() %>%
arrange(topic, -beta)
top_terms%>%
head%>%
group_by(topic)%>%
kbl(caption = "Top terms by Beta") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| topic | term | beta |
|---|---|---|
| 1 | students | 0.0290633 |
| 1 | intelligent | 0.0121039 |
| 1 | study | 0.0121039 |
| 1 | education | 0.0109423 |
| 1 | tutoring | 0.0104777 |
| 2 | users | 0.0175237 |
Finally, inspecting the terms in a nice layout for comparison of what is in each topic. We inspect 10 terms that are most common within each topic.
num_words <- 10 #number of words to visualize
#create function that accepts the lda model and num word to display
top_terms_per_topic <- function(lda_model, num_words) {
#tidy LDA object to get word, topic, and probability (beta)
topics_tidy <- tidy(lda_model, matrix = "beta")
top_terms <- topics_tidy %>%
group_by(topic) %>%
arrange(topic, desc(beta)) %>%
#get the top num_words PER topic
slice(seq_len(num_words)) %>%
arrange(topic, beta) %>%
#row is required for the word_chart() function
mutate(row = row_number()) %>%
ungroup() %>%
#add the word Topic to the topic labels
mutate(topic = paste("Topic", topic, sep = " "))
#create a title to pass to word_chart
title <- paste("LDA Top Terms for", k, "Topics")
#call the word_chart function
word_chart(top_terms, top_terms$term, title)
}
#call the function you just built!
top_terms_per_topic(lda, num_words)
### 4b. Gamma
Silge & Robinson (2017) state, “besides estimating each topic as a mixture of words, LDA also models each document as a mixture of topics. We can examine the per-document-per-topic probabilities, called γ(“gamma”), with the matrix = “gamma” argument to tidy().
Show relationship between topic and journal document field.
tidy_data3
## # A tibble: 137 x 6
## abstract published_year journal document subfield number
## <chr> <dbl> <chr> <chr> <chr> <int>
## 1 "With the rapid g~ 2014 Knowledge-Based ~ ai ai 1
## 2 "Plagiarism refer~ 2017 Engineering Appl~ enginee~ eng_ai 2
## 3 "<bold>Introducti~ 2018 International Jo~ technol~ tech_hea~ 3
## 4 "Spoken dialog sy~ 2015 Neurocomputing technol~ tech_nc 4
## 5 "We designed a wo~ 2015 Early Child Deve~ educati~ edu_early 5
## 6 "• Sequential dee~ 2021 Safety Science enginee~ eng_safe~ 6
## 7 "The goal of this~ 2019 International Jo~ educati~ edu_ai 7
## 8 "A Materials Acce~ 2020 Advanced Science science sci_multi 8
## 9 "• Machine learni~ 2020 Environment Inte~ science sci_env_~ 9
## 10 "Highlights: [•] ~ 2014 Computers in Hum~ technol~ tech_hum~ 10
## # ... with 127 more rows
#using tidy with gamma gets document probabilities into topic
#but you only have document, topic and gamma
source_topic_relationship <- tidy(lda, matrix = "gamma") %>%
#join to orig tidy data bydoc to get the source field
inner_join(tidy_data3, by = "document") %>%
select(document, topic, gamma) %>%
group_by(document, topic) %>%
#get the avg doc gamma value per source/topic
mutate(mean = mean(gamma)) %>%
#remove the gamma value as you only need the mean
select(-gamma) %>%
#removing gamma created duplicates so remove them
distinct()
#relabel topics to include the word Topic
source_topic_relationship$topic = paste("Topic", source_topic_relationship$topic, sep = " ")
circos.clear() #very important! Reset the circular layout parameters
#assign colors to the outside bars around the circle
grid.col = c("Education" = my_colors[1],
"Science" = my_colors[2],
"AI" = my_colors[3],
"Technology" = my_colors[4],
"Engineering"= my_colors[5],
"Topic 1" = "grey", "Topic 2" = "grey", "Topic 3" = "grey", "Topic 4" = "grey", "Topic 5" = "grey")
# set the global parameters for the circular layout. Specifically the gap size (15)
#this also determines that topic goes on top half and source on bottom half
circos.par(gap.after = c(rep(5, length(unique(source_topic_relationship[[1]])) - 1), 15,
rep(5, length(unique(source_topic_relationship[[2]])) - 1), 15))
#main function that draws the diagram. transparancy goes from 0-1
chordDiagram(source_topic_relationship, grid.col = grid.col, transparency = .2)
title("Relationship Between Topic and Journal Field")
#save to beta var
td_beta <- tidy(lda)
#save to gamma var
td_gamma <- tidy(lda, matrix = "gamma")
#copy Julia Silge code to combine
top_terms <- td_beta %>%
arrange(beta) %>%
group_by(topic) %>%
top_n(7, beta) %>%
arrange(-beta) %>%
select(topic, term) %>%
summarise(terms = list(term)) %>%
mutate(terms = map(terms, paste, collapse = ", ")) %>%
unnest()
gamma_terms <- td_gamma %>%
group_by(topic) %>%
summarise(gamma = mean(gamma)) %>%
arrange(desc(gamma)) %>%
left_join(top_terms, by = "topic") %>%
mutate(topic = paste0("Topic ", topic),
topic = reorder(topic, gamma))
gamma_terms %>%
select(topic, gamma, terms) %>%
kable(digits = 3,
col.names = c("Topic", "Expected topic proportion", "Top 7 terms"))
| Topic | Expected topic proportion | Top 7 terms |
|---|---|---|
| Topic 4 | 0.344 | learning, based, design, system, data, paper, research, systems |
| Topic 3 | 0.211 | human, service, machine, tests, relevant, collaborative, screening |
| Topic 2 | 0.192 | users, system, interactions, tool, models, user, classification |
| Topic 5 | 0.135 | human, user, bold, study, web, neural, multi |
| Topic 1 | 0.119 | students, intelligent, study, education, tutoring, student, knowledge |
My prediction of the topics are as follows: -Topic 1 is about intelligent tutoring systems -Topic 2 is about -Topic 3 is about -Topic 4 is about -Topic 5 is about
The structure of the k-means object reveals two important pieces of information: clusters and centers. k-means clustering, each document—can be assigned to one, and only one, cluster.
source_dtm2 <- cast_dtm3
source_tidy2 <- token_words3
#Set a seed for replicable results
set.seed(1234)
k <- 4
kmeansResult <- kmeans(source_dtm2, k)
str(kmeansResult)
## List of 9
## $ cluster : Named int [1:5] 3 4 1 1 2
## ..- attr(*, "names")= chr [1:5] "education" "technology" "engineering" "ai" ...
## $ centers : num [1:4, 1:3854] 17.5 17 235 99 12.5 0 141 14 7 2 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:4] "1" "2" "3" "4"
## .. ..$ : chr [1:3854] "learning" "students" "design" "system" ...
## $ totss : num 133285
## $ withinss : num [1:4] 2626 0 0 0
## $ tot.withinss: num 2626
## $ betweenss : num 130658
## $ size : int [1:4] 2 1 1 1
## $ iter : int 2
## $ ifault : int 0
## - attr(*, "class")= chr "kmeans"
Intelligence is in all four clusters, but falls mainly in cluster three.
head(kmeansResult$centers[,"intelligent"])
## 1 2 3 4
## 4 8 52 8
num_words <- 8 #number of words to display
#get the top words from the kmeans centers
kmeans_topics <- lapply(1:k, function(i) {
s <- sort(kmeansResult$centers[i, ], decreasing = T)
names(s)[1:num_words]
})
#make sure it's a data frame
kmeans_topics_df <- as.data.frame(kmeans_topics)
#label the topics with the word Topic
names(kmeans_topics_df) <- paste("Topic", seq(1:k), sep = " ")
#create a sequential row id to use with gather()
kmeans_topics_df <- cbind(id = rownames(kmeans_topics_df),
kmeans_topics_df)
kmeans_topics_df
## id Topic 1 Topic 2 Topic 3 Topic 4
## 1 1 system learning learning learning
## 2 2 learning human students based
## 3 3 user service design user
## 4 4 students screening system model
## 5 5 systems ai based human
## 6 6 based intelligence data design
## 7 7 data machine results paper
## 8 8 users artificial intelligent interface
#transpose it into the format required for word_chart()
kmeans_top_terms <- kmeans_topics_df %>%
gather(id)
colnames(kmeans_top_terms) = c("topic", "term")
kmeans_top_terms <- kmeans_top_terms %>%
group_by(topic) %>%
mutate(row = row_number()) %>% #needed by word_chart()
ungroup()
title <- paste("K-Means Top Terms for", k, "Topics")
word_chart(kmeans_top_terms, kmeans_top_terms$term, title)
Coppin, B. (2004). Artificial intelligence illuminated. Jones & Bartlett Learning.
Simonite, T. (2014). 2014 in Computing: Breakthroughs in Artificial Intelligence.
Warwick, K. (2013). Artificial intelligence: the basics. Routledge.
Whitby, B. (2009). Artificial intelligence. The Rosen Publishing Group, Inc.