Final Project for ECI 588, Text Mining in Education —- Analyzing Literature Through Text Mining

Introduction

In order to identify trends, developments, and emerging themes within the field of learning analytics as applied to STEM education in higher education settings, this literature analysis offers a comprehensive understanding of the field and provides insights into evolving trends within the field, which holds significant value for researchers and practitioners focusing on STEM education within higher education.

As a result, this research project aims to answer the following questions:

What is the annual publication trend of the studies? What are the most influential sources, affiliations, publishers, and publications in the field? What are the most common words used? Which topic modeling approach is optimal for generating potential topics from short text? What are the potential topic trends observed over the years?

Preparation

2.1 Data Source

My raw data was initially collected from the Web of Science database, which is a comprehensive research database that covers a wide range of scholarly literature across various disciplines. The search routine is as follows: “learning analytics” AND (“higher education” OR colleg* OR universit*) AND (STEM OR science OR technology OR engineering OR mathematics) (Abstract) and Article (Document Types). Upon including all publication types in the scope, 373 documents are reached, and 205 articles emerge when the survey is limited to articles between 2014 - 2024. The survey was performed on 7 April 2024, to avoid deviations from daily updates of the database. Then, inclusion criteria were applied to select relevant studies: (1) The paper is written in English; (2) Full text is available; (3) The paper is peer-reviewed;(4) The study is conducted in a higher education context;(5) The study focuses on STEM education; (6) The study The paper is empirical. Therefore, other types of documents such as non-peer-reviewed papers, government reports, review, meta-analysis, survey, or commentary articles, editorials, book chapters, theses, dissertations, and author notes were excluded from this review. Finally, all the texts containing “Record Content: Full Record and Cited References” are downloaded in Excel format by clicking on Export Records to Excel File of the 45 articles reached during the survey. The entire records of WoS include the author, document type, Web of Science category, keywords, year of publication, publishers, affiliated institutions, countries/regions, and indexes for each article.

2.2 Load libraries

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidytext)
library(tidyr)
library(SnowballC)
library(topicmodels)
library(stm)

## stm v1.3.7 successfully loaded. See ?stm for help. 
##  Papers, resources, and other materials at structuraltopicmodel.com

library(ldatuning)
library(knitr)
library(LDAvis)
library(tm)

## Loading required package: NLP
## 
## Attaching package: 'NLP'
## 
## The following object is masked from 'package:ggplot2':
## 
##     annotate

library(lubridate)
library(kableExtra)

## Warning: package 'kableExtra' was built under R version 4.3.3

## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(BTM)

## Warning: package 'BTM' was built under R version 4.3.3

library(textplot)

## Warning: package 'textplot' was built under R version 4.3.3

library(concaveman)

## Warning: package 'concaveman' was built under R version 4.3.3

library(udpipe)

## Warning: package 'udpipe' was built under R version 4.3.3

library(data.table)

## 
## Attaching package: 'data.table'
## 
## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## 
## The following object is masked from 'package:purrr':
## 
##     transpose
## 
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

library(stopwords)

## 
## Attaching package: 'stopwords'
## 
## The following object is masked from 'package:tm':
## 
##     stopwords

library(ggplot2)
library(igraph)

## 
## Attaching package: 'igraph'
## 
## The following objects are masked from 'package:lubridate':
## 
##     %--%, union
## 
## The following objects are masked from 'package:purrr':
## 
##     compose, simplify
## 
## The following object is masked from 'package:tidyr':
## 
##     crossing
## 
## The following object is masked from 'package:tibble':
## 
##     as_data_frame
## 
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## 
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## 
## The following object is masked from 'package:base':
## 
##     union

library(ggraph)

Wrangle

3.1 Read in the data in R project.

library(readxl)
downloadLASTEM_data <- read_excel("C:/Users/kaoya/Downloads/ECI_588_Datasets/download_LASTEM.xls")
View(downloadLASTEM_data)

3.2 Rename Columns

downloadLASTEM_data <- rename(downloadLASTEM_data, Year = `Publication Year`)

downloadLASTEM_data <- rename(downloadLASTEM_data, Title = `Article Title`)

downloadLASTEM_data <- rename(downloadLASTEM_data, Source = `Source Title`)

downloadLASTEM_data <- rename(downloadLASTEM_data, Type = `Publication Type`)

downloadLASTEM_data <- rename(downloadLASTEM_data, Keywords = `Author Keywords`)

downloadLASTEM_data <- rename(downloadLASTEM_data, Usagecount = `Since 2013 Usage Count`)

downloadLASTEM_data <- rename(downloadLASTEM_data, Keywordsplus = `Keywords Plus`)

downloadLASTEM_data <- rename(downloadLASTEM_data, Referencecount = `Cited Reference Count`)

downloadLASTEM_data <- rename(downloadLASTEM_data, Journal = `Journal Abbreviation`)

3.3 Subset columns

LASTEM_data <- select(downloadLASTEM_data, Type, Authors, Title, Source, Keywords, Keywordsplus, Abstract, Affiliations, Referencecount, Usagecount, Journal, Publisher, Year)

3.4 Create tidytext corpus

LASTEM_data_tidy <- LASTEM_data %>%
  unnest_tokens(output = word, input = Abstract) %>%
  anti_join(stop_words, by = "word")


tidy_top_tokens <- LASTEM_data_tidy %>% 
  count(word, sort = TRUE) %>% 
  top_n(50)

## Selecting by n

tidy_top_tokens

## # A tibble: 51 × 2
##    word           n
##    <chr>      <int>
##  1 learning     230
##  2 students     147
##  3 analytics     82
##  4 study         63
##  5 data          59
##  6 student       59
##  7 education     45
##  8 academic      41
##  9 assessment    39
## 10 university    39
## # ℹ 41 more rows

3.5 Stemming Stemming reduces the feature size of a corpus by transforming terms to their base stem. Stemming reduces the chances of redundancy in terms and phrases as the various topic modeling techniques are explored.

StemLASTEM_data_tidy <- LASTEM_data_tidy %>% 
  mutate(word = wordStem(word))

StemLASTEM_data_tidy

## # A tibble: 5,376 × 13
##    Type  Authors  Title Source Keywords Keywordsplus Affiliations Referencecount
##    <chr> <chr>    <chr> <chr>  <chr>    <chr>        <chr>                 <dbl>
##  1 J     Vargas,… A Pr… IEEE … Educati… ENGINEERING… Pontificia …             36
##  2 J     Vargas,… A Pr… IEEE … Educati… ENGINEERING… Pontificia …             36
##  3 J     Vargas,… A Pr… IEEE … Educati… ENGINEERING… Pontificia …             36
##  4 J     Vargas,… A Pr… IEEE … Educati… ENGINEERING… Pontificia …             36
##  5 J     Vargas,… A Pr… IEEE … Educati… ENGINEERING… Pontificia …             36
##  6 J     Vargas,… A Pr… IEEE … Educati… ENGINEERING… Pontificia …             36
##  7 J     Vargas,… A Pr… IEEE … Educati… ENGINEERING… Pontificia …             36
##  8 J     Vargas,… A Pr… IEEE … Educati… ENGINEERING… Pontificia …             36
##  9 J     Vargas,… A Pr… IEEE … Educati… ENGINEERING… Pontificia …             36
## 10 J     Vargas,… A Pr… IEEE … Educati… ENGINEERING… Pontificia …             36
## # ℹ 5,366 more rows
## # ℹ 5 more variables: Usagecount <dbl>, Journal <chr>, Publisher <chr>,
## #   Year <dbl>, word <chr>

3.6 Cast a Document Term Matrix The LDA model requires the text be presented in the form of a tidy DTM, where each term occupies a single cell according to a unique and controlling variable. In this case, the title will act as that unique identifier.

tidy_tds_DTM <- StemLASTEM_data_tidy %>%
  count(Title, word) %>%
  cast_dtm(Title, word, n)

tidy_tds_DTM

## <<DocumentTermMatrix (documents: 45, terms: 1222)>>
## Non-/sparse entries: 3497/51493
## Sparsity           : 94%
## Maximal term length: 13
## Weighting          : term frequency (tf)

3.7 Word counts

LASTEM_counts <- StemLASTEM_data_tidy %>% 
  count(word, sort = TRUE)

LASTEM_counts

## # A tibble: 1,222 × 2
##    word        n
##    <chr>   <int>
##  1 learn     233
##  2 student   206
##  3 analyt     86
##  4 studi      79
##  5 educ       68
##  6 data       59
##  7 model      52
##  8 assess     48
##  9 univers    46
## 10 result     43
## # ℹ 1,212 more rows

3.8 Word frequency

LASTEM_frequencies <- StemLASTEM_data_tidy %>%
  count(word, sort = TRUE) %>%
  mutate(proportion = n / sum(n))

LASTEM_frequencies

## # A tibble: 1,222 × 3
##    word        n proportion
##    <chr>   <int>      <dbl>
##  1 learn     233    0.0433 
##  2 student   206    0.0383 
##  3 analyt     86    0.0160 
##  4 studi      79    0.0147 
##  5 educ       68    0.0126 
##  6 data       59    0.0110 
##  7 model      52    0.00967
##  8 assess     48    0.00893
##  9 univers    46    0.00856
## 10 result     43    0.00800
## # ℹ 1,212 more rows

3.9 Tokenization of the original data to enable further term frequency analysis at the bigram (word pair) levels. For these iterations, stop word removal and stemming has been incorporated:

tds_bigrams <- LASTEM_data %>%   
  unnest_tokens(output = bigram, input = Abstract, token = "ngrams", n = 2)

tds_bigrams <- tds_bigrams %>% 
  separate(bigram, into = c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>%
  mutate(word1 = wordStem(word1)) %>% 
  mutate(word2 = wordStem(word2)) %>% 
  unite(bigram, c(word1, word2), sep = " ")

bigram_top_tokens <- tds_bigrams %>% 
  count(bigram, sort = TRUE) %>% 
  top_n(20)

## Selecting by n

bigram_top_tokens

## # A tibble: 22 × 2
##    bigram              n
##    <chr>           <int>
##  1 learn analyt       66
##  2 academ perform     12
##  3 machin learn       11
##  4 student learn      11
##  5 learn environ      10
##  6 learn manag        10
##  7 manag system        9
##  8 student perform     9
##  9 student success     8
## 10 academ achiev       7
## # ℹ 12 more rows

3.10 Bigrams Counts

LASTEM_countsbi <- tds_bigrams %>% 
  count(bigram, sort = TRUE)

LASTEM_countsbi

## # A tibble: 1,878 × 2
##    bigram              n
##    <chr>           <int>
##  1 learn analyt       66
##  2 academ perform     12
##  3 machin learn       11
##  4 student learn      11
##  5 learn environ      10
##  6 learn manag        10
##  7 manag system        9
##  8 student perform     9
##  9 student success     8
## 10 academ achiev       7
## # ℹ 1,868 more rows

Exploratory Analysis

4.1 Wordcloud

library(wordcloud2)

wordcloud2(LASTEM_counts)

The most meaningful common words were ‘student,’ ‘study,’ ‘data,’ ‘model,’ ‘assess,’ and ‘result.’

4.2 Basic Bar Chart

LASTEM_counts %>%
  filter(n > 20) %>% # keep rows with word counts greater than 20
  mutate(word = reorder(word, n)) %>% #reorder the word variable by n and replace with new variable called word
  ggplot(aes(n, word)) + # create a plot with n on x axis and word on y axis
  geom_col() # make it a bar plot

LASTEM_countsbi %>%
  filter(n > 5) %>% # keep rows with word counts greater than 5
  mutate(bigram = reorder(bigram, n)) %>% #reorder the word variable by n and replace with new variable called biagram
  ggplot(aes(n, bigram)) + # create a plot with n on x axis and biagram on y axis
  geom_col() # make it a bar plot

The most meaningful common bigrams included ‘machine learning,’ ‘student learn,’ ‘learning environment,’ ‘learning management,’ ‘management system,’ and ‘student performance.’

4.3 Published Article Counts

all_years <- as.character(2014:2024)  # Define the range of years
LASTEM_data$Year <- factor(LASTEM_data$Year, levels = all_years)

LASTEM_data %>% 
  ggplot(aes(x = Year), color = factor(Type)) +
  geom_bar(show.legend = FALSE) +
  scale_x_discrete(limits = all_years) +  # Explicitly set x-axis limits
  labs(x = "Year",
     y = "Article Counts",
     title = "Articles Published by Years",
     subtitle = "Published from 2014 - 2024")

This visual depicts the number of papers published over the past ten years. There was no publication from 2014-2016, although the average number of publications per year was about 5. Four works were found in 2017, nine were found in 2018, seven were found in 2019, and three were published in 2020. Moreover, the number of papers were reached 11 in 2021, then decreased to three in 2022 and seven in 2023. There was only one work in 2024.

4.4 Influential sources

LASTEM_countsSource <- LASTEM_data %>% 
  count(Source, sort = TRUE)

LASTEM_countsSource %>%
  filter(n > 1) %>% # keep rows with souce counts greater than 1
  mutate(Source = reorder(Source, n)) %>% #reorder the word variable by n and replace with new variable called source
  ggplot(aes(n, Source)) + # create a plot with n on x axis and source on y axis
  geom_col() # make it a bar plot

This visual illustrates the distribution of published papers across various sources. Specifically, it shows that three papers were published in the International Journal of Engineering Education and IEEE ACCESS each. Additionally, two papers each were found in the Multidisciplinary Journal for Education Social and Technological Sciences, Education and Information Technologies, Computer & Education, Computer Applications in Engineering Education, and Applied Sciences-basel.

4.5 Influential affiliations

LASTEM_countsAffiliations <- LASTEM_data %>% 
  count(Affiliations, sort = TRUE)

LASTEM_countsAffiliations %>%
  filter(n > 1) %>% # keep rows with word counts greater than 1
  mutate(Affiliations = reorder(Affiliations, n)) %>% #reorder the word variable by n and replace with new variable called word
  ggplot(aes(n, Affiliations)) + # create a plot with n on x axis and word on y axis
  geom_col() # make it a bar plot

Two works were discovered from the University of Sydney and the Universidad de Castilla-La Mancha, while only one work was found in other affiliations.

4.6 Influential publications

LASTEM_sortUsagecount <- LASTEM_data %>%
  arrange((desc(Usagecount)))

View(LASTEM_sortUsagecount)

library(knitr)

# Select the desired variables and arrange them in the desired order
selected_LASTEM_sortUsagecount <- LASTEM_sortUsagecount %>%
  select(Usagecount, Authors, Year, Title, Source, Affiliations)

# Display the table using kable
kable(selected_LASTEM_sortUsagecount)

Usagecount	Authors	Year	Title	Source	Affiliations
167	Pardo, A; Han, FF; Ellis, RA	2017	Combining University Student Self-Regulated Learning Indicators and Engagement with Online Learning Events to Predict Academic Performance	IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES	University of Sydney; University of Sydney
144	Muñoz-Merino, PJ; Ruipérez-Valiente, JA; Kloos, CD; Auger, MA; Briz, S; de Castro, V; Santalla, SN	2017	Flipping the Classroom to Improve Learning With MOOCs Technology	COMPUTER APPLICATIONS IN ENGINEERING EDUCATION	Universidad Carlos III de Madrid; IMDEA Networks Institute
133	Hu, YH	2022	Effects and acceptance of precision education in an AI-supported smart learning environment	EDUCATION AND INFORMATION TECHNOLOGIES	National Yunlin University Science & Technology
108	Wu, JY	2021	Learning analytics on structured and unstructured heterogeneous data sources: Perspectives from procrastination, help-seeking, and machine-learning defined cognitive engagement	COMPUTERS & EDUCATION	National Yang Ming Chiao Tung University
70	Zhu, QL; Wang, MJ	2020	Team-based mobile learning supported by an intelligent system: case study of STEM students	INTERACTIVE LEARNING ENVIRONMENTS	Hainan University; California State University System; San Diego State University
63	Aparicio, F; Morales-Botello, ML; Rubio, M; Hernando, A; Muñoz, R; López-Fernández, H; Glez-Peña, D; Fdez-Riverola, F; de la Villa, M; Maña, M; Gachet, D; de Buenaga, M	2018	Perceptions of the use of intelligent information access systems in university level active learning activities among teachers of biomedical subjects	INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS	European University of Madrid; European University of Madrid; Universidade de Vigo; Universidade de Vigo; CINBIO; Universidad de Huelva
47	Saqr, M; Alamro, A	2019	The role of social network analysis as a learning analytics tool in online problem based learning	BMC MEDICAL EDUCATION	University of Eastern Finland; Stockholm University; Qassim University
46	Iatrellis, O; Savvas, IK; Fitsilis, P; Gerogiannis, VC	2021	A two-phase machine learning approach for predicting student outcomes	EDUCATION AND INFORMATION TECHNOLOGIES	University of Thessaly
46	Zheng, M; Bender, D; Nadershahi, N	2017	Faculty professional development in emergent pedagogies for instructional innovation in dental education	EUROPEAN JOURNAL OF DENTAL EDUCATION	University of the Pacific
45	Vaz-Fernandes, P; Caeiro, S	2019	Students’ perceptions of a food safety and quality e-learning course: a CASE study for a MSC in food consumption	INTERNATIONAL JOURNAL OF EDUCATIONAL TECHNOLOGY IN HIGHER EDUCATION	Universidade Aberta; Universidade de Lisboa; Universidade Nova de Lisboa
39	Noel, R; Riquelme, F; Mac Lean, R; Merino, E; Cechinel, C; Barcelos, TS; Villarroel, R; Munoz, R	2018	Exploring Collaborative Writing of User Stories With Multimodal Learning Analytics: A Case Study on a Software Engineering Course	IEEE ACCESS	Universidad de Valparaiso; Universidad de Valparaiso; Universidade Federal de Santa Catarina (UFSC); Instituto Federal de Sao Paulo (IFSP); Pontificia Universidad Catolica de Valparaiso
38	Lacave, C; Molina, AI; Cruz-Lemus, JA	2018	Learning Analytics to identify dropout factors of Computer Science studies through Bayesian networks	BEHAVIOUR & INFORMATION TECHNOLOGY	Universidad de Castilla-La Mancha
37	Kuromiya, H; Majumdar, R; Ogata, H	2020	Fostering Evidence-Based Education with Learning Analytics: Capturing Teaching-Learning Cases from Log Data	EDUCATIONAL TECHNOLOGY & SOCIETY	Kyoto University; Kyoto University
35	Cheng, S; Xie, K; Collier, J	2023	Motivational beliefs moderate the relation between academic delay and academic achievement in online learning environments	COMPUTERS & EDUCATION	National Taipei University of Technology; University System of Ohio; Ohio State University; Texas State University System; Sam Houston State University
35	Bertolini, R; Finch, SJ; Nehm, RH	2021	Testing the Impact of Novel Assessment Sources and Machine Learning Methods on Predictive Outcome Modeling in Undergraduate Biology	JOURNAL OF SCIENCE EDUCATION AND TECHNOLOGY	State University of New York (SUNY) System; State University of New York (SUNY) Stony Brook; State University of New York (SUNY) System; State University of New York (SUNY) Stony Brook
34	Divjak, B; Svetec, B; Horvat, D; Kadoic, N	2023	Assessment validity and learning analytics as prerequisites for ensuring student-centred learning design	BRITISH JOURNAL OF EDUCATIONAL TECHNOLOGY	University of Zagreb
30	Vargas, H; Heradio, R; Chacon, J; De la Torre, L; Farias, G; Galan, D; Dormido, S	2019	Automated Assessment and Monitoring Support for Competency-Based Courses	IEEE ACCESS	Pontificia Universidad Catolica de Valparaiso; Universidad Nacional de Educacion a Distancia (UNED); Complutense University of Madrid; Universidad Nacional de Educacion a Distancia (UNED)
30	Al-Shabandar, R; Hussain, AJ; Liatsis, P; Keight, R	2018	Anlaying Learners Behavior in MOOCs: An Examination of Performance and Motivation Using a Data-Driven Approach	IEEE ACCESS	Liverpool John Moores University; Khalifa University of Science & Technology
30	Mwalumbwe, I; Mtebe, JS	2017	USING LEARNING ANALYTICS TO PREDICT STUDENTS’ PERFORMANCE IN MOODLE LEARNING MANAGEMENT SYSTEM: A CASE OF MBEYA UNIVERSITY OF SCIENCE AND TECHNOLOGY	ELECTRONIC JOURNAL OF INFORMATION SYSTEMS IN DEVELOPING COUNTRIES	University of Dar es Salaam
26	Raza, SH; Reddy, E	2021	Intentionality and Players of Effective Online Courses in Mathematics	FRONTIERS IN APPLIED MATHEMATICS AND STATISTICS	University of the South Pacific
26	Scott, K; Morris, A; Marais, B	2018	Medical student use of digital learning resources	CLINICAL TEACHER	University of Sydney; University of Sydney
22	Menchaca, I; Guenaga, M; Solabarrieta, J	2018	Learning Analytics for Formative Assessment in Engineering Education	INTERNATIONAL JOURNAL OF ENGINEERING EDUCATION	University of Deusto
20	Kritzinger, A; Lemmens, JC; Potgieter, M	2018	Learning Strategies for First-Year Biology: Toward Moving the Murky Middle	CBE-LIFE SCIENCES EDUCATION	University of Pretoria; University of Pretoria; University of Pretoria
19	Gardner, C; Jones, A; Jefferis, H	2020	Analytics for Tracking Student Engagement	JOURNAL OF INTERACTIVE MEDIA IN EDUCATION	NA
16	Salazar-Fernandez, JP; Munoz-Gama, J; Maldonado-Mahauad, J; Bustamante, D; Sepúlveda, M	2021	Backpack Process Model (BPPM): A Process Mining Approach for Curricular Analytics	APPLIED SCIENCES-BASEL	Pontificia Universidad Catolica de Chile; Universidad Austral de Chile; Universidad de Cuenca
16	He, LJ; Levine, RA; Bohonak, AJ; Fan, JJ; Stronach, J	2018	Predictive Analytics Machinery for STEM Student Success Studies	APPLIED ARTIFICIAL INTELLIGENCE	California State University System; San Diego State University; California State University System; San Diego State University; California State University System; San Diego State University
15	Oliva-Córdova, LM; Garcia-Cabot, A; Recinos-Fernández, SA; Bojórquez-Roque, MS; Amado-Salvatierra, HR	2022	Evaluating Technological Acceptance of Virtual Learning Environments (VLE) in an Emergency Remote Situation	INTERNATIONAL JOURNAL OF ENGINEERING EDUCATION	Universidad de Alcala; Universidad de San Carlos de Guatemala; Universidad de San Carlos de Guatemala; Universidad de San Carlos de Guatemala
14	Apiola, M; Lokkila, E; Laakso, MJ	2019	Digital learning approaches in an intermediate-level computer science course	INTERNATIONAL JOURNAL OF INFORMATION AND LEARNING TECHNOLOGY	University of Turku
13	Olney, T; Walker, S; Wood, C; Clarke, A	2021	Are We Living in LA (P)LA Land? Reporting on the Practice of 30 STEM Tutors in Their Use of a Learning Analytics Implementation at The Open University	JOURNAL OF LEARNING ANALYTICS	Open University - UK; Open University - UK
12	Prat, A; Code, WJ	2021	WeBWorK log files as a rich source of data on student homework behaviours	INTERNATIONAL JOURNAL OF MATHEMATICAL EDUCATION IN SCIENCE AND TECHNOLOGY	University of British Columbia; University of British Columbia
11	Lobos, K; Sáez-Delgado, F; Cobo-Rendón, R; Mella Norambuena, J; Maldonado Trapp, A; Cisternas San Martín, N; Bruna Jofré, C	2021	Learning Beliefs, Time on Platform, and Academic Performance During the COVID-19 in University STEM Students	FRONTIERS IN PSYCHOLOGY	Universidad de Concepcion; Universidad Catolica de la Santisima Concepcion; Universidad Catolica de la Santisima Concepcion; Universidad de Concepcion; Universidad de Concepcion
11	Llopis-Albert, C; Rubio, F	2021	Application of Learning Analytics to Improve Higher Education	MULTIDISCIPLINARY JOURNAL FOR EDUCATION SOCIAL AND TECHNOLOGICAL SCIENCES	Universitat Politecnica de Valencia
11	Dennehy, D; Conboy, K; Babu, J	2023	Adopting Learning Analytics to Inform Postgraduate Curriculum Design: Recommendations and Research Agenda	INFORMATION SYSTEMS FRONTIERS	Ollscoil na Gaillimhe-University of Galway
11	Lacave, C; Molina, AI	2018	Using Bayesian Networks for Learning Analytics in Engineering Education: A Case Study on Computer Science Dropout at UCLM	INTERNATIONAL JOURNAL OF ENGINEERING EDUCATION	Universidad de Castilla-La Mancha
10	Vargas, H; Heradio, R; Farias, G; Lei, ZC; Torre, LD	2024	A Pragmatic Framework for Assessing Learning Outcomes in Competency-Based Courses	IEEE TRANSACTIONS ON EDUCATION	Pontificia Universidad Catolica de Valparaiso; Universidad Nacional de Educacion a Distancia (UNED); Wuhan University
10	Huang, LW; Willcox, KE	2021	Network models and sensor layers to design adaptive learning using educational mapping	DESIGN SCIENCE	Massachusetts Institute of Technology (MIT); University of Texas System; University of Texas Austin
9	Raj, NS; Renumol, VG	2022	Early prediction of student engagement in virtual learning environments using machine learning techniques	E-LEARNING AND DIGITAL MEDIA	Cochin University Science & Technology
8	Chapman, KE; Davidson, ME; Azuka, N; Liberatore, MW	2023	Quantifying deliberate practice using auto-graded questions: Analyzing multiple metrics in a chemical engineering course	COMPUTER APPLICATIONS IN ENGINEERING EDUCATION	University System of Ohio; University of Toledo; University System of Ohio; University of Toledo
8	Simanca, F; Crespo, RG; Rodríguez-Baena, L; Burgos, D	2019	Identifying Students at Risk of Failing a Subject by Using Learning Analytics for Subsequent Customised Tutoring	APPLIED SCIENCES-BASEL	Universidad Cooperativa de Colombia; Universidad Internacional de La Rioja (UNIR); Universidad Internacional de La Rioja (UNIR)
7	Walker, S; Olney, T; Wood, C; Clarke, A; Dunworth, M	2019	How do tutors use data to support their students?	OPEN LEARNING	Open University - UK
5	Vaithilingam, CA; Gamboa, RA; Lim, SC	2019	EMPOWERED PEDAGOGY: CATCHING UP WITH THE FUTURE	MALAYSIAN JOURNAL OF LEARNING & INSTRUCTION	Taylor’s University; Multimedia University
4	Tsoni, R; Sakkopoulos, E; Panagiotakopoulos, CT; Verykios, VS	2021	On the equivalence between bimodal and unimodal students’ collaboration networks in distance learning	INTELLIGENT DECISION TECHNOLOGIES-NETHERLANDS	Hellenic Open University; University of Piraeus; University of Patras
3	Qazdar, A; Hasidi, O; Qassimi, S; Abdelwahed, E	2023	Newly Proposed Student Performance Indicators Based on Learning Analytics for Continuous Monitoring in Learning Management Systems	INTERNATIONAL JOURNAL OF ONLINE AND BIOMEDICAL ENGINEERING	Ibn Zohr University of Agadir; Cadi Ayyad University of Marrakech; Cadi Ayyad University of Marrakech
1	Torner, ME; Aparicio-Fernández, C; Vivancos, JL; Cañada-Soriano, M	2023	Analysis of the optimization of resources with Learning Analytics techniques	MULTIDISCIPLINARY JOURNAL FOR EDUCATION SOCIAL AND TECHNOLOGICAL SCIENCES	Universitat Politecnica de Valencia; Universitat Politecnica de Valencia; Universitat Politecnica de Valencia; Universitat Politecnica de Valencia
0	Soto-Acevedo, M; Abuchar-Curi, AM; Zuluaga-Ortiz, RA; Delahoz-Domínguez, EJ	2023	A Machine Learning Model to Predict Standardized Tests in Engineering Programs in Colombia	IEEE REVISTA IBEROAMERICANA DE TECNOLOGIAS DEL APRENDIZAJE-IEEE RITA	Universidad Tecnologica de Bolivar; Universidad Tecnologica de Bolivar; Universidad de la Costa

This table showcases the most frequently cited influential publications including titles like “Student Self-Regulated Learning Indicators and Engagement with Online Learning Events to Predict Academic Performance,” “Flipping the Classroom to Improve Learning With MOOCs Technology,” “Effects and Acceptance of Precision Education in an AI-Supported Smart Learning Environment,” “Learning Analytics on Structured and Unstructured Heterogeneous Data,” and “Perspectives from Procrastination, Help-Seeking, and Machine-Learning Defined Cognitive Engagement Team-Based Mobile.”

4.7 Influential publishers

LASTEM_countsPublisher <- LASTEM_data %>% 
  count(Publisher, sort = TRUE)

LASTEM_countsPublisher %>%
  filter(n > 1) %>% # keep rows with word counts greater than 1
  mutate(Publisher = reorder(Publisher, n)) %>% #reorder the word variable by n and replace with new variable called Publisher
  ggplot(aes(n, Publisher)) + # create a plot with n on x axis and word on y axis
  geom_col() # make it a bar plot

This visualization highlights the most influential publishers, with Wiley leading, followed by Springer and IEEE-Institute of Electrical and Electronics Engineers Inc.

4.8 Word Counts by Year

StemLASTEM_data_tidy %>%
  group_by(Year) %>%
  count(word, sort = TRUE) %>%
  top_n(8) %>%
  ungroup %>%
  mutate(word = reorder_within(word, n, Year)) %>%
  ggplot(aes(x = word, y = n, fill = word)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ Year, scales = "free_y") +
  coord_flip() +
  scale_x_reordered() +
  scale_y_continuous(expand = c(0,0)) +
  labs(y = "Count",
       x = "Unique words",
       title = "Most frequent words found in Abstract through Years",
       subtitle = "Stop words removed from the list")

## Selecting by n

In diagraming the top eight unigrams by year, certain terms are recurrent. Given the project’s focus on leveraging learning analytics to enhance STEM education within higher education contexts, it’s expected that terms like “learn,” “student,” “analytics,” “study,” and “education” would consistently rank among the top terms across the years. However, each year also presents unique terms. For instance, “mooc,” “develop,” and “design” prominently featured in 2017, while “system” and “success” stood out in 2018. In 2022, “virtual” and “predict” were notable, and by 2024, “assess” and “model” gradually emerged as prevalent terms, eventually becoming the most common.

4.9 Bigram Counts

tds_bigrams %>%
  group_by(Year) %>%
  count(bigram, sort = TRUE) %>%
  top_n(5) %>%
  ungroup %>%
  mutate(bigram = reorder_within(bigram, n, Year)) %>%
  ggplot(aes(x = bigram, y = n, fill = bigram)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ Year, scales = "free_y") +
  coord_flip() +
  scale_x_reordered() +
  scale_y_continuous(expand = c(0,0)) +
  labs(y = "Count",
       x = "Unique Bigrams",
       title = "Most frequent bigrams found in Abstract through Years",
       subtitle = "Stop words removed from the list")

## Selecting by n

Moreover, I’ve chosen to shift the emphasis away from individual terms. In our community, key topics are often represented by multiple terms, such as ‘learning analytics’ and ‘machine learning,’ rather than treating each word in isolation. With this in mind, I replicated the previous visualizations but focused on bi-word groupings instead. The expansion to 2-word phrases reveals common topics like ‘learning analytics’ and ‘academic performance,’ along with unique topics for each year. Noteworthy examples include ‘blended learning’ and ‘analytic tools’ in 2017, ‘student success’ and ‘learning activities’ in 2018, ‘program outcomes’ and ‘learning systems’ in 2019, ‘student engagement’ and ‘evidence-based’ in 2020, ‘model approach’ and ‘machine learning’ in 2021, ‘virtual learning’ and ‘implementation strategies’ in 2022, ‘teacher inquiry’ and ‘academic achievement’ in 2023, and ‘competitive-based’ and ‘assessing models’ in 2024.

Model

To explore the potential for generating topics, I will conduct a comparative analysis of various qualitative models to identify any differences in how they delineate unique topics. This analysis will encompass three models:

Latent Dirichlet Allocation (LDA): LDA operates on the assumption that each document comprises a blend of topics, with each topic comprising a blend of words.

Structural Topic Model (STM): STM leverages metadata to enhance the assignment of words to topics within a corpus. It also enables the examination of relationships between covariates and documents.

Biterm Topic Model (BTM): BTM is specifically designed for short-format corpora. It identifies topics by explicitly modeling co-occurrences of words within a specified text window.

By comparing the outcomes of these models, I aim to gain insights into how they differ in identifying and delineating distinct topics within the dataset.

5.1 Determining K Each model relies on determining an optimal value for K, representing the number of potential topics to be identified. If K is too small, the corpus may be divided into a few overly generic topics. Conversely, if K is too large, the collection may be fragmented into numerous topics that either overlap significantly or become indistinguishable. To ensure consistent comparisons across models, a common K value needs to be established before applying the models.

FindTopicsNumber() Function in the LDA Model For the LDA model, four metrics were extracted, then plotted to visualize the maximum or minimum K value of each metric:

k_metrics <- FindTopicsNumber(
  tidy_tds_DTM,
  topics = seq(5, 80, by = 5),
  metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
  method = "Gibbs",
  control = list(seed = 77),
  mc.cores = NA,
  return_models = FALSE,
  verbose = FALSE,
  libpath = NULL
)
FindTopicsNumber_plot(k_metrics)

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the ldatuning package.
##   Please report the issue at <https://github.com/nikita-moor/ldatuning/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The crucial aspect is pinpointing the visible bend or inflection point on each line. This is where the lines shift from a rapid increase or decrease to a more gradual trajectory. Across all lines, this inflection point falls between 20 and 30 topics. Considering the dataset comprises only 45 items, I would opt for the smallest possible number of topics, which is 5.

5.2 Latent Dirichlet Allocation (LDA) Model LDA is a mathematical technique used to estimate the combination of words associated with each topic and to determine the blend of topics that characterize each document (Silge & Robinson, 2017).

n_distinct(LASTEM_data$Abstract)

## [1] 45

tds_lda <- LDA(tidy_tds_DTM, 
                  k = 5, 
                  control = list(seed = 588))

terms(tds_lda, 10)

##       Topic 1   Topic 2    Topic 3    Topic 4   Topic 5  
##  [1,] "learn"   "learn"    "student"  "learn"   "learn"  
##  [2,] "student" "student"  "learn"    "student" "student"
##  [3,] "academ"  "analyt"   "studi"    "assess"  "data"   
##  [4,] "educ"    "resourc"  "perform"  "studi"   "studi"  
##  [5,] "analyt"  "statist"  "data"     "analyt"  "model"  
##  [6,] "predict" "engag"    "academ"   "model"   "analyt" 
##  [7,] "time"    "onlin"    "educ"     "develop" "correct"
##  [8,] "model"   "mathemat" "analyt"   "result"  "educ"   
##  [9,] "univers" "cours"    "interact" "system"  "network"
## [10,] "studi"   "data"     "result"   "lo"      "practic"

For K = 5, the topics appear generally similar across the board, although there are a couple that stand out as unique. However, in this format, discerning those distinct topics is challenging. However, identifying these distinct topics is challenging in this format. A faceted plot below offers a more insightful visual representation:

top_terms_lda <- tidy(tds_lda, matrix="beta") %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, desc(beta))
top_terms_lda %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  group_by(topic, term) %>%    
  arrange(desc(beta)) %>%  
  ungroup() %>%
  ggplot(aes(beta, term, fill = as.factor(topic))) +
  geom_col(show.legend = FALSE) +
  scale_y_reordered() +
  labs(title = "Top 10 terms in each LDA topic",
       x = expression(beta), y = NULL) +
  facet_wrap(~ topic, ncol = 4, scales = "free")

Performance Summary The LDA model struggled to differentiate between topics, providing generic summaries similar to what one might find in a typical data science article. These findings qualitatively validate the model’s difficulty in consistently identifying unique topics within this dataset.

5.3 Structural Topic Model (STM) The stm package in R mandates that the documents, metadata, and “vocab” (the complete list of words described in the documents) be stored in distinct objects, as shown in the code below. The initial line of code filters out exceedingly common and rare terms, a common practice in topic modeling, as these terms can complicate word-topic assignments (Bail, 2019).

temp <- textProcessor(LASTEM_data$Abstract, 
                      metadata = LASTEM_data,  
                      lowercase=TRUE, 
                      removestopwords=TRUE, 
                      removenumbers=TRUE,  
                      removepunctuation=TRUE, 
                      wordLengths=c(3,Inf),
                      stem=TRUE,
                      onlycharacter= FALSE, 
                      striphtml=TRUE, 
                      customstopwords=NULL)

## Building corpus... 
## Converting to Lower Case... 
## Removing punctuation... 
## Removing stopwords... 
## Removing numbers... 
## Stemming... 
## Creating Output...

docs <- temp$documents 
meta <- temp$meta 
vocab <- temp$vocab 

tds_stm <- stm(documents=docs, 
               data=meta,
               vocab=vocab, 
               prevalence =~ Year,
               K=5,
               max.em.its=25,
               verbose = FALSE,
               )

plot.STM(tds_stm, n = 20)

The initial iteration of the STM model with K=5 exhibits a broader array of terms compared to LDA. However, this linear representation still struggles to clearly depict the differences between each topic and its adjacent topics. A more visually informative method is facilitated by the toLDAvis function.

toLDAvis(mod = tds_stm, docs = docs)

## Loading required namespace: servr

This tool explores the spatial relationships between topics, aiming for a scenario where the model identifies 5 topics that are distinctly separate from each other. In an ideal setting, the diagram would depict 5 non-overlapping circles, each representing a unique topic. While it’s unlikely for a model to achieve such precision, this diagram with K=5 shows relatively good separation between topics. There’s only one region where topics overlap, indicating a larger proportion of similar terms shared between them.

Topic 5 is highlighted due to its size and significant spacing from neighboring topics. The size of the circle represents the number of terms associated with the topic. In this instance, the prominent term is ‘learn,’ which is expected, but it’s the accompanying terms that set it apart from neighboring topics. The solitary bubble signifies that this topic is distinct in its composition of terms.

Performance Summary It seems that the STM model demonstrated a better ability to differentiate between topics compared to the LDA model. The clear separation of topic terms and the presence of medium to large-sized bubbles further support that the optimal number of topics is around 20, and incorporating metadata enhances the likelihood of distinguishing between unique topics. However, it’s worth noting that most of the 20 terms are included in the 5 topics, suggesting some overlap or redundancy in the topics identified.

5.4 Biterm Topic Model (BTM) The Biterm Topic Model (BTM) is a word co-occurrence based model designed to learn topics by analyzing word-word co-occurrence patterns(Wijffels, 2021).

A biterm comprises two words that frequently occur together within the same context, typically within a short text window. This window is defined by parameters such as skipgram and width.

Skipgram determines the number of words considered in the biterm search space, while width specifies the average number of words within a single document. In this project, a skipgram value of 10 was employed, while the width remained at the default value of 15.

Unlike LDA models that focus on word occurrences within individual documents, BTM models biterm occurrences across the entire corpus, providing a different perspective on topic modeling.

LASTEM_data$ID <- 1:nrow(LASTEM_data)

LASTEM_data$ID

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

LASTEM_data_btm <- select(LASTEM_data, c(ID, Abstract)) %>%
  rename (doc_id = ID) %>%
  rename (text = Abstract)

Based on the first two iterations, the BTM trial focused on training a model to identify 5 topics:

anno    <- udpipe(LASTEM_data_btm, "english", trace = 10)

## 2024-04-30 13:14:47.740299 Annotating text fragment 1/45
## 2024-04-30 13:14:50.266528 Annotating text fragment 11/45
## 2024-04-30 13:14:52.932178 Annotating text fragment 21/45
## 2024-04-30 13:14:55.63223 Annotating text fragment 31/45
## 2024-04-30 13:14:57.910263 Annotating text fragment 41/45

biterms <- as.data.table(anno)
biterms <- biterms[, cooccurrence(x = lemma,
                                  relevant = upos %in% c("NOUN",
                                                         "ADJ",
                                                         "PROPN"),
                                  skipgram = 10),
                   by = list(doc_id)]

# Build BTM
set.seed(588)
traindata <- subset(anno, upos %in% c("NOUN", "ADJ", "PROPN"))
traindata <- traindata[, c("doc_id", "lemma")]
model <- BTM(traindata, k = 5, 
             beta = 0.01, 
             iter = 500,
             biterms = biterms, 
             trace = 100)

## 2024-04-30 13:15:01 Start Gibbs sampling iteration 1/500
## 2024-04-30 13:15:02 Start Gibbs sampling iteration 101/500
## 2024-04-30 13:15:02 Start Gibbs sampling iteration 201/500
## 2024-04-30 13:15:03 Start Gibbs sampling iteration 301/500
## 2024-04-30 13:15:03 Start Gibbs sampling iteration 401/500

# Plot Model Results (do not run when knitting)
#library(ggraph)
#plot(model,
#     top_n = 20,
#     title = "BTM model",
#     subtitle = "K = 5, 500 Training Iterations",
#     labels = c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9",
#                "10", "11", "12", "13",  "14", "15", "16", "17", 
#                "18", "19"))

# Plot Model Results (do not run when knitting)
library(ggraph)
plot(model,
     top_n = 15,
     title = "BTM model",
     subtitle = "K = 5, 500 Training Iterations",
     labels = c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9",
               "10", "11", "12", "13",  "14", "15", "16", "17", 
                "18", "19"))

The words ‘data’, ‘analytic’ and ‘university’ were most important for topic 1. The words ‘assessment’, ‘model’, ‘framework’ and ‘study’ were most important for topic 2. The words ‘academic’, ‘correct’, ‘delay’ and ‘question’ were most important for topic 3. The words ‘student’, ‘course’ and ‘education’ were most important for topic 4. The words ‘intelligent’, ‘system’ and ‘activity’ were most important for topic 5.

Performance Summary The visual outcomes of the BTM with K set to 5 topics reveal several distinct characteristics:

Each grouped topic is unique in terminology; there are no repetitions among them.
The size of words reflects their importance or frequency within the respective topic. Smaller font sizes indicate lower probabilities (theta) of those words appearing in the topic.
The line weights within topics signify the strength of relationships between words. Thicker lines indicate stronger associations among terms.

Although the visualization may resemble word clouds, the underlying algorithm is more intricate than simple term frequencies. It is based on the co-occurrence of terms, adding a layer of complexity to the graphic.

Communication

6.1 Conlusion

This project delved into publications within WoS concerning learning analytics, higher education, and STEM, employing text mining to identify research trends and characteristics. From January 1, 2014, to April 7, 2024, 205 papers were published, and 45 were chosen for analysis based on specific criteria. Results indicated fluctuating publication numbers, with the highest count in 2021. Key unigrams included ‘student,’ ‘study,’ ‘data,’ ‘model,’ ‘assess,’ and ‘result,’ while prominent bigrams comprised ‘machine learning,’ ‘student learn,’ ‘learn environment,’ ‘learn management,’ ‘management system,’ and ‘student performance.’

Regarding influential sources, the International Journal of Engineering Education and IEEE ACCESS stood out, with University of Sydney and Universidad de Castilla-La Mancha as prominent affiliations. Wiley, Springer, and IEEE-Institute of Electrical and Electronics Engineers Inc emerged as leading publishers. Notable publications included titles like “Student Self-Regulated Learning Indicators and Engagement with Online Learning Events to Predict Academic Performance,” “Flipping the Classroom to Improve Learning With MOOCs Technology,” “Effects and Acceptance of Precision Education in an AI-Supported Smart Learning Environment,” “Learning Analytics on Structured and Unstructured Heterogeneous Data,” and “Perspectives from Procrastination, Help-Seeking, and Machine-Learning Defined Cognitive Engagement Team-Based Mobile.”

Furthermore, the study explored topic generation using three models—LDA, STM, and BTM. BTM exhibited superior effectiveness by revealing the most unique latent topics, with five topics identified through this model.

6.2 Limitations

The Web of Science database’s coverage may not include all relevant articles in the learning analytics field, especially those from journals or publications not indexed in this database. This limitation could lead to a biased or incomplete understanding of the research landscape. Additionally, the dataset used in this analysis is relatively small, and determining the optimal K value for topic modeling may not be definitive.

6.3 Ethic issues

Ethical Use of Data and Findings: It’s essential to handle the data and findings presented in the literature ethically and responsibly. This includes accurately representing research results without misinterpretation or misrepresentation. Additionally, acknowledging any limitations or ethical considerations highlighted by the original authors is crucial to maintain ethical standards in literature analysis.

Transparency and Reproducibility: Prioritize transparency throughout my literature analysis by meticulously documenting my search strategies, inclusion criteria, and article selection process. Strive for reproducibility by offering comprehensive information that enables others to replicate or validate your review methodology. This commitment to transparency and reproducibility enhances the credibility and trustworthiness of my literature analysis.

References

Bail, C. (2019). Topic Modeling. Text as Data Course. Retrieved April 22, 2022, from https://sicss.io/2019/materials/day3-text-analysis/topic-modeling/rmarkdown/Topic_Modeling.html Wijffels, J. (2021). BTM: Biterm topic models for short text - cran.r-project.org. CRAN - Package BTM. Retrieved April 27, 2022, from https://cran.r-project.org/web/packages/BTM/BTM.pdf Jipeng, Q., Zhenyu, Q., Yun, L., Yunhao, Y., & Xindong, W. (2019). Short text topic modeling techniques, applications, and performance: A survey. arXiv.org. Retrieved April 21, 2022, from https://doi.org/10.48550/arXiv.1904.07695 Silge, J., & Robinson, D. (2017). Topic Modeling: Text mining with R. 6 Topic modeling | Text Mining with R. Retrieved April 30, 2022, from https://www.tidytextmining.com/topicmodeling.html Yan, X., Guo, J., Lan, Y., & Cheng, X. (2013). A biterm topic model for short texts. Xiaohui Yan’s Homepage. Retrieved April 27, 2022, from http://xiaohuiyan.github.io/paper/BTM-WWW13.pdf

Final Project for ECI 588, Text Mining in Education —- Analyzing Literature Through Text Mining

Ming Cai

2024-04-10