1. Introduction

In order to identify trends, developments, and emerging themes within the field of learning analytics as applied to STEM education in higher education settings, this literature analysis offers a comprehensive understanding of the field and provides insights into evolving trends within the field, which holds significant value for researchers and practitioners focusing on STEM education within higher education.

As a result, this research project aims to answer the following questions:

What is the annual publication trend of the studies? What are the most influential sources, affiliations, publishers, and publications in the field? What are the most common words used? Which topic modeling approach is optimal for generating potential topics from short text? What are the potential topic trends observed over the years?

  1. Preparation

2.1 Data Source

My raw data was initially collected from the Web of Science database, which is a comprehensive research database that covers a wide range of scholarly literature across various disciplines. The search routine is as follows: “learning analytics” AND (“higher education” OR colleg* OR universit*) AND (STEM OR science OR technology OR engineering OR mathematics) (Abstract) and Article (Document Types). Upon including all publication types in the scope, 373 documents are reached, and 205 articles emerge when the survey is limited to articles between 2014 - 2024. The survey was performed on 7 April 2024, to avoid deviations from daily updates of the database. Then, inclusion criteria were applied to select relevant studies: (1) The paper is written in English; (2) Full text is available; (3) The paper is peer-reviewed;(4) The study is conducted in a higher education context;(5) The study focuses on STEM education; (6) The study The paper is empirical. Therefore, other types of documents such as non-peer-reviewed papers, government reports, review, meta-analysis, survey, or commentary articles, editorials, book chapters, theses, dissertations, and author notes were excluded from this review. Finally, all the texts containing “Record Content: Full Record and Cited References” are downloaded in Excel format by clicking on Export Records to Excel File of the 45 articles reached during the survey. The entire records of WoS include the author, document type, Web of Science category, keywords, year of publication, publishers, affiliated institutions, countries/regions, and indexes for each article.

2.2 Load libraries

library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidytext)
library(tidyr)
library(SnowballC)
library(topicmodels)
library(stm)
## stm v1.3.7 successfully loaded. See ?stm for help. 
##  Papers, resources, and other materials at structuraltopicmodel.com
library(ldatuning)
library(knitr)
library(LDAvis)
library(tm)
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## 
## The following object is masked from 'package:ggplot2':
## 
##     annotate
library(lubridate)
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.3.3
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(BTM)
## Warning: package 'BTM' was built under R version 4.3.3
library(textplot)
## Warning: package 'textplot' was built under R version 4.3.3
library(concaveman)
## Warning: package 'concaveman' was built under R version 4.3.3
library(udpipe)
## Warning: package 'udpipe' was built under R version 4.3.3
library(data.table)
## 
## Attaching package: 'data.table'
## 
## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## 
## The following object is masked from 'package:purrr':
## 
##     transpose
## 
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
library(stopwords)
## 
## Attaching package: 'stopwords'
## 
## The following object is masked from 'package:tm':
## 
##     stopwords
library(ggplot2)
library(igraph)
## 
## Attaching package: 'igraph'
## 
## The following objects are masked from 'package:lubridate':
## 
##     %--%, union
## 
## The following objects are masked from 'package:purrr':
## 
##     compose, simplify
## 
## The following object is masked from 'package:tidyr':
## 
##     crossing
## 
## The following object is masked from 'package:tibble':
## 
##     as_data_frame
## 
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## 
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## 
## The following object is masked from 'package:base':
## 
##     union
library(ggraph)
  1. Wrangle

3.1 Read in the data in R project.

library(readxl)
downloadLASTEM_data <- read_excel("C:/Users/kaoya/Downloads/ECI_588_Datasets/download_LASTEM.xls")
View(downloadLASTEM_data)

3.2 Rename Columns

downloadLASTEM_data <- rename(downloadLASTEM_data, Year = `Publication Year`) 
downloadLASTEM_data <- rename(downloadLASTEM_data, Title = `Article Title`)
downloadLASTEM_data <- rename(downloadLASTEM_data, Source = `Source Title`) 
downloadLASTEM_data <- rename(downloadLASTEM_data, Type = `Publication Type`) 
downloadLASTEM_data <- rename(downloadLASTEM_data, Keywords = `Author Keywords`)
downloadLASTEM_data <- rename(downloadLASTEM_data, Usagecount = `Since 2013 Usage Count`)
downloadLASTEM_data <- rename(downloadLASTEM_data, Keywordsplus = `Keywords Plus`) 
downloadLASTEM_data <- rename(downloadLASTEM_data, Referencecount = `Cited Reference Count`) 
downloadLASTEM_data <- rename(downloadLASTEM_data, Journal = `Journal Abbreviation`) 

3.3 Subset columns

LASTEM_data <- select(downloadLASTEM_data, Type, Authors, Title, Source, Keywords, Keywordsplus, Abstract, Affiliations, Referencecount, Usagecount, Journal, Publisher, Year)

3.4 Create tidytext corpus

LASTEM_data_tidy <- LASTEM_data %>%
  unnest_tokens(output = word, input = Abstract) %>%
  anti_join(stop_words, by = "word")


tidy_top_tokens <- LASTEM_data_tidy %>% 
  count(word, sort = TRUE) %>% 
  top_n(50)
## Selecting by n
tidy_top_tokens
## # A tibble: 51 × 2
##    word           n
##    <chr>      <int>
##  1 learning     230
##  2 students     147
##  3 analytics     82
##  4 study         63
##  5 data          59
##  6 student       59
##  7 education     45
##  8 academic      41
##  9 assessment    39
## 10 university    39
## # ℹ 41 more rows

3.5 Stemming Stemming reduces the feature size of a corpus by transforming terms to their base stem. Stemming reduces the chances of redundancy in terms and phrases as the various topic modeling techniques are explored.

StemLASTEM_data_tidy <- LASTEM_data_tidy %>% 
  mutate(word = wordStem(word))

StemLASTEM_data_tidy
## # A tibble: 5,376 × 13
##    Type  Authors  Title Source Keywords Keywordsplus Affiliations Referencecount
##    <chr> <chr>    <chr> <chr>  <chr>    <chr>        <chr>                 <dbl>
##  1 J     Vargas,… A Pr… IEEE … Educati… ENGINEERING… Pontificia …             36
##  2 J     Vargas,… A Pr… IEEE … Educati… ENGINEERING… Pontificia …             36
##  3 J     Vargas,… A Pr… IEEE … Educati… ENGINEERING… Pontificia …             36
##  4 J     Vargas,… A Pr… IEEE … Educati… ENGINEERING… Pontificia …             36
##  5 J     Vargas,… A Pr… IEEE … Educati… ENGINEERING… Pontificia …             36
##  6 J     Vargas,… A Pr… IEEE … Educati… ENGINEERING… Pontificia …             36
##  7 J     Vargas,… A Pr… IEEE … Educati… ENGINEERING… Pontificia …             36
##  8 J     Vargas,… A Pr… IEEE … Educati… ENGINEERING… Pontificia …             36
##  9 J     Vargas,… A Pr… IEEE … Educati… ENGINEERING… Pontificia …             36
## 10 J     Vargas,… A Pr… IEEE … Educati… ENGINEERING… Pontificia …             36
## # ℹ 5,366 more rows
## # ℹ 5 more variables: Usagecount <dbl>, Journal <chr>, Publisher <chr>,
## #   Year <dbl>, word <chr>

3.6 Cast a Document Term Matrix The LDA model requires the text be presented in the form of a tidy DTM, where each term occupies a single cell according to a unique and controlling variable. In this case, the title will act as that unique identifier.

tidy_tds_DTM <- StemLASTEM_data_tidy %>%
  count(Title, word) %>%
  cast_dtm(Title, word, n)

tidy_tds_DTM
## <<DocumentTermMatrix (documents: 45, terms: 1222)>>
## Non-/sparse entries: 3497/51493
## Sparsity           : 94%
## Maximal term length: 13
## Weighting          : term frequency (tf)

3.7 Word counts

LASTEM_counts <- StemLASTEM_data_tidy %>% 
  count(word, sort = TRUE)

LASTEM_counts
## # A tibble: 1,222 × 2
##    word        n
##    <chr>   <int>
##  1 learn     233
##  2 student   206
##  3 analyt     86
##  4 studi      79
##  5 educ       68
##  6 data       59
##  7 model      52
##  8 assess     48
##  9 univers    46
## 10 result     43
## # ℹ 1,212 more rows

3.8 Word frequency

LASTEM_frequencies <- StemLASTEM_data_tidy %>%
  count(word, sort = TRUE) %>%
  mutate(proportion = n / sum(n))

LASTEM_frequencies
## # A tibble: 1,222 × 3
##    word        n proportion
##    <chr>   <int>      <dbl>
##  1 learn     233    0.0433 
##  2 student   206    0.0383 
##  3 analyt     86    0.0160 
##  4 studi      79    0.0147 
##  5 educ       68    0.0126 
##  6 data       59    0.0110 
##  7 model      52    0.00967
##  8 assess     48    0.00893
##  9 univers    46    0.00856
## 10 result     43    0.00800
## # ℹ 1,212 more rows

3.9 Tokenization of the original data to enable further term frequency analysis at the bigram (word pair) levels. For these iterations, stop word removal and stemming has been incorporated:

tds_bigrams <- LASTEM_data %>%   
  unnest_tokens(output = bigram, input = Abstract, token = "ngrams", n = 2)

tds_bigrams <- tds_bigrams %>% 
  separate(bigram, into = c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>%
  mutate(word1 = wordStem(word1)) %>% 
  mutate(word2 = wordStem(word2)) %>% 
  unite(bigram, c(word1, word2), sep = " ")

bigram_top_tokens <- tds_bigrams %>% 
  count(bigram, sort = TRUE) %>% 
  top_n(20)
## Selecting by n
bigram_top_tokens
## # A tibble: 22 × 2
##    bigram              n
##    <chr>           <int>
##  1 learn analyt       66
##  2 academ perform     12
##  3 machin learn       11
##  4 student learn      11
##  5 learn environ      10
##  6 learn manag        10
##  7 manag system        9
##  8 student perform     9
##  9 student success     8
## 10 academ achiev       7
## # ℹ 12 more rows

3.10 Bigrams Counts

LASTEM_countsbi <- tds_bigrams %>% 
  count(bigram, sort = TRUE)

LASTEM_countsbi
## # A tibble: 1,878 × 2
##    bigram              n
##    <chr>           <int>
##  1 learn analyt       66
##  2 academ perform     12
##  3 machin learn       11
##  4 student learn      11
##  5 learn environ      10
##  6 learn manag        10
##  7 manag system        9
##  8 student perform     9
##  9 student success     8
## 10 academ achiev       7
## # ℹ 1,868 more rows
  1. Exploratory Analysis

4.1 Wordcloud

library(wordcloud2)

wordcloud2(LASTEM_counts)

The most meaningful common words were ‘student,’ ‘study,’ ‘data,’ ‘model,’ ‘assess,’ and ‘result.’

4.2 Basic Bar Chart

LASTEM_counts %>%
  filter(n > 20) %>% # keep rows with word counts greater than 20
  mutate(word = reorder(word, n)) %>% #reorder the word variable by n and replace with new variable called word
  ggplot(aes(n, word)) + # create a plot with n on x axis and word on y axis
  geom_col() # make it a bar plot

LASTEM_countsbi %>%
  filter(n > 5) %>% # keep rows with word counts greater than 5
  mutate(bigram = reorder(bigram, n)) %>% #reorder the word variable by n and replace with new variable called biagram
  ggplot(aes(n, bigram)) + # create a plot with n on x axis and biagram on y axis
  geom_col() # make it a bar plot

The most meaningful common bigrams included ‘machine learning,’ ‘student learn,’ ‘learning environment,’ ‘learning management,’ ‘management system,’ and ‘student performance.’

4.3 Published Article Counts

all_years <- as.character(2014:2024)  # Define the range of years
LASTEM_data$Year <- factor(LASTEM_data$Year, levels = all_years)

LASTEM_data %>% 
  ggplot(aes(x = Year), color = factor(Type)) +
  geom_bar(show.legend = FALSE) +
  scale_x_discrete(limits = all_years) +  # Explicitly set x-axis limits
  labs(x = "Year",
     y = "Article Counts",
     title = "Articles Published by Years",
     subtitle = "Published from 2014 - 2024")

This visual depicts the number of papers published over the past ten years. There was no publication from 2014-2016, although the average number of publications per year was about 5. Four works were found in 2017, nine were found in 2018, seven were found in 2019, and three were published in 2020. Moreover, the number of papers were reached 11 in 2021, then decreased to three in 2022 and seven in 2023. There was only one work in 2024.

4.4 Influential sources

LASTEM_countsSource <- LASTEM_data %>% 
  count(Source, sort = TRUE)

LASTEM_countsSource %>%
  filter(n > 1) %>% # keep rows with souce counts greater than 1
  mutate(Source = reorder(Source, n)) %>% #reorder the word variable by n and replace with new variable called source
  ggplot(aes(n, Source)) + # create a plot with n on x axis and source on y axis
  geom_col() # make it a bar plot

This visual illustrates the distribution of published papers across various sources. Specifically, it shows that three papers were published in the International Journal of Engineering Education and IEEE ACCESS each. Additionally, two papers each were found in the Multidisciplinary Journal for Education Social and Technological Sciences, Education and Information Technologies, Computer & Education, Computer Applications in Engineering Education, and Applied Sciences-basel.

4.5 Influential affiliations

LASTEM_countsAffiliations <- LASTEM_data %>% 
  count(Affiliations, sort = TRUE)

LASTEM_countsAffiliations %>%
  filter(n > 1) %>% # keep rows with word counts greater than 1
  mutate(Affiliations = reorder(Affiliations, n)) %>% #reorder the word variable by n and replace with new variable called word
  ggplot(aes(n, Affiliations)) + # create a plot with n on x axis and word on y axis
  geom_col() # make it a bar plot

Two works were discovered from the University of Sydney and the Universidad de Castilla-La Mancha, while only one work was found in other affiliations.

4.6 Influential publications

LASTEM_sortUsagecount <- LASTEM_data %>%
  arrange((desc(Usagecount)))

View(LASTEM_sortUsagecount)
library(knitr)

# Select the desired variables and arrange them in the desired order
selected_LASTEM_sortUsagecount <- LASTEM_sortUsagecount %>%
  select(Usagecount, Authors, Year, Title, Source, Affiliations)

# Display the table using kable
kable(selected_LASTEM_sortUsagecount)
Usagecount Authors Year Title Source Affiliations
167 Pardo, A; Han, FF; Ellis, RA 2017 Combining University Student Self-Regulated Learning Indicators and Engagement with Online Learning Events to Predict Academic Performance IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES University of Sydney; University of Sydney
144 Muñoz-Merino, PJ; Ruipérez-Valiente, JA; Kloos, CD; Auger, MA; Briz, S; de Castro, V; Santalla, SN 2017 Flipping the Classroom to Improve Learning With MOOCs Technology COMPUTER APPLICATIONS IN ENGINEERING EDUCATION Universidad Carlos III de Madrid; IMDEA Networks Institute
133 Hu, YH 2022 Effects and acceptance of precision education in an AI-supported smart learning environment EDUCATION AND INFORMATION TECHNOLOGIES National Yunlin University Science & Technology
108 Wu, JY 2021 Learning analytics on structured and unstructured heterogeneous data sources: Perspectives from procrastination, help-seeking, and machine-learning defined cognitive engagement COMPUTERS & EDUCATION National Yang Ming Chiao Tung University
70 Zhu, QL; Wang, MJ 2020 Team-based mobile learning supported by an intelligent system: case study of STEM students INTERACTIVE LEARNING ENVIRONMENTS Hainan University; California State University System; San Diego State University
63 Aparicio, F; Morales-Botello, ML; Rubio, M; Hernando, A; Muñoz, R; López-Fernández, H; Glez-Peña, D; Fdez-Riverola, F; de la Villa, M; Maña, M; Gachet, D; de Buenaga, M 2018 Perceptions of the use of intelligent information access systems in university level active learning activities among teachers of biomedical subjects INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS European University of Madrid; European University of Madrid; Universidade de Vigo; Universidade de Vigo; CINBIO; Universidad de Huelva
47 Saqr, M; Alamro, A 2019 The role of social network analysis as a learning analytics tool in online problem based learning BMC MEDICAL EDUCATION University of Eastern Finland; Stockholm University; Qassim University
46 Iatrellis, O; Savvas, IK; Fitsilis, P; Gerogiannis, VC 2021 A two-phase machine learning approach for predicting student outcomes EDUCATION AND INFORMATION TECHNOLOGIES University of Thessaly
46 Zheng, M; Bender, D; Nadershahi, N 2017 Faculty professional development in emergent pedagogies for instructional innovation in dental education EUROPEAN JOURNAL OF DENTAL EDUCATION University of the Pacific
45 Vaz-Fernandes, P; Caeiro, S 2019 Students’ perceptions of a food safety and quality e-learning course: a CASE study for a MSC in food consumption INTERNATIONAL JOURNAL OF EDUCATIONAL TECHNOLOGY IN HIGHER EDUCATION Universidade Aberta; Universidade de Lisboa; Universidade Nova de Lisboa
39 Noel, R; Riquelme, F; Mac Lean, R; Merino, E; Cechinel, C; Barcelos, TS; Villarroel, R; Munoz, R 2018 Exploring Collaborative Writing of User Stories With Multimodal Learning Analytics: A Case Study on a Software Engineering Course IEEE ACCESS Universidad de Valparaiso; Universidad de Valparaiso; Universidade Federal de Santa Catarina (UFSC); Instituto Federal de Sao Paulo (IFSP); Pontificia Universidad Catolica de Valparaiso
38 Lacave, C; Molina, AI; Cruz-Lemus, JA 2018 Learning Analytics to identify dropout factors of Computer Science studies through Bayesian networks BEHAVIOUR & INFORMATION TECHNOLOGY Universidad de Castilla-La Mancha
37 Kuromiya, H; Majumdar, R; Ogata, H 2020 Fostering Evidence-Based Education with Learning Analytics: Capturing Teaching-Learning Cases from Log Data EDUCATIONAL TECHNOLOGY & SOCIETY Kyoto University; Kyoto University
35 Cheng, S; Xie, K; Collier, J 2023 Motivational beliefs moderate the relation between academic delay and academic achievement in online learning environments COMPUTERS & EDUCATION National Taipei University of Technology; University System of Ohio; Ohio State University; Texas State University System; Sam Houston State University
35 Bertolini, R; Finch, SJ; Nehm, RH 2021 Testing the Impact of Novel Assessment Sources and Machine Learning Methods on Predictive Outcome Modeling in Undergraduate Biology JOURNAL OF SCIENCE EDUCATION AND TECHNOLOGY State University of New York (SUNY) System; State University of New York (SUNY) Stony Brook; State University of New York (SUNY) System; State University of New York (SUNY) Stony Brook
34 Divjak, B; Svetec, B; Horvat, D; Kadoic, N 2023 Assessment validity and learning analytics as prerequisites for ensuring student-centred learning design BRITISH JOURNAL OF EDUCATIONAL TECHNOLOGY University of Zagreb
30 Vargas, H; Heradio, R; Chacon, J; De la Torre, L; Farias, G; Galan, D; Dormido, S 2019 Automated Assessment and Monitoring Support for Competency-Based Courses IEEE ACCESS Pontificia Universidad Catolica de Valparaiso; Universidad Nacional de Educacion a Distancia (UNED); Complutense University of Madrid; Universidad Nacional de Educacion a Distancia (UNED)
30 Al-Shabandar, R; Hussain, AJ; Liatsis, P; Keight, R 2018 Anlaying Learners Behavior in MOOCs: An Examination of Performance and Motivation Using a Data-Driven Approach IEEE ACCESS Liverpool John Moores University; Khalifa University of Science & Technology
30 Mwalumbwe, I; Mtebe, JS 2017 USING LEARNING ANALYTICS TO PREDICT STUDENTS’ PERFORMANCE IN MOODLE LEARNING MANAGEMENT SYSTEM: A CASE OF MBEYA UNIVERSITY OF SCIENCE AND TECHNOLOGY ELECTRONIC JOURNAL OF INFORMATION SYSTEMS IN DEVELOPING COUNTRIES University of Dar es Salaam
26 Raza, SH; Reddy, E 2021 Intentionality and Players of Effective Online Courses in Mathematics FRONTIERS IN APPLIED MATHEMATICS AND STATISTICS University of the South Pacific
26 Scott, K; Morris, A; Marais, B 2018 Medical student use of digital learning resources CLINICAL TEACHER University of Sydney; University of Sydney
22 Menchaca, I; Guenaga, M; Solabarrieta, J 2018 Learning Analytics for Formative Assessment in Engineering Education INTERNATIONAL JOURNAL OF ENGINEERING EDUCATION University of Deusto
20 Kritzinger, A; Lemmens, JC; Potgieter, M 2018 Learning Strategies for First-Year Biology: Toward Moving the Murky Middle CBE-LIFE SCIENCES EDUCATION University of Pretoria; University of Pretoria; University of Pretoria
19 Gardner, C; Jones, A; Jefferis, H 2020 Analytics for Tracking Student Engagement JOURNAL OF INTERACTIVE MEDIA IN EDUCATION NA
16 Salazar-Fernandez, JP; Munoz-Gama, J; Maldonado-Mahauad, J; Bustamante, D; Sepúlveda, M 2021 Backpack Process Model (BPPM): A Process Mining Approach for Curricular Analytics APPLIED SCIENCES-BASEL Pontificia Universidad Catolica de Chile; Universidad Austral de Chile; Universidad de Cuenca
16 He, LJ; Levine, RA; Bohonak, AJ; Fan, JJ; Stronach, J 2018 Predictive Analytics Machinery for STEM Student Success Studies APPLIED ARTIFICIAL INTELLIGENCE California State University System; San Diego State University; California State University System; San Diego State University; California State University System; San Diego State University
15 Oliva-Córdova, LM; Garcia-Cabot, A; Recinos-Fernández, SA; Bojórquez-Roque, MS; Amado-Salvatierra, HR 2022 Evaluating Technological Acceptance of Virtual Learning Environments (VLE) in an Emergency Remote Situation INTERNATIONAL JOURNAL OF ENGINEERING EDUCATION Universidad de Alcala; Universidad de San Carlos de Guatemala; Universidad de San Carlos de Guatemala; Universidad de San Carlos de Guatemala
14 Apiola, M; Lokkila, E; Laakso, MJ 2019 Digital learning approaches in an intermediate-level computer science course INTERNATIONAL JOURNAL OF INFORMATION AND LEARNING TECHNOLOGY University of Turku
13 Olney, T; Walker, S; Wood, C; Clarke, A 2021 Are We Living in LA (P)LA Land? Reporting on the Practice of 30 STEM Tutors in Their Use of a Learning Analytics Implementation at The Open University JOURNAL OF LEARNING ANALYTICS Open University - UK; Open University - UK
12 Prat, A; Code, WJ 2021 WeBWorK log files as a rich source of data on student homework behaviours INTERNATIONAL JOURNAL OF MATHEMATICAL EDUCATION IN SCIENCE AND TECHNOLOGY University of British Columbia; University of British Columbia
11 Lobos, K; Sáez-Delgado, F; Cobo-Rendón, R; Mella Norambuena, J; Maldonado Trapp, A; Cisternas San Martín, N; Bruna Jofré, C 2021 Learning Beliefs, Time on Platform, and Academic Performance During the COVID-19 in University STEM Students FRONTIERS IN PSYCHOLOGY Universidad de Concepcion; Universidad Catolica de la Santisima Concepcion; Universidad Catolica de la Santisima Concepcion; Universidad de Concepcion; Universidad de Concepcion
11 Llopis-Albert, C; Rubio, F 2021 Application of Learning Analytics to Improve Higher Education MULTIDISCIPLINARY JOURNAL FOR EDUCATION SOCIAL AND TECHNOLOGICAL SCIENCES Universitat Politecnica de Valencia
11 Dennehy, D; Conboy, K; Babu, J 2023 Adopting Learning Analytics to Inform Postgraduate Curriculum Design: Recommendations and Research Agenda INFORMATION SYSTEMS FRONTIERS Ollscoil na Gaillimhe-University of Galway
11 Lacave, C; Molina, AI 2018 Using Bayesian Networks for Learning Analytics in Engineering Education: A Case Study on Computer Science Dropout at UCLM INTERNATIONAL JOURNAL OF ENGINEERING EDUCATION Universidad de Castilla-La Mancha
10 Vargas, H; Heradio, R; Farias, G; Lei, ZC; Torre, LD 2024 A Pragmatic Framework for Assessing Learning Outcomes in Competency-Based Courses IEEE TRANSACTIONS ON EDUCATION Pontificia Universidad Catolica de Valparaiso; Universidad Nacional de Educacion a Distancia (UNED); Wuhan University
10 Huang, LW; Willcox, KE 2021 Network models and sensor layers to design adaptive learning using educational mapping DESIGN SCIENCE Massachusetts Institute of Technology (MIT); University of Texas System; University of Texas Austin
9 Raj, NS; Renumol, VG 2022 Early prediction of student engagement in virtual learning environments using machine learning techniques E-LEARNING AND DIGITAL MEDIA Cochin University Science & Technology
8 Chapman, KE; Davidson, ME; Azuka, N; Liberatore, MW 2023 Quantifying deliberate practice using auto-graded questions: Analyzing multiple metrics in a chemical engineering course COMPUTER APPLICATIONS IN ENGINEERING EDUCATION University System of Ohio; University of Toledo; University System of Ohio; University of Toledo
8 Simanca, F; Crespo, RG; Rodríguez-Baena, L; Burgos, D 2019 Identifying Students at Risk of Failing a Subject by Using Learning Analytics for Subsequent Customised Tutoring APPLIED SCIENCES-BASEL Universidad Cooperativa de Colombia; Universidad Internacional de La Rioja (UNIR); Universidad Internacional de La Rioja (UNIR)
7 Walker, S; Olney, T; Wood, C; Clarke, A; Dunworth, M 2019 How do tutors use data to support their students? OPEN LEARNING Open University - UK
5 Vaithilingam, CA; Gamboa, RA; Lim, SC 2019 EMPOWERED PEDAGOGY: CATCHING UP WITH THE FUTURE MALAYSIAN JOURNAL OF LEARNING & INSTRUCTION Taylor’s University; Multimedia University
4 Tsoni, R; Sakkopoulos, E; Panagiotakopoulos, CT; Verykios, VS 2021 On the equivalence between bimodal and unimodal students’ collaboration networks in distance learning INTELLIGENT DECISION TECHNOLOGIES-NETHERLANDS Hellenic Open University; University of Piraeus; University of Patras
3 Qazdar, A; Hasidi, O; Qassimi, S; Abdelwahed, E 2023 Newly Proposed Student Performance Indicators Based on Learning Analytics for Continuous Monitoring in Learning Management Systems INTERNATIONAL JOURNAL OF ONLINE AND BIOMEDICAL ENGINEERING Ibn Zohr University of Agadir; Cadi Ayyad University of Marrakech; Cadi Ayyad University of Marrakech
1 Torner, ME; Aparicio-Fernández, C; Vivancos, JL; Cañada-Soriano, M 2023 Analysis of the optimization of resources with Learning Analytics techniques MULTIDISCIPLINARY JOURNAL FOR EDUCATION SOCIAL AND TECHNOLOGICAL SCIENCES Universitat Politecnica de Valencia; Universitat Politecnica de Valencia; Universitat Politecnica de Valencia; Universitat Politecnica de Valencia
0 Soto-Acevedo, M; Abuchar-Curi, AM; Zuluaga-Ortiz, RA; Delahoz-Domínguez, EJ 2023 A Machine Learning Model to Predict Standardized Tests in Engineering Programs in Colombia IEEE REVISTA IBEROAMERICANA DE TECNOLOGIAS DEL APRENDIZAJE-IEEE RITA Universidad Tecnologica de Bolivar; Universidad Tecnologica de Bolivar; Universidad de la Costa

This table showcases the most frequently cited influential publications including titles like “Student Self-Regulated Learning Indicators and Engagement with Online Learning Events to Predict Academic Performance,” “Flipping the Classroom to Improve Learning With MOOCs Technology,” “Effects and Acceptance of Precision Education in an AI-Supported Smart Learning Environment,” “Learning Analytics on Structured and Unstructured Heterogeneous Data,” and “Perspectives from Procrastination, Help-Seeking, and Machine-Learning Defined Cognitive Engagement Team-Based Mobile.”

4.7 Influential publishers

LASTEM_countsPublisher <- LASTEM_data %>% 
  count(Publisher, sort = TRUE)

LASTEM_countsPublisher %>%
  filter(n > 1) %>% # keep rows with word counts greater than 1
  mutate(Publisher = reorder(Publisher, n)) %>% #reorder the word variable by n and replace with new variable called Publisher
  ggplot(aes(n, Publisher)) + # create a plot with n on x axis and word on y axis
  geom_col() # make it a bar plot

This visualization highlights the most influential publishers, with Wiley leading, followed by Springer and IEEE-Institute of Electrical and Electronics Engineers Inc.

4.8 Word Counts by Year

StemLASTEM_data_tidy %>%
  group_by(Year) %>%
  count(word, sort = TRUE) %>%
  top_n(8) %>%
  ungroup %>%
  mutate(word = reorder_within(word, n, Year)) %>%
  ggplot(aes(x = word, y = n, fill = word)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ Year, scales = "free_y") +
  coord_flip() +
  scale_x_reordered() +
  scale_y_continuous(expand = c(0,0)) +
  labs(y = "Count",
       x = "Unique words",
       title = "Most frequent words found in Abstract through Years",
       subtitle = "Stop words removed from the list")
## Selecting by n

In diagraming the top eight unigrams by year, certain terms are recurrent. Given the project’s focus on leveraging learning analytics to enhance STEM education within higher education contexts, it’s expected that terms like “learn,” “student,” “analytics,” “study,” and “education” would consistently rank among the top terms across the years. However, each year also presents unique terms. For instance, “mooc,” “develop,” and “design” prominently featured in 2017, while “system” and “success” stood out in 2018. In 2022, “virtual” and “predict” were notable, and by 2024, “assess” and “model” gradually emerged as prevalent terms, eventually becoming the most common.

4.9 Bigram Counts

tds_bigrams %>%
  group_by(Year) %>%
  count(bigram, sort = TRUE) %>%
  top_n(5) %>%
  ungroup %>%
  mutate(bigram = reorder_within(bigram, n, Year)) %>%
  ggplot(aes(x = bigram, y = n, fill = bigram)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ Year, scales = "free_y") +
  coord_flip() +
  scale_x_reordered() +
  scale_y_continuous(expand = c(0,0)) +
  labs(y = "Count",
       x = "Unique Bigrams",
       title = "Most frequent bigrams found in Abstract through Years",
       subtitle = "Stop words removed from the list")
## Selecting by n

Moreover, I’ve chosen to shift the emphasis away from individual terms. In our community, key topics are often represented by multiple terms, such as ‘learning analytics’ and ‘machine learning,’ rather than treating each word in isolation. With this in mind, I replicated the previous visualizations but focused on bi-word groupings instead. The expansion to 2-word phrases reveals common topics like ‘learning analytics’ and ‘academic performance,’ along with unique topics for each year. Noteworthy examples include ‘blended learning’ and ‘analytic tools’ in 2017, ‘student success’ and ‘learning activities’ in 2018, ‘program outcomes’ and ‘learning systems’ in 2019, ‘student engagement’ and ‘evidence-based’ in 2020, ‘model approach’ and ‘machine learning’ in 2021, ‘virtual learning’ and ‘implementation strategies’ in 2022, ‘teacher inquiry’ and ‘academic achievement’ in 2023, and ‘competitive-based’ and ‘assessing models’ in 2024.

  1. Model

To explore the potential for generating topics, I will conduct a comparative analysis of various qualitative models to identify any differences in how they delineate unique topics. This analysis will encompass three models:

Latent Dirichlet Allocation (LDA): LDA operates on the assumption that each document comprises a blend of topics, with each topic comprising a blend of words.

Structural Topic Model (STM): STM leverages metadata to enhance the assignment of words to topics within a corpus. It also enables the examination of relationships between covariates and documents.

Biterm Topic Model (BTM): BTM is specifically designed for short-format corpora. It identifies topics by explicitly modeling co-occurrences of words within a specified text window.

By comparing the outcomes of these models, I aim to gain insights into how they differ in identifying and delineating distinct topics within the dataset.

5.1 Determining K Each model relies on determining an optimal value for K, representing the number of potential topics to be identified. If K is too small, the corpus may be divided into a few overly generic topics. Conversely, if K is too large, the collection may be fragmented into numerous topics that either overlap significantly or become indistinguishable. To ensure consistent comparisons across models, a common K value needs to be established before applying the models.

FindTopicsNumber() Function in the LDA Model For the LDA model, four metrics were extracted, then plotted to visualize the maximum or minimum K value of each metric:

k_metrics <- FindTopicsNumber(
  tidy_tds_DTM,
  topics = seq(5, 80, by = 5),
  metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
  method = "Gibbs",
  control = list(seed = 77),
  mc.cores = NA,
  return_models = FALSE,
  verbose = FALSE,
  libpath = NULL
)
FindTopicsNumber_plot(k_metrics)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the ldatuning package.
##   Please report the issue at <https://github.com/nikita-moor/ldatuning/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The crucial aspect is pinpointing the visible bend or inflection point on each line. This is where the lines shift from a rapid increase or decrease to a more gradual trajectory. Across all lines, this inflection point falls between 20 and 30 topics. Considering the dataset comprises only 45 items, I would opt for the smallest possible number of topics, which is 5.

5.2 Latent Dirichlet Allocation (LDA) Model LDA is a mathematical technique used to estimate the combination of words associated with each topic and to determine the blend of topics that characterize each document (Silge & Robinson, 2017).

n_distinct(LASTEM_data$Abstract)
## [1] 45
tds_lda <- LDA(tidy_tds_DTM, 
                  k = 5, 
                  control = list(seed = 588))

terms(tds_lda, 10)
##       Topic 1   Topic 2    Topic 3    Topic 4   Topic 5  
##  [1,] "learn"   "learn"    "student"  "learn"   "learn"  
##  [2,] "student" "student"  "learn"    "student" "student"
##  [3,] "academ"  "analyt"   "studi"    "assess"  "data"   
##  [4,] "educ"    "resourc"  "perform"  "studi"   "studi"  
##  [5,] "analyt"  "statist"  "data"     "analyt"  "model"  
##  [6,] "predict" "engag"    "academ"   "model"   "analyt" 
##  [7,] "time"    "onlin"    "educ"     "develop" "correct"
##  [8,] "model"   "mathemat" "analyt"   "result"  "educ"   
##  [9,] "univers" "cours"    "interact" "system"  "network"
## [10,] "studi"   "data"     "result"   "lo"      "practic"

For K = 5, the topics appear generally similar across the board, although there are a couple that stand out as unique. However, in this format, discerning those distinct topics is challenging. However, identifying these distinct topics is challenging in this format. A faceted plot below offers a more insightful visual representation:

top_terms_lda <- tidy(tds_lda, matrix="beta") %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, desc(beta))
top_terms_lda %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  group_by(topic, term) %>%    
  arrange(desc(beta)) %>%  
  ungroup() %>%
  ggplot(aes(beta, term, fill = as.factor(topic))) +
  geom_col(show.legend = FALSE) +
  scale_y_reordered() +
  labs(title = "Top 10 terms in each LDA topic",
       x = expression(beta), y = NULL) +
  facet_wrap(~ topic, ncol = 4, scales = "free")

Performance Summary The LDA model struggled to differentiate between topics, providing generic summaries similar to what one might find in a typical data science article. These findings qualitatively validate the model’s difficulty in consistently identifying unique topics within this dataset.

5.3 Structural Topic Model (STM) The stm package in R mandates that the documents, metadata, and “vocab” (the complete list of words described in the documents) be stored in distinct objects, as shown in the code below. The initial line of code filters out exceedingly common and rare terms, a common practice in topic modeling, as these terms can complicate word-topic assignments (Bail, 2019).

temp <- textProcessor(LASTEM_data$Abstract, 
                      metadata = LASTEM_data,  
                      lowercase=TRUE, 
                      removestopwords=TRUE, 
                      removenumbers=TRUE,  
                      removepunctuation=TRUE, 
                      wordLengths=c(3,Inf),
                      stem=TRUE,
                      onlycharacter= FALSE, 
                      striphtml=TRUE, 
                      customstopwords=NULL)
## Building corpus... 
## Converting to Lower Case... 
## Removing punctuation... 
## Removing stopwords... 
## Removing numbers... 
## Stemming... 
## Creating Output...
docs <- temp$documents 
meta <- temp$meta 
vocab <- temp$vocab 

tds_stm <- stm(documents=docs, 
               data=meta,
               vocab=vocab, 
               prevalence =~ Year,
               K=5,
               max.em.its=25,
               verbose = FALSE,
               )
plot.STM(tds_stm, n = 20)

The initial iteration of the STM model with K=5 exhibits a broader array of terms compared to LDA. However, this linear representation still struggles to clearly depict the differences between each topic and its adjacent topics. A more visually informative method is facilitated by the toLDAvis function.

toLDAvis(mod = tds_stm, docs = docs)
## Loading required namespace: servr

This tool explores the spatial relationships between topics, aiming for a scenario where the model identifies 5 topics that are distinctly separate from each other. In an ideal setting, the diagram would depict 5 non-overlapping circles, each representing a unique topic. While it’s unlikely for a model to achieve such precision, this diagram with K=5 shows relatively good separation between topics. There’s only one region where topics overlap, indicating a larger proportion of similar terms shared between them.

Topic 5 is highlighted due to its size and significant spacing from neighboring topics. The size of the circle represents the number of terms associated with the topic. In this instance, the prominent term is ‘learn,’ which is expected, but it’s the accompanying terms that set it apart from neighboring topics. The solitary bubble signifies that this topic is distinct in its composition of terms.

Performance Summary It seems that the STM model demonstrated a better ability to differentiate between topics compared to the LDA model. The clear separation of topic terms and the presence of medium to large-sized bubbles further support that the optimal number of topics is around 20, and incorporating metadata enhances the likelihood of distinguishing between unique topics. However, it’s worth noting that most of the 20 terms are included in the 5 topics, suggesting some overlap or redundancy in the topics identified.

5.4 Biterm Topic Model (BTM) The Biterm Topic Model (BTM) is a word co-occurrence based model designed to learn topics by analyzing word-word co-occurrence patterns(Wijffels, 2021).

A biterm comprises two words that frequently occur together within the same context, typically within a short text window. This window is defined by parameters such as skipgram and width.

Skipgram determines the number of words considered in the biterm search space, while width specifies the average number of words within a single document. In this project, a skipgram value of 10 was employed, while the width remained at the default value of 15.

Unlike LDA models that focus on word occurrences within individual documents, BTM models biterm occurrences across the entire corpus, providing a different perspective on topic modeling.

LASTEM_data$ID <- 1:nrow(LASTEM_data)

LASTEM_data$ID
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
LASTEM_data_btm <- select(LASTEM_data, c(ID, Abstract)) %>%
  rename (doc_id = ID) %>%
  rename (text = Abstract)

Based on the first two iterations, the BTM trial focused on training a model to identify 5 topics:

anno    <- udpipe(LASTEM_data_btm, "english", trace = 10)
## 2024-04-30 13:14:47.740299 Annotating text fragment 1/45
## 2024-04-30 13:14:50.266528 Annotating text fragment 11/45
## 2024-04-30 13:14:52.932178 Annotating text fragment 21/45
## 2024-04-30 13:14:55.63223 Annotating text fragment 31/45
## 2024-04-30 13:14:57.910263 Annotating text fragment 41/45
biterms <- as.data.table(anno)
biterms <- biterms[, cooccurrence(x = lemma,
                                  relevant = upos %in% c("NOUN",
                                                         "ADJ",
                                                         "PROPN"),
                                  skipgram = 10),
                   by = list(doc_id)]
# Build BTM
set.seed(588)
traindata <- subset(anno, upos %in% c("NOUN", "ADJ", "PROPN"))
traindata <- traindata[, c("doc_id", "lemma")]
model <- BTM(traindata, k = 5, 
             beta = 0.01, 
             iter = 500,
             biterms = biterms, 
             trace = 100)
## 2024-04-30 13:15:01 Start Gibbs sampling iteration 1/500
## 2024-04-30 13:15:02 Start Gibbs sampling iteration 101/500
## 2024-04-30 13:15:02 Start Gibbs sampling iteration 201/500
## 2024-04-30 13:15:03 Start Gibbs sampling iteration 301/500
## 2024-04-30 13:15:03 Start Gibbs sampling iteration 401/500
# Plot Model Results (do not run when knitting)
#library(ggraph)
#plot(model,
#     top_n = 20,
#     title = "BTM model",
#     subtitle = "K = 5, 500 Training Iterations",
#     labels = c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9",
#                "10", "11", "12", "13",  "14", "15", "16", "17", 
#                "18", "19"))
# Plot Model Results (do not run when knitting)
library(ggraph)
plot(model,
     top_n = 15,
     title = "BTM model",
     subtitle = "K = 5, 500 Training Iterations",
     labels = c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9",
               "10", "11", "12", "13",  "14", "15", "16", "17", 
                "18", "19"))

The words ‘data’, ‘analytic’ and ‘university’ were most important for topic 1. The words ‘assessment’, ‘model’, ‘framework’ and ‘study’ were most important for topic 2. The words ‘academic’, ‘correct’, ‘delay’ and ‘question’ were most important for topic 3. The words ‘student’, ‘course’ and ‘education’ were most important for topic 4. The words ‘intelligent’, ‘system’ and ‘activity’ were most important for topic 5.

Performance Summary The visual outcomes of the BTM with K set to 5 topics reveal several distinct characteristics:

  1. Each grouped topic is unique in terminology; there are no repetitions among them.
  2. The size of words reflects their importance or frequency within the respective topic. Smaller font sizes indicate lower probabilities (theta) of those words appearing in the topic.
  3. The line weights within topics signify the strength of relationships between words. Thicker lines indicate stronger associations among terms.

Although the visualization may resemble word clouds, the underlying algorithm is more intricate than simple term frequencies. It is based on the co-occurrence of terms, adding a layer of complexity to the graphic.

  1. Communication

6.1 Conlusion

This project delved into publications within WoS concerning learning analytics, higher education, and STEM, employing text mining to identify research trends and characteristics. From January 1, 2014, to April 7, 2024, 205 papers were published, and 45 were chosen for analysis based on specific criteria. Results indicated fluctuating publication numbers, with the highest count in 2021. Key unigrams included ‘student,’ ‘study,’ ‘data,’ ‘model,’ ‘assess,’ and ‘result,’ while prominent bigrams comprised ‘machine learning,’ ‘student learn,’ ‘learn environment,’ ‘learn management,’ ‘management system,’ and ‘student performance.’

Regarding influential sources, the International Journal of Engineering Education and IEEE ACCESS stood out, with University of Sydney and Universidad de Castilla-La Mancha as prominent affiliations. Wiley, Springer, and IEEE-Institute of Electrical and Electronics Engineers Inc emerged as leading publishers. Notable publications included titles like “Student Self-Regulated Learning Indicators and Engagement with Online Learning Events to Predict Academic Performance,” “Flipping the Classroom to Improve Learning With MOOCs Technology,” “Effects and Acceptance of Precision Education in an AI-Supported Smart Learning Environment,” “Learning Analytics on Structured and Unstructured Heterogeneous Data,” and “Perspectives from Procrastination, Help-Seeking, and Machine-Learning Defined Cognitive Engagement Team-Based Mobile.”

Furthermore, the study explored topic generation using three models—LDA, STM, and BTM. BTM exhibited superior effectiveness by revealing the most unique latent topics, with five topics identified through this model.

6.2 Limitations

The Web of Science database’s coverage may not include all relevant articles in the learning analytics field, especially those from journals or publications not indexed in this database. This limitation could lead to a biased or incomplete understanding of the research landscape. Additionally, the dataset used in this analysis is relatively small, and determining the optimal K value for topic modeling may not be definitive.

6.3 Ethic issues

Ethical Use of Data and Findings: It’s essential to handle the data and findings presented in the literature ethically and responsibly. This includes accurately representing research results without misinterpretation or misrepresentation. Additionally, acknowledging any limitations or ethical considerations highlighted by the original authors is crucial to maintain ethical standards in literature analysis.

Transparency and Reproducibility: Prioritize transparency throughout my literature analysis by meticulously documenting my search strategies, inclusion criteria, and article selection process. Strive for reproducibility by offering comprehensive information that enables others to replicate or validate your review methodology. This commitment to transparency and reproducibility enhances the credibility and trustworthiness of my literature analysis.

References

Bail, C. (2019). Topic Modeling. Text as Data Course. Retrieved April 22, 2022, from https://sicss.io/2019/materials/day3-text-analysis/topic-modeling/rmarkdown/Topic_Modeling.html Wijffels, J. (2021). BTM: Biterm topic models for short text - cran.r-project.org. CRAN - Package BTM. Retrieved April 27, 2022, from https://cran.r-project.org/web/packages/BTM/BTM.pdf Jipeng, Q., Zhenyu, Q., Yun, L., Yunhao, Y., & Xindong, W. (2019). Short text topic modeling techniques, applications, and performance: A survey. arXiv.org. Retrieved April 21, 2022, from https://doi.org/10.48550/arXiv.1904.07695 Silge, J., & Robinson, D. (2017). Topic Modeling: Text mining with R. 6 Topic modeling | Text Mining with R. Retrieved April 30, 2022, from https://www.tidytextmining.com/topicmodeling.html Yan, X., Guo, J., Lan, Y., & Cheng, X. (2013). A biterm topic model for short texts. Xiaohui Yan’s Homepage. Retrieved April 27, 2022, from http://xiaohuiyan.github.io/paper/BTM-WWW13.pdf