In order to identify trends, developments, and emerging themes within the field of learning analytics as applied to STEM education in higher education settings, this literature analysis offers a comprehensive understanding of the field and provides insights into evolving trends within the field, which holds significant value for researchers and practitioners focusing on STEM education within higher education.
As a result, this research project aims to answer the following questions:
What is the annual publication trend of the studies? What are the most influential sources, affiliations, publishers, and publications in the field? What are the most common words used? Which topic modeling approach is optimal for generating potential topics from short text? What are the potential topic trends observed over the years?
2.1 Data Source
My raw data was initially collected from the Web of Science database, which is a comprehensive research database that covers a wide range of scholarly literature across various disciplines. The search routine is as follows: “learning analytics” AND (“higher education” OR colleg* OR universit*) AND (STEM OR science OR technology OR engineering OR mathematics) (Abstract) and Article (Document Types). Upon including all publication types in the scope, 373 documents are reached, and 205 articles emerge when the survey is limited to articles between 2014 - 2024. The survey was performed on 7 April 2024, to avoid deviations from daily updates of the database. Then, inclusion criteria were applied to select relevant studies: (1) The paper is written in English; (2) Full text is available; (3) The paper is peer-reviewed;(4) The study is conducted in a higher education context;(5) The study focuses on STEM education; (6) The study The paper is empirical. Therefore, other types of documents such as non-peer-reviewed papers, government reports, review, meta-analysis, survey, or commentary articles, editorials, book chapters, theses, dissertations, and author notes were excluded from this review. Finally, all the texts containing “Record Content: Full Record and Cited References” are downloaded in Excel format by clicking on Export Records to Excel File of the 45 articles reached during the survey. The entire records of WoS include the author, document type, Web of Science category, keywords, year of publication, publishers, affiliated institutions, countries/regions, and indexes for each article.
2.2 Load libraries
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidytext)
library(tidyr)
library(SnowballC)
library(topicmodels)
library(stm)
## stm v1.3.7 successfully loaded. See ?stm for help.
## Papers, resources, and other materials at structuraltopicmodel.com
library(ldatuning)
library(knitr)
library(LDAvis)
library(tm)
## Loading required package: NLP
##
## Attaching package: 'NLP'
##
## The following object is masked from 'package:ggplot2':
##
## annotate
library(lubridate)
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.3.3
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
library(BTM)
## Warning: package 'BTM' was built under R version 4.3.3
library(textplot)
## Warning: package 'textplot' was built under R version 4.3.3
library(concaveman)
## Warning: package 'concaveman' was built under R version 4.3.3
library(udpipe)
## Warning: package 'udpipe' was built under R version 4.3.3
library(data.table)
##
## Attaching package: 'data.table'
##
## The following objects are masked from 'package:lubridate':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
##
## The following object is masked from 'package:purrr':
##
## transpose
##
## The following objects are masked from 'package:dplyr':
##
## between, first, last
library(stopwords)
##
## Attaching package: 'stopwords'
##
## The following object is masked from 'package:tm':
##
## stopwords
library(ggplot2)
library(igraph)
##
## Attaching package: 'igraph'
##
## The following objects are masked from 'package:lubridate':
##
## %--%, union
##
## The following objects are masked from 'package:purrr':
##
## compose, simplify
##
## The following object is masked from 'package:tidyr':
##
## crossing
##
## The following object is masked from 'package:tibble':
##
## as_data_frame
##
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
##
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
##
## The following object is masked from 'package:base':
##
## union
library(ggraph)
3.1 Read in the data in R project.
library(readxl)
downloadLASTEM_data <- read_excel("C:/Users/kaoya/Downloads/ECI_588_Datasets/download_LASTEM.xls")
View(downloadLASTEM_data)
3.2 Rename Columns
downloadLASTEM_data <- rename(downloadLASTEM_data, Year = `Publication Year`)
downloadLASTEM_data <- rename(downloadLASTEM_data, Title = `Article Title`)
downloadLASTEM_data <- rename(downloadLASTEM_data, Source = `Source Title`)
downloadLASTEM_data <- rename(downloadLASTEM_data, Type = `Publication Type`)
downloadLASTEM_data <- rename(downloadLASTEM_data, Keywords = `Author Keywords`)
downloadLASTEM_data <- rename(downloadLASTEM_data, Usagecount = `Since 2013 Usage Count`)
downloadLASTEM_data <- rename(downloadLASTEM_data, Keywordsplus = `Keywords Plus`)
downloadLASTEM_data <- rename(downloadLASTEM_data, Referencecount = `Cited Reference Count`)
downloadLASTEM_data <- rename(downloadLASTEM_data, Journal = `Journal Abbreviation`)
3.3 Subset columns
LASTEM_data <- select(downloadLASTEM_data, Type, Authors, Title, Source, Keywords, Keywordsplus, Abstract, Affiliations, Referencecount, Usagecount, Journal, Publisher, Year)
3.4 Create tidytext corpus
LASTEM_data_tidy <- LASTEM_data %>%
unnest_tokens(output = word, input = Abstract) %>%
anti_join(stop_words, by = "word")
tidy_top_tokens <- LASTEM_data_tidy %>%
count(word, sort = TRUE) %>%
top_n(50)
## Selecting by n
tidy_top_tokens
## # A tibble: 51 × 2
## word n
## <chr> <int>
## 1 learning 230
## 2 students 147
## 3 analytics 82
## 4 study 63
## 5 data 59
## 6 student 59
## 7 education 45
## 8 academic 41
## 9 assessment 39
## 10 university 39
## # ℹ 41 more rows
3.5 Stemming Stemming reduces the feature size of a corpus by transforming terms to their base stem. Stemming reduces the chances of redundancy in terms and phrases as the various topic modeling techniques are explored.
StemLASTEM_data_tidy <- LASTEM_data_tidy %>%
mutate(word = wordStem(word))
StemLASTEM_data_tidy
## # A tibble: 5,376 × 13
## Type Authors Title Source Keywords Keywordsplus Affiliations Referencecount
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 J Vargas,… A Pr… IEEE … Educati… ENGINEERING… Pontificia … 36
## 2 J Vargas,… A Pr… IEEE … Educati… ENGINEERING… Pontificia … 36
## 3 J Vargas,… A Pr… IEEE … Educati… ENGINEERING… Pontificia … 36
## 4 J Vargas,… A Pr… IEEE … Educati… ENGINEERING… Pontificia … 36
## 5 J Vargas,… A Pr… IEEE … Educati… ENGINEERING… Pontificia … 36
## 6 J Vargas,… A Pr… IEEE … Educati… ENGINEERING… Pontificia … 36
## 7 J Vargas,… A Pr… IEEE … Educati… ENGINEERING… Pontificia … 36
## 8 J Vargas,… A Pr… IEEE … Educati… ENGINEERING… Pontificia … 36
## 9 J Vargas,… A Pr… IEEE … Educati… ENGINEERING… Pontificia … 36
## 10 J Vargas,… A Pr… IEEE … Educati… ENGINEERING… Pontificia … 36
## # ℹ 5,366 more rows
## # ℹ 5 more variables: Usagecount <dbl>, Journal <chr>, Publisher <chr>,
## # Year <dbl>, word <chr>
3.6 Cast a Document Term Matrix The LDA model requires the text be presented in the form of a tidy DTM, where each term occupies a single cell according to a unique and controlling variable. In this case, the title will act as that unique identifier.
tidy_tds_DTM <- StemLASTEM_data_tidy %>%
count(Title, word) %>%
cast_dtm(Title, word, n)
tidy_tds_DTM
## <<DocumentTermMatrix (documents: 45, terms: 1222)>>
## Non-/sparse entries: 3497/51493
## Sparsity : 94%
## Maximal term length: 13
## Weighting : term frequency (tf)
3.7 Word counts
LASTEM_counts <- StemLASTEM_data_tidy %>%
count(word, sort = TRUE)
LASTEM_counts
## # A tibble: 1,222 × 2
## word n
## <chr> <int>
## 1 learn 233
## 2 student 206
## 3 analyt 86
## 4 studi 79
## 5 educ 68
## 6 data 59
## 7 model 52
## 8 assess 48
## 9 univers 46
## 10 result 43
## # ℹ 1,212 more rows
3.8 Word frequency
LASTEM_frequencies <- StemLASTEM_data_tidy %>%
count(word, sort = TRUE) %>%
mutate(proportion = n / sum(n))
LASTEM_frequencies
## # A tibble: 1,222 × 3
## word n proportion
## <chr> <int> <dbl>
## 1 learn 233 0.0433
## 2 student 206 0.0383
## 3 analyt 86 0.0160
## 4 studi 79 0.0147
## 5 educ 68 0.0126
## 6 data 59 0.0110
## 7 model 52 0.00967
## 8 assess 48 0.00893
## 9 univers 46 0.00856
## 10 result 43 0.00800
## # ℹ 1,212 more rows
3.9 Tokenization of the original data to enable further term frequency analysis at the bigram (word pair) levels. For these iterations, stop word removal and stemming has been incorporated:
tds_bigrams <- LASTEM_data %>%
unnest_tokens(output = bigram, input = Abstract, token = "ngrams", n = 2)
tds_bigrams <- tds_bigrams %>%
separate(bigram, into = c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
mutate(word1 = wordStem(word1)) %>%
mutate(word2 = wordStem(word2)) %>%
unite(bigram, c(word1, word2), sep = " ")
bigram_top_tokens <- tds_bigrams %>%
count(bigram, sort = TRUE) %>%
top_n(20)
## Selecting by n
bigram_top_tokens
## # A tibble: 22 × 2
## bigram n
## <chr> <int>
## 1 learn analyt 66
## 2 academ perform 12
## 3 machin learn 11
## 4 student learn 11
## 5 learn environ 10
## 6 learn manag 10
## 7 manag system 9
## 8 student perform 9
## 9 student success 8
## 10 academ achiev 7
## # ℹ 12 more rows
3.10 Bigrams Counts
LASTEM_countsbi <- tds_bigrams %>%
count(bigram, sort = TRUE)
LASTEM_countsbi
## # A tibble: 1,878 × 2
## bigram n
## <chr> <int>
## 1 learn analyt 66
## 2 academ perform 12
## 3 machin learn 11
## 4 student learn 11
## 5 learn environ 10
## 6 learn manag 10
## 7 manag system 9
## 8 student perform 9
## 9 student success 8
## 10 academ achiev 7
## # ℹ 1,868 more rows
4.1 Wordcloud
library(wordcloud2)
wordcloud2(LASTEM_counts)
The most meaningful common words were ‘student,’ ‘study,’ ‘data,’ ‘model,’ ‘assess,’ and ‘result.’
4.2 Basic Bar Chart
LASTEM_counts %>%
filter(n > 20) %>% # keep rows with word counts greater than 20
mutate(word = reorder(word, n)) %>% #reorder the word variable by n and replace with new variable called word
ggplot(aes(n, word)) + # create a plot with n on x axis and word on y axis
geom_col() # make it a bar plot
LASTEM_countsbi %>%
filter(n > 5) %>% # keep rows with word counts greater than 5
mutate(bigram = reorder(bigram, n)) %>% #reorder the word variable by n and replace with new variable called biagram
ggplot(aes(n, bigram)) + # create a plot with n on x axis and biagram on y axis
geom_col() # make it a bar plot
The most meaningful common bigrams included ‘machine learning,’ ‘student
learn,’ ‘learning environment,’ ‘learning management,’ ‘management
system,’ and ‘student performance.’
4.3 Published Article Counts
all_years <- as.character(2014:2024) # Define the range of years
LASTEM_data$Year <- factor(LASTEM_data$Year, levels = all_years)
LASTEM_data %>%
ggplot(aes(x = Year), color = factor(Type)) +
geom_bar(show.legend = FALSE) +
scale_x_discrete(limits = all_years) + # Explicitly set x-axis limits
labs(x = "Year",
y = "Article Counts",
title = "Articles Published by Years",
subtitle = "Published from 2014 - 2024")
This visual depicts the number of papers published over the past ten
years. There was no publication from 2014-2016, although the average
number of publications per year was about 5. Four works were found in
2017, nine were found in 2018, seven were found in 2019, and three were
published in 2020. Moreover, the number of papers were reached 11 in
2021, then decreased to three in 2022 and seven in 2023. There was only
one work in 2024.
4.4 Influential sources
LASTEM_countsSource <- LASTEM_data %>%
count(Source, sort = TRUE)
LASTEM_countsSource %>%
filter(n > 1) %>% # keep rows with souce counts greater than 1
mutate(Source = reorder(Source, n)) %>% #reorder the word variable by n and replace with new variable called source
ggplot(aes(n, Source)) + # create a plot with n on x axis and source on y axis
geom_col() # make it a bar plot
This visual illustrates the distribution of published papers across
various sources. Specifically, it shows that three papers were published
in the International Journal of Engineering Education and IEEE ACCESS
each. Additionally, two papers each were found in the Multidisciplinary
Journal for Education Social and Technological Sciences, Education and
Information Technologies, Computer & Education, Computer
Applications in Engineering Education, and Applied Sciences-basel.
4.5 Influential affiliations
LASTEM_countsAffiliations <- LASTEM_data %>%
count(Affiliations, sort = TRUE)
LASTEM_countsAffiliations %>%
filter(n > 1) %>% # keep rows with word counts greater than 1
mutate(Affiliations = reorder(Affiliations, n)) %>% #reorder the word variable by n and replace with new variable called word
ggplot(aes(n, Affiliations)) + # create a plot with n on x axis and word on y axis
geom_col() # make it a bar plot
Two works were discovered from the University of Sydney and the
Universidad de Castilla-La Mancha, while only one work was found in
other affiliations.
4.6 Influential publications
LASTEM_sortUsagecount <- LASTEM_data %>%
arrange((desc(Usagecount)))
View(LASTEM_sortUsagecount)
library(knitr)
# Select the desired variables and arrange them in the desired order
selected_LASTEM_sortUsagecount <- LASTEM_sortUsagecount %>%
select(Usagecount, Authors, Year, Title, Source, Affiliations)
# Display the table using kable
kable(selected_LASTEM_sortUsagecount)
| Usagecount | Authors | Year | Title | Source | Affiliations |
|---|---|---|---|---|---|
| 167 | Pardo, A; Han, FF; Ellis, RA | 2017 | Combining University Student Self-Regulated Learning Indicators and Engagement with Online Learning Events to Predict Academic Performance | IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES | University of Sydney; University of Sydney |
| 144 | Muñoz-Merino, PJ; Ruipérez-Valiente, JA; Kloos, CD; Auger, MA; Briz, S; de Castro, V; Santalla, SN | 2017 | Flipping the Classroom to Improve Learning With MOOCs Technology | COMPUTER APPLICATIONS IN ENGINEERING EDUCATION | Universidad Carlos III de Madrid; IMDEA Networks Institute |
| 133 | Hu, YH | 2022 | Effects and acceptance of precision education in an AI-supported smart learning environment | EDUCATION AND INFORMATION TECHNOLOGIES | National Yunlin University Science & Technology |
| 108 | Wu, JY | 2021 | Learning analytics on structured and unstructured heterogeneous data sources: Perspectives from procrastination, help-seeking, and machine-learning defined cognitive engagement | COMPUTERS & EDUCATION | National Yang Ming Chiao Tung University |
| 70 | Zhu, QL; Wang, MJ | 2020 | Team-based mobile learning supported by an intelligent system: case study of STEM students | INTERACTIVE LEARNING ENVIRONMENTS | Hainan University; California State University System; San Diego State University |
| 63 | Aparicio, F; Morales-Botello, ML; Rubio, M; Hernando, A; Muñoz, R; López-Fernández, H; Glez-Peña, D; Fdez-Riverola, F; de la Villa, M; Maña, M; Gachet, D; de Buenaga, M | 2018 | Perceptions of the use of intelligent information access systems in university level active learning activities among teachers of biomedical subjects | INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS | European University of Madrid; European University of Madrid; Universidade de Vigo; Universidade de Vigo; CINBIO; Universidad de Huelva |
| 47 | Saqr, M; Alamro, A | 2019 | The role of social network analysis as a learning analytics tool in online problem based learning | BMC MEDICAL EDUCATION | University of Eastern Finland; Stockholm University; Qassim University |
| 46 | Iatrellis, O; Savvas, IK; Fitsilis, P; Gerogiannis, VC | 2021 | A two-phase machine learning approach for predicting student outcomes | EDUCATION AND INFORMATION TECHNOLOGIES | University of Thessaly |
| 46 | Zheng, M; Bender, D; Nadershahi, N | 2017 | Faculty professional development in emergent pedagogies for instructional innovation in dental education | EUROPEAN JOURNAL OF DENTAL EDUCATION | University of the Pacific |
| 45 | Vaz-Fernandes, P; Caeiro, S | 2019 | Students’ perceptions of a food safety and quality e-learning course: a CASE study for a MSC in food consumption | INTERNATIONAL JOURNAL OF EDUCATIONAL TECHNOLOGY IN HIGHER EDUCATION | Universidade Aberta; Universidade de Lisboa; Universidade Nova de Lisboa |
| 39 | Noel, R; Riquelme, F; Mac Lean, R; Merino, E; Cechinel, C; Barcelos, TS; Villarroel, R; Munoz, R | 2018 | Exploring Collaborative Writing of User Stories With Multimodal Learning Analytics: A Case Study on a Software Engineering Course | IEEE ACCESS | Universidad de Valparaiso; Universidad de Valparaiso; Universidade Federal de Santa Catarina (UFSC); Instituto Federal de Sao Paulo (IFSP); Pontificia Universidad Catolica de Valparaiso |
| 38 | Lacave, C; Molina, AI; Cruz-Lemus, JA | 2018 | Learning Analytics to identify dropout factors of Computer Science studies through Bayesian networks | BEHAVIOUR & INFORMATION TECHNOLOGY | Universidad de Castilla-La Mancha |
| 37 | Kuromiya, H; Majumdar, R; Ogata, H | 2020 | Fostering Evidence-Based Education with Learning Analytics: Capturing Teaching-Learning Cases from Log Data | EDUCATIONAL TECHNOLOGY & SOCIETY | Kyoto University; Kyoto University |
| 35 | Cheng, S; Xie, K; Collier, J | 2023 | Motivational beliefs moderate the relation between academic delay and academic achievement in online learning environments | COMPUTERS & EDUCATION | National Taipei University of Technology; University System of Ohio; Ohio State University; Texas State University System; Sam Houston State University |
| 35 | Bertolini, R; Finch, SJ; Nehm, RH | 2021 | Testing the Impact of Novel Assessment Sources and Machine Learning Methods on Predictive Outcome Modeling in Undergraduate Biology | JOURNAL OF SCIENCE EDUCATION AND TECHNOLOGY | State University of New York (SUNY) System; State University of New York (SUNY) Stony Brook; State University of New York (SUNY) System; State University of New York (SUNY) Stony Brook |
| 34 | Divjak, B; Svetec, B; Horvat, D; Kadoic, N | 2023 | Assessment validity and learning analytics as prerequisites for ensuring student-centred learning design | BRITISH JOURNAL OF EDUCATIONAL TECHNOLOGY | University of Zagreb |
| 30 | Vargas, H; Heradio, R; Chacon, J; De la Torre, L; Farias, G; Galan, D; Dormido, S | 2019 | Automated Assessment and Monitoring Support for Competency-Based Courses | IEEE ACCESS | Pontificia Universidad Catolica de Valparaiso; Universidad Nacional de Educacion a Distancia (UNED); Complutense University of Madrid; Universidad Nacional de Educacion a Distancia (UNED) |
| 30 | Al-Shabandar, R; Hussain, AJ; Liatsis, P; Keight, R | 2018 | Anlaying Learners Behavior in MOOCs: An Examination of Performance and Motivation Using a Data-Driven Approach | IEEE ACCESS | Liverpool John Moores University; Khalifa University of Science & Technology |
| 30 | Mwalumbwe, I; Mtebe, JS | 2017 | USING LEARNING ANALYTICS TO PREDICT STUDENTS’ PERFORMANCE IN MOODLE LEARNING MANAGEMENT SYSTEM: A CASE OF MBEYA UNIVERSITY OF SCIENCE AND TECHNOLOGY | ELECTRONIC JOURNAL OF INFORMATION SYSTEMS IN DEVELOPING COUNTRIES | University of Dar es Salaam |
| 26 | Raza, SH; Reddy, E | 2021 | Intentionality and Players of Effective Online Courses in Mathematics | FRONTIERS IN APPLIED MATHEMATICS AND STATISTICS | University of the South Pacific |
| 26 | Scott, K; Morris, A; Marais, B | 2018 | Medical student use of digital learning resources | CLINICAL TEACHER | University of Sydney; University of Sydney |
| 22 | Menchaca, I; Guenaga, M; Solabarrieta, J | 2018 | Learning Analytics for Formative Assessment in Engineering Education | INTERNATIONAL JOURNAL OF ENGINEERING EDUCATION | University of Deusto |
| 20 | Kritzinger, A; Lemmens, JC; Potgieter, M | 2018 | Learning Strategies for First-Year Biology: Toward Moving the Murky Middle | CBE-LIFE SCIENCES EDUCATION | University of Pretoria; University of Pretoria; University of Pretoria |
| 19 | Gardner, C; Jones, A; Jefferis, H | 2020 | Analytics for Tracking Student Engagement | JOURNAL OF INTERACTIVE MEDIA IN EDUCATION | NA |
| 16 | Salazar-Fernandez, JP; Munoz-Gama, J; Maldonado-Mahauad, J; Bustamante, D; Sepúlveda, M | 2021 | Backpack Process Model (BPPM): A Process Mining Approach for Curricular Analytics | APPLIED SCIENCES-BASEL | Pontificia Universidad Catolica de Chile; Universidad Austral de Chile; Universidad de Cuenca |
| 16 | He, LJ; Levine, RA; Bohonak, AJ; Fan, JJ; Stronach, J | 2018 | Predictive Analytics Machinery for STEM Student Success Studies | APPLIED ARTIFICIAL INTELLIGENCE | California State University System; San Diego State University; California State University System; San Diego State University; California State University System; San Diego State University |
| 15 | Oliva-Córdova, LM; Garcia-Cabot, A; Recinos-Fernández, SA; Bojórquez-Roque, MS; Amado-Salvatierra, HR | 2022 | Evaluating Technological Acceptance of Virtual Learning Environments (VLE) in an Emergency Remote Situation | INTERNATIONAL JOURNAL OF ENGINEERING EDUCATION | Universidad de Alcala; Universidad de San Carlos de Guatemala; Universidad de San Carlos de Guatemala; Universidad de San Carlos de Guatemala |
| 14 | Apiola, M; Lokkila, E; Laakso, MJ | 2019 | Digital learning approaches in an intermediate-level computer science course | INTERNATIONAL JOURNAL OF INFORMATION AND LEARNING TECHNOLOGY | University of Turku |
| 13 | Olney, T; Walker, S; Wood, C; Clarke, A | 2021 | Are We Living in LA (P)LA Land? Reporting on the Practice of 30 STEM Tutors in Their Use of a Learning Analytics Implementation at The Open University | JOURNAL OF LEARNING ANALYTICS | Open University - UK; Open University - UK |
| 12 | Prat, A; Code, WJ | 2021 | WeBWorK log files as a rich source of data on student homework behaviours | INTERNATIONAL JOURNAL OF MATHEMATICAL EDUCATION IN SCIENCE AND TECHNOLOGY | University of British Columbia; University of British Columbia |
| 11 | Lobos, K; Sáez-Delgado, F; Cobo-Rendón, R; Mella Norambuena, J; Maldonado Trapp, A; Cisternas San Martín, N; Bruna Jofré, C | 2021 | Learning Beliefs, Time on Platform, and Academic Performance During the COVID-19 in University STEM Students | FRONTIERS IN PSYCHOLOGY | Universidad de Concepcion; Universidad Catolica de la Santisima Concepcion; Universidad Catolica de la Santisima Concepcion; Universidad de Concepcion; Universidad de Concepcion |
| 11 | Llopis-Albert, C; Rubio, F | 2021 | Application of Learning Analytics to Improve Higher Education | MULTIDISCIPLINARY JOURNAL FOR EDUCATION SOCIAL AND TECHNOLOGICAL SCIENCES | Universitat Politecnica de Valencia |
| 11 | Dennehy, D; Conboy, K; Babu, J | 2023 | Adopting Learning Analytics to Inform Postgraduate Curriculum Design: Recommendations and Research Agenda | INFORMATION SYSTEMS FRONTIERS | Ollscoil na Gaillimhe-University of Galway |
| 11 | Lacave, C; Molina, AI | 2018 | Using Bayesian Networks for Learning Analytics in Engineering Education: A Case Study on Computer Science Dropout at UCLM | INTERNATIONAL JOURNAL OF ENGINEERING EDUCATION | Universidad de Castilla-La Mancha |
| 10 | Vargas, H; Heradio, R; Farias, G; Lei, ZC; Torre, LD | 2024 | A Pragmatic Framework for Assessing Learning Outcomes in Competency-Based Courses | IEEE TRANSACTIONS ON EDUCATION | Pontificia Universidad Catolica de Valparaiso; Universidad Nacional de Educacion a Distancia (UNED); Wuhan University |
| 10 | Huang, LW; Willcox, KE | 2021 | Network models and sensor layers to design adaptive learning using educational mapping | DESIGN SCIENCE | Massachusetts Institute of Technology (MIT); University of Texas System; University of Texas Austin |
| 9 | Raj, NS; Renumol, VG | 2022 | Early prediction of student engagement in virtual learning environments using machine learning techniques | E-LEARNING AND DIGITAL MEDIA | Cochin University Science & Technology |
| 8 | Chapman, KE; Davidson, ME; Azuka, N; Liberatore, MW | 2023 | Quantifying deliberate practice using auto-graded questions: Analyzing multiple metrics in a chemical engineering course | COMPUTER APPLICATIONS IN ENGINEERING EDUCATION | University System of Ohio; University of Toledo; University System of Ohio; University of Toledo |
| 8 | Simanca, F; Crespo, RG; Rodríguez-Baena, L; Burgos, D | 2019 | Identifying Students at Risk of Failing a Subject by Using Learning Analytics for Subsequent Customised Tutoring | APPLIED SCIENCES-BASEL | Universidad Cooperativa de Colombia; Universidad Internacional de La Rioja (UNIR); Universidad Internacional de La Rioja (UNIR) |
| 7 | Walker, S; Olney, T; Wood, C; Clarke, A; Dunworth, M | 2019 | How do tutors use data to support their students? | OPEN LEARNING | Open University - UK |
| 5 | Vaithilingam, CA; Gamboa, RA; Lim, SC | 2019 | EMPOWERED PEDAGOGY: CATCHING UP WITH THE FUTURE | MALAYSIAN JOURNAL OF LEARNING & INSTRUCTION | Taylor’s University; Multimedia University |
| 4 | Tsoni, R; Sakkopoulos, E; Panagiotakopoulos, CT; Verykios, VS | 2021 | On the equivalence between bimodal and unimodal students’ collaboration networks in distance learning | INTELLIGENT DECISION TECHNOLOGIES-NETHERLANDS | Hellenic Open University; University of Piraeus; University of Patras |
| 3 | Qazdar, A; Hasidi, O; Qassimi, S; Abdelwahed, E | 2023 | Newly Proposed Student Performance Indicators Based on Learning Analytics for Continuous Monitoring in Learning Management Systems | INTERNATIONAL JOURNAL OF ONLINE AND BIOMEDICAL ENGINEERING | Ibn Zohr University of Agadir; Cadi Ayyad University of Marrakech; Cadi Ayyad University of Marrakech |
| 1 | Torner, ME; Aparicio-Fernández, C; Vivancos, JL; Cañada-Soriano, M | 2023 | Analysis of the optimization of resources with Learning Analytics techniques | MULTIDISCIPLINARY JOURNAL FOR EDUCATION SOCIAL AND TECHNOLOGICAL SCIENCES | Universitat Politecnica de Valencia; Universitat Politecnica de Valencia; Universitat Politecnica de Valencia; Universitat Politecnica de Valencia |
| 0 | Soto-Acevedo, M; Abuchar-Curi, AM; Zuluaga-Ortiz, RA; Delahoz-Domínguez, EJ | 2023 | A Machine Learning Model to Predict Standardized Tests in Engineering Programs in Colombia | IEEE REVISTA IBEROAMERICANA DE TECNOLOGIAS DEL APRENDIZAJE-IEEE RITA | Universidad Tecnologica de Bolivar; Universidad Tecnologica de Bolivar; Universidad de la Costa |
This table showcases the most frequently cited influential publications including titles like “Student Self-Regulated Learning Indicators and Engagement with Online Learning Events to Predict Academic Performance,” “Flipping the Classroom to Improve Learning With MOOCs Technology,” “Effects and Acceptance of Precision Education in an AI-Supported Smart Learning Environment,” “Learning Analytics on Structured and Unstructured Heterogeneous Data,” and “Perspectives from Procrastination, Help-Seeking, and Machine-Learning Defined Cognitive Engagement Team-Based Mobile.”
4.7 Influential publishers
LASTEM_countsPublisher <- LASTEM_data %>%
count(Publisher, sort = TRUE)
LASTEM_countsPublisher %>%
filter(n > 1) %>% # keep rows with word counts greater than 1
mutate(Publisher = reorder(Publisher, n)) %>% #reorder the word variable by n and replace with new variable called Publisher
ggplot(aes(n, Publisher)) + # create a plot with n on x axis and word on y axis
geom_col() # make it a bar plot
This visualization highlights the most influential publishers, with
Wiley leading, followed by Springer and IEEE-Institute of Electrical and
Electronics Engineers Inc.
4.8 Word Counts by Year
StemLASTEM_data_tidy %>%
group_by(Year) %>%
count(word, sort = TRUE) %>%
top_n(8) %>%
ungroup %>%
mutate(word = reorder_within(word, n, Year)) %>%
ggplot(aes(x = word, y = n, fill = word)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ Year, scales = "free_y") +
coord_flip() +
scale_x_reordered() +
scale_y_continuous(expand = c(0,0)) +
labs(y = "Count",
x = "Unique words",
title = "Most frequent words found in Abstract through Years",
subtitle = "Stop words removed from the list")
## Selecting by n
In diagraming the top eight unigrams by year, certain terms are recurrent. Given the project’s focus on leveraging learning analytics to enhance STEM education within higher education contexts, it’s expected that terms like “learn,” “student,” “analytics,” “study,” and “education” would consistently rank among the top terms across the years. However, each year also presents unique terms. For instance, “mooc,” “develop,” and “design” prominently featured in 2017, while “system” and “success” stood out in 2018. In 2022, “virtual” and “predict” were notable, and by 2024, “assess” and “model” gradually emerged as prevalent terms, eventually becoming the most common.
4.9 Bigram Counts
tds_bigrams %>%
group_by(Year) %>%
count(bigram, sort = TRUE) %>%
top_n(5) %>%
ungroup %>%
mutate(bigram = reorder_within(bigram, n, Year)) %>%
ggplot(aes(x = bigram, y = n, fill = bigram)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ Year, scales = "free_y") +
coord_flip() +
scale_x_reordered() +
scale_y_continuous(expand = c(0,0)) +
labs(y = "Count",
x = "Unique Bigrams",
title = "Most frequent bigrams found in Abstract through Years",
subtitle = "Stop words removed from the list")
## Selecting by n
Moreover, I’ve chosen to shift the emphasis away from individual terms. In our community, key topics are often represented by multiple terms, such as ‘learning analytics’ and ‘machine learning,’ rather than treating each word in isolation. With this in mind, I replicated the previous visualizations but focused on bi-word groupings instead. The expansion to 2-word phrases reveals common topics like ‘learning analytics’ and ‘academic performance,’ along with unique topics for each year. Noteworthy examples include ‘blended learning’ and ‘analytic tools’ in 2017, ‘student success’ and ‘learning activities’ in 2018, ‘program outcomes’ and ‘learning systems’ in 2019, ‘student engagement’ and ‘evidence-based’ in 2020, ‘model approach’ and ‘machine learning’ in 2021, ‘virtual learning’ and ‘implementation strategies’ in 2022, ‘teacher inquiry’ and ‘academic achievement’ in 2023, and ‘competitive-based’ and ‘assessing models’ in 2024.
To explore the potential for generating topics, I will conduct a comparative analysis of various qualitative models to identify any differences in how they delineate unique topics. This analysis will encompass three models:
Latent Dirichlet Allocation (LDA): LDA operates on the assumption that each document comprises a blend of topics, with each topic comprising a blend of words.
Structural Topic Model (STM): STM leverages metadata to enhance the assignment of words to topics within a corpus. It also enables the examination of relationships between covariates and documents.
Biterm Topic Model (BTM): BTM is specifically designed for short-format corpora. It identifies topics by explicitly modeling co-occurrences of words within a specified text window.
By comparing the outcomes of these models, I aim to gain insights into how they differ in identifying and delineating distinct topics within the dataset.
5.1 Determining K Each model relies on determining an optimal value for K, representing the number of potential topics to be identified. If K is too small, the corpus may be divided into a few overly generic topics. Conversely, if K is too large, the collection may be fragmented into numerous topics that either overlap significantly or become indistinguishable. To ensure consistent comparisons across models, a common K value needs to be established before applying the models.
FindTopicsNumber() Function in the LDA Model For the LDA model, four metrics were extracted, then plotted to visualize the maximum or minimum K value of each metric:
k_metrics <- FindTopicsNumber(
tidy_tds_DTM,
topics = seq(5, 80, by = 5),
metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
method = "Gibbs",
control = list(seed = 77),
mc.cores = NA,
return_models = FALSE,
verbose = FALSE,
libpath = NULL
)
FindTopicsNumber_plot(k_metrics)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the ldatuning package.
## Please report the issue at <https://github.com/nikita-moor/ldatuning/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The crucial aspect is pinpointing the visible bend or inflection point
on each line. This is where the lines shift from a rapid increase or
decrease to a more gradual trajectory. Across all lines, this inflection
point falls between 20 and 30 topics. Considering the dataset comprises
only 45 items, I would opt for the smallest possible number of topics,
which is 5.
5.2 Latent Dirichlet Allocation (LDA) Model LDA is a mathematical technique used to estimate the combination of words associated with each topic and to determine the blend of topics that characterize each document (Silge & Robinson, 2017).
n_distinct(LASTEM_data$Abstract)
## [1] 45
tds_lda <- LDA(tidy_tds_DTM,
k = 5,
control = list(seed = 588))
terms(tds_lda, 10)
## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
## [1,] "learn" "learn" "student" "learn" "learn"
## [2,] "student" "student" "learn" "student" "student"
## [3,] "academ" "analyt" "studi" "assess" "data"
## [4,] "educ" "resourc" "perform" "studi" "studi"
## [5,] "analyt" "statist" "data" "analyt" "model"
## [6,] "predict" "engag" "academ" "model" "analyt"
## [7,] "time" "onlin" "educ" "develop" "correct"
## [8,] "model" "mathemat" "analyt" "result" "educ"
## [9,] "univers" "cours" "interact" "system" "network"
## [10,] "studi" "data" "result" "lo" "practic"
For K = 5, the topics appear generally similar across the board, although there are a couple that stand out as unique. However, in this format, discerning those distinct topics is challenging. However, identifying these distinct topics is challenging in this format. A faceted plot below offers a more insightful visual representation:
top_terms_lda <- tidy(tds_lda, matrix="beta") %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, desc(beta))
top_terms_lda %>%
mutate(term = reorder_within(term, beta, topic)) %>%
group_by(topic, term) %>%
arrange(desc(beta)) %>%
ungroup() %>%
ggplot(aes(beta, term, fill = as.factor(topic))) +
geom_col(show.legend = FALSE) +
scale_y_reordered() +
labs(title = "Top 10 terms in each LDA topic",
x = expression(beta), y = NULL) +
facet_wrap(~ topic, ncol = 4, scales = "free")
Performance Summary The LDA model struggled to differentiate between topics, providing generic summaries similar to what one might find in a typical data science article. These findings qualitatively validate the model’s difficulty in consistently identifying unique topics within this dataset.
5.3 Structural Topic Model (STM) The stm package in R mandates that the documents, metadata, and “vocab” (the complete list of words described in the documents) be stored in distinct objects, as shown in the code below. The initial line of code filters out exceedingly common and rare terms, a common practice in topic modeling, as these terms can complicate word-topic assignments (Bail, 2019).
temp <- textProcessor(LASTEM_data$Abstract,
metadata = LASTEM_data,
lowercase=TRUE,
removestopwords=TRUE,
removenumbers=TRUE,
removepunctuation=TRUE,
wordLengths=c(3,Inf),
stem=TRUE,
onlycharacter= FALSE,
striphtml=TRUE,
customstopwords=NULL)
## Building corpus...
## Converting to Lower Case...
## Removing punctuation...
## Removing stopwords...
## Removing numbers...
## Stemming...
## Creating Output...
docs <- temp$documents
meta <- temp$meta
vocab <- temp$vocab
tds_stm <- stm(documents=docs,
data=meta,
vocab=vocab,
prevalence =~ Year,
K=5,
max.em.its=25,
verbose = FALSE,
)
plot.STM(tds_stm, n = 20)
The initial iteration of the STM model with K=5 exhibits a broader array
of terms compared to LDA. However, this linear representation still
struggles to clearly depict the differences between each topic and its
adjacent topics. A more visually informative method is facilitated by
the toLDAvis function.
toLDAvis(mod = tds_stm, docs = docs)
## Loading required namespace: servr
This tool explores the spatial relationships between topics, aiming for a scenario where the model identifies 5 topics that are distinctly separate from each other. In an ideal setting, the diagram would depict 5 non-overlapping circles, each representing a unique topic. While it’s unlikely for a model to achieve such precision, this diagram with K=5 shows relatively good separation between topics. There’s only one region where topics overlap, indicating a larger proportion of similar terms shared between them.
Topic 5 is highlighted due to its size and significant spacing from neighboring topics. The size of the circle represents the number of terms associated with the topic. In this instance, the prominent term is ‘learn,’ which is expected, but it’s the accompanying terms that set it apart from neighboring topics. The solitary bubble signifies that this topic is distinct in its composition of terms.
Performance Summary It seems that the STM model demonstrated a better ability to differentiate between topics compared to the LDA model. The clear separation of topic terms and the presence of medium to large-sized bubbles further support that the optimal number of topics is around 20, and incorporating metadata enhances the likelihood of distinguishing between unique topics. However, it’s worth noting that most of the 20 terms are included in the 5 topics, suggesting some overlap or redundancy in the topics identified.
5.4 Biterm Topic Model (BTM) The Biterm Topic Model (BTM) is a word co-occurrence based model designed to learn topics by analyzing word-word co-occurrence patterns(Wijffels, 2021).
A biterm comprises two words that frequently occur together within the same context, typically within a short text window. This window is defined by parameters such as skipgram and width.
Skipgram determines the number of words considered in the biterm search space, while width specifies the average number of words within a single document. In this project, a skipgram value of 10 was employed, while the width remained at the default value of 15.
Unlike LDA models that focus on word occurrences within individual documents, BTM models biterm occurrences across the entire corpus, providing a different perspective on topic modeling.
LASTEM_data$ID <- 1:nrow(LASTEM_data)
LASTEM_data$ID
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
LASTEM_data_btm <- select(LASTEM_data, c(ID, Abstract)) %>%
rename (doc_id = ID) %>%
rename (text = Abstract)
Based on the first two iterations, the BTM trial focused on training a model to identify 5 topics:
anno <- udpipe(LASTEM_data_btm, "english", trace = 10)
## 2024-04-30 13:14:47.740299 Annotating text fragment 1/45
## 2024-04-30 13:14:50.266528 Annotating text fragment 11/45
## 2024-04-30 13:14:52.932178 Annotating text fragment 21/45
## 2024-04-30 13:14:55.63223 Annotating text fragment 31/45
## 2024-04-30 13:14:57.910263 Annotating text fragment 41/45
biterms <- as.data.table(anno)
biterms <- biterms[, cooccurrence(x = lemma,
relevant = upos %in% c("NOUN",
"ADJ",
"PROPN"),
skipgram = 10),
by = list(doc_id)]
# Build BTM
set.seed(588)
traindata <- subset(anno, upos %in% c("NOUN", "ADJ", "PROPN"))
traindata <- traindata[, c("doc_id", "lemma")]
model <- BTM(traindata, k = 5,
beta = 0.01,
iter = 500,
biterms = biterms,
trace = 100)
## 2024-04-30 13:15:01 Start Gibbs sampling iteration 1/500
## 2024-04-30 13:15:02 Start Gibbs sampling iteration 101/500
## 2024-04-30 13:15:02 Start Gibbs sampling iteration 201/500
## 2024-04-30 13:15:03 Start Gibbs sampling iteration 301/500
## 2024-04-30 13:15:03 Start Gibbs sampling iteration 401/500
# Plot Model Results (do not run when knitting)
#library(ggraph)
#plot(model,
# top_n = 20,
# title = "BTM model",
# subtitle = "K = 5, 500 Training Iterations",
# labels = c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9",
# "10", "11", "12", "13", "14", "15", "16", "17",
# "18", "19"))
# Plot Model Results (do not run when knitting)
library(ggraph)
plot(model,
top_n = 15,
title = "BTM model",
subtitle = "K = 5, 500 Training Iterations",
labels = c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9",
"10", "11", "12", "13", "14", "15", "16", "17",
"18", "19"))
The words ‘data’, ‘analytic’ and ‘university’ were most important for topic 1. The words ‘assessment’, ‘model’, ‘framework’ and ‘study’ were most important for topic 2. The words ‘academic’, ‘correct’, ‘delay’ and ‘question’ were most important for topic 3. The words ‘student’, ‘course’ and ‘education’ were most important for topic 4. The words ‘intelligent’, ‘system’ and ‘activity’ were most important for topic 5.
Performance Summary The visual outcomes of the BTM with K set to 5 topics reveal several distinct characteristics:
Although the visualization may resemble word clouds, the underlying algorithm is more intricate than simple term frequencies. It is based on the co-occurrence of terms, adding a layer of complexity to the graphic.
6.1 Conlusion
This project delved into publications within WoS concerning learning analytics, higher education, and STEM, employing text mining to identify research trends and characteristics. From January 1, 2014, to April 7, 2024, 205 papers were published, and 45 were chosen for analysis based on specific criteria. Results indicated fluctuating publication numbers, with the highest count in 2021. Key unigrams included ‘student,’ ‘study,’ ‘data,’ ‘model,’ ‘assess,’ and ‘result,’ while prominent bigrams comprised ‘machine learning,’ ‘student learn,’ ‘learn environment,’ ‘learn management,’ ‘management system,’ and ‘student performance.’
Regarding influential sources, the International Journal of Engineering Education and IEEE ACCESS stood out, with University of Sydney and Universidad de Castilla-La Mancha as prominent affiliations. Wiley, Springer, and IEEE-Institute of Electrical and Electronics Engineers Inc emerged as leading publishers. Notable publications included titles like “Student Self-Regulated Learning Indicators and Engagement with Online Learning Events to Predict Academic Performance,” “Flipping the Classroom to Improve Learning With MOOCs Technology,” “Effects and Acceptance of Precision Education in an AI-Supported Smart Learning Environment,” “Learning Analytics on Structured and Unstructured Heterogeneous Data,” and “Perspectives from Procrastination, Help-Seeking, and Machine-Learning Defined Cognitive Engagement Team-Based Mobile.”
Furthermore, the study explored topic generation using three models—LDA, STM, and BTM. BTM exhibited superior effectiveness by revealing the most unique latent topics, with five topics identified through this model.
6.2 Limitations
The Web of Science database’s coverage may not include all relevant articles in the learning analytics field, especially those from journals or publications not indexed in this database. This limitation could lead to a biased or incomplete understanding of the research landscape. Additionally, the dataset used in this analysis is relatively small, and determining the optimal K value for topic modeling may not be definitive.
6.3 Ethic issues
Ethical Use of Data and Findings: It’s essential to handle the data and findings presented in the literature ethically and responsibly. This includes accurately representing research results without misinterpretation or misrepresentation. Additionally, acknowledging any limitations or ethical considerations highlighted by the original authors is crucial to maintain ethical standards in literature analysis.
Transparency and Reproducibility: Prioritize transparency throughout my literature analysis by meticulously documenting my search strategies, inclusion criteria, and article selection process. Strive for reproducibility by offering comprehensive information that enables others to replicate or validate your review methodology. This commitment to transparency and reproducibility enhances the credibility and trustworthiness of my literature analysis.
References
Bail, C. (2019). Topic Modeling. Text as Data Course. Retrieved April 22, 2022, from https://sicss.io/2019/materials/day3-text-analysis/topic-modeling/rmarkdown/Topic_Modeling.html Wijffels, J. (2021). BTM: Biterm topic models for short text - cran.r-project.org. CRAN - Package BTM. Retrieved April 27, 2022, from https://cran.r-project.org/web/packages/BTM/BTM.pdf Jipeng, Q., Zhenyu, Q., Yun, L., Yunhao, Y., & Xindong, W. (2019). Short text topic modeling techniques, applications, and performance: A survey. arXiv.org. Retrieved April 21, 2022, from https://doi.org/10.48550/arXiv.1904.07695 Silge, J., & Robinson, D. (2017). Topic Modeling: Text mining with R. 6 Topic modeling | Text Mining with R. Retrieved April 30, 2022, from https://www.tidytextmining.com/topicmodeling.html Yan, X., Guo, J., Lan, Y., & Cheng, X. (2013). A biterm topic model for short texts. Xiaohui Yan’s Homepage. Retrieved April 27, 2022, from http://xiaohuiyan.github.io/paper/BTM-WWW13.pdf