<link> <a href="https://github.com/massimoaria/tall" target="_blank"> <img role="img" src="logo_white.jpg" height="30" width="30" /> </a> <strong style="font-size:17px;"> TALL</strong> </link>

Corpus
Info & References

Import texts

Importing Data in TALL

TALL provides a versatile and user-friendly interface for importing textual data from various sources, ensuring flexibility in data handling for diverse analytical needs. The platform supports multiple file formats and structures, allowing users to seamlessly prepare their datasets for analysis.

Supported File Formats

1. Plain Text Files (.txt)

Plain text files can be imported in three different ways, depending on the structure of the data:

Single file containing a single document: Ideal for analyzing an individual document, such as a speech transcript, literary work, or report.
Single file with multiple documents separated by alphanumeric codes (e.g., 'Chapter', '0001', '****'):
- TALL automatically detects these separators, enabling structured document segmentation.
- Users can further refine the segmentation using the Edit → Split menu.
Multiple .txt files, where each file represents a separate document:
- Users can either select individual files manually or import a compressed (.zip) folder containing multiple text files.
- Each document will be automatically assigned an ID based on its file name, ensuring clear organization.

2. Tabular Data (.csv, .xlsx)

Tabular formats are useful for structured datasets, such as online reviews, survey responses, or social media posts.

The text to be analyzed must be stored in a dedicated column named 'text' to ensure proper identification.
Each row in the dataset is treated as an individual document.
Additional metadata (e.g., timestamps, user IDs, categories) can be retained for contextual analysis.

3. PDF Documents (.pdf)

TALL supports the import of PDF files, facilitating the analysis of academic papers, reports, books, and other document types.

Text extraction occurs automatically, converting the content into a format suitable for processing.
Limitation: At the moment, TALL can only import and process PDFs that follow a single-column formatting. PDFs with multi-column layouts, footnotes, or complex page structures may not be correctly parsed, and additional preprocessing may be required.

4. Biblioshiny Export Files

TALL supports the import of files exported from Biblioshiny, the graphical user interface of the Bibliometrix R package. This feature allows users to directly analyze the textual content of bibliographic metadata extracted from bibliometric databases such as Scopus or Web of Science.

The exported file (typically in .csv format) can be loaded into TALL.
Users must specify which column (e.g., Abstract, Keywords, or Title) should be used as the main textual content for analysis.
Other fields (e.g., authors, year, journal) can be imported and used as metadata for document grouping or filtering.

TALL Structured Files (.tall)

TALL allows users to save their analysis progress in a structured format, ensuring continuity across sessions.

Save Progress: Users can export their current session as a .tall file, preserving all imported data, configurations, and analytical steps.
Load Saved Sessions: Previously saved .tall files can be reloaded, allowing users to resume their work seamlessly without the need to re-import or preprocess data.

By offering flexible and structured data import capabilities, TALL streamlines the initial steps of text analysis, enabling users to focus on extracting insights efficiently.

References

Aria, M., Cuccurullo, C., D’Aniello, L., Misuraca, M., & Spano, M. (2024). Breaking Barriers with TALL: A Text Analysis Shiny app for ALL. In A. Dister, D. Longrée (eds.), Mots competes textes déchiffrés (JADT24) Presses Universitaires De Louvain Vol.1 pp.39-48.

Aria, M., Cuccurullo, C., D’Aniello, L., Misuraca, M., & Spano, M. (2024). TALL: A New Shiny App for Text Analysis. In Scientific Meeting of the Italian Statistical Society (pp. 64-70). Cham: Springer Nature Switzerland.

Aria, M., Cuccurullo, C., D'Aniello, L., Misuraca, M., & Spano, M. (2023). TALL: A New Shiny App of Text Analysis for All. In CLiC-it.

Split texts

Split texts by a sequence of characters (e.g. **H1**)

The minimum sequence of characters required to split the text must consist of at least three characters.

It's important to note that the text used as a delimiter for splitting is case sensitive (e.g., 'CHAPTER' is different from 'chapter').

Splitting the Corpus in TALL

TALL allows users to split textual data into smaller segments based on a specified sequence of characters. This feature is particularly useful when dealing with large documents containing multiple sections or structured content that needs to be analyzed separately.

How It Works

Users can define a delimiter, which is a sequence of characters used to segment the text.
The delimiter must contain at least three characters to ensure accurate text splitting.
The splitting process is case-sensitive, meaning that uppercase and lowercase variations are treated as distinct (e.g., 'CHAPTER' is different from 'chapter').

Example Use Cases

Books or Reports: Splitting a novel into chapters using 'CHAPTER ' as a delimiter.
Survey Responses: Separating responses when they are structured using a marker like '###' between answers.
Transcriptions: Dividing interview transcripts based on speaker labels (e.g., 'Speaker 1:').

By offering a flexible splitting mechanism, TALL ensures that text segmentation aligns with the user's analytical needs, preserving the original structure for meaningful interpretation.

Random Selection

Random Selection
Info & References

Random Text Selection

Extract a random sample of texts to analyze

Sample Size (%)

Random Text Selection in TALL

TALL allows users to extract a random subset of imported texts for focused analysis. This feature is particularly useful when working with large corpora, enabling users to explore representative samples without processing the entire dataset.

How It Works

The total number of imported texts is displayed, providing an overview of the dataset size.
Users can define the sample size as a percentage (%) of the total corpus.
The selection process is random, ensuring an unbiased representation of the dataset.

Example Use Cases

Analyzing Social Media Data: Selecting 10% of tweets from a large dataset to perform sentiment analysis.
Survey Research: Extracting a random subset of open-ended responses for qualitative coding.
Document Sampling: Reviewing a sample of reports or articles instead of analyzing the full collection.

By enabling controlled sampling, TALL helps users balance efficiency and analytical depth, making text exploration more manageable and meaningful.

External Information

Corpus with External Information
Info & References

Add from a file

To import external information, please make sure that the file to be uploaded is in Excel format and contains a column labeled 'doc_id' to identify documents associated to the text(s) imported.

You can download the list of doc_id associated with the imported text files below.

Import external information

Browse...

Importing External Information in TALL

TALL allows users to integrate additional information into their analysis by importing external datasets. This feature is particularly useful for enriching text data with metadata, annotations, or categorical variables, enabling a more comprehensive exploration of textual patterns.

How to Import External Data

The external file must be in Excel format (.xlsx).
The dataset must include a column labeled 'doc_id', which is used to match external information with the previously imported text data.
The 'doc_id' values must correspond exactly to the document identifiers assigned during text import to ensure proper alignment.

Using External Information

Imported external data can be used to filter or group documents based on specific attributes (e.g., author, category, sentiment).
This allows users to segment text collections efficiently, focusing on subsets relevant to their research questions.

Download Document Identifiers

To facilitate the integration process, users can download a list of 'doc_id' values associated with the imported text files below. This ensures that external data is formatted correctly before uploading.

By supporting the import of structured external data, TALL enhances text analysis capabilities, allowing users to incorporate contextual information for richer insights.

Tokenization & PoS Tagging

Annotated Text Table
Info & References

Language Model

Tokenization, Lemmatization, and PoS Tagging in TALL

TALL provides robust Natural Language Processing (NLP) capabilities for preprocessing textual data, including tokenization, lemmatization, and Part-of-Speech (PoS) tagging. These steps are essential for transforming raw text into a structured format suitable for further analysis.

Powered by UDPipe for NLP Preprocessing

TALL leverages the UDPipe library to perform tokenization, tagging, lemmatization, and dependency parsing. The udpipe R package offers seamless access to pre-trained annotation models, supporting multiple languages.

Tokenization: Splits raw text into individual words or tokens.
Lemmatization: Converts words into their base or dictionary form (e.g., 'running' → 'run').
PoS Tagging: Assigns grammatical categories (e.g., noun, verb, adjective) to each word.
Dependency Parsing: Identifies syntactic relationships between words in a sentence.

Updated Pre-trained Language Models

By default, UDPipe includes models based on Universal Dependencies (UD) version 2.5, but these had not been updated in some time. To enhance accuracy and ensure better linguistic processing, TALL now integrates updated pre-trained NLP language models from Universal Dependencies (UD) version 2.15.

These models were trained using gold standard annotated corpora from the UD project, significantly improving the quality of text analysis in TALL. The updated pre-trained models used in TALL can be accessed through our GitHub repository.

Applications in NLP and Text Analysis

Sentiment Analysis: Better understanding of word usage and context.
Topic Modeling: Improved preprocessing for cleaner topic extraction.
Corpus Exploration: Advanced filtering and segmentation of texts based on linguistic attributes.

By integrating updated NLP models and leveraging powerful preprocessing techniques, TALL ensures high-quality text analysis, making it a valuable tool for researchers and practitioners in computational linguistics.

References

TALL Pre-trained Models Repository: GitHub repository for pre-trained models

UDPipe R Package: CRAN link to UDPipe

Universal Dependencies Repository: Universal Dependencies project

Tagging Special Entities

Tagging Special Entities in TALL

TALL automatically detects and tags special entities within texts, ensuring that key non-linguistic elements are properly identified and can be leveraged in further analysis.
Recognizing these entities helps improve text preprocessing, pattern recognition, and contextual analysis.

Detected Special Entities

When processing textual data, TALL assigns specific tags to the following entities:

Email Addresses: Recognizes and tags email formats (e.g., example@domain.com).
URLs: Detects web links, ensuring they can be excluded or analyzed separately (e.g., https://www.example.com/path).
Emojis: Identifies and classifies emojis used in digital communication (e.g., 😊, 🚀, ❤️).
Hashtags: Extracts hashtags commonly used in social media and categorization (e.g., #ExampleTag).
IP Addresses: Detects standard IP address formats (e.g., 192.168.1.1), which may be useful in network-related text analysis.
Mentions: Identifies references to usernames, particularly in social media or chat applications (e.g., @username).

Why Special Entity Tagging Matters?

Enhanced Text Cleaning: Filtering out or isolating elements that may not contribute to linguistic analysis.
Social Media and Web Analysis: Extracting meaningful patterns from hashtags, mentions, and URLs.
Sentiment and Emotion Studies: Analyzing the role of emojis in sentiment-based communication.
Cybersecurity and Digital Forensics: Identifying sensitive data points such as email addresses and IP addresses.

By integrating special entity recognition, TALL enhances the preprocessing phase, ensuring that these elements are structured for more effective text analysis.

Special Entities

When processing text, special tags will be assigned to certain detected entities.

These include:

•⁠ ⁠Email addresses: example@domain.com

•⁠ ⁠URLs: https://www.example.com/path

• ⁠Emojis: 😊, 🚀, ❤️

•⁠ ⁠Hashtags: #ExampleTag

•⁠ ⁠IP addresses: 192.168.1.1

•⁠ ⁠Mentions: @username

This ensures that these elements are identified and marked for further analysis within the text.

Custom Term List Loading and Merging

Pos Tagging with Custom List
Info & References

Custom Term List in TALL

TALL allows users to define a Custom Term List, enabling more precise control over text processing and linguistic analysis. This feature allows users to manually assign custom tags to specific terms, overriding their default categorization by the language model.

Why Use a Custom Term List?

Highlighting Specific Concepts: Identifying key terms related to methodologies, specialized vocabulary, or domain-specific jargon.
Filtering Stop Words: Removing terms that are irrelevant to the analysis, ensuring a cleaner dataset.
Enhancing Named Entity Recognition (NER): Manually tagging specific words that the language model may misclassify.
Overriding Default PoS Assignments: Ensuring consistency in tagging across texts by defining a fixed categorization for certain terms.

How to Import a Custom Term List

To integrate a custom list of terms, users must provide a properly formatted file:

The list must be in Excel format (.xlsx).
The file should contain two columns:

First column: The list of terms to be tagged.
Second column: The corresponding Part-of-Speech (PoS) or user-defined category assigned to each term.

The specified tags should align with standard linguistic categories (e.g., noun, verb, adjective) or custom categories for specific analysis needs.

Example of Custom Term List Format

-------- Term ----------	------ Custom Tag ------
artificial intelligence	methodology
deep learning	methodology
preprocess	data_handling
dataset	data_handling
remove	Ignore

By allowing users to define and control term tagging, TALL provides enhanced flexibility for text analysis, making it a powerful tool for domain-specific research and refined linguistic processing.

Import Custom Term List

Please ensure that the Custom Term List is formatted as an Excel file with two columns. In the first column include the desired terms. In the second column provide the corresponding list of PoS associated with each term.

Browse...

Pressing Run Button will delete previous custom PoS

Multi-Word Creation

Algorithms for Automatic Multi-Word Extraction

The software TALL - Text Analysis for All employs four key algorithms to automatically generate multi-word sequences from a corpus of documents. These methods, widely recognized in computational linguistics and text mining, include Rapid Automatic Keyword Extraction (RAKE), Pointwise Mutual Information (PMI), Mutual Dependency (MD), and Log-Frequency Biased Mutual Dependency (LF-MD).

- Rapid Automatic Keyword Extraction (RAKE)

RAKE is a domain-independent keyword extraction algorithm that identifies key phrases by analyzing word co-occurrences within a document. It segments text into candidate keyword phrases based on stopword delimiters and then assigns scores based on word co-occurrence and frequency. Higher-scoring phrases are considered more relevant as multi-word expressions.

Reference:
Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic keyword extraction from individual documents. Text Mining: Applications and Theory, 1(1), 1-20.

- Pointwise Mutual Information (PMI)

PMI is a statistical measure used to assess the association strength between two words. It is defined as:

PMI(w₁, w₂) = log ( P(w₁, w₂) / (P(w₁) P(w₂)) )

where P(w₁, w₂) is the probability of words w₁ and w₂ appearing together, and P(w₁) and P(w₂) are their individual probabilities. High PMI values indicate strong word associations, making the phrase a good multi-word candidate.

Reference:
Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1), 22-29.

- Mutual Dependency (MD)

Mutual Dependency extends PMI by considering the full context of a multi-word expression rather than just pairwise co-occurrence. It incorporates statistical dependency measures, ensuring that all words in a multi-word sequence contribute significantly to its overall meaning. This approach is particularly useful for identifying multi-word units beyond simple bigrams.

Reference:
Thanopoulos, A., Fakotakis, N., & Kokkinakis, G. (2002, May). Comparative Evaluation of Collocation Extraction Metrics. In LREC (Vol. 2, pp. 620-625).

- Log-Frequency Biased Mutual Dependency (LF-MD)

LF-MD refines the MD approach by incorporating word frequency into the dependency calculation. This method biases the selection of multi-word expressions toward frequent collocations while maintaining a balance between statistical significance and linguistic relevance. It is particularly useful in extracting meaningful multi-word expressions in large corpora where rare but statistically significant collocations might otherwise dominate.

Reference:
Thanopoulos, A., Fakotakis, N., & Kokkinakis, G. (2002, May). Comparative Evaluation of Collocation Extraction Metrics. In LREC (Vol. 2, pp. 620-625).

Automatic Multi-Words

Multi-word creation extracts keywords (sequence of terms) from the text.

After keywords are generated, select those you wish to include in your data from the list.

Relevant Collocation Algorithm

Ngrams

Freq Min

Multi-Words created by:

Multi-Word Creation by a List

Multi-Word Creation by a List in TALL

TALL allows users to define multi-word expressions (MWEs) by importing a predefined list of multi-word terms. This feature is particularly useful for ensuring that specific phrases or domain-specific expressions are treated as single units during text processing, improving linguistic analysis.

How to Import a Multi-Word List

To integrate multi-word expressions into the analysis, users must provide a properly formatted list:

The list must be in Excel (.xlsx) or CSV (.csv) format.
The file should contain a single column where each row represents one multi-word expression.
Each term within a multi-word expression must be separated by a single whitespace
(e.g., machine learning, natural language processing).

Why Use Multi-Word Expressions?

Preserving Meaningful Phrases: Ensuring that key terms (e.g., artificial intelligence) are not split into separate words.
Improving Text Preprocessing: Enhancing tokenization and lemmatization by treating phrases as cohesive units.
Enhancing Domain-Specific Analysis: Beneficial in specialized fields such as legal, medical, or technical texts, where multi-word terms have precise meanings.

By supporting multi-word recognition, TALL provides users with greater flexibility in structuring their text analysis and ensures that critical expressions are accurately identified and processed.

Import a Multi-Word List

Please ensure that the Multi-Word List is formatted as an Excel/CSV file with one column. Each cell of that column include a multi-word. Each term have to be separated by a single whitespace.

Browse...

PoS Tag Selection

Annotated Text
Info & References

PoS Tagging Selection in TALL

TALL provides users with the flexibility to select specific Part-of-Speech (PoS) tags to be used in subsequent analyses. This feature allows for greater control over the linguistic elements included in text processing, ensuring that only relevant grammatical categories are considered.

Why Select PoS Tags?

Filtering Out Unnecessary Elements: Excluding determiners, conjunctions, or punctuation that may not contribute to the analysis.
Focusing on Key Linguistic Features: Selecting only nouns and verbs for topic modeling, or adjectives and adverbs for sentiment analysis.
Improving Computational Efficiency: Reducing data size and processing time by analyzing only the most relevant word categories.

How It Works in TALL

Users can manually select or deselect PoS categories from a predefined list.
The available PoS tags follow the Universal Dependencies (UD) annotation scheme, ensuring consistency across different languages.

Default Selected PoS Tags

By default, TALL selects the following PoS categories:

ADJ: Adjective – Descriptive words (e.g., 'beautiful', 'quick').
NOUN: Noun – Common nouns representing entities (e.g., 'dog', 'city').
PROPN: Proper Noun – Specific names of places, people, or organizations (e.g., 'London', 'NASA').
VERB: Verb – Action words representing processes (e.g., 'run', 'speak').
HAPAX: Words appearing only once in the text, useful for lexical richness analysis.

Available PoS Categories in TALL

PoS Tag	Description
ADJ	Adjective
ADP	Adposition
ADV	Adverb
AUX	Auxiliary
CCONJ	Coordinating Conjunction
DET	Determiner
INTJ	Interjection
NOUN	Noun
NUM	Numeral
PART	Particle
PRON	Pronoun
PROPN	Proper Noun
PUNCT	Punctuation
SCONJ	Subordinating Conjunction
SYM	Symbol
VERB	Verb
X	Other
Hapax	Words appearing only once in the corpus
Single Character	Individual symbols or characters

Custom Categories

In addition to predefined PoS categories, users may have also generated custom categories through the Custom List and Multi-Word menus.
These user-defined tags allow for specialized analysis by grouping specific terms under a unique classification system.

Enhancing Analysis with PoS Selection

By allowing users to choose specific PoS categories, TALL ensures that the analysis is tailored to the user's research goals.
Whether performing keyword extraction, syntactic analysis, topic modeling, or sentiment analysis, the ability to refine PoS selection enhances the precision and interpretability of results.

Select:

Hapax

Single Character

Filter docs by available external information

Filter by

Select an external information to filter docs:

Define groups by available external information

Select external information

Select an external information to define new document groups:

Overview

Words

Text Size

Word Frequency by PoS

Thinking...

Corpus Metrics in TALL

These metrics provide a summary of the key textual characteristics of the analyzed corpus.

📂 Corpus Size & Structure

Documents → The total number of documents in the corpus.
Sentences → The total number of sentences in the corpus.
Tokens → The total number of words or linguistic units, including punctuation marks.
Types → The number of unique words in the corpus, representing vocabulary richness.
Lemma → The number of unique lemmas, considering the base form of words.

📏 Average Length Metrics

Doc Avg Length in Chars → The average number of characters per document.
$\frac{Total Characters}{Number of Documents}$
Doc Avg Length in Tokens → The average number of tokens per document.
$\frac{Total Tokens}{Number of Documents}$
Sent Avg Length in Chars → The average number of characters per sentence.
$\frac{Total Characters}{Number of Sentences}$
Sent Avg Length in Tokens → The average number of tokens per sentence.
$\frac{Total Tokens}{Number of Sentences}$

📊 Lexical Metrics

Type-Token Ratio (TTR) → Ratio of unique words (types) to total words (tokens). Higher values indicate greater lexical diversity.
$TTR = \frac{Types}{Tokens}$
Hapax Legomena (%) → Percentage of words that appear only once in the corpus.
$Hapax % = \frac{Hapax}{Types} \times 100$
Guiraud Index → Measure of lexical richness correcting for text length.
$Guiraud = \frac{Types}{\sqrt{Tokens}}$

📊 Additional Lexical Measures

Lexical Density → Proportion of content words over total tokens.
$Lexical Density = \frac{Content Words}{Total Tokens}$
Nominal Ratio → Ratio between nouns and verbs.
$Nominal Ratio = \frac{Number of Nouns}{Number of Verbs}$
Gini Index → Measure of inequality in word frequency distribution. Calculated from the Lorenz curve of word frequencies.
Yule’s K Index → Measure of lexical diversity based on word repetition.
$K = 10,000 \times \frac{(\sum_{i}^{n} f^{2}) - N}{N^{2}}$

References

Baayen, R. H. The effect of lexical specialization on the growth curve of vocabulary. Computational Linguistics, 22(2), 1996.

Bentz, C., Alikaniotis, D., Cysouw, M., & Ferrer-i-Cancho, R. The entropy of words—learnability and expressivity across more than 1000 languages. Entropy, 19(6), 2017.

Biber, D. Variation across speech and writing. Cambridge University Press, 1988.

Guiraud, P. Les caractères statistiques du vocabulaire. Presse Universitaire de France, 1954.

Tweedie, F. J., & Baayen, R. H. How variable may a constant be? Measures of lexical richness in perspective. Computers and the Humanities, 32(5), 323–352, 1998.

Ure, J. Lexical density and register differentiation. In G. Perren and J.L.M. Trim (eds). Applications of Linguistics, Cambridge University Press, 443–452, 1971.

Yule, G. U. The statistical study of literary vocabulary. Cambridge University Press, 1944.

Part of Speech Frequency List

Plot
Table

Words in Context

Words in Context in TALL

The Words in Context feature in TALL allows users to analyze how specific words appear in textual data, offering valuable insights into semantic usage, contextual meaning, and discourse structure. This tool is particularly useful for qualitative text analysis, linguistic research, and content exploration in diverse domains, such as social sciences, digital humanities, marketing, and legal studies.

How Words in Context Works in TALL

1. Concordance Analysis (Keyword in Context - KWIC)

Displays a side-by-side view of words and their surrounding textual context (left and right neighbors).
Helps in identifying common phrases, recurring structures, and usage variations.
Useful for studying semantic shifts, idiomatic expressions, and collocations.

📌 Example:
If analyzing the term 'sustainable' in a corpus of news articles, KWIC might show:
- 'sustainable development is a key focus of international policies'
- 'the company promotes sustainable and ethical supply chains'
- 'concerns over sustainable agricultural practices are increasing'
This helps in understanding how 'sustainable' is used in different thematic contexts.

2. Context Window Customization

Users can define the window size (number of words before and after the target term) to adjust the level of contextual information displayed.
Shorter windows highlight immediate linguistic relationships, while larger windows help analyze broader semantic dependencies.

📌 Example:
When studying 'risk' in financial reports, adjusting the window size allows users to see if it is used in association with:
- 'risk management,' 'high-risk investments' (short window)
- 'the recent economic downturn has increased financial risk for small businesses' (larger window)

3. Frequency and Distribution Insights

Words appearing in multiple contexts can be analyzed for frequency trends, helping users identify dominant themes associated with a term.
Examines whether a word is evenly distributed across the corpus or clustered in specific sections/documents.

📌 Example:
In a dataset of customer reviews, the word 'expensive' might frequently co-occur with:
- 'but worth it' in positive reviews
- 'not justified for the quality' in negative reviews
This helps distinguish when 'expensive' has a neutral, positive, or negative connotation.

By enabling customizable and interactive text exploration, the Words in Context tool in TALL provides users with a deeper understanding of language patterns in large textual datasets.

Thinking...

Words in Context

Search word(s) in text

Window Length:

Before

After

Clustering

Dendrogram
Table

Reinert Clustering

Reinert Clustering in TALL

Reinert clustering is a hierarchical descending classification method used for textual data clustering. It identifies lexically homogeneous word clusters based on the co-occurrence of terms within textual contexts. Originally developed by Max Reinert (1983, 1990), this approach has become a core method in corpus linguistics, sociolinguistics, and content analysis.

Reinert’s method is particularly effective in structuring large textual datasets, making it a powerful tool for thematic segmentation, discourse analysis, and socio-linguistic research.

How Reinert Clustering Works in TALL

1. Text Segmentation into Context Units

The text is divided into small context units (CUs), typically paragraphs or fixed-length segments, to capture local lexical co-occurrence patterns.
Each CU is treated as a vector of word frequencies.

2. Iterative Splitting of Clusters

The method starts with all CUs grouped together.
A first split is performed, maximizing intra-cluster homogeneity while ensuring that word distributions differ between groups.
This recursive process continues until no further meaningful lexical differentiation can be achieved.

3. Statistical Association of Words to Clusters

Words are assigned probabilistic weights based on their distribution within each cluster.
The most characteristic words of each cluster are identified, forming the lexical profile of the topic.

4. Interpretation and Thematic Analysis

The final clusters represent coherent thematic units.
Thematic interpretation is facilitated by analyzing the most significant words in each cluster.

Reinert Clustering vs. Traditional Topic Modeling

Feature	Reinert Clustering	LDA Topic Modeling
Method	Hierarchical word clustering	Probabilistic word-topic assignment
Output	Discrete word clusters with distinct themes	Soft assignment of words to topics
Context Sensitivity	High – Uses local lexical co-occurrence	Medium – Uses global probability distributions
Interpretability	Direct thematic segmentation	Requires manual topic interpretation
Application	Text segmentation, discourse analysis	Thematic classification, topic inference

Implementation of Reinert Clustering in TALL

The implementation of Reinert clustering in TALL was inspired by the 'rainette' package (Barnier & Privé, 2023). The original routines have been adapted to work with the TALL data structure, which includes tokenized, lemmatized, and PoS-tagged corpora.

This adaptation allows:

Customization of context unit size to fit different corpus structures.
Compatibility with pre-processed linguistic data, ensuring greater accuracy in lexical clustering.
Optimized performance for large-scale text analysis, leveraging TALL’s text processing pipeline.
Graphical visualization of thematic structures to facilitate interpretation and reporting.

By adapting Reinert’s methodology to TALL’s specialized NLP framework, researchers can conduct advanced text clustering analyses while maintaining compatibility with state-of-the-art linguistic preprocessing techniques.

References

Reinert, M.

Une méthode de classification descendante hiérarchique : application à l'analyse lexicale par contexte. Cahiers de l'analyse des données, 8(2), 1983.

Reinert, M.

Alceste: Une méthodologie d'analyse des données textuelles et une application: Aurelia De Gerard De Nerval. Bulletin de Méthodologie Sociologique, 26(1), 1990. DOI: 10.1177/075910639002600103

Barnier, J., & Privé, F.

rainette: The Reinert Method for Textual Data Clustering. R CRAN Repository, 2023. DOI: 10.32614/CRAN.package.rainette

Correspondence Analysis

Correspondence Analysis in TALL

Correspondence Analysis (CA) is a fundamental technique for exploring semantic relationships among words within a text collection (Benzécri, 1982; Lebart et al., 1997). By applying dimensionality reduction, CA represents the most relevant information in a low-rank vector space, uncovering latent structures within the data. These structures are then visualized on factorial maps, allowing users to detect associations between terms and documents effectively.

Why Use Correspondence Analysis?

Revealing Hidden Patterns: CA captures relationships between words and documents that might not be immediately apparent.
Dimensionality Reduction: By projecting the data into a lower-dimensional space, CA simplifies complex text corpora while retaining key semantic information.
Visualization on Factorial Maps: The results are displayed on a graphical representation, enabling easy interpretation of term clusters and document similarities.

Limitations of Correspondence Analysis

One of the primary challenges of CA is that the new features generated through dimensionality reduction often lack direct interpretability. Since the transformation is data-driven, the factors extracted do not always correspond to clear linguistic or thematic constructs, making it more difficult to derive explicit meaning from the analysis.

Enhancing Interpretability: The Tandem Approach

To address this limitation, TALL integrates a tandem approach, which combines CA with clustering techniques to improve the interpretability of results (Misuraca & Spano, 2020). This approach follows a two-step process:

Dimensionality Reduction with CA: The text data is transformed into orthogonal and ordered features, preserving essential relationships while reducing complexity.
Hierarchical Clustering: Clustering is applied to the transformed data, allowing for multi-level aggregation of terms and documents. Unlike simple factor analysis, this method provides non-overlapping clusters, making the results easier to interpret.

Applications of Correspondence Analysis in Text Mining

Exploring Co-occurrence Patterns: Identifying how frequently certain words appear together in a corpus.
Thematic Segmentation: Grouping documents based on their shared linguistic characteristics.
Semantic Mapping: Revealing latent structures within unstructured text data.
Lexical Field Analysis: Understanding how words are distributed and related within a text collection.

By integrating Correspondence Analysis with clustering methods, TALL enhances the interpretability and usability of text mining workflows, offering a powerful framework for unsupervised exploration of large document collections.

References

Benzécri, J. P. (1982). Histoire et préhistoire de l’analyse des données. Paris: Dunod.

Lebart, L., Salem, A., & Berry, L. (1997). Exploring textual data. Volume 4. Springer Science & Business Media.

Misuraca, M., & Spano, M. (2020). Unsupervised Analytic Strategies to Explore Large Document Collections. Heidelberg: Springer, 06, 17-28.

Thinking...

Co-word analysis

Co-Word Analysis in TALL

Co-word analysis is a network-based text mining technique that examines co-occurrence patterns of words within a corpus, identifying semantic structures based on term relationships (Callon et al., 1983). This method is particularly valuable in detecting thematic clusters within large textual datasets, as it helps uncover conceptual linkages and emerging research topics in various fields.

How Co-Word Analysis Works

Nodes represent words (terms extracted from the corpus).
Edges represent co-occurrence relationships (connections between words appearing together in the same context).
Edge weights reflect frequency, meaning stronger relationships are represented by thicker connections.

Normalization Measures in Co-Word Analysis

Raw co-occurrence frequencies can be biased by term frequency in the corpus, making normalization essential to provide meaningful co-word relationships. TALL allows users to apply different normalization measures to refine co-occurrence networks (Eck & Waltman, 2009):

Association Index

The Association Index (AI) normalizes co-occurrence counts relative to the expected frequency of terms in the corpus:

AI_ij = C_ij / (C_i × C_j)

Cosine Similarity

Cosine Similarity measures how similar two terms are based on their co-occurrence across different documents:

cos(θ) = C_ij / sqrt(C_i × C_j)

Jaccard Similarity

The Jaccard Similarity measures the co-occurrence strength relative to the total occurrences of both words:

J_ij = C_ij / (C_i + C_j - C_ij)

Community Detection for Semantic Clustering

To extract thematic clusters, TALL applies the Walktrap algorithm for community detection (Pons & Latapy, 2006):

Uses random walks on the co-occurrence network to detect structurally cohesive word communities.
Efficiently discovers hierarchical relationships among terms.
Groups words into non-overlapping clusters, representing latent topics or conceptual domains within the corpus.

Applications of Co-Word Analysis

Bibliometric and Scientometric Studies: Identifying research trends and thematic structures in academic literature.
Topic Detection in Large Text Collections: Extracting underlying themes from newspapers, reports, or social media content.
Keyword Network Exploration: Understanding how keywords interconnect and contribute to discourse formation.
Patent and Innovation Analysis: Revealing technological trends by examining term co-occurrence in patent databases.
Social Media and Sentiment Analysis: Discovering key discussion topics within online platforms.

Advantages of Co-Word Analysis in TALL

Unsupervised Approach: Extracts thematic clusters without requiring predefined categories.
Graph-Based Representation: Provides an intuitive visualization of textual structures.
Scalable to Large Text Corpora: Efficiently handles extensive document collections.
Integration with Other Analytical Techniques: Can be combined with Correspondence Analysis, Topic Modeling, and Sentiment Analysis for richer insights.

References

Callon, M., Courtial, J.-P., Turner, W.A., & Bauin, S.

From translations to problematic networks: An introduction to co-word analysis. Social Science Information, 22(2), 191-235.

Eck, N. J. V., & Waltman, L.

How to normalize co-occurrence data? An analysis of some well‐known similarity measures. Journal of the American Society for Information Science and Technology, 60(8), 1635-1651.

Fortunato, S., & Hric, D.

Community detection in networks: A user guide. Physics Reports, 659, 1-44.

Pons, P., & Latapy, M.

Computing communities in large networks using random walks. Retrieved from arXiv:physics/0512106.

Thinking...

Thematic Map

The Thematic Map feature in TALL enables users to explore the conceptual structure of a text corpus by visually mapping the most relevant topics. It is based on an unsupervised, network-based method designed to extract, cluster, and characterize groups of words representing distinct semantic areas within the analyzed texts. This approach has been successfully applied in bibliometric research and adapted in TALL for general-purpose text analysis.

Methodological Framework

Thematic mapping starts with the construction of a co-occurrence matrix from the pre-processed text corpus. The association strength between terms is then calculated to normalize the raw co-occurrence frequencies:

AS jj' = \frac{a_{jj'}}{a_{jj} \cdot a_{j'j'}}

where AS_jj' is the association strength between terms j and j', and a_jj' is their observed co-occurrence. This metric expresses the semantic relatedness of term pairs.

A community detection algorithm (WalkTrap) is then applied to the normalized network to identify clusters of terms (i.e., topics). Each cluster is projected onto a two-dimensional plane using two dimensions:

Callon Centrality (CC): measures a topic’s interaction with others, indicating its relevance in the corpus.
Callon Density (CD): measures the internal cohesion of the topic, reflecting its development.

Each topic is placed on a strategic diagram based on its centrality and density values:

Upper-right (Hot Topics): High centrality and high density – well-developed and important.
Lower-right (Basic Topics): High centrality and low density – important but still under development.
Upper-left (Niche Topics): Low centrality and high density – well developed but marginal.
Lower-left (Peripheral Topics): Low centrality and low density – weakly developed and marginal.

Features in TALL

Users can generate thematic maps from any textual dataset preprocessed and tokenized in TALL.
The algorithm works automatically and does not require setting the number of topics in advance.
Topics are labeled by the most frequent keywords within each cluster.
Topic size (i.e., the size of the bubble) represents the number of terms in the cluster.
The user can select specific time slices or metadata filters to perform comparative thematic analysis across groups or periods.

Thematic maps offer a rich, interpretable representation of discourse structure and are particularly effective for exploratory text mining and culturomic studies.

References

Aria, M., Cuccurullo, C., D’Aniello, L., Misuraca, M., & Spano, M. (2022). Thematic Analysis as a New Culturomic Tool: The Social Media Coverage on COVID-19 Pandemic in Italy. Sustainability, 14(6), 3643. https://doi.org/10.3390/su14063643

Cobo, M.J., López-Herrera, A.G., Herrera-Viedma, E., & Herrera, F. (2011). An approach for detecting, quantifying, and visualising the evolution of a research field: A practical application to the fuzzy sets theory field. Journal of Informetrics, 5(1), 146–166.

Thinking...

Training Word Embeddings in TALL

The Training module in TALL enables users to generate custom word embeddings from their own corpus using the word2vec algorithm, which includes both the Continuous Bag-of-Words (CBOW) and Skip-gram architectures. These models create dense vector representations that capture semantic and syntactic relationships among words based on their distributional context.

Available Architectures

CBOW: Predicts a word from its surrounding context. It is faster and works well with frequent words.
Skip-gram: Predicts surrounding context words from a target word. It is slower but performs better with infrequent words.

How It Works

Text data is lemmatized and filtered to exclude non-informative tokens (e.g., punctuation, auxiliaries, determiners).
Training is performed at the sentence level to preserve local context.
Stopwords are automatically identified and excluded.
Parameters such as dimensionality, number of iterations, and architecture (CBOW/Skip-gram) can be configured.

Outputs

Word embedding matrix.
Descriptive statistics for each vector dimension (mean, SD, skewness, kurtosis).
PCA analysis to evaluate variance explained by each component.
Cosine similarity and Euclidean distance metrics for quality assessment.

Example

Training a word2vec model on a corpus of product reviews may reveal that terms like “delivery” and “shipping” appear close in vector space, indicating their semantic similarity within that context.

References

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. arXiv:1310.4546

Model Training

Options:

Embedding Method

Dimensions

Iterations

Embedding Similarity

Word Similarity Network in TALL

The Similarity module in TALL allows users to explore semantic relationships between words through an interactive similarity network generated from word embeddings trained in the Training tab. These embeddings are built using the word2vec algorithm (either CBOW or Skip-gram).

How It Works

TALL selects the top 100 most frequent content words in the corpus (restricted to POS: NOUN, PROPN, ADJ).
For each of these 100 terms, the system computes the 10 most similar words based on cosine similarity in the embedding space.
The resulting network is composed of:
- Nodes: the 100 target words (triangles) and their similar terms (dots).
- Edges: connections representing semantic similarity scores (cosine similarity ≥ 0.5), with width proportional to similarity.
The network also undergoes community detection using the Walktrap algorithm to highlight thematic clusters.

Visualization Tools

UMAP projection: two-dimensional semantic mapping of all words in the embedding matrix.
Overlap reduction: improves readability by adjusting label positions and opacity in dense areas.
Interactive display: with zoom, node highlighting, and draggable layout via visNetwork.

Example

After training on a corpus of scientific publications, the similarity network might display “method”, “approach”, and “model” as top frequent terms, each connected to semantically related concepts such as “algorithm”, “technique”, or “framework”.

References

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. In *Advances in Neural Information Processing Systems* (NeurIPS 2013), 26, 3111–3119. [View PDF]
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781

Thinking...

Grako

Topic Modeling: Optimal selection of topic number

Topic Modeling in TALL: K Selection

Topic modeling is a fundamental technique in unsupervised text mining, allowing users to uncover latent themes within large collections of documents. One of the key challenges in Latent Dirichlet Allocation (LDA) and other topic modeling techniques is determining the optimal number of topics (K).

TALL estimates K automatically using well-established statistical measures (Deveaud et al., 2014; Cao et al., 2009; Arun et al., 2010), including Perplexity.
However, users can also manually adjust K and explore different solutions in the Model Estimation Menu, enabling greater flexibility based on the dataset and research objectives.

Why is K Selection Important?

A too small K may merge distinct topics, reducing the model's ability to separate different thematic structures.
A too large K may fragment coherent topics, introducing unnecessary complexity and reducing interpretability.
The correct K ensures that topics are coherent, interpretable, and representative of the dataset.

Automatic K Estimation in TALL

TALL integrates several standard measures for determining the optimal number of topics in LDA:

Blei et al. (2003) – Perplexity Measure

- Perplexity (Probabilistic Evaluation of Generalization) is a likelihood-based metric that measures how well a model generalizes to unseen data.

- It evaluates the model's ability to predict a held-out test set, with lower values indicating better performance.

- Perplexity is defined as the inverse geometric mean of the likelihood function, computed over the test corpus.

Cao et al. (2009) – Topic Coherence Measure

- Computes the average pairwise similarity between topics based on word distributions.

- The optimal K is found when inter-topic similarity is minimized, ensuring that topics are well-separated.

Arun et al. (2010) – KL Divergence-Based Measure

- Compares the word-topic distribution and document-topic distribution using Kullback-Leibler (KL) divergence.

- The optimal K is identified as the point where KL divergence stabilizes, meaning topics balance between coherence and specificity.

Deveaud et al. (2014) – A Hybrid Approach

- A refinement of previous approaches that balances topic coherence and diversity.

- The optimal K is chosen where topic distinctiveness is maximized while preserving thematic coverage.

Manual K Adjustment for Customization

While automatic estimation provides a strong baseline, users may need to adjust K manually based on domain knowledge and interpretability:

For exploratory research: Start with low K values (e.g., 5–20 topics) to gain an overview of broad themes.
For fine-grained analysis: Use higher K values (e.g., 30–100 topics) to capture more nuanced subtopics.
For benchmarking: Compare different K values using topic coherence scores and human interpretability.

References

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003) Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993–1022.

Deveaud, R., Sanjuan, E., & Bellot, P. (2014) Accurate and effective latent concept modeling for ad hoc information retrieval. Document Numérique, 17, 61–84.

Cao, J., Xia, T., Li, J., Zhang, Y., & Tang, S. (2009) A density-based method for adaptive LDA model selection. Neurocomputing, 72(7), 1775–1781.

Arun, R., Suresh, V., Veni Madhavan, C.E., & Narasimha Murthy, M.N. (2010) On finding the natural number of topics with latent Dirichlet allocation: Some observations. In Zaki, M.J., Yu, J.X., Ravindran, B., & Pudi, V. (Eds.), Advances in Knowledge Discovery and Data Mining (pp. 391–402). Berlin, Heidelberg: Springer.

Find optimal K

Topics

Metric for model tuning

N. of terms

Terms selection by:

K min

K max

K by:

Topic Modeling: Model estimation

Topic Modeling in TALL: Model Estimation

Topic modeling is a family of generative statistical models designed to uncover semantic structures within large document collections. These models aim to identify latent topics that explain the observed word distributions in text corpora, allowing for a low-dimensional representation of textual data.

Through probabilistic modeling, topic modeling enables:

Discovery of underlying themes within a collection of documents.
Assignment of probabilistic membership scores to documents, indicating their association with different topics.
Dimensionality reduction, making it easier to analyze large text datasets by structuring them into meaningful clusters.
Human interpretability, as each topic is characterized by a set of highly associated terms, making it easier for users to extract insights.

Latent Dirichlet Allocation (LDA) in TALL

TALL implements the Latent Dirichlet Allocation (LDA) algorithm (Blei et al., 2003), one of the most widely used topic modeling techniques. LDA is a Bayesian probabilistic model that assumes:

Each document is a mixture of multiple topics, with different proportions.
Each topic is defined by a probability distribution over words, meaning that some words are more strongly associated with a given topic.
The goal of LDA is to infer these hidden topic distributions, making it possible to automatically organize, summarize, and analyze large textual datasets.

LDA operates by:

Assigning each word in a document to a latent topic, estimating topic-word distributions.
Iteratively adjusting topic probabilities to maximize likelihood, ensuring that words are grouped into meaningful semantic structures.
Producing a document-topic matrix, where each document is represented as a probability distribution over the identified topics.

Advantages of Topic Modeling in TALL

Unsupervised Learning – No prior labeling is required; topics emerge naturally from the dataset.
Scalability – LDA efficiently handles large text corpora, making it useful for applications ranging from scientific literature to customer reviews.
Flexibility – Users can define K (number of topics) manually or use automatic estimation techniques (see the K Selection Menu).
Enhanced Text Understanding – Topics provide a thematic summary of a collection, improving text exploration and classification.

By integrating state-of-the-art topic modeling techniques, TALL enables researchers and analysts to discover hidden structures in textual data, making it an essential tool for content analysis, knowledge extraction, and thematic clustering.

References

Blei, D.M., Ng, A.Y., & Jordan, M.I. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3(Jan), 993-1022.

Thinking...

Polarity Detection

Polarity Detection in TALL

Polarity detection is a fundamental sentiment analysis technique used to determine whether a document expresses a positive, negative, or neutral sentiment. This process is essential in analyzing consumer feedback, financial reports, product reviews, and social media discussions, where understanding sentiment trends can provide valuable insights into public opinion and decision-making processes.

How Polarity Detection Works in TALL

TALL calculates document polarity using a lexicon-based approach, incorporating contextual adjustments to refine sentiment scoring. The methodology follows three key steps:

1. Lexicon-Based Sentiment Scoring

Each word in the text is assigned a polarity score based on its presence in sentiment lexicons.
Positive words (e.g., 'excellent,' 'happy') are assigned +1, while negative words (e.g., 'bad,' 'fail') receive -1.
Words not found in sentiment lexicons are considered neutral and assigned a score of 0.

2. Contextual Modifications Using Valence Shifters

Negators: Words like “not,” “never,” or “no” invert the polarity of a nearby sentiment word (e.g., 'not happy' changes from +1 to -1).
Amplifiers: Words such as 'very,' 'extremely,' and 'highly' increase the intensity of a sentiment (e.g., 'very good' is weighted more than 'good').
De-amplifiers (Diminishers): Terms like 'slightly' or 'somewhat' reduce sentiment intensity (e.g., 'slightly disappointing' has a weaker negative score than 'disappointing').

3. Aggregation and Normalization

Sentiment scores are summed across the document to obtain an overall polarity score.
An optional normalization step scales the final score within the [-1, 1] range, ensuring comparability across different text lengths.
Documents with scores near 0 are classified as neutral, indicating a balanced mix of sentiment or the absence of strong emotions.

Sentiment Lexicons Used in TALL

1. Hu and Liu (2004) - Opinion Lexicon

Designed for analyzing consumer reviews, categorizing words into positive and negative classes.
Particularly useful for e-commerce platforms, review aggregation sites, and user-generated feedback.
Language: English

2. Loughran and McDonald (2016) - Financial Sentiment Dictionary

Developed for financial and accounting texts, including categories such as “positive,” “negative,” “uncertainty,” “litigious,” and “constraining”.
Widely used in financial risk assessment, investor sentiment analysis, and stock market forecasting.
Language: English

3. NRC Emotion Lexicon (Mohammad & Turney, 2010)

Captures emotions beyond basic polarity, categorizing words into eight primary emotions: Joy, Sadness, Anger, Fear, Surprise, Disgust, Trust, and Anticipation.
Useful for social media mining, psychological studies, and literary analysis.
Language: Multilingual

References

Hu, M., & Liu, B.

Mining and summarizing customer reviews. Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’04, 168-177. New York, NY, USA: Association for Computing Machinery.

Loughran, T., & McDonald, B.

Textual analysis in accounting and finance: A survey. Journal of Accounting Research, 54(4), 1187-1230.

Mohammad, S., & Turney, P.

Emotions evoked by common words and phrases: Using Mechanical Turk to create an emotion lexicon. In Proceedings of the NAACL HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text, 26-34. Los Angeles, CA: Association for Computational Linguistics.

Thinking...

Abstractive Summarization

Summarization in TALL

Summarization is a key technique in text analysis that allows users to extract the most relevant information from a document while maintaining its core meaning.
TALL implements extractive summarization, a method that selects and reorders the most important sentences directly from the original text to generate a coherent, condensed version of the content.

Unlike abstractive summarization, which rephrases content using deep learning models, extractive summarization ensures that the summary remains factually consistent with the input document, making it a reliable method for automated text compression.

How Summarization Works in TALL

1. Sentence Tokenization and Preprocessing

The text is split into individual sentences to form the basis of the summarization process.
Sentences are preprocessed, removing unnecessary punctuation and stopwords to enhance semantic clarity.

2. Graph Construction Using Sentence Similarity

A graph-based representation of the document is created, where:

Nodes represent sentences.
Edges connect sentences based on their semantic similarity (measured using cosine similarity or word overlap).

Sentences that share a high degree of lexical similarity are considered strongly connected in the graph.

3. Application of TextRank Algorithm

The TextRank algorithm assigns an importance score to each sentence based on its connectivity within the graph.
Sentences with the highest PageRank scores are deemed the most representative of the overall document.

4. Sentence Selection and Ordering

The top-ranked sentences are selected for the summary.
A reordering step ensures that sentences are presented in a logical and coherent structure, preserving the original document’s flow.

Advantages of Summarization in TALL

Extractive and Factually Consistent – Ensures that summaries are directly sourced from the original text,
reducing the risk of hallucinations or misinterpretations.
Graph-Based Ranking for Objective Selection – Uses TextRank, an unsupervised method that
ranks sentences purely based on semantic importance, eliminating bias.
Efficient and Scalable – Processes large documents quickly, making it ideal for summarizing
research papers, news articles, legal documents, and reviews.
No Need for Pre-Trained Models – Unlike abstractive methods that require deep learning models,
extractive summarization works effectively on any text without additional training.
Customizable Summary Length – Users can adjust the number of extracted sentences to control the
level of detail in the summary.

Implementation of Summarization in TALL

TALL’s summarization routines are built upon the TextRank algorithm, with optimizations for handling preprocessed and structured corpora:

Customized Text Preprocessing – The system operates on tokenized, lemmatized, and PoS-tagged corpora, ensuring better sentence representation.
Sentence Similarity Based on Multiple Metrics – Supports TF-IDF, cosine similarity, and word embeddings for improved ranking.
Multi-Document Summarization (Future Work) – The framework is being expanded to support multi-document summarization, allowing users to extract summaries from multiple related texts.

By integrating unsupervised graph-based techniques, TALL provides users with a robust and efficient summarization tool, ideal for academic, business, and legal applications.

References

Mihalcea, R., & Tarau, P.

TextRank: Bringing order into text. Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, 404-411, Barcelona, Spain, July. Association for Computational Linguistics.

Page, L., Brin, S., Motwani, R., & Winograd, T.

The PageRank Citation Ranking: Bringing Order to the Web. Technical report, Stanford Digital Library Technologies Project, 1998.

Abstractive Summarization

TALL generates a coherent and concise summary by interpreting and paraphrasing the main ideas from the original text

Summary Length (in words)

Extractive Summarization

Summarization in TALL

How Summarization Works in TALL

1. Sentence Tokenization and Preprocessing

The text is split into individual sentences to form the basis of the summarization process.
Sentences are preprocessed, removing unnecessary punctuation and stopwords to enhance semantic clarity.

2. Graph Construction Using Sentence Similarity

A graph-based representation of the document is created, where:

Nodes represent sentences.
Edges connect sentences based on their semantic similarity (measured using cosine similarity or word overlap).

Sentences that share a high degree of lexical similarity are considered strongly connected in the graph.

3. Application of TextRank Algorithm

The TextRank algorithm assigns an importance score to each sentence based on its connectivity within the graph.
Sentences with the highest PageRank scores are deemed the most representative of the overall document.

4. Sentence Selection and Ordering

The top-ranked sentences are selected for the summary.
A reordering step ensures that sentences are presented in a logical and coherent structure, preserving the original document’s flow.

Advantages of Summarization in TALL

Extractive and Factually Consistent – Ensures that summaries are directly sourced from the original text,
reducing the risk of hallucinations or misinterpretations.
Graph-Based Ranking for Objective Selection – Uses TextRank, an unsupervised method that
ranks sentences purely based on semantic importance, eliminating bias.
Efficient and Scalable – Processes large documents quickly, making it ideal for summarizing
research papers, news articles, legal documents, and reviews.
No Need for Pre-Trained Models – Unlike abstractive methods that require deep learning models,
extractive summarization works effectively on any text without additional training.
Customizable Summary Length – Users can adjust the number of extracted sentences to control the
level of detail in the summary.

Implementation of Summarization in TALL

TALL’s summarization routines are built upon the TextRank algorithm, with optimizations for handling preprocessed and structured corpora:

Customized Text Preprocessing – The system operates on tokenized, lemmatized, and PoS-tagged corpora, ensuring better sentence representation.
Sentence Similarity Based on Multiple Metrics – Supports TF-IDF, cosine similarity, and word embeddings for improved ranking.
Multi-Document Summarization (Future Work) – The framework is being expanded to support multi-document summarization, allowing users to extract summaries from multiple related texts.

By integrating unsupervised graph-based techniques, TALL provides users with a robust and efficient summarization tool, ideal for academic, business, and legal applications.

References

Mihalcea, R., & Tarau, P.

Page, L., Brin, S., Motwani, R., & Winograd, T.

The PageRank Citation Ranking: Bringing Order to the Web. Technical report, Stanford Digital Library Technologies Project, 1998.

Extractive Summarization

TALL selects and reorders the most relevant sentences from the original text to generate a coherent and concise summary

Report

Select results to include in the Report

Settings

Working Folder

Select a folder where the analysis outputs will be saved

Language Models

'Tall AI' Api Key

Set a valid API Key to use 'Tall AI' features powered by Google Gemini.

If you don’t have one yet, you can generate it by logging into AI Studio with your Google account and creating a new API Key.

Enter your Gemini API Key:

Donation

Massimo Aria

Corrado Cuccurullo

Maria Spano

Luca D'Aniello

Michelangelo Misuraca

K-Synth

Github

Import texts

Importing Data in TALL

Supported File Formats

1. Plain Text Files (.txt)

2. Tabular Data (.csv, .xlsx)

3. PDF Documents (.pdf)

4. Biblioshiny Export Files

TALL Structured Files (.tall)

References

Split Corpus

Split texts

Splitting the Corpus in TALL

How It Works

Example Use Cases

Random Selection

Random Text Selection

Random Text Selection in TALL

How It Works

Example Use Cases

External Information

Add from a file

To import external information, please make sure that the file to be uploaded is in Excel format and contains a column labeled 'doc_id' to identify documents associated to the text(s) imported.

You can download the list of doc_id associated with the imported text files below.

Importing External Information in TALL

How to Import External Data

Using External Information

Download Document Identifiers

Tokenization & PoS Tagging

Language Model

Language Model

Tokenization, Lemmatization, and PoS Tagging in TALL

Powered by UDPipe for NLP Preprocessing

Updated Pre-trained Language Models

Applications in NLP and Text Analysis

References

Tagging Special Entities

Tagging Special Entities in TALL

Detected Special Entities

Why Special Entity Tagging Matters?

Special Entities

When processing text, special tags will be assigned to certain detected entities.

These include:

•⁠ ⁠Email addresses: example@domain.com

•⁠ ⁠URLs: https://www.example.com/path

• ⁠Emojis: 😊, 🚀, ❤️

•⁠ ⁠Hashtags: #ExampleTag

•⁠ ⁠IP addresses: 192.168.1.1

•⁠ ⁠Mentions: @username

This ensures that these elements are identified and marked for further analysis within the text.

Custom Term List Loading and Merging

Custom Term List in TALL

Why Use a Custom Term List?

How to Import a Custom Term List

Example of Custom Term List Format

Import Custom Term List

Please ensure that the Custom Term List is formatted as an Excel file with two columns. In the first column include the desired terms. In the second column provide the corresponding list of PoS associated with each term.

Multi-Word Creation

Algorithms for Automatic Multi-Word Extraction

- Rapid Automatic Keyword Extraction (RAKE)

- Pointwise Mutual Information (PMI)

- Mutual Dependency (MD)

- Log-Frequency Biased Mutual Dependency (LF-MD)

Automatic Multi-Words

Multi-word creation extracts keywords (sequence of terms) from the text.

After keywords are generated, select those you wish to include in your data from the list.

Multi-Words created by:

Multi-Word Creation by a List

Multi-Word Creation by a List in TALL

How to Import a Multi-Word List

Why Use Multi-Word Expressions?

Import a Multi-Word List

Please ensure that the Multi-Word List is formatted as an Excel/CSV file with one column. Each cell of that column include a multi-word. Each term have to be separated by a single whitespace.