Author: Gabriel Demetrios Lafis
Date: June 3, 2025
Course: Data Science Capstone Project
This report presents the results of an exploratory data analysis conducted on text data as the foundation for developing a predictive text algorithm. The primary objective was to understand the fundamental characteristics of the text corpus, identify word frequency patterns and n-grams, and establish the groundwork for building an efficient predictive model.
The analysis revealed important insights about the structure and distribution of the textual data, including the identification of frequency patterns that follow Zipf’s Law, determination of vocabulary coverage needed for different precision levels, and characterization of the linguistic complexity of the corpus. These results provide clear guidelines for developing the predictive algorithm and subsequent Shiny application.
Natural language processing and text prediction represent critical areas of modern data science, with applications ranging from virtual assistants to autocomplete systems on mobile devices. Developing an effective predictive model requires a deep understanding of the statistical and linguistic characteristics of the textual data that will be used for training.
This exploratory analysis had the following main objectives:
The analyzed corpus consists of 110 text documents containing content related to data science, machine learning, and artificial intelligence. The texts include material in both Portuguese and English, reflecting the multilingual nature common in modern technical contexts.
The preparation process included: - Text tokenization and normalization - Punctuation removal and conversion to lowercase - Language identification and content separation - Calculation of basic text complexity metrics
The analysis was conducted using Python with specialized natural language processing libraries (NLTK) and statistical analysis tools (pandas, numpy). Visualizations were created with matplotlib and seaborn to ensure clarity in results presentation.
The analysis revealed a compact but representative corpus with characteristics that make it suitable for developing predictive models. The corpus contains 1,221 total words (tokens) distributed across 254 unique words, resulting in a type/token ratio of 0.208. This ratio indicates a moderate level of vocabulary diversity, appropriate for model training without excessive sparsity.
The document length distribution shows an average of 11.1 words per document, with variation between 6 and 20 words. This consistency in document size facilitates processing and reduces the computational complexity needed for the predictive model.
Key Statistics: - Total documents: 110 - Total words (tokens): 1,221 - Unique words: 254 - Type/token ratio: 0.208 - Average document length: 11.1 words - Document length range: 6-20 words
The frequency analysis revealed patterns consistent with Zipf’s Law, where a small number of words dominate the frequency distribution. The most frequent words include:
This distribution is typical of natural language texts and indicates that the corpus maintains authentic linguistic characteristics. The predominance of prepositions and articles in the top positions is expected and beneficial for prediction models, as these functional words provide predictable syntactic structure.
One of the most significant findings of the analysis relates to vocabulary coverage efficiency:
These results have important implications for predictive model design. The high concentration of frequency in a relatively small subset of the vocabulary suggests that an efficient model can be built focusing on the most frequent words, with specific strategies to handle rare or unobserved words.
The n-gram analysis revealed important structural patterns:
Bigrams (2-grams): - Total: 1,111 bigrams with 347 unique combinations - Most frequent: “de dados” (15 occurrences), “é uma” (12 occurrences), “machine learning” (12 occurrences)
Trigrams (3-grams): - Total: 1,001 trigrams with 343 unique combinations - Most frequent: “o processamento de” (6 occurrences), “processamento de linguagem” (6 occurrences)
The high diversity of n-grams (low repetition) indicates linguistic richness in the corpus, but also suggests that models based on higher-order n-grams may face sparsity problems. This observation guides the choice of smoothing techniques and backoff strategies in the final model.
The analysis identified an interesting distribution between languages: - Portuguese: 11.5% of words (identified by accented characters) - English: 88.5% of words - Mixed documents: 52 documents contain both languages
This multilingual characteristic presents both challenges and opportunities for the predictive model. It will be necessary to implement strategies to handle code-switching and maintain adequate context in multilingual environments.
Based on the analysis results, a hybrid architecture is recommended that combines:
The analysis suggests that the model can be optimized for mobile devices through:
The model should be evaluated considering: - Accuracy in predicting the top-3 most probable words - Response time (target: <100ms on mobile devices) - Memory usage (target: <50MB for complete model) - Generalization capability for unseen texts
The planned Shiny application will feature:
Key Features: - Text input field with real-time predictions - Display of top 3 word suggestions with probabilities - Visualization of n-gram patterns being used - Performance metrics dashboard - Option to switch between different model configurations
Model Storage and Retrieval: - Efficient hash tables for n-gram lookup - Compressed vocabulary storage - Fast backoff mechanism implementation
User Experience: - Responsive design for mobile and desktop - Minimal latency for predictions - Clear visualization of model confidence
The exploratory analysis provided valuable insights into the characteristics of the text corpus and established clear guidelines for developing the predictive model. Key findings include:
These results indicate that the project is well-positioned to develop an efficient predictive model and functional Shiny application that meets the established performance and usability requirements.
The next phase will focus on implementing the predictive algorithm using the insights gained from this analysis, followed by the development of an interactive Shiny application that demonstrates the model’s capabilities in a user-friendly interface.
Report prepared by Gabriel Demetrios Lafis
Exploratory Data Analysis for Text Prediction - June 2025