Introduction

The goal of this project is to perform exploratory data analysis (EDA) on a large collection of text data obtained from blogs, news articles, and Twitter posts. This analysis serves as the foundation for developing a predictive text model that can suggest the most likely next word based on a user’s input.

Predictive text technologies are widely used in mobile keyboards, messaging applications, search engines, and virtual assistants. These systems improve user experience by reducing typing effort and increasing communication efficiency. Before building such a model, it is important to understand the structure, size, and characteristics of the available data.

The datasets used in this project consist of text collected from three different sources: blogs, news articles, and Twitter posts. Each source represents a unique style of communication. Blog posts generally contain detailed and descriptive content, news articles provide formal and structured language, while Twitter messages contain short and informal text. Together, these datasets provide a diverse corpus suitable for language modeling.

The objectives of this exploratory analysis are to load and examine the datasets, generate summary statistics, visualize important features of the data, and identify patterns that may be useful for predictive modeling. The insights gained from this analysis will support the development of an N-gram based prediction algorithm and an interactive Shiny application for real-time next-word prediction.

Data Loading and Summary Statistics

##   Dataset   Lines    Words Characters
## 1   Blogs  899288 37546806  206824505
## 2    News 1010206 34761151  203214543
## 3 Twitter 2360148 30096690  162096241

Histogram of Blog Word Counts

The histogram below illustrates the distribution of words per line in the Blogs dataset.

Dataset Comparison

The chart below compares the number of lines present in each dataset.

Interesting Findings

The Twitter dataset contains the highest number of text entries with more than 2.3 million lines.
The Blogs dataset contains the highest total number of words.
The combined datasets contain over 100 million words, providing a rich source of training data.
The datasets represent different writing styles, ranging from formal news articles to informal social media communication.
Data cleaning and preprocessing will be required before developing the prediction model.

Prediction Algorithm Plan

The final prediction algorithm will be based on N-gram language modeling techniques. N-grams are sequences of words used to estimate the probability of the next word in a sentence. The model will analyze patterns within the training data and suggest the most likely next word based on previously entered words.

The text data will undergo preprocessing steps such as removing unnecessary symbols, handling punctuation, and standardizing text. After preprocessing, unigram, bigram, and trigram models will be generated to support next-word prediction.

Shiny Application Plan

A Shiny application will be developed to provide an interactive interface for users. The application will contain a text input box where users can enter phrases or sentences and receive real-time next-word predictions.

The application will be designed with the following goals:

Simple and user-friendly interface
Fast prediction response time
Accurate next-word suggestions
Easy accessibility for non-technical users

Conclusion

This exploratory analysis successfully loaded and examined the Blogs, News, and Twitter datasets. Summary statistics and visualizations were generated to better understand the structure and characteristics of the data. The findings indicate that the datasets provide a substantial and diverse text corpus suitable for predictive modeling.

The next phase of the project will focus on data preprocessing, N-gram model construction, algorithm optimization, and deployment through an interactive Shiny application.

Exploratory Analysis of Text Prediction Data

Annappa Madiwal

2026-06-13