Introduction

This report outlines the initial exploratory data analysis performed on the text datasets for a text prediction project. The goal is to provide a concise overview of the data and plans for building a prediction algorithm and a Shiny application, understandable to a non-data scientist manager.

Data Loading and Initial Summary

The project uses three text datasets: blogs, news, and Twitter. The data was downloaded and successfully loaded into R.

Summary Statistics of the Text Datasets
Dataset	File.Size..MB.	Number.of.Lines	Number.of.Characters	Number.of.Words
Blogs	200.42	899288	206824505	37546250
News	196.28	1010242	203223159	34762395
Twitter	159.36	2360148	162096241	30093413

Exploratory Analysis: Word Counts

To understand the characteristics of the text data, the distribution of word counts within each document (line) of the datasets was analyzed. The analysis focused on the average sentence length and how it varies across the different text sources.

Future Plans: Prediction Algorithm and Shiny App

The goal is to develop a text prediction algorithm and deploy it as a Shiny application.

Prediction Algorithm

A basic n-gram model with a backoff strategy is planned. If a higher-order n-gram (e.g., a trigram) isn’t found in the training data, the model will “back off” to a lower-order n-gram (e.g., a bigram) to make a prediction. This helps to handle unseen word combinations.

Shiny Application

The Shiny app will provide a user-friendly interface for text prediction. Users will be able to: 1. Enter text into an input field. 2. Receive the top three predicted next words. 3. Potentially explore some of the underlying data features through interactive plots or tables. This approach balances accuracy with computational efficiency, providing a useful predictive text experience.

Conclusion

The initial data exploration confirms successful loading and understanding of the basic structure of the text datasets. The word count distributions provide insights into the characteristics of each source. The plan to build an n-gram based prediction algorithm and deploy it via a Shiny app provides a solid foundation for the next stages of this project.

Text Prediction Project: Exploratory Data Analysis and Future Plans

Dominic Nadeau

2025-07-27