Introduction

This report outlines the initial exploratory data analysis performed on the text datasets for a text prediction project. The goal is to provide a concise overview of the data and plans for building a prediction algorithm and a Shiny application, understandable to a non-data scientist manager.

Data Loading and Initial Summary

The project uses three text datasets: blogs, news, and Twitter. The data was downloaded and successfully loaded into R.

Summary Statistics of the Text Datasets
Dataset File.Size..MB. Number.of.Lines Number.of.Characters Number.of.Words
Blogs 200.42 899288 206824505 37546250
News 196.28 1010242 203223159 34762395
Twitter 159.36 2360148 162096241 30093413

Exploratory Analysis: Word Counts

To understand the characteristics of the text data, the distribution of word counts within each document (line) of the datasets was analyzed. The analysis focused on the average sentence length and how it varies across the different text sources.

Future Plans: Prediction Algorithm and Shiny App

The goal is to develop a text prediction algorithm and deploy it as a Shiny application.

Prediction Algorithm

A basic n-gram model with a backoff strategy is planned. If a higher-order n-gram (e.g., a trigram) isn’t found in the training data, the model will “back off” to a lower-order n-gram (e.g., a bigram) to make a prediction. This helps to handle unseen word combinations.

Shiny Application

The Shiny app will provide a user-friendly interface for text prediction. Users will be able to: 1. Enter text into an input field. 2. Receive the top three predicted next words. 3. Potentially explore some of the underlying data features through interactive plots or tables. This approach balances accuracy with computational efficiency, providing a useful predictive text experience.

Conclusion

The initial data exploration confirms successful loading and understanding of the basic structure of the text datasets. The word count distributions provide insights into the characteristics of each source. The plan to build an n-gram based prediction algorithm and deploy it via a Shiny app provides a solid foundation for the next stages of this project.