Milestone Report: Text Prediction Project

Introduction

This report explores text data from Twitter, blogs, and news to prepare for a word prediction app. We analyzed the data’s structure and patterns to ensure we’re on track to build a user-friendly tool.

Data Overview

We successfully loaded three text files: Twitter, blogs, and news. To manage the large dataset, we used a 10% sample for analysis. The table below summarizes the full datasets’ size:

## Warning in readLines("en_US/en_US.news.txt",
## encoding = "UTF-8", skipNul = TRUE): incomplete
## final line found on 'en_US/en_US.news.txt'

Summary Statistics of Text Files
File	Lines	Words
Twitter	2360148	9
Blogs	0	0
News	630799	0

Exploratory Findings

The datasets vary in style:

Twitter: Short, informal text with emojis and slang like “lol.”
Blogs: Longer, narrative text with personal stories.
News: Formal, structured sentences with professional tone.

The bar chart below shows the top 10 words in a Twitter sample (after removing common words like “the” and punctuation):

The histogram below shows the distribution of words per line in the Twitter sample, highlighting that most lines are short (under 20 words):

Plans for Prediction Algorithm and Shiny App

We will build a tool that predicts the next word a user types, similar to a phone’s auto-complete. For example, if someone types “I love to,” the tool might suggest “eat” or “run” based on common patterns. To keep it fast, we’ll use a smaller dataset and focus on frequent word combinations.

The Shiny app will be simple and user-friendly:

A text box for users to type.
A dropdown to choose data (Twitter, blogs, or news).
A display showing 3–5 suggested words.

This will make typing faster and more intuitive, like a virtual keyboard assistant.

Conclusion

This analysis confirms we’ve successfully loaded and explored the data, identified key patterns, and planned a practical word prediction app. We’re ready to develop the algorithm and app, with feedback welcome to improve our approach.