Text Prediction Project

Exploratory Data Analysis

Report Date: 2025-11-17

Data Summary

The dataset consists of text from three sources:

Blogs: 899,288 lines, 37 million words
News: 77,259 lines, 34 million words
Twitter: 2,360,148 lines, 30 million words

Total: Over 100 million words for training the prediction algorithm.

Word Frequency Analysis

Plot showing most common words

Plot showing most common words

Common words like “the”, “and”, and “to” are most frequent

Prediction Algorithm Plan

Approach:

Clean data - remove profanity, normalize text
Build n-gram models - analyze word sequences
Create backoff method - use context for predictions
Build Shiny app - user-friendly interface

Features:

Real-time word prediction
Mobile-friendly design
Fast response times

Next Steps

Complete data processing
Build and test prediction algorithm
Develop Shiny application
Deploy to production

This report demonstrates initial exploratory analysis for the text prediction project.