Text Prediction Project

Exploratory Data Analysis

Report Date: 2025-11-17


Data Summary

The dataset consists of text from three sources:

  • Blogs: 899,288 lines, 37 million words
  • News: 77,259 lines, 34 million words
  • Twitter: 2,360,148 lines, 30 million words

Total: Over 100 million words for training the prediction algorithm.


Word Frequency Analysis

Plot showing most common words
Plot showing most common words

Common words like “the”, “and”, and “to” are most frequent


Prediction Algorithm Plan

Approach:

  1. Clean data - remove profanity, normalize text
  2. Build n-gram models - analyze word sequences
  3. Create backoff method - use context for predictions
  4. Build Shiny app - user-friendly interface

Features:

  • Real-time word prediction
  • Mobile-friendly design
  • Fast response times

Next Steps

  1. Complete data processing
  2. Build and test prediction algorithm
  3. Develop Shiny application
  4. Deploy to production

This report demonstrates initial exploratory analysis for the text prediction project.