Basic Statistics of the Data Files
File_Source File_Size_MB Line_Counts Word_Counts
Blogs 200 899288 37334131
News 196 1010242 34372530
Twitter 159 2360148 30373583

3. Future Goals for Prediction Algorithm and Shiny App

The final goal of this project is to build a predictive text application. Based on this exploratory analysis, my plan is as follows:

  1. Prediction Model: I will develop an N-gram model (using 2-word, 3-word, and 4-word sequences) to predict the next word based on user input.
  2. Handling Unseen Phrases: I will implement a “back-off” strategy. If a 3-word phrase isn’t found, the algorithm will look at 2-word pairs to provide the best possible guess.
  3. App Design: The Shiny app will feature a simple text interface. As the user types, the top 3 most likely next words will be displayed instantly.
  4. Optimization: To ensure the app is fast and memory-efficient for mobile users, I will prune the dictionary to remove very rare words and phrases.