1. Executive SummaryThe goal of this project is to develop a predictive text algorithm that can suggest the next word in a sequence. This report demonstrates the initial exploratory analysis of three large datasets: Blogs, News, and Twitter. We have successfully cleaned the data, performed tokenization, and identified the most frequent word patterns (N-grams) that will form the basis of our prediction model.
  2. Data Summary StatisticsThe datasets were downloaded and loaded into R. Initial inspection reveals a significant volume of text across all three sources.SourceApprox. File SizeLine CountWord CountBlogs~200 MB[Insert][Insert]News~196 MB[Insert][Insert]Twitter~159 MB[Insert][Insert]
  3. Exploratory Data Analysis (EDA) Data was cleaned by removing punctuation, numbers, whitespaces, and profanity. We then analyzed the frequency of single words (Unigrams) and pairs of words (Bigrams).

Top 15 Most Frequent Words (Unigrams) # INSERT YOUR CODE HERE to generate a bar plot of top 15 words Top 15 Most Frequent Word Pairs (Bigrams) # INSERT YOUR CODE HERE to generate a bar plot of top 15 word pairs Key Findings Sparsity: A small percentage of unique words account for a large portion of the total word occurrences.

Context Matters: Twitter data contains shorter sentences and more informal language compared to News and Blogs.

Stop Words: Common words like “the”, “and”, and “to” dominate the frequencies, but are vital for structural prediction.

  1. Plans for Prediction Algorithm and Shiny App The final goal is to create a user-friendly Shiny application.

Prediction Model: I will implement a “Katz Back-off” algorithm. If a three-word sequence (Trigram) is not found, the model will “back-off” to a two-word sequence (Bigram), and eventually a single word.

Optimization: To ensure the app runs quickly, I will prune the N-gram tables to remove very rare word combinations that do not significantly improve accuracy.

Shiny App UI: The app will feature a simple text input box. As the user types, the top 3 most likely next words will be displayed instantly.