Introduction

The goal of this capstone project is to build a next-word prediction model similar to those used in mobile keyboards. The training data come from the HC Corpora collection and include text from blogs, news websites, and Twitter in multiple languages.

This milestone report:

The goal is to demonstrate that I am comfortable working with the data and am on track to build the final prediction model.

##Data loading

Basic Summaries

Line Counts

Dataset Lines
Blogs 899288
News 1010242
Twitter 2360148

Word Counts

Dataset WordCount
Blogs 37546250
News 34762395
Twitter 30093413

Summary Table

Dataset Lines WordCount AvgWordsPerLine
Blogs 899288 37546250 41.75
News 1010242 34762395 34.41
Twitter 2360148 30093413 12.75

Exploratory Plots

To reduce processing time, we sample 10,000 lines from each file.

Histogram of Word Counts per Line

## Early Findings Twitter lines are much shorter than Blogs or News due to message length limits.

Blogs have the highest average words per line, pointing to richer sentence structure.

News text is balanced and formal, potentially influencing predictive word patterns.

The dataset is very large (over 4 million lines), so sampling strategies will be necessary when building the prediction model.

Next Steps

The next phase will focus on:

  1. Cleaning and preprocessing
  1. Building an n-gram model (1–3 word sequences)
  1. Model optimization
  1. Developing the Shiny app

This milestone shows that we are comfortable with the data and ready to begin model development.