Data Science Capstone – Milestone Report

Introduction

The goal of this capstone project is to build a next-word prediction model similar to those used in mobile keyboards. The training data come from the HC Corpora collection and include text from blogs, news websites, and Twitter in multiple languages.

This milestone report:

Shows that the English data have been downloaded and successfully loaded into R.
Provides basic summaries of the three text files (line counts, word counts, etc.).
Outlines an initial plan for the prediction algorithm and the Shiny application in simple, non-technical language.

The goal is to demonstrate that I am comfortable working with the data and am on track to build the final prediction model.

##Data loading

Basic Summaries

Line Counts

Dataset	Lines
Blogs	899288
News	1010242
Twitter	2360148

Word Counts

Dataset	WordCount
Blogs	37546250
News	34762395
Twitter	30093413

Summary Table

Dataset	Lines	WordCount	AvgWordsPerLine
Blogs	899288	37546250	41.75
News	1010242	34762395	34.41
Twitter	2360148	30093413	12.75

Exploratory Plots

To reduce processing time, we sample 10,000 lines from each file.

Histogram of Word Counts per Line

## Early Findings Twitter lines are much shorter than Blogs or News due to message length limits.

Blogs have the highest average words per line, pointing to richer sentence structure.

News text is balanced and formal, potentially influencing predictive word patterns.

The dataset is very large (over 4 million lines), so sampling strategies will be necessary when building the prediction model.

Next Steps

The next phase will focus on:

Cleaning and preprocessing

Removing punctuation, numbers, and profanity
Keeping common words (e.g., “the”, “to”, “and”) since they help prediction

Building an n-gram model (1–3 word sequences)

Used to estimate the most likely next word

Model optimization

Reducing memory by trimming rare word combinations
Ensuring fast prediction speeds for deployment

Developing the Shiny app

User types text
Model suggests the next word in real time

This milestone shows that we are comfortable with the data and ready to begin model development.