title: “Data Science Capstone Milestone Report: Exploratory Analysis and Prediction Algorithm” author: “Your Name” date: “2026-06-24” output: html_document: theme: cosmo highlight: tango —————-

Introduction

Project Overview

Project Overview

The objective of this project is the creation of a text prediction model that is able to predict the next word in a sentence that is typed by a user. Text prediction algorithms have evolved into a vital part of modern communication tools and are now actively used in search engines, smartphones and messengers. The end result of this capstone project will be comprised of two parts:

Algorithm for predicting the most likely next word.

Interactive Shiny app for testing and using the prediction algorithm.

The main focus of this milestone report is on the exploration of the text dataset from SwiftKey. This phase should provide an insight into the properties of the data that will help in creating an effective prediction algorithm.

There are three large English texts:

Blogs

News articles

Twitter messages

in the dataset.

These files collectively contain millions of words and represent different writing styles and language patterns.

Objectives of the Milestone Report

The key purposes of this milestone report include:

Loading the Data

The data files used in this analysis are:

File Name Description
en_US.blogs.txt Blog posts
en_US.news.txt News articles
en_US.twitter.txt Twitter messages

The files were imported into R using the readLines() function.

blogs <- readLines("en_US.blogs.txt")
news <- readLines("en_US.news.txt")
twitter <- readLines("en_US.twitter.txt")

Summary Statistics

Since the entire data set is very large in size, a sample of the data set was randomly selected for further processing.

library(stringi)

stats <- data.frame(
File = c("Blogs","News","Twitter"),
Lines = c(length(blogs),
          length(news),
          length(twitter)),
Words = c(sum(stri_count_words(blogs)),
          sum(stri_count_words(news)),
          sum(stri_count_words(twitter))),
SizeMB = c(
file.info("en_US.blogs.txt")$size/1024^2,
file.info("en_US.news.txt")$size/1024^2,
file.info("en_US.twitter.txt")$size/1024^2
)
)

knitr::kable(stats)

Interpretation

In terms of the number of lines, the highest is found in the Twitter data set, while the one with the highest number of average words per line is the Blogs data set. The News data is somewhere in the middle.

Sampling the Data

Because the complete data set is extremely large, a random sample of the data was taken for further analysis.

set.seed(123)

sampleData <- c(
sample(blogs, length(blogs)*0.01),
sample(news, length(news)*0.01),
sample(twitter, length(twitter)*0.01)
)

The process of sampling helps in reducing the amount of computation needed while maintaining the properties of the data set.

Data Cleaning

The steps involved in data cleaning include:

library(tm)

corpus <- VCorpus(VectorSource(sampleData))

corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)

Word Count Distribution

The distribution of words per line helps in understanding sentence lengths.

wordCount <- sapply(sampleData, stri_count_words)

hist(wordCount,
     col="lightblue",
     main="Distribution of Words Per Line",
     xlab="Words",
     ylab="Frequency")

Interpretation

The histogram shows that most sentences contain fewer than 30 words. A small number of observations contain very long sentences.

Most Frequent Words

library(RWeka)

tdm <- TermDocumentMatrix(corpus)

m <- as.matrix(tdm)

freq <- sort(rowSums(m),
             decreasing=TRUE)

freqData <- data.frame(
word=names(freq),
freq=freq
)

head(freqData,20)
library(ggplot2)

ggplot(head(freqData,20),
       aes(reorder(word,freq),freq))+
geom_bar(stat="identity",
fill="steelblue")+
coord_flip()+
labs(title="Top 20 Most Frequent Words",
x="Words",
y="Frequency")

The most common words are generally stop words such as:

These words dominate the corpus because they appear in nearly every sentence.

Bigram Analysis

Bigrams represent two-word combinations.

bigramTokenizer <- function(x)
NGramTokenizer(x,
Weka_control(min=2,max=2))
bigram <- TermDocumentMatrix(
corpus,
control=list(
tokenize=bigramTokenizer
)
)
bigramFreq <- sort(
rowSums(as.matrix(bigram)),
decreasing=TRUE
)

bigramData <- data.frame(
bigram=names(bigramFreq),
freq=bigramFreq
)
ggplot(head(bigramData,20),
aes(reorder(bigram,freq),freq))+
geom_bar(stat="identity",
fill="orange")+
coord_flip()+
labs(title="Top 20 Bigrams")

Trigram Analysis

Trigrams represent three-word combinations.

trigramTokenizer <- function(x)
NGramTokenizer(x,
Weka_control(min=3,max=3))
trigram <- TermDocumentMatrix(
corpus,
control=list(
tokenize=trigramTokenizer
)
)
trigramFreq <- sort(
rowSums(as.matrix(trigram)),
decreasing=TRUE
)
trigramData <- data.frame(
trigram=names(trigramFreq),
freq=trigramFreq
)
ggplot(head(trigramData,20),
aes(reorder(trigram,freq),freq))+
geom_bar(stat="identity",
fill="forestgreen")+
coord_flip()+
labs(title="Top 20 Trigrams")

Word Cloud

library(wordcloud)

wordcloud(
words=names(freq),
freq=freq,
max.words=100,
colors=rainbow(10)
)

The word cloud provides a visual representation of the most frequently occurring words.

Interesting Findings

Several interesting observations emerged during the exploratory analysis:

  1. Twitter contains the largest number of records.
  2. Blog posts contain the longest sentences.
  3. A relatively small number of words account for a significant percentage of the corpus.
  4. Common phrases repeat frequently, making n-gram modelling suitable.
  5. Data cleaning substantially reduces the vocabulary size.

Proposed Prediction Algorithm

The prediction algorithm will be based on the n-gram language model.

The following models will be constructed:

The prediction process will follow these steps:

  1. User enters text.
  2. Last one, two, or three words are extracted.
  3. Matching n-grams are searched.
  4. Most probable next word is returned.
  5. If no match exists, the algorithm backs off to lower-order n-grams.

This approach is computationally efficient and widely used in predictive text applications.

Proposed Shiny Application

The Shiny application will contain:

Text Input Box

Allows users to enter text.

Prediction Button

Initiates prediction.

Predicted Words Section

Displays the most likely next words.

Additional Features

  • Fast response time.
  • User-friendly interface.
  • Multiple predictions.
  • Responsive design.

Conclusion

This exploratory analysis demonstrates that the text corpus contains sufficient information to build an effective next-word prediction model. Summary statistics, visualizations, and n-gram analysis reveal significant patterns in the data. The next phase of the project will involve developing and optimizing the prediction algorithm and integrating it into an interactive Shiny application.

The final application aims to provide accurate and real-time next-word predictions using statistical language modelling techniques.