title: “Data Science Capstone Milestone Report: Exploratory Analysis and Prediction Algorithm” author: “Your Name” date: “2026-06-24” output: html_document: theme: cosmo highlight: tango —————-

Introduction

Project Overview

The objective of this project is the creation of a text prediction model that is able to predict the next word in a sentence that is typed by a user. Text prediction algorithms have evolved into a vital part of modern communication tools and are now actively used in search engines, smartphones and messengers. The end result of this capstone project will be comprised of two parts:

Algorithm for predicting the most likely next word.

Interactive Shiny app for testing and using the prediction algorithm.

The main focus of this milestone report is on the exploration of the text dataset from SwiftKey. This phase should provide an insight into the properties of the data that will help in creating an effective prediction algorithm.

There are three large English texts:

Blogs

News articles

Twitter messages

in the dataset.

These files collectively contain millions of words and represent different writing styles and language patterns.

Objectives of the Milestone Report

The key purposes of this milestone report include:

Download and loading of the dataset.
Exploratory data analysis.
Calculation of summary statistics.
Visualization of the features of the data.
Description of the prediction algorithm to be used.
Description of the Shiny app design.

Loading the Data

The data files used in this analysis are:

File Name	Description
en_US.blogs.txt	Blog posts
en_US.news.txt	News articles
en_US.twitter.txt	Twitter messages

The files were imported into R using the readLines() function.

blogs <- readLines("en_US.blogs.txt")
news <- readLines("en_US.news.txt")
twitter <- readLines("en_US.twitter.txt")

Summary Statistics

Since the entire data set is very large in size, a sample of the data set was randomly selected for further processing.

library(stringi)

stats <- data.frame(
File = c("Blogs","News","Twitter"),
Lines = c(length(blogs),
          length(news),
          length(twitter)),
Words = c(sum(stri_count_words(blogs)),
          sum(stri_count_words(news)),
          sum(stri_count_words(twitter))),
SizeMB = c(
file.info("en_US.blogs.txt")$size/1024^2,
file.info("en_US.news.txt")$size/1024^2,
file.info("en_US.twitter.txt")$size/1024^2
)
)

knitr::kable(stats)

Interpretation

In terms of the number of lines, the highest is found in the Twitter data set, while the one with the highest number of average words per line is the Blogs data set. The News data is somewhere in the middle.

Sampling the Data

Because the complete data set is extremely large, a random sample of the data was taken for further analysis.

set.seed(123)

sampleData <- c(
sample(blogs, length(blogs)*0.01),
sample(news, length(news)*0.01),
sample(twitter, length(twitter)*0.01)
)

The process of sampling helps in reducing the amount of computation needed while maintaining the properties of the data set.

Data Cleaning

The steps involved in data cleaning include:

Converting to lower case
Eliminating punctuation marks
Eliminating numbers
Eliminating extra white spaces
Eliminating profanity
Tokenizing into words

library(tm)

corpus <- VCorpus(VectorSource(sampleData))

corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)

Word Count Distribution

The distribution of words per line helps in understanding sentence lengths.

wordCount <- sapply(sampleData, stri_count_words)

hist(wordCount,
     col="lightblue",
     main="Distribution of Words Per Line",
     xlab="Words",
     ylab="Frequency")

Interpretation

The histogram shows that most sentences contain fewer than 30 words. A small number of observations contain very long sentences.

Most Frequent Words

library(RWeka)

tdm <- TermDocumentMatrix(corpus)

m <- as.matrix(tdm)

freq <- sort(rowSums(m),
             decreasing=TRUE)

freqData <- data.frame(
word=names(freq),
freq=freq
)

head(freqData,20)

library(ggplot2)

ggplot(head(freqData,20),
       aes(reorder(word,freq),freq))+
geom_bar(stat="identity",
fill="steelblue")+
coord_flip()+
labs(title="Top 20 Most Frequent Words",
x="Words",
y="Frequency")

The most common words are generally stop words such as:

These words dominate the corpus because they appear in nearly every sentence.

Bigram Analysis

Bigrams represent two-word combinations.

bigramTokenizer <- function(x)
NGramTokenizer(x,
Weka_control(min=2,max=2))

bigram <- TermDocumentMatrix(
corpus,
control=list(
tokenize=bigramTokenizer
)
)

bigramFreq <- sort(
rowSums(as.matrix(bigram)),
decreasing=TRUE
)

bigramData <- data.frame(
bigram=names(bigramFreq),
freq=bigramFreq
)

ggplot(head(bigramData,20),
aes(reorder(bigram,freq),freq))+
geom_bar(stat="identity",
fill="orange")+
coord_flip()+
labs(title="Top 20 Bigrams")

Trigram Analysis

Trigrams represent three-word combinations.

trigramTokenizer <- function(x)
NGramTokenizer(x,
Weka_control(min=3,max=3))

trigram <- TermDocumentMatrix(
corpus,
control=list(
tokenize=trigramTokenizer
)
)

trigramFreq <- sort(
rowSums(as.matrix(trigram)),
decreasing=TRUE
)

trigramData <- data.frame(
trigram=names(trigramFreq),
freq=trigramFreq
)

ggplot(head(trigramData,20),
aes(reorder(trigram,freq),freq))+
geom_bar(stat="identity",
fill="forestgreen")+
coord_flip()+
labs(title="Top 20 Trigrams")

Word Cloud

library(wordcloud)

wordcloud(
words=names(freq),
freq=freq,
max.words=100,
colors=rainbow(10)
)

The word cloud provides a visual representation of the most frequently occurring words.

Interesting Findings

Several interesting observations emerged during the exploratory analysis:

Twitter contains the largest number of records.
Blog posts contain the longest sentences.
A relatively small number of words account for a significant percentage of the corpus.
Common phrases repeat frequently, making n-gram modelling suitable.
Data cleaning substantially reduces the vocabulary size.

Proposed Prediction Algorithm

The prediction algorithm will be based on the n-gram language model.

The following models will be constructed:

Unigram model
Bigram model
Trigram model
Four-gram model

The prediction process will follow these steps:

User enters text.
Last one, two, or three words are extracted.
Matching n-grams are searched.
Most probable next word is returned.
If no match exists, the algorithm backs off to lower-order n-grams.

This approach is computationally efficient and widely used in predictive text applications.

Proposed Shiny Application

The Shiny application will contain:

Text Input Box

Allows users to enter text.

Prediction Button

Initiates prediction.

Predicted Words Section

Displays the most likely next words.

Additional Features

Fast response time.
User-friendly interface.
Multiple predictions.
Responsive design.

Conclusion

This exploratory analysis demonstrates that the text corpus contains sufficient information to build an effective next-word prediction model. Summary statistics, visualizations, and n-gram analysis reveal significant patterns in the data. The next phase of the project will involve developing and optimizing the prediction algorithm and integrating it into an interactive Shiny application.

The final application aims to provide accurate and real-time next-word predictions using statistical language modelling techniques.

datascience

Introduction

Project Overview

Objectives of the Milestone Report

Loading the Data

Summary Statistics

Interpretation

Sampling the Data

Data Cleaning

Word Count Distribution

Interpretation

Most Frequent Words

Bigram Analysis

Trigram Analysis

Word Cloud

Interesting Findings

Proposed Prediction Algorithm

Proposed Shiny Application

Text Input Box

Prediction Button

Predicted Words Section

Additional Features

Conclusion