title: “Data Science Capstone Milestone Report: Exploratory Analysis and Prediction Algorithm” author: “Your Name” date: “2026-06-24” output: html_document: theme: cosmo highlight: tango —————-
Project Overview
The objective of this project is the creation of a text prediction model that is able to predict the next word in a sentence that is typed by a user. Text prediction algorithms have evolved into a vital part of modern communication tools and are now actively used in search engines, smartphones and messengers. The end result of this capstone project will be comprised of two parts:
Algorithm for predicting the most likely next word.
Interactive Shiny app for testing and using the prediction algorithm.
The main focus of this milestone report is on the exploration of the text dataset from SwiftKey. This phase should provide an insight into the properties of the data that will help in creating an effective prediction algorithm.
There are three large English texts:
Blogs
News articles
Twitter messages
in the dataset.
These files collectively contain millions of words and represent different writing styles and language patterns.
The key purposes of this milestone report include:
Download and loading of the dataset.
Exploratory data analysis.
Calculation of summary statistics.
Visualization of the features of the data.
Description of the prediction algorithm to be used.
Description of the Shiny app design.
The data files used in this analysis are:
| File Name | Description |
|---|---|
| en_US.blogs.txt | Blog posts |
| en_US.news.txt | News articles |
| en_US.twitter.txt | Twitter messages |
The files were imported into R using the readLines()
function.
blogs <- readLines("en_US.blogs.txt")
news <- readLines("en_US.news.txt")
twitter <- readLines("en_US.twitter.txt")
Since the entire data set is very large in size, a sample of the data set was randomly selected for further processing.
library(stringi)
stats <- data.frame(
File = c("Blogs","News","Twitter"),
Lines = c(length(blogs),
length(news),
length(twitter)),
Words = c(sum(stri_count_words(blogs)),
sum(stri_count_words(news)),
sum(stri_count_words(twitter))),
SizeMB = c(
file.info("en_US.blogs.txt")$size/1024^2,
file.info("en_US.news.txt")$size/1024^2,
file.info("en_US.twitter.txt")$size/1024^2
)
)
knitr::kable(stats)
In terms of the number of lines, the highest is found in the Twitter data set, while the one with the highest number of average words per line is the Blogs data set. The News data is somewhere in the middle.
Because the complete data set is extremely large, a random sample of the data was taken for further analysis.
set.seed(123)
sampleData <- c(
sample(blogs, length(blogs)*0.01),
sample(news, length(news)*0.01),
sample(twitter, length(twitter)*0.01)
)
The process of sampling helps in reducing the amount of computation needed while maintaining the properties of the data set.
The steps involved in data cleaning include:
library(tm)
corpus <- VCorpus(VectorSource(sampleData))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
The distribution of words per line helps in understanding sentence lengths.
wordCount <- sapply(sampleData, stri_count_words)
hist(wordCount,
col="lightblue",
main="Distribution of Words Per Line",
xlab="Words",
ylab="Frequency")
The histogram shows that most sentences contain fewer than 30 words. A small number of observations contain very long sentences.
library(RWeka)
tdm <- TermDocumentMatrix(corpus)
m <- as.matrix(tdm)
freq <- sort(rowSums(m),
decreasing=TRUE)
freqData <- data.frame(
word=names(freq),
freq=freq
)
head(freqData,20)
library(ggplot2)
ggplot(head(freqData,20),
aes(reorder(word,freq),freq))+
geom_bar(stat="identity",
fill="steelblue")+
coord_flip()+
labs(title="Top 20 Most Frequent Words",
x="Words",
y="Frequency")
The most common words are generally stop words such as:
These words dominate the corpus because they appear in nearly every sentence.
Bigrams represent two-word combinations.
bigramTokenizer <- function(x)
NGramTokenizer(x,
Weka_control(min=2,max=2))
bigram <- TermDocumentMatrix(
corpus,
control=list(
tokenize=bigramTokenizer
)
)
bigramFreq <- sort(
rowSums(as.matrix(bigram)),
decreasing=TRUE
)
bigramData <- data.frame(
bigram=names(bigramFreq),
freq=bigramFreq
)
ggplot(head(bigramData,20),
aes(reorder(bigram,freq),freq))+
geom_bar(stat="identity",
fill="orange")+
coord_flip()+
labs(title="Top 20 Bigrams")
Trigrams represent three-word combinations.
trigramTokenizer <- function(x)
NGramTokenizer(x,
Weka_control(min=3,max=3))
trigram <- TermDocumentMatrix(
corpus,
control=list(
tokenize=trigramTokenizer
)
)
trigramFreq <- sort(
rowSums(as.matrix(trigram)),
decreasing=TRUE
)
trigramData <- data.frame(
trigram=names(trigramFreq),
freq=trigramFreq
)
ggplot(head(trigramData,20),
aes(reorder(trigram,freq),freq))+
geom_bar(stat="identity",
fill="forestgreen")+
coord_flip()+
labs(title="Top 20 Trigrams")
library(wordcloud)
wordcloud(
words=names(freq),
freq=freq,
max.words=100,
colors=rainbow(10)
)
The word cloud provides a visual representation of the most frequently occurring words.
Several interesting observations emerged during the exploratory analysis:
The prediction algorithm will be based on the n-gram language model.
The following models will be constructed:
The prediction process will follow these steps:
This approach is computationally efficient and widely used in predictive text applications.
The Shiny application will contain:
Allows users to enter text.
Displays the most likely next words.
This exploratory analysis demonstrates that the text corpus contains sufficient information to build an effective next-word prediction model. Summary statistics, visualizations, and n-gram analysis reveal significant patterns in the data. The next phase of the project will involve developing and optimizing the prediction algorithm and integrating it into an interactive Shiny application.
The final application aims to provide accurate and real-time next-word predictions using statistical language modelling techniques.