Introduction

In many applications or devices when we type the text there is the option to suggest the next word providing several options. Sometimes it helps, sometimes it makes you crazy …

If we imagine ourselves predicting next word based on previous input words, we understand that it is enormously difficult task. “my dog is ….” (running, jumping, sleeping …?), “he is a …” (driver, husband, boy …?)

Two important things that are clear from mental simulation of words prediction

The longer is the input line of words the better is the prediction: it is impossible to predict reasonably based on 1-2 words, 3-4 words input is the limit when your prediction starts to make sense.
Probability of prediction of one specific word is not high, you always have in mind 5-10 words that match the input text.

Overview

The objective of the project is to build the model that predicts the next word based on previous input text and realize it in Shiny application.

Model limitations

It should run on mobile devices to predict the next word in typing process, so it should not be “heavy” in terms of memory (several dozens of MB as a maximum) and in terms of computing (computing time should be “immediate”, less than a second).
It should be based on Ngram approach (there are more advanced generative AI models).

Project steps

Obtain the dataset. Divide it into training, validation, testing sub-sets.
Clean and analyse the training data.

Build the optimal model for word prediction.

 3a. Identify the model options.
 3b. Build and “train” the models based on training data.
 3c. Evaluate the model based on cleaned validation data.

Evaluate the selected model based on cleaned testing data.
Deploy it in Shiny server for users to try.

Questions to consider

What part of provided data set to use?
How to clean the data?
How can we optimize the N-gram model in terms of volume?
What are the options of prediction models?
What is the required number of words as the input to make prediction?
What is the model accuracy?

Data

Loading data

The data was loaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

Data includes 3 files: blogs, news and twitter

set.seed(1234)
setwd("~/GitHub/Capstone_Project/Coursera-SwiftKey/final/en_US")
blogs <- readLines("en_US.blogs.txt")
news <- readLines("en_US.news.txt")
twitter <- readLines("en_US.twitter.txt")

files <- c("en_US.blogs.txt","en_US.news.txt","en_US.twitter.txt")

Data analysis

## Number of Lines
lines.blogs <- length(blogs)
lines.news <- length(news)
lines.twitter <- length(twitter)
lines <- c(lines.blogs, lines.news, lines.twitter) 

## Number of Words
library(stringi)
words.blogs <- sum(stri_count_words(blogs))
words.news <- sum(stri_count_words(news))
words.twitter <- sum(stri_count_words(twitter))
words <- c(words.blogs, words.news, words.twitter)

## size of files in MB
size.blogs <- object.size(blogs)/1000000
size.news <- object.size(news)/1000000
size.twitter <- object.size(twitter)/1000000
size_MB <- c(round(size.blogs,0), round(size.news,0), round(size.twitter,0))

files_summary <- data.frame(files, lines, words, size_MB)

library(knitr)
kable(files_summary)

files	lines	words	size_MB
en_US.blogs.txt	899288	37546250	268
en_US.news.txt	77259	2674536	21
en_US.twitter.txt	2360148	30093372	334

As we see the provided data has high number of lines and words, high volume size in MB.

That means that we will need to use only the part of data to construct the model. The good side is that we have a lot of data for validation and testing purposes.

Moving forward based on plan above we will need to make important decisions on scope of data to use, data splitting into training, validation and testing sub-sets and on data cleaning.

1. Milestones - Data Loading

Pavel

2024-03-20