Prediction algoritm for words

What and why

Capstone project of the Coursera data science specialization consist of milestone report in week 2. The aim of this report is to show an understanding of the various statistical properties of the data set that can later be used when building the word prediction model. Using exploratory data analysis, this report describes the major features of available data. Later it summarizes my plans for creating the word prediction model.

Environment setup

rm(list = ls(all.names = TRUE))
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readtext)
library(quanteda)

## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "ndiMatrix" of class "replValueSp"; definition not updated

## Package version: 3.3.1
## Unicode version: 15.1
## ICU version: 74.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## 
## The following object is masked from 'package:readtext':
## 
##     texts

library(quanteda.textstats)

## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "ndiMatrix" of class "replValueSp"; definition not updated

library(quanteda.textplots)
library(quanteda.textmodels)
library(gt)

Getting data

I am using data from three available sources in english - blogs, news, twitter. Data from provided link are downloaded in zip file, unzipped and directory structure is visualized.

# download and unzip
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download_zip <- "download.zip"

if(!file.exists(download_zip)){
  download.file(url, destfile = download_zip)
}

if(!dir.exists("final")){
  unzip(download_zip)
}

list <- unzip(download_zip, list = T)
file_list <- gt(list)
file_list

Name	Length	Date
final/	0	2014-07-22 10:10:00
final/de_DE/	0	2014-07-22 10:10:00
final/de_DE/de_DE.twitter.txt	75578341	2014-07-22 10:11:00
final/de_DE/de_DE.blogs.txt	85459666	2014-07-22 10:11:00
final/de_DE/de_DE.news.txt	95591959	2014-07-22 10:11:00
final/ru_RU/	0	2014-07-22 10:10:00
final/ru_RU/ru_RU.blogs.txt	116855835	2014-07-22 10:12:00
final/ru_RU/ru_RU.news.txt	118996424	2014-07-22 10:12:00
final/ru_RU/ru_RU.twitter.txt	105182346	2014-07-22 10:12:00
final/en_US/	0	2014-07-22 10:10:00
final/en_US/en_US.twitter.txt	167105338	2014-07-22 10:12:00
final/en_US/en_US.news.txt	205811889	2014-07-22 10:13:00
final/en_US/en_US.blogs.txt	210160014	2014-07-22 10:13:00
final/fi_FI/	0	2014-07-22 10:10:00
final/fi_FI/fi_FI.news.txt	94234350	2014-07-22 10:11:00
final/fi_FI/fi_FI.blogs.txt	108503595	2014-07-22 10:12:00
final/fi_FI/fi_FI.twitter.txt	25331142	2014-07-22 10:10:00

Basic exploratory analysis of provided data

path_twitter <- list[11,1]
path_news <- list[12,1]
path_blogs <- list[13,1]

file_size <- c(round(file.size(path_twitter)/1024^2, 1), round(file.size(path_news)/1024^2, 1), round(file.size(path_blogs)/1024^2, 1))

con <- file(path_twitter, "r")
tw1 <- readLines(con, 3)
close(con)

con <- file(path_news, "r")
ne1 <- readLines(con, 3)
close(con)

con <- file(path_blogs, "r")
bl1 <- readLines(con, 3)
close(con)

txt_raw <- readtext(list[10,1], docvarsfrom = "filenames", docvarnames = c("language", "source"))
corpus_raw <- corpus(txt_raw) # preparing raw corpus
sentence_no <- c(nsentence(corpus_raw[3]), nsentence(corpus_raw[2]), nsentence(corpus_raw[1]))

Basic parameters of source files are listed below.

# visualization of exploratory analysis outcome
source_file <- c("US Twitter", "US News", "US Blogs")
param <- data.frame(source_file, file_size, sentence_no)
table_gt <- gt(param)
table_gt

source_file	file_size	sentence_no
US Twitter	159.4	2580555
US News	196.3	143662
US Blogs	200.4	2072623

There are 3 rather big txt files of size from 159MB to 200MB. These 3 files contain from hundreds of thousands to millions of sentences.

Example texts, first three sentences are available below.

example_text <- data.frame(tw1, ne1, bl1)
names(example_text) <- source_file
example_gt <- example_text|> gt()
example_gt

US Twitter	US News	US Blogs
How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long.	He wasn't home alone, apparently.	In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”.
When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason.	The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s.	We love you Mr. Brown.
they've decided its more fun if I don't.	WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building.	Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him.

It seems (as expected) I am dealing with “typical” natural language data with variability, wide range of expressions, diverse vocabulary, grammar structures, regional dialects, with ambiguity, context-dependent meanings, with creativity, novel combinations of words and ideas, with imperfection, with errors, hesitations, informal expressions.

To be able to work efficiently with corpus, taking into account limitations of my PC, I am going to work with only 2% of original corpus.

Further exploratory analysis using simplified corpus

Creation of simplified corpus

I am merging content of all three files because I would like to train my model on one train set in later stages of this project. I am using only 2% of original corpus as already stated.

# sub-setting corpus to be easier to handle in following steps
corpus_sentences <- corpus_reshape(corpus_raw, to = "sentences")

corpus_samples <- corpus_sample(corpus_sentences, size = 0.02*ndoc(corpus_sentences))
corpus_samples_stats <- summary(corpus_samples)
ndoc(corpus_samples)

## [1] 95936

head(corpus_samples_stats)

##                        Text Types Tokens Sentences language     source
## 1   en_US.blogs.txt.1293786    24     26         1       en   US.blogs
## 2 en_US.twitter.txt.1620545    30     31         1       en US.twitter
## 3 en_US.twitter.txt.1990352    10     10         1       en US.twitter
## 4  en_US.twitter.txt.819516    16     17         1       en US.twitter
## 5  en_US.twitter.txt.669838    18     19         1       en US.twitter
## 6  en_US.twitter.txt.922339    23     25         1       en US.twitter

Simplified corpus contains around 96000 sentences.

Tokenization

Once I have a corpus, I can create basic token list with punctuation, numbers, urls removed.

token_samples <- tokens(corpus_samples, tolower = TRUE, remove_punct = TRUE, remove_numbers = TRUE, remove_url = TRUE)

## Warning: tolower argument is not used.

Preparation of Document-Feature Matrix (DFM) is my next step. I will use it for couple of following analysis and steps of this project.

dfm_samples <- dfm(token_samples)
dfm_samples[, 1:5]

## Document-feature matrix of: 95,936 documents, 5 features (99.92% sparse) and 2 docvars.
##                            features
## docs                        north coast brother thelonius 355ml
##   en_US.blogs.txt.1293786       1     1       1         1     1
##   en_US.twitter.txt.1620545     0     0       0         0     0
##   en_US.twitter.txt.1990352     0     0       0         0     0
##   en_US.twitter.txt.819516      0     0       0         0     0
##   en_US.twitter.txt.669838      0     0       0         0     0
##   en_US.twitter.txt.922339      0     0       0         0     0
## [ reached max_ndoc ... 95,930 more documents ]

I decided to NOT remove stop words as I expect them to be important for word prediction as they are extensively used in sentences.

What about to check if the word data is part of my corpus and how is it used in context.

kw_data <- kwic(token_samples, pattern = "data")
nrow(kw_data)

## [1] 129

gt(head(kw_data, 5))

docname	from	to	pre	keyword	post	pattern
en_US.twitter.txt.1242970	6	6	Apple #39 s North Carolina	Data	Center Will Feature Biogas Generators	data
en_US.twitter.txt.1208822	3	3	Yes the	data	will be available for others	data
en_US.twitter.txt.260506	11	11	shocked to raise fees throttle	data	for Shout out to my	data
en_US.twitter.txt.92853	15	15	inputting the crazy amounts of	data	from library wkrs calling it	data
en_US.blogs.txt.643660	53	53	for department cell phones and	data	plans	data

There are 122 occurences of word data in my data set. I visualized 5 examples of this word in context. Just for fun.

Visualizations

Couple of visualizations can be helpful to get better understanding of corpus content.

top_f <- data.frame(data.frame(word = names(topfeatures(dfm_samples, 20)), n = topfeatures(dfm_samples, 20)))

top_f %>%
  ggplot(aes(n, reorder(word,n))) +
  geom_col() +
  labs(x="Frequency",y = "Unigram") +
  theme_light()

We can see that words mostly identified as stop words are the most common in our data set. It is another prediction that they are needed in our prediction model.

Creating ngrams

Unigrams are already available, but I would like to have a look also at bi, tri, quad grams as these could be useful for our prediction model. This is a computationally intensive process and it needs a lot of memory. I might need to save ngrams to hdd during optimization of my model.

bigrams <- tokens_ngrams(token_samples, n = 2)
trigrams <- tokens_ngrams(token_samples, n = 3)
quadgrams <- tokens_ngrams(token_samples, n = 4)

dfm_bi <- dfm(bigrams)
dfm_tri <- dfm(trigrams)
dfm_quad <- dfm(quadgrams)

bigram_tot <- enframe(sort(colSums(dfm_bi),TRUE))
trigram_tot <- enframe(sort(colSums(dfm_tri),TRUE))
quadgram_tot <- enframe(sort(colSums(dfm_quad),TRUE))

Now we can have a look which are the most common ngrams.

Plans of next action

My initial steps are barely start of the real work. There are many things I need to tackle.

Ideas about models

Bigram Model with Laplace Smoothing Concept: In a bigram model, the probability of a word depends only on the preceding word. Laplace smoothing is applied to handle cases where certain bigrams are unseen in the training data.

Implementation: Calculate bigram probabilities by counting occurrences of word pairs and apply Laplace smoothing to account for unseen combinations. To predict the next word, use the bigram with the highest probability given the current word.

Trigram Model with Back-off Concept: A trigram model predicts the next word based on the two preceding words. Back-off is employed to gracefully handle cases where a trigram is unseen, falling back to a bigram or unigram model.

Implementation: Calculate trigram probabilities and, if a trigram is unseen, fall back to the corresponding bigram or unigram probabilities. Choose the word with the highest probability for prediction.

These models represent a progression in complexity from the bigram to the trigram model. They serve as foundational building blocks for more advanced N-gram models and can be implemented with relative simplicity using basic probability calculations.

My next steps

Calculation of probabilities for each ngram
Application of smoothing
Implementation of prediction logic
Evaluation of model performance
Iterative refinement
Back-off strategies
Optimization for computational efficiency

Prediction algoritm for words | Milestone report

Rashad Ahammed

2023-01-30