What and why

Capstone project of the Coursera data science specialization consist of milestone report in week 2. The aim of this report is to show an understanding of the various statistical properties of the data set that can later be used when building the word prediction model. Using exploratory data analysis, this report describes the major features of available data. Later it summarizes my plans for creating the word prediction model.

Environment setup

rm(list = ls(all.names = TRUE))
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readtext)
library(quanteda)
## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "ndiMatrix" of class "replValueSp"; definition not updated
## Package version: 3.3.1
## Unicode version: 15.1
## ICU version: 74.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## 
## The following object is masked from 'package:readtext':
## 
##     texts
library(quanteda.textstats)
## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "ndiMatrix" of class "replValueSp"; definition not updated
library(quanteda.textplots)
library(quanteda.textmodels)
library(gt)

Getting data

I am using data from three available sources in english - blogs, news, twitter. Data from provided link are downloaded in zip file, unzipped and directory structure is visualized.

# download and unzip
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download_zip <- "download.zip"

if(!file.exists(download_zip)){
  download.file(url, destfile = download_zip)
}

if(!dir.exists("final")){
  unzip(download_zip)
}

list <- unzip(download_zip, list = T)
file_list <- gt(list)
file_list
Name Length Date
final/ 0 2014-07-22 10:10:00
final/de_DE/ 0 2014-07-22 10:10:00
final/de_DE/de_DE.twitter.txt 75578341 2014-07-22 10:11:00
final/de_DE/de_DE.blogs.txt 85459666 2014-07-22 10:11:00
final/de_DE/de_DE.news.txt 95591959 2014-07-22 10:11:00
final/ru_RU/ 0 2014-07-22 10:10:00
final/ru_RU/ru_RU.blogs.txt 116855835 2014-07-22 10:12:00
final/ru_RU/ru_RU.news.txt 118996424 2014-07-22 10:12:00
final/ru_RU/ru_RU.twitter.txt 105182346 2014-07-22 10:12:00
final/en_US/ 0 2014-07-22 10:10:00
final/en_US/en_US.twitter.txt 167105338 2014-07-22 10:12:00
final/en_US/en_US.news.txt 205811889 2014-07-22 10:13:00
final/en_US/en_US.blogs.txt 210160014 2014-07-22 10:13:00
final/fi_FI/ 0 2014-07-22 10:10:00
final/fi_FI/fi_FI.news.txt 94234350 2014-07-22 10:11:00
final/fi_FI/fi_FI.blogs.txt 108503595 2014-07-22 10:12:00
final/fi_FI/fi_FI.twitter.txt 25331142 2014-07-22 10:10:00

Basic exploratory analysis of provided data

path_twitter <- list[11,1]
path_news <- list[12,1]
path_blogs <- list[13,1]

file_size <- c(round(file.size(path_twitter)/1024^2, 1), round(file.size(path_news)/1024^2, 1), round(file.size(path_blogs)/1024^2, 1))

con <- file(path_twitter, "r")
tw1 <- readLines(con, 3)
close(con)

con <- file(path_news, "r")
ne1 <- readLines(con, 3)
close(con)

con <- file(path_blogs, "r")
bl1 <- readLines(con, 3)
close(con)

txt_raw <- readtext(list[10,1], docvarsfrom = "filenames", docvarnames = c("language", "source"))
corpus_raw <- corpus(txt_raw) # preparing raw corpus
sentence_no <- c(nsentence(corpus_raw[3]), nsentence(corpus_raw[2]), nsentence(corpus_raw[1]))

Basic parameters of source files are listed below.

# visualization of exploratory analysis outcome
source_file <- c("US Twitter", "US News", "US Blogs")
param <- data.frame(source_file, file_size, sentence_no)
table_gt <- gt(param)
table_gt
source_file file_size sentence_no
US Twitter 159.4 2580555
US News 196.3 143662
US Blogs 200.4 2072623

There are 3 rather big txt files of size from 159MB to 200MB. These 3 files contain from hundreds of thousands to millions of sentences.

Example texts, first three sentences are available below.

example_text <- data.frame(tw1, ne1, bl1)
names(example_text) <- source_file
example_gt <- example_text|> gt()
example_gt
US Twitter US News US Blogs
How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long. He wasn't home alone, apparently. In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”.
When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason. The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s. We love you Mr. Brown.
they've decided its more fun if I don't. WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building. Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him.

It seems (as expected) I am dealing with “typical” natural language data with variability, wide range of expressions, diverse vocabulary, grammar structures, regional dialects, with ambiguity, context-dependent meanings, with creativity, novel combinations of words and ideas, with imperfection, with errors, hesitations, informal expressions.

To be able to work efficiently with corpus, taking into account limitations of my PC, I am going to work with only 2% of original corpus.

Further exploratory analysis using simplified corpus

Creation of simplified corpus

I am merging content of all three files because I would like to train my model on one train set in later stages of this project. I am using only 2% of original corpus as already stated.

# sub-setting corpus to be easier to handle in following steps
corpus_sentences <- corpus_reshape(corpus_raw, to = "sentences")

corpus_samples <- corpus_sample(corpus_sentences, size = 0.02*ndoc(corpus_sentences))
corpus_samples_stats <- summary(corpus_samples)
ndoc(corpus_samples)
## [1] 95936
head(corpus_samples_stats)
##                        Text Types Tokens Sentences language     source
## 1   en_US.blogs.txt.1293786    24     26         1       en   US.blogs
## 2 en_US.twitter.txt.1620545    30     31         1       en US.twitter
## 3 en_US.twitter.txt.1990352    10     10         1       en US.twitter
## 4  en_US.twitter.txt.819516    16     17         1       en US.twitter
## 5  en_US.twitter.txt.669838    18     19         1       en US.twitter
## 6  en_US.twitter.txt.922339    23     25         1       en US.twitter

Simplified corpus contains around 96000 sentences.

Tokenization

Once I have a corpus, I can create basic token list with punctuation, numbers, urls removed.

token_samples <- tokens(corpus_samples, tolower = TRUE, remove_punct = TRUE, remove_numbers = TRUE, remove_url = TRUE)
## Warning: tolower argument is not used.

Preparation of Document-Feature Matrix (DFM) is my next step. I will use it for couple of following analysis and steps of this project.

dfm_samples <- dfm(token_samples)
dfm_samples[, 1:5]
## Document-feature matrix of: 95,936 documents, 5 features (99.92% sparse) and 2 docvars.
##                            features
## docs                        north coast brother thelonius 355ml
##   en_US.blogs.txt.1293786       1     1       1         1     1
##   en_US.twitter.txt.1620545     0     0       0         0     0
##   en_US.twitter.txt.1990352     0     0       0         0     0
##   en_US.twitter.txt.819516      0     0       0         0     0
##   en_US.twitter.txt.669838      0     0       0         0     0
##   en_US.twitter.txt.922339      0     0       0         0     0
## [ reached max_ndoc ... 95,930 more documents ]

I decided to NOT remove stop words as I expect them to be important for word prediction as they are extensively used in sentences.

What about to check if the word data is part of my corpus and how is it used in context.

kw_data <- kwic(token_samples, pattern = "data")
nrow(kw_data)
## [1] 129
gt(head(kw_data, 5))
docname from to pre keyword post pattern
en_US.twitter.txt.1242970 6 6 Apple #39 s North Carolina Data Center Will Feature Biogas Generators data
en_US.twitter.txt.1208822 3 3 Yes the data will be available for others data
en_US.twitter.txt.260506 11 11 shocked to raise fees throttle data for Shout out to my data
en_US.twitter.txt.92853 15 15 inputting the crazy amounts of data from library wkrs calling it data
en_US.blogs.txt.643660 53 53 for department cell phones and data plans data

There are 122 occurences of word data in my data set. I visualized 5 examples of this word in context. Just for fun.

Visualizations

Couple of visualizations can be helpful to get better understanding of corpus content.

top_f <- data.frame(data.frame(word = names(topfeatures(dfm_samples, 20)), n = topfeatures(dfm_samples, 20)))

top_f %>%
  ggplot(aes(n, reorder(word,n))) +
  geom_col() +
  labs(x="Frequency",y = "Unigram") +
  theme_light()

We can see that words mostly identified as stop words are the most common in our data set. It is another prediction that they are needed in our prediction model.

Creating ngrams

Unigrams are already available, but I would like to have a look also at bi, tri, quad grams as these could be useful for our prediction model. This is a computationally intensive process and it needs a lot of memory. I might need to save ngrams to hdd during optimization of my model.

bigrams <- tokens_ngrams(token_samples, n = 2)
trigrams <- tokens_ngrams(token_samples, n = 3)
quadgrams <- tokens_ngrams(token_samples, n = 4)

dfm_bi <- dfm(bigrams)
dfm_tri <- dfm(trigrams)
dfm_quad <- dfm(quadgrams)

bigram_tot <- enframe(sort(colSums(dfm_bi),TRUE))
trigram_tot <- enframe(sort(colSums(dfm_tri),TRUE))
quadgram_tot <- enframe(sort(colSums(dfm_quad),TRUE))

Now we can have a look which are the most common ngrams.

Plans of next action

My initial steps are barely start of the real work. There are many things I need to tackle.

Ideas about models

Bigram Model with Laplace Smoothing Concept: In a bigram model, the probability of a word depends only on the preceding word. Laplace smoothing is applied to handle cases where certain bigrams are unseen in the training data.

Implementation: Calculate bigram probabilities by counting occurrences of word pairs and apply Laplace smoothing to account for unseen combinations. To predict the next word, use the bigram with the highest probability given the current word.

Trigram Model with Back-off Concept: A trigram model predicts the next word based on the two preceding words. Back-off is employed to gracefully handle cases where a trigram is unseen, falling back to a bigram or unigram model.

Implementation: Calculate trigram probabilities and, if a trigram is unseen, fall back to the corresponding bigram or unigram probabilities. Choose the word with the highest probability for prediction.

These models represent a progression in complexity from the bigram to the trigram model. They serve as foundational building blocks for more advanced N-gram models and can be implemented with relative simplicity using basic probability calculations.

My next steps

  • Calculation of probabilities for each ngram
  • Application of smoothing
  • Implementation of prediction logic
  • Evaluation of model performance
  • Iterative refinement
  • Back-off strategies
  • Optimization for computational efficiency