Capstone project of the Coursera data science specialization consist of milestone report in week 2. The aim of this report is to show an understanding of the various statistical properties of the data set that can later be used when building the word prediction model. Using exploratory data analysis, this report describes the major features of available data. Later it summarizes my plans for creating the word prediction model.
rm(list = ls(all.names = TRUE))
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readtext)
library(quanteda)
## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "ndiMatrix" of class "replValueSp"; definition not updated
## Package version: 3.3.1
## Unicode version: 15.1
## ICU version: 74.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
##
## The following object is masked from 'package:readtext':
##
## texts
library(quanteda.textstats)
## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "ndiMatrix" of class "replValueSp"; definition not updated
library(quanteda.textplots)
library(quanteda.textmodels)
library(gt)
I am using data from three available sources in english - blogs, news, twitter. Data from provided link are downloaded in zip file, unzipped and directory structure is visualized.
# download and unzip
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download_zip <- "download.zip"
if(!file.exists(download_zip)){
download.file(url, destfile = download_zip)
}
if(!dir.exists("final")){
unzip(download_zip)
}
list <- unzip(download_zip, list = T)
file_list <- gt(list)
file_list
| Name | Length | Date |
|---|---|---|
| final/ | 0 | 2014-07-22 10:10:00 |
| final/de_DE/ | 0 | 2014-07-22 10:10:00 |
| final/de_DE/de_DE.twitter.txt | 75578341 | 2014-07-22 10:11:00 |
| final/de_DE/de_DE.blogs.txt | 85459666 | 2014-07-22 10:11:00 |
| final/de_DE/de_DE.news.txt | 95591959 | 2014-07-22 10:11:00 |
| final/ru_RU/ | 0 | 2014-07-22 10:10:00 |
| final/ru_RU/ru_RU.blogs.txt | 116855835 | 2014-07-22 10:12:00 |
| final/ru_RU/ru_RU.news.txt | 118996424 | 2014-07-22 10:12:00 |
| final/ru_RU/ru_RU.twitter.txt | 105182346 | 2014-07-22 10:12:00 |
| final/en_US/ | 0 | 2014-07-22 10:10:00 |
| final/en_US/en_US.twitter.txt | 167105338 | 2014-07-22 10:12:00 |
| final/en_US/en_US.news.txt | 205811889 | 2014-07-22 10:13:00 |
| final/en_US/en_US.blogs.txt | 210160014 | 2014-07-22 10:13:00 |
| final/fi_FI/ | 0 | 2014-07-22 10:10:00 |
| final/fi_FI/fi_FI.news.txt | 94234350 | 2014-07-22 10:11:00 |
| final/fi_FI/fi_FI.blogs.txt | 108503595 | 2014-07-22 10:12:00 |
| final/fi_FI/fi_FI.twitter.txt | 25331142 | 2014-07-22 10:10:00 |
path_twitter <- list[11,1]
path_news <- list[12,1]
path_blogs <- list[13,1]
file_size <- c(round(file.size(path_twitter)/1024^2, 1), round(file.size(path_news)/1024^2, 1), round(file.size(path_blogs)/1024^2, 1))
con <- file(path_twitter, "r")
tw1 <- readLines(con, 3)
close(con)
con <- file(path_news, "r")
ne1 <- readLines(con, 3)
close(con)
con <- file(path_blogs, "r")
bl1 <- readLines(con, 3)
close(con)
txt_raw <- readtext(list[10,1], docvarsfrom = "filenames", docvarnames = c("language", "source"))
corpus_raw <- corpus(txt_raw) # preparing raw corpus
sentence_no <- c(nsentence(corpus_raw[3]), nsentence(corpus_raw[2]), nsentence(corpus_raw[1]))
Basic parameters of source files are listed below.
# visualization of exploratory analysis outcome
source_file <- c("US Twitter", "US News", "US Blogs")
param <- data.frame(source_file, file_size, sentence_no)
table_gt <- gt(param)
table_gt
| source_file | file_size | sentence_no |
|---|---|---|
| US Twitter | 159.4 | 2580555 |
| US News | 196.3 | 143662 |
| US Blogs | 200.4 | 2072623 |
There are 3 rather big txt files of size from 159MB to 200MB. These 3 files contain from hundreds of thousands to millions of sentences.
Example texts, first three sentences are available below.
example_text <- data.frame(tw1, ne1, bl1)
names(example_text) <- source_file
example_gt <- example_text|> gt()
example_gt
| US Twitter | US News | US Blogs |
|---|---|---|
| How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long. | He wasn't home alone, apparently. | In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”. |
| When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason. | The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s. | We love you Mr. Brown. |
| they've decided its more fun if I don't. | WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building. | Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him. |
It seems (as expected) I am dealing with “typical” natural language data with variability, wide range of expressions, diverse vocabulary, grammar structures, regional dialects, with ambiguity, context-dependent meanings, with creativity, novel combinations of words and ideas, with imperfection, with errors, hesitations, informal expressions.
To be able to work efficiently with corpus, taking into account limitations of my PC, I am going to work with only 2% of original corpus.
I am merging content of all three files because I would like to train my model on one train set in later stages of this project. I am using only 2% of original corpus as already stated.
# sub-setting corpus to be easier to handle in following steps
corpus_sentences <- corpus_reshape(corpus_raw, to = "sentences")
corpus_samples <- corpus_sample(corpus_sentences, size = 0.02*ndoc(corpus_sentences))
corpus_samples_stats <- summary(corpus_samples)
ndoc(corpus_samples)
## [1] 95936
head(corpus_samples_stats)
## Text Types Tokens Sentences language source
## 1 en_US.blogs.txt.1293786 24 26 1 en US.blogs
## 2 en_US.twitter.txt.1620545 30 31 1 en US.twitter
## 3 en_US.twitter.txt.1990352 10 10 1 en US.twitter
## 4 en_US.twitter.txt.819516 16 17 1 en US.twitter
## 5 en_US.twitter.txt.669838 18 19 1 en US.twitter
## 6 en_US.twitter.txt.922339 23 25 1 en US.twitter
Simplified corpus contains around 96000 sentences.
Once I have a corpus, I can create basic token list with punctuation, numbers, urls removed.
token_samples <- tokens(corpus_samples, tolower = TRUE, remove_punct = TRUE, remove_numbers = TRUE, remove_url = TRUE)
## Warning: tolower argument is not used.
Preparation of Document-Feature Matrix (DFM) is my next step. I will use it for couple of following analysis and steps of this project.
dfm_samples <- dfm(token_samples)
dfm_samples[, 1:5]
## Document-feature matrix of: 95,936 documents, 5 features (99.92% sparse) and 2 docvars.
## features
## docs north coast brother thelonius 355ml
## en_US.blogs.txt.1293786 1 1 1 1 1
## en_US.twitter.txt.1620545 0 0 0 0 0
## en_US.twitter.txt.1990352 0 0 0 0 0
## en_US.twitter.txt.819516 0 0 0 0 0
## en_US.twitter.txt.669838 0 0 0 0 0
## en_US.twitter.txt.922339 0 0 0 0 0
## [ reached max_ndoc ... 95,930 more documents ]
I decided to NOT remove stop words as I expect them to be important for word prediction as they are extensively used in sentences.
What about to check if the word data is part of my corpus and how is it used in context.
kw_data <- kwic(token_samples, pattern = "data")
nrow(kw_data)
## [1] 129
gt(head(kw_data, 5))
| docname | from | to | pre | keyword | post | pattern |
|---|---|---|---|---|---|---|
| en_US.twitter.txt.1242970 | 6 | 6 | Apple #39 s North Carolina | Data | Center Will Feature Biogas Generators | data |
| en_US.twitter.txt.1208822 | 3 | 3 | Yes the | data | will be available for others | data |
| en_US.twitter.txt.260506 | 11 | 11 | shocked to raise fees throttle | data | for Shout out to my | data |
| en_US.twitter.txt.92853 | 15 | 15 | inputting the crazy amounts of | data | from library wkrs calling it | data |
| en_US.blogs.txt.643660 | 53 | 53 | for department cell phones and | data | plans | data |
There are 122 occurences of word data in my data set. I visualized 5 examples of this word in context. Just for fun.
Couple of visualizations can be helpful to get better understanding of corpus content.
top_f <- data.frame(data.frame(word = names(topfeatures(dfm_samples, 20)), n = topfeatures(dfm_samples, 20)))
top_f %>%
ggplot(aes(n, reorder(word,n))) +
geom_col() +
labs(x="Frequency",y = "Unigram") +
theme_light()
We can see that words mostly identified as stop words are the most common in our data set. It is another prediction that they are needed in our prediction model.
Unigrams are already available, but I would like to have a look also at bi, tri, quad grams as these could be useful for our prediction model. This is a computationally intensive process and it needs a lot of memory. I might need to save ngrams to hdd during optimization of my model.
bigrams <- tokens_ngrams(token_samples, n = 2)
trigrams <- tokens_ngrams(token_samples, n = 3)
quadgrams <- tokens_ngrams(token_samples, n = 4)
dfm_bi <- dfm(bigrams)
dfm_tri <- dfm(trigrams)
dfm_quad <- dfm(quadgrams)
bigram_tot <- enframe(sort(colSums(dfm_bi),TRUE))
trigram_tot <- enframe(sort(colSums(dfm_tri),TRUE))
quadgram_tot <- enframe(sort(colSums(dfm_quad),TRUE))
Now we can have a look which are the most common ngrams.
My initial steps are barely start of the real work. There are many things I need to tackle.
Bigram Model with Laplace Smoothing Concept: In a bigram model, the probability of a word depends only on the preceding word. Laplace smoothing is applied to handle cases where certain bigrams are unseen in the training data.
Implementation: Calculate bigram probabilities by counting occurrences of word pairs and apply Laplace smoothing to account for unseen combinations. To predict the next word, use the bigram with the highest probability given the current word.
Trigram Model with Back-off Concept: A trigram model predicts the next word based on the two preceding words. Back-off is employed to gracefully handle cases where a trigram is unseen, falling back to a bigram or unigram model.
Implementation: Calculate trigram probabilities and, if a trigram is unseen, fall back to the corresponding bigram or unigram probabilities. Choose the word with the highest probability for prediction.
These models represent a progression in complexity from the bigram to the trigram model. They serve as foundational building blocks for more advanced N-gram models and can be implemented with relative simplicity using basic probability calculations.