Understand the distribution and relationship between words, tokens and phrases in the text and build a linguistic predictive model.
Data source https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.
As the data was relatively large, a 10% random sample of the following datasets: en_US.blog.txt, en_US.twitter.txt and en_US.news.txt was used. These samples were combined and cleaned for analysis (e.g. removal of HTML tags, punctuation,bullet points and non-common characters).
Tidytext and Tidyverse packages were used for data processing and exploratory data analysis. The first step was to create tokens (i.e. mainly words) which were cleaned to remove “non-words” (e.g. repeated vowels “iii” and symbols). The tidy tokens were explored to decide which models should be used.
N-grams Language Modeling was used for prediction word analysis (i.e. predict the following word). Markov Chains network visualizations (using ggplot2) are used to understand how the model would predict the following words. This approach was used to plan how the machine learning model would be developed.
Load the data
# define the directory to store the zipfile
destfile <- "/Documents/Data_Science_Projects/Coursera/JHU_DS/Capstone/Data/Coursera-Swiftkey.zip"
# save the URL with the zipfile
fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
# Download the zipfile
download.file(url = fileUrl, destfile = destfile, method = "curl")
Load libraries
library(purrr) # for map function
library(tidyverse)
library(tidytext)
library(stringr)
library(ggthemes)
library(gridExtra)
library(igraph)
library(ggraph)
library(pander)
Read the data
# Load the three data sets together
enUS_folder <- "Data/final/en_US/"
Corpus <- tibble(file = dir(enUS_folder, full.names = TRUE)) %>%
mutate(text = map(file, read_lines)) %>%
transmute(id = basename(file), text) %>%
unnest(text)
# Print the file sizes and the Corpus object size
print(object.size(Corpus), units = "MB")
## 831.9 Mb
fileSizes <- tibble(
id = list.files(enUS_folder),
size = file.size(list.files(enUS_folder, full.names = TRUE))) %>%
mutate(sizeMb = size/1024)
pander::pander(fileSizes)
| id | size | sizeMb |
|---|---|---|
| en_US.blogs.txt | 210160014 | 205234.4 |
| en_US.news.txt | 205811889 | 200988.2 |
| en_US.twitter.txt | 167105338 | 163188.8 |
The 3 files combined have 4269678 lines.
The table below presents the total number of lines per each data set or source. The mean, sd, median, min, Longest Line (max) are also presented string counts. The longest line represents the line with most string characters. It should be noted that the twitter string count can only have a maximum of 140 characters.
| id | N. of Lines | mean | sd | median | min | Longest Line |
|---|---|---|---|---|---|---|
| en_US.blogs.txt | 899288 | 229.98695 | 258.66081 | 156 | 1 | 40833 |
| en_US.news.txt | 1010242 | 201.16285 | 133.21714 | 185 | 1 | 11384 |
| en_US.twitter.txt | 2360148 | 68.68045 | 37.22725 | 64 | 2 | 140 |
As the data is too large to process a random sample of 10% will be taken.
set.seed(7067729)
# sample each dataset
# Clean the whole sample in one go before spliting it into a training and test sets
# Replace "unusual" characters with a space
sampleCorpus <- sample_frac(Corpus, 0.1) %>%
mutate(id = str_replace_all(id, c("en_US.twitter.txt" = "Twitter",
"en_US.news.txt" = "News",
"en_US.blogs.txt" = "Blogs"))) %>%
mutate(text = str_replace_all(text, "[\r?\n|\røØ\\/\\#:)!?^~&=]|[^a-zA-Z0-9 ']|\\_|\\b[aeiou]{2,}\\b|'\\s+", "")) %>%
mutate(text = tolower(text))
dim(sampleCorpus)
## [1] 426968 2
Splitting sampleCorpus into a Training set (80%) and a Test set (20%).
set.seed(2017)
cleanTrain <- sampleCorpus %>% sample_frac(0.8)
cleanTest <- anti_join(sampleCorpus, cleanTrain, by = c("id", "text"))
Save the file for future reuse.
save(cleanTrain, file = 'cleanTrain.RData')
# Some cleaning to free up memory that could be useful later.
rm(Corpus)
The first step is to transform the text into single tokens (words/Unigrams).
words
# words by id
wordToken <- cleanTrain %>%
# separate each line of text into 1-gram
unnest_tokens(unigram, text, token = "ngrams", n = 1) %>%
# remove all the words with only vowels and the very small words
filter(!str_detect(unigram, "\\b[aeiou]{2,}\\b")) %>%
mutate(
unigram = factor(unigram, levels = rev(unique(unigram)))
) %>%
group_by(id, unigram) %>%
count(unigram, sort = TRUE)
pander::pander(head(wordToken, 5))
| id | unigram | n |
|---|---|---|
| News | the | 157736 |
| Blogs | the | 148400 |
| Blogs | and | 86688 |
| Blogs | to | 85265 |
| the | 74347 |
Compare total unigrams between groups (blogs, news and twitter)
Although Blogs file has less lines, we can see most words come from Blogs data source.
The three datasets have different lenghts, but for the purpose of the exercise of developing a shiny app based on the words that appear more frequently in the text, we’ll explore the three datasets together.
n-grams are consecutive sequences of words
We’ve covered words as individual units and considered their frequencies to visualize which were the most common words in the three data sets. Next step is to build figures and tables to understand variation in the frequencies of words and word pairs in the data.
Set function to calculate different sizes of N-grams
GetGrams <- function(clean_set, value){
sentences <- clean_set %>%
unnest_tokens(sentence, text, token = "ngrams", n = value) %>%
# remove all the words with only vowels
filter(!str_detect(sentence, "\\b[aeiou]{2,}\\b")) %>%
mutate(
sentence = factor(sentence, levels = rev(unique(sentence)))
) %>%
count(sentence, sort = TRUE)
return(sentences)
}
Unigrams
| sentence | n |
|---|---|
| the | 380483 |
| to | 220709 |
| and | 192480 |
Bigrams
| sentence | n |
|---|---|
| of the | 34332 |
| in the | 32395 |
| to the | 17097 |
Trigrams
| sentence | n |
|---|---|
| one of the | 2791 |
| a lot of | 2475 |
| thanks for the | 1831 |
Tetragrams
| sentence | n |
|---|---|
| the end of the | 606 |
| the rest of the | 570 |
| at the end of | 517 |
Pentagrams
| sentence | n |
|---|---|
| at the end of the | 287 |
| for the first time in | 132 |
| in the middle of the | 107 |
# Total number of words by ID
TotalWords <- wordToken %>%
group_by(id) %>%
summarise(total = sum(n))
pander::pander(TotalWords)
| id | total |
|---|---|
| Blogs | 2973476 |
| News | 2740952 |
| 2374121 |
n is the number of times that the word is used in each data set (Twitter, Blogs, News). To look at the distribution of n/total for each data set we use the number of times a word appears divided by the total number of terms (words) in that set, which corresponds to the term frequency.
Plot word proportions distribution by id
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 22884 rows containing non-finite values (stat_bin).
The plots exhibit similar distributions for the three data sets. There are many words that occur rarely and fewer that occur frequently. Zip’s law states that the frequency that a word appears is inversely proportional to its rank. Thus we consider the proportion of word counts and the cumulative proportion as a probability of word appearance in the text corpus. If we consider 50% of the data it will still cover enough words for prediction.
UniProp <- termProp(UniGram)
##
## For UniGram 152 words cover 50% of all word instances.
BiProp <- termProp(BiGram)
##
## For BiGram 50320 words cover 50% of all word instances.
TriProp <- termProp(TriGram)
##
## For TriGram 1793372 words cover 50% of all word instances.
TetraProp <- termProp(TetraGram)
##
## For TetraGram 3456988 words cover 50% of all word instances.
PentaProp <- termProp(PentaGram)
##
## For PentaGram 3914094 words cover 50% of all word instances.
head(PentaProp)
## # A tibble: 6 × 5
## sentence n prop rank cumprop
## <fctr> <int> <dbl> <int> <dbl>
## 1 at the end of the 287 3.550563e-05 1 3.550563e-05
## 2 for the first time in 132 1.633012e-05 2 5.183575e-05
## 3 in the middle of the 107 1.323729e-05 3 6.507304e-05
## 4 for the rest of the 102 1.261873e-05 4 7.769176e-05
## 5 thank you so much for 96 1.187645e-05 5 8.956821e-05
## 6 by the end of the 95 1.175273e-05 6 1.013209e-04
The number of rows (rank) gives us the top 213864 unique words that could be used to cover 50% of all word instances in the language. In a frequency sorted disctionary, 50% should be enough to cover all word instances. This is also shown in the histograms above, where the plots present a long tail, with half of the observations skewed.
Save ngrams into a single file for memory efficiency
save(UniGram, BiGram, TriGram, TetraGram, PentaGram, file = 'ngrams.RData')
save(UniProp, BiProp, TriProp, TetraProp, PentaProp, file = 'nprop.RData')
I would use an english dictionary or a list of english words and match with each sentence/word in the text corpus. For this the hunspell package could be useful to detect words that would not match with the list. These words would be considered foreign words or typed not accordingly. I would keep those in the text still for future predictions with bi-grams and tri-grams, meaning that when more than one non-english word occurs there can be a chance that the next words predicted are non-english words as well.
We start by visualizing the relationship between words using a Markov chain. Markov chain is a model where each choice (probability) of a word depends only on the previous one. A word is generated considering the most common words following the previous one.
To calculate the most common ngrams we need to separate the column word into N columns
BiProp_split <- BiProp %>%
select(sentence, n, cumprop) %>%
separate(sentence, c("word1", "word2"), sep = " ")
TriProp_split <- TriProp %>%
select(sentence, n, cumprop) %>%
separate(sentence, c("word1", "word2", "word3"), sep = " ")
## [1] 55
15% of bigrams(BiProp_split) correspond r nrow(BiGrams_top)` words. That’s what we use in the Markov network visualization below.
Use all bigrams to build a suitable data frame to be used with a visualization of Markov chains
bigram_all_graph <- BiGrams_top %>%
graph_from_data_frame()
Visualizing a network with bigrams
Split pentagrams into one word per column
PentaProp_split <- PentaProp %>%
select(sentence, n, cumprop) %>%
separate(sentence, c("word1", "word2", "word3", "word4", "word5"), sep = " ")
# Top observations: unique words that appear more times
PentaProp_top <- PentaProp_split %>% filter(n > 30)
PentaProp_top
## # A tibble: 71 × 7
## word1 word2 word3 word4 word5 n cumprop
## <chr> <chr> <chr> <chr> <chr> <int> <dbl>
## 1 at the end of the 287 3.550563e-05
## 2 for the first time in 132 5.183575e-05
## 3 in the middle of the 107 6.507304e-05
## 4 for the rest of the 102 7.769176e-05
## 5 thank you so much for 96 8.956821e-05
## 6 by the end of the 95 1.013209e-04
## 7 the end of the day 88 1.122077e-04
## 8 can't wait to see you 82 1.223522e-04
## 9 is going to be a 79 1.321255e-04
## 10 i can't wait to see 71 1.409091e-04
## # ... with 61 more rows
Visualizing a network with pentagrams
We should be able to get good estimations with N to 5-grams model. Considering conditional probabilities of word occurencies, we can predict next word. The bigram will look at one word into the past, the trigram looks two words into the past and so on. The N-gram looks N - 1 words into past. If we just consider the unigram frequencies we would get a skewed distribution of results. Thus we use a Kneser-ney model to correct predictions in relation to possible words preceding.
We can estimate probabilities with maximum likelihood estimation (MLE) in a training set. We can normalise the counts from the text corpus so that they lie between 0 and 1. Count, for example, all bigrams that share the same first word and consider the counts for that single first word as denominator. Thus if we divide each row/sequence of words by the observed frequency of a prefix we get the relative frequency. One way could be to generate a matrix of probabilities for each word combination and by multiplying all the bigram probabilities of each sentence we get the probability for each of those sentences. The more probabilities we multiply together, the smaller the product becomes. This would generate what is called numerical underflow. To overcome this situation we can use log probabilities instead, adding them all together (p1 * p2 * p3 * p4 = exp(logp1 + logp2 + logp3 + logp4))
We will have to compare different N-gram models. This is accomplished by dividing the data in two sets. We train the parameters with two or three models on the training set and then compare how the models fit in the test set. In the end we compare the models by their predictions accuracy and use perplexity (inverse probability of the test set, normalized by the number of words). Building models on a “train” set and then testing it on a test set is what will be applied in this scenario though the ideal would be to test it through an application. This would give us a better sense of how much the application is improving. Considering time and memory efficiency, we keep to an intrinsic evalution of the model.
We can apply a discount method to get words with zero probability. Thus we estimate the third word based on the previous two words. First we create a backoff estimate where we apply a discount to probability estimates (count proportions). This produces results of sets of words with different count probabilities.
For this report I used: