Scope

Understand the distribution and relationship between words, tokens and phrases in the text and build a linguistic predictive model.

Executive summary

Data source https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.

As the data was relatively large, a 10% random sample of the following datasets: en_US.blog.txt, en_US.twitter.txt and en_US.news.txt was used. These samples were combined and cleaned for analysis (e.g. removal of HTML tags, punctuation,bullet points and non-common characters).

Tidytext and Tidyverse packages were used for data processing and exploratory data analysis. The first step was to create tokens (i.e. mainly words) which were cleaned to remove “non-words” (e.g. repeated vowels “iii” and symbols). The tidy tokens were explored to decide which models should be used.

N-grams Language Modeling was used for prediction word analysis (i.e. predict the following word). Markov Chains network visualizations (using ggplot2) are used to understand how the model would predict the following words. This approach was used to plan how the machine learning model would be developed.

Data Import

Load the data

# define the directory to store the zipfile
destfile <- "/Documents/Data_Science_Projects/Coursera/JHU_DS/Capstone/Data/Coursera-Swiftkey.zip"
# save the URL with the zipfile

fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
# Download the zipfile
download.file(url = fileUrl, destfile = destfile, method = "curl")

Load libraries

library(purrr) # for map function
library(tidyverse)
library(tidytext)
library(stringr)
library(ggthemes)
library(gridExtra)
library(igraph)
library(ggraph)
library(pander)

Read the data

# Load the three data sets together
enUS_folder <- "Data/final/en_US/"
Corpus <- tibble(file = dir(enUS_folder, full.names = TRUE)) %>% 
          mutate(text = map(file, read_lines)) %>% 
          transmute(id = basename(file), text) %>% 
          unnest(text)

# Print the file sizes and the Corpus object size
print(object.size(Corpus), units = "MB")
## 831.9 Mb
fileSizes <- tibble(
             id = list.files(enUS_folder),
             size = file.size(list.files(enUS_folder, full.names = TRUE))) %>%
             mutate(sizeMb = size/1024)
pander::pander(fileSizes)
id size sizeMb
en_US.blogs.txt 210160014 205234.4
en_US.news.txt 205811889 200988.2
en_US.twitter.txt 167105338 163188.8

The 3 files combined have 4269678 lines.

Data Processing

The table below presents the total number of lines per each data set or source. The mean, sd, median, min, Longest Line (max) are also presented string counts. The longest line represents the line with most string characters. It should be noted that the twitter string count can only have a maximum of 140 characters.

id N. of Lines mean sd median min Longest Line
en_US.blogs.txt 899288 229.98695 258.66081 156 1 40833
en_US.news.txt 1010242 201.16285 133.21714 185 1 11384
en_US.twitter.txt 2360148 68.68045 37.22725 64 2 140

Data Cleaning and Tokenization

As the data is too large to process a random sample of 10% will be taken.

set.seed(7067729)
# sample each dataset
# Clean the whole sample in one go before spliting it into a training and test sets
# Replace "unusual" characters with a space
sampleCorpus <- sample_frac(Corpus, 0.1) %>%
                mutate(id = str_replace_all(id, c("en_US.twitter.txt" = "Twitter", 
                                             "en_US.news.txt" = "News",
                                             "en_US.blogs.txt" = "Blogs"))) %>% 
               mutate(text = str_replace_all(text, "[\r?\n|\røØ\\/\\#:)!?^~&=]|[^a-zA-Z0-9 ']|\\_|\\b[aeiou]{2,}\\b|'\\s+", "")) %>%
               mutate(text = tolower(text))

dim(sampleCorpus)
## [1] 426968      2

Splitting sampleCorpus into a Training set (80%) and a Test set (20%).

set.seed(2017)
cleanTrain <- sampleCorpus %>% sample_frac(0.8)
cleanTest  <- anti_join(sampleCorpus, cleanTrain, by = c("id", "text"))

Save the file for future reuse.

save(cleanTrain, file = 'cleanTrain.RData')

# Some cleaning to free up memory that could be useful later.
 rm(Corpus)

Exploratory Data Analysis

Some words are more frequent than others - what are the distributions of word frequencies?

The first step is to transform the text into single tokens (words/Unigrams).

words

# words by id
wordToken <- cleanTrain %>%
        # separate each line of text into 1-gram
        unnest_tokens(unigram, text, token = "ngrams", n = 1) %>%  
        # remove all the words with only vowels and the very small words
        filter(!str_detect(unigram, "\\b[aeiou]{2,}\\b")) %>% 
        mutate(
                unigram = factor(unigram, levels = rev(unique(unigram)))
                ) %>%
        group_by(id, unigram) %>%
        count(unigram, sort = TRUE)
pander::pander(head(wordToken, 5))
id unigram n
News the 157736
Blogs the 148400
Blogs and 86688
Blogs to 85265
Twitter the 74347

Compare total unigrams between groups (blogs, news and twitter)

Although Blogs file has less lines, we can see most words come from Blogs data source.

The three datasets have different lenghts, but for the purpose of the exercise of developing a shiny app based on the words that appear more frequently in the text, we’ll explore the three datasets together.

What are the frequencies of 2-grams and 3-grams in the dataset?

n-grams are consecutive sequences of words

We’ve covered words as individual units and considered their frequencies to visualize which were the most common words in the three data sets. Next step is to build figures and tables to understand variation in the frequencies of words and word pairs in the data.

Set function to calculate different sizes of N-grams

GetGrams <- function(clean_set, value){
        sentences <- clean_set %>%
                unnest_tokens(sentence, text, token = "ngrams", n = value) %>%
                # remove all the words with only vowels
                filter(!str_detect(sentence, "\\b[aeiou]{2,}\\b")) %>% 
                mutate(
                        sentence = factor(sentence, levels = rev(unique(sentence)))
                ) %>%
        count(sentence, sort = TRUE) 
        return(sentences)
}

Unigrams

sentence n
the 380483
to 220709
and 192480

Bigrams

sentence n
of the 34332
in the 32395
to the 17097

Trigrams

sentence n
one of the 2791
a lot of 2475
thanks for the 1831

Tetragrams

sentence n
the end of the 606
the rest of the 570
at the end of 517

Pentagrams

sentence n
at the end of the 287
for the first time in 132
in the middle of the 107

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

# Total number of words by ID
TotalWords <- wordToken %>%
        group_by(id) %>% 
        summarise(total = sum(n)) 
pander::pander(TotalWords)
id total
Blogs 2973476
News 2740952
Twitter 2374121

n is the number of times that the word is used in each data set (Twitter, Blogs, News). To look at the distribution of n/total for each data set we use the number of times a word appears divided by the total number of terms (words) in that set, which corresponds to the term frequency.

Plot word proportions distribution by id

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 22884 rows containing non-finite values (stat_bin).

The plots exhibit similar distributions for the three data sets. There are many words that occur rarely and fewer that occur frequently. Zip’s law states that the frequency that a word appears is inversely proportional to its rank. Thus we consider the proportion of word counts and the cumulative proportion as a probability of word appearance in the text corpus. If we consider 50% of the data it will still cover enough words for prediction.

UniProp <- termProp(UniGram)
## 
##  For UniGram 152 words cover 50% of all word instances.
BiProp <- termProp(BiGram)
## 
##  For BiGram 50320 words cover 50% of all word instances.
TriProp <- termProp(TriGram)
## 
##  For TriGram 1793372 words cover 50% of all word instances.
TetraProp <- termProp(TetraGram)
## 
##  For TetraGram 3456988 words cover 50% of all word instances.
PentaProp <- termProp(PentaGram)
## 
##  For PentaGram 3914094 words cover 50% of all word instances.
head(PentaProp)
## # A tibble: 6 × 5
##                sentence     n         prop  rank      cumprop
##                  <fctr> <int>        <dbl> <int>        <dbl>
## 1     at the end of the   287 3.550563e-05     1 3.550563e-05
## 2 for the first time in   132 1.633012e-05     2 5.183575e-05
## 3  in the middle of the   107 1.323729e-05     3 6.507304e-05
## 4   for the rest of the   102 1.261873e-05     4 7.769176e-05
## 5 thank you so much for    96 1.187645e-05     5 8.956821e-05
## 6     by the end of the    95 1.175273e-05     6 1.013209e-04

The number of rows (rank) gives us the top 213864 unique words that could be used to cover 50% of all word instances in the language. In a frequency sorted disctionary, 50% should be enough to cover all word instances. This is also shown in the histograms above, where the plots present a long tail, with half of the observations skewed.

Save ngrams into a single file for memory efficiency

save(UniGram, BiGram, TriGram, TetraGram, PentaGram, file = 'ngrams.RData')
save(UniProp, BiProp, TriProp, TetraProp, PentaProp, file = 'nprop.RData')

How do you evaluate how many of the words come from foreign languages?

I would use an english dictionary or a list of english words and match with each sentence/word in the text corpus. For this the hunspell package could be useful to detect words that would not match with the list. These words would be considered foreign words or typed not accordingly. I would keep those in the text still for future predictions with bi-grams and tri-grams, meaning that when more than one non-english word occurs there can be a chance that the next words predicted are non-english words as well.

Can you think of a way to increase the coverage - identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?

  • Generate term-frequency matrices with ngrams and run a prediction model on it.
  • Use a Markov chain as a measure to save memory and randomly predict n-grams.

Text Modeling

How can you efficiently store an n-gram model?

We start by visualizing the relationship between words using a Markov chain. Markov chain is a model where each choice (probability) of a word depends only on the previous one. A word is generated considering the most common words following the previous one.

To calculate the most common ngrams we need to separate the column word into N columns

BiProp_split <- BiProp %>% 
        select(sentence, n, cumprop) %>%
        separate(sentence, c("word1", "word2"), sep = " ") 
TriProp_split <- TriProp %>%
                        select(sentence, n, cumprop) %>%
                        separate(sentence, c("word1", "word2", "word3"), sep = " ")
## [1] 55

15% of bigrams(BiProp_split) correspond r nrow(BiGrams_top)` words. That’s what we use in the Markov network visualization below.

Use all bigrams to build a suitable data frame to be used with a visualization of Markov chains

bigram_all_graph <- BiGrams_top %>%
        graph_from_data_frame()

Visualizing a network with bigrams

Split pentagrams into one word per column

PentaProp_split <- PentaProp %>% 
        select(sentence, n, cumprop) %>%
        separate(sentence, c("word1", "word2", "word3", "word4", "word5"), sep = " ") 
# Top observations: unique words that appear more times
PentaProp_top <- PentaProp_split %>% filter(n > 30)
PentaProp_top
## # A tibble: 71 × 7
##    word1 word2  word3 word4 word5     n      cumprop
##    <chr> <chr>  <chr> <chr> <chr> <int>        <dbl>
## 1     at   the    end    of   the   287 3.550563e-05
## 2    for   the  first  time    in   132 5.183575e-05
## 3     in   the middle    of   the   107 6.507304e-05
## 4    for   the   rest    of   the   102 7.769176e-05
## 5  thank   you     so  much   for    96 8.956821e-05
## 6     by   the    end    of   the    95 1.013209e-04
## 7    the   end     of   the   day    88 1.122077e-04
## 8  can't  wait     to   see   you    82 1.223522e-04
## 9     is going     to    be     a    79 1.321255e-04
## 10     i can't   wait    to   see    71 1.409091e-04
## # ... with 61 more rows

Visualizing a network with pentagrams

How many parameters do you need (i.e. how big is n in your n-gram model)?

We should be able to get good estimations with N to 5-grams model. Considering conditional probabilities of word occurencies, we can predict next word. The bigram will look at one word into the past, the trigram looks two words into the past and so on. The N-gram looks N - 1 words into past. If we just consider the unigram frequencies we would get a skewed distribution of results. Thus we use a Kneser-ney model to correct predictions in relation to possible words preceding.

Can you think of simple ways to “smooth” the probabilities (think about giving all n-grams a non-zero probability even if they aren’t observed in the data) ?

We can estimate probabilities with maximum likelihood estimation (MLE) in a training set. We can normalise the counts from the text corpus so that they lie between 0 and 1. Count, for example, all bigrams that share the same first word and consider the counts for that single first word as denominator. Thus if we divide each row/sequence of words by the observed frequency of a prefix we get the relative frequency. One way could be to generate a matrix of probabilities for each word combination and by multiplying all the bigram probabilities of each sentence we get the probability for each of those sentences. The more probabilities we multiply together, the smaller the product becomes. This would generate what is called numerical underflow. To overcome this situation we can use log probabilities instead, adding them all together (p1 * p2 * p3 * p4 = exp(logp1 + logp2 + logp3 + logp4))

How do you evaluate whether your model is any good?

We will have to compare different N-gram models. This is accomplished by dividing the data in two sets. We train the parameters with two or three models on the training set and then compare how the models fit in the test set. In the end we compare the models by their predictions accuracy and use perplexity (inverse probability of the test set, normalized by the number of words). Building models on a “train” set and then testing it on a test set is what will be applied in this scenario though the ideal would be to test it through an application. This would give us a better sense of how much the application is improving. Considering time and memory efficiency, we keep to an intrinsic evalution of the model.

How can you use backoff models to estimate the probability of unobserved n-grams?

We can apply a discount method to get words with zero probability. Thus we estimate the third word based on the previous two words. First we create a backoff estimate where we apply a discount to probability estimates (count proportions). This produces results of sets of words with different count probabilities.

References

For this report I used: