Document motivation

The final project for the Coursera Data Science Capstone Course is to implement a shiny based application for predicting the next word based on the ones proposed by a user. For this assignment the aims is to:

  1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.
  2. Create a basic report of summary statistics about the data sets.
  3. Report any interesting findings that you amassed so far.
  4. Get feedback on your plans for creating a prediction algorithm and Shiny app

In this document I will show how, by performing some basic exploratory data analysis, the related finding will help in shaping and refining the ultimate data product. At this stage, the learning keys will be focus of the analysis.

Synopsis

The datasets that will be used to implement the word predictor has been provided by Coursera and consist of text files coming three different sources (news, blogs and twitter) in four different languages (English, Russian, German and Finnish). The full repository can be downloaded at the following link. In this report only the English/US datasets have been analysed and the following steps have been performed:

As for exploring the datasets, I have used a tidyverse approach mostly based on Text Mining with R. The main reason is because it reflect most the workflow used so far in the entire Coursera specialization and the libraries used so far. Moreover, I could reduce, at least in this preliminary part, dealing with massive data structure that would slow down computation and plotting.

Downloading and evaluating size of the dataset

The dataset has been downloaded from the follwing link:

  • https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

unzipped and stored in a Data subfolder. For the project and the application I will focus on the en_US language. Using cygwin command:

$ wc -lwmc en_US.blogs.txt en_US.twitter.txt en_US.news.txt > summary.txt

we can extract the first summary information:

Summary table for datasets
Lines Words Characters Size (Mb)
en_US.blogs.txt 899288 37333958 208623085 200.42
en_US.twitter.txt 2360148 30357171 166843164 159.36
en_US.news.txt 1010242 34365905 205243643 196.28

As expected, the en_US.blogs.txt file contains more words althoug less lines, meaning that the lines themselves are longer. As the files are quite large and processing all data would be unfeasible, the approach is to open the files separately, sampling a preset number of lines, and delete the object containing the whole file in order to save memory space.

Sampling the dataset

In order to decide which percentage of data to sample, I compared a \(1\%\) sampled subset of the blogs dataset with its whole (in Appendix the code of the function). The blogs dataset was chosen becouse of its variety (it’s somehow between a “news” and “twitter” vocabulary), size and also to effectively test the profanity filter as the news dataset contains, not surprisingly, very little swearing.

Sampled vs Entire dataset - Number of Characters

The following table summarise the basic statistic about the number of characters of the two datasets before cleaning the text.

number of characters of the sampled vs completed datasets
Sampled Text All Text
Sum 2006628.00 206824505.00
Mean 223.16 229.99
Standard Deviation 239.77 258.66
Median 151.00 156.00

Without performing any further analysis (e.g. ptest) the table shows that the number are consistent at least for the number of characters. Let’s now consider the word distribution.

Sampled vs Entire dataset - Token frequency

In this section I will compare the word frequency1 between the sampled and entire datasets. Moreover, the workflow to compute the word frequency will be later applied to the other datasets (blogs and twitter). To calculate the word counts, all text has been preprocessed using the clean.text (see Appendix) function that performs the following steps:

  1. transform all text to lower case
  2. remove non UTF-8 characters
  3. remove punctuation
  4. remove digits
  5. remove references to websites (to improve)
  6. remove extra white spaces

Sampled vs Entire dataset - Most frequent tokens

The following plots compare the top 15 most frequent words for the sampled and complete Blogs datasets.

As we can see, there is a good match in the most frequent words with just 1% of the sampled dataset. If we now consider the less frequent words, we notice they form the vast majority of words unnested from the text. In both the sampled and the complete datasets, more than \(50 \%\) of the unique words appear in the text just once.

Sampled vs Entire dataset - Less frequent tokens

The following plot shows the percentage of unique words with respect to the word frequency.

The plot shows that \(75\%\) of unique words is formed by words appearing less then \(3\) times for both datasets (the plots also shows at which frequency we cover the \(95\%\) of the words for the different datasets).

Dataset Sampling final remarks

With \(1\%\) of the data sampled, the blogs dataset seems to properly reflect the behavior of the entire data^[I also performed further investigation on words, bigrams and trigrams frequency distribution that are not reported in this document in order to be more coincise (full code available on GitHub) On the github repository there is the code I developed to do that. To be on a safe side, the next analysis will be based on a \(2\%\) sampling of the data for all different datasets.

Tokens Analysis

In the previous sections I introduced the clean.text function to remove all text that would impact the tokenization of the text. The approch I followed is to use first words as tokens and building dataframes containing from 1 (basically words) up to 4-ngrams that will be used for implementing the next word predictor (the unnest_tokens function from the tidytext library will split the tokens). The distribution of the different \(N\)-grams will give some insight about the dataset themselves and how to use the \(N\)-grams to improve the prediction rate of the application. I will compare the N-grams frequency considering the dataset with and without stop words. Stop Words are a set (customizable based on different languages and purposes) of commonly used words. By removing Stop Words we will obtain the most “salient” terms for each dataset (news, blogs and twitter) and possibly have some hints about the differences between them and this could lead to a different approach in sampling the overall text (see the results section).

Profanity Filter

In cleaning the text I just removed characters that would impact the text tokenization. A profanity filter will remove all the words that could be considered as “offensive”. The profanity filters is an interesting data analysis exercise as we can spend a lot of time arguing what is offensive or not and its level. Besides, it’s really difficult to implement a perfect profanity filter as human inventiveness has very few limits when it comes to being offensive. My approach was to use the sentimentr library and using profanity_arr_bad and profanity_alvarez lists and still the results are quite poor.

Unigrams Analysis

The following is a table summarising the basic statistics for the unigrams in the different datasets.

Unigram Statistics (including Stop Words)
total number unique words max length mean sd
news 662382 44428 80 7.7 2.9
blogs 725962 43584 269 7.7 3.2
twitter 586628 39689 76 7.5 3.3
complete 1974972 84338 269 7.9 3.4

The previous tables indicates some interesting points. First, not considering the stop words (124 words), the number of total words drops of about the \(40\%\) indicating that the stop words have all a high frequency rate in the corpus. Second longest words in English rarerly exceed 25 characters, so the max length column suggest to have a further look at the data and eventually prune them

As for the second point, if we have a further look to words whose lenght exceed 20 characters we will notice that the frequency is very low and since in the final application I will cut tokens with low frequency, this issue will be solved. Moreover, the data suggest that a further cleaning on the dataset (e.g. removing hashtags in twitter dataset) could improve the working dataset.

Unigrams Frequency - Most frequent tokens

The following plot is the word frequency of the top 15 words for the different datasets with and without stop words

The unigram distribution of the top 15 most frequency it’s not really surprising as we would expect the news and blogs dataset shows some differences but somehow have similar results while the twitter dataset is more different. Especially if we look at the distributions without stop words, we can see that a further step in manipulating the data could improve the final application by changing all “im”, “youre”, “dont” in their correct form “I am”, “you are”, “do not” (e.g. before removing the “’” symbol. More stemming and lemmatisation).

Unigrams Frequency - Less frequent tokens

Beside the top most frequent, it’s interesting to have a look at the less frequent words. If we consider the tail

As we also see in the previous section, words with just one or few occurrences are the large majority of terms in all dataset. The plot also shows that since the twitter dataset approaches the \(100\%\) of unique words faster than the other dataset, it seems that vocabulary contains less variety of words and a further investigation it’s maybe required.

Bigrams, Trigrams and Tetragrams analysis

The followings are the plot of 2, 3, and 4-Ngrams

Bigrams

The most frequent bigrams looks quite homogeneus between the different datasets. As for unigrams, twitter dataset is slightly different from the other two. That’s even more evident by looking at the distribution that eliminates the Stop Words.

Trigrams

Like the most frequent bigrams, also the trigrams show some similarity between news and blogs where twitter dataset brings more colloquial phrase construction.

Tetragrams

What is interesting the the tetragrams analysis is the fact that a part from the most frequent ones, the meaning becomes more confuse with less frequent tokens. Moreover, if we look at the frequency considering stop words, the news dataset still have some meaning while blogs and twitter show that a further text cleaning might help in building the model.

2-grams, 3-grams, 4-grams less frequent tokens

As expected, increasing the number of grams considered means that most of the unique tokens (2, 3 and 4-grams) fall into very few occurences.

Final remarks and model outline

A possible model for the word predictor application would be based on Markov chain model in which the probability of the occurence of a word is based on the probability of the ones appearing before the one to be predicted. There are different approaches using N-grams (see chapter 4. of the book Speech and Language Processing) like dd-k smoothing, Stupid backoff, and Kneser-Ney smoothing that will be evaluated.

This preliminary exploratory data analysis suggests that using the a \(2\%\) of sampling text between the different dataset could bring good prediction results while keeping computational time and resources not critical. Moreover, some further effort in cleaning the dataset must be done (e.g. improving eliminating website references, not considering words starting with more than 3 consecutive same letters). One possible improvement could be sampling the different dataset with a different weight since they bring a different type of vocabulary.

Appendix

In the appendix I report the most relevant functions used to create the data structure used in this report. Full code on my GitHub repository.

clean.text function

clean.text <- function(lines)
{
    lines <- tolower(lines)
    lines <- gsub("[^[:alnum:][:blank:]?&/\\-]", "", lines) # remove non UTF-8 characters from text
    lines <- gsub("[[:punct:]]", "", lines) # remove punctuation
    lines <- gsub("[[:digit:]]", "", lines) # remove digits
    lines <- gsub("http[[:alnum:]]", "", lines) # removing references to websites
    lines <- gsub("www[[:alnum:]]", "", lines) # removing references to websites
    lines <- gsub("\\s+", " ", str_trim(lines)) # remove extra whitespaces
    return(lines)
}

createSampledDfText function

createSampledDfText <- function(original, sample.percentage = 0.5, book = "default")
{
    set.seed(1)
    if(!file.exists(original))
    {
        print("no file")
        return(NULL)
    } 
    f <- file(original, "rb")
    original.text <- readLines(f, encoding = "UTF-8", skipNul = TRUE)
    close(f)
    n.lines <- sort(sample(1:length(original.text),
                           as.integer(length(original.text) * sample.percentage),
                           replace = FALSE))
    sampled.text <- original.text[n.lines]
    return(data.frame(doc_id = 1:as.integer(length(original.text) * sample.percentage),
                      book = book,
                      text = sampled.text,
                      stringsAsFactors = FALSE))
}

word.frequency function

word.frequency <- function(df.corpus, remove.stopwords = FALSE)
{
    custom.stopwords <- data.frame(word = stopwords('english'),
                                   lexicon = "mylexicon")
    
    df.wordfreq <- df.corpus %>%
        unnest_tokens(word, text) %>%
        {if(remove.stopwords) anti_join(., y = custom.stopwords) else .} %>%
        count(word, sort = TRUE) %>%
        rename(c(frequency = n))
    
    return(df.wordfreq)
}

create.bigramsfreq function

create.bigramsfreq <- function(df.text, remove.stopwords = FALSE)
{
    custom.stopwords <- data.frame(word = stopwords('english'),
                                   lexicon = "mylexicon")
    
    bigrams <- df.text %>%
        unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
        drop_na() %>%
        separate(bigram, c("word1","word2")) %>%
        {if(remove.stopwords)
            filter(.,!word1 %in% custom.stopwords$word,
                   !word2 %in% custom.stopwords$word) else .} %>%
        count(word1, word2, sort = TRUE) %>%
        unite(bigram, word1, word2, sep = " ")
}

create.trigramsfreq function

create.trigramsfreq <- function(df.text, remove.stopwords = FALSE)
{
    custom.stopwords <- data.frame(word = stopwords('english'),
                                   lexicon = "mylexicon")
    
        trigrams <- df.text %>%
            unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
            drop_na() %>%
            separate(trigram, c("word1","word2","word3")) %>%
            {if(remove.stopwords)
                filter(.,!word1 %in% custom.stopwords$word,
                       !word2 %in% custom.stopwords$word,
                       !word3 %in% custom.stopwords$word) else .} %>%
            count(word1, word2, word3, sort = TRUE) %>%
            unite(trigram, word1, word2, word3, sep = " ")
}

create.tetragramsfreq function

create.tetragramsfreq <- function(df.text, remove.stopwords = FALSE)
{
    custom.stopwords <- data.frame(word = stopwords('english'),
                                   lexicon = "mylexicon")
    
    tetrarams <- df.text %>%
        unnest_tokens(tetragram, text, token = "ngrams", n = 4) %>%
        drop_na() %>%
        separate(tetragram, c("word1","word2","word3","word4")) %>%
        {if(remove.stopwords)
            filter(.,!word1 %in% custom.stopwords$word,
                   !word2 %in% custom.stopwords$word,
                   !word3 %in% custom.stopwords$word,
                   !word4 %in% custom.stopwords$word) else .} %>%
        count(word1, word2, word3, word4, sort = TRUE) %>%
        unite(trigram, word1, word2, word3, word4, sep = " ")
}

  1. It should be more correct using the term word count than word frequency but since I will mostly compare the occurrances, the results are the same than dividing for the sum of all words↩︎