The final project for the Coursera Data Science Capstone Course is to implement a shiny based application for predicting the next word based on the ones proposed by a user. For this assignment the aims is to:
- Demonstrate that you’ve downloaded the data and have successfully loaded it in.
- Create a basic report of summary statistics about the data sets.
- Report any interesting findings that you amassed so far.
- Get feedback on your plans for creating a prediction algorithm and Shiny app
In this document I will show how, by performing some basic exploratory data analysis, the related finding will help in shaping and refining the ultimate data product. At this stage, the learning keys will be focus of the analysis.
The datasets that will be used to implement the word predictor has been provided by Coursera and consist of text files coming three different sources (news, blogs and twitter) in four different languages (English, Russian, German and Finnish). The full repository can be downloaded at the following link. In this report only the English/US datasets have been analysed and the following steps have been performed:
As for exploring the datasets, I have used a tidyverse approach mostly based on Text Mining with R. The main reason is because it reflect most the workflow used so far in the entire Coursera specialization and the libraries used so far. Moreover, I could reduce, at least in this preliminary part, dealing with massive data structure that would slow down computation and plotting.
The dataset has been downloaded from the follwing link:
https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zipunzipped and stored in a Data subfolder. For the project and the application I will focus on the en_US language. Using cygwin command:
$ wc -lwmc en_US.blogs.txt en_US.twitter.txt en_US.news.txt > summary.txt
we can extract the first summary information:
| Lines | Words | Characters | Size (Mb) | |
|---|---|---|---|---|
| en_US.blogs.txt | 899288 | 37333958 | 208623085 | 200.42 |
| en_US.twitter.txt | 2360148 | 30357171 | 166843164 | 159.36 |
| en_US.news.txt | 1010242 | 34365905 | 205243643 | 196.28 |
As expected, the en_US.blogs.txt file contains more words althoug less lines, meaning that the lines themselves are longer. As the files are quite large and processing all data would be unfeasible, the approach is to open the files separately, sampling a preset number of lines, and delete the object containing the whole file in order to save memory space.
In order to decide which percentage of data to sample, I compared a \(1\%\) sampled subset of the blogs dataset with its whole (in Appendix the code of the function). The blogs dataset was chosen becouse of its variety (it’s somehow between a “news” and “twitter” vocabulary), size and also to effectively test the profanity filter as the news dataset contains, not surprisingly, very little swearing.
The following table summarise the basic statistic about the number of characters of the two datasets before cleaning the text.
| Sampled Text | All Text | |
|---|---|---|
| Sum | 2006628.00 | 206824505.00 |
| Mean | 223.16 | 229.99 |
| Standard Deviation | 239.77 | 258.66 |
| Median | 151.00 | 156.00 |
Without performing any further analysis (e.g. ptest) the table shows that the number are consistent at least for the number of characters. Let’s now consider the word distribution.
In this section I will compare the word frequency1 between the sampled and entire datasets. Moreover, the workflow to compute the word frequency will be later applied to the other datasets (blogs and twitter). To calculate the word counts, all text has been preprocessed using the clean.text (see Appendix) function that performs the following steps:
The following plots compare the top 15 most frequent words for the sampled and complete Blogs datasets.
As we can see, there is a good match in the most frequent words with just 1% of the sampled dataset. If we now consider the less frequent words, we notice they form the vast majority of words unnested from the text. In both the sampled and the complete datasets, more than \(50 \%\) of the unique words appear in the text just once.
The following plot shows the percentage of unique words with respect to the word frequency.
The plot shows that \(75\%\) of unique words is formed by words appearing less then \(3\) times for both datasets (the plots also shows at which frequency we cover the \(95\%\) of the words for the different datasets).
With \(1\%\) of the data sampled, the blogs dataset seems to properly reflect the behavior of the entire data^[I also performed further investigation on words, bigrams and trigrams frequency distribution that are not reported in this document in order to be more coincise (full code available on GitHub) On the github repository there is the code I developed to do that. To be on a safe side, the next analysis will be based on a \(2\%\) sampling of the data for all different datasets.
In the previous sections I introduced the clean.text function to remove all text that would impact the tokenization of the text. The approch I followed is to use first words as tokens and building dataframes containing from 1 (basically words) up to 4-ngrams that will be used for implementing the next word predictor (the unnest_tokens function from the tidytext library will split the tokens). The distribution of the different \(N\)-grams will give some insight about the dataset themselves and how to use the \(N\)-grams to improve the prediction rate of the application. I will compare the N-grams frequency considering the dataset with and without stop words. Stop Words are a set (customizable based on different languages and purposes) of commonly used words. By removing Stop Words we will obtain the most “salient” terms for each dataset (news, blogs and twitter) and possibly have some hints about the differences between them and this could lead to a different approach in sampling the overall text (see the results section).
In cleaning the text I just removed characters that would impact the text tokenization. A profanity filter will remove all the words that could be considered as “offensive”. The profanity filters is an interesting data analysis exercise as we can spend a lot of time arguing what is offensive or not and its level. Besides, it’s really difficult to implement a perfect profanity filter as human inventiveness has very few limits when it comes to being offensive. My approach was to use the sentimentr library and using profanity_arr_bad and profanity_alvarez lists and still the results are quite poor.
The following is a table summarising the basic statistics for the unigrams in the different datasets.
| total number | unique words | max length | mean | sd | |
|---|---|---|---|---|---|
| news | 662382 | 44428 | 80 | 7.7 | 2.9 |
| blogs | 725962 | 43584 | 269 | 7.7 | 3.2 |
| 586628 | 39689 | 76 | 7.5 | 3.3 | |
| complete | 1974972 | 84338 | 269 | 7.9 | 3.4 |
The previous tables indicates some interesting points. First, not considering the stop words (124 words), the number of total words drops of about the \(40\%\) indicating that the stop words have all a high frequency rate in the corpus. Second longest words in English rarerly exceed 25 characters, so the max length column suggest to have a further look at the data and eventually prune them
As for the second point, if we have a further look to words whose lenght exceed 20 characters we will notice that the frequency is very low and since in the final application I will cut tokens with low frequency, this issue will be solved. Moreover, the data suggest that a further cleaning on the dataset (e.g. removing hashtags in twitter dataset) could improve the working dataset.
The following plot is the word frequency of the top 15 words for the different datasets with and without stop words
The unigram distribution of the top 15 most frequency it’s not really surprising as we would expect the news and blogs dataset shows some differences but somehow have similar results while the twitter dataset is more different. Especially if we look at the distributions without stop words, we can see that a further step in manipulating the data could improve the final application by changing all “im”, “youre”, “dont” in their correct form “I am”, “you are”, “do not” (e.g. before removing the “’” symbol. More stemming and lemmatisation).
Beside the top most frequent, it’s interesting to have a look at the less frequent words. If we consider the tail
As we also see in the previous section, words with just one or few occurrences are the large majority of terms in all dataset. The plot also shows that since the twitter dataset approaches the \(100\%\) of unique words faster than the other dataset, it seems that vocabulary contains less variety of words and a further investigation it’s maybe required.
The followings are the plot of 2, 3, and 4-Ngrams
The most frequent bigrams looks quite homogeneus between the different datasets. As for unigrams, twitter dataset is slightly different from the other two. That’s even more evident by looking at the distribution that eliminates the Stop Words.
Like the most frequent bigrams, also the trigrams show some similarity between news and blogs where twitter dataset brings more colloquial phrase construction.
What is interesting the the tetragrams analysis is the fact that a part from the most frequent ones, the meaning becomes more confuse with less frequent tokens. Moreover, if we look at the frequency considering stop words, the news dataset still have some meaning while blogs and twitter show that a further text cleaning might help in building the model.
As expected, increasing the number of grams considered means that most of the unique tokens (2, 3 and 4-grams) fall into very few occurences.
A possible model for the word predictor application would be based on Markov chain model in which the probability of the occurence of a word is based on the probability of the ones appearing before the one to be predicted. There are different approaches using N-grams (see chapter 4. of the book Speech and Language Processing) like dd-k smoothing, Stupid backoff, and Kneser-Ney smoothing that will be evaluated.
This preliminary exploratory data analysis suggests that using the a \(2\%\) of sampling text between the different dataset could bring good prediction results while keeping computational time and resources not critical. Moreover, some further effort in cleaning the dataset must be done (e.g. improving eliminating website references, not considering words starting with more than 3 consecutive same letters). One possible improvement could be sampling the different dataset with a different weight since they bring a different type of vocabulary.
In the appendix I report the most relevant functions used to create the data structure used in this report. Full code on my GitHub repository.
clean.text <- function(lines)
{
lines <- tolower(lines)
lines <- gsub("[^[:alnum:][:blank:]?&/\\-]", "", lines) # remove non UTF-8 characters from text
lines <- gsub("[[:punct:]]", "", lines) # remove punctuation
lines <- gsub("[[:digit:]]", "", lines) # remove digits
lines <- gsub("http[[:alnum:]]", "", lines) # removing references to websites
lines <- gsub("www[[:alnum:]]", "", lines) # removing references to websites
lines <- gsub("\\s+", " ", str_trim(lines)) # remove extra whitespaces
return(lines)
}
createSampledDfText <- function(original, sample.percentage = 0.5, book = "default")
{
set.seed(1)
if(!file.exists(original))
{
print("no file")
return(NULL)
}
f <- file(original, "rb")
original.text <- readLines(f, encoding = "UTF-8", skipNul = TRUE)
close(f)
n.lines <- sort(sample(1:length(original.text),
as.integer(length(original.text) * sample.percentage),
replace = FALSE))
sampled.text <- original.text[n.lines]
return(data.frame(doc_id = 1:as.integer(length(original.text) * sample.percentage),
book = book,
text = sampled.text,
stringsAsFactors = FALSE))
}
word.frequency functionword.frequency <- function(df.corpus, remove.stopwords = FALSE)
{
custom.stopwords <- data.frame(word = stopwords('english'),
lexicon = "mylexicon")
df.wordfreq <- df.corpus %>%
unnest_tokens(word, text) %>%
{if(remove.stopwords) anti_join(., y = custom.stopwords) else .} %>%
count(word, sort = TRUE) %>%
rename(c(frequency = n))
return(df.wordfreq)
}
create.bigramsfreq functioncreate.bigramsfreq <- function(df.text, remove.stopwords = FALSE)
{
custom.stopwords <- data.frame(word = stopwords('english'),
lexicon = "mylexicon")
bigrams <- df.text %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
drop_na() %>%
separate(bigram, c("word1","word2")) %>%
{if(remove.stopwords)
filter(.,!word1 %in% custom.stopwords$word,
!word2 %in% custom.stopwords$word) else .} %>%
count(word1, word2, sort = TRUE) %>%
unite(bigram, word1, word2, sep = " ")
}
create.trigramsfreq functioncreate.trigramsfreq <- function(df.text, remove.stopwords = FALSE)
{
custom.stopwords <- data.frame(word = stopwords('english'),
lexicon = "mylexicon")
trigrams <- df.text %>%
unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
drop_na() %>%
separate(trigram, c("word1","word2","word3")) %>%
{if(remove.stopwords)
filter(.,!word1 %in% custom.stopwords$word,
!word2 %in% custom.stopwords$word,
!word3 %in% custom.stopwords$word) else .} %>%
count(word1, word2, word3, sort = TRUE) %>%
unite(trigram, word1, word2, word3, sep = " ")
}
create.tetragramsfreq functioncreate.tetragramsfreq <- function(df.text, remove.stopwords = FALSE)
{
custom.stopwords <- data.frame(word = stopwords('english'),
lexicon = "mylexicon")
tetrarams <- df.text %>%
unnest_tokens(tetragram, text, token = "ngrams", n = 4) %>%
drop_na() %>%
separate(tetragram, c("word1","word2","word3","word4")) %>%
{if(remove.stopwords)
filter(.,!word1 %in% custom.stopwords$word,
!word2 %in% custom.stopwords$word,
!word3 %in% custom.stopwords$word,
!word4 %in% custom.stopwords$word) else .} %>%
count(word1, word2, word3, word4, sort = TRUE) %>%
unite(trigram, word1, word2, word3, word4, sep = " ")
}
It should be more correct using the term word count than word frequency but since I will mostly compare the occurrances, the results are the same than dividing for the sum of all words↩︎