Overview

This reports is a first milestone report in the capstone project for the DataScience specialisation course on Coursera. The project is about predicting text when typing. It is the kind of functionality you may typically encounter on mobile devices, in chat apps for example.

The main idea is to retrieve statistical information from a given set of texts, in order to predict the most likely next word(s). The concept of n-grams will be used to achieve this. An n-gram is an ordered set of n words. For example “I love you” and “I love movies” are two 3-grams. If the former is more frequent than the latter in our given set of texts, our prediction after typing “I love” will be “you”, rather than “movies”.

This is the whole idea :

In this report, the preliminary steps are shown …

Getting the data

The data was downloaded from here, and unzipped. The dataset comes in 4 languages. We will continue with the English one.

All texts have been sourced by a web crawler (i.e. by scanning the internet and getting text) and categorized into 3 text files : blogs, news and twitter.

# Get file sizes in MB

blogs.size <- file.info("Coursera-SwiftKey/final/en_US/en_US.blogs.txt")$size / 1024 ^ 2
news.size <- file.info("Coursera-SwiftKey/final/en_US/en_US.news.txt")$size / 1024 ^ 2
twitter.size <- file.info("Coursera-SwiftKey/final/en_US/en_US.twitter.txt")$size / 1024 ^ 2

# Read the data into R
blogs <- readLines("Coursera-SwiftKey/final/en_US/en_US.blogs.txt", skipNul = TRUE)
news <- readLines("Coursera-SwiftKey/final/en_US/en_US.news.txt", skipNul = TRUE)
twitter <- readLines("Coursera-SwiftKey/final/en_US/en_US.twitter.txt", skipNul = TRUE)

# Summary of the data sets
summary <- data.frame(source = c("blogs", "news", "twitter"),
           size.MB = c(round(blogs.size,0), round(news.size,0), round(twitter.size,0)),
           lines = c(length(blogs), length(news), length(twitter)),
           average.length = c(round(mean(nchar(blogs)),0), round(mean(nchar(news)),0), round(mean(nchar(twitter)),0)),
           maximum.length = c(max(nchar(blogs)), max(nchar(news)), max(nchar(twitter))))
knitr::kable(summary)
source size.MB lines average.length maximum.length
blogs 200 899288 232 40835
news 196 77259 203 5760
twitter 159 2360148 69 213

The texts are quite large. Twitter has shorter lines as expected.

Now let’s have a look at how the lines look by randomly picking 2 lines from each text :

## [1] "OH MY WENIS. REALLY?! :D I'm gonna see if I can drop college and career 7th so I can just have lunch all day all year:"
## [2] "Seeing the shy girl from class hammered dancing on the bar..."
## [1] "What is significant about Brancusi past the beauty of the artifacts he produced is the process of making and creating that constantly evolved over his life and career. He was interested not with the outer form but the underlying idea of the form â<U+0080><U+0093> the very essence of the thing. What he captured in bronze and stone and wood is breathtaking. The beauty of â<U+0080><U+009C>the thingâ<U+0080> is found in its abstract simplicity. Comprehension of this design construct eluded me for a long time, but when it clicked, Iâ<U+0080><U+0099>ve been hooked since. I can (and have) sit in a room with his work"
## [2] "Here are some of the contenders that were tried on. I wont be giving away the actual dresses but can you pick your favourite?"
## [1] "Instead of second and third and none out, the Indians got nothing. Hafner struck out swinging and Shelley Duncan grounded into a double play."
## [2] "\"It is just difficult to really grasp that and feel like that is racing to you.\""

We can see that a line is in fact a stand-alone chunck of text, which may consist of several phrases. It may contain exotic characters, proper names, emoticons, double quotes and other punctuation marks, spelling errors or alternative (colloquial?) writing styles such as dont instead of don’t. This calls for data cleaning …

Preprocessing the data

Without looking at the texts, we can already think of some data cleaning actions which will make statistical interpretation easier :

Another aspect concerns what to do with certain words :

Stemnisation (i.e. keeping the stems of words instead of their various forms, i.e. go instead of goes, going, gone) will not be done here, as again, we are less concerned with meaning than with prediction, thus preferring to keep the language as natural as feasable.

Now, bearing in mind our goal of predicting text, it would make sense not to work with the lines as given, but rather with groups of words belonging together. In English, logical groups are typically identified by punctuations marks. We will thus split the lines by punctuation marks (except “-” and “’”).

preproc <- function(texts){
  texts <- tolower(texts)                    # all lower case
  texts <- iconv(texts,"latin1", "ASCII", sub="") # get rid of non-english characters
  texts <- gsub("[#/@\\$%\\+\\*_\\^]"," ",texts) # remove special characters 
  texts <- gsub("[0-9]","",texts)              # remove numbers
  
  texts <- strsplit(texts,"[\\.;:!\\?&\\)\\(,\"]")  # cut up by punctuations
  texts <- unlist(texts)                     # all strings at same level

  texts <- gsub("\\s+"," ",texts)            # consecutive spaces to one space
  texts <- trimws(texts)                     # trim leading and trailing whitespaces
  texts <- texts[lapply(texts,nchar)>3]      # remove meaningless strings (shorter than 4)
  return(texts)
}

twitter <-  preproc(twitter)
news <-  preproc(news)
blogs <-  preproc(blogs)

# create sample

set.seed(5768)
sample <- c(sample(blogs, 10000),
            sample(news, 10000),
            sample(twitter, 10000))

# save the cleaned datasets to disk

write(news,"news.txt")
write(twitter,"twitter.txt")
write(blogs,"blogs.txt")

# release large objects that were saved previously in order to liberate some memory
rm(news)
rm(blogs)
rm(twitter)
#gc()

Exploratory Analysis

Given the size of the datasets, our exploration will be performed on a sample (10.000 lines of each of the blogs, news and twitter texts). Now that we have somewhat cleaned our data, we will explore it. For this we will use the tm (“text mining”) library, which gives us access to powerful features as well as to other packages such as RWeka.

Frequency of N-grams are calculated on the sample, and unfrequent N-grams (occuring once or twice in the full dataset) are removed. This removal should not alter significantly the prediction accuracy potential of the dataset, but lowers its size significantly.

library(tm)
library(RWeka)

# define functions for 1-;2-;3- and 4-grams based on RWeka

UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

# define a function to retrieve some basic statistics from the term document matrix

FreqVector <- function(tdm,highfreqThreshold) {
  FreqVector <- list(tdm$nrow,
                     length(findFreqTerms(tdm, lowfreq = 0, highfreq = 1)), # terms occuring only once
                     length(findFreqTerms(tdm, lowfreq = highfreqThreshold))) # terms occuring > hfth times
  return(FreqVector)
}

# Create VCorpus
corpus <- VCorpus(VectorSource(sample))

# RWeka : create term document matrices for 1-;2-;3- and 4-grams

Unigrams <- TermDocumentMatrix(corpus, control = list(tokenize = UnigramTokenizer))
UniVec <- FreqVector(Unigrams,200)
Unigrams <- removeSparseTerms(Unigrams,0.9998) # remove sparse items
UniVecDense <- FreqVector(Unigrams,200)

Bigrams <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer))
BiVec <- FreqVector(Bigrams,50)
Bigrams <- removeSparseTerms(Bigrams,0.9998)  # removing sparse terms
BiVecDense <- FreqVector(Bigrams,50)

Trigrams <- TermDocumentMatrix(corpus, control = list(tokenize = TrigramTokenizer))
TriVec <- FreqVector(Trigrams,10)
Trigrams<-removeSparseTerms(Trigrams,0.9998)   # removing sparse terms
TriVecDense <- FreqVector(Trigrams,10)


# Summary of the data sets
summary <- data.frame(source = c("1-grams (# terms)", "1-grams (# LowFreq terms)","1-grams (# HighFreq terms)",
                                 "2-grams (# terms)", "2-grams (# LowFreq terms)","2-grams (# HighFreq terms)",
                                 "3-grams (# terms)", "3-grams (# LowFreq terms)","3-grams (# HighFreq terms)"),
           FULL = c(unlist(UniVec),unlist(BiVec),unlist(TriVec)),
           DENSE = c(unlist(UniVecDense),unlist(BiVecDense),unlist(TriVecDense))
           )
knitr::kable(summary)
source FULL DENSE
1-grams (# terms) 23770 3667
1-grams (# LowFreq terms) 13635 0
1-grams (# HighFreq terms) 98 98
2-grams (# terms) 109501 3320
2-grams (# LowFreq terms) 90752 0
2-grams (# HighFreq terms) 172 172
3-grams (# terms) 147142 644
3-grams (# LowFreq terms) 139121 0
3-grams (# HighFreq terms) 209 209

We see that in our sample of about 30.000 chunks, we have about 25.000 distinct words. Even for our modest sample size, a matrix representing counts per line of terms would have 30.000*25.000 cells, or about 750.000.000. However, more than half of the words are used only once, we will remove here the terms that occur less than once in 10.000, or occuring less than about 3 times in our corpus. This reduces the number of words to less than 4000.

For 2- and 3-grams the reduction is even more pronounced.

We list now the most frequent n-grams …

library(ggplot2)

Unigrams <- head(sort(rowSums(as.matrix(removeSparseTerms(Unigrams,.99))), decreasing=TRUE),20)
Unigrams <- data.frame(Ngram=names(Unigrams),Freq = Unigrams)

GG <- ggplot(data = Unigrams, aes(x = reorder(Ngram, Freq), y = Freq)) + coord_flip() 
GG <- GG + geom_bar(stat="identity") + theme(axis.title.y = element_blank())
GG <- GG + labs(y = "Frequency", title = "Most Frequent words")
GG

Bigrams <- head(sort(rowSums(as.matrix(removeSparseTerms(Bigrams,.999))), decreasing=TRUE),20)
Bigrams <- data.frame(Ngram=names(Bigrams),Freq = Bigrams)

GG <- ggplot(data = Bigrams, aes(x = reorder(Ngram, Freq), y = Freq)) + coord_flip() 
GG <- GG + geom_bar(stat="identity") + theme(axis.title.y = element_blank())
GG <- GG + labs(y = "Frequency", title = "Most Frequent word couples")
GG

Trigrams <- head(sort(rowSums(as.matrix(removeSparseTerms(Trigrams,.999))), decreasing=TRUE),20)
Trigrams <- data.frame(Ngram=names(Trigrams),Freq = Trigrams)

GG <- ggplot(data = Trigrams, aes(x = reorder(Ngram, Freq), y = Freq)) + coord_flip() 
GG <- GG + geom_bar(stat="identity") + theme(axis.title.y = element_blank())
GG <- GG + labs(y = "Frequency", title = "Most Frequent word triplets")
GG

Next Steps

After this exploratory phase, we will need to make some decisions to make our prediction model.

The model itself will propose the 3 most frequent options based on the frequency N-grams tables. These tables should be available at run-time in the Shiny app, so will be limited in size. The app could react either after a space (calculate next probable word) or while typing (propose current word). I will probably propose only the former.

Constructing the model itself can be done on a large dataset, but having an ordinary pc, I will need to reduce the size of the texts when analyzing them statistically.Sampling will thus play an important role. Analyzing the whole set of texts at once is prohibitive, so dividing into manageble chunks and testing and combining the results will be necesary. We will test if it is necessary to use the whole text or if a sample would suffise. Comparing lists of sparse items over different samples could give clues here. Removing both high- and low frequency N-grams might help in a stepped approach.

Each of the collections represents a different style (a tweet will be less eloquent than a blog or a news article, although distinctions are probably less pronounced than could be expected) and as such will predict differently. I choose richer predictions over risk of mismatch because of style differences. All 3 texts collections will be used in the prediction model.

I will not remove stopwords, as I believe removing them makes more sense in analyzing “meaning” but not in the case of predicting, as removing them will make the language “unnatural”. This means I may have to include N-grams for higher N.