This reports is a first milestone report in the capstone project for the DataScience specialisation course on Coursera. The project is about predicting text when typing. It is the kind of functionality you may typically encounter on mobile devices, in chat apps for example.
The main idea is to retrieve statistical information from a given set of texts, in order to predict the most likely next word(s). The concept of n-grams will be used to achieve this. An n-gram is an ordered set of n words. For example “I love you” and “I love movies” are two 3-grams. If the former is more frequent than the latter in our given set of texts, our prediction after typing “I love” will be “you”, rather than “movies”.
This is the whole idea :
In this report, the preliminary steps are shown …
The data was downloaded from here, and unzipped. The dataset comes in 4 languages. We will continue with the English one.
All texts have been sourced by a web crawler (i.e. by scanning the internet and getting text) and categorized into 3 text files : blogs, news and twitter.
# Get file sizes in MB
blogs.size <- file.info("Coursera-SwiftKey/final/en_US/en_US.blogs.txt")$size / 1024 ^ 2
news.size <- file.info("Coursera-SwiftKey/final/en_US/en_US.news.txt")$size / 1024 ^ 2
twitter.size <- file.info("Coursera-SwiftKey/final/en_US/en_US.twitter.txt")$size / 1024 ^ 2
# Read the data into R
blogs <- readLines("Coursera-SwiftKey/final/en_US/en_US.blogs.txt", skipNul = TRUE)
news <- readLines("Coursera-SwiftKey/final/en_US/en_US.news.txt", skipNul = TRUE)
twitter <- readLines("Coursera-SwiftKey/final/en_US/en_US.twitter.txt", skipNul = TRUE)
# Summary of the data sets
summary <- data.frame(source = c("blogs", "news", "twitter"),
size.MB = c(round(blogs.size,0), round(news.size,0), round(twitter.size,0)),
lines = c(length(blogs), length(news), length(twitter)),
average.length = c(round(mean(nchar(blogs)),0), round(mean(nchar(news)),0), round(mean(nchar(twitter)),0)),
maximum.length = c(max(nchar(blogs)), max(nchar(news)), max(nchar(twitter))))
knitr::kable(summary)
| source | size.MB | lines | average.length | maximum.length |
|---|---|---|---|---|
| blogs | 200 | 899288 | 232 | 40835 |
| news | 196 | 77259 | 203 | 5760 |
| 159 | 2360148 | 69 | 213 |
The texts are quite large. Twitter has shorter lines as expected.
Now let’s have a look at how the lines look by randomly picking 2 lines from each text :
## [1] "OH MY WENIS. REALLY?! :D I'm gonna see if I can drop college and career 7th so I can just have lunch all day all year:"
## [2] "Seeing the shy girl from class hammered dancing on the bar..."
## [1] "What is significant about Brancusi past the beauty of the artifacts he produced is the process of making and creating that constantly evolved over his life and career. He was interested not with the outer form but the underlying idea of the form â<U+0080><U+0093> the very essence of the thing. What he captured in bronze and stone and wood is breathtaking. The beauty of â<U+0080><U+009C>the thingâ<U+0080> is found in its abstract simplicity. Comprehension of this design construct eluded me for a long time, but when it clicked, Iâ<U+0080><U+0099>ve been hooked since. I can (and have) sit in a room with his work"
## [2] "Here are some of the contenders that were tried on. I wont be giving away the actual dresses but can you pick your favourite?"
## [1] "Instead of second and third and none out, the Indians got nothing. Hafner struck out swinging and Shelley Duncan grounded into a double play."
## [2] "\"It is just difficult to really grasp that and feel like that is racing to you.\""
We can see that a line is in fact a stand-alone chunck of text, which may consist of several phrases. It may contain exotic characters, proper names, emoticons, double quotes and other punctuation marks, spelling errors or alternative (colloquial?) writing styles such as dont instead of don’t. This calls for data cleaning …
Without looking at the texts, we can already think of some data cleaning actions which will make statistical interpretation easier :
Another aspect concerns what to do with certain words :
Stemnisation (i.e. keeping the stems of words instead of their various forms, i.e. go instead of goes, going, gone) will not be done here, as again, we are less concerned with meaning than with prediction, thus preferring to keep the language as natural as feasable.
Now, bearing in mind our goal of predicting text, it would make sense not to work with the lines as given, but rather with groups of words belonging together. In English, logical groups are typically identified by punctuations marks. We will thus split the lines by punctuation marks (except “-” and “’”).
preproc <- function(texts){
texts <- tolower(texts) # all lower case
texts <- iconv(texts,"latin1", "ASCII", sub="") # get rid of non-english characters
texts <- gsub("[#/@\\$%\\+\\*_\\^]"," ",texts) # remove special characters
texts <- gsub("[0-9]","",texts) # remove numbers
texts <- strsplit(texts,"[\\.;:!\\?&\\)\\(,\"]") # cut up by punctuations
texts <- unlist(texts) # all strings at same level
texts <- gsub("\\s+"," ",texts) # consecutive spaces to one space
texts <- trimws(texts) # trim leading and trailing whitespaces
texts <- texts[lapply(texts,nchar)>3] # remove meaningless strings (shorter than 4)
return(texts)
}
twitter <- preproc(twitter)
news <- preproc(news)
blogs <- preproc(blogs)
# create sample
set.seed(5768)
sample <- c(sample(blogs, 10000),
sample(news, 10000),
sample(twitter, 10000))
# save the cleaned datasets to disk
write(news,"news.txt")
write(twitter,"twitter.txt")
write(blogs,"blogs.txt")
# release large objects that were saved previously in order to liberate some memory
rm(news)
rm(blogs)
rm(twitter)
#gc()
Given the size of the datasets, our exploration will be performed on a sample (10.000 lines of each of the blogs, news and twitter texts). Now that we have somewhat cleaned our data, we will explore it. For this we will use the tm (“text mining”) library, which gives us access to powerful features as well as to other packages such as RWeka.
Frequency of N-grams are calculated on the sample, and unfrequent N-grams (occuring once or twice in the full dataset) are removed. This removal should not alter significantly the prediction accuracy potential of the dataset, but lowers its size significantly.
library(tm)
library(RWeka)
# define functions for 1-;2-;3- and 4-grams based on RWeka
UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
# define a function to retrieve some basic statistics from the term document matrix
FreqVector <- function(tdm,highfreqThreshold) {
FreqVector <- list(tdm$nrow,
length(findFreqTerms(tdm, lowfreq = 0, highfreq = 1)), # terms occuring only once
length(findFreqTerms(tdm, lowfreq = highfreqThreshold))) # terms occuring > hfth times
return(FreqVector)
}
# Create VCorpus
corpus <- VCorpus(VectorSource(sample))
# RWeka : create term document matrices for 1-;2-;3- and 4-grams
Unigrams <- TermDocumentMatrix(corpus, control = list(tokenize = UnigramTokenizer))
UniVec <- FreqVector(Unigrams,200)
Unigrams <- removeSparseTerms(Unigrams,0.9998) # remove sparse items
UniVecDense <- FreqVector(Unigrams,200)
Bigrams <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer))
BiVec <- FreqVector(Bigrams,50)
Bigrams <- removeSparseTerms(Bigrams,0.9998) # removing sparse terms
BiVecDense <- FreqVector(Bigrams,50)
Trigrams <- TermDocumentMatrix(corpus, control = list(tokenize = TrigramTokenizer))
TriVec <- FreqVector(Trigrams,10)
Trigrams<-removeSparseTerms(Trigrams,0.9998) # removing sparse terms
TriVecDense <- FreqVector(Trigrams,10)
# Summary of the data sets
summary <- data.frame(source = c("1-grams (# terms)", "1-grams (# LowFreq terms)","1-grams (# HighFreq terms)",
"2-grams (# terms)", "2-grams (# LowFreq terms)","2-grams (# HighFreq terms)",
"3-grams (# terms)", "3-grams (# LowFreq terms)","3-grams (# HighFreq terms)"),
FULL = c(unlist(UniVec),unlist(BiVec),unlist(TriVec)),
DENSE = c(unlist(UniVecDense),unlist(BiVecDense),unlist(TriVecDense))
)
knitr::kable(summary)
| source | FULL | DENSE |
|---|---|---|
| 1-grams (# terms) | 23770 | 3667 |
| 1-grams (# LowFreq terms) | 13635 | 0 |
| 1-grams (# HighFreq terms) | 98 | 98 |
| 2-grams (# terms) | 109501 | 3320 |
| 2-grams (# LowFreq terms) | 90752 | 0 |
| 2-grams (# HighFreq terms) | 172 | 172 |
| 3-grams (# terms) | 147142 | 644 |
| 3-grams (# LowFreq terms) | 139121 | 0 |
| 3-grams (# HighFreq terms) | 209 | 209 |
We see that in our sample of about 30.000 chunks, we have about 25.000 distinct words. Even for our modest sample size, a matrix representing counts per line of terms would have 30.000*25.000 cells, or about 750.000.000. However, more than half of the words are used only once, we will remove here the terms that occur less than once in 10.000, or occuring less than about 3 times in our corpus. This reduces the number of words to less than 4000.
For 2- and 3-grams the reduction is even more pronounced.
We list now the most frequent n-grams …
library(ggplot2)
Unigrams <- head(sort(rowSums(as.matrix(removeSparseTerms(Unigrams,.99))), decreasing=TRUE),20)
Unigrams <- data.frame(Ngram=names(Unigrams),Freq = Unigrams)
GG <- ggplot(data = Unigrams, aes(x = reorder(Ngram, Freq), y = Freq)) + coord_flip()
GG <- GG + geom_bar(stat="identity") + theme(axis.title.y = element_blank())
GG <- GG + labs(y = "Frequency", title = "Most Frequent words")
GG
Bigrams <- head(sort(rowSums(as.matrix(removeSparseTerms(Bigrams,.999))), decreasing=TRUE),20)
Bigrams <- data.frame(Ngram=names(Bigrams),Freq = Bigrams)
GG <- ggplot(data = Bigrams, aes(x = reorder(Ngram, Freq), y = Freq)) + coord_flip()
GG <- GG + geom_bar(stat="identity") + theme(axis.title.y = element_blank())
GG <- GG + labs(y = "Frequency", title = "Most Frequent word couples")
GG
Trigrams <- head(sort(rowSums(as.matrix(removeSparseTerms(Trigrams,.999))), decreasing=TRUE),20)
Trigrams <- data.frame(Ngram=names(Trigrams),Freq = Trigrams)
GG <- ggplot(data = Trigrams, aes(x = reorder(Ngram, Freq), y = Freq)) + coord_flip()
GG <- GG + geom_bar(stat="identity") + theme(axis.title.y = element_blank())
GG <- GG + labs(y = "Frequency", title = "Most Frequent word triplets")
GG
After this exploratory phase, we will need to make some decisions to make our prediction model.
The model itself will propose the 3 most frequent options based on the frequency N-grams tables. These tables should be available at run-time in the Shiny app, so will be limited in size. The app could react either after a space (calculate next probable word) or while typing (propose current word). I will probably propose only the former.
Constructing the model itself can be done on a large dataset, but having an ordinary pc, I will need to reduce the size of the texts when analyzing them statistically.Sampling will thus play an important role. Analyzing the whole set of texts at once is prohibitive, so dividing into manageble chunks and testing and combining the results will be necesary. We will test if it is necessary to use the whole text or if a sample would suffise. Comparing lists of sparse items over different samples could give clues here. Removing both high- and low frequency N-grams might help in a stepped approach.
Each of the collections represents a different style (a tweet will be less eloquent than a blog or a news article, although distinctions are probably less pronounced than could be expected) and as such will predict differently. I choose richer predictions over risk of mismatch because of style differences. All 3 texts collections will be used in the prediction model.
I will not remove stopwords, as I believe removing them makes more sense in analyzing “meaning” but not in the case of predicting, as removing them will make the language “unnatural”. This means I may have to include N-grams for higher N.