Exploration of Data Sources for the Swiftkey Capstone Project

Executive Summary

This report outlines the exploration of the raw text data required for this project.
There are three data sets:
* Blogs * Twitter feeds * News feeds

We will explore the characteristics of these three very different types of text. Blogs are created by individuals who author them in their own style. Twitter entries are quick and limited to 140 characters, so that results in a different type of communication that will have, we expect, a different shape to the other more formal communications. Lastly, the News feeds are provided by professional news organisations and generally edited in a very specific way.

We will compare and contrast the results when creating various N-Grams lists and the Frequncy spectrum for those.

Loading the Data

There are three files provided for this project and for this report we will use the English versions. These were loaded and the results saved as processed RDS files for easy loading later rather than re-reading the files.

First the libraries required for text mining and other data manipluation activities are also loaded.

Preprocessing & Examining the Data

When looking at the data, we need to understand the structure and size of the data to understand how the sampling and comparison should be tackled. Here the basic parameters are idintified and presented.

# Get length of all words in the data sets and find longest
blog.lines <- length(blogs)
twitter.lines <- length(twitter)
news.lines <- length(news)

blog.linelength <- nchar(blogs)
twitter.linelength <- nchar(twitter)
news.linelength <- nchar(news)
# Get words in files
blogs.words <- sum(stri_count_words(blogs))
news.words <- sum(stri_count_words(news))
twitter.words <- sum(stri_count_words(twitter))

blogs.data <- cbind(c("Blogs"), blog.lines, max(blog.linelength), blogs.words)
twitter.data <- cbind(c("Twitter"), twitter.lines, max(twitter.linelength), twitter.words)
news.data <- cbind(c("News"), news.lines, max(news.linelength), news.words)
Source.data <- data.frame(rbind(blogs.data,twitter.data,news.data))
colnames(Source.data) <- c("Data Source", "No of Lines","Max Line Length", "No of Words")

knitr::kable(Source.data, caption = "Properties of the Three Source Files")
Properties of the Three Source Files
Data Source No of Lines Max Line Length No of Words
Blogs 899288 40833 37546246
Twitter 2360148 140 30093410
News 77259 5760 2674536

As can be seen from the data above, blog entries are much longer in length than the news articles which are more succinct. The Twitter items, of course, are 140 very compact characters! Also, the word counts are significantly different. So to even up the comparison, the report looks at more representative samples of the data that even up more the number of words that are being considered when doing analysis on structure.

Sampling the Data for Analysis

The sampling choices were taken to even up the disparity in sizes of the given data sets.
For this analysis, we will take 20% of Blogs, 10% sample of twitter records, and 40% of News.

# Get samples for each set of documents
set.seed(10212)
blogs.sample <- blogs[sample(1:length(blogs), length(blogs)*0.20)]
saveRDS(blogs.sample , file = "blogs.sample.RDS")

twitter.sample <- twitter[sample(1:length(twitter), length(twitter)*0.10)]
saveRDS(twitter.sample , file = "twitter.sample.RDS")
# take more news data as it is under-represented in the three sources - to balance the samples 
news.sample <- news[sample(1:length(news), length(news)*0.40)]
saveRDS(news.sample , file = "news.sample.RDS")

Preprocessing the Data

When pre-processing the data, a number of activities must be done to clean and standardise the text.
In our case:
* stripping out bad characters from the data that cannot be handled by the text mining software. These can be foreign characters or system control characters that have been inadvertently included in the text. * converting everything to lower case so it can be compared and counted as the same thing. * removal of multiple spaces (white space) to allow better margins between words for tokenisation * removal of profane or offensive words * removal of punctuation. Unless the requirement is to understand sentence structures, puntuation is not useful in this type of exercise. * removal of very common words called stopwords, such as the and and, which will appear very often in text and skew the probabilities. In this report I have chosen to leave in stopwords, as for many typeahead activities, many stopwords are an integral part of the phrase e.g. on t“he go - if you lose ‘the’ and ‘on’ you lose a lot of meaning os a trigram.

The list of bad or profane words used was found at:
http://www.cs.cmu.edu/~biglou/resources/bad-words.txt

Below are the two functions used to create a document Corpus, a structure of documents ready for analysis, and apply the preprocessing acitivies.
I chose to use a separate process for Twitter, as on examination of the unigrams, I found there were a lot of words like “I’ll” that are probably very relevant for the type of communication that Twitter does which is very much in the first person, and like the more structured and formal third person accounts in blogs and even more in news. I chose to preserve the “’”, apostrophe character for unigrams.

The two functions used are shown here:

# Set up various functions
profanity <- read.csv("badwords.csv")
getCorpus <- function(sample) {
      # remove all characters that are not either letters or a space
      # including any non-visible control characters
      sample <- str_replace_all(sample, "[^[a-zA-Z ]]", "")
      sample <- str_replace_all(sample, "[[:cntrl:]]", " ")
      corpus <- Corpus(VectorSource(list(sample)))
      corpus <- tm_map(corpus, stripWhitespace)
      corpus <- tm_map(corpus, content_transformer(tolower))
      corpus <- tm_map(corpus, removePunctuation, preserve_intra_word_dashes = TRUE)
      # Get rid of special characters in case we missed them before
      corpus <- tm_map(corpus, function(x) iconv(x, "latin1", "ASCII", sub=""))
      corpus <- tm_map(corpus, removeWords, profanity)
      corpus <- tm_map(corpus, PlainTextDocument)
      return(corpus)
}
getTwitterCorpus <- function(sample) {
      # remove all characters that are not either letters or a space
      # including any non-visible control characters
      sample <- str_replace_all(sample, "[^[a-zA-Z' ]]", "")
      sample <- str_replace_all(sample, "[[:cntrl:]]", " ")
      corpus <- Corpus(VectorSource(list(sample)))
      corpus <- tm_map(corpus, stripWhitespace)
      corpus <- tm_map(corpus, content_transformer(tolower))
      # Get rid of special characters in case we missed them before
      corpus <- tm_map(corpus, function(x) iconv(x, "latin1", "ASCII", sub=""))
      corpus <- tm_map(corpus, removeWords, profanity)
      corpus <- tm_map(corpus, PlainTextDocument)
      return(corpus)
}

Tokenising the Data into Usefule Groups for Comparison

Tokenisation is a process of breaking sentences up into single and multiple word groups. They are called N-Grams, with N being a number of our choice. UniGrams have one word, BiGrams have two words, TriGrams have three words etc. The software will take all permutations of two, three or more word groupings and show the number of occurences of that is our data. This establishes the likelihood of encountering that word or phase when predicting what will come next. So it is used to establish probability.

For this report, UniGrams, BiGrams and TriGrams will be compared across the three data sets. Three functions are used to execute this on the three datasetsa and to plot the results.

getUniGrams <- function(dtm) {
      freq <- sort(colSums(as.matrix(dtm)), decreasing = TRUE)  
      wordFreq <- data.frame(Term = names(freq), Frequency = freq)
      colnames(wordFreq) <- c("Term", "Frequency")
      return(wordFreq)
}

getTriGrams <- function(corpus) {
      TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
      triGrams <- TermDocumentMatrix(corpus, 
                                     control = list(tokenize = TrigramTokenizer, wordLengths = c(3, Inf)))
      gmatrix <- as.matrix(triGrams)
      triGramFreq.table <- as.data.frame(cbind(rownames(gmatrix), rowSums(gmatrix)))
      colnames( triGramFreq.table) <- c("Term","Frequency")
      triGramFreq.table$Frequency <- as.numeric(triGramFreq.table$Frequency)
      triGramFreq.table <- arrange(triGramFreq.table, desc(Frequency))
      return(triGramFreq.table)
}

getBiGrams <- function(corpus) {
      BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
      biGrams <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer, wordLengths = c(2, Inf)))
      gmatrix <- as.matrix(biGrams)
      biGramFreq.table <- as.data.frame(cbind(rownames(gmatrix), rowSums(gmatrix)))
      colnames( biGramFreq.table) <- c("Term","Frequency")
      biGramFreq.table$Frequency <- as.numeric(biGramFreq.table$Frequency)
      biGramFreq.table <- arrange(biGramFreq.table, desc(Frequency))
      return(biGramFreq.table)
} 
plotFrequencyFacets <- function(freqTable) {
      # Plot the resulting Frequency Charts
      g <- ggplot(freqTable, aes(Term, Frequency),  fill=Frequency)
      g <- g + facet_grid(Source~.)
      g <- g + geom_bar(stat="identity")   
      g <- g + theme(axis.text.x=element_text(angle=45, hjust=1))   
      g <- g + ggtitle("Plot of top 20 terms Frequency.")
      g 
}
plotFrequency <- function(freqTable) {
      g <- ggplot(freqTable[1:20,], aes(Term, Frequency,  fill=Frequency))    
      g <- g + geom_bar(stat="identity")   
      g <- g + theme(axis.text.x=element_text(angle=45, hjust=1))   
      g <- g + ggtitle("Plot of top 20 terms Frequency.")
      g 
}

Processing the Data Sets

Now the datasets are ready to process. The following activities are included here.
* The Document Corpus is created * A Document Term Matrix is created to allow analysis of Frequencies of words. * Finaly, the N-Grams are created and ready to be plotted.

# Process Blogs 
blogs.corpus <- getCorpus(blogs.sample)
blogs.dtm <- getDTM(blogs.corpus)
blogs.uniGrams <- getUniGrams(blogs.dtm)
blogs.biGrams <- getBiGrams(blogs.corpus)
blogs.triGrams <- getTriGrams(blogs.corpus)
# Process Twitter
twitter.corpus <- getTwitterCorpus(twitter.sample)
twitter.dtm <- getDTM(twitter.corpus)
twitter.uniGrams <- getUniGrams(twitter.dtm)
twitter.biGrams <- getBiGrams(twitter.corpus)
twitter.triGrams <- getTriGrams(twitter.corpus)
# Process News
news.corpus <- getCorpus(news.sample)
news.dtm <- getDTM(news.corpus)
news.uniGrams <- getUniGrams(news.dtm)
news.biGrams <- getBiGrams(news.corpus)
news.triGrams <- getTriGrams(news.corpus)

The Results

After having process the N-Gram lists, it is visualised using a frequency plot to show the top N Unigrams, Bigrams etc.

blogs.uniGrams <- readRDS("blogs.uniGrams.RDS")
plotFrequency(blogs.uniGrams)

As can be seen, ‘and’ and ‘the’ are way above all the others as they are very common words and it may be useful to remove them later as they overwhelm other words.

Comparing UniGrams

Top 20 Unigrams, by Frequency, were compared and you can see that there is little difference between them

top20UniGrams <- as.data.frame(rbind(topN(blogs.uniGrams, "Blogs",20), 
                                     topN(news.uniGrams, "News",20), 
                                     topN(twitter.uniGrams, "Twitter",20) ))
plotFrequencyFacets(top20UniGrams)

It is interesting that in Blogs, ‘the’ and ‘and’ are used substantially more proportionately than either News or Twitter. This could be a commentary on the writing style of Bloggers! However, otherwise, when looking at Unigrams, there is not much difference in the Unigram coveragel. Where you see a blank on one, that means that the other categories will have had that word in their top 20.

Comparing BiGrams

top10BiGrams <- as.data.frame(rbind(topN(blogs.biGrams, "Blogs",10), 
                                    topN(news.biGrams, "News",10),
                                    topN(twitter.biGrams, "Twitter",10) ))
plotFrequencyFacets(top10BiGrams)

When looking at BiGrams, however, there is little overlap between the different data sets in their top 20 BiGrams. This indicates that when looking to predict next words, the proability of N-Grams look like they differen across the different writing styles and content. This needs to be considered when doing prediction in those domains.

Comparing TriGrams

Now the TriGrams are compared and it is expected that a similar and perhaps more distict variation will occur looking at TriGrams as there are many nore permutations and combinations that can occur.

top10TriGrams <- as.data.frame(rbind(topN(blogs.triGrams, "Blogs",10), 
                                     topN(news.triGrams, "News",10), 
                                     topN(twitter.triGrams, "Twitter",10) ))
plotFrequencyFacets(top10TriGrams)

As can be seen, this is even more pronounced. Again, this needs to be considered when developing a text prediction applications - context will be important.

Frequency Spectrums

A useful visualisation of the structure of the vocablary is to use a Frequency Spectrum Chart. This shows the word frequency and the number of words that have that Frequency of occurence. i.e. for a word frequency of 4 (a where ord was seen 4 times), how many of all the words are words that were seen 4 times e.g. 5 words were seen a total of 4 times?

This is called (Frequency) Class - the group of words that appeared a specific number of times. The structure of this can be shown by calculating the Classes and the number of occurences and then plotting them.
Here we are looking at News

# Get Frequency Classes
FreqClass <- data.frame(cbind(news.uniGrams, as.character(news.uniGrams$Frequency)))
colnames(FreqClass) <- c("Term", "Frequency","Class" )
FreqClassTab <- table(factor(FreqClass$Class), FreqClass$Frequency)
FreqClassDF <- as.data.frame(colSums(as.matrix(FreqClassTab)))
FreqClassFinal <- cbind(rownames(FreqClassDF), FreqClassDF)
colnames(FreqClassFinal) <- c("NoOfOccurences", "NumberOfWords" )
fc <- FreqClassFinal
fc$NoOfOccurences <- factor(fc$NoOfOccurences, levels = fc$NoOfOccurences[order(fc$NumberOfWords, decreasing = TRUE)]) 

      g <- ggplot(fc[1:50,], aes(NoOfOccurences, NumberOfWords, group=1)) 
      g <- g + geom_line(stat="identity", colour="blue", size=1)
      g <- g + ggtitle("Number of Words in First 50 Frequency Classes") 
      g <- g + xlab("Frequency Class") + ylab("No of Words")
      g

c("Total Number of Classes found:")
## [1] "Total Number of Classes found:"
nrow(fc)
## [1] 514

As can be seen, there are a lot of words that occur between about one and 5 times. Looking at the rest of the classes to 50, the shape of the curve can be better seen.

      g <- ggplot(fc[6:50,], aes(NoOfOccurences, NumberOfWords, group=1)) 
      g <- g + geom_line(stat="identity", colour="blue", size=1)
      g <- g + ggtitle("Number of Words in Classes 6 to 50") 
      g <- g + xlab("Frequency Class") + ylab("No of Words")
      g

Using a specialised package for looking at lexical Statistics, called zipFr, we can see this structure and easily plot it.

# Create the spc data structure that it uses.
fc.spc <- spc(Vm=fc$NumberOfWords, m=as.numeric(fc$NoOfOccurences))
# show summary information
summary(fc.spc)
## zipfR object for frequency spectrum
## Sample size:     N  = 499074 
## Vocabulary size: V  = 51110 
## Class sizes:     Vm = 25267 7327 3790 2222 1526 1198 866 704 ...
c("Number of lines or observations")
## [1] "Number of lines or observations"
as.character(N(fc.spc)) # no of observations - words
## [1] "499074"
c("The Vocabulary, or number of unique words identified")
## [1] "The Vocabulary, or number of unique words identified"
V(fc.spc) # no of unique words (vocabulary)
## [1] 51110

These plots show Vm = No of words which had m occurences"
On the X axis is m which is the no of occurences. So 25,000 words had one occurence The second log chart gives a better indication of the spread of word occurences.

# frequency Spectrum plotsplot(fc.spc) - top 15 as there are so many
plot(fc.spc)

# frequency Spectrum plotsplot(fc.spc) - On a Log scale version
plot(fc.spc, log="x")

There are also functions inthe package to calculate vocabulary coverage which will be used in the further research for this project.

Conclusion

This analysis has shone a light on the differences in approach that might be required for text prediction as the domain that the data is sourced from has its own context and shape. This will need to be considered in the next phase of the project, determining the data product to be produced.

Developing the Data Product

On investigation, only a small proportion of the words in the corpora make up 90% ofthe unique words. Constructing and using the best set of unigrams and other Ngrams will need to be optimised to ensure a good predictor for input text.
Based on the analyis in this report, the choice of data product should deal with a specific genre of text. e.g. if it was a predictor of text to search or enter twitter feeds, it might look very different to that for new feeds. This will be investigated inthe next phase.