The SwiftKey Corpus consists of news articles, blogs, and tweets in English, Russian, Dutch, and Finnish. The purpose of these corpora is to provide natural language data to aid in text prediction. This exploratory analysis will briefly examine these three genres of writing in English.
Preliminary exploratory analysis suggests that the SwiftKey corpus follows patterns predicted by Zipf’s Law which are common to natural language data. Overall, about 45% of words occur only once in the data, and nearly 4% of the words are the word the.
Since Finnish and Russian use non-ASCII characters, we will need the unicode pacakge to handle this.
library(Unicode)
Next, we set our working directories and load all the data. This is demonstrated for the English corpora, but can be preformed on the other languages as well.
setwd("~/Desktop/SwiftKey Corpus/en_US")
en.tweets <- readLines("en_US.twitter.txt")
en.news <- readLines("en_US.news.txt")
en.blogs <- readLines("en_US.blogs.txt")
My prediction algorithm will focus only on words, ignoring case, punctuation, and numbers. Although predicting capitalization, numbers, and punctuation (including emjoi), is interesting, this level of complexity will be reserved for later versions of the prediction algorithm. Therefore, it is necessary to get rid of numbers, extraneous spaces, and normalize the data to lowercase. This process will be demonstrated on the English blog data; however, the same process will be preformed on the twitter and news corpora as well.
en.blogs2 <- gsub("\\d","",en.blogs) #get rid of numbers
en.blogs3 <- gsub("\\W", " ", en.blogs2) #get rid of punctuation
en.blogs4 <- gsub("\\s+", " ", en.blogs3) #replace 2+ spaces with one
en.blogs5 <- tolower(en.blogs4) #make everything lowercase
Getting rid of punctuation will cause contractions like should’ve to become separate single words, as in should and ve.
A unigram is a single word. Again, this process will be demonstrated on the English blog data; however, the same process was preformed on all twelve corpora.
en.blogs6 <-strsplit(en.blogs5, "\\s") #split on spaces
en.blogs7 <-unlist(en.blogs6) #unlist the vector
en.blogs.freq <-table(en.blogs7) #makes a frequency list of every word
sorted.en.blogs.freq<-sort(en.blogs.freq, decreasing=T) #sort the frequency list
write.table(sorted.en.blogs.freq, "enBlogFreqs.csv",sep=",") #make a CSV file with the frequency list
Now that the unigram frequency data have been saved as a .csv file, we can return to this at a later point, without loading or manipulating the original data again.
setwd("~/Desktop/SwiftKey Corpus/en_US")
en.blog.unigram <- read.csv("enBlogFreqs.csv",col.names=c("word","blogFreq"))
en.news.unigram <- read.csv("enNewsFreqs.csv",col.names=c("word","newsFreq"))
en.tweet.unigram <- read.csv("enTweetFreqs.csv",col.names=c("word","tweetFreq"))
It is important to note that I named the words column as a generic “word”, but the frequency column has a unique name for each corpus. This will be important later on when merging the data.
When predicting text, word order matters. We expect a native English speaker to frequently say the red house and almost never say house red the. Moreover, certain combinations of words (such as the White House) might be more common than others (such as the red house). An n-gram is a series of N number of words. The functions ngram2 and ngram3 will help produce these frequency lists.
ngram2 <- function(corpus){
ngrams <- vector()
for(i in 1:length(corpus)){ #for each line
temp <- unlist(strsplit(corpus[i], "\\W"))
cat(i/length(corpus)*100, "%\n") #give progress
for(j in 0:length(temp)){ #start at zero to include line beginning
ngram <- c(temp[j],temp[j+1])
ngrams <- append(ngram, ngrams)
}
}
return(table(ngrams))
}
ngram3 <- function(corpus){
ngrams <- vector()
for(i in 1:length(corpus)){ #for each line
temp <- unlist(strsplit(corpus[i], "\\W"))
cat(i/length(corpus)*100, "%\n") #give progress
for(j in 0:length(temp)){ #start at zero to include line beginning
ngram <- c(temp[j],temp[j+1],temp[j+2])
ngrams <- append(ngram, ngrams)
}
}
return(table(ngrams))
}
ngram4 <- function(corpus){
ngrams <- vector()
for(i in 1:length(corpus)){ #for each line
temp <- unlist(strsplit(corpus[i], "\\W"))
cat(i/length(corpus)*100, "%\n") #give progress
for(j in 0:length(temp)){ #start at zero to include line beginning
ngram <- c(temp[j],temp[j+1],temp[j+2],temp[j+3])
ngrams <- append(ngram, ngrams)
}
}
return(table(ngrams))
}
This process was highly inefficent.
The number of rows tells us how many unique words are in the corpus (word types), while the sum of the freq column tells us how many total words are in the corpus (word tokens).
blogTypes <- nrow(en.blog.unigram); blogTypes
## [1] 256889
blogTokens <- sum(en.blog.unigram$blogFreq); blogTokens
## [1] 43015603
In the English blogs corpus, there are 256,889 unique word types, and a total of 37,975,178 word tokens. However, we cannot expect the average number of tokens for each word to be blogTokens/blogTypes. This is because language data tends to follow a skewed distribution.
According to Zipf’s Law (http://en.wikipedia.org/wiki/Zipf%27s_law), language data follows a logarithmic pattern. In other words, the most frequeny word (or n-gram) can be expected to be about twice as frequent as the second most frequent word (or n-gram), which in turn is twice as frequent as the third most frequent word (or n-gram), and so on. This means there are a small set of very highly frequent words and a large set of extremely infrequent words.
Moreover, the highly frequent words also overwhelmingly tend to be function words (pronouns, articles, prepositions, auxiliary verbs, and conjunctions) rather than content words (nouns, verbs, adverbs, and adjectives). Let’s look at the 10 most frequent words in the blogs corpus:
head(en.blog.unigram,10)
## word blogFreq
## 1 5144773
## 2 the 1860655
## 3 and 1094849
## 4 to 1069558
## 5 i 906685
## 6 a 903931
## 7 of 876836
## 8 in 598749
## 9 it 485362
## 10 that 484196
en.blog.unigram$blogFreq[1]/blogTokens * 100
## [1] 11.96025
sum(en.blog.unigram$blogFreq[1:10])/blogTokens * 100
## [1] 31.21099
As expected, the 10 most frequent words in the blogs corpus are all function words: articles such as the and a, conjunctions such as and, prepositions such as to, of, in, pronouns such as i, it, and that, and the copula verb is. Moreover, these ten words make up 22.9% of all the tokens! Over a fifth of the blog corpus can be accounted for by just these ten words.
Similar results were found in the other corpora. Among tweets, the most common words are the, i, to, a, you, and, for, it, in, and of, which account for 16.9% of the twitter corpus. In the news corpus, the most common words are the, to, a, and, of, in, s, that, for, and it, which account for 21.8% of the news corpus.
It is important to note the differences between the three corpora. Tweets, which tend to be more social, contain the pronouns I and you more frequently than the other corpora. Words such as of and that, which are common in relative clauses and other syntactically complex phrases, are more common in the news and blogs corpora. The distributions of these words also differ from one corpus to the next. Therefore, it is important to consider context (news, blog, or tweet) when predicting text.
The least frequent words, which only occur once in a corpus, are known as hapaxes (http://en.wikipedia.org/wiki/Hapax_legomenon), and may make up as much as half of all the words in the corpus. For example, in a corpus of 1,000,000 unique words, up to 500,000 may be hapaxes.
hapaxes <- en.blog.unigram$blogFreq==1; table(hapaxes)
## hapaxes
## FALSE TRUE
## 137910 118979
As we can see, 118,979 (46%) of the tokens in the English blogs corpus only occur once. This, again, is expected based on predictions from Zipf’s Law. Similar rates are found in the twitter corpus (56% hapaxes) and news corpus (39% hapaxes).
One popular method of visualizing word frequency data is through a word cloud, which can be done in R using library(wordcloud). This is demonstrated below for all three corpora:
From left to right: blogs, tweets, and news. While visually appealing, these plots do not demonstrate any statistical information. The Zipfian distribution is better demonstrated by plotting the (logged) frequencies of each word against their ranked orders:
Here we can see the logarithmic curve predicted by Zipf’s Law. Again, we can see differences from one corpus to the next. While the is the most common word in all three, social words like thanks and question words like what are more common in the twitter corpus, due to its more interactive nature. Third person pronouns like they are more common in the news and blogs corpora, due to the fact that these may be more reportative.
First, we’ll load our n-grams:
#blog2gram <- read.csv("enBlog2gramFreqs.csv",col.names=c("word","blog2gramFreq"))
#blog2gram <- blog2gram[ order(-blog2gramFreq), ]
#head(blog2gram)
Unigram frequencies can help overall prediction. For example, given no other information, a highly frequent word such as the is much more likely to occur than an infrequent word like botches. However, raw frequencies alone will not be helpful in prediction. Therefore, the relative frequencies of each word will be used as factors in a machine learning predictive algorithm.
The most basic type of prediction would only use the relative frequency of unigrams. To add more complexity, and predictive power, we can also add 2-grams, 3-grams, or other n-grams as factors. To add even more predictive power, we can consider the context. For example, someone writing a tweet may be more likely to type you than someone writing a news article.
Although context is important, having an overall idea of word frequencies is important as well. We can merge the data into one data set using the merge() function from the entropy package:
English <- merge(en.tweet.unigram, merge(en.blog.unigram, en.news.unigram,all=T), all=T)
English[is.na(English)]=0 #makes NAs zero (common with hapaxes)
English$totalFreq <- English$tweetFreq+English$blogFreq+English$newsFreq
English <- English[ order(-English$totalFreq), ]
englishTypes <- nrow(English); englishTypes
## [1] 546645
englishTokens <- sum(English$totalFreq); englishTokens
## [1] 119778688
head(English, 10)
## word tweetFreq blogFreq newsFreq totalFreq
## 1 6025535 5144773 5571626 16741934
## 265280 the 937889 1860655 1974508 4773052
## 270832 to 788925 1069558 906168 2764651
## 9117 and 438732 1094849 889538 2423119
## 316 a 617132 903931 894434 2415497
## 123620 i 918750 906685 195398 2020833
## 189296 of 359743 876836 774512 2011091
## 127621 in 380762 598749 679106 1658617
## 132831 it 383721 485362 286537 1155620
## 264903 that 271083 484196 371795 1127074
There are over half a million unique word types and over 100 million word tokens in this corpus!
Raw numbers do not give an accurate picture of the data. Relative frequencies are more like probabilities, and can be used in a prediction model.
tweetTokens <- sum(English$tweetFreq)
blogTokens <- sum(English$blogFreq)
newsTokens <- sum(English$newsFreq)
English$tweetRelFreq <- English$tweetFreq/tweetTokens
English$blogRelFreq <- English$blogFreq/blogTokens
English$newsRelFreq <- English$newsFreq/newsTokens
English$totalRelFreq <- English$totalFreq/englishTokens
head(English)
## word tweetFreq blogFreq newsFreq totalFreq tweetRelFreq blogRelFreq
## 1 6025535 5144773 5571626 16741934 0.16472815 0.11960248
## 265280 the 937889 1860655 1974508 4773052 0.02564033 0.04325535
## 270832 to 788925 1069558 906168 2764651 0.02156790 0.02486442
## 9117 and 438732 1094849 889538 2423119 0.01199421 0.02545237
## 316 a 617132 903931 894434 2415497 0.01687137 0.02101403
## 123620 i 918750 906685 195398 2020833 0.02511710 0.02107805
## newsRelFreq totalRelFreq
## 1 0.138651384 0.13977390
## 265280 0.049136153 0.03984893
## 270832 0.022550230 0.02308133
## 9117 0.022136388 0.02022997
## 316 0.022258226 0.02016633
## 123620 0.004862531 0.01687139
Future work will calculate relative frequencies for all the n-gram tables. All the n-gram frequencies will be merged with the unigram frequncies, such that any ngram or unigram contained in a 4-gram will appear as a single row. For example (fake data):
example
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## colname "4gram" "freq" "3gram1" "freq" "3gram2" "freq" "2gram1" "freq"
## data "A B C D" "0.01" "A B C" "0.01" "B C D" "0.01" "A B" "0.01"
## [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16]
## colname "2gram2" "freq" "2gram3" "freq" "1gram1" "freq" "1gram2" "freq"
## data "B C" "0.01" "C D" "0.01" "A" "0.01" "B" "0.01"
## [,17] [,18] [,19] [,20]
## colname "1gram3" "freq" "1gram4" "freq"
## data "C" "0.01" "D" "0.01"
Hapaxes or other extremely infrequent combinations may be discarded, as they take up nearly half the data set and provide relatively little predictive power at a large memory cost.
The ultimate goal of the prediction algorithm will be to predict “D” given the sequence “A B C” and using the associated frequency probabilities.