As people are spending more and more time on their mobile devices for email, social networking, banking, etc., smart keyboards that can predict and automatically complete the words users are trying to type are of great interest. The goal of this project is to build a predictive text model in the form of a Shiny app using English text data (twitters, news, and blogs) from a corpus called HC Corpora.
This preliminary report (1) explains the major features of the data such as the distribution and relationship between the words, tokens, and phrases in the text; and (2) summarizes my plans for creating the prediction algorithm and Shiny app.
# Load data into R
twitter <- readLines('en_US.twitter.txt', encoding = 'UTF-8', skipNul = T)
blog <- readLines('en_US.blogs.txt', encoding = 'UTF-8', skipNul = T)
news <- readLines('en_US.news.txt', encoding = 'UTF-8', skipNul = T)
The downloaded text data are loaded into R. Some basic information including the number of lines, characters, words of the dataset can be extracted and is shown in the table below.
File | Size (MB) | Line Count | Word Count | Character Count |
---|---|---|---|---|
en_US.twitters.txt | 159.3641 | 2360148 | 30433550 | 162096241 |
en_US.blogs.txt | 200.4242 | 899288 | 38222300 | 206824505 |
en_US.news.txt | 196.2775 | 1010242 | 35710849 | 203223159 |
As we can see from the table above, the entire dataset is very large which makes any further processing very slow. Because it is actualy not necessary to use all the data for building the algorithms, only 5% of the data from each data source (twitters, news, and blogs) are randomly selected and then combined to build a new corpus.
# Randomly sample 10% of data from the 3 data sources
set.seed(25)
sampletwitter <- twitter[sample(1:twitter_lines, twitter_lines*0.05)]
sampleblog <- blog[sample(1:blog_lines, blog_lines*0.05)]
samplenews <- news[sample(1:news_lines, news_lines*0.05)]
writeLines(c(sampletwitter, sampleblog, samplenews), 'sample.txt')
rm(twitter, blog, news, sampletwitter, sampleblog, samplenews) # free up memory
The statistics of the sampled data is shown in the table below:
File | Size (MB) | Line Count | Word Count | Character Count |
---|---|---|---|---|
sample.txt | 27.55027 | 213483 | 5208554 | 28555992 |
With this much smaller dataset, a new corpus can be constructed for further model building. After comparing two R libraries, namely tm library and quanteda library, I found that the latter is significantly faster. Also, it provides a more convenient way for pre-processing. Therefore, quanteda library is chosen to build the new corpus and to perform the steps described in the next section.
# Create new corpus with the sampled data
corpus <- corpus(sample)
With the new corpus constructed, the next step is to build a basic n-gram model. N-gram is a continuous sequence of n items from a given sequence of text or speech, in this case the corpus from twitter, blogs and news data. For example, a 2-gram is a sequence of 2 words.
Here, n from 1 to 4 are chosen to create 4 different n-grams using the quanteda library. The process includes:
Note that the corpus needs to be pre-processed for tokenization and document-term matrix construction. The quanteda library allows us to pass common pre-processing arguments to the tokenize() function. The pre-processing in this case includes:
# N-gram tokenization & document-term matrix construction
# 1-gram
Uni_token <- tokenize(corpus, removePunct = T, removeSymbols = T, removeSeparators = T, removeTwitter = T,
removeHyphens = T, removeURL = T, ngrams = 1, verbose = F)
Uni_dfm <- dfm(Uni_token)
##
## ... indexing documents: 213,483 documents
##
## ... indexing features:
## 131,234 feature types
##
## ... created a 213483 x 131235 sparse dfm
## ... complete.
## Elapsed time: 3.52 seconds.
# 2-gram
Bi_token <- tokenize(corpus, removePunct = T, removeSymbols = T, removeSeparators = T, removeTwitter = T,
removeHyphens = T, removeURL = T, ngrams = 2, concatenator = ' ', verbose = F)
Bi_dfm <- dfm(Bi_token)
##
## ... indexing documents: 213,483 documents
##
## ... indexing features:
## 1,610,202 feature types
##
## ... created a 213483 x 1610203 sparse dfm
## ... complete.
## Elapsed time: 5.81 seconds.
# 3-gram
Tri_token <- tokenize(corpus, removePunct = T, removeSymbols = T, removeSeparators = T, removeTwitter = T,
removeHyphens = T, removeURL = T, ngrams = 3, concatenator = ' ', verbose = F)
Tri_dfm <- dfm(Tri_token)
##
## ... indexing documents: 213,483 documents
##
## ... indexing features:
## 3,449,394 feature types
##
## ... created a 213483 x 3449395 sparse dfm
## ... complete.
## Elapsed time: 7.15 seconds.
# 4-gram
Quadri_token <- tokenize(corpus, removePunct = T, removeSymbols = T, removeSeparators = T, removeTwitter = T,
removeHyphens = T, removeURL = T, ngrams = 4, concatenator = ' ', verbose = F)
Quadri_dfm <- dfm(Quadri_token)
##
## ... indexing documents: 213,483 documents
##
## ... indexing features:
## 4,173,563 feature types
##
## ... created a 213483 x 4173564 sparse dfm
## ... complete.
## Elapsed time: 12.5 seconds.
With the document-term matrices, we can find out terms with top frequency in the corpus for each n-gram model. The results are shown in the plots below:
We can also visualize the frequency of the terms using word clouds using the plot() function which passes arguments through to wordcloud() function in the wordcloud package in R. For example, the word cloud of uni-gram for the top 100 words is shown below with the more frequent words having larger size.
From the bar plots and the word cloud, we find that the terms with the highest frequency are mainly the stopwords, such as ‘the’, ‘and’…It is probably necessary to remove these stopwords for our model. But on the other hand, they are a significant portion of our daily word libraries. So, removing them might not be helpful if the goal is to build a smart keyboard for mobile devices.
So far, a basic n-gram model has been built to help us understand the relationship between the words. The next steps would be to: