Introduction

Natural language processing and generation constitues a specrum of several both really difficult and interesting problems in both computer science and artificial intelligence areas. The present project elaborate and address one of those problems in particular: Text prediction.

Up to 2018, there were 4.57 billion mobile users worldwide [1]. All of them, at different degrees consume and require text prediction technology to facilitate communication through their respective mobile devices. Such exposure and role in the way we comunicate, requires a constant improvement of both models and algorithms involved in the technology that is practically enabling mankind to communicate globally during this century.

The scope of the present work is to perform an initial exploratory analysis on a provided training data set in U.S. English. After addressing a set of initial questions about the nature and content of the data, an initial model for text prediction is proposed based on the findings of the provided training data sets.

Introducing the Target Data Sets

The data sets that will be used for the present work consist in three files that contain raw text sampled from the following types of sources:

A summary of the text corpus composed by these files is presented in the following table:

library(tm)

# First parse the raw content of the files
newsFile <- readLines("C:\\Users\\psainza\\Documents\\Capstone Project\\final\\en_US\\en_US.news.txt")
blogFile <- readLines("C:\\Users\\psainza\\Documents\\Capstone Project\\final\\en_US\\en_US.blogs.txt")
twitterFile <- readLines("C:\\Users\\psainza\\Documents\\Capstone Project\\final\\en_US\\en_US.twitter.txt")

# Calculate the number of files
newsTotalSize <- length(newsFile)
blogTotalSize <-length(blogFile)
twitterTotalSize <- length(twitterFile)

# Then extract a sample of it, starting with 1% for optimization purposes
newsSample <-sample(newsFile,newsTotalSize*0.01)
blogSample <-sample(blogFile,blogTotalSize*0.01)
twitterSample <- sample(twitterFile,twitterTotalSize*0.01)

# Create a unified sample vector to create the corpus
unifiedSampleVector <- c(newsSample,blogSample,twitterSample)

targetCorpus <- VCorpus(VectorSource(unifiedSampleVector))
# As a first assumption and for simplicity, lets canonize to lower case
targetCorpus <- tm_map(targetCorpus,content_transformer(tolower))
# Then we remove the punctuation
targetCorpus <-tm_map(targetCorpus,removePunctuation)
# Then remove actual numbers
targetCorpus <-tm_map(targetCorpus,removeNumbers)
# Trim the accidental whitespaces
targetCorpus <- tm_map(targetCorpus,stripWhitespace)

# Then generate the corresponding Term Matrix Table
targetDTM <- DocumentTermMatrix(targetCorpus)
inspect(targetDTM)
## <<DocumentTermMatrix (documents: 33365, terms: 46798)>>
## Non-/sparse entries: 472839/1560942431
## Sparsity           : 100%
## Maximal term length: 126
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   and are for have that the this was with you
##   1159  16   1   5    3    2  12    0  10    3   1
##   2270  16   5   2    2    5  13    0   0    4  11
##   2712   8   3   3    6    7  38    7   2    4   1
##   3277   8   0   2    2    3  15    3  11    5   1
##   3660  13   1   4    4    2  18    0   3    9   1
##   3952  14   0   3    0    4  12    1   9    0   0
##   4155   7   0   4    0    4  29    1   3    1   0
##   7064  21   4   2    0    9  18    2   3    3   1
##   8076  13   3   4    3    4  20    6   1    6   3
##   9162  20   1   1    2    7  18    5   0    6   3

Addressing the First Questions

  1. How big the model could be?

We initialy started with the raw files as the starting corpus, which consisted in the following numbers:

After a couple of attempts in creating a working corpus for further analysis and mining, it was found out that the required processing time would make not feasible to process in a practical time all the data from the source files into a single text corpus. Therefore, as commented on the previous code block, another attempt was made to actually work with a random sample equivalent to the 1% of the original text volume. This way we could keep the representativity while lowering considerably the processing and memory load for the purposes of this work.

After reducing the size of the source data to create the corpus we ended up with a text corpus of 123.7 MB and a document analysis matrix of just 11.8 MB, achieving in that way a reduction of 97.9% in memory allocation.

  1. What are the most frequent terms?

For the purposes of the present work we present how the frequency distribution looks like with the created corpus as it is shown below:

# First pick up the the terms which appears 1000 times or more
topFrequentTerms = findFreqTerms(targetDTM,lowfreq = 2000)
# calculate the word frequencies of the overall corpus
wordFrequencies <- slam::col_sums(targetDTM)

frequencyDataFrame <- data.frame(Term=character(),Frequency=integer())
names(frequencyDataFrame) <- c("Term","Frequency")
#Create the data frame for the plot
for(topFrequentTerm in topFrequentTerms){
  
  tempDataFrame <-data.frame(topFrequentTerm,as.integer(wordFrequencies[topFrequentTerm]))
  names(tempDataFrame)<- c("Term","Frequency")
  frequencyDataFrame <- rbind(frequencyDataFrame,tempDataFrame)
}

plot(frequencyDataFrame,main="Top Frequent Terms")

As its illustrated by the previous plot. There are currently 24 terms which have more 2000 or more instances across the corpus, which as expected, are mostly stop words in US english.

Initial Model Proposal

Based on the data obtained in the previous section. An initial proposed model to enable text prediction could be a decision tree based model on which, at least on a per-token basis, The suggested options to predict the text would be the top 3 terms which present the most association with the provided term.

An example of the data associated to train such model is shown below using a subset of the top frequent terms analyzed on the present work:

for(topFrequentTerm in topFrequentTerms[1:5]){
  print(topFrequentTerm)
  # Pick up the terms with at least 20% of correlation
  print(findAssocs(targetDTM,topFrequentTerm,0.2))
  
}
## [1] "about"
## $about
##  and  the 
## 0.23 0.21 
## 
## [1] "all"
## $all
##  and  the that 
## 0.28 0.25 0.22 
## 
## [1] "and"
## $and
##   the  that  with   was   for   but  this  from  have   all   not   are 
##  0.60  0.44  0.39  0.36  0.33  0.31  0.31  0.29  0.29  0.28  0.27  0.26 
##   had  they   his  were   one their about  into  more other   out  some 
##  0.26  0.26  0.25  0.25  0.24  0.24  0.23  0.23  0.23  0.23  0.23  0.23 
##  then  when  also  even there which  been   has   her   our  them   who 
##  0.23  0.23  0.22  0.22  0.22  0.22  0.20  0.20  0.20  0.20  0.20  0.20 
## 
## [1] "are"
## $are
##  the  and they that  you 
## 0.27 0.26 0.26 0.23 0.23 
## 
## [1] "but"
## $but
##  and  the  not that  was with have this 
## 0.31 0.31 0.30 0.29 0.25 0.23 0.21 0.21

Conclusions

The work included on this paper presented a series of challenges inherent to the problem domain of text mining:

The original intent was to keep the english stop words in order to avoid the predictor to work only on ‘robotic’ english. However, the partial results of the first prediction model proposal suggest that keeping the stop words might be counterproductive. Since those terms are associated to too many different terms, it becomes really hard to pick up a most feasible term without nay additional context of the phrase being handled.

Therefore, in order to refine and create the proposal for the model, the following focus areas are proposed:

  1. Remove the stop words from US english.
  2. Look for further optimizations on representativeness and text corpus size. Since reducting to a sample of 1% may give good results but also it could be reduced even further considering aspects of vocabulary size of an average native english speaker.
  3. Explore ways to filter out quoted foreign language within the source data, which may lead to establishing a term frequency based threshold.

References