Proposal project: Prediction of the next word

Abstract

Currently, there are many small devices where you can type text. Some of them allow a fast writing, however, it is always a problem to write quickly and without spelling mistakes. When you are writing using a keyboard from your desktop computer, it is not so difficult to type correctly and very fast, but typing using an iPhone or other small devices is something different and sometime it can be frustrating for many of us. In the present project, I will explain how to improve our experience typing on small devices only by predicting the next word in a sentence fragment. Therefore, we do not have to write whole sentences, the algorithms will predict the next word for you and also it will reduce misspelling substantially. This project will require to learn and apply Natural Language Processing techniques which contains knowledge from different subfields as statistics, machine learning, linguistic and computer science. The main goal of this project is to help users of small devices text faster and also avoid misspelling. Also, in this project we pretend to personalize the prediction of the next word using personal writing history. For instance, a twitter user will be able to load all their tweets into the application and the next word prediction algorithm will learn from these data.
Here, I will explain our exploratory data analysis along with the goals for the implementation of the algorithm.

Introduction

We present a probabilistic model to predict the next word from a text. The model will be trained on a dataset in english language. Our model will be build using the frequency of one word, 2-words, 3-words and 4 words. These combination of words together are called 1-gram, 2-gram, 3-gram and 4-gram. We will generate the probabilistic distribution of these n-grams for our dataset and from there we will predict for an unseen text the next word. First, we need to explore and understand our dataset.

Loading the Data

We will use a public dataset for our project from http://www.corpora.heliohost.org/download.html. This dataset contains 2,462,000,000 words from 67 languages and it contains material published in different webpages from 2005. This corpora was collected from publicly available sources by a web crawler. Here we will use only the english dataset for three different sources (blogs, newspapers and twitter).

First, we will take a look at the data. As a first step we load a random set of each sources and explore it to identify the most frequent word for each sources and compare their behavior.

## Loading required package: NLP
## Loading required package: RColorBrewer

We explore the original dataset from different sources to know the statistic of these files, we are interested in the number of lines and number of words in each file. At the end of the following section, I show the number of words and lines in the corresponding sources Newspaper, Blogs, and Twitter.

## [1] "Next table shows the number of words and lines in the corresponding sources"

##           Number of Words Number of Lines
## newspaper       205811889         1010242
## Twitter         167105117         2360148
## Blogs           210160014          899288

Last table shows that the number of words in each sources are very similar (~ 200 000 000) and NewsPaper and Blogs contain similar number of lines (around 1 000 000) compared with Twitter (~ 2 000 000). In general, we observe that all our files contain similar number of lines and words. Now the question is whether they contain similar words and their frequency. Before we answer this question, we need do some cleaning in our data.

Initial Cleaning of the Data

Data from public domains usually contain a lot of noise, we first need to clean it. So, the first step is to remove non-printable characters from our files, otherwise, they will produce problem processing text. We are going to do this process using the command line utilities tr with regular expression. It will remove any non-printable ASCII character.

Sampling the data

We take only 1000 lines randomly from each sources to explore the data. We do this sampling to avoid memory problem running our code in this step. For our final algorithms and app we may need to process all files, we will comment this point later in the project.

Loading and Cleaning of Sample Data

We load the sample data and make the cleaning process. The cleaning processing depends on the goals of the project. For instance, in this project we will predict the next word in a sentence, so, we do not want to predict a “bad” word, so we will remove them from our vocabulary. Our cleaning process includes:

remove numbers
remove whitespaces
convert all text to lower case
remove punctuation
remove profanity words

## [1] "Example of two text from the blog data"

## [1] "No, I can't say that I'm kept up at night dreaming of plastic surgery or bashing my head on the wall for stupid things said and done. But of course I can recognise where there is room for improvement and of course I have regrets."                                                                
## [2] "I am nearing that magical age of forty. Not quite there yet, but it is at the end of the month. Fabled as the middle ages (this assuming that I am living until 80, but with the family genes of women in my family living until the 90s, doesn't quite work for me), I am excited about this decade."

## [1] "Example of two text from the news data"

## [1] "After the sale of the last 4,000 sets -- 32 heavy, hardbound volumes chock-full of information and illustrations on, well, nearly everything -- will exist weightless on the Internet, where they can be quickly updated."
## [2] "\"I just kind of shifted into my ER mode,\" Gail said. \"I thought, 'OK, I'm going to be supportive, watchful, clinical and maintain calmness for everybody, especially for Alli.' \""

## [1] "Example of two text from the twitter data"

## [1] "Clyde Stubblefield's drums are the backbone of hip-hop. Clyde Stubblefield is UNSUNG!!!"
## [2] "I love makeup because it enhances my beauty!! RT if you agree"

## Warning: closing unused connection 7 (data/en_US/en_US.twitter_Sample.txt)

## Warning: closing unused connection 6 (data/en_US/en_US.news_Sample.txt)

## Warning: closing unused connection 5 (data/en_US/en_US.blogs_Sample.txt)

To implement our algorithm we tokenize each text in the dataset and create a DocumentTerm Matrix for each source, from these we can make a wordcloud plot of the most frequent words for each source.

Blogs dataset:

TDMblogs <- TermDocumentMatrix(corpusblogs)
matrixb <- as.matrix(TDMblogs)
freq_word_blog <- sort(rowSums(matrixb), decreasing=TRUE)
df_blogs <- data.frame(word=names(freq_word_blog), freq=freq_word_blog)
wordcloud(df_blogs$word, df_blogs$freq, random.order=FALSE,  colors=brewer.pal(8, "Dark2"),min.freq=10)

head(freq_word_blog, 10)

##  the  and that  for with  you  was this have  but 
## 2068 1168  477  390  341  316  295  285  235  220

We can observe from the previous wordcloud plot that for blogs the most frequent word is “the”. This is expected because the is a stop-word, this means that it usually very often in our vocabulary. Here we decide to keep the stop-words because they are going to be important for the prediction of the nex word.

Now, let check the newspaper source:

TDMnews <- TermDocumentMatrix(corpusnews)
matrixb <- as.matrix(TDMnews)
freq_word_news <- sort(rowSums(matrixb), decreasing=TRUE)
df_news <- data.frame(word=names(freq_word_news), freq=freq_word_news)
wordcloud(df_news$word, df_news$freq, random.order=FALSE,  colors=brewer.pal(8, "Dark2"),min.freq=10)

head(freq_word_news, 10)

##  the  and that  for said  was with from  are  but 
## 1915  880  338  316  265  241  239  153  148  147

From last plot we can conclude that also for the newspaper dataset the most frequent word is “the”.

Now, let study the twitter dataset:

TDMtwitter <- TermDocumentMatrix(corpustwitter)
matrixb <- as.matrix(TDMtwitter)
freq_word_twitter <- sort(rowSums(matrixb), decreasing=TRUE)
df_twitter <- data.frame(word=names(freq_word_twitter), freq=freq_word_twitter)
wordcloud(df_twitter$word, df_twitter$freq, random.order=FALSE,  colors=brewer.pal(8, "Dark2"),min.freq=10)

head(freq_word_twitter, 10)

##  the  you  for  and that with have just this your 
##  390  214  182  166  116   79   74   70   69   66

Here again for the twitter dataset the most frequent word is “the”. The second most frequent word is different for different dataset (blogs -> and, news -> and, twitter -> you), this indicate that even when the most frequent word is the same, the three different dataset contain different amount of words. Also, if we check the frequency of the most relevant word, we see that they are not the same (blog -> 2068, news -> 1915, twitter -> 390). Also, this difference shows that even when we are using the same amount of sentences, twitter contains less time the word the that blogs and news. The explanation for this point is that in twitter a text can not contain more than 140 character while in blogs and news a single text can be very large.

Now I am going to plot the most frequent word for only one source (news) removing stop-words and we will see how different the wordcloud look like.

TDMnews1 <- TermDocumentMatrix(corpusnews,control = list(stopwords = TRUE))
matrixb1 <- as.matrix(TDMnews1)
freq_word_news1 <- sort(rowSums(matrixb1), decreasing=TRUE)
df_news1 <- data.frame(word=names(freq_word_news1), freq=freq_word_news1)
wordcloud(df_news1$word, df_news1$freq, random.order=FALSE,  colors=brewer.pal(8, "Dark2"),min.freq=10)

head(freq_word_news1, 10)

##  said  will   one   new   two first  last  year  just years 
##   265   107    78    71    63    60    60    60    57    56

Now, the most frequent word is said followed by will. As we have noted removing stop-words can make a huge difference in our analysis. We consider that stop-words should not be removed from our dataset.

Now, in the following histogram we can see the frequency for each word in the news dataset after removing the stop-words.

library(ggplot2)

## 
## Attaching package: 'ggplot2'
## 
## The following object is masked from 'package:NLP':
## 
##     annotate

first15 <- df_news1[1:15,]
ggplot(first15, aes(x=word,y=freq), ) + geom_bar(stat="Identity", fill="black") +geom_text(aes(label=freq), vjust=-0.4) + theme(axis.text.x = element_text(angle = 45, hjust = 1))

From last plot we can conclude that the word said is very common in the news dataset and the second most frequent word appears less than half of the times. We can combine these three dataset into only one and generate the wordcloud, however, we are not going to do it because the result will be very similar to previous figures.

Our goals for the final app:

When we are done with the cleaning and the exploratory data analysis we will develop the algorithm to be used in our app. We are going to do the following steps:

We will use a larger sample of the data, and even if it is possible all the data (or even if we can obtain other data) from our dataset to generate our term-document matrix.
We will generate the n-gram matrix for our data set, mainly we will be focused on 1-gram, 2-gram, 3-gram and 4-gram. If the time allows we will include more higher gram in our study. We consider than 4-gram should be good enough for our final product.
To develop an algorithm taking into account all the gram. For an unseen text, we will calculate the probability that the n-gram from this text appears in our n-gram matrix, if not, we reduce the search to n-1 gram and repeat the process again, till we get for the largest k-gram the next word from the incomplete text. Here, we still have to learn how to handle the case where the text is not in our dataset. For this case, we can use a random word starting with the initial letter or using some more complex algorithm to calculate the probability from the similar history of word in our dataset. This part is still in progress.
As the final step, we need to check that our model is working properly, so we need to find a separate data to be used as a test set, or from the same dataset take a fraction as a test set. This will provide some feedback to improve the algorithm. We may need to go back and refine the cleaning or other steps.
We will write a web app including our best model. It will predict the next word from an incomplete sentence. We will recommend the most probable word but also we will provide the other four most probable word to the users.
As the final step, we will allow users load their texts from facebook or twitter to help the model predict a more personal next word.