Purpose

The aim of this project is to develop an application to predict the next word that could be typed from words that have been typed previously.

This report looks at the characteristics of the body of text that will be used to develop the model for predicting the word.

Background

To develop a text prediction model three large files of text written in United States English have been provided. They have been extracted from Twitter,Usenet news (hereafter referred to as news) and blogs.

What is a Word?

For this project a word is a string of text that is preceded and/or followed by a space. Therefore this includes all words that we consider normal, numbers and hypenated expressions such as “so-and-so”.

Sources of Text

The sources of data is here:

https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

For this report I have only used the files in the en_US directory.

“Non-English Words”

Given that languages evolve and people often use phrases or words from languages other that those they are primarily writing or speaking in, I have not attempted to filter out “non-English words”.

What We Will Look At

This report will consider:

  1. The size of the text files and sampling;
  2. The frequency of common words;
  3. The use of combinations of words; and
  4. An outline of how I will go about text prediction.

The Size of Files and Sampling

The numbers of lines in each of the files of tweets, news and blogs is as follows:

##  tweets    news   blogs 
## 2360148 1010242  899288

These files are very big so to try in parse all the text of each file may take a very long time. I therefore took a sample of about 10 percent of the lines of each file. To cshose which lines to sample I simulated the tossing of a coin that is weighted so that I only get the outcome I want in about 10 percent of tosses.

Frequency of Words

The most frequent words in an English body of text are words such as “as”, “I”, and “is”. These words are known as stop words. Below is a table of the number of lines, number of words and number of words excluding stopwords for the samples of teets, news and blogs:

##                             tweets    news   blogs
## Lines                       236135  101222   90037
## Words                      3012711 3481526 3738632
## Words excluding stop words 1248828 1639348 1248828

Stop words account for about two-thirds of the words in the tweets and blogs, and almost over half the words in the Usenet news. In R there is a list of 1149 distinct stop_words.

Content of Each Sample

Below are plots of the 20 most common words, excluding stop words, in each of the tweets, news and blogs samples:

and the number of different words in each sample is:

## tweets   news  blogs 
##  95697  96399  96292

The words people, day and time are in the top 20 most common words that are not stop words, in each of the three samples but no other words occur in the top 20 of all three files. You might expect the tweets to be different from news and blogs because you would expect abbreviated words such as rt to be common in Twitter.

Given that most of the top 20 words that are not stop words are different in each of the tweets, news and blogs vary, perhaps the occurrences of other words vary and maybe there are words that occur exclusively in each sample. To find out whether the latter point is true I created this Venn diagram. The numbers show the number of different words in each category.

The diagram shows that about have of the distinct words in each of the tweets, news, and blogs are unique to that category. Therefore taking a sample of all three categories produces a bigger vocabulary for developing a model for text prediction.

How Words Occur Together

The number of distinct strings of two words is :

## [1] 2736157

and the number of three word strings is

Developing a Model for Text Prediction

Method

To predict the next word in a text I will compare the end of the text provided with six, five, four, three and two word strings, the size of the comparison string depending on the length of the text provided, and find the string whose first words up to but not including the final word are the as the last words of the give string, which has the highest proability of occurring. I will then take the last word of that string as the predicted word.

Size of Sample

According to the Oxford English Dictionary (Ref: https://www.google.co.nz/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&cad=rja&uact=8&ved=0ahUKEwjLibzt4dLVAhUMTbwKHetiChgQFggvMAI&url=https%3A%2F%2Fen.oxforddictionaries.com%2Fexplore%2Fhow-many-words-are-there-in-the-english-language&usg=AFQjCNFa5r9Jr4N2dcfuk0XykJk_uXdh1A) there are about 171,000 words in current use in the English language. From the Venn diagram and the number of distinct stop words, the sample contains 193,604 distinct words. Some of these will be numbers, hyphenated expressions and words not in the English language. It would be difficult to find out how many of these there are, but it looks like the sample will cover nearly all words likely to be in typical English text.

The number of possible strings of two words is about 176000 squared but not all of these will make grammatical sense. It is most unlikely that any sample will contain all the possible grammatically-correct strings of two words, let alone all grammatically-correct stings of more than two words.

It took hours to extract the 6416453 distinct strings of three words. Therefore, I intend to develop the model using a smaller sample of the text - probably 20% of the sample text, and to test using a different sample comprising about 10% of the sample text.

Conclusions

  1. To obtain the widest vocabulary it is best to use a sample of each of the tweets, news and blogs files provided.

  2. A sample of about 2% of all the text provided is probably sufficient to provide all the words likely to be used for predicting text so I shall use about 2% of the text to train the prediction model.

  3. To predict text I will use the probablility of occurrence of strings of words corresponding to the text provided.

Credit

Most of this work is based on guidance from Julia Silge and David Robinson, ‘Text Mining With R A Tidy Approach’, O’Reilly, which you can also read at: http://http://tidytextmining.com/ . I thank the authors for their assistance.

I would also like to thank Hanbo Chen and Paul Boutros for the VennDiagram package and the documentation thereof, which you can find at: https://cran.r-project.org/web/packages/VennDiagram/VennDiagram.pdf