The aim of this project is to develop an application to predict the next word that could be typed from words that have been typed previously.
This report looks at the characteristics of the body of text that will be used to develop the model for predicting the word.
To develop a text prediction model three large files of text written in United States English have been provided. They have been extracted from Twitter,Usenet news (hereafter referred to as news) and blogs.
For this project a word is a string of text that is preceded and/or followed by a space. Therefore this includes all words that we consider normal, numbers and hypenated expressions such as “so-and-so”.
The sources of data is here:
https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
For this report I have only used the files in the en_US directory.
Given that languages evolve and people often use phrases or words from languages other that those they are primarily writing or speaking in, I have not attempted to filter out “non-English words”.
This report will consider:
The numbers of lines in each of the files of tweets, news and blogs is as follows:
## tweets news blogs
## 2360148 1010242 899288
These files are very big so to try in parse all the text of each file may take a very long time. I therefore took a sample of about 10 percent of the lines of each file. To cshose which lines to sample I simulated the tossing of a coin that is weighted so that I only get the outcome I want in about 10 percent of tosses.
The most frequent words in an English body of text are words such as “as”, “I”, and “is”. These words are known as stop words. Below is a table of the number of lines, number of words and number of words excluding stopwords for the samples of teets, news and blogs:
## tweets news blogs
## Lines 236135 101222 90037
## Words 3012711 3481526 3738632
## Words excluding stop words 1248828 1639348 1248828
Stop words account for about two-thirds of the words in the tweets and blogs, and almost over half the words in the Usenet news. In R there is a list of 1149 distinct stop_words.
Below are plots of the 20 most common words, excluding stop words, in each of the tweets, news and blogs samples:
and the number of different words in each sample is:
## tweets news blogs
## 95697 96399 96292
The words people, day and time are in the top 20 most common words that are not stop words, in each of the three samples but no other words occur in the top 20 of all three files. You might expect the tweets to be different from news and blogs because you would expect abbreviated words such as rt to be common in Twitter.
Given that most of the top 20 words that are not stop words are different in each of the tweets, news and blogs vary, perhaps the occurrences of other words vary and maybe there are words that occur exclusively in each sample. To find out whether the latter point is true I created this Venn diagram. The numbers show the number of different words in each category.
The diagram shows that about have of the distinct words in each of the tweets, news, and blogs are unique to that category. Therefore taking a sample of all three categories produces a bigger vocabulary for developing a model for text prediction.
The number of distinct strings of two words is :
## [1] 2736157
and the number of three word strings is
To predict the next word in a text I will compare the end of the text provided with six, five, four, three and two word strings, the size of the comparison string depending on the length of the text provided, and find the string whose first words up to but not including the final word are the as the last words of the give string, which has the highest proability of occurring. I will then take the last word of that string as the predicted word.
According to the Oxford English Dictionary (Ref: https://www.google.co.nz/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&cad=rja&uact=8&ved=0ahUKEwjLibzt4dLVAhUMTbwKHetiChgQFggvMAI&url=https%3A%2F%2Fen.oxforddictionaries.com%2Fexplore%2Fhow-many-words-are-there-in-the-english-language&usg=AFQjCNFa5r9Jr4N2dcfuk0XykJk_uXdh1A) there are about 171,000 words in current use in the English language. From the Venn diagram and the number of distinct stop words, the sample contains 193,604 distinct words. Some of these will be numbers, hyphenated expressions and words not in the English language. It would be difficult to find out how many of these there are, but it looks like the sample will cover nearly all words likely to be in typical English text.
The number of possible strings of two words is about 176000 squared but not all of these will make grammatical sense. It is most unlikely that any sample will contain all the possible grammatically-correct strings of two words, let alone all grammatically-correct stings of more than two words.
It took hours to extract the 6416453 distinct strings of three words. Therefore, I intend to develop the model using a smaller sample of the text - probably 20% of the sample text, and to test using a different sample comprising about 10% of the sample text.
To obtain the widest vocabulary it is best to use a sample of each of the tweets, news and blogs files provided.
A sample of about 2% of all the text provided is probably sufficient to provide all the words likely to be used for predicting text so I shall use about 2% of the text to train the prediction model.
To predict text I will use the probablility of occurrence of strings of words corresponding to the text provided.
Most of this work is based on guidance from Julia Silge and David Robinson, ‘Text Mining With R A Tidy Approach’, O’Reilly, which you can also read at: http://http://tidytextmining.com/ . I thank the authors for their assistance.
I would also like to thank Hanbo Chen and Paul Boutros for the VennDiagram package and the documentation thereof, which you can find at: https://cran.r-project.org/web/packages/VennDiagram/VennDiagram.pdf