This Capstone project will be held in collaboration with SwiftKey. The main goal of this Capstone project is to create an algorithm to predict next possible words while typing a text fragment into an input field as many people may know while using their mobile devices. Because on this devices exist a limit in amount of storage and RAM it is not a good idea to have huge databases to predict next words. So, these predicting algorithms will be used.This intermediate report is to provide a short overview and some exploratory results about our traing data set. The english texts will be used for the exploratory analysis.
There are blog, twitter and news categories of text files are avilable for this analysis. The data sets for this project are reasonably large. It may cause problems for your computer if you try to read the whole data set in at once. So a sample of 5000 lines of text is used for each category.
| Desc | Blog | News | |
|---|---|---|---|
| Total# of lines | 899288 | 1010242 | 2360148 |
| Total Words | 37334131 | 34372530 | 30373543 |
| File Size(Mb) | 200.42 | 196.28 | 159.36 |
| Sample# of lines | 5000 | 5000 | 5000 |
| Sample Word Count | 205555 | 63747 | 170940 |
| Sample Word Count(after cleanup) | 104347 | 35947 | 96239 |
A corpus is created for each category text file. On the corpus, clean up operations have been performed as part of Tokenization.
Tokenization : word wise breakup, clean up the corpus by removing special characters, punctuation, numbers etc. Also, remove profanity will be removed. Corpus cleanup includes the following operations:
In many cases, words need to be stemmed to retrieve their radicals. For instance, “example” and “examples” are both stemmed to “exampl”. However, after that, one may want to complete the stems to their original forms, so that the words would look “normal”.
The tokenization in the project includes stemming words process.
The exploratory analysis includes determining the terms that are repeatedly used. This analysis includes the word clouds with the top 100 terms that are with high freqncy.
an n-gram is a contiguous sequence of n items from a given sequence of text or speech. An n-gram of size 1 is referred to as a “unigram”; size 2 is a “bigram” (or, less commonly, a “digram”); size 3 is a “trigram”. Larger sizes are sometimes referred to by the value of n, e.g., “four-gram”, “five-gram”, and so on.
“memorylessness”: the probability distribution of the next word depends only on the current word or previous 1-3 words, and not on the big sequence of words that preceded it. This specific kind of “memorylessness” is called the Markov property.
In this project n-grams is collected on the text-corpus of each category. And bigram, trigram and four-gram estimations are used, as per Markov property.
## Warning in rm(dtm_corpus): object 'dtm_corpus' not found
## Warning in rm(dtm_corpus1): object 'dtm_corpus1' not found
## Warning in rm(dtm_corpus): object 'dtm_corpus' not found
## Warning in rm(dtm_corpus1): object 'dtm_corpus1' not found
## Warning in rm(dtm_corpus): object 'dtm_corpus' not found
## Warning in rm(dtm_corpus1): object 'dtm_corpus1' not found
While the strategy for modeling and prediction has not been finalized, the n-gram model with a frequency look-up table might be used based on the analysis above. A possible method of prediction is to use the 4-gram model to find the most likely next word first. If none is found, then the 3-gram model is used, and so forth. Furthermore, stemming might also be done in data preprocessing.
For the Shiny app, the plan is to create an app with a simple interface where the user can enter a string of text. Our prediction model will then give a list of suggested words to update the next word.
The next steps for this project could look like as follows:
Natural language processing is a complete new topic for me. So my approach to analyze things may have some inconsistencies. However, I enjoyed learning NLP techiniques.