This is a report to inform the reader on the progress of my work regarding creating a shiny app that predicts the words most likely to complete a sentence when we try to enter a meaningful set of words. In this report I present the exploratory data analysis on the dataset and my goals for the eventual app and algorithm. Please read the report and provide the necessary feedback on my plans.
Dataset can be downloaded from this link
In this process we will explore the dataset and perform basic tasks to know our dataset better. Below are the key findings presented in tabular form. This includes number of lines, number of total characters, longest line length, mean number of characters and median number of characters.
object | lines | characters | Longest_Line | Mean_chars | Median_chars |
---|---|---|---|---|---|
Blog | 899288 | 208361438 | 40835 | 231.7 | 157 |
News | 77259 | 15683765 | 5760 | 203 | 186 |
2360148 | 162384825 | 213 | 68.8 | 64 |
In the twitter data, we can see that even though the maximum character limit of tweets is 140, the data shows the longest tweet to be of length 213. We can see the longest tweet below.
## [1] "It's time for you to give me a little bit of lovin'ï¼\210ã\201•ã\201\201ã\201¡ã‚‡ã\201£ã\201¨ã\201¯ã\201‚ã\201ªã\201Ÿã\201®æ„›ã‚’ã\201¡ã‚‡ã\201†ã\201 ã\201„)Baby, hold me tight and do what I tell youï¼\201ï¼\210ãƒ\231イビー抱ã\201\215ã\201—ã‚\201ã\201¦ç§\201ã\201Œè¨\200ã\201†ã‚\210ã\201†ã\201«ï¼\201)"
To build the n-gram model first we have to clean the dataset, so we have to perform the following steps in order:-
I didn’t remove stopwords as they are very frequently used by the people and removing them will lead to a very inefficient application that would not match user expectations.
Now in this process I will build the n-gram model and try to observe the frequencies of the most frequent n-grams using the quanteda package. In this process I first read the files into a corpus, then tokenize them and then build the n-grams. Below you can see the most frequent n- grams which contains unigrams, bigrams and trigrams.
Some five grams observed are very unique such as the magiano little italic boston or the santelena hotel venice itali, which I think can be a part of special hashtag campaigns run on twitter or some special news coverage run for a long time. Also we can see that certain n-grams contain repeated words which can also vary from user to user. Such n-grams are detrimental to building a good word predictor as these special n-grams are unique to each user, so taking them from a generalized corpus is not a good practice. These special cases should be stored in real time based on user input.
My goal for the algorithm is that it should predict the five words that is most likely to be entered by the user based on his or her previous entries. If they have not entered any words then most frequent unigrams are displayed. If the have entered one word, then most frequent bi-grams are displayed. And this process is repeated thereafter for more words up-to five words as it is mean number of words in a sentence. To achieve this I will be implementing the stupid backoff mechanism because it is very inexpensive to compute even when containing a large corpus of data. And also later I will be evaluating the performance of the algorithm by using a bencmark tool which is used by many of the students of this specialization. Please feel free to give your feedback on the report.