Milestone Report

Introduction

This is a report to inform the reader on the progress of my work regarding creating a shiny app that predicts the words most likely to complete a sentence when we try to enter a meaningful set of words. In this report I present the exploratory data analysis on the dataset and my goals for the eventual app and algorithm. Please read the report and provide the necessary feedback on my plans.

Dataset can be downloaded from this link

Exploratory Data Analysis

In this process we will explore the dataset and perform basic tasks to know our dataset better. Below are the key findings presented in tabular form. This includes number of lines, number of total characters, longest line length, mean number of characters and median number of characters.

object	lines	characters	Longest_Line	Mean_chars	Median_chars
Blog	899288	208361438	40835	231.7	157
News	77259	15683765	5760	203	186
Twitter	2360148	162384825	213	68.8	64

In the twitter data, we can see that even though the maximum character limit of tweets is 140, the data shows the longest tweet to be of length 213. We can see the longest tweet below.

## [1] "It's time for you to give me a little bit of lovin'ï¼\210ã\201•ã\201\201ã\201¡ã‚‡ã\201£ã\201¨ã\201¯ã\201‚ã\201ªã\201Ÿã\201®æ„›ã‚’ã\201¡ã‚‡ã\201†ã\201 ã\201„ï¼‰Baby, hold me tight and do what I tell youï¼\201ï¼\210ãƒ\231ã‚¤ãƒ“ãƒ¼æŠ±ã\201\215ã\201—ã‚\201ã\201¦ç§\201ã\201Œè¨\200ã\201†ã‚\210ã\201†ã\201«ï¼\201ï¼‰"

Data Cleaning

To build the n-gram model first we have to clean the dataset, so we have to perform the following steps in order:-

Removal of non-ASCII characters.
Taking a sample from the original dataset.
Converting all the text to lower case.
Tokenizing the sample dataset.
Removal of punctuation, symbols, numbers, urls, separators.
Stemming the tokens so that similar word phrases can be grouped together.
Removing profanity. Full list of bad words has been downloaded from this link

I didn’t remove stopwords as they are very frequently used by the people and removing them will lead to a very inefficient application that would not match user expectations.

Building N-gram models

Now in this process I will build the n-gram model and try to observe the frequencies of the most frequent n-grams using the quanteda package. In this process I first read the files into a corpus, then tokenize them and then build the n-grams. Below you can see the most frequent n- grams which contains unigrams, bigrams and trigrams.

Some five grams observed are very unique such as the magiano little italic boston or the santelena hotel venice itali, which I think can be a part of special hashtag campaigns run on twitter or some special news coverage run for a long time. Also we can see that certain n-grams contain repeated words which can also vary from user to user. Such n-grams are detrimental to building a good word predictor as these special n-grams are unique to each user, so taking them from a generalized corpus is not a good practice. These special cases should be stored in real time based on user input.

Algorithm design

My goal for the algorithm is that it should predict the five words that is most likely to be entered by the user based on his or her previous entries. If they have not entered any words then most frequent unigrams are displayed. If the have entered one word, then most frequent bi-grams are displayed. And this process is repeated thereafter for more words up-to five words as it is mean number of words in a sentence. To achieve this I will be implementing the stupid backoff mechanism because it is very inexpensive to compute even when containing a large corpus of data. And also later I will be evaluating the performance of the algorithm by using a bencmark tool which is used by many of the students of this specialization. Please feel free to give your feedback on the report.