datasciencecapstoneproject

Introduction

The purpose of this project is to create a model which can suggest the next word in a sentence. Such predictive models are extensively used in mobile keyboards, search engines, and messaging software. This project will make use of a huge amount of data in English language collected from blogs, newspapers, and Twitter messages.

The end product would be the algorithm of predictive text together with a Shiny app to showcase its features. In order to create the predictive engine, it is important to understand the nature of the dataset first through exploratory data analysis.

Data Collection

The data was downloaded from the Coursera SwiftKey dataset.

There are three different files in the corpus.

Dataset lines of data Blogs 899,288 News 1,010,242 Twitter 2,360,148

In total, the corpus has more than 4 million text records.

Size of Files Dataset size(MB) Blogs 200 News 196 Twitter 159

All the datasets take up more than 3. Data Loading and Sampling

Since the corpus was huge, a randomly selected sample from the data set was selected for exploratory analysis. The sampling helps in reducing the need for computation without losing any language pattern representation.

The sample data sets were joined together to form a single corpus and analyzed by using various R packages such as:

tm stringi RWeka ggplot2 dplyr

Data Cleaning

The following pre-processing techniques were applied:

Lower case conversion. Punctuation removal. Numbers removal. Extra whitespace removal. Special characters removal. Profanity filtering.

These help to ensure consistency in the corpus. 5. Exploratory Data Analysis Distribution of Word Counts

The data sets vary greatly in terms of sentence length.

Data Set Maximum Sentence Length Blogs 40,833 News 11,384 Twitter 140

The Twitter data set is naturally characterized by shorter texts due to limitations on the platform, while blogs contain relatively long texts.

Most Frequent Words

Insert Figure 1 Here

Figure 1: Top 20 most frequent words in the sampled data set.

It appears from the analysis that the most frequent words in the data set are common English words. These words make up a substantial part of word usage.

Most Frequent Bigrams

Insert Figure 2 Here

Figure 2: Top 20 most frequent bigrams.

Bigrams that appear frequently are informative regarding word orders and can be used to predict the next word.

Most Frequent Trigrams

Insert Figure 3 Here

Figure 3: Top 20 most frequent trigrams.

Trigrams contain more information and will be more accurate in prediction than single words. 6. Interesting Observations

Several observations were found while doing exploratory data analysis:

Twitter dataset has more records than other two datasets. Blogs have significantly longer sentences than tweets and SMS. Word frequencies obey Zipf’s Law, where very few words have extremely high frequency, whereas most of the words have low frequency. Common phrases are found in all three datasets. High order n-grams give contextual information that can be helpful in making predictions.

These observations suggest that there is enough structure in language to make predictions.

Prediction Algorithm Approach

The predictive algorithm will be based on N-gram language model.

It will involve the following steps:

Step 1: Tokenization

Break the text into individual words.

Step 2: N-gram Generation

Create:

Unigrams Bigrams Trigrams Four grams Step 3: Frequency Calculation

Count frequency of each n-gram.

Step 4: Backoff Model

Prediction will be made using hierarchy as follows:

Find four gram match If not available, find trigram If not available, find bigram Use frequency of unigrams.

Planned Shiny Application

In order to demonstrate the use of the predictive text model, an interactive Shiny application will be developed.

Main components of the application include:

Field for text input. Next-word prediction in real time. Multiple predictions per word. Short latency of responses. Simple and user-friendly interface.

Future Work

The future work includes:

Improving accuracy of predictions. Reducing memory consumption. Decreasing latency of responses. Model evaluation. Improvements of the Shiny interface.

Smoothing and pruning techniques might be utilized to enhance the process of prediction.

Conclusion

It was shown that the SwiftKey datasets could be downloaded and successfully explored. The exploratory analysis highlighted some key characteristics of the corpus, such as frequencies of words, patterns of phrases, and dataset-dependent peculiarities.

According to the results obtained during the analysis, the N-gram based model is a right choice for the next-word prediction problem. Further work should be focused on implementation of the predictive text model and its deployment via Shiny interface.