Introduction

The Capstone Project is about building a web application that does predictive text system that will provide a list of possible words based on the previous word(s) entered.

This project requires the concepts of Natural Language Processing (NLP) such as n-gram.

N-gram

N-gram is a contiguous sequence of n items from a given sequence of text or speech.

Using the sentence below as an example:

A quick brown fox jumps over the fence.

The associated 2-grams are: “A quick”, “quick brown”, “brown fox”, “fox jumps”, “jumps over”, “over the”, “the fence”. By extending this idea, the 3-grams are: “A quick brown”, “quick brown fox”, “brown fox jumps”, “fox jumps over”, “jumps over the”, “over the fence”

The application that we are building relies extensively on the concept of n-gram as the word that we are trying to predict relies on the word(s) that precede(s) it.

Data Source

The data for this project is obtained from Coursera; the original source from HC Corpora.

We are using the files based on English (en_US) for this project.

Removing profanity

The list of profanity is based on the Google’s What do you love.

Lines containing any of the words listed will not be analysed subsequently.

Summary of data

The table below shows some basic statistics on the files that are used in the analysis. Note that lines containing profanities have been removed.

setwd("../downloaded_data/cleaned")
en_files_full = list.files(getwd(), pattern = "no_profanity")
file_sizes = lapply(en_files_full, function(f) file.size(f)/1024/1024 )
lines_count = lapply(en_files_full, function(f) length(readLines(f)) )
words_count = lapply(en_files_full, function(f) sum(nchar(readLines(f))) )
files_stat = cbind(file_sizes, lines_count, words_count)
colnames(files_stat) = c("Size (MB)", "Lines", "Words")
rownames(files_stat) = en_files_full
files_stat

##                                Size (MB) Lines   Words    
## en_US.blogs.no_profanity.txt   183.6844  862601  190825135
## en_US.news.no_profanity.txt    193.3121  1001449 200649356
## en_US.twitter.no_profanity.txt 151.8963  2260065 154625097

Analysis

As each text file is very large in size, we take a 10% sample of the total number of lines for each file. For our analysis, we present the top unigram frequency, its coverage and the 2-gram and 3-gram frequencies.

Unigram

Frequency

Coverage

From the Frequency and Coverage figures above, it shows that the coverage is 50% at 0.33% of the unigrams and 90% covered at 17.11% of the unigrams. This shows that there are many words that appear very often in the corpus, and these words are the stopwords as shown in the Frequency figure. Hence, leading to high coverage with very small percentage of the words in the corpus. With about 9772 words, 90% of the words are being covered. This provides us good knowledge and estimates on how much data we will need to use to build the final model.

2-gram and 3-gram Frequencies

From the graphs above, the most common 2-gram and 3-gram can be observed. Stopwords top the charts as expected. However, they cannot be ignored for building a system that predicts the next possible word. Therefore, they are taken into account in the model development.

Conclusion

We have done the a good initial data exploratory on the corpus provided. It provides knowledge and understanding on how we can proceed on to develop the model for the Shiny app.

Plans on building the application

Develop an associative relationships between words (without stopwords). A pair of words often appear together in a sentence. This will make the model able to predict the next word based on word(s) that is/are further away (3 or more words preceding the word to be predicted).
Shrink the size of the model in order to fit into the Shiny app.

Data Science Capstone Milestone Report

28 Dec 2015