Text Predict Milestone Report

Introduction

This is first report for the Capstone project of Data Science Specilization. The main goal of the project is to create R application which can predict next word given two previous words. For this application to function, we need to design a text predictive model by training the given dataset (Blogs, Twitter & News). In this report, we will perform some explotary data analysis and try to discover some interesting relationship from the dataset. Also, we will build n-grams models using the given dataset. Lets get started.

Getting & Cleaning Data

The dataset provided is a bit large considering internet speed in Nepal. It took almost 8hrs for me to download the dataset. I used this link to download the data set. https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Loading required package: NLP

## 
## Attaching package: 'NLP'

## The following object is masked from 'package:ggplot2':
## 
##     annotate

## Loading required package: foreach

## Loading required package: iterators

## Loading required package: parallel

The dataset are provided in 4 different languages: Finnish, English, Russian and Germal. Also, the data sets consist of text from 3 different sources: News, Blogs and Twitter. In this report, we will only focus on the English data sets.

#Set extracted file location
en_blogs<-"/home/tensor/Documents/Tutorials/R/Capestone/dataset/en_US/en_US.blogs.txt"
en_twitter<-"/home/tensor/Documents/Tutorials/R/Capestone/dataset/en_US/en_US.twitter.txt"
en_news<-"/home/tensor/Documents/Tutorials/R/Capestone/dataset/en_US/en_US.news.txt"

Load the datasets

# Load data into R
data_blogs <- readLines(en_blogs, encoding = "UTF-8", skipNul = TRUE)
data_news <- readLines(en_news, encoding = "UTF-8", skipNul = TRUE)
data_twitter <- readLines(en_twitter, encoding = "UTF-8", skipNul = TRUE)

Data Summary

words_blogs <- stri_count_words(data_blogs)
words_news <- stri_count_words(data_news)
words_twitter <- stri_count_words(data_twitter)
size_blogs <- file.info(en_blogs)$size/1024^2
size_news <- file.info(en_news)$size/1024^2
size_twitter <- file.info(en_twitter)$size/1024^2
summary_table <- data.frame(filename = c("en_US.blogs.txt","en_US.news.txt","en_US.twitter.txt"),
                            file_size_MB = c(size_blogs, size_news, size_twitter),
                            num_lines = c(length(data_blogs),length(data_news),length(data_twitter)),
                            num_words = c(sum(words_blogs),sum(words_news),sum(words_twitter)),
                            mean_num_words = c(mean(words_blogs),mean(words_news),mean(words_twitter)))
kable(summary_table)

filename	file_size_MB	num_lines	num_words	mean_num_words
en_US.blogs.txt	200.4242	899288	37546246	41.75108
en_US.news.txt	196.2775	1010242	34762395	34.40997
en_US.twitter.txt	159.3641	2360148	30093410	12.75065

Data Cleaning

Before procedding to create models of given datasets, we need to clean the data set. We will remove special characters, whitespaces, URLS and other non-import characters from the dataset. For this report, we will take sample data from our data set. We will be using whole date set for latter project.

# Sample the data
set.seed(420)
data.sample <- c(sample(data_blogs, length(data_blogs) * 0.01),
                 sample(data_news, length(data_news) * 0.01),
                 sample(data_twitter, length(data_twitter) * 0.01))

# Create corpus and clean the data
corpus <- VCorpus(VectorSource(data.sample))
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)

n-grams

# prepare the word n-gram data
my_corpus <-data.frame(text = unlist(sapply(corpus, `[`, "content")), 
                        stringsAsFactors = FALSE)

findNGrams <- function(corp, grams, top) {
        ngram <- NGramTokenizer(corp, Weka_control(min = grams, max = grams,
                                                   delimiters = " \\r\\n\\t.,;:\"()?!"))
        ngram <- data.frame(table(ngram))
        ngram <- ngram[order(ngram$Freq, decreasing = TRUE),][1:top,]
        colnames(ngram) <- c("Words","Count")
        ngram
}

mono_grams   <- findNGrams(my_corpus, 1, 10)
bi_grams     <- findNGrams(my_corpus, 2, 10)
tri_grams    <- findNGrams(my_corpus, 3, 10)
quad_grams   <- findNGrams(my_corpus, 4, 10)

Exploratory Data Analysis

As we have developed four different n-gram model, it would be very helpful to see the most occuring grams in those models.

makePlot <- function(data, label) {
  ggplot(data[1:10,], aes(reorder(Words, -Count), Count)) +
         labs(x = label, y = "Frequency") +
         theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
         geom_bar(stat = "identity", fill = I("red"))
}


makePlot(bi_grams, "10 Most Common Bi-grams")

makePlot(tri_grams, "10 Most Common Tri-grams")

makePlot(mono_grams, "10 Most Common Mono-grams")

makePlot(quad_grams, "10 Most Common Quad-grams")

Next Step

So far we performed exploratory data analysis for given data set. The next challenge would be build a predictive model, evaluate that model and build a friendly UI in shiny.

Our predictive algorithm will be using n-gram model with frequency lookup combined with logistic regression. One possible strategy would be to use the trigram model to predict the next word. If no matching trigram can be found, then the algorithm would back off to logistic regression to predict the word.

The user interface of the Shiny app will consist of a text input box that will allow a user to enter a phrase. Then the app will use our algorithm to suggest the most likely next word after a short delay. Our plan is also to allow the user to configure how many words our app would suggest.