Coursera DataScience Capstone Project Milestone Report

Introduction

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, the corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models.

In this capstone I will work on understanding and building predictive text models like those used by SwiftKey. This project will start with the basics, analyzing a large corpus of text documents to discover the structure in the data and how words are put together. It will cover cleaning and analyzing text data, then building and sampling from a predictive text model. Finally, I will use the knowledge gained in data products to build a predictive text product.

Set Environment and Load Data

library(NLP)
library(tm)
library(stringi)
library(dplyr)
library(ggplot2)
library(textcat)
library(RWeka)
library(wordcloud)

Load data:

con <- file("./Data/Coursera-SwiftKey/final/en_US/en_US.blogs.txt","r")
lineBlogs <- readLines(con)
con <- file("./Data/Coursera-SwiftKey/final/en_US/en_US.news.txt","r")
lineNews <- readLines(con)
con <- file("./Data/Coursera-SwiftKey/final/en_US/en_US.twitter.txt","r")
lineTwitter <- readLines(con)
close(con)
lineAll <- c(lineBlogs, lineNews, lineTwitter)

Exploratory Data Analysis

Check the size of three files (MB), number of lines, and maximum line length (words):

  fileName fileSize_MB linesNumber_lines maxLinesLength_words
1    Blogs    200.4242            899288                40835
2     News    196.2775             77259                 5760
3  Twitter    159.3641           2360148                  213

Data Cleaning

Sampling:

set.seed(314159)
lenAll <- length(lineAll)
index <- rbinom(n=lenAll,size=1,prob=0.1) %>%
        as.logical()
lineSub <- lineAll[index]
rm(lineAll)
corpus <- VCorpus(VectorSource(lineSub))

Clean the corpus: transfer to lower case, remove numbers, punctuations, badwords and white spaces:

# Get bad words
badWords <- read.csv("./Data/BadWords.csv", stringsAsFactors = FALSE)
badWords <- as.character(badWords[,1])
# Clean Data
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation, preserve_intra_word_dashes = TRUE)
corpus <- tm_map(corpus, content_transformer(tolower))
#corpus <- tm_map(corpus, removeWords, badWords)
corpus <- tm_map(corpus, stripWhitespace)

Then transfer the corpus to dataframe for later use. In the meantime, remove the lines that don’t belong to english.

lineInf <- data.frame(text = sapply(corpus, as.character), stringsAsFactors = FALSE)
lineInf$language <- textcat(lineInf$text)
lineInf <- filter(lineInf, language == "english") %>%
        select(text)

Tokenization

Since we’ve already processed the corpus, then use RWeka package to tokenize corpus. Write a function for tokenization in order to same time. Here, we pick N from 1 to 4, which is unigram, bigram, trigram, and quagram.

# Sample first
lenLine <- length(lineInf[,1])
index <- rbinom(n=lenLine, size = 1, prob = 0.5) %>%
        as.logical()
lineSample <- lineInf[index,]
# Construct function
NGT <- function(text, numControl){
        NGTProcess <- NGramTokenizer(text, 
                        control = Weka_control(min=numControl, max=numControl))
        NGTProcess <- data.frame(table(NGTProcess))
        NGTProcess <- NGTProcess[order(NGTProcess$Freq, decreasing = TRUE),]
        colnames(NGTProcess) <- c("String","Freq")
        NGTProcess <- NGTProcess[1:1000,]
        NGTProcess$String <- as.character(NGTProcess$String)
}
# Tokenize
NGT1 <- NGT(lineSample, 1)
NGT2 <- NGT(lineSample, 2)
NGT3 <- NGT(lineSample, 3)
NGT4 <- NGT(lineSample, 4)

Just show the result unigram for example:

head(NGT1)

      String   Freq
97815    the 132257
99361     to  88587
6869     and  75250
44         a  67606
69576     of  62705
48165      i  57410

Plot N-Gram Frequency

Plot the frequency of unigram, bigram, trigram, and quagram.

Also the word cloud:

Furthur Planning

Since we have already got the N-Gram model (N is from 1 to 4), the next step would be to build the prediction algorithms and develop data product. For the algorithms, the idea is to search inside the N-Gram in a certain way according to user’s input. For example, search in the quagram first and store the results. If we are not satisfied with the results, then search inside the trigram, and so on. In the end, compare the frequency and make the final decision. As for the data product, the plan is to use Shiny package in R to develop a interactive web application.