Data Science Capstone Milestone Report

Overview

The capstone project for the Data Science specialization is to build a predictive text model with a Shiny app, which will provide the user with the ability to predict the next word when the user types a sentence. A common example in everyday life is the use of smartphone keyboards, many of which use SwiftKey technology.

The data

There are three data sets for this project, obtained from: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip .

For this milestone report, the English version of the three data sets was used. The following libraries were used:

library(knitr)
library(R.utils)
library(stringi)
library(stringr)
library(tm)
library(openNLP)
library(qdap)
library(RWeka)
library(ggplot2)

Here is a summary of the data:

File name	Size in bytes	Line count	Word count	Words per line
blogs	210160014	899288	37546246	41
news	205811889	1010242	34762395	34
twitter	167105338	2360148	30093369	12

Raw data sets tend to be large, which increases runtime, so for the remainder of this report, a subset of the data will be used.

con1 <- file("en_US.blogs.txt", "r")
con2 <- file("en_US.news.txt", "r")
con3 <- file("en_US.twitter.txt", "r")
blogs   <- readLines(con1, 2000)
news    <- readLines(con2, 2000)
twitter <- readLines(con3, 2000)
close(con1)
close(con2)
close(con3)

Cleaning and pre-processing the data

Before anything else happens, each data set is cleaned and then combined:

# convert to ASCII to strip of strange characters
blogs <- iconv(blogs, to="ASCII", sub="")
news <- iconv(news, to="ASCII", sub="")
twitter <- iconv(twitter, to="ASCII", sub="")

# combine the data
all <- paste(blogs, news, twitter)

# split text paragraphs into sentences
all <- sent_detect(all, language = "en", model = NULL)

Then, the text data is converted into the corpus, a structured set of text used for statistical analysis.

corpus <- Corpus(VectorSource(all))
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, tolower)  
corpus <- tm_map(corpus, removePunctuation)

Profanity is removed.

badwords = readLines('./badwords.txt')
corpus <- tm_map(corpus, removeWords, badwords)
corpus <- gsub("http\\w+","", corpus)

N-gram tokenization

Create n-grams.

unigram <- NGramTokenizer(corpus, Weka_control(min = 1, max = 1))
bigram <- NGramTokenizer(corpus, Weka_control(min = 2, max = 2))
trigram <- NGramTokenizer(corpus, Weka_control(min = 3, max = 3))

Convert n-grams into tables.

tbl_unigram <- data.frame(table(unigram))
tbl_bigram <- data.frame(table(bigram))
tbl_trigram <- data.frame(table(trigram))

Sort the word distributation frequency.

unigramgroup <- tbl_unigram[order(tbl_unigram$Freq,decreasing = TRUE),]
bigramgroup <- tbl_bigram[order(tbl_bigram$Freq,decreasing = TRUE),]
trigramgroup <- tbl_trigram[order(tbl_trigram$Freq,decreasing = TRUE),]

Take samples from the sorted distribution.

unisample <- unigramgroup[1:30,]
colnames(unisample) <- c("Word","Frequency")
bisample <- bigramgroup[1:30,]
colnames(bisample) <- c("Word","Frequency")
trisample <- trigramgroup[1:30,]
colnames(trisample) <- c("Word","Frequency")

Results

One-gram analysis

Here are the most commonly used single words:

Two-gram analysis

Here are the most commonly used two-word combinations:

Three-gram analysis

Here are the most commonly used three-word combinations:

Conclusion and next steps

As the results of each n-gram show, the most frequently used words are the best ones to use to build the predictive model. Words with high frequency will be used, and words with low frequency will be excluded.

Improvement ideas:

Use fewer than 30 words for the graphical results
Show exact frequency for each term

The next steps include:

Create a predictive word model
Build a Shiny app