Overview

The capstone project for the Data Science specialization is to build a predictive text model with a Shiny app, which will provide the user with the ability to predict the next word when the user types a sentence. A common example in everyday life is the use of smartphone keyboards, many of which use SwiftKey technology.

The data

There are three data sets for this project, obtained from: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip .

For this milestone report, the English version of the three data sets was used. The following libraries were used:

Here is a summary of the data:

File name Size in bytes Line count Word count Words per line
blogs 210160014 899288 37546246 41
news 205811889 1010242 34762395 34
twitter 167105338 2360148 30093369 12

Raw data sets tend to be large, which increases runtime, so for the remainder of this report, a subset of the data will be used.

con1 <- file("en_US.blogs.txt", "r")
con2 <- file("en_US.news.txt", "r")
con3 <- file("en_US.twitter.txt", "r")
blogs   <- readLines(con1, 2000)
news    <- readLines(con2, 2000)
twitter <- readLines(con3, 2000)
close(con1)
close(con2)
close(con3)

Cleaning and pre-processing the data

Before anything else happens, each data set is cleaned and then combined:

# convert to ASCII to strip of strange characters
blogs <- iconv(blogs, to="ASCII", sub="")
news <- iconv(news, to="ASCII", sub="")
twitter <- iconv(twitter, to="ASCII", sub="")

# combine the data
all <- paste(blogs, news, twitter)

# split text paragraphs into sentences
all <- sent_detect(all, language = "en", model = NULL)

Then, the text data is converted into the corpus, a structured set of text used for statistical analysis.

corpus <- Corpus(VectorSource(all))
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, tolower)  
corpus <- tm_map(corpus, removePunctuation)

Profanity is removed.

badwords = readLines('./badwords.txt')
corpus <- tm_map(corpus, removeWords, badwords)
corpus <- gsub("http\\w+","", corpus)

N-gram tokenization

Create n-grams.

unigram <- NGramTokenizer(corpus, Weka_control(min = 1, max = 1))
bigram <- NGramTokenizer(corpus, Weka_control(min = 2, max = 2))
trigram <- NGramTokenizer(corpus, Weka_control(min = 3, max = 3))

Convert n-grams into tables.

tbl_unigram <- data.frame(table(unigram))
tbl_bigram <- data.frame(table(bigram))
tbl_trigram <- data.frame(table(trigram))

Sort the word distributation frequency.

unigramgroup <- tbl_unigram[order(tbl_unigram$Freq,decreasing = TRUE),]
bigramgroup <- tbl_bigram[order(tbl_bigram$Freq,decreasing = TRUE),]
trigramgroup <- tbl_trigram[order(tbl_trigram$Freq,decreasing = TRUE),]

Take samples from the sorted distribution.

unisample <- unigramgroup[1:30,]
colnames(unisample) <- c("Word","Frequency")
bisample <- bigramgroup[1:30,]
colnames(bisample) <- c("Word","Frequency")
trisample <- trigramgroup[1:30,]
colnames(trisample) <- c("Word","Frequency")

Results

One-gram analysis

Here are the most commonly used single words:

Two-gram analysis

Here are the most commonly used two-word combinations:

Three-gram analysis

Here are the most commonly used three-word combinations:

Conclusion and next steps

As the results of each n-gram show, the most frequently used words are the best ones to use to build the predictive model. Words with high frequency will be used, and words with low frequency will be excluded.

Improvement ideas:

The next steps include:

  1. Create a predictive word model
  2. Build a Shiny app