The capstone project for the Data Science specialization is to build a predictive text model with a Shiny app, which will provide the user with the ability to predict the next word when the user types a sentence. A common example in everyday life is the use of smartphone keyboards, many of which use SwiftKey technology.
There are three data sets for this project, obtained from: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip .
For this milestone report, the English version of the three data sets was used. The following libraries were used:
Here is a summary of the data:
| File name | Size in bytes | Line count | Word count | Words per line |
|---|---|---|---|---|
| blogs | 210160014 | 899288 | 37546246 | 41 |
| news | 205811889 | 1010242 | 34762395 | 34 |
| 167105338 | 2360148 | 30093369 | 12 |
Raw data sets tend to be large, which increases runtime, so for the remainder of this report, a subset of the data will be used.
con1 <- file("en_US.blogs.txt", "r")
con2 <- file("en_US.news.txt", "r")
con3 <- file("en_US.twitter.txt", "r")
blogs <- readLines(con1, 2000)
news <- readLines(con2, 2000)
twitter <- readLines(con3, 2000)
close(con1)
close(con2)
close(con3)
Before anything else happens, each data set is cleaned and then combined:
# convert to ASCII to strip of strange characters
blogs <- iconv(blogs, to="ASCII", sub="")
news <- iconv(news, to="ASCII", sub="")
twitter <- iconv(twitter, to="ASCII", sub="")
# combine the data
all <- paste(blogs, news, twitter)
# split text paragraphs into sentences
all <- sent_detect(all, language = "en", model = NULL)
Then, the text data is converted into the corpus, a structured set of text used for statistical analysis.
corpus <- Corpus(VectorSource(all))
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
Profanity is removed.
badwords = readLines('./badwords.txt')
corpus <- tm_map(corpus, removeWords, badwords)
corpus <- gsub("http\\w+","", corpus)
Create n-grams.
unigram <- NGramTokenizer(corpus, Weka_control(min = 1, max = 1))
bigram <- NGramTokenizer(corpus, Weka_control(min = 2, max = 2))
trigram <- NGramTokenizer(corpus, Weka_control(min = 3, max = 3))
Convert n-grams into tables.
tbl_unigram <- data.frame(table(unigram))
tbl_bigram <- data.frame(table(bigram))
tbl_trigram <- data.frame(table(trigram))
Sort the word distributation frequency.
unigramgroup <- tbl_unigram[order(tbl_unigram$Freq,decreasing = TRUE),]
bigramgroup <- tbl_bigram[order(tbl_bigram$Freq,decreasing = TRUE),]
trigramgroup <- tbl_trigram[order(tbl_trigram$Freq,decreasing = TRUE),]
Take samples from the sorted distribution.
unisample <- unigramgroup[1:30,]
colnames(unisample) <- c("Word","Frequency")
bisample <- bigramgroup[1:30,]
colnames(bisample) <- c("Word","Frequency")
trisample <- trigramgroup[1:30,]
colnames(trisample) <- c("Word","Frequency")
Here are the most commonly used single words:
Here are the most commonly used two-word combinations:
Here are the most commonly used three-word combinations:
As the results of each n-gram show, the most frequently used words are the best ones to use to build the predictive model. Words with high frequency will be used, and words with low frequency will be excluded.
Improvement ideas:
The next steps include: