For the Coursera Data Science Specialization Capstone, we will create a Natural Language Processing algorithm for predicting next word when someone is typing one or more words.
Text data has been provided by Swiftkey, they are famous for being leader in mobile virtual keyboard. In this report we will have a first overview of the data and explore interesting features of it.
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(url, "swiftkey.zip")
unzip("swiftkey.zip")
The files are in our working directory.
library(stringr)
library(tm)
library(slam)
library(wordcloud)
library(ggplot2)
library(ngram)
library(gridExtra)
We will focus on the English language datasets.
#function to read easily the files
readfile <- function(x) {
con <- file(x, "r")
text <- readLines(con)
close(con)
return(text)
}
twit <- readfile("final/en_US/en_US.twitter.txt")
## Warning in readLines(con): line 167155 appears to contain an embedded nul
## Warning in readLines(con): line 268547 appears to contain an embedded nul
## Warning in readLines(con): line 1274086 appears to contain an embedded nul
## Warning in readLines(con): line 1759032 appears to contain an embedded nul
blog <- readfile("final/en_US/en_US.blogs.txt")
news <- readfile("final/en_US/en_US.news.txt")
features <- function(x) {
temp <- length(x)
temp <- c(temp, sum(str_count(x)))
return(temp)
}
twitFeat <- features(twit)
blogFeat <- features(blog)
newsFeat <- features(news)
The twitter dataset has 2360148 lines and 162095755 words. The blog one has 899288 lines and 206824257 words. The news data has 1010242 lines and 203223153 words.
We will sample the data to have a smaller esaily computable dataset, also for prediction we will provide a training and testing set (not realized in this report). We “reverse-enginereed” Swiftkey to study the necessity of keeping 3 datasets or grouping in a more heterogenous dataset for prediction. We think it is more interesting to group the 3 datasets to create one where we will base or prediction, and we believe Swiftkey is doing the same (text messages, facebook and twitter seems to be predicted on the same dataset).
# number of lines to sample
size <- mean(twitFeat[1], blogFeat[1], newsFeat[1])*.02
set.seed(12345)
sampleT <- sample(twit, size)
sampleB <- sample(blog, size)
sampleN <- sample(news, size)
allSam <- c(sampleT, sampleB, sampleN)
We have a dataset with 4.72029610^{4} rows of each dataset.
vs <- VectorSource(allSam)
corpus <- VCorpus(vs)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map (corpus, removePunctuation, preserve_intra_word_dashes = TRUE)
#profanities comes from LDNOOBW github
con <- url("https://raw.githubusercontent.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en")
profanity <- readLines(con)
close(con)
corpus <- tm_map(corpus, removeWords, profanity)
corpus <- tm_map(corpus, removeNumbers)
We chose to keep the stopwords, as they are useful for the accuracy of our prediction and we didn’t do stemming as we believe that predicting “apples” instead of “apple” can be great for users.
We can observe a wordcloud to have a first look on the distribution of the words.
wordcloud(corpus, max.words=100, colors=brewer.pal(8, "Dark2"))
We can plot the 15 most frequent words in the dataset.
tdm <- TermDocumentMatrix(corpus)
frequency <- rowapply_simple_triplet_matrix(tdm,FUN=sum)
frequency <- sort(frequency, decreasing=TRUE)
gram1 <- data.frame(term=names(frequency), count=frequency, row.names=NULL)
top <- gram1[1:15,]
We can also plot the 10 most frequents trigrams (sequences of 3 words).
trigramToken <- function(x) {
unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)
}
tdm3 <- TermDocumentMatrix(corpus, control=list(tokenize=trigramToken))
frequency3 <- rowapply_simple_triplet_matrix(tdm3,FUN=sum)
frequency3 <- sort(frequency3, decreasing=TRUE)
gram3 <- data.frame(term=names(frequency3), count=frequency3, row.names=NULL)
top3 <- gram3[1:10,]
We can also plot the cumulative coverage of the words, 90% coverage on the red line.
#gram1
total1 <- sum(gram1$count)
gram1$cover <- cumsum(gram1$count)/total1
gram1$index <- seq.int(nrow(gram1))
x1 <- table(gram1$cover<=.9)[2]
#gram3
total3 <- sum(gram3$count)
gram3$cover <- cumsum(gram3$count)/total3
gram3$index <- seq.int(nrow(gram3))
x3 <- table(gram3$cover<=.9)[2]
We can see that we arrive at 90% coverage (red vertical line) quite fast for 1gram, and may remove some of the terms for the final algorithm to make it more efficient in terms of memory.
The next step will make the dataset more concise using words distribution to have the ability to run on a app without loosing accuracy, then we will compute the probability of an unknown combination of words happening (with Good Turing Smoothing). For prediction model we will use Markov Chains theory to provide an efficient model for monograms, bigrams and trigrams.
Thank you for reading!