This report is Milestone report of Swiftkey next word prediction task based on NLP. The Swhiftkey NLP task is create a next word prediction algorithms that would help mobile user to type the messages faster recommending the next possible word based on the text user has types in.
In this Milestone report, the basic preparation of raw data, preliminary statistics/visualization analysis, plans for algorithm and applications are introduced in order to provide readers with overall concepts about the project.
We use following packages in the text mining report
library(knitr)
options(java.parameters = "-Xmx4g" )
library(RWeka)
library(tm)
## Loading required package: NLP
library(stringr)
As the first step in the analysis is to gather the data. We download the data and unzip. Following code was used to capture the data.
# url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
# download.file(url = url,method = "curl",destfile = "Coursera-SwiftKey.zip")
# unzip(zipfile = "Coursera-SwiftKey.zip")
The Swiftkey task data consist of tweets, blog posts and news in 4 different languages - English, German, Russian and Finnish. There are 12 different large file. For the Milestone report we look only the English language files
English language test files, size in MB, number of lines and number of words
kable(read.csv("file_size.csv",sep = ";"))
| File | Size | Lines | WordCount |
|---|---|---|---|
| en_US/en_US.blogs.txt | 200MB | 899288 | 37334690 |
| en_US/en_US.news.txt | 196MB | 1010242 | 34372720 |
| en_US/en_US.twitter.txt | 159MB | 2360148 | 30374206 |
As the next step - we load the data. As the data is large we sample only for first 3000 rows in each data file and merge them.
n=3000
eng_tw = list(readLines(file("final/en_US/en_US.twitter.txt", "r"), n, encoding = "UTF-8", warn = FALSE))
eng_news = list(readLines(file("final/en_US/en_US.news.txt", "r"), n, encoding = "UTF-8", warn = FALSE))
eng_blogs = list(readLines(file("final/en_US/en_US.blogs.txt", "r"), n, encoding = "UTF-8", warn = FALSE))
eng_txt = paste(eng_tw,eng_news,eng_blogs)
save("eng_txt",file = "eng_txt.txt")
In order to create a corpus we the text documents should be prepared. For this we execute several standard procedures such as:
As a result we get a corpus of texts that will be the basis of application.
corpus <- VCorpus(VectorSource(eng_txt))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
writeCorpus(corpus,filenames = "corpus.txt")
Creating the Term-document Matrix
dtm <- DocumentTermMatrix(corpus)
dtm
## <<DocumentTermMatrix (documents: 1, terms: 26378)>>
## Non-/sparse entries: 26378/0
## Sparsity : 0%
## Maximal term length: 95
## Weighting : term frequency (tf)
Creating n-grams for the Data tokens for next word prediction algorithms. Ngrams are sequential words that exist in the text.
First we create Unigram tokens and the top of frequency table
token1 <- NGramTokenizer(corpus, Weka_control(min=1, max=1))
token1 <- data.frame(table(token1))
token1 <- token1[order(token1$Freq,decreasing=TRUE),]
colnames(token1) <- c("Word", "Freq")
token1[1:15,]
## Word Freq
## 23715 the 13094
## 24078 to 7208
## 875 and 6853
## 18 a 6316
## 16377 of 5591
## 11687 in 4379
## 11456 i 3958
## 23699 that 2886
## 9042 for 2784
## 12257 is 2724
## 12298 it 2383
## 16489 on 2059
## 26222 with 1993
## 26601 you 1986
## 25746 was 1727
Histogram of Unigrams
par(ps=8, las=2, mar=c(5.1,5.1,4.1,2.1))
barplot(token1[1:40,]$Freq,
col = "blue",
main = "Most frequent Unigram words",
ylab = "Word frequencies",
names.arg = token1$Word[1:40])
Secondly Bigram tokens and the top frequency table
token2 <- NGramTokenizer(corpus, Weka_control(min=2, max=2))
token2 <- data.frame(table(token2))
token2 <- token2[order(token2$Freq,decreasing=TRUE),]
colnames(token2) <- c("Word", "Freq")
token2[1:15,]
## Word Freq
## 91486 of the 1195
## 65402 in the 1133
## 137945 to the 582
## 49170 for the 503
## 93198 on the 491
## 136356 to be 453
## 10605 and the 386
## 14822 at the 344
## 64329 in a 334
## 151015 with the 309
## 69217 it was 286
## 50932 from the 267
## 89610 of a 247
## 9200 and i 237
## 67577 is a 233
Histogram of most frequent Bigrams
par(ps=8, las=2, mar=c(5.1,5.1,4.1,2.1))
barplot(token2[1:40,]$Freq,
col = "blue",
main = "Most frequent 2-gram words",
ylab = "Word frequencies",
names.arg = token2$Word[1:40])
Thirdly 3-gram tokens and the top frequency table
token3 <- NGramTokenizer(corpus, Weka_control(min=3, max=3))
token3 <- data.frame(table(token3))
token3 <- token3[order(token3$Freq,decreasing=TRUE),]
colnames(token3) <- c("Word", "Freq")
token3[1:15,]
## Word Freq
## 142331 one of the 98
## 2663 a lot of 96
## 76921 going to be 44
## 105297 it was a 41
## 93099 i want to 40
## 189573 the end of 40
## 148027 part of the 38
## 91851 i have a 37
## 23525 as well as 34
## 91445 i dont know 34
## 205606 to be a 34
## 27999 be able to 32
## 145951 out of the 31
## 935 a couple of 30
## 91930 i have to 30
Histogram of 3-gram word frequencies
par(ps=8, las=2, mar=c(5.1,5.1,4.1,2.1))
barplot(token3[1:40,]$Freq,
col = "blue",
main = "Most frequent 3-gram words",
ylab = "Word frequencies",
names.arg = token3$Word[1:40])
As the plan for next steps we consider the following improvements for the current setup - Review of corpus document term matrix and cleanup for sparse and profanity words - Random selection and larger number of documents for corpus to represent better of the texts
As a plan for shiny application we consider to improve the document term matrix that would be small enough to be able to be used in shiny application in light of performance. Also we will seek for the appropriate prediction model for next word prediction algorithm.