To access the version of this report without code, here is the link : http://rpubs.com/rdsn/138846
People are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:
I went to the
the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. In this capstone we will work on understanding and building predictive text models like those used by SwiftKey.
This project is about creating a Shiny application designed to make text predictions. In order to do so, we will use the HC Corpora dataset, achieve an exploratory analysis, and implement a predictive model so that to predict the next word.
This Milestone Report is presenting our understanding of the data, our exploratroy analysis, and our plans for implementing a predictive algorithm.
HC corpora is a collection of corpora for various languages freely available to download. The corpora have been collected from numerous different webpages by a web crawler.
More details on the corpora can be found here : http://www.corpora.heliohost.org/aboutcorpus.html In this dataset, we are provided with 3 text files (.txt) available in 4 different languages : Deutch, English, Finnish, Russian.
We will focus here only on the english files. Those files are :
Here is the link to download the dataset : https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
Once we have loaded the data into R, we can observe for each those characteristics :
# Loading the libraries
library(NLP)
library(tm)
library(ggplot2)
##
## Attaching package: 'ggplot2'
##
## The following object is masked from 'package:NLP':
##
## annotate
cname <- file.path("~", "Documents", "Coursera", "Data science - Johns Hopkins University", "Capstone Project", "final", "en_US")
docs <- Corpus(DirSource(cname))
# size of files
blogSize <- as.character(round(file.info("~/Documents/Coursera/Data science - Johns Hopkins University/Capstone Project/final/en_US/en_US.blogs.txt")$size / 1024^2,2))
newSize <- as.character(round(file.info("~/Documents/Coursera/Data science - Johns Hopkins University/Capstone Project/final/en_US/en_US.news.txt")$size / 1024^2,2))
twitterSize <- as.character(round(file.info("~/Documents/Coursera/Data science - Johns Hopkins University/Capstone Project/final/en_US/en_US.twitter.txt")$size / 1024^2,2))
# number of lines per file
blogLines <- as.character(format(length(docs[[1]]$content), big.mark = " "))
newsLines <- as.character(format(length(docs[[2]]$content), big.mark = " "))
twitterLines <- as.character(format(length(docs[[3]]$content), big.mark = " "))
# maximum number of words
blogMax <- as.character(format(max(nchar(docs[[1]]$content)), big.mark = " "))
newsMax <- as.character(format(max(nchar(docs[[2]]$content)), big.mark = " "))
twitterMax <- as.character(format(max(nchar(docs[[3]]$content)), big.mark = " "))
# minimum numbe of words
blogMin <- as.character(min(nchar(docs[[1]]$content)))
newsMin <- as.character(min(nchar(docs[[2]]$content)))
twitterMin <- as.character(min(nchar(docs[[3]]$content)))
# word counts
blogWcount = 0
for (item in docs[[1]]$content){
blogWcount = blogWcount + nchar(item)
}
blogWcount = format(blogWcount, big.mark = " ")
newsWcount = 0
for (item in docs[[2]]$content){
newsWcount = newsWcount + nchar(item)
}
newsWcount = format(newsWcount, big.mark = " ")
twitterWcount = 0
for (item in docs[[3]]$content){
twitterWcount = twitterWcount + nchar(item)
}
twitterWcount = format(twitterWcount, big.mark = " ")
| Data | Size (Mo) | Number of lines | Min Word in item | Max Words in item | Total Word Count |
|---|---|---|---|---|---|
| Blogs | 200.42 | 899 288 | 1 | 40 833 | 206 824 505 |
| News | 196.28 | 1 010 242 | 1 | 11 384 | 203 223 159 |
| 159.36 | 2 360 148 | 2 | 140 | 162 096 031 |
Each of these 3 files discussed above presents a very large size that may represent a problem in terms of time of computation. To avoid encountering those issues, we have subsetted our 3 files so that to obtain much smaller objects but representative of the original ones. Here are the characteristics of those subsets :
# Subsetting
set.seed(1)
subset <- docs
subset[[1]]$content <- subset[[1]]$content[as.logical(rbinom(length(subset[[1]]$content),1,prob = 0.15))]
subset[[2]]$content <- subset[[1]]$content[as.logical(rbinom(length(subset[[2]]$content),1,prob = 0.15))]
subset[[3]]$content <- subset[[1]]$content[as.logical(rbinom(length(subset[[3]]$content),1,prob = 0.15))]
# save(subset, file="subset.RData")
# load("subset.RData")
rm(docs)
# size of files
blogSize2 <- as.character(round(object.size(subset[[1]]$content) / 1024^2,2))
newSize2 <- as.character(round(object.size(subset[[2]]$content) / 1024^2,2))
twitterSize2 <- as.character(round(object.size(subset[[3]]$content) / 1024^2,2))
# number of lines per file
blogLines2 <- as.character(format(length(subset[[1]]$content), big.mark = " "))
newsLines2 <- as.character(format(length(subset[[2]]$content), big.mark = " "))
twitterLines2 <- as.character(format(length(subset[[3]]$content), big.mark = " "))
# maximum number of words
blogMax2 <- as.character(format(max(nchar(subset[[1]]$content)), big.mark = " "))
newsMax2 <- as.character(format(max(nchar(subset[[2]]$content)), big.mark = " "))
twitterMax2 <- as.character(format(max(nchar(subset[[3]]$content)), big.mark = " "))
# minimum numbe of words
blogMin2 <- as.character(min(nchar(subset[[1]]$content)))
newsMin2 <- as.character(min(nchar(subset[[2]]$content)))
twitterMin2 <- as.character(min(nchar(subset[[3]]$content)))
blogWcount2 = 0
for (item in subset[[1]]$content){
blogWcount2 = blogWcount2 + nchar(item)
}
blogWcount2t = format(blogWcount2, big.mark = " ")
newsWcount2 = 0
for (item in subset[[2]]$content){
newsWcount2 = newsWcount2 + nchar(item)
}
newsWcount2t = format(newsWcount2, big.mark = " ")
twitterWcount2 = 0
for (item in subset[[3]]$content){
twitterWcount2 = twitterWcount2 + nchar(item)
}
twitterWcount2t = format(twitterWcount2, big.mark = " ")
| Data | Size (Mo) | Number of lines | Min Word in item | Max Words in item | Total Word Count |
|---|---|---|---|---|---|
| Blogs | 37.24 | 134 734 | 1 | 37 191 | 30 990 754 |
| News | 6.57 | 151 267 | 1 | 37 191 | 4 888 176 |
| 8.18 | 354 075 | 2 | 3 845 | 5 357 746 |
In order to be able to analyse properly these datasets, we need to perform some transformations that will “clean” the data :
Let’s look at this example to see the way those transformations are performing:
Before Preprocessing
subset[[1]]$content[13]
## [1] "Origin: Middle English: from Old French joie, based on Latin gaudium, from gaudere ‘rejoice’"
library(SnowballC)
sub <- tm_map(subset, removePunctuation)
sub <- tm_map(sub, removeNumbers)
sub <- tm_map(sub, content_transformer(tolower))
sub <- tm_map(sub, stemDocument,language = ("english"))
sub <- tm_map(sub, stripWhitespace)
# save(sub, file="sub.RData")
# load("sub.RData")
rm(subset)
After Preprocessing
sub[[1]]$content[13]
## [1] "origin middl english from old french joie base on latin gaudium from gauder rejoic"
Now that we have preprocessed the data, we are able to build a term-document matrix that will allow us to perform some exploratory analysis and observe the words frequencies, and the n-gram frequencies.
# Creating (or loading if it already exists) a term-document matrix
mat <- DocumentTermMatrix(sub)
# save(mat, file="Mat.RData")
# load("Mat.RData")
# Build a dataframe containing the frequency for each word, sorted in decreasing order
freq <- sort(colSums(as.matrix(mat)), decreasing=TRUE)
wordF <- data.frame(word=names(freq), freq=freq)
# wordF$F <- wordF$freq * 100 /(blogWcount2 + newsWcount2 + twitterWcount2)
library(ggplot2)
require(gridExtra)
## Loading required package: gridExtra
g <- ggplot(wordF[wordF$freq>35000, ], aes(x=word, y=freq))
g <- g + geom_bar(stat="identity", fill = "darkblue")
g <- g + xlab("") + ylab("Frequency (number of occurances)")
g <- g + ggtitle("Words that appear more than 35 000\ntimes in the 3 Datasets")
g <- g + coord_flip()
h <- ggplot(wordF, aes(wordF$freq)) + geom_histogram(breaks=seq(0, 20, by = 1), fill="green", alpha = .7)
h <- h + xlab("Number of words") + ylab("Frequency")
h <- h + ggtitle("Word frequencies")
h <- h + xlim(0,20)
grid.arrange(g, h, nrow = 1, ncol = 2)
As we can see in this graph, the words that occur most frequently in the 3 texts are mainly connecting words, as we could have expected. For example the most occuring word here is “the”. This shows us that we will have to go further to perform an efficient predictive model and consider other patterns than only word frequencies. That should be n-grams frequencies.
An n-gram is a contiguous sequence of n items from a given sequence of text.
So in this analysis, we split all the strings in n-grams (n being the number of contiguous words) and look at their frequency through the 3 text files.
For the purpose of this analysis, we look at 1-grams, 2-grams, and 3-grams.
options(java.parameters = "-Xmx2048m")
library(RWeka)
library(SnowballC)
# 1-gram
if(!file.exists("Unimat.RData")){
options(mc.cores=1)
UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
unimat <- TermDocumentMatrix(sub, control = list(tokenize = UnigramTokenizer))
save(unimat, file="Unimat.RData")
}else{
load("Unimat.RData")
}
UgramFreq <- sort(rowSums(as.matrix(unimat)), decreasing=TRUE)
UgramF <- data.frame(word=names(UgramFreq), freq=UgramFreq)
# 2-gram
if(!file.exists("Bimat.RData")){
options(mc.cores=1)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
bimat <- TermDocumentMatrix(sub, control = list(tokenize = BigramTokenizer))
save(bimat, file="Bimat.RData")
}else{
load("Bimat.RData")
}
BgramFreq <- sort(rowSums(as.matrix(bimat)), decreasing=TRUE)
BgramF <- data.frame(word=names(BgramFreq), freq=BgramFreq)
# 3-gram
if(!file.exists("Trimat.RData")){
options(mc.cores=1)
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
trimat <- TermDocumentMatrix(sub, control = list(tokenize = TrigramTokenizer))
save(trimat, file="Trimat.RData")
}else{
load("Trimat.RData")
}
TgramFreq <- sort(rowSums(as.matrix(trimat)), decreasing=TRUE)
TgramF <- data.frame(word=names(TgramFreq), freq=TgramFreq)
require(gridExtra)
g <- ggplot(UgramF[UgramF$freq>35000, ], aes(x=word, y=freq))
g <- g + geom_bar(stat="identity", fill = "darkblue")
g <- g + xlab("") + ylab("Frequency (number of occurances)")
g <- g + ggtitle("1-grams that appear more than 35 000\ntimes in the three Datasets")
g <- g + coord_flip()
h <- ggplot(BgramF[BgramF$freq>10000, ], aes(x=word, y=freq))
h <- h + geom_bar(stat="identity", fill = "green")
h <- h + xlab("") + ylab("Frequency (number of occurances)")
h <- h + ggtitle("2-grams that appear more than 10 000\ntimes in the three Datasets")
h <- h + coord_flip()
i <- ggplot(TgramF[TgramF$freq>1000, ], aes(x=word, y=freq))
i <- i + geom_bar(stat="identity", fill = "red")
i <- i + xlab("") + ylab("Frequency (number of occurances)")
i <- i + ggtitle("3-grams that appear more than 1 000\ntimes in the three Datasets")
i <- i + coord_flip()
grid.arrange(g, h, i, nrow = 2, ncol = 2)
UgramLength <- as.character(format(length(UgramF$word),big.mark = " "))
BgramLength <- as.character(format(length(BgramF$word),big.mark = " "))
TgramLength <- as.character(format(length(TgramF$word),big.mark = " "))
Here are the characteristics of each corpus defined above :
Watching at the graphs above, we can observe that frequencies are very skewed, as the most occuring N-grams (N = 1 or 2 or 3) have frequencies much higher than others. Furthermore, we can observe that coordination words, such as “the” or “and” have much higher frequency than others.
Profanity filtering : there are some words that we don’t want to predict, and that we don’t want either to use in order to build our prediction model. Those words must be removed. But we can’t just remove those words, we have to remove the entire sentences in which they are displayed because those whole sentences become unuseful by the presence of those words.
To perform this task, we can find a list of profane words, for example at this link https://gist.github.com/jamiew/1112488..
perc = 0
total <- sum(UgramF$freq)
totW <- length(UgramF$word)
i = 1
while (perc <= 0.9){
perc = perc + UgramF$freq[i]/total
i = i+1
}
Covering 90% of word instances : in order to cover 90% of word instances in the subset we have defined above, we only have to consider the 4 044 most occuring words in the subsetted corpus, the total number of words in the subsetted corpus being 96 257, so around 4% of the words in the subsetted corpus.
That consideration gives us a large margin to reduce the size of the corpus we will have to use in our predictive model in order to have the opportunity to reduce calculation time.
Here are some tracks we will dig to build our predictive model :