Data Science Capstone Project

Synopsis

Cousera Data Science Capstone project is about creating a typing next word prediction app. The objective of this initial work is to make an exploratory analyses of the data. This report includes a basic analysis of data (word counts, line counts) and some analyses of the size of the sample data set needed to carry out a reliable prediction of the next word.

Data loading and preprocessing

We load some libraries first:

library(tm)

## Loading required package: NLP

library(SnowballC)
library(RWeka)
library(plyr)
library(ggplot2)

## 
## Attaching package: 'ggplot2'
## 
## The following object is masked from 'package:NLP':
## 
##     annotate

library(wordcloud)

## Loading required package: RColorBrewer

library(gridExtra)

## Loading required package: grid

After downloading and unzipping initial archive (here and further we will use only english files) we have three text files: en_US.news.txt, en_US.blogs.txt, en_US.twitter.txt.

Now we calculate number of lines and words in them, using wc system tool.

newsPath <- paste0(getwd(), "/final/en_US/en_US.news.txt")
blogsPath <- paste0(getwd(), "/final/en_US/en_US.blogs.txt")
twitterPath <- paste0(getwd(), "/final/en_US/en_US.twitter.txt")
#Calculating number of lines
system(sprintf("wc -l %s", newsPath))
system(sprintf("wc -l %s", blogsPath))
system(sprintf("wc -l %s", twitterPath))
#Calculating number of words
system(sprintf("wc -w %s", newsPath))
system(sprintf("wc -w %s", blogsPath))
system(sprintf("wc -w %s", twitterPath))
#Calculating whole file volume in megabytes
file.info(newsPath)$size / 1024^2
file.info(blogsPath)$size / 1024^2
file.info(twitterPath)$size / 1024^2

What we get:

The largest number of lines in the Twitter file.
The Blog file has the largest number of words.
The Blog file’s and News file’s volume is around 200 Mb when Twitter file is a bit smaller (159 Mb)

However all three text files are very large. Therefore, to make the work with the files faster on the computer we make samples from each file with random 50000 lines. Also during sampling we replace some endings like “’ve” to " have“. As the result we have new 3 sampled files in new directory 50000 lines each.

getSample <- function(Path, sampleSize)
{
      text <- iconv(readLines(Path, warn = FALSE, encoding = "UTF-8"), "UTF-8", "UTF-8", "byte")
      text <- text[!is.na(text)]
      
      if (sampleSize < 0) return(text)
      
      l <- length(text)
      sampleIndexes <- sample(1:l, size = sampleSize, replace = TRUE)
      result <- text[sampleIndexes]
      result <- gsub("'ll", " will", result, perl=T)
      result <- gsub("'ve", " have",result, perl=T)
      result <- gsub("'d", " had",result, perl=T)
      result <- gsub("n't", " not",result, perl=T)
      result <- gsub("'s", " is",result, perl=T)
      result <- gsub("'m", " am",result, perl=T)
      return(result)
}

writeSample <- function(sample, sampleFilePath)
{
      fileConnection <- file(sampleFilePath, "w+")
      writeLines(sample, fileConnection)
      close(fileConnection)
}

newsSampleFilePath <- paste0(getwd(), "/Samples/en_US/en_US.sample.news.txt")
blogsSampleFilePath <- paste0(getwd(), "/Samples/en_US/en_US.sample.blogs.txt")
twitterSampleFilePath <- paste0(getwd(), "/Samples/en_US/en_US.sample.twitter.txt")

sampleSize <- 50000
writeSample(getSample(newsPath,sampleSize),newsSampleFilePath)
writeSample(getSample(blogsPath,sampleSize),blogsSampleFilePath)
writeSample(getSample(twitterPath,sampleSize),twitterSampleFilePath)

Data cleaning and loading the corpus

Now we can load sampled files and make a corpus.

corpus <- Corpus(DirSource("Samples/en_US/", encoding="UTF-8"), readerControl =list(language="en_US"))
summary(corpus)

##                          Length Class             Mode
## en_US.sample.blogs.txt   2      PlainTextDocument list
## en_US.sample.news.txt    2      PlainTextDocument list
## en_US.sample.twitter.txt 2      PlainTextDocument list

Now we can make transforamtions with the texts using tm functions:

# Convert all to lower case 
corpus <- tm_map(corpus, content_transformer(tolower))
# Remove punctuation and numbers
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
# Remove spaces 
corpus <- tm_map(corpus, stripWhitespace)
# Remove non ascii characters 
corpus <- tm_map(corpus, function(x) iconv(x, "latin1", "ASCII", sub=""))
corpus <- tm_map(corpus, PlainTextDocument)

Text Analyses (Uni- Bi- Tri- grams)

In order to make Document-Term Matrix not only with words, but with samples of 2,3 words in a sequence, we have to make Tokeniztion.

#Unigrams
corpusDTM <- TermDocumentMatrix(corpus)
Unigrams <- as.data.frame(rowSums(as.matrix(corpusDTM)))
Unigrams$words <- row.names(Unigrams)
row.names(Unigrams) <- NULL
colnames(Unigrams)[1] <- 'Freq'
Unigrams <- arrange(Unigrams, desc(Freq))
#Bigrams
BigramTokenizer <- function(x) unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
BigramsTDM <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer))
Bigrams <- as.data.frame(rowSums(as.matrix(BigramsTDM)))
Bigrams$words <- row.names(Bigrams)
row.names(Bigrams) <- NULL
colnames(Bigrams)[1] <- 'Freq'
Bigrams <- arrange(Bigrams, desc(Freq))
#Trigrams
TrigramTokenizer <- function(x) unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)
TrigramsTDM <- TermDocumentMatrix(corpus, control = list(tokenize = TrigramTokenizer))
Trigrams <- as.data.frame(rowSums(as.matrix(TrigramsTDM)))
Trigrams$words <- row.names(Trigrams)
row.names(Trigrams) <- NULL
colnames(Trigrams)[1] <- 'Freq'
Trigrams <- arrange(Trigrams, desc(Freq))

Lets look at the most frequent Uni- Bi- Tri- Four- grams:

wordcloud(Unigrams$words, Unigrams$Freq, random.order=FALSE, max.words = 100, 
          colors=brewer.pal(8, "Dark2"))

p1 <- ggplot(data=within(Unigrams[1:10,], words <- factor(words, levels=words)), aes(x=words,y=Freq)) + theme(axis.text.x=element_text(angle=90, hjust=1)) + geom_bar(stat="identity")+labs(x = "Words", y="Frequency", title="Top Unigrams")
p2 <- ggplot(data=within(Bigrams[1:10,], words <- factor(words, levels=words)), aes(x=words,y=Freq)) + theme(axis.text.x=element_text(angle=90, hjust=1))+ geom_bar(stat="identity")+labs(x = "Words", y="Frequency", title="Top Bigrams")
p3 <- ggplot(data=within(Trigrams[1:10,], words <- factor(words, levels=words)), aes(x=words,y=Freq)) + theme(axis.text.x=element_text(angle=90, hjust=1))+ geom_bar(stat="identity")+labs(x = "Words", y="Frequency", title="Top Trigrams")
grid.arrange(p1, p2, p3, ncol=3)

We see that the most frequent words are (as we supposed to think) stopwords. Lets look how the distribution will change if we remove stopwords using corpus <- tm_map(corpus, removeWords, stopwords("english")) in data cleaning section:

Now this n-grams are far more informative according to their real usefullness in text in terms of words coverage.

So next question we want to answer is “How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 80%? 90%?”

For this purpose we will make a function and then run it on different coverage percentages:

getWordsByCoverage <- function (tdm, percentage)
{
      Break <- tdm
      Break$cumsum <- cumsum(Break$Freq)
      t <- min(which(Break$cumsum > percentage * sum(tdm$Freq)))
      return (t)
}

breaks <- c(0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.99)
Coverage <- data.frame()
for (i in 1:10) {
      Coverage[i,1] <- breaks[i]
      Coverage[i,2] <- getWordsByCoverage(Unigrams, breaks[i])
}
names(Coverage) <- c("part","words")

Lets look what does this breaks for each coverage percentage look like:

Coverage # on corpus with stopwords

##    part words
## 1  0.10     3
## 2  0.20    15
## 3  0.30    47
## 4  0.40   121
## 5  0.50   312
## 6  0.60   723
## 7  0.70  1584
## 8  0.80  3591
## 9  0.90 10249
## 10 0.99 87919

Coverage2 # on corpus without stopwords

##    part words
## 1  0.10    34
## 2  0.20   125
## 3  0.30   303
## 4  0.40   591
## 5  0.50  1076
## 6  0.60  1885
## 7  0.70  3387
## 8  0.80  6612
## 9  0.90 16531
## 10 0.99 98927

Now lets make a plot of this coverage function and compare it to the coverage function on a corpus without stopwords

p1 <- qplot(Coverage$part, Coverage$words, geom = c("line"), xlab="Coverage", ylab="Number of words", main="Number of words required to reach the coverage")
p2 <- qplot(Coverage2$part, Coverage2$words, geom = c("line"), xlab="Coverage", ylab="Number of words", main="Number of words required to reach the coverage (corpus without stopwords)")
grid.arrange(p1, p2, ncol=2)

So we see that the coverage function is not linear and 50% coverage is granted by only 1076 words, while 90% coverage is granted by 16531 words, 95% - 34963 words, 99% - 98927. That means that we can cut the long tail and lose only 10% of accuracy by reducing data volume by (98927-16531)/98927 = 83%!

Further planned work

I’m going to create a prediction model using nGrams from analyses above. A larger sampling could be tested in order to include more n-grams. The model would start with a X-gram (may be 5-gram?) model and look back when matches are not found (down to 4-grams, 3-grams and so on).

The problem that i will have to solve - work with the prediction time reducing, because now it seems like this time is about several seconds and it will not work good for online typing-prediction.

In the end i’m going to create Shiny application with prototype and create a brief slide presentation