Synopsis

This document is the Milestone Report for Coursera’s Data Science Specialization Capstone. It explains the major features of the provided data for the Capstone Project and briefly summarizes my plans for creating the prediction algorithm and Shiny app.

Loading the data

After we downloaded the data from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip and unzipped them

DownloadData <- function()
{
    cw_dir <- getwd()
    setwd("../data")
    ## Download (if necessary) and reading the data for the assignment
    if(!file.exists("final"))
    {
        print("Data will be downloaded and unziped...")
        url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
        download.file(url, "Coursera-SwiftKey.zip")
        unzip("Coursera-SwiftKey.zip" )
    }
    else
    {
        print("Data was already downloaded ...")
    }
    setwd(cw_dir)
    return()
}
DownloadData()

We can see that this data are very huge text files in different languages. But because we are only interested in English texts we load in R only the English files. We also eliminate all the non alphanumeric characters on the fly.

read_file <- function(file){
    file <- paste("../data/final/en_US/en_US.",file,".txt", sep="")
    con <- file(file, encoding="UTF-8")
    content <- gsub("[^[:alnum:]]", " ",scan(con,what=character(), sep="\n", skipNul = TRUE))
    close(con)
    return(content)
}

blogs <- read_file("blogs")
news <- read_file("news")
twitter <- read_file("twitter")

Then we split the lines to words.

blogs_words <- unlist(strsplit(blogs, " +"))
save(blogs_words, file="blogs_words.saved")
rm(blogs_words)

news_words <- unlist(strsplit(news, " +"))
save(news_words, file="news_words.saved")
rm(news_words)

twitter_words <- unlist(strsplit(twitter, " +"))
save(twitter_words, file="twitter_words.saved")
rm(twitter_words)

Basic summary statistics

And after that we are ready to do some statistics. First we count lines and words in the files.

load("blogs_words.saved")
num_blines <- length(blogs)
num_blines
## [1] 899288
num_bwords <- length(blogs_words)
num_bwords
## [1] 38372148
load("news_words.saved")
num_nlines <- length(news)
num_nlines
## [1] 77259
num_nwords <-length(news_words)
num_nwords
## [1] 2753580
load("twitter_words.saved")
num_tlines <- length(twitter)
num_tlines
## [1] 2360148
num_twords <- length(twitter_words)
num_twords
## [1] 31143477

There are 899288 lines and 38372148 words in blogs dataset, 77259 lines and 2753580 words in news dataset and 2360148 lines and 31143477 words in twitter dataset.

Next we can count different words in all three datasets.

words <- c(blogs_words,news_words,twitter_words)
rm(blogs_words)
rm(news_words)
rm(twitter_words)
words <- gsub("[^[:alpha:]]","",words)
w_freq <- table(words)
num_words <- length(w_freq)
num_words
## [1] 612443
save(words, file="words.saved")

There are 612443 different words in these datasets, and the twenty most common words are:

head(sort(w_freq,TRUE),20)
## words
##     the      to       I       a     and      of              in     you 
## 2644537 1895413 1709175 1508553 1507890 1279497 1188756  965140  817800 
##      is     for    that      it       s      on      my    with       t 
##  787545  742552  719653  707336  656037  550472  495734  463884  413358 
##     was      be 
##  407181  395965

Interesting findings

We can see that this are all words that are shorter than five characters, so we eliminate them to see how many “real” words we have.

load("words.saved")
r_words <- words[nchar(words)>4]
num_r_words <- length(table(r_words))

So, there are 544687 different words longer than four characters. The most common of these words are:

r_w_freq <- head(sort(table(r_words),TRUE),20)
r_w_freq
## r_words
##   about   there   would   their  people   think   going  really   great 
##  207831  141631  134280  120984  107718   99346   91255   88871   86317 
##   today   which   first   other   right because   could   still  should 
##   85815   83068   76657   76656   76547   75995   72486   69782   63105 
##  little   being 
##   61960   60199
barplot(r_w_freq, main="Frequencies of most common words longer than four characters",las=2)

The number of different words in these datasets is around two times larger than the number of distinct English words, which is around a quarter of a million. This tells us that must be a lots of misspellings and slangs words in our datasets. So, if we want a good prediction model we should correct all the misspellings and find all the synonyms and clean datasets. The best way to do that is to use R text mining library “tm”.For purpose of this milestone report, we will use only 10% of all text data.

library("tm")
## Loading required package: NLP
library("SnowballC")

blogs <- blogs[sample(length(blogs), size=0.1*length(blogs))]
news <- news[sample(length(news), size=0.1*length(news))]
twitter <- twitter[sample(length(twitter), size=0.1*length(twitter))]

text <- c(blogs,news,twitter)

rm(blogs)
rm(news)
rm(twitter)

corpus <- Corpus(VectorSource(text))
rm(text)

stopwords <- c(stopwords('english'), "t", "a","s","the","don","m")

corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords) 
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stemDocument)

When we have clean corpus, we can build bigrams to show the most comon word pairs and their occurencies:

library("RWeka")
library(slam)

n_gram_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tdm <- TermDocumentMatrix(corpus, control = list(tokenize = n_gram_tokenizer))
tdm <- rollup(tdm, 2, na.rm=TRUE, FUN = sum)
pairs_freq <- head(sort(rowSums(as.matrix(tdm)),TRUE), 20)
pairs_freq
##      right now      look like       can wait     last night      feel like 
##           2170           1764           1644           1559           1557 
##   look forward   thank follow        can get       year old      last year 
##           1494           1250           1107           1057            926 
##      make sure       new york     first time       year ago happi birthday 
##            922            902            889            885            874 
##       let know        one day      good morn         let go       just got 
##            837            816            803            803            797
barplot(pairs_freq, main="Frequencies of most common word pairs",las=2)

My plans

My plan for the final project is to build a Shiny app that will predict the next word in a phrase based on the previous 1, 2 or 3 words. For the cases where a combination of these words doesn’t appear in base texts, I’ll try to predict next word with use of a back-off model that estimates the conditional probability of a word given its history in the n-gram. If I would have enough time, I’ll try to improve the predictive accuracy and reduce computational runtime and model complexity.