I. Introduction

The subject of this report is data exploration for “autocomplete” keyboard typing. The sponsor,SwiftKey, is known for building smart keyboards that makes it easier for people to type on their mobile devices by second guessing what people are going to type. One cornerstone technology of smart keyboards is the use of predictive text models.

This initial milestone report will explain only the major features of the data identified and briefly summarize plans for creating the prediction algorithm and Shiny app in a high level way.

The outline for this project is:

II. Data preparation: collection and loading

The choice for text analysis document gives four options: de_DE, en_US, fi_FI, ru_RU. I will only examine the english (en_US) subfile. I will now download and read the files.

##                             Name    Length                Date
## 1                         final/         0 2014-07-22 10:10:00
## 2                   final/de_DE/         0 2014-07-22 10:10:00
## 3  final/de_DE/de_DE.twitter.txt  75578341 2014-07-22 10:11:00
## 4    final/de_DE/de_DE.blogs.txt  85459666 2014-07-22 10:11:00
## 5     final/de_DE/de_DE.news.txt  95591959 2014-07-22 10:11:00
## 6                   final/ru_RU/         0 2014-07-22 10:10:00
## 7    final/ru_RU/ru_RU.blogs.txt 116855835 2014-07-22 10:12:00
## 8     final/ru_RU/ru_RU.news.txt 118996424 2014-07-22 10:12:00
## 9  final/ru_RU/ru_RU.twitter.txt 105182346 2014-07-22 10:12:00
## 10                  final/en_US/         0 2014-07-22 10:10:00
## 11 final/en_US/en_US.twitter.txt 167105338 2014-07-22 10:12:00
## 12    final/en_US/en_US.news.txt 205811889 2014-07-22 10:13:00
## 13   final/en_US/en_US.blogs.txt 210160014 2014-07-22 10:13:00
## 14                  final/fi_FI/         0 2014-07-22 10:10:00
## 15    final/fi_FI/fi_FI.news.txt  94234350 2014-07-22 10:11:00
## 16   final/fi_FI/fi_FI.blogs.txt 108503595 2014-07-22 10:12:00
## 17 final/fi_FI/fi_FI.twitter.txt  25331142 2014-07-22 10:10:00
readFiles <- function () {
    files <- c()
    paths <- c(
        "blogs"   = "./dataset/final/en_US/en_US.blogs.txt",
        "news"    = "./dataset/final/en_US/en_US.news.txt",
        "twitter" ="./dataset/final/en_US/en_US.twitter.txt"
  )
  
    for (name in names(paths)) {
        files[[name]] <- readLines(paths[name], encoding = "UTF-8", skipNul = TRUE)
        files[[name]] <- sapply(files[[name]], function(x) iconv(enc2utf8(x), sub = "byte"))
   }
  
    return(files)
}

if (!exists("files")) {
    files <- readFiles()
}
## Warning in readLines(paths[name], encoding = "UTF-8", skipNul = TRUE):
## incomplete final line found on './dataset/final/en_US/en_US.news.txt'

The next step is to present some initial data exploration for the three files in question. We start by doing some count statistics of lines words and characters.

getFileStats <- function () {
    files_stat <- NULL
  
    for (name in names(files)) {
        file <- files[[name]]
        stats <- stri_stats_general(file)
        data <- data.frame(t(stats), row.names = name)
        data$Words <- sum(stri_count_words(file))

    if (is.null(files_stat)) {
        files_stat <- data
      } else {
        files_stat <- rbind(files_stat, data)
        print(files_stat)
    }
  }
  
    return(files_stat)
}

files_stat <- getFileStats()
##        Lines LinesNEmpty     Chars CharsNWhite    Words
## blogs 899288      899288 208361438   171925775 38153767
## news   77259       77259  15683765    13117028  2694417
##           Lines LinesNEmpty     Chars CharsNWhite    Words
## blogs    899288      899288 208361438   171925775 38153767
## news      77259       77259  15683765    13117028  2694417
## twitter 2360148     2360148 162385035   134370242 30195719
files_stat
##           Lines LinesNEmpty     Chars CharsNWhite    Words
## blogs    899288      899288 208361438   171925775 38153767
## news      77259       77259  15683765    13117028  2694417
## twitter 2360148     2360148 162385035   134370242 30195719

III. Initial findings

One of the most important tasks of data science is the cleaning process: “Garbage in = Garbage out”" is more truth than you think. The cleaning is done using the following procedure on the raw, merged dataset:

However what are the initial findings? Difficult to say but it looks like we are dealing with two or three different ways of communicating and hence different ways of typing - so something to think about in the next few weeks.

set.seed(2016)

googlebanwords <- read.delim("googlebanwords.txt",,sep = ":",header = FALSE)
googlebanwords <- googlebanwords[,1]

createSample <- function (blogs, news, twitter) {

    samples <- c()

    for (name in names(files)) {
        file <- files[[name]]
        samples <- c(samples, sample(file, get(name)))
    }
  
    return(samples)
}

getCorpus <- function (data) {
    corpus <- Corpus(VectorSource(data))
    corpus <- tm_map(corpus, tolower)
    corpus <- tm_map(corpus, removePunctuation)
    corpus <- tm_map(corpus, removeNumbers)
    corpus <- tm_map(corpus, removeWords, stopwords("english"))
    corpus <- tm_map(corpus, stripWhitespace)
    corpus <- tm_map(corpus, stemDocument, language = "english") 
    corpus <- tm_map(corpus, removeWords, googlebanwords)
    return(corpus)
}

getSampleCorpus <- function (blogs = 10000, news = 10000, twitter = 30000, cache = TRUE) {
    cacheFile <- "./dataset/corpus.txt"
    if (cache && file.exists(cacheFile)) {
        corpus <- readRDS(cacheFile)
  } else {
        samples <- createSample(blogs, news, twitter)
        corpus <- getCorpus(samples)
        # writeCorpus(corpus, filenames = "./dataset/corpus.txt")
        saveRDS(corpus, file = "./dataset/sample_corpus.RDS")
  }

    return(corpus)
}

corpus <- getSampleCorpus(cache = TRUE)

After the cleaning process we are going to have a quick look at the most common words alone and put together in pairs and groups of three. I have chosen to group all news sources in one sample but will probably explore further whether or not we should treat the three groups of info as separate news flows

printNGram <- function (corpus, n = 1, topN = 10, delim = " \\r\\n\\t.!?,;\"()") {
    label <- paste('Top ', topN, ' ', n, '-grams', sep = '')
    token <- NGramTokenizer(corpus$content, Weka_control(min = n, max = n, delimiters = delim))
    top <- as.data.frame(table(token))
    top <- head(top[order(-top$Freq), ], topN)
  
    par(mar = c(5, 8, 2, 1))
    barplot(rev(top$Freq), names.arg = rev(top$token), main = label, xlab = "Frequency", horiz = TRUE, las = 1, cex.names = 0.9, col = "darkblue")
  
    return(top)
}

printNGrams <- function (corpus, num = 3, topN = 10) {
    for (n in 1:num) { 
        printNGram(corpus, n, topN)
    }
}

printNGrams(corpus)

IV. plans for creating a prediction algorithm and Shiny app.

The next stage is to create a prediction algorithm for forecasting words when typing information. I foresee three stages in the process:

On the first point: should we keep data together or work on a more specialized application for the three groups? And on the second point are the data clean enough or do we need to do another sweep ? Finally on the third point we need to decide on how to split the data in training, calibration and testing and equally important which model to use - however still early days. We have much data so maybe we can do some kernel regression or split the data in smart subgroups. It is to early to say I need to do more investigation.