Data Analysis

Introduction

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and the whole range of other activities. But typing on mobile devices can be a serious pain. This project builds a smart algorithm that can suggest for people the next word of a phrase, for example if someone type:

I went to the

the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant.

Goal

Using the dataset provided by SwiftKey, predict the next word of a phrase.

The dataset

This dataset is composed by texts of: blogs, news and tweets. The texts are available in four differente languages: german, english, russian and finnish. The language used to predict the words in this project was the english. Each file is organised by the following manner: each line of each file represent a document, then for twitter each line is a tweet, for the blogs each line is a post and for the news each line is a news. The dataset can be downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. The zip file (Coursera-SwiftKey.zip) is composed by four folders final/de_DE, final/en_US, final/fi_FI and final/ru_RU, each folder containing three files with documents of twitter, news and blogs. The datasets used for this project are: final/en_US/en_US.blogs.txt, final/en_US/en_US.news.txt and final/en_US/en_US.twitter.txt.

# Obtaining the data
if(!file.exists("Coursera-SwiftKey.zip")) {
    download.file(
        url = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
        file = "Coursera-SwiftKey.zip")
}
# Unziping the data
if(!file.exists("final")) {
    unzip(zipfile = "Coursera-SwiftKey.zip")
}
# Reading the data
twitterData <- readLines("final/en_US/en_US.twitter.txt")
blogsData <- readLines("final/en_US/en_US.blogs.txt")
newsData <- readLines("final/en_US/en_US.news.txt")

The datasets are pretty large, each file contains the following size:

en_US.twitter.txt = 159.364069 MB
en_US.blogs.txt = 200.4242077 MB
en_US.news.txt = 196.2775126 MB

Each line of each file represent a document, for each file we have the following number of lines:

en_US.twitter.txt = 2360148
en_US.blogs.txt = 899288
en_US.news.txt = 1010242

For each file the maximum number of words by phrase that can be found:

en_US.twitter.txt = 140
en_US.blogs.txt = 40833
en_US.news.txt = 11384

Sampling the datasets

The training dataset was sampled using 1% of the total amount of lines. For each document random lines were choose and the lines that were detected as ENGLISH phrases using the library cldr were add to the training dataset.

library(tm)
library(SnowballC)
library(cldr)

docs <- c("final/en_US/en_US.twitter.txt", 
          "final/en_US/en_US.blogs.txt",
          "final/en_US/en_US.news.txt")
dataDocuments <- character()
# Sampling the data
for(doc in docs) {
    data <- readLines(doc)
    sampleSize <- length(data) * .01
    data <- data[detectLanguage(data)$detectedLanguage == "ENGLISH"]
    data <- sample(data, sampleSize)
    dataDocuments <- c(dataDocuments, data)
}

# Saving the data
saveRDS(dataDocuments, file = "tidy/sample.RDS")

Done the sampling of the datasets, a cleaning procediment was executed to remove punctuation, numbers, pass the words to lower case and strip whitespace.

dataDocuments <- readRDS("tidy/sample.txt")
# Cleaning the data
dataVS <- VectorSource(dataDocuments)
dataCorpus <- VCorpus(dataVS)
dataCorpus <- tm_map(dataCorpus, removePunctuation)
dataCorpus <- tm_map(dataCorpus, removeNumbers)
dataCorpus <- tm_map(dataCorpus, content_transformer(tolower))
dataCorpus <- tm_map(dataCorpus, stripWhitespace)
# Saving the data
dataLines <- character()
for(i in 1:length(dataDocuments)) {
    dataLines <- c(dataLines, dataCorpus[[i]]$content)
}
saveRDS(dataLines, file = "tidy/tidy.RDS")

With the training dataset sampled and tidy, was used the strategy of n-grams tokenizer to separate phrases in sequence of one, two and three words and their frequencies of use. For each n-gram the following plots were generated.

library(tm)
library(SnowballC)
library(RWeka)
library(wordcloud)
library(RColorBrewer)
library(cldr)

# ============================================
# Generate plot for the frequencies of words
# ============================================
# Loaded the tidy data reduced (not stemmed)
dataDocuments <- readRDS("tidy/tidy.RDS")
# Prepare the corpus
dataVS <- VectorSource(dataDocuments)
dataCorpus <- VCorpus(dataVS)
# Configure the cores
options(mc.cores=1)

Mono-gram plot

A mono-gram barplot of the most used words in the training datasets:

monoGramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
tdmMono <- TermDocumentMatrix(dataCorpus, control = list(tokenize = monoGramTokenizer))
freqMono <- slam::row_sums(tdmMono)
freqMono <- freqMono[order(-freqMono)]
freqMono <- data.frame(word = names(freqMono), freq = freqMono)
saveRDS(freqMono, file = "tidy/freqMono.RDS")

freqMono = readRDS("tidy/freqMono.RDS")
par(mar=c(8,4,4,4))
barplot(freqMono[1:50,]$freq, las = 2, names.arg = freqMono[1:50,]$word,
        col ="lightblue", main ="Most frequent mono-gram words",
        ylab = "Word frequencies")

Bi-gram plot

A bi-gram barplot of the most used sequence of two words in the training datasets:

biGramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tdmBi <- TermDocumentMatrix(dataCorpus, control = list(tokenize = biGramTokenizer))
freqBi <- slam::row_sums(tdmBi)
freqBi <- freqBi[order(-freqBi)]
freqBi <- data.frame(word = names(freqBi), freq = freqBi)
saveRDS(freqBi, file = "tidy/freqBi.RDS")

freqBi = readRDS("tidy/freqBi.RDS")
par(mar=c(8,4,4,4))
barplot(freqBi[1:50,]$freq, las = 2, names.arg = freqBi[1:50,]$word,
        col ="lightblue", main ="Most frequent bi-gram words",
        ylab = "Word frequencies")

Three-gram plot

A three-gram barplot of the most used sequence of three words in the training datasets:

threeGramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tdmThree <- TermDocumentMatrix(dataCorpus, control = list(tokenize = threeGramTokenizer))
freqThree <- slam::row_sums(tdmThree)
freqThree <- freqThree[order(-freqThree)]
freqThree <- data.frame(word = names(freqThree), freq = freqThree)
saveRDS(freqThree, file = "tidy/freqThree.RDS")

freqThree = readRDS("tidy/freqThree.RDS")
par(mar=c(8,4,4,4))
barplot(freqThree[1:50,]$freq, las = 2, names.arg = freqThree[1:50,]$word,
        col ="lightblue", main ="Most frequent three-gram words",
        ylab = "Word frequencies")

Statistics of mono-grams

For the mono-grams the following statistic were obtained:

The words quantity for 50%

freqMono = readRDS("tidy/freqMono.RDS")
# Calculate the words quantity for 50%
totalFreq <- sum(freqMono$freq) * .5
currentFreq <- 0
selectedIdx <- 0
for(i in 1:length(freqMono$freq)) {
    currentFreq <- currentFreq + freqMono$freq[i]
    if(currentFreq >= totalFreq) {
        selectedIdx <- i - 1
        break
    }
}

Only 0.551828% of the total number of different words represents 50% of the total frequencies of the words in the training dataset.

The words quantity for 90%

freqMono = readRDS("tidy/freqMono.RDS")
# Calculate the words quantity for 90%
totalFreq <- sum(freqMono$freq) * .9
currentFreq <- 0
selectedIdx <- 0
for(i in 1:length(freqMono$freq)) {
    currentFreq <- currentFreq + freqMono$freq[i]
    if(currentFreq >= totalFreq) {
        selectedIdx <- i - 1
        break
    }
}

Only 17.1346986% of the total number of different words represents 90% of the total frequencies of the words in the training dataset.

The amount of foreign words

dataDocuments <- readRDS("tidy/tidy.RDS")
foreignWords <- 0
for(i in 1:length(dataDocuments)) {
    for(j in strsplit(dataDocuments[i], " ")) {
        if(detectLanguage(j)$detectedLanguage[1] != "ENGLISH") {
            foreignWords <- foreignWords + 1    
        }
    }
}

In the training dataset was identified as foreign words an amount of 7572 words.

Considerations

Based on the analysis, the n-grams identified can be used to predict the next word of a phrase, a strategy will be necessary to identify the word probabilities found in the n-grans for words that have never been seem before. Various models can be generated, for example we can combine bi-grams of three-grams, combine the first word of a three-gram with the last word, and so on. A possible algorithm to calculate the word frequencies probability could be the SGT (Simple Good Turing) to elenc words probabilities based on their frequencies, after that, some testing predictions will be executed to analyze the precision of the algorithm. At the end, a shiny app will be developed to predict the next word of a phrase with a lean summary and a good prediction time.