Data Science Capstone Milestone Report

Overview

Smartphones have become impressively ubiquitous. Time that used to be “wasted” such us while travelling to and from work, waiting in queues or for the bus is now put to good use: from chatting with friends and colleagues, to reading the latest news or carrying out bank transactions. Time is of the essence, and since smartphones are not comfortable for writing lengthy texts, many users would welcome an application that predicts the next word to be typed, which would definitely save a lot of time and reduce “smartphone typing fatigue”.

This report summarises the preliminary work done to build such an application. For input data we have used three data files, containing extracts from Internet blogs, news and Twitter messages.

The R code used to perform the analysis is included for those who may wish to see the technical details of the analysis. The reader who is unfamiliar with R language may omit these code blocks and focus solely on the results.

Tables and Plots of Exploratory Analysis of the Data

We loaded the data files and extracted a sample in order to reduce processing time.

# Libraries:
library(tm)
library(openNLP)
library(data.table)
library(dplyr)
library(ggplot2)

# Open connections to source data files:
conBlogs = file("../Data/en_US/en_US.blogs.txt", open="rb")
conNews = file("../Data/en_US/en_US.news.txt", open="rb")
conTwitter = file("../Data/en_US/en_US.twitter.txt", open="rb")

# Read entire data files:
Blogs = readLines(conBlogs, skipNul=T) # 18 sec
close(conBlogs)
News = readLines(conNews, skipNul=T) # 17 sec
close(conNews)
Twitter = readLines(conTwitter, skipNul=T) # 23 sec
close(conTwitter)

# Create entire data set vector:
dat = c(Blogs, News, Twitter)
totlines = length(dat)

# Extract a sample, to reduce waiting time:
nl = 1000 # Number of lines to extract
set.seed(0)
sdat = sample(dat, nl)

The entire data set from the three files had a total of 4,269,678 lines of text. We extracted a sample of size 1,000.

We tokenised the lines of text, i.e. we divided each line into its constituent words, placing them in a vector. For this the MC_tokenizer function from the tm R package was invaluable.

# Tokenise vectors:
tokenDat = MC_tokenizer(sdat)
tokenDat = tokenDat[(tokenDat!="")] # Remove spaces

We created frequency tables and thus calculated the number of word instances and unique words in the sample.

WordInstances = length(tokenDat)

# Create frequency table for words (tokens):
fwords = as.data.table(table(tokenDat))
names(fwords)[1] = "UniqueWord"

UniqueWords = nrow(fwords)

The sample had 25,710 word instances and 6,848 unique words.

Table of Word Frequencies

##        UniqueWord    N
##    1:         the 1102
##    2:          to  642
##    3:         and  597
##    4:           a  575
##    5:          of  555
##   ---                 
## 6844:     zombies    1
## 6845:       zones    1
## 6846:         Zoo    1
## 6847: Zoroastrian    1
## 6848:       Zumba    1

We created histograms of individual words. For clarity’s sake the frequency tables are ordered by descending frequency and hence the histogram descends from left to right.

# Create a histogram from the word frequency table:
fwords$rowN = as.numeric(rownames(fwords)) # Create row number for x-axis.
fwords$logN = log(fwords$N) + 0.1 # Use logN+0.1 on y-axis for better visualisation.

# Plots:
g = ggplot(fwords, aes(y=logN, x=rowN)) + geom_bar(stat="identity", colour="red")
g = g + ggtitle("Frequencies of Words")
g = g + ylab("log Frequency")
g = g + xlab("Word ID")
g

Likwise, we created frequency tables and histograms of 2-grams and 3-grams, i.e. strings of pairs of words and triplets.

# Create vector of 2-grams:
n = length(tokenDat)
odds = seq(1, n, by=2)
evens = seq(2, n, by=2)
odds3 = seq(3, n, by=2)

Twograms1 = paste(tokenDat[odds], "-", tokenDat[evens], sep="")
Twograms2 = paste(tokenDat[evens], "-", tokenDat[odds3], sep="")

Twogram = c(Twograms1, Twograms2)

# Create frequency table for 2-grams:
fTwogram = as.data.table(table(Twogram))
fTwogram = arrange(fTwogram, desc(N))

# Create vector of 3-grams:
s13 = seq(1, n, by=3)
s23 = seq(2, n, by=3)
s33 = seq(3, n, by=3)
s43 = seq(4, n, by=3)
s53 = seq(5, n, by=3)

Threegrams1 = paste(tokenDat[s13], "-", tokenDat[s23], "-", tokenDat[s33], sep="")
Threegrams2 = paste(tokenDat[s23], "-", tokenDat[s33], "-", tokenDat[s43], sep="")
Threegrams3 = paste(tokenDat[s33], "-", tokenDat[s43], "-", tokenDat[s53], sep="")

Threegram = c(Threegrams1, Threegrams2, Threegrams3)

# Create frequency table for 3-grams:
fThreegram = as.data.table(table(Threegram))
fThreegram = arrange(fThreegram, desc(N))

Table of 2-Gram Frequencies

##                      Twogram   N
##     1:                of-the 103
##     2:                in-the  92
##     3:               for-the  52
##     4:                 to-be  48
##     5:                to-the  47
##    ---                          
## 20664:                Zoo-in   1
## 20665:      Zoroastrian-idea   1
## 20666:   Zoroastrianism-Asha   1
## 20667: Zoroastrianism-places   1
## 20668:            Zumba-days   1

Table of 3-Gram Frequencies

##                       Threegram  N
##     1:                  I-don-t 12
##     2:                 a-lot-of  9
##     3:         block-blocks-obj  8
##     4:           blocks-obj-tim  8
##     5:            ela-log-block  8
##    ---                            
## 25053:              Zoo-in-Palm  1
## 25054:      Zoroastrian-idea-of  1
## 25055:  Zoroastrianism-Asha-and  1
## 25056: Zoroastrianism-places-on  1
## 25057:         Zumba-days-every  1

# Create a histogram from the 2-gram frequency table:
fTwogram$rowN = as.numeric(rownames(fTwogram)) # Create row number for x-axis.
fTwogram$logN = log(fTwogram$N) + 0.1 # Use logN+0.1 on y-axis for better visualisation.

# Plots:
g = ggplot(fTwogram, aes(y=logN, x=rowN)) + geom_bar(stat="identity", colour="blue")
g = g + ggtitle("Frequencies of 2-grams")
g = g + ylab("log Frequency")
g = g + xlab("2-gram ID")
g

# Create a histogram from the 3-gram frequency table:
fThreegram$rowN = as.numeric(rownames(fThreegram)) # Create row number for x-axis.
fThreegram$logN = log(fThreegram$N) + 0.1 # Use logN+0.1 on y-axis for better visualisation.

# Plots:
g = ggplot(fThreegram, aes(y=logN, x=rowN)) + geom_bar(stat="identity", colour="green")
g = g + ggtitle("Frequencies of 3-grams")
g = g + ylab("log Frequency")
g = g + xlab("3-gram ID")
g

By quick trial and error we found the number of unique words that covered 50% of all word instances. 50% of 25,710 = 12,855.

k=176
sum(fwords$N[1:k])

## [1] 12848

A surprisingly modest number: 176, i.e. 2.6% of unique words, covers 50% of all word instances.

Summary of Plan for Prediction Algorithm

To build the prediction algorithm we will find words and sequences of words that occur frequently associated with the words we wish to predict.