Overview

The Johns Hopkins, Coursera Data Science specialisation capstone project involve the development of a predictive text model.

The model is to be built using a fairly large corpus of text made available for this project. The project was build in conjunction with Swiftkey

Note on Programming Languages (R v.s. Python)

Please note that during experimentation it was found that python is a more suitable programming language to implement the intense computing required for pre-processing in this project.

Specifically, the Natural Language Toolkt (NLTK) Python library is well suited for the work required.

The bulk of the pre-processing tasks are implemented in Python. Code snippets are provided to illustrate the usage of Python. CSV is used to communicate between Python and R.

import os
import unicodedata
import nltk
import re
import gc

from nltk.tokenize import sent_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.util import ngrams

full_execute = False
os.chdir(os.path.expanduser("~/Documents/Coursera/Capstone/final/en_US"))

Data Aquisition

The dataset used for training the model include data from twitter, blogs and news feeds. The specific dataset is available here: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip )

blogs=open("en_US.blogs.txt","r").read().decode("utf8")
twits=open("en_US.twitter.txt").read().decode("utf8")
news=open("en_US.news.txt", "Ur").read().decode("utf8")
blogs = blogs[0:int(0.3*len(blogs))]
tweets = tweets[0:int(0.3*len(blogs))]
news = news[0:int(0.3*len(blogs))]

Thirty percent of the original data were used for exploratory data analyis.

Cleaning

Initial cleaning of the data were done by converting all text to ascii and lowercase:

blogs=unicodedata.normalize("NFKC", blogs).encode("ascii", "ignore").lower()
tweets=unicodedata.normalize("NFKC", tweets).encode("ascii", "ignore").lower()
news=unicodedata.normalize("NFKC", news).encode("ascii", "ignore").lower()
alldata = blogs + tweets + news

Data was then tokenized into words:

tokenizer=RegexpTokenizer(r'\w+')
tokens=tokenizer.tokenize(alldata)

Exploratory Data Analysis

The Source Data

l_blog = countLines("en_US.blogs.txt")[[1]]
l_twitter = countLines("en_US.twitter.txt")[[1]]
l_news = countLines("en_US.news.txt")[[1]]

There were three relevant files provided:

File	Number of lines
en_US.blogs.txt	899288
en_US.twitter.txt	2360148
en_US.news.txt	1010242

Words

The first item to investigate is the frequency of individual words. This is extracted via python and saved to csv:

words = nltk.FreqDist(tokens)
# Save words
with open("words.csv", "w") as writerf:
    writer = csv.writer(writerf)
    writer.writerow(["WORD","COUNT"])
    for the_word in words.most_common():
        writer.writerow(the_word)

The top 12 words are graphed below:

words = read.csv("words.csv", nrows=50)
words$COUNT<-as.numeric(words$COUNT)
plot(words[1:12,]$COUNT, xaxt='n', xlab="Word", ylab="Frequency", main="Words")
 axis(1,at=1:12,labels=words[1:12,]$WORD)

n_words<-countLines("words.csv")

A total of 456506 words were found in the exploratory data analysis.

Bigrams

Bigrams are word-pairs that occur in the text.Bigrams were extracted using the python ngram function as below:

# bigrams
fbigram = nltk.FreqDist(ngrams(tokens,2))

 # save bigrams
with open("bigrams.csv", "w") as writerf:
    writer = csv.writer(writerf)
    writer.writerow(["BIGRAM","COUNT"])
    for bigram in fbigram.most_common():
        writer.writerow([" ".join(bigram[0]), bigram[1]])

The top 10 bigrams are graphed below:

bigrams = read.csv("bigrams.csv", nrows=50)
bigrams$COUNT<-as.numeric(bigrams$COUNT)
plot(bigrams[1:8,]$COUNT, xaxt='n', xlab="Bigram", ylab="Frequency", main="Bigrams")
axis(1,at=1:8,labels=bigrams[1:8,]$BIGRAM)

n_bigrams<-countLines("bigrams.csv")

A total of 8455234 bigrams were found in the exploratory data analysis.

Trigrams

Trigrams are groups of three words and were extracted using the ngram function as below:

# trigrams
ftrigrams = nltk.FreqDist(ngrams(tokens,3))
# save trigrams
with open("trigrams.csv", "w") as writerf:
    writer = csv.writer(writerf)
    writer.writerow(["TRIGRAM","COUNT"])
    for trigram in ftrigrams.most_common():
        writer.writerow([" ".join(trigram[0]), trigram[1]])

The top trigrams are graphed below:

trigrams = read.csv("trigrams.csv", nrows=50)
trigrams$COUNT<-as.numeric(trigrams$COUNT)
plot(trigrams[1:8,]$COUNT, xaxt='n', xlab="trigram", ylab="Frequency", main="Trigrams")
axis(1,at=1:8,labels=trigrams[1:8,]$TRIGRAM)

n_trigrams<-countLines("trigrams.csv")

A total of 22859430 trigrams were found in the exploratory data analysis.

Quadgrams

Quadgrams are groups of four words found together in the text and were extracted using the ngram function as below:

# quadgrams
fquadgrams = nltk.FreqDist(ngrams(tokens,4))
with open("quadgrams.csv", "w") as writerf:
    writer = csv.writer(writerf)
    writer.writerow(["QUADGRAM","COUNT"])
    for quadgram in fquadgrams.most_common():
        writer.writerow([" ".join(quadgram[0]), quadgram[1]])

The top 10 trigrams are graphed below:

quadgrams = read.csv("quadgrams.csv", nrows=50)
quadgrams$COUNT<-as.numeric(quadgrams$COUNT)
plot(quadgrams[1:8,]$COUNT, xaxt='n', xlab="quadgram", ylab="Frequency", main="Quadgrams")
axis(1,at=1:8,labels=quadgrams[1:8,]$QUADGRAM)

n_quadgrams<-countLines("quadgrams.csv")

A total of 32001938 quadgrams were found in the exploratory data analysis.

Next Steps

Next steps will include the following:

Investigate the impact of stop words and if they should be included
Investigate ways to best use shiny for the application
Investigate use of Markov Chains for use in the predictive model
Build a predictive model that include back-off
Investigate strategy to ignore swear-words
Optimise predictive model
Finalize Data Product
Create Slide Deck

Data Science Capstone Project Progress Report

Cobus Nel

26 July 2015