The Johns Hopkins, Coursera Data Science specialisation capstone project involve the development of a predictive text model.
The model is to be built using a fairly large corpus of text made available for this project. The project was build in conjunction with Swiftkey
Please note that during experimentation it was found that python is a more suitable programming language to implement the intense computing required for pre-processing in this project.
Specifically, the Natural Language Toolkt (NLTK) Python library is well suited for the work required.
The bulk of the pre-processing tasks are implemented in Python. Code snippets are provided to illustrate the usage of Python. CSV is used to communicate between Python and R.
import os
import unicodedata
import nltk
import re
import gc
from nltk.tokenize import sent_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.util import ngrams
full_execute = False
os.chdir(os.path.expanduser("~/Documents/Coursera/Capstone/final/en_US"))
The dataset used for training the model include data from twitter, blogs and news feeds. The specific dataset is available here: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip )
blogs=open("en_US.blogs.txt","r").read().decode("utf8")
twits=open("en_US.twitter.txt").read().decode("utf8")
news=open("en_US.news.txt", "Ur").read().decode("utf8")
blogs = blogs[0:int(0.3*len(blogs))]
tweets = tweets[0:int(0.3*len(blogs))]
news = news[0:int(0.3*len(blogs))]
Thirty percent of the original data were used for exploratory data analyis.
Initial cleaning of the data were done by converting all text to ascii and lowercase:
blogs=unicodedata.normalize("NFKC", blogs).encode("ascii", "ignore").lower()
tweets=unicodedata.normalize("NFKC", tweets).encode("ascii", "ignore").lower()
news=unicodedata.normalize("NFKC", news).encode("ascii", "ignore").lower()
alldata = blogs + tweets + news
Data was then tokenized into words:
tokenizer=RegexpTokenizer(r'\w+')
tokens=tokenizer.tokenize(alldata)
l_blog = countLines("en_US.blogs.txt")[[1]]
l_twitter = countLines("en_US.twitter.txt")[[1]]
l_news = countLines("en_US.news.txt")[[1]]
There were three relevant files provided:
| File | Number of lines |
|---|---|
| en_US.blogs.txt | 899288 |
| en_US.twitter.txt | 2360148 |
| en_US.news.txt | 1010242 |
The first item to investigate is the frequency of individual words. This is extracted via python and saved to csv:
words = nltk.FreqDist(tokens)
# Save words
with open("words.csv", "w") as writerf:
writer = csv.writer(writerf)
writer.writerow(["WORD","COUNT"])
for the_word in words.most_common():
writer.writerow(the_word)
The top 12 words are graphed below:
words = read.csv("words.csv", nrows=50)
words$COUNT<-as.numeric(words$COUNT)
plot(words[1:12,]$COUNT, xaxt='n', xlab="Word", ylab="Frequency", main="Words")
axis(1,at=1:12,labels=words[1:12,]$WORD)
n_words<-countLines("words.csv")
A total of 456506 words were found in the exploratory data analysis.
Bigrams are word-pairs that occur in the text.Bigrams were extracted using the python ngram function as below:
# bigrams
fbigram = nltk.FreqDist(ngrams(tokens,2))
# save bigrams
with open("bigrams.csv", "w") as writerf:
writer = csv.writer(writerf)
writer.writerow(["BIGRAM","COUNT"])
for bigram in fbigram.most_common():
writer.writerow([" ".join(bigram[0]), bigram[1]])
The top 10 bigrams are graphed below:
bigrams = read.csv("bigrams.csv", nrows=50)
bigrams$COUNT<-as.numeric(bigrams$COUNT)
plot(bigrams[1:8,]$COUNT, xaxt='n', xlab="Bigram", ylab="Frequency", main="Bigrams")
axis(1,at=1:8,labels=bigrams[1:8,]$BIGRAM)
n_bigrams<-countLines("bigrams.csv")
A total of 8455234 bigrams were found in the exploratory data analysis.
Trigrams are groups of three words and were extracted using the ngram function as below:
# trigrams
ftrigrams = nltk.FreqDist(ngrams(tokens,3))
# save trigrams
with open("trigrams.csv", "w") as writerf:
writer = csv.writer(writerf)
writer.writerow(["TRIGRAM","COUNT"])
for trigram in ftrigrams.most_common():
writer.writerow([" ".join(trigram[0]), trigram[1]])
The top trigrams are graphed below:
trigrams = read.csv("trigrams.csv", nrows=50)
trigrams$COUNT<-as.numeric(trigrams$COUNT)
plot(trigrams[1:8,]$COUNT, xaxt='n', xlab="trigram", ylab="Frequency", main="Trigrams")
axis(1,at=1:8,labels=trigrams[1:8,]$TRIGRAM)
n_trigrams<-countLines("trigrams.csv")
A total of 22859430 trigrams were found in the exploratory data analysis.
Quadgrams are groups of four words found together in the text and were extracted using the ngram function as below:
# quadgrams
fquadgrams = nltk.FreqDist(ngrams(tokens,4))
with open("quadgrams.csv", "w") as writerf:
writer = csv.writer(writerf)
writer.writerow(["QUADGRAM","COUNT"])
for quadgram in fquadgrams.most_common():
writer.writerow([" ".join(quadgram[0]), quadgram[1]])
The top 10 trigrams are graphed below:
quadgrams = read.csv("quadgrams.csv", nrows=50)
quadgrams$COUNT<-as.numeric(quadgrams$COUNT)
plot(quadgrams[1:8,]$COUNT, xaxt='n', xlab="quadgram", ylab="Frequency", main="Quadgrams")
axis(1,at=1:8,labels=quadgrams[1:8,]$QUADGRAM)
n_quadgrams<-countLines("quadgrams.csv")
A total of 32001938 quadgrams were found in the exploratory data analysis.
Next steps will include the following: