Introduction

This project aims to develop a predictive text product as part of the Capstone Project for the Data Science Specialisation available on Coursera. The product is based on Natural Language Processing (NLP), which is a multi-disciplinary field concerning the interactions between computer and human (natural) languages.

This report will explain and summarise the exploratory analysis process I have made for the data, and will also briefly outline my plans and goals for the prediction algorithm and application.

Data Processing and Summary

The dataset is from a corpus called HC Corpora.

Downloading and loading the data

zipfile = "Coursera-SwiftKey.zip"
fileURL = "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileURL, zipfile)
unzip(zipfile)

There are four sub-folders included in the dataset.

list.files("final")
## [1] "de_DE" "en_US" "fi_FI" "ru_RU"

For our subsequent analysis, we will be using only the “news” file in the US folder.

list.files("final/en_US")
## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"
datanews = readLines("final/en_US/en_US.news.txt", encoding="UTF-8", warn=FALSE, skipNul=TRUE)

Summarising the data

Basic information about the “news” file are summarised below.

File size (in MB):

filesize = (file.info("final/en_US/en_US.news.txt")$size)/1024^2
print(filesize)
## [1] 196.2775

Row count:

linecount = length(datanews)
print(linecount)
## [1] 1010242

Word count:

wordcount = sum(sapply(gregexpr("\\W+",datanews),length)+1)
print(wordcount)
## [1] 36721087

Exploratory Analysis

Creating a sample

Since the file is too big to examine completely, we extract 10% of the file as a sample for further analysis.

set.seed(100)
sample = sample(datanews, length(datanews) * 0.1)

Cleaning the data

We then clean up the sample to remove numbers, punctuation, whitespace and change data to lower cases.

library(tm)
## Warning: package 'NLP' was built under R version 3.1.3
sample = VCorpus(VectorSource(sample))
sample = tm_map(sample, removeNumbers)
sample = tm_map(sample, removePunctuation)
sample = tm_map(sample, stripWhitespace)
sample = tm_map(sample, tolower)

We would also want to remove profanities. The reference list for common profanities used is provided by tjrobinson via Github. I copied and pasted the list of profanities to a .txt file.

profanities = scan("profanity.txt","")
sample = tm_map(sample, removeWords, profanities)

Attempt at Tokenisation

I aimed to create n-grams to prepare the datatset for word frequency analysis and subsequently, construction of the prediction model. However, I couldn’t get my function to work properly and I’m nearly out of time. I’m just gonna leave my report at this, but I’m still going to work the kinks out.

My original function for tokenisation is as below, in normal text format.

# library(RWeka)

# nGram = function(ng) {
#     NgramTokenizer = function(x) NGramTokenizer(x, Weka_control(min = ng, max = ng))
#     gramtdm <- TermDocumentMatrix(sample, control=list(tokenize=NgramTokenizer))
#     gramtdm <- as.data.frame(apply(gramtdm,1,sum))
#     colnames(gramtdm) = c("Frequency")
#     return(gramtdm)
#}
#
#unigramtdm = nGram(1)
#bigramtdm = nGram(2)
#trigramtdm = nGram(3)

Conclusion

I would be working out the kinks in my tokenisation formula, and extend what I’ve done above to the rest of the US files. Hopefully the application would turn out fine!