This project aims to develop a predictive text product as part of the Capstone Project for the Data Science Specialisation available on Coursera. The product is based on Natural Language Processing (NLP), which is a multi-disciplinary field concerning the interactions between computer and human (natural) languages.
This report will explain and summarise the exploratory analysis process I have made for the data, and will also briefly outline my plans and goals for the prediction algorithm and application.
The dataset is from a corpus called HC Corpora.
zipfile = "Coursera-SwiftKey.zip"
fileURL = "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileURL, zipfile)
unzip(zipfile)
There are four sub-folders included in the dataset.
list.files("final")
## [1] "de_DE" "en_US" "fi_FI" "ru_RU"
For our subsequent analysis, we will be using only the “news” file in the US folder.
list.files("final/en_US")
## [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
datanews = readLines("final/en_US/en_US.news.txt", encoding="UTF-8", warn=FALSE, skipNul=TRUE)
Basic information about the “news” file are summarised below.
File size (in MB):
filesize = (file.info("final/en_US/en_US.news.txt")$size)/1024^2
print(filesize)
## [1] 196.2775
Row count:
linecount = length(datanews)
print(linecount)
## [1] 1010242
Word count:
wordcount = sum(sapply(gregexpr("\\W+",datanews),length)+1)
print(wordcount)
## [1] 36721087
Since the file is too big to examine completely, we extract 10% of the file as a sample for further analysis.
set.seed(100)
sample = sample(datanews, length(datanews) * 0.1)
We then clean up the sample to remove numbers, punctuation, whitespace and change data to lower cases.
library(tm)
## Warning: package 'NLP' was built under R version 3.1.3
sample = VCorpus(VectorSource(sample))
sample = tm_map(sample, removeNumbers)
sample = tm_map(sample, removePunctuation)
sample = tm_map(sample, stripWhitespace)
sample = tm_map(sample, tolower)
We would also want to remove profanities. The reference list for common profanities used is provided by tjrobinson via Github. I copied and pasted the list of profanities to a .txt file.
profanities = scan("profanity.txt","")
sample = tm_map(sample, removeWords, profanities)
I aimed to create n-grams to prepare the datatset for word frequency analysis and subsequently, construction of the prediction model. However, I couldn’t get my function to work properly and I’m nearly out of time. I’m just gonna leave my report at this, but I’m still going to work the kinks out.
My original function for tokenisation is as below, in normal text format.
# library(RWeka)
# nGram = function(ng) {
# NgramTokenizer = function(x) NGramTokenizer(x, Weka_control(min = ng, max = ng))
# gramtdm <- TermDocumentMatrix(sample, control=list(tokenize=NgramTokenizer))
# gramtdm <- as.data.frame(apply(gramtdm,1,sum))
# colnames(gramtdm) = c("Frequency")
# return(gramtdm)
#}
#
#unigramtdm = nGram(1)
#bigramtdm = nGram(2)
#trigramtdm = nGram(3)
I would be working out the kinks in my tokenisation formula, and extend what I’ve done above to the rest of the US files. Hopefully the application would turn out fine!