This Capstone Project is completed as part of the requirement for Coursera Data Science Specialization. To find out more about this specialization, follow this link. https://www.coursera.org/specializations/jhu-data-science
In this project, Natural Language Processing (NLP) is used to analyse and provide a next-word prediction for a series of words a user key in. The algorithm will be constructed based on a large collection of text sample extracted from 3 main sources, blogs, news and twitter. This Milestone Report will mainly focus on the following
The dataset used is downloaded from the following location. Coursera Download Link
Downloading and extracting this file will yield the folder /final with sample data of 4 different languages. For the purpose of this project, only en_US is used.
The 3 data files are
en_US.blogs.txt - text from blog postsen_US.news.txt - text from news articles posted onlineen_US.twitter.txt - tweets on TwitterFirstly, some basic statistics on the source data files will be explored.
| File | File Size (bytes) | No of Rows | Word Count | Average Word Count (per row) | Longest Row (characters) | Shortest Row (characters) |
|---|---|---|---|---|---|---|
| Blogs | 210160014 | 899288 | 37334131 | 42 | 40833 | 1 |
| News | 205811889 | 77259 | 2643969 | 34 | 5760 | 2 |
| 167105338 | 2360148 | 30373583 | 13 | 140 | 2 |
The whole dataset is huge and therefore difficult to be processed in its entirely given the limited resource of a regular laptop, a subset the data is extracted for processing. About 30,000 rows are taken at random from each dataset.
Next, the data will be cleaned by generally
**Code for sampling & data cleanup is available in the Appendix.
The cleaned up data is converted into a Corpus for further processing. Corpus (which consists of all 3 dataset) is tokenize into n-gram as follows
The top terms for these n-grams are listed here
Wordcloud for better top terms visualization.
Further work will need to be done to further clean up the dataset
It is planned to use trigram, bigram and unigram to predict the next word which the user might type. Quadgram might be included in the final algorithm, depending on processing capability and performance optimization. Generally,
## Generate Sample
gen_sample <- function(file, row=1000, seed=123) {
## set file path for reading and writing
infile <- file(paste('../myCapCode/final/en_US/',
if(file=='blog')'en_US.blogs.txt',
if(file=='news')'en_US.news.txt',
if(file=='twitter')'en_US.twitter.txt',sep=''),'rb')
outfile <- file(paste('../myCapCode/final/sample/',
ifelse(file=='blog','blogs.txt',
ifelse(file=='news','news.txt','twitter.txt')),sep=''),'w')
## set seed, determine rows to take, by rbinom
set.seed(seed)
samplerows <- rbinom(n=row*2, size=1, prob=0.6)
## read in rows and write out, until required # of rows are attained.
i<-0
written <- 0
for (i in 1:(row*2)) {
# read in one line at a time
currLine <- readLines(infile, n=1, encoding="UTF-8", skipNul=TRUE)
# if reached end of file, close all conns
if (length(currLine) == 0) {
close(infile)
close(outfile)
break
}
## write out current row if chosen
if (samplerows[i] == 1){
writeLines(currLine, outfile)
written <- written + 1
}
## stop if enough rows written out
if(written >= row){
close(infile)
close(outfile)
break
}
i<-i+1
}
return(written)
}
gen_sample('blog',30000,seed=88)
gen_sample('news',30000,seed=88)
gen_sample('twitter',30000,seed=88)
Further improvement will be done here to improve corpus quality
library(tm)
library(RWeka)
library(ggplot2)
sample_path <- '../myCapCode/final/sample/'
## create corpus using all 3 sample files
doc <- Corpus(DirSource(sample_path), readerControl = list(reader = readPlain,language = "en_US",load = TRUE))
## corpus cleanup
d <- doc
d = tm_map(d,content_transformer(tolower))
d = tm_map(d,removePunctuation)
d = tm_map(d,removeNumbers)
d = tm_map(d,stripWhitespace)
n1gramtoken <- function (x) NGramTokenizer(x, Weka_control(min = 1, max = 1, delimiters = " \\r\\n\\t.,;:\"()?!"))
n2gramtoken <- function (x) NGramTokenizer(x, Weka_control(min = 2, max = 2, delimiters = " \\r\\n\\t.,;:\"()?!"))
n3gramtoken <- function (x) NGramTokenizer(x, Weka_control(min = 3, max = 3, delimiters = " \\r\\n\\t.,;:\"()?!"))
n4gramtoken <- function (x) NGramTokenizer(x, Weka_control(min = 4, max = 4, delimiters = " \\r\\n\\t.,;:\"()?!"))
dtm1 <- TermDocumentMatrix(d, control = list(tokenize = n1gramtoken))
dtm2 <- TermDocumentMatrix(d, control = list(tokenize = n2gramtoken))
dtm3 <- TermDocumentMatrix(d, control = list(tokenize = n3gramtoken))
dtm4 <- TermDocumentMatrix(d, control = list(tokenize = n4gramtoken))