This report is a milestone for the Data Science Capstone. The main goal of this report is to describe the progress done so far towards a text prediction web app. The three main tasks expected for this milestone report are
The downloaded dataset contains files from four different languages (i.e. Danish, English, Finnish and Russian). For this project, english is the chosen language. There are three files containing text from news website, twitter, and blogs. The code below reads the text files.
con_news <- file("./data/final/en_US/en_US.news.txt", "r")
con_twitter <- file("./data/final/en_US/en_US.twitter.txt", "r")
con_blogs <- file("./data/final/en_US/en_US.blogs.txt", "r")
lines_news <- readLines(con_news, encoding = "UTF-8" , skipNul = TRUE)
lines_twitter <- readLines(con_twitter, encoding = "UTF-8", skipNul = TRUE)
lines_blogs <- readLines(con_blogs, encoding = "UTF-8", skipNul = TRUE)
# Close connections
close(con_news)
close(con_twitter)
close(con_blogs)
Some basic statistics about sizes and number of words of each dataset can be seen below.
if (!require("stringi")) install.packages("stringi")
library(stringi)
# Number of lines in each file
num_lines_news <- length(lines_news)
num_lines_twitter <- length(lines_twitter)
num_lines_blogs <- length(lines_blogs)
# Total number of words
num_words_news <- sum(stri_count_words(lines_news))
num_words_twitter <- sum(stri_count_words(lines_twitter))
num_words_blogs <- sum(stri_count_words(lines_blogs))
size_news <- file.info("./data/final/en_US/en_US.news.txt")$size / 1024^2
size_twitter <- file.info("./data/final/en_US/en_US.twitter.txt")$size / 1024^2
size_blog <- file.info("./data/final/en_US/en_US.blogs.txt")$size / 1024^2
# Max number of character in a line
max_char_news <- max(nchar(lines_news))
max_char_twitter <- max(nchar(lines_twitter))
max_char_blogs <- max(nchar(lines_blogs))
summary_stats <- data.frame(Data = c("News", "Twitter", "Blog"),
File_size = c(size_news, size_twitter, size_blog),
Number_lines = c(num_lines_news, num_lines_twitter, num_lines_blogs),
Number_words = c(num_words_news, num_words_twitter, num_words_blogs),
Max_num_char = c(max_char_news, max_char_twitter, max_char_blogs))
summary_stats
## Data File_size Number_lines Number_words Max_num_char
## 1 News 196.2775 1010242 34762395 11384
## 2 Twitter 159.3641 2360148 30093410 140
## 3 Blog 200.4242 899288 37546246 40833
Note that the sizes of the files are huge, which may require extra attention on the implementation of the proposed approach, since memory is always an issue.
The dataset is extremely large, for this reason only a portion of the dataset was used for training purposes to decrease the overhead. A total of 1% of the data was randomly sampled. The training data was saved in a separated file to avoid to recreate a subsample everytime.
training_prec <- 0.01
news_train <- sample(lines_news)[1:round(training_prec * length(lines_news))]
twitter_train <- sample(lines_twitter)[1:round(training_prec * length(lines_twitter))]
blogs_train <- sample(lines_blogs)[1:round(training_prec * length(lines_blogs))]
## 2 - Writing it out to a separate file
fileConn<-file("./data/final/en_US/training/news_train.txt")
writeLines(news_train, fileConn)
close(fileConn)
fileConn<-file("./data/final/en_US/training/twitter_train.txt")
writeLines(twitter_train, fileConn)
close(fileConn)
fileConn<-file("./data/final/en_US/training/blogs_train.txt")
writeLines(blogs_train, fileConn)
close(fileConn)
# Load training data
con_newstrain <- file("./data/final/en_US/training/news_train.txt", "r")
con_blogstrain <- file("./data/final/en_US/training/blogs_train.txt", "r")
con_twittertrain <- file("./data/final/en_US/training/twitter_train.txt", "r")
train_news <- readLines(con_newstrain, -1)
train_twitter <- readLines(con_blogstrain, -1)
train_blogs <- readLines(con_twittertrain, -1)
close(con_newstrain)
close(con_blogstrain)
close(con_twittertrain)
This next step consists of braking the texts into words, which is the process called tokenization. In fact, tokenization does not only break things into words, but into items or tokens, which can also be things like, punctuation and numbers. To avoid redundance, all tokens are converted to lowercase and stemmed. In order to have a cleaner data numbers, white spaces, and stopwords are removed.
if (!require("tm")) install.packages("tm")
if (!require("SnowballC")) install.packages("SnowballC")
library(tm)
library(SnowballC)
# Create a corpus
data_path <- "./data/final/en_US/training/"
docs <- Corpus(DirSource(data_path))
# Removing punctuation
docs <- tm_map(docs, removePunctuation)
# Removing numbers
docs <- tm_map(docs, removeNumbers)
#Converting to lowercase
docs <- tm_map(docs, tolower)
# Removing “stopwords”
docs <- tm_map(docs, removeWords, stopwords("english"))
# Steming (e.g., “ing”, “es”, “s”)
docs <- tm_map(docs, stemDocument)
#Removing whitespace
docs <- tm_map(docs, stripWhitespace)
# Transform docs back into text
docs <- tm_map(docs, PlainTextDocument)
In order to explore the data, some frequency operations were performed on the data. First, after all the cleaning and tokenization, a document frequency matrix was created and the 100 most frequent words were plotted in the format of a word cloud. Words like “will”, “one”, and “just” are very frequent. However, a more meaningful representation are bi-grams and tri-grams, which correspond the combinations of two and three words found together respectively.
if (!require("RColorBrewer")) install.packages("RColorBrewer")
if (!require("quanteda")) install.packages("quanteda")
if (!require("wordcloud")) install.packages("wordcloud")
library(RColorBrewer)
library(quanteda)
library(wordcloud)
# Create a document frequency matrix
myCorpus <- corpus(docs)
myDfm <- dfm(myCorpus, ngrams = 1, verbose = FALSE)
# Plot the 100 most frequent words
plot(myDfm, max.words = 100, colors = brewer.pal(6, "Dark2"),
scale = c(4, .2))
myDfm_2gram <- dfm(myCorpus, ngrams = 2)
## Creating a dfm from a corpus ...
##
## ... lowercasing
##
## ... tokenizing
##
## ... indexing documents: 3 documents
##
## ... indexing features:
## 472,270 feature types
##
## ... created a 3 x 472270 sparse dfm
## ... complete.
## Elapsed time: 962 seconds.
freq_2gram <- colSums(myDfm_2gram)
barplot(sort(freq_2gram, decreasing = TRUE)[1:10], cex.names = 0.7,
main = "2-grams", las = 2)
It can be seen that the bigrams “right now”, “dont know”, “high school”, and “new york” are also very frequent.
myDfm_3gram <- dfm(myCorpus, ngrams = 3)
## Creating a dfm from a corpus ...
##
## ... lowercasing
##
## ... tokenizing
##
## ... indexing documents: 3 documents
##
## ... indexing features:
## 558,344 feature types
##
## ... created a 3 x 558344 sparse dfm
## ... complete.
## Elapsed time: 1070 seconds.
freq_3gram <- colSums(myDfm_3gram)
barplot(sort(freq_3gram, decreasing = TRUE)[1:10], cex.names = 0.7,
main = "3-grams", las = 2)
For 3-grams, constructions like “cant wait see”, “happy mothers day”, “let us know” are very common, which indicates that once we see the term, “happy mothers”, there is a high probability that the next term will be “day”.
Next steps include understand more Markov Chains to apply to the problem. What I got so far from this approach is that given that we have seen a word, what is the most likely word to appear next. The most frequent n-grams are indications for the likelyhood of the next words. However, it is still not clear what the optimal n-gram would be, which may take some experimentations on the test set in order to find the optimal ngram size.
Once the algorithm is implemented based on the training set. A shinny app will be developed containing an input interface that will captures some text from the user, and it will predict the most likelly next word.