Introduction

This report is a milestone for the Data Science Capstone. The main goal of this report is to describe the progress done so far towards a text prediction web app. The three main tasks expected for this milestone report are

  1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.
  2. Create a basic report of summary statistics about the data sets.
  3. Report any interesting findings that you amassed so far.
  4. Get feedback on your plans for creating a prediction algorithm and Shiny app

Loading Data

The downloaded dataset contains files from four different languages (i.e. Danish, English, Finnish and Russian). For this project, english is the chosen language. There are three files containing text from news website, twitter, and blogs. The code below reads the text files.

con_news <- file("./data/final/en_US/en_US.news.txt", "r")
con_twitter <- file("./data/final/en_US/en_US.twitter.txt", "r") 
con_blogs <- file("./data/final/en_US/en_US.blogs.txt", "r") 

lines_news <- readLines(con_news, encoding = "UTF-8"  , skipNul = TRUE) 
lines_twitter <- readLines(con_twitter, encoding = "UTF-8",  skipNul = TRUE)
lines_blogs <- readLines(con_blogs, encoding = "UTF-8", skipNul = TRUE)

# Close connections
close(con_news)
close(con_twitter)
close(con_blogs)

Basic Statistics of the dataset

Some basic statistics about sizes and number of words of each dataset can be seen below.

if (!require("stringi")) install.packages("stringi")
library(stringi)

# Number of lines in each file
num_lines_news <- length(lines_news)
num_lines_twitter <- length(lines_twitter)
num_lines_blogs <- length(lines_blogs)

# Total number of words
num_words_news <- sum(stri_count_words(lines_news))
num_words_twitter <- sum(stri_count_words(lines_twitter))
num_words_blogs <- sum(stri_count_words(lines_blogs))

size_news <- file.info("./data/final/en_US/en_US.news.txt")$size / 1024^2
size_twitter <- file.info("./data/final/en_US/en_US.twitter.txt")$size / 1024^2
size_blog <- file.info("./data/final/en_US/en_US.blogs.txt")$size / 1024^2
    
# Max number of character in a line
max_char_news <- max(nchar(lines_news))
max_char_twitter <- max(nchar(lines_twitter))
max_char_blogs <- max(nchar(lines_blogs))

summary_stats <- data.frame(Data = c("News", "Twitter", "Blog"),
                            File_size = c(size_news, size_twitter, size_blog),
                            Number_lines = c(num_lines_news, num_lines_twitter,                                              num_lines_blogs),
                            Number_words = c(num_words_news, num_words_twitter,                                              num_words_blogs),
                            Max_num_char = c(max_char_news, max_char_twitter,                                                max_char_blogs))
summary_stats
##      Data File_size Number_lines Number_words Max_num_char
## 1    News  196.2775      1010242     34762395        11384
## 2 Twitter  159.3641      2360148     30093410          140
## 3    Blog  200.4242       899288     37546246        40833

Note that the sizes of the files are huge, which may require extra attention on the implementation of the proposed approach, since memory is always an issue.

Data Partition

The dataset is extremely large, for this reason only a portion of the dataset was used for training purposes to decrease the overhead. A total of 1% of the data was randomly sampled. The training data was saved in a separated file to avoid to recreate a subsample everytime.

training_prec <- 0.01

news_train <- sample(lines_news)[1:round(training_prec * length(lines_news))]
twitter_train <- sample(lines_twitter)[1:round(training_prec * length(lines_twitter))]
blogs_train <- sample(lines_blogs)[1:round(training_prec * length(lines_blogs))]

## 2 - Writing it out to a separate file
fileConn<-file("./data/final/en_US/training/news_train.txt")
writeLines(news_train, fileConn)
close(fileConn)

fileConn<-file("./data/final/en_US/training/twitter_train.txt")
writeLines(twitter_train, fileConn)
close(fileConn)

fileConn<-file("./data/final/en_US/training/blogs_train.txt")
writeLines(blogs_train, fileConn)
close(fileConn)

# Load training data
con_newstrain <- file("./data/final/en_US/training/news_train.txt", "r")
con_blogstrain <- file("./data/final/en_US/training/blogs_train.txt", "r")
con_twittertrain <- file("./data/final/en_US/training/twitter_train.txt", "r")

train_news <- readLines(con_newstrain, -1) 
train_twitter <- readLines(con_blogstrain, -1)
train_blogs <- readLines(con_twittertrain, -1)

close(con_newstrain)
close(con_blogstrain)
close(con_twittertrain)

Tokenization

This next step consists of braking the texts into words, which is the process called tokenization. In fact, tokenization does not only break things into words, but into items or tokens, which can also be things like, punctuation and numbers. To avoid redundance, all tokens are converted to lowercase and stemmed. In order to have a cleaner data numbers, white spaces, and stopwords are removed.

if (!require("tm")) install.packages("tm")
if (!require("SnowballC")) install.packages("SnowballC")
library(tm)
library(SnowballC) 

# Create a corpus
data_path <- "./data/final/en_US/training/"
docs <- Corpus(DirSource(data_path))   

# Removing punctuation
docs <- tm_map(docs, removePunctuation)   
# Removing numbers
docs <- tm_map(docs, removeNumbers)   
#Converting to lowercase
docs <- tm_map(docs, tolower)
# Removing “stopwords” 
docs <- tm_map(docs, removeWords, stopwords("english"))   
# Steming (e.g., “ing”, “es”, “s”)
docs <- tm_map(docs, stemDocument)   
#Removing whitespace
docs <- tm_map(docs, stripWhitespace) 

# Transform docs back into text
docs <- tm_map(docs, PlainTextDocument)

Exploratory Data Analysis

In order to explore the data, some frequency operations were performed on the data. First, after all the cleaning and tokenization, a document frequency matrix was created and the 100 most frequent words were plotted in the format of a word cloud. Words like “will”, “one”, and “just” are very frequent. However, a more meaningful representation are bi-grams and tri-grams, which correspond the combinations of two and three words found together respectively.

if (!require("RColorBrewer")) install.packages("RColorBrewer")
if (!require("quanteda")) install.packages("quanteda")
if (!require("wordcloud")) install.packages("wordcloud")

library(RColorBrewer)
library(quanteda)
library(wordcloud)


# Create a document frequency matrix    
myCorpus <- corpus(docs)
myDfm <- dfm(myCorpus, ngrams = 1, verbose = FALSE)

# Plot the 100 most frequent words
plot(myDfm, max.words = 100, colors = brewer.pal(6, "Dark2"), 
         scale = c(4, .2))

myDfm_2gram <- dfm(myCorpus, ngrams = 2)
## Creating a dfm from a corpus ...
## 
##    ... lowercasing
## 
##    ... tokenizing
## 
##    ... indexing documents: 3 documents
## 
##    ... indexing features:
## 472,270 feature types
## 
##    ... created a 3 x 472270 sparse dfm
##    ... complete. 
## Elapsed time: 962 seconds.
freq_2gram <- colSums(myDfm_2gram)


barplot(sort(freq_2gram, decreasing = TRUE)[1:10], cex.names = 0.7, 
        main = "2-grams", las = 2)

It can be seen that the bigrams “right now”, “dont know”, “high school”, and “new york” are also very frequent.

myDfm_3gram <- dfm(myCorpus, ngrams = 3)
## Creating a dfm from a corpus ...
## 
##    ... lowercasing
## 
##    ... tokenizing
## 
##    ... indexing documents: 3 documents
## 
##    ... indexing features:
## 558,344 feature types
## 
##    ... created a 3 x 558344 sparse dfm
##    ... complete. 
## Elapsed time: 1070 seconds.
freq_3gram <- colSums(myDfm_3gram)

barplot(sort(freq_3gram, decreasing = TRUE)[1:10], cex.names = 0.7, 
        main = "3-grams", las = 2)

For 3-grams, constructions like “cant wait see”, “happy mothers day”, “let us know” are very common, which indicates that once we see the term, “happy mothers”, there is a high probability that the next term will be “day”.

Plans for creating a prediction algorithm

Next steps include understand more Markov Chains to apply to the problem. What I got so far from this approach is that given that we have seen a word, what is the most likely word to appear next. The most frequent n-grams are indications for the likelyhood of the next words. However, it is still not clear what the optimal n-gram would be, which may take some experimentations on the test set in order to find the optimal ngram size.

Once the algorithm is implemented based on the training set. A shinny app will be developed containing an input interface that will captures some text from the user, and it will predict the most likelly next word.