Overview

This milestone report explains exploratory analysis and outlines further steps that will be followed in order to create a “next word” prediction algorithm. The prediction algorithm will be based on English language sentences contained in Corpora dataset. The dataset is collected from publicly available sources by a web crawler. So far, this report reports summaries of the dataset, reports interesing findings and creates plans for a prediction algorithm.

Exploratory analysis

First, it is important to have a look at the structure of the available datasets. Here are samples from the three datasets containing twitter, news, and blog sentences:

## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”."
## [1] "He wasn't home alone, apparently."
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."

Dataset Descriptive Statistics

Next we can create simple statistics for each of the datasets (blogs, news, twitter) that shows number of samples/lines in the dataset, and average number of words in each sample:

library(stringi)

## Count lines in sources
nBlogLines <- length(blogs)
nNewsLines <- length(news)
nTwitterLines <- length(twitter)
nTotalLines <- nBlogLines + nNewsLines + nTwitterLines

blogsWords <- sum(stri_count_words(blogs))
newsWords <- sum(stri_count_words(news))
twitterWords <- sum(stri_count_words(twitter))
totalWords <- blogsWords + newsWords + twitterWords

## Summarize sources
sourceLines <- c(nBlogLines, nNewsLines, nTwitterLines, nTotalLines)
sourceWords <- c(blogsWords, newsWords, twitterWords, totalWords)
sourceWordsPerLine <- c(blogsWords/nBlogLines,
                        newsWords/nNewsLines,
                        twitterWords/nTwitterLines,
                        totalWords/nTotalLines
)
sourceSummary <- data.frame(sourceLines, sourceWords, sourceWordsPerLine)
rownames(sourceSummary) <- c("Blogs", "News", "Twitter", "Total")
colnames(sourceSummary) <- c("Number of lines", "Number of words", "Average number of words per line")
sourceSummary

Data Cleaning

In natural laguage processing it is common practice to clean up the data (e.g. removing numbers, punctuations, convering all characters to lower case, removing stop words, etc) before using them to build the model. Cleaning the data helps in improving model performance. Hence, the numbers stopwords, punctionations and whitespaces have been removed from the text data using following code:

library(dplyr)
library(tm)
library(tidytext)

library(tm)
library(wordcloud)

toSpace <- content_transformer(function (x, pattern) gsub(pattern, " ", x))

text_transformer <- function(text_corpus) {
    text_corpus <- tm_map(text_corpus, toSpace, "[[:punct:]]")
    text_corpus <- tm_map(text_corpus, content_transformer(tolower))
    text_corpus <- tm_map(text_corpus, removeNumbers)
    text_corpus <- tm_map(text_corpus, removeWords, stopwords("english"))
    text_corpus <- tm_map(text_corpus, removeWords, c("rt"))
    text_corpus <- tm_map(text_corpus, stripWhitespace)
    text_corpus
}

twit_c_t <- text_transformer(Corpus(VectorSource(twitter)))
blogs_c_t <- text_transformer(Corpus(VectorSource(blogs)))
news_c_t <- text_transformer(Corpus(VectorSource(news)))

N-grams and word clouds

N-grams is a contiguous sequence of n items from a given sample of text. N-gram of size 1 is refered as unigram, size 2 is a bigram and size 3 as a trigram. N-grams can be nicely visualized using “word cloud” method where the importance of each n-gram in text is shown with font size or color. Bigger n-grams mean greater more frequent occurance in the text. This is shown below for unigrams constructed for all three datasets:

library(tm)
library(wordcloud)

wordcloud(words = twit_c_t, min.freq = 1,
    max.words=100, random.order=FALSE, rot.per=0.35,
    colors=brewer.pal(8, "Dark2"))

wordcloud(words = blogs_c_t, min.freq = 1,
    max.words=100, random.order=FALSE, rot.per=0.35,
    colors=brewer.pal(8, "Dark2"))

wordcloud(words = news_c_t, min.freq=1,
    max.words=100, random.order=FALSE, rot.per=0.35,
    colors=brewer.pal(8, "Dark2"))

Next steps towards model building

N-grams can be used to construct a probabilistic language model for predicting the next item in a sequence of words. Such model can be expressed in the form of a (\(n\)- 1)-order Markov model. However, one needs to be careful in chosing number n, because n-gram statistics suffers from sparseness in text as n-grows. Therefore, it is necessary to use smoothing methods against sparseness such as adding \(\lambda\) method or Good-Turning method.