Milestone Report

Introduction

The aim of this report is to show some data analysis that will be the basis to create a prediction algorithm on the Natural Language Processing (NLP) field. The analysis will deal with a set of text data, which are a collection of several entries in social media, news and blogs in US. The files were language filtered but may still contain some foreign text. The idea is to analyse and use statistical pattern learning on them to obtain high quality information. which are typically getting patterns and trends from the texts (also referred as text mining). The algorithm having these findings will be later deployed on an App.

Loading the libraries

library(dplyr)
library(ggplot2)
library(stringr)
library(tm)
library(lexicon)
library(RWeka)
library(qdap)
library(tidytext)
library(ggraph)
library(widyr)
library(ngram)

Loading the Data Sets

Downloading the Data

setwd("C:/Users/CChaves/Documents/Big Data/Capstone")
filePath <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
if(!file.exists("Coursera-SwiftKey.zip")){
  download.file(filePath, "Coursera-SwiftKey.zip")
  unzip("Coursera-SwiftKey.zip")}
setwd("C:/Users/CChaves/Documents/Big Data/Capstone/final/en_US")
blogs <- readLines("en_US.blogs.txt")
news <- readLines("en_US.news.txt")
twitter <- readLines("en_US.twitter.txt")

Three data sets in english are available.

setwd("C:/Users/CChaves/Documents/Big Data/Capstone/final/en_US")
blogs <- readLines("en_US.blogs.txt")
news <- readLines("en_US.news.txt")
twitter <- readLines("en_US.twitter.txt")

First step is to enconde the texts by using Unicode-based (UTF-8) to support many languages and accomodate pages and forms in any mixture of those languages.

Encoding(blogs) <- "UTF-8"
Encoding(news) <- "UTF-8"
Encoding(twitter) <- "UTF-8"

Data Overview

A summary of the data sets to check the size of the files, number of lines, words and characters.

data.frame("Data Set" = c("blogs","news","twitter"),
           "File Size" = sapply(list(blogs, news, twitter), function(x){
             format(object.size(x), units = "auto")}),
           "Lines Count" = sapply(list(blogs, news, twitter), length),
           "Words Count" = sapply(list(blogs, news, twitter), wordcount),
           "Characters Count" = sapply(list(blogs, news, twitter), function(x){
             sum(nchar(x))}))

From the table above, it is clear the data sets are very large. So, dealing with all of them would require too much processing. As a representative sample can be used to infer details about a population, the data will be subset to create the first analysis.

Data Sampling

Few randomly selected rows will be sampled to get accurate results of all data. A rbinom function is used to sample 0.1% of the data.

set.seed(1234)
sampling <- function(data, p) {
      return(data[as.logical(rbinom(length(data),1,p))])}
sample <- 0.001
blogs_sample <- sampling(blogs, sample)
news_sample <- sampling(news, sample)
twitter_sample <- sampling(twitter, sample)

Data Preprocessing

To clean and subsequent processing, a corpus class object is created. This object contains the original texts, the document-level variables, the document-level metadata, corpus-level metadata and default settings.

corpus <- VCorpus(VectorSource(c(blogs_sample, news_sample, twitter_sample)), readerControl=list(reader=readPlain,language="en"))

Cleaning the Data

The steps below are to transform to lower cases, remove punctuation, numbers, extra spaces, bad words (lexicon library), stopwords, stem the text and create a plain text document.

bad_words <- c(profanity_arr_bad, profanity_banned, profanity_racist)
corpus <- tm_map(corpus, tolower) %>%
  tm_map(removePunctuation) %>%
  tm_map(removeNumbers) %>%
  tm_map(stripWhitespace) %>%
  tm_map(removeWords, bad_words) %>%
  #tm_map(removeWords, stopwords("english")) %>%
  tm_map(stemDocument) %>%
  tm_map(PlainTextDocument)
corpus1 <- tm_map(corpus, removeWords, stopwords("english")) %>% #For unigram
  tm_map(PlainTextDocument)

Tokenization

The idea is to break the text content into words (meaningful elements called tokens). Later, this list of tokens (words) will turn into inputs for text mining. Three basic functions are created to the n-grams, where n is the size. The unigram (n = 1) considers each word occurrence in a document as independent of all other word occurrences. The bigrams (n = 2) are all possible word pairs in a sentence formed from neighboring words. The trigrams (n = 3) are all possible set of 3 words in a sentence formed from neighboring words.

uni_gramTokenizer <- function(x){
  NGramTokenizer(x, Weka_control(min = 1, max = 1))}
bi_gramTokenizer <- function(x){
  NGramTokenizer(x, Weka_control(min = 2, max = 2))}
tri_gramTokenizer <- function(x){
  NGramTokenizer(x, Weka_control(min = 3, max = 3))}

Term-Document Matrix

Creating a term-document matrix from a corpus data and tokenization functions to perform some text mining.

unigram <- DocumentTermMatrix(VCorpus(VectorSource(corpus1)),
                             control = list(weighting = weightTf, 
                                            tokenize = uni_gramTokenizer))
bigram <- DocumentTermMatrix(VCorpus(VectorSource(corpus)),
                             control = list(weighting = weightTf, 
                                            tokenize = bi_gramTokenizer))
trigram <- DocumentTermMatrix(VCorpus(VectorSource(corpus)),
                             control = list(weighting = weightTf, 
                                            tokenize = tri_gramTokenizer))

Data Analysis

The first step is to create frequency matrixes for each n-gram. The idea is to check the occurrence of each term to show the most frequent ones for unigram, bigrams and trigrams.

Unigram Frequent Terms

freq_unigram <- sort(colSums(as.matrix(unigram)), decreasing = TRUE)
unigram_df <- data.frame(Word = names(freq_unigram), Frequency = freq_unigram )
row.names(unigram_df) <- 1:nrow(unigram_df)
ggplot(unigram_df[1:20,], aes(x=Word,y=Frequency)) + ggtitle("Unigram Most Frequent Terms") + geom_bar(stat="Identity", fill="Red") + geom_text(aes(label=Frequency), vjust=-0.20) + theme(axis.text.x = element_text(angle = 45, hjust = 1))

Bigrams Frequent Terms

freq_bigram <- sort(colSums(as.matrix(bigram)), decreasing = TRUE)
bigram_df <- data.frame(Word = names(freq_bigram), Frequency = freq_bigram )
row.names(bigram_df) <- 1:nrow(bigram_df)
ggplot(bigram_df[1:20,], aes(x=Word,y=Frequency)) + ggtitle("Bigrams Most Frequent Terms") + geom_bar(stat="Identity", fill="Green") + geom_text(aes(label=Frequency), vjust=-0.20) + theme(axis.text.x = element_text(angle = 45, hjust = 1))

Trigrams Frequent Terms

freq_trigram <- sort(colSums(as.matrix(trigram)), decreasing = TRUE)
trigram_df <- data.frame(Word = names(freq_trigram), Frequency = freq_trigram )
row.names(trigram_df) <- 1:nrow(trigram_df)
ggplot(trigram_df[1:20,], aes(x=Word,y=Frequency)) + ggtitle("Trigrams Most Frequent Terms") + geom_bar(stat="Identity", fill="Yellow") + geom_text(aes(label=Frequency), vjust=-0.20) + theme(axis.text.x = element_text(angle = 45, hjust = 1))

Findings

The size of the data was definetly a problem. There were several trials with different samplings to find an optimal process time. In addition, it was not possible to create the frequency matrixes with some samplings as they generated big vector sizes that could not be allocated. So, the possible sampling was 0.1%. It was decided to keep the stopwords for bigrams and trigrams. Otherwise, only few significant expressions would have appeared. On the other hand, the stopwords were removed from unigram to get meaningful words. Finally, there was a problem to create the corpus object using Corpus function, it was necessary to use VCorpus function instead.

Next Steps

The idea will be to use the frequency matrixes to create the models to predict. The prediction model will probably use random forest algorithm. The strategy is to suggest some entries by getting some inputs of the user; for example: when the user enters a word, the app wil suggest a word to complement based on the bigram terms. A doubt resides in the fact that the processing can be slow. There may be 2 solutions for that: the first would be to still decrease the sampling, that would probably impact the prediction accuracy; a second would be to use some cache to ease the process.