Abstract

The main goal of this capstone project is to build a shiny application that is able to predict the next word/words. This exercise was divided into seven sub tasks like data cleaning, exploratory analysis, the creation of a predictive model and more. All text data that is used to create a frequency dictionary and thus to predict the next words comes from a corpus called HC Corpora.

All text mining and natural language processing was done with the usage of a variety of well-known R (Programming language) packages. After creating a data sample from the HC Corpora data, this sample was cleaned by conversion to lowercase, removing punctuation, links, white space, numbers,stopwords and all kinds of special characters. This data sample was then tokenized into so-called n-grams. And the resulting data.frames are used to predict the next word

                                          Steps of this Report

Data Acquisition

1. Load Libraries

First we will setup the environment (R Packages) so we assume that you install it using install.packages() command

library(NLP) ;      library(tm);        library(RWeka);       library(RColorBrewer)
library(wordcloud); library(ggplot2) ;  library(slam);        library(hash)
library(rpart);     library(data.table);library(SnowballC) ;  library(stringi)
library(qdap);      library(scales) ;   library(gridExtra) ; library(stringr) ; library(gridExtra)

2. Load Data

Second we will load data to start cleaning and analysis.

## Load twitter data
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding="UTF-8")
twitter <- iconv(twitter, from="latin1", to="ASCII", sub="")

## Load blog data
blogs <- readLines("final/en_US/en_US.blogs.txt", encoding="UTF-8")
blogs <- iconv(blogs, from="latin1", to="ASCII", sub="")

## Load news data
con <- file("final/en_US/en_US.news.txt", open="rb")
news <- readLines(con, encoding="UTF-8")
close(con)
news <- iconv(news, from="latin1", to="ASCII", sub="")

3. Sample Data we create a function to sample the documents ,random lines were selected from all 3 files to create a train data set. Sample function is used to take about 1/10 or 10% from each text files. Using below function:

blogs_sample <- sample(blogs, length(blogs)/10)
news_sample <- sample(news, length(news)/10)
twitter_sample <- sample(twitter, length(twitter)/10)

sampleData <- c(blogs_sample, news_sample, twitter_sample)
## Create  Corpus for combined data
en_documents_sample <- Corpus(VectorSource(list(sampleData)))

Data summary details:

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 57073418

Data pre-processing

1. Cleaning and Transformation In this section we will: 1.1 clean the collections of text documents provided by removing: - Numbers - Punctuations - Whitespace - Profanity words - foreign words/characters 1.2 Translate characters from upper to lower case 1.3 Stem words in the text documents using Porter’s stemming algorithm, to not make distinction between the singular and the plural form of the same word.

tokenize_file <- function(txt_file) {
      tidy_txt <- tm_map(txt_file, content_transformer(stri_trans_tolower))
      tidy_txt <- tm_map(tidy_txt, removeWords, stopwords("english")) 
      tidy_txt <- tm_map(tidy_txt, removePunctuation) 
      tidy_txt <- tm_map(tidy_txt, removeNumbers) 
      tidy_txt <- tm_map(tidy_txt, stemDocument, language="english")
      P_WORDS <- scan(file.path("google_bad_words_utf.txt"), what="", sep="\n")
      tidy_txt <- tm_map(tidy_txt, removeWords, P_WORDS) 
      tidy_txt <- tm_map(tidy_txt, stripWhitespace) 
      tidy_txt 
}

# rmove foreign/special words and characters
txtonly <- content_transformer(function(x) stri_replace_all_regex(x,"[^\\p{L}\\s[']]+",""))
en_documents_sample <- tm_map(en_documents_sample, txtonly)

en_documents_sample <- tokenize_file(en_documents_sample)

Data summary details after cleaning:

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 33446329

2. n-gram In this section, we will Building a Term-Document Matrix then we will compute the frequencies of Unigrams ,bigrams, trigrams and quatrgrams of the corpus of the sample data. - unigram is one word. - bigrams are the sets of two words that appear together in the corpus of the text. - trigrams are the set of three words that appear together in the corpus of the text. - Quatrgrams are the set of four words that appear together in the corpus of the text.

2.1 Building a Term-Document Matrix (TDM) Building TDM to organize terms by their frequency and use removeSparseTerms function to focus on just the interesting stuff. this most 10 frequently occurring words.

Exploratory data analysis

1. Basic Analysis In this section we analyse the basic metrics for the three files and we compare between themm in terms of lines, words and characters, We use the completed/original data on this analysis (not the sample data) . Results as you will see below: - twitter have higest noumber of lines and lowest chars/words - blogs have lowest number of lines and higest chars/words

get_Comparson <- function(text_file){
  total_lines <- length(text_file)
  words_per_line <- sapply(strsplit(text_file, "\\s+"), length)
  total_words <- sum(words_per_line)
  total_chars <- sum(nchar(text_file))
  
  c(total_lines,total_words,total_chars)
}
sumry_blogs <- get_Comparson(blogs)
sumry_news <- get_Comparson(news)
sumry_twitter <- get_Comparson(twitter)
summary_stat <- as.data.frame(rbind(sumry_blogs,sumry_news,sumry_twitter))
row.names(summary_stat) <- NULL
summary_stat$name <- c("Blogs","News","Twitter")
names(summary_stat) <-  c("total_lines","total_words","total_chars","name")

2. Comparison Comparison Word counts, line counts and char counts

plot1 <- ggplot(summary_stat, aes(x=summary_stat$name, y=summary_stat$total_lines, fill=summary_stat$name)) +
     geom_bar(stat="identity", position=position_dodge(), colour="black")

plot2 <- ggplot(summary_stat, aes(x=summary_stat$name, y=summary_stat$total_words, fill=summary_stat$name)) +
     geom_bar(stat="identity", position=position_dodge(), colour="black")

plot3 <- ggplot(summary_stat, aes(x=summary_stat$name, y=summary_stat$total_chars, fill=summary_stat$name)) +
     geom_bar(stat="identity", position=position_dodge(), colour="black")

grid.arrange(plot1, plot2,plot3 , ncol=3)

Future

The main outcome for this project is a predictive application, where the user enter words and the application suggests the next word. The steps to build this application are as follows:

We will base the prediction on the n-grams.
Create statistical model
Deal with memory limitation
Build Data product

SwiftKey NLP Milestone Report

Mahmood Salah

December 28, 2015

Abstract

Data Acquisition

Data pre-processing

Exploratory data analysis

Future