Introduction

The aim of this project is build a shiny application which uses an algorithm to predict the word most likely to follow one or more words typed by any user. This problem is categorized as a task in Language Modeling in the field of Natural Language Processing (NLP). We are provided with text data from social media (twitter), news and blogs in three different languages, English, German and French. We use the English data for our project. Here is a basic size summary of these data sets which show that we are dealing with large data sets.

library(tm); library(koRpus); library(data.table); library(ggplot2); library(wordcloud); library(cowplot); library(gridExtra)
##             fileName size_in_MB lineCount
## 1:   en_US.blogs.txt      210.2    899288
## 2:    en_US.news.txt      205.8   1010242
## 3: en_US.twitter.txt      167.1   2360148

Data Processing

Cleaning and Sampling. We first remove profanity words using regular expression since we do not want the prediction algorithm to return such words. Then we choose a random subset of the data as the training set. We use n.readLines{reader} command from reader package to read the data in chunks and then choose a subset of the data by a random selection by lines. For the purpose of this report we work with relatively small sample size of \(1\%\) of the cleaned data.

Tokenization. Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. We tokenize the data to understand the distribution of these tokens such as words, punctuation, commas, full stops etc. We see that there are higher number of full stops and less commas in the twitter data as compared to the other two. The news data has the the most numbers. This shows the fundamental difference of the use of the language in the data sets, with the news data being more formal and reports more factual figures and numbers while blogs and twitter are more informal. Also, the distribution of word lengths show the presence of very long words in the twitter data as compared to the others, probably be due to web links addresses.

tokenized <-koRpus::tokenize(txt =  train.file.path, format = "file", lang = "en")
tokenDesc <- slot(tokenized, name="desc")
tokensdf <-slot(tokenized, name="TT.res") 

##                  fileName lineCount wordCount sentCount punctCount
## 1    train.en_US.blog.txt      8993    380472     23997      65396
## 2    train.en_US.news.txt     10102    353717     22089      71202
## 3 train.en_US.twitter.txt     23601    310737     37325      81145

Document Term Matrix

A Document Term Matrix (dtm) is a mathematical matrix that describes the frequency of terms that occur in a collection of documents, where the rows correspond to documents in the collection and columns correspond to terms. Creating such a matrix from the corpus is an important step in text analysis. Using the tm package, we read the corpus and then convert the text to the lower case, remove numbers and punctuation before creating a dtm.

# Use tm package
docs <- Corpus(DirSource(trainData.path)) # Load the data
# Preprocessing
docs <- tm_map(docs, tolower) 
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument) # treat your preprocessed documents as text documents

dtm <- DocumentTermMatrix(docs)
inspect(dtm[, 100:106])
## <<DocumentTermMatrix (documents: 3, terms: 7)>>
## Non-/sparse entries: 9/12
## Sparsity           : 57%
## Maximal term length: 8
## Weighting          : term frequency (tf)
## 
##               Terms
## Docs           ably abnormal abny aboard abode abodes abolish
##   character(0)    0        3    0      5     3      1       0
##   character(0)    2        0    0      7     0      0       1
##   character(0)    0        0    1      5     0      0       0

Exploratory Analysis

Word Frequency The dtm allows us to analyse the data by the frequency of the words. We first look at the most frequently occurring words in all documents combined. A more attractive visualization is a word cloud with the 100 most frequently occurring words. “The” and “and” are the two most occurring words.

 wordcloud(names(freq), freq, min.freq=500, max.words=100, colors=brewer.pal(8, "Dark2"), random.order = FALSE, rot.per=.15, scale=c(8,.9)) 

Coverage By coverage we mean the number of unique words needed to cover a certain percentage of total number of words (56519) in the data.

##    coverage.percent cutoff.index uniqword.percent cutoff.freq
## 5              0.50          313        0.5537961         299
## 9              0.90         9807       17.3516870           6
## 10             0.95        21313       37.7094428           2

We can see from the above table and graphs that for \(50 \%\) coverage we need 313 unique words which have a minimum frequency of 299. For \(90 \%\) coverage we need 9807 unique words which have a minimum frequency of 6. So if we remove the terms occurring only 1 to 2 times in the data we can still have a good coverage and hence a robust model. This will also remove most foreign language and unknown words.

Proposed Plan