Download and import the data

We first download the file from “https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip”. After we unzip the file, we will find a folder titled en_US that contains three text files, en_US.blogs, en_US.news, en_US.twitter. We will focus on these three files

Data summary

After importing the three text files using readLines command, we extract the size of the files in Gigabits (GB). We can also extract number of lines for each text files directly checking the length of the data files imported using readLines. We finally check the number of words of each file. We make a data frame giving the summary of the three text files below:

##      File     Size   Lines Word_Number
## 1   Blogs 200.4242  899288    37334147
## 2    News 196.2775 1010242    34372530
## 3 Twitter 159.3641 2360148    30373603

Exploratory analysis of the text files

Before we build the n-gram models with n >1, let’s first perform some exploratory data analysis. Since the three text files are fairly large and the number of the text lines are huge, we below randomly select a comparably small part of each text file (here we choose 10000 lines for each file) to form a training data set which we will use to build the prediction model later.

Nowe we have the training set data, we will first create a corpus using the Corpus command and then we will perform a series of standard procedures to remove puctuations, numbers, and stopwords. We also change the letters to lower cases and stem the texts (which means all the words are converted to their stem, i.e. learning -> learn, walked -> walk). After all the procedure, we create a wordcloud to give visually what letters occur most frequent in this training data set.

To extract more information of the training set, we create the document term matrix, which can help provide the information about the frequency of each words. Below, we provide the first 10 words which appear the most frequently:

##      word freq
## said said 2896
## will will 2761
## one   one 2493
## just just 2193
## can   can 2054
## like like 1992
## time time 1647
## get   get 1641
## new   new 1541
## now   now 1377

For an illustration of which letters occuring most often, we below show a histogram of the top 10 words. We can see the most freqent word is actually the word “said”.

For a more visually illustration of the letters occurring in the text file, we below show the wordcloud which gives a clear qualitative view of the text file

Building N-gram models:

We will consider the 2-gram and 3-gram models below

2-gram model

We will build the 2-gram model below. We list the top 10 2-words below along with the histogram.

##                    word freq
## last year     last year  193
## new york       new york  181
## dont know     dont know  145
## right now     right now  145
## last week     last week  133
## years ago     years ago  125
## high school high school  123
## feel like     feel like  106
## dont think   dont think  101
## first time   first time   95

We again create a wordcloud for visual illustration of the texts.

3-gram model

We use the tm package to build the 3-gram model. We list the first 10 words which appear more frequently below along with the corresponding histogram,

##                                          word freq
## new york city                   new york city   24
## new york times                 new york times   14
## happy new year                 happy new year   13
## dont think can                 dont think can   12
## let us know                       let us know   12
## two years ago                   two years ago   12
## cant wait see                   cant wait see   11
## cinco de mayo                   cinco de mayo   11
## president barack obama president barack obama   11
## dont even know                 dont even know   10

The wordcloud is illustrated below:

What’s next for building a text prediction app

Clearly in this Milestone report, we randomly extract 10000 lines from each en_US text files and extract the most frequent single words, two words, and three words. In order to build a more approapriate text prediction app, I think the easiest way is to adopt Shannon’s idea, which was pubished in a paper about information theory in 1950 that can be found in http://languagelog.ldc.upenn.edu/myl/Shannon1950.pdf. For a given word, in order to predict the word following it we first use the algorithm above to assign a probabilty to the top most freqent words following it, where all the probabiity should add up to be 1. This should be the easiest way to predict a word following a given word using the algorithm developped above. For developing a more faster algorithm, maybe we can try to import smaller chunks of texts to build a bigram model. We reiterate the steps for a few days and combine the results to build a more unbiased bigram model.

We also need to figure out a way to improve the cleanning level of the text data, espencially the en_US.twitter.txt or blogs files which contain lots of nonsense words, i.e. aaaaah, whooooah, etc. Though the frequencies of those words are nothing compared with those of the meaning words, it is still better to find a systematic way to clean out all those words.

Codes used in this report

We list all the codes performed above below,

## We first set the working directory
#setwd("~/Documents/Coursera/Capstone/Milestone_Report")

## We can download the file using the command below and 
## unzip the file
#fileURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
#download.file(fileURL, destfile = "TextData.zip", method = "curl")
#unlink(fileURL)
#unzip("TextData.zip")

## We first import the packages we will use
#library(tm)
#library(SnowballC)
#library(wordcloud)
#library(RWeka)
#library(ggplot2)
#library(slam)

## There are three text files in the ./final/en_US directory
## Let's first import the files
#data_blogs <- readLines("./final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = T)
#data_news <- readLines("./final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = T)
#data_twit<- readLines("./final/en_US/en_US.twitter.txt", encoding = "UTF-8",skipNul = T)

## We first check the size of the data in GB
#file_blogs <- file.info("./final/en_US/en_US.blogs.txt")$size / 1024.0 / 1024.0
#file_news <- file.info("./final/en_US/en_US.news.txt")$size / 1024.0 / 1024.0
#file_twit <- file.info("./final/en_US/en_US.twitter.txt")$size / 1024.0 / 1024.0

## We check number of lines of each text file
#line_blogs <- length(data_blogs)
#line_news <- length(data_news)
#line_twit <- length(data_twit)

## Check the total number of words
#word_blogs <- sum(sapply(gregexpr("\\S+", data_blogs), length))
#word_news <- sum(sapply(gregexpr("\\S+", data_news), length))
#word_twit <- sum(sapply(gregexpr("\\S+", data_twit), length))

## We summarize all the information in a data frame
#file_summary <- data.frame(File=c("Blogs", "News", "Twitter"), Size = c(file_blogs, file_news,file_twit), Lines = c(line_blogs, line_news, line_twit), Word_Number =c(word_blogs, word_news, word_twit))

#file_summary

##Let's pull out parts of the text files to be the training set
#nblog <- sample(1:length(data_blogs), 10000)
#nnews <- sample(1:length(data_news), 10000)
#ntwit <- sample(1:length(data_twit), 10000)

## The subdata below will be our training sets
#train_blogs <- data_blogs[nblog]
#train_news <- data_news[nnews]
#train_twit <- data_twit[ntwit]

## Let's combine the three subdata to be one
#train_data <- c(train_blogs,train_news,train_twit) 

##Now, we will perform a series of operations on the text data to simplify it.
##First, we need to create a corpus.
#trainCorpus <- Corpus(VectorSource(train_data))

##Next, we will convert the corpus to a plain text document.
#trainCorpus <- tm_map(trainCorpus, PlainTextDocument)

##Then, we will remove all punctuation and stopwords. Stopwords are commonly
##used words in the English language such as I, me, my, etc. You can see the
##full list of stopwords using stopwords('english').
#trainCorpus <- tm_map(trainCorpus, removePunctuation)
#trainCorpus <- tm_map(trainCorpus, removeNumbers)
#trainCorpus <- tm_map(trainCorpus, tolower)
#trainCorpus <- tm_map(trainCorpus, removeWords, stopwords('english'))

##Next, we will perform stemming. This means that all the words are converted to their stem (Ex: learning -> learn, walked -> walk, etc.).
##This will ensure that different forms of the word are converted to
## the same form and plotted only once in the wordcloud.
#trainCorpus <- tm_map(trainCorpus, stemDocument)

## Remove white spaces left after we perform all the steps above
#trainCorpus <- tm_map(trainCorpus, stripWhitespace)

## We convert the trainCorpus to a plain text document.
#trainCorpus <- tm_map(trainCorpus, PlainTextDocument)

#DTM <- DocumentTermMatrix(trainCorpus)

#TDM <- TermDocumentMatrix(trainCorpus)

## Let's organize terms by their frequency:
#word_freq <- colSums(as.matrix(DTM))
#ord_wordf <- order(word_freq)

## Examing the matrix, we find that the document term matrix (DTM) or term document matric(TDM) are very sparse. Let's remove sparse area of the matrix
#DTMsparse <- removeSparseTerms(DTM,0.99)
## inspect(DTMsparse)

##Now we can check the the words occuring the most frequently
#word_freq[tail(ord_wordf)]

## We can also check the frequency with which words appear. We will see many words occur only once
#head(table(word_freq),20)
#word_freq <- sort(word_freq, decreasing = T)
#wf <- data.frame(word=names(word_freq), freq=word_freq)   
#head(wf, 10)  

#subwf <- wf[1:10,]
#p <- ggplot(subwf, aes(x= word, y= freq, fill = word))
#p <- p + geom_bar(stat = "identity")
#p

## Let's first see the wordclode first to see what words occur more frequently
#set.seed(17)
#wordcloud(trainCorpus, max.words = 100, scale = c(5,0.3),
#          random.order = FALSE, colors = brewer.pal(8,"Dark2"))
#We will consider the 2-gram and 3-gram models below

### 2-gram model
#We will build the 2-gram model below. We list the top 10 2-words below along with the histogram.

#options(mc.cores=1)
#BigramTokenizer <- function(x){NGramTokenizer(x,Weka_control(min = 2, max = 2))} ## create n-grams
#bgDTM <- DocumentTermMatrix(trainCorpus, control = list(tokenize = BigramTokenizer)) # create DTM from n-grams

#bgDTM2 <- rollup(bgDTM, 1, na.rm=TRUE, FUN = sum)
#bg_freq <- colSums(as.matrix(bgDTM2))

## Now let's give the top 10 two-words occuring most frequently
#bg_freq <- sort(bg_freq, decreasing = T)
#head(bg_freq,10)

## create the bigrams data frame
#bgwf <- data.frame(word=names(bg_freq), freq=bg_freq)   
#head(bgwf,10)

#subbg_wf <- bgwf[1:10,]
#bg_p <- ggplot(subbg_wf, aes(x= word, y= freq, fill = word))
#bg_p <- bg_p + geom_bar(stat = "identity")
#bg_p <- bg_p + theme(axis.text.x = element_text(angle = 90, hjust = 1))
#bg_p

#set.seed(17)
#wordcloud(words = bgwf$word,
#          freq = bgwf$freq,
#          scale = c(3,0.3),
#          max.words = 50,
#          random.order = FALSE, colors = brewer.pal(8,"Dark2"))


### 3-gram model

#options(mc.cores=1)
#TrgramTokenizer <- function(x){NGramTokenizer(x,Weka_control(min = 3, max = 3))} ## create n-grams

#TrgDTM <- DocumentTermMatrix(trainCorpus, control = list(tokenize = TrgramTokenizer)) # create DTM from n-grams

#TrgDTM2 <- rollup(TrgDTM, 1, na.rm=TRUE, FUN = sum)
#Trg_freq <- colSums(as.matrix(TrgDTM2))


## Now let's give the words occuring most frequently
#Trg_freq <- sort(Trg_freq, decreasing = T)

#Trgwf <- data.frame(word = names(Trg_freq), freq = Trg_freq)  
#head(Trgwf,10)

#subtrg_wf <- Trgwf[1:10,]
#trg_p <- ggplot(subtrg_wf, aes(x= word, y= freq, fill = word))
#trg_p <- trg_p + geom_bar(stat = "identity")
#trg_p <- trg_p + theme(axis.text.x = element_text(angle = 90, hjust = 1))
#trg_p

#set.seed(17)
#wordcloud(words = Trgwf$word,
#          freq = Trgwf$freq,
#          scale = c(2.5,0.3),
#          max.words = 30,
#          random.order = FALSE, colors = brewer.pal(8,"Dark2"))