We first download the file from “https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip”. After we unzip the file, we will find a folder titled en_US that contains three text files, en_US.blogs, en_US.news, en_US.twitter. We will focus on these three files
After importing the three text files using readLines command, we extract the size of the files in Gigabits (GB). We can also extract number of lines for each text files directly checking the length of the data files imported using readLines. We finally check the number of words of each file. We make a data frame giving the summary of the three text files below:
## File Size Lines Word_Number
## 1 Blogs 200.4242 899288 37334147
## 2 News 196.2775 1010242 34372530
## 3 Twitter 159.3641 2360148 30373603
Before we build the n-gram models with n >1, let’s first perform some exploratory data analysis. Since the three text files are fairly large and the number of the text lines are huge, we below randomly select a comparably small part of each text file (here we choose 10000 lines for each file) to form a training data set which we will use to build the prediction model later.
Nowe we have the training set data, we will first create a corpus using the Corpus command and then we will perform a series of standard procedures to remove puctuations, numbers, and stopwords. We also change the letters to lower cases and stem the texts (which means all the words are converted to their stem, i.e. learning -> learn, walked -> walk). After all the procedure, we create a wordcloud to give visually what letters occur most frequent in this training data set.
To extract more information of the training set, we create the document term matrix, which can help provide the information about the frequency of each words. Below, we provide the first 10 words which appear the most frequently:
## word freq
## said said 2896
## will will 2761
## one one 2493
## just just 2193
## can can 2054
## like like 1992
## time time 1647
## get get 1641
## new new 1541
## now now 1377
For an illustration of which letters occuring most often, we below show a histogram of the top 10 words. We can see the most freqent word is actually the word “said”.
For a more visually illustration of the letters occurring in the text file, we below show the wordcloud which gives a clear qualitative view of the text file
We will consider the 2-gram and 3-gram models below
We will build the 2-gram model below. We list the top 10 2-words below along with the histogram.
## word freq
## last year last year 193
## new york new york 181
## dont know dont know 145
## right now right now 145
## last week last week 133
## years ago years ago 125
## high school high school 123
## feel like feel like 106
## dont think dont think 101
## first time first time 95
We again create a wordcloud for visual illustration of the texts.
We use the tm package to build the 3-gram model. We list the first 10 words which appear more frequently below along with the corresponding histogram,
## word freq
## new york city new york city 24
## new york times new york times 14
## happy new year happy new year 13
## dont think can dont think can 12
## let us know let us know 12
## two years ago two years ago 12
## cant wait see cant wait see 11
## cinco de mayo cinco de mayo 11
## president barack obama president barack obama 11
## dont even know dont even know 10
The wordcloud is illustrated below:
Clearly in this Milestone report, we randomly extract 10000 lines from each en_US text files and extract the most frequent single words, two words, and three words. In order to build a more approapriate text prediction app, I think the easiest way is to adopt Shannon’s idea, which was pubished in a paper about information theory in 1950 that can be found in http://languagelog.ldc.upenn.edu/myl/Shannon1950.pdf. For a given word, in order to predict the word following it we first use the algorithm above to assign a probabilty to the top most freqent words following it, where all the probabiity should add up to be 1. This should be the easiest way to predict a word following a given word using the algorithm developped above. For developing a more faster algorithm, maybe we can try to import smaller chunks of texts to build a bigram model. We reiterate the steps for a few days and combine the results to build a more unbiased bigram model.
We also need to figure out a way to improve the cleanning level of the text data, espencially the en_US.twitter.txt or blogs files which contain lots of nonsense words, i.e. aaaaah, whooooah, etc. Though the frequencies of those words are nothing compared with those of the meaning words, it is still better to find a systematic way to clean out all those words.
We list all the codes performed above below,
## We first set the working directory
#setwd("~/Documents/Coursera/Capstone/Milestone_Report")
## We can download the file using the command below and
## unzip the file
#fileURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
#download.file(fileURL, destfile = "TextData.zip", method = "curl")
#unlink(fileURL)
#unzip("TextData.zip")
## We first import the packages we will use
#library(tm)
#library(SnowballC)
#library(wordcloud)
#library(RWeka)
#library(ggplot2)
#library(slam)
## There are three text files in the ./final/en_US directory
## Let's first import the files
#data_blogs <- readLines("./final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = T)
#data_news <- readLines("./final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = T)
#data_twit<- readLines("./final/en_US/en_US.twitter.txt", encoding = "UTF-8",skipNul = T)
## We first check the size of the data in GB
#file_blogs <- file.info("./final/en_US/en_US.blogs.txt")$size / 1024.0 / 1024.0
#file_news <- file.info("./final/en_US/en_US.news.txt")$size / 1024.0 / 1024.0
#file_twit <- file.info("./final/en_US/en_US.twitter.txt")$size / 1024.0 / 1024.0
## We check number of lines of each text file
#line_blogs <- length(data_blogs)
#line_news <- length(data_news)
#line_twit <- length(data_twit)
## Check the total number of words
#word_blogs <- sum(sapply(gregexpr("\\S+", data_blogs), length))
#word_news <- sum(sapply(gregexpr("\\S+", data_news), length))
#word_twit <- sum(sapply(gregexpr("\\S+", data_twit), length))
## We summarize all the information in a data frame
#file_summary <- data.frame(File=c("Blogs", "News", "Twitter"), Size = c(file_blogs, file_news,file_twit), Lines = c(line_blogs, line_news, line_twit), Word_Number =c(word_blogs, word_news, word_twit))
#file_summary
##Let's pull out parts of the text files to be the training set
#nblog <- sample(1:length(data_blogs), 10000)
#nnews <- sample(1:length(data_news), 10000)
#ntwit <- sample(1:length(data_twit), 10000)
## The subdata below will be our training sets
#train_blogs <- data_blogs[nblog]
#train_news <- data_news[nnews]
#train_twit <- data_twit[ntwit]
## Let's combine the three subdata to be one
#train_data <- c(train_blogs,train_news,train_twit)
##Now, we will perform a series of operations on the text data to simplify it.
##First, we need to create a corpus.
#trainCorpus <- Corpus(VectorSource(train_data))
##Next, we will convert the corpus to a plain text document.
#trainCorpus <- tm_map(trainCorpus, PlainTextDocument)
##Then, we will remove all punctuation and stopwords. Stopwords are commonly
##used words in the English language such as I, me, my, etc. You can see the
##full list of stopwords using stopwords('english').
#trainCorpus <- tm_map(trainCorpus, removePunctuation)
#trainCorpus <- tm_map(trainCorpus, removeNumbers)
#trainCorpus <- tm_map(trainCorpus, tolower)
#trainCorpus <- tm_map(trainCorpus, removeWords, stopwords('english'))
##Next, we will perform stemming. This means that all the words are converted to their stem (Ex: learning -> learn, walked -> walk, etc.).
##This will ensure that different forms of the word are converted to
## the same form and plotted only once in the wordcloud.
#trainCorpus <- tm_map(trainCorpus, stemDocument)
## Remove white spaces left after we perform all the steps above
#trainCorpus <- tm_map(trainCorpus, stripWhitespace)
## We convert the trainCorpus to a plain text document.
#trainCorpus <- tm_map(trainCorpus, PlainTextDocument)
#DTM <- DocumentTermMatrix(trainCorpus)
#TDM <- TermDocumentMatrix(trainCorpus)
## Let's organize terms by their frequency:
#word_freq <- colSums(as.matrix(DTM))
#ord_wordf <- order(word_freq)
## Examing the matrix, we find that the document term matrix (DTM) or term document matric(TDM) are very sparse. Let's remove sparse area of the matrix
#DTMsparse <- removeSparseTerms(DTM,0.99)
## inspect(DTMsparse)
##Now we can check the the words occuring the most frequently
#word_freq[tail(ord_wordf)]
## We can also check the frequency with which words appear. We will see many words occur only once
#head(table(word_freq),20)
#word_freq <- sort(word_freq, decreasing = T)
#wf <- data.frame(word=names(word_freq), freq=word_freq)
#head(wf, 10)
#subwf <- wf[1:10,]
#p <- ggplot(subwf, aes(x= word, y= freq, fill = word))
#p <- p + geom_bar(stat = "identity")
#p
## Let's first see the wordclode first to see what words occur more frequently
#set.seed(17)
#wordcloud(trainCorpus, max.words = 100, scale = c(5,0.3),
# random.order = FALSE, colors = brewer.pal(8,"Dark2"))
#We will consider the 2-gram and 3-gram models below
### 2-gram model
#We will build the 2-gram model below. We list the top 10 2-words below along with the histogram.
#options(mc.cores=1)
#BigramTokenizer <- function(x){NGramTokenizer(x,Weka_control(min = 2, max = 2))} ## create n-grams
#bgDTM <- DocumentTermMatrix(trainCorpus, control = list(tokenize = BigramTokenizer)) # create DTM from n-grams
#bgDTM2 <- rollup(bgDTM, 1, na.rm=TRUE, FUN = sum)
#bg_freq <- colSums(as.matrix(bgDTM2))
## Now let's give the top 10 two-words occuring most frequently
#bg_freq <- sort(bg_freq, decreasing = T)
#head(bg_freq,10)
## create the bigrams data frame
#bgwf <- data.frame(word=names(bg_freq), freq=bg_freq)
#head(bgwf,10)
#subbg_wf <- bgwf[1:10,]
#bg_p <- ggplot(subbg_wf, aes(x= word, y= freq, fill = word))
#bg_p <- bg_p + geom_bar(stat = "identity")
#bg_p <- bg_p + theme(axis.text.x = element_text(angle = 90, hjust = 1))
#bg_p
#set.seed(17)
#wordcloud(words = bgwf$word,
# freq = bgwf$freq,
# scale = c(3,0.3),
# max.words = 50,
# random.order = FALSE, colors = brewer.pal(8,"Dark2"))
### 3-gram model
#options(mc.cores=1)
#TrgramTokenizer <- function(x){NGramTokenizer(x,Weka_control(min = 3, max = 3))} ## create n-grams
#TrgDTM <- DocumentTermMatrix(trainCorpus, control = list(tokenize = TrgramTokenizer)) # create DTM from n-grams
#TrgDTM2 <- rollup(TrgDTM, 1, na.rm=TRUE, FUN = sum)
#Trg_freq <- colSums(as.matrix(TrgDTM2))
## Now let's give the words occuring most frequently
#Trg_freq <- sort(Trg_freq, decreasing = T)
#Trgwf <- data.frame(word = names(Trg_freq), freq = Trg_freq)
#head(Trgwf,10)
#subtrg_wf <- Trgwf[1:10,]
#trg_p <- ggplot(subtrg_wf, aes(x= word, y= freq, fill = word))
#trg_p <- trg_p + geom_bar(stat = "identity")
#trg_p <- trg_p + theme(axis.text.x = element_text(angle = 90, hjust = 1))
#trg_p
#set.seed(17)
#wordcloud(words = Trgwf$word,
# freq = Trgwf$freq,
# scale = c(2.5,0.3),
# max.words = 30,
# random.order = FALSE, colors = brewer.pal(8,"Dark2"))