This is the Milestone Report for the Capstone project of Data Science Specialization.
The goal of Capstone project is to build application which predicts next word as user types sentence.
We are using these libraries to process and visualize report:
library(stringi)
## Warning: package 'stringi' was built under R version 3.2.5
library(tm)
## Loading required package: NLP
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
Capstone project uses a large text corpus of documents as training data. The data provided for this project contains 3 text sources: blogs, news and twitter and comes in four languages: English, German, Finnish, Russian. We use only data in English for this exploratory analysis.
We downloaded zip file with text data, extracted files to current directory. https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
#blogs <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
#news <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
#twitter <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
#Loaded data from saved environment
load("~/CourseraR/Capstone/e1.RData")
Here is some statistics for these files and data:
# Disk size (in MB)
blogs_dsize <- file.info("final/en_US/en_US.blogs.txt")$size / 1024 / 1024
news_dsize <- file.info("final/en_US/en_US.news.txt")$size / 1024 / 1024
twitter_dsize <- file.info("final/en_US/en_US.twitter.txt")$size / 1024 / 1024
#In-memory size (in MB)
blogs_msize<-object.size(blogs) / 1024 / 1024
news_msize<-object.size(news) / 1024 / 1024
twitter_msize<-object.size(twitter) / 1024 / 1024
# Words in lines
blogs_words <- stri_count_words(blogs)
news_words <- stri_count_words(news)
twitter_words <- stri_count_words(twitter)
# Summary
data.frame(source = c("blogs", "news", "twitter"),
files_MB = c(blogs_dsize, news_dsize, twitter_dsize),
in_memory_MB = c(blogs_msize, news_msize, twitter_msize),
lines = c(length(blogs), length(news), length(twitter)),
words_num = c(sum(blogs_words), sum(news_words), sum(twitter_words)),
mean_words_num = c(mean(blogs_words), mean(news_words), mean(twitter_words)))
## source files_MB in_memory_MB lines words_num mean_words_num
## 1 blogs 200.4242 248.4935 899288 37546246 41.75108
## 2 news 196.2775 249.6329 1010242 34762395 34.40997
## 3 twitter 159.3641 301.3969 2360148 30093410 12.75065
We are cleaning data to remove some special characters, any numbers, punctuations, excess whitespace, then changing to lower case and remove stopwords.
Creating corpus is very labour-intensive operation and in our case it really halts the computer while trying to process nearly 1GB in-memory, so we are loading only 5000 lines from each source of text data.
# this file contains limited data - head(..., 5000)
load("~/CourseraR/Capstone/e2.RData")
data<-c(blogs,news,twitter)
#Corpus object for tm_map functions
vector_doc <- VectorSource(data)
corpus <- VCorpus(vector_doc)
corpus <- tm_map(corpus, content_transformer(function(x) iconv(x, to='UTF-8-MAC', sub='byte')), mc.cores=1)
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x,fixed=TRUE))
#Removing special characters and URLs
#corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
#corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus<-tm_map(corpus, content_transformer(tolower))
corpus<-tm_map(corpus, stripWhitespace)
#Removing...
corpus<-tm_map(corpus, removePunctuation)
corpus<-tm_map(corpus, removeNumbers)
corpus<-tm_map(corpus, removeWords, stopwords('english'))
In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. https://en.wikipedia.org/wiki/N-gram
The goal of our analysis is to create different n-grams from this corpus and analyze frequency of words.
#bigram_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
#trigram_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
#RWeka controls failed to for me with strage instant error, so I found another tokenizers:
BigramTokenizer <-
function(x)
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
TrigramTokenizer <-
function(x)
unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)
#Count words
freq_df <- function(tdm){
freq <- sort(rowSums(as.matrix(tdm)), decreasing=TRUE)
freq_df <- data.frame(word=names(freq), freq=freq)
return(freq_df)
}
unigram <- removeSparseTerms(TermDocumentMatrix(corpus), 0.9999)
unigram_freq <- freq_df(unigram)
bigram <- removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer)), 0.9999)
bigram_freq <- freq_df(bigram)
trigram <- removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = TrigramTokenizer)), 0.9999)
trigram_freq <- freq_df(trigram)
freq_plot <- function(data, title) {
ggplot(data[1:25,], aes(reorder(word, -freq), freq)) +
labs(x = "Words/Phrases", y = "Frequency") +
ggtitle(title) +
theme(axis.text.x = element_text(angle = 90, size = 12, hjust = 1)) +
geom_bar(stat = "identity")
}
freq_plot(unigram_freq, "Top-25 Unigrams")
freq_plot(bigram_freq, "Top-25 Bigrams")
freq_plot(trigram_freq, "Top-25 Trigrams")
As mentioned above, loading and processing the dataset costs a lot of time. A huge amount of time to process corpus! We had to limit text data to perform analysis.
Stopwords are very important, they are fundamental part of language. We have to test and maybe include these words to our prediction algorythm.
Next step of this project is to build predictive algorithm, test it, surround it with ShinyApps-functions and deploy it to shinyapps.io
We have to find and to test some strategies of prediction of next word, now we are thinking about using trigram model, but it is not last solution - we need to test it! If trigram models fails to predict next word, we are switching to bigram model.