This is first report for the Capstone project of Data Science Specilization. The main goal of the project is to create R application which can predict next word given two previous words. For this application to function, we need to design a text predictive model by training the given dataset (Blogs, Twitter & News). In this report, we will perform some explotary data analysis and try to discover some interesting relationship from the dataset. Also, we will build n-grams models using the given dataset. Lets get started.
The dataset provided is a bit large considering internet speed in Nepal. It took almost 8hrs for me to download the dataset. I used this link to download the data set. https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
The dataset are provided in 4 different languages: Finnish, English, Russian and Germal. Also, the data sets consist of text from 3 different sources: News, Blogs and Twitter. In this report, we will only focus on the English data sets.
#Set extracted file location
en_blogs<-"/home/tensor/Documents/Tutorials/R/Capestone/dataset/en_US/en_US.blogs.txt"
en_twitter<-"/home/tensor/Documents/Tutorials/R/Capestone/dataset/en_US/en_US.twitter.txt"
en_news<-"/home/tensor/Documents/Tutorials/R/Capestone/dataset/en_US/en_US.news.txt"
Load the datasets
# Load data into R
data_blogs <- readLines(en_blogs, encoding = "UTF-8", skipNul = TRUE)
data_news <- readLines(en_news, encoding = "UTF-8", skipNul = TRUE)
data_twitter <- readLines(en_twitter, encoding = "UTF-8", skipNul = TRUE)
words_blogs <- stri_count_words(data_blogs)
words_news <- stri_count_words(data_news)
words_twitter <- stri_count_words(data_twitter)
size_blogs <- file.info(en_blogs)$size/1024^2
size_news <- file.info(en_news)$size/1024^2
size_twitter <- file.info(en_twitter)$size/1024^2
summary_table <- data.frame(filename = c("en_US.blogs.txt","en_US.news.txt","en_US.twitter.txt"),
file_size_MB = c(size_blogs, size_news, size_twitter),
num_lines = c(length(data_blogs),length(data_news),length(data_twitter)),
num_words = c(sum(words_blogs),sum(words_news),sum(words_twitter)),
mean_num_words = c(mean(words_blogs),mean(words_news),mean(words_twitter)))
kable(summary_table)
| filename | file_size_MB | num_lines | num_words | mean_num_words |
|---|---|---|---|---|
| en_US.blogs.txt | 200.4242 | 899288 | 37546246 | 41.75108 |
| en_US.news.txt | 196.2775 | 1010242 | 34762395 | 34.40997 |
| en_US.twitter.txt | 159.3641 | 2360148 | 30093410 | 12.75065 |
Before procedding to create models of given datasets, we need to clean the data set. We will remove special characters, whitespaces, URLS and other non-import characters from the dataset. For this report, we will take sample data from our data set. We will be using whole date set for latter project.
# Sample the data
set.seed(420)
data.sample <- c(sample(data_blogs, length(data_blogs) * 0.01),
sample(data_news, length(data_news) * 0.01),
sample(data_twitter, length(data_twitter) * 0.01))
# Create corpus and clean the data
corpus <- VCorpus(VectorSource(data.sample))
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
# prepare the word n-gram data
my_corpus <-data.frame(text = unlist(sapply(corpus, `[`, "content")),
stringsAsFactors = FALSE)
findNGrams <- function(corp, grams, top) {
ngram <- NGramTokenizer(corp, Weka_control(min = grams, max = grams,
delimiters = " \\r\\n\\t.,;:\"()?!"))
ngram <- data.frame(table(ngram))
ngram <- ngram[order(ngram$Freq, decreasing = TRUE),][1:top,]
colnames(ngram) <- c("Words","Count")
ngram
}
mono_grams <- findNGrams(my_corpus, 1, 10)
bi_grams <- findNGrams(my_corpus, 2, 10)
tri_grams <- findNGrams(my_corpus, 3, 10)
quad_grams <- findNGrams(my_corpus, 4, 10)
As we have developed four different n-gram model, it would be very helpful to see the most occuring grams in those models.
makePlot <- function(data, label) {
ggplot(data[1:10,], aes(reorder(Words, -Count), Count)) +
labs(x = label, y = "Frequency") +
theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
geom_bar(stat = "identity", fill = I("red"))
}
makePlot(bi_grams, "10 Most Common Bi-grams")
makePlot(tri_grams, "10 Most Common Tri-grams")
makePlot(mono_grams, "10 Most Common Mono-grams")
makePlot(quad_grams, "10 Most Common Quad-grams")
So far we performed exploratory data analysis for given data set. The next challenge would be build a predictive model, evaluate that model and build a friendly UI in shiny.
Our predictive algorithm will be using n-gram model with frequency lookup combined with logistic regression. One possible strategy would be to use the trigram model to predict the next word. If no matching trigram can be found, then the algorithm would back off to logistic regression to predict the word.
The user interface of the Shiny app will consist of a text input box that will allow a user to enter a phrase. Then the app will use our algorithm to suggest the most likely next word after a short delay. Our plan is also to allow the user to configure how many words our app would suggest.