This is the Milestone Report for Capstone project for Coursera’s Data Science Specialization. The main goal of the capstone project is to develop a predictive text application that will predict the next word in the sentence as the user types a sentence. In this report I going to describe the main featureas of the data and briefly summarize my plans for creating the prediction algorithm and Shiny app. The motivation for this project is to: 1. Demonstrate that data sets is downloaded and have successfully loaded it in R 2. Create a basic report of summary statistics about the data sets 3. Report any interesting findings 4. Get feedback on plans for creating a prediction algorithm and Shiny app
We downloaded the zip file containing the text files from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.The data sets consist of text from 3 different sources: 1) News, 2) Blogs and 3) Twitter feeds,The text data contains different languages, but in this project we will concentrate on en_Us.
library(tm)
## Loading required package: NLP
library(ngram)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(stringi)
blogs <- readLines(file("final/en_US/en_US.blogs.txt","rb"))
news <- readLines(file("final/en_US/en_US.news.txt","rb"))
twitter <- readLines(file("final/en_US/en_US.twitter.txt","rb"))
## Warning in readLines(file("final/en_US/en_US.twitter.txt", "rb")): line 167155
## appears to contain an embedded nul
## Warning in readLines(file("final/en_US/en_US.twitter.txt", "rb")): line 268547
## appears to contain an embedded nul
## Warning in readLines(file("final/en_US/en_US.twitter.txt", "rb")): line 1274086
## appears to contain an embedded nul
## Warning in readLines(file("final/en_US/en_US.twitter.txt", "rb")): line 1759032
## appears to contain an embedded nul
library(stringi)
# Get file sizes
blogs.size <- file.info("final/en_US/en_US.blogs.txt")$size / 1024 ^ 2
news.size <- file.info("final/en_US/en_US.news.txt")$size / 1024 ^ 2
twitter.size <- file.info("final/en_US/en_US.twitter.txt")$size / 1024 ^ 2
##Get words in a file
blogs.words <- stri_count_words(blogs)
news.words <- stri_count_words(news)
twitter.words <- stri_count_words(twitter)
##Getting the basic Info.summary
data.frame(filename = c("blogs", "news", "twitter"),
filesize = c(blogs.size,news.size,twitter.size),
line_count =c(length(blogs), length(news), length(twitter)),
word_count = c(sum(blogs.words), sum(news.words),sum(twitter.words)))
## filename filesize line_count word_count
## 1 blogs 200.4242 899288 38154238
## 2 news 196.2775 1010242 35010782
## 3 twitter 159.3641 2360148 30218125
Considering that the data files are very large, we will create a data sample by randomly choosing 1% of the data and then we perform some data cleaning on that data sample before doing an exploratory analysis.
set.seed(124)
sam_twitter <- twitter[rbinom(length(twitter)*0.005, length(twitter),0.5)]
length(sam_twitter)
## [1] 11800
set.seed(124)
sam_blogs <- blogs[rbinom(length(blogs)*0.01, length(blogs),0.5)]
length(sam_blogs)
## [1] 8992
set.seed(124)
sam_news <- news[rbinom(length(news)*0.01, length(news),0.5)]
length(sam_news)
## [1] 10102
rm(blogs, news, twitter)
Analysing the text to remove unnecessary terms
data <- c(sam_twitter,sam_news,sam_blogs)
dataCh <- as.character(data.frame(data,stringsAsFactors=FALSE))
proData <- preprocess(dataCh, case="lower", remove.punct=TRUE,
remove.numbers =TRUE, fix.spacing=TRUE)
We will performing exploratory analysis on the data sample to analye the frequency of terms. We will use NGramTokenizer function from the ngram library for creating different n-grams from the corpus and then we will construct a term document matrix for each n-gram token. Then we will plot N most frequent quadgram, bigrams and Trigrams using custom make_plot function.
ng2 <- ngram(proData, n=2)
ng2freq <- get.phrasetable(ng2)
big2_10 <- head(ng2freq,10)
g <- ggplot(big2_10, aes(x=reorder(ngrams,freq), y=freq, fill=ngrams))
g <- g + geom_bar(stat="identity")+ coord_flip() + xlab("Bigram")
g <- g + ylab("Frequency")+ labs(title="Top 10 Bigrams")
print(g)
## Trigram Analysis
ng3 <- ngram(proData, n=3)
ng3freq <- get.phrasetable(ng3)
big3Top <- head(ng3freq,10)
g <- ggplot(big3Top, aes(x=reorder(ngrams,freq), y=freq, fill=ngrams))
g <- g + geom_bar(stat="identity")+ coord_flip() + xlab("Trigram")
g <- g + ylab("Frequency")+ labs(title="Top 10 Trigrams")
print(g)
ng4 <- ngram(proData, n=4)
ng4freq <- get.phrasetable(ng4)
big4_10 <- head(ng4freq,10)
g <- ggplot(big4_10, aes(x=reorder(ngrams,freq), y=freq, fill=ngrams))
g <- g + geom_bar(stat="identity")+ coord_flip() + xlab("Quadgram")
g <- g + ylab("Frequency")+ labs(title="Top 10 quadgrams")
print(g)
This concludes our exploratory analysis. The next steps of this capstone project would be to finalize our predictive algorithm, and deploy our algorithm as a Shiny app.
Our predictive algorithm will be using n-gram model with frequency lookup similar to our exploratory analysis above. One possible strategy would be to use the trigram model to predict the next word. If no matching trigram can be found, then the algorithm would back off to the bigram model, and then to the unigram model if needed.
The user interface of the Shiny app will consist of a text input box that will allow a user to enter a phrase. Then the app will use our algorithm to suggest the most likely next word after a short delay. Our plan is also to allow the user to configure how many words our app should suggest.