Milestone-project

INTRODUCTION

This is the Milestone Report for Capstone project for Coursera’s Data Science Specialization. The main goal of the capstone project is to develop a predictive text application that will predict the next word in the sentence as the user types a sentence. In this report I going to describe the main featureas of the data and briefly summarize my plans for creating the prediction algorithm and Shiny app. The motivation for this project is to: 1. Demonstrate that data sets is downloaded and have successfully loaded it in R 2. Create a basic report of summary statistics about the data sets 3. Report any interesting findings 4. Get feedback on plans for creating a prediction algorithm and Shiny app

DATASET DESCRIPTION

We downloaded the zip file containing the text files from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.The data sets consist of text from 3 different sources: 1) News, 2) Blogs and 3) Twitter feeds,The text data contains different languages, but in this project we will concentrate on en_Us.

GETTING STARTED

Loading libraries

library(tm)

## Loading required package: NLP

library(ngram)
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

library(stringi)

Loading Data

blogs <- readLines(file("final/en_US/en_US.blogs.txt","rb"))
news <- readLines(file("final/en_US/en_US.news.txt","rb"))
twitter <- readLines(file("final/en_US/en_US.twitter.txt","rb"))

## Warning in readLines(file("final/en_US/en_US.twitter.txt", "rb")): line 167155
## appears to contain an embedded nul

## Warning in readLines(file("final/en_US/en_US.twitter.txt", "rb")): line 268547
## appears to contain an embedded nul

## Warning in readLines(file("final/en_US/en_US.twitter.txt", "rb")): line 1274086
## appears to contain an embedded nul

## Warning in readLines(file("final/en_US/en_US.twitter.txt", "rb")): line 1759032
## appears to contain an embedded nul

Getting Basic Info.of the data

library(stringi)

# Get file sizes
blogs.size <- file.info("final/en_US/en_US.blogs.txt")$size / 1024 ^ 2
news.size <- file.info("final/en_US/en_US.news.txt")$size / 1024 ^ 2
twitter.size <- file.info("final/en_US/en_US.twitter.txt")$size / 1024 ^ 2

##Get words in a file 
blogs.words <- stri_count_words(blogs)
news.words <- stri_count_words(news)
twitter.words <- stri_count_words(twitter)

##Getting the basic Info.summary
data.frame(filename = c("blogs", "news", "twitter"),
           filesize = c(blogs.size,news.size,twitter.size),
           line_count =c(length(blogs), length(news), length(twitter)),
           word_count = c(sum(blogs.words), sum(news.words),sum(twitter.words)))

##   filename filesize line_count word_count
## 1    blogs 200.4242     899288   38154238
## 2     news 196.2775    1010242   35010782
## 3  twitter 159.3641    2360148   30218125

Data sampling and Cleaning

Considering that the data files are very large, we will create a data sample by randomly choosing 1% of the data and then we perform some data cleaning on that data sample before doing an exploratory analysis.

set.seed(124)
sam_twitter <- twitter[rbinom(length(twitter)*0.005, length(twitter),0.5)]

length(sam_twitter)

## [1] 11800

set.seed(124)
sam_blogs <- blogs[rbinom(length(blogs)*0.01, length(blogs),0.5)]
length(sam_blogs)

## [1] 8992

set.seed(124)
sam_news <- news[rbinom(length(news)*0.01, length(news),0.5)]
length(sam_news)

## [1] 10102

rm(blogs, news, twitter)

Analysing the text to remove unnecessary terms

data <- c(sam_twitter,sam_news,sam_blogs)
dataCh <- as.character(data.frame(data,stringsAsFactors=FALSE))
proData <- preprocess(dataCh, case="lower", remove.punct=TRUE, 
                remove.numbers =TRUE, fix.spacing=TRUE)

EXPLORATORY ANALYSIS

We will performing exploratory analysis on the data sample to analye the frequency of terms. We will use NGramTokenizer function from the ngram library for creating different n-grams from the corpus and then we will construct a term document matrix for each n-gram token. Then we will plot N most frequent quadgram, bigrams and Trigrams using custom make_plot function.

Bigram Analysis

ng2 <- ngram(proData, n=2)
ng2freq <- get.phrasetable(ng2)

big2_10 <- head(ng2freq,10)
g <- ggplot(big2_10, aes(x=reorder(ngrams,freq), y=freq, fill=ngrams)) 
g <- g  + geom_bar(stat="identity")+ coord_flip() + xlab("Bigram") 
g <- g  + ylab("Frequency")+ labs(title="Top 10 Bigrams")
print(g)

## Trigram Analysis

ng3 <- ngram(proData, n=3)
ng3freq <- get.phrasetable(ng3)

big3Top <- head(ng3freq,10)
g <- ggplot(big3Top, aes(x=reorder(ngrams,freq), y=freq, fill=ngrams)) 
g <- g + geom_bar(stat="identity")+ coord_flip() + xlab("Trigram") 
g   <- g + ylab("Frequency")+ labs(title="Top 10 Trigrams")
print(g)

Quadgram analysis

ng4 <- ngram(proData, n=4)
ng4freq <- get.phrasetable(ng4)

big4_10 <- head(ng4freq,10)
g <- ggplot(big4_10, aes(x=reorder(ngrams,freq), y=freq, fill=ngrams)) 
g <- g  + geom_bar(stat="identity")+ coord_flip() + xlab("Quadgram") 
g <- g  + ylab("Frequency")+ labs(title="Top 10 quadgrams")
print(g)

Next Steps For Prediction Algorithm And Shiny App

This concludes our exploratory analysis. The next steps of this capstone project would be to finalize our predictive algorithm, and deploy our algorithm as a Shiny app.

Our predictive algorithm will be using n-gram model with frequency lookup similar to our exploratory analysis above. One possible strategy would be to use the trigram model to predict the next word. If no matching trigram can be found, then the algorithm would back off to the bigram model, and then to the unigram model if needed.

The user interface of the Shiny app will consist of a text input box that will allow a user to enter a phrase. Then the app will use our algorithm to suggest the most likely next word after a short delay. Our plan is also to allow the user to configure how many words our app should suggest.