Milestone Report

The Capstone project depects a data scientist’s ability to process and analyze large data. The project is to develop an algorithm that predicts words on the basis of previous words, similar to predictive text function on today’s smartphones.

This report provides a overiew of the exploratory analysis on the text data used for the capstone project. The motivation for this project is to:
1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.
2. Create a basic report of summary statistics about the data sets.
3. Report any interesting findings that you amassed so far.
4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Getting and Cleaning Data

The dataset was downloaded from Coursera-Swiftkey-Dataset.
The data set consists of 4 folders with data in different languages. We’ll be using the data in english language stored in en_US here. This folder consists of 3 data files consisting data about twitter, blog and news.
Downloading the data set:

fUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"

if(!file.exists("data.zip")){
  download.file(fUrl, "data.zip", method = "curl")
  unzip(data.zip)
}

Since the data is downloaded and unziped now let’s move forward and load the required libraries.

library(tm)

## Loading required package: NLP

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(RWeka)
library(wordcloud)

## Loading required package: RColorBrewer

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

#Loading the data into environment.
files <- c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt")
blog <- readLines(files[1], skipNul = TRUE)
news <- readLines(files[2], skipNul = TRUE)
twitter <- readLines(files[3], skipNul = TRUE)

size <- sapply(files, file.size) / 1024^2
files <- list(blog = blog, news = news, twitter = twitter)
#Removing unwanted files.
rm(blog)
rm(news)
rm(twitter)

lcount <- sapply(files, length)
counts <- data.frame(fileSizeInMb = size, lines = lcount)
counts

##                   fileSizeInMb   lines
## en_US.blogs.txt       200.4242  899288
## en_US.news.txt        196.2775 1010242
## en_US.twitter.txt     159.3641 2360148

Here we can se that the data is really large. Each file is 150-200 mbs.

Word Frequency

Here we’ll process and see which words appear more frequently.
Since the data is really large. Let’s take only 0.5% of data for the algorithm.

set.seed(12121)
samp <- c(sample(files$blog, round(0.005*lcount[1]), replace = FALSE), sample(files$news, round(0.005*lcount[2]), replace = FALSE), sample(files$twitter, round(0.005*lcount[3]), replace = FALSE))
rm(files)

Build Corpus

At this stage, each text collection is converted to a single Corpus class and transformations are performed. The below transformations are performed:
- text is converted to lower case
- extra white spaces are removed
- numbers are removed
- remaining punctuations are removed
- data is converted to standard ASCII format
Stop word are not removed here as we want to use our prediction algorithm in Natural Language Processing.

#remove emoticans
samp <- iconv(samp, "UTF-8", "ASCII")
#Now create a corpus.
corpus <- VCorpus(VectorSource(as.data.frame(samp, stringsAsFactors = FALSE)))
corpus <- corpus %>% 
  tm_map(content_transformer(tolower)) %>% 
  tm_map(stripWhitespace) %>% 
  tm_map(removePunctuation) %>% 
  tm_map(removeNumbers)

rm(samp)

Tokenize

The Corpus is converted to TermDocumentMatrix so that we can compute word frequencies of each word. These single words are knowns as Tokens.

#Creating tdm
tdm <- TermDocumentMatrix(corpus)

#Most frequent terms:
dfTdm <- data.frame(word = tdm$dimnames$Terms, freq = tdm$v)
odfTdm <- plyr::arrange(dfTdm, -freq)
head(odfTdm)

##   word  freq
## 1  the 18024
## 2  and  9103
## 3  for  4527
## 4  you  3917
## 5 that  3783
## 6 with  2708

Word Cloud

#word cloud
wordcloud(dfTdm$word, dfTdm$freq, min.freq = 200, colors = brewer.pal(6, "Dark2"))

Exploratory Analysis

#plot
dfTdm <- odfTdm[1:25,]
dfTdm$word <- reorder(dfTdm$word, dfTdm$freq)
g <- ggplot(dfTdm, aes(x = word, y = freq))
g + geom_bar(stat = "identity") + coord_flip() + ggtitle("Single Word Frequency")

Since stop words are not removed, the most frequent words would be the commanly used word like ‘the’, ‘for’, etc.

Prediction Algorithm Plans

Moving futher, the goal is to make a prediction algorithm that predicts the next word from the previous input words. For eg. if the input is “go near the” then the next word is predicted.
The prdiction is not only done on the basis of the last word input but will be done on the basis of the previous 1, 2 or 3 words.

Hence n-grams will be used. N-grams are constructed from the corpus. The word analysis performed in this report is useful for initial exploration. Hence, n-grams like unigrams, bigrams, trigrams and four-grams are needed. Unigram is one word, Bigram is two word phrases, Trigrams is three word phrases and four-grams is four word phrases.

Constructing N-grams:

bigram <- NGramTokenizer(corpus, Weka_control(min = 2, max = 2))
biDf <- data.frame(table(bigram))
#order according to decreasing frequency.
obiDf <- biDf[order(biDf$Freq, decreasing = TRUE), ]
biDf <- obiDf[1:25,]
biDf$bigram <- reorder(biDf$bigram, biDf$Freq)
b <- ggplot(biDf, aes(x = bigram, y = Freq))
b + geom_bar(stat = "identity") + coord_flip() + ggtitle("Most Common 25 Biagrams")

trigram <- NGramTokenizer(corpus, Weka_control(min = 3, max = 3))
triDf <- data.frame(table(trigram))
otriDf <- triDf[order(triDf$Freq, decreasing = TRUE), ]
triDf <- otriDf[1:25,]
triDf$trigram <- reorder(triDf$trigram, triDf$Freq)
t <- ggplot(triDf, aes(x = trigram, y = Freq))
t + geom_bar(stat = "identity") + coord_flip() + ggtitle("Most Common 25 Trigrams")

Summary

After constructing various n-grams, a prediction algorithm will be constructed on the basis of it and an shiny app will be developed allowing the users to enter text. When the user inputs some word then the length of word will be checked first and then it will be processed accordingly.
- If the input is one word then the predicted word will be looked up in the bigram and the most recent word pair corresponding to the input will be returned.
- If the input is two words then the predicted word will be looked up in the trigram and the most recent word pair corresponding to the input will be returned.
- If the input is three words then the predicted word will be looked up in the four-gram and the most recent word pair corresponding to the input will be returned.
- If the input is more than three words then the last three words will be considered and looked up in the four-gram and the most recent word pair corresponding to the input will be returned.