Data Science Capstone - Exploratory Analysis

Downloading Corpora Data

First step is to download the raw data. We can download the data and unzip it using the following commands

download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", "Coursera-SwiftKey.zip")
unzip("Coursera-SwiftKey.zip")

The unzipped files get placed in the “final” folder.

Reading Data

In this exercise, we will use the English data and will ignore the data German Russion and Finnish Data.
lets read only the twitter data for demonstration purpose.

RawTwitterData <- readLines("final/en_US/en_US.twitter.txt")
RawBlogData <- readLines("final/en_US/en_US.blogs.txt")
RawNewsData <- readLines("final/en_US/en_US.news.txt")

Exploratory Analysis

Line Counts

No. of lines in the raw twitter data

length(RawTwitterData)

## [1] 2360148

No. of lines in the raw blogs data

length(RawBlogData)

## [1] 899288

No. of lines in the raw news data

length(RawNewsData)

## [1] 1010242

Slicing Data to obtain a Training Dataset, a Dev-testing dataset and a Final testing Dataset

We partition 80% data to be the training dataset, 10% as dev-testing and 10% as final testing
Note: We will only consider the twitter data right now for demonstration purpose

training <- 0.8
devtesting <- 0.1
testing <- 0.1

## TwitterData
trainingSampleSize <- floor(length(RawTwitterData)*training)
devtestingSampleSize <- floor(length(RawTwitterData)*devtesting)
testingSampleSize <- floor(length(RawTwitterData)*testing)

trainingIds <- sample(1:length(RawTwitterData), trainingSampleSize) #for training
trainingTwitter <- RawTwitterData[trainingIds]
testingTwitter <- RawTwitterData[-trainingIds]
# further split this testing data into dev-testing and final testing data
devtestingIds <- sample(1:length(testingTwitter), devtestingSampleSize) #for devtesting
devtestingTwitter <- testingTwitter[devtestingIds]
finaltestingTwitter <- testingTwitter[-devtestingIds]

From here on we will work only with the trainingTwitterData dataset

Preprocessing the Training Dataset

We will use the ngram package of R to do preprocessing

Note: For purpose of demonstration I will use only the top 10000 lines of the trainingTwitterData dataset.

We will convert all text to lower case
We will remove all punctuation
We will remove all numbers, and
We will fix the spacing

library(ngram)
SingleStringTwitterData <- paste(trainingTwitter[1:10000], collapse = " ")
PreprocessedData <- preprocess(SingleStringTwitterData, case = "lower", 
                                      remove.punct = T, remove.numbers = T, fix.spacing = T)

Tokenization

Again we will utilize the ngram package to create unigrams, bigrams and trigrams from the PreprocessedData

Creating Unigram

  unigram <- ngram(PreprocessedData, n = 1)
  UnigramTable <<- get.phrasetable(unigram)

Creating a Histogram from Unigram data

Will take only top 20 most frequently occuring words (unigrams)

library(ggplot2)
library(stringr)
plotdata <- UnigramTable[1:20, ]
    
    p <- ggplot(data=plotdata, aes(x=ngrams, y=freq)) +
      geom_bar(stat="identity", fill = "orange") +
      scale_x_discrete(limits = plotdata$ngrams) +
      theme(axis.text.x = element_text(angle = 60, hjust = 1))
    p

Creating a Bi-Gram

  bigram <- ngram(PreprocessedData, n = 2)
  BigramTable <<- get.phrasetable(bigram)

Creating a Histogram from Bigram data

Will take only top 20 most frequently occuring bigrams

plotdata <- BigramTable[1:20, ]
    
    p <- ggplot(data=plotdata, aes(x=ngrams, y=freq)) +
      geom_bar(stat="identity", fill = "blue") +
      scale_x_discrete(limits = plotdata$ngrams) + 
      theme(axis.text.x = element_text(angle = 60, hjust = 1))
    p

Creating a Tri-gram

  trigram <- ngram(PreprocessedData, n = 3)
  TrigramTable <<- get.phrasetable(trigram)

Creating a Histogram from Trigram data

Will take only top 20 most frequently occuring trigrams

 plotdata <- TrigramTable[1:20, ]
    p <- ggplot(data=plotdata, aes(x=ngrams, y=freq)) +
      geom_bar(stat="identity", fill = "pink") +
      scale_x_discrete(limits = plotdata$ngrams) + 
      theme(axis.text.x = element_text(angle = 60, hjust = 1))
    p

Way Forward - Building the Algorithm

End goal is to take upto 3 words as input from a user and predict the next word

case1: if user enters only 1 word

we check if the entered word appears in the training data

If the word doesn’t exist in the training data, we suggest the top 3 most frequently occuring words in the training data. We can find this easily from the Unigram table

If the word does exist in the training data, we look for the most frequently occuring bigrams starting with the entered word, and find out their second words. We suggest the most frequently occuring such words as the probable next word

case2: if user enters 2 words

We check if the entered words appear in the training data

If the second/ last word doesnt exist in our training data, we suggest the top 3 most frequently occuring words in the training data. We can find this easily from the Unigram table.

If the first word doesnt exist in the training data but the second does. We proceed as if the user has entered only 1 word (the second one), and predict the next word as described in case 1.

If both words exist in the training data, we check if they ever appear together in the entered order, i.e. we check if the bigram formed by these words exists.

If the bigram does exist, we look for the most frequently occuring trigrams starting with the entered words, and find out their 3rd words. We suggest the most frequently occuring such 3rd words as the probable next word.

If the bigram doesn’t exist, we proceed as if the user has entered only 1 word (the second one), and predict the next word as described in case 1.

case 3: if user enters 3 words

We extrapolate the process mentioned above to predict the 4th word

case 4: if user enters more than 3 words

We predict next word considering only the last 3 input words

case 5: user enters an empty string

we prompt user toenter atleast 1 word