Introduction

The objective of the project is to model the prediction function performed by the Swiftkey Keyboard App. The input to the model is n words, and it predicts the next most probable/approriate word. This project shall try to model this prediction function efficiently, and apply data science concepts to the domain of Natural Language Processign.

Obtaining The Data

The data can be downloaded from :

[https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip]

library(tm)
library(RWeka)
library(ggplot2)
library(stringi)
library(qdap)
library(wordcloud)

setwd("~/Capstone Project")
#reading each dataset and sampling 10,000 entries from each

blogs <- readLines("final/en_US/en_US.blogs.txt", encoding="UTF-8")
blogs <- sample(blogs,5000)

twitter <- readLines("final/en_US/en_US.twitter.txt", encoding="UTF-8")
twitter <- sample(twitter,10000)

news <- readLines("final/en_US/en_US.news.txt", encoding="UTF-8")
news <- sample(news,5000)

Summary Statistics

The data to be used for the project comes from 3 sources: blogs, news and twitter.

File Name Number Of Lines Size(bytes)
en_US.blogs.txt 899288 201M
en_US.news.txt 1010242 197M
en_US.twitter.txt 2360148 160M

As the size and complexity of the data is huge, the dataset is sampled to be representative of the entire population.

The sample is taken to be 10,000 entries from the twitter file, 5,000 entries from the blogs and news file.

Data Preprocessing

Exploratory Analysis

1. Sentence Length

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    8.00   14.00   15.36   21.00  132.00

2. Word Frequencies - 15 Most Frequent Words

Wordcloud Depicting Word Frequencies

Frequency Of 2-Grams : Most Frequent Bi-Grams in the Dataset Sample

Wordcloud Depicting Most Frequent Bi-Grams

Frequency Of 3-Grams : Most Frequent Tri-Grams in the Dataset Sample

Wordcloud Depicting Most Frequent Tri-Grams

Plotting Percent Coverage

This graph depicts relationship between number of unique words and percent coverage of all word instances in the dataset.

From this graph, the trade-off between dataset size and percent coverage can be analyzed.

From the graph, it can be clearly seen that the No. Of Unique Words Required increases Exponentially with increase in Percent Coverage.

Plans For Final Project

Questions and bottlenecks to consider:

1. Sample Dataset Size

<1% of the dataset has been taken in this sample. There is a chance that much of the language has been discounted from this sample. A different sample set with same size and a sample set with ~10% of the dataset has to be taken to evaluate performance improvement vs speed decrease.

2. Language Modelling

  • Word frequencies, 2-Gram, 3-Gram frequencies have been calculated. These have to be used to calculate Probabilities for the n-grams.

  • Given a string W1…..Wi-1, the word Wi that maximizes P(Wi | Wn-i+1…Wi-1) has to be chosen as the prediction where n is the maximum n-gram.

3. Back-off Models

  • The above probability calculation suffers when new phrases not present in the language are introduced.

  • Possible Solutions to be considered:
    1. Katz Back-Off Models
    2. Interpolated Models
    3. Kneser-Ney Models

4. Overall Run-Time

  • After calculating n-gram probabilities along with Back-off Models, Observing runtime for different sample sizes for efficient memory usage and run-time