Milestone Report

Introduction

The objective of the project is to model the prediction function performed by the Swiftkey Keyboard App. The input to the model is n words, and it predicts the next most probable/approriate word. This project shall try to model this prediction function efficiently, and apply data science concepts to the domain of Natural Language Processign.

Obtaining The Data

The data can be downloaded from :

[https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip]

library(tm)
library(RWeka)
library(ggplot2)
library(stringi)
library(qdap)
library(wordcloud)

setwd("~/Capstone Project")
#reading each dataset and sampling 10,000 entries from each

blogs <- readLines("final/en_US/en_US.blogs.txt", encoding="UTF-8")
blogs <- sample(blogs,5000)

twitter <- readLines("final/en_US/en_US.twitter.txt", encoding="UTF-8")
twitter <- sample(twitter,10000)

news <- readLines("final/en_US/en_US.news.txt", encoding="UTF-8")
news <- sample(news,5000)

Summary Statistics

The data to be used for the project comes from 3 sources: blogs, news and twitter.

File Name	Number Of Lines	Size(bytes)
en_US.blogs.txt	899288	201M
en_US.news.txt	1010242	197M
en_US.twitter.txt	2360148	160M

As the size and complexity of the data is huge, the dataset is sampled to be representative of the entire population.

The sample is taken to be 10,000 entries from the twitter file, 5,000 entries from the blogs and news file.

Data Preprocessing

The dataset loaded contains one line per entry of tweet,blog,news. These may contain multiple sentences. Hence these word chunks must be split into lines where each line denotes one sentence.
Data Cleaning Activites :
1. Removing Numbers
2. Removing Puntuations
3. Removing Foreign characters
4. Removing extra white spaces
5. Stemming words
6. Converting to lower case
7. Profanity Filtering
Finally, storing the processed data in the form of a VCorpus,Plain Text Document and a Term Document Matrix is required for text mining applications.
Bad Words list - [https://github.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words]

Exploratory Analysis

Things to consider while performing Exploratory Analysis:
1. Average Sentence Length (in terms of Words)
2. Distribution of Word Frequencies
3. Frequencies of 2-grams and 3-grams in the dataset
4. No. of unique words needed in a frequency sorted dictionary to cover 50%, 90% of all word instances in the language

1. Sentence Length

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    8.00   14.00   15.36   21.00  132.00

2. Word Frequencies - 15 Most Frequent Words

Wordcloud Depicting Word Frequencies

Frequency Of 2-Grams : Most Frequent Bi-Grams in the Dataset Sample

Wordcloud Depicting Most Frequent Bi-Grams

Frequency Of 3-Grams : Most Frequent Tri-Grams in the Dataset Sample

Wordcloud Depicting Most Frequent Tri-Grams

Plotting Percent Coverage

This graph depicts relationship between number of unique words and percent coverage of all word instances in the dataset.

From this graph, the trade-off between dataset size and percent coverage can be analyzed.

From the graph, it can be clearly seen that the No. Of Unique Words Required increases Exponentially with increase in Percent Coverage.

Plans For Final Project

Questions and bottlenecks to consider:

1. Sample Dataset Size

<1% of the dataset has been taken in this sample. There is a chance that much of the language has been discounted from this sample. A different sample set with same size and a sample set with ~10% of the dataset has to be taken to evaluate performance improvement vs speed decrease.

2. Language Modelling

Word frequencies, 2-Gram, 3-Gram frequencies have been calculated. These have to be used to calculate Probabilities for the n-grams.
Given a string W1…..Wi-1, the word Wi that maximizes P(Wi | Wn-i+1…Wi-1) has to be chosen as the prediction where n is the maximum n-gram.

3. Back-off Models

The above probability calculation suffers when new phrases not present in the language are introduced.
Possible Solutions to be considered:
1. Katz Back-Off Models
2. Interpolated Models
3. Kneser-Ney Models

4. Overall Run-Time

After calculating n-gram probabilities along with Back-off Models, Observing runtime for different sample sizes for efficient memory usage and run-time