The objective of the project is to model the prediction function performed by the Swiftkey Keyboard App. The input to the model is n words, and it predicts the next most probable/approriate word. This project shall try to model this prediction function efficiently, and apply data science concepts to the domain of Natural Language Processign.
The data can be downloaded from :
[https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip]
library(tm)
library(RWeka)
library(ggplot2)
library(stringi)
library(qdap)
library(wordcloud)
setwd("~/Capstone Project")
#reading each dataset and sampling 10,000 entries from each
blogs <- readLines("final/en_US/en_US.blogs.txt", encoding="UTF-8")
blogs <- sample(blogs,5000)
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding="UTF-8")
twitter <- sample(twitter,10000)
news <- readLines("final/en_US/en_US.news.txt", encoding="UTF-8")
news <- sample(news,5000)
The data to be used for the project comes from 3 sources: blogs, news and twitter.
| File Name | Number Of Lines | Size(bytes) |
|---|---|---|
| en_US.blogs.txt | 899288 | 201M |
| en_US.news.txt | 1010242 | 197M |
| en_US.twitter.txt | 2360148 | 160M |
As the size and complexity of the data is huge, the dataset is sampled to be representative of the entire population.
The sample is taken to be 10,000 entries from the twitter file, 5,000 entries from the blogs and news file.
The dataset loaded contains one line per entry of tweet,blog,news. These may contain multiple sentences. Hence these word chunks must be split into lines where each line denotes one sentence.
Finally, storing the processed data in the form of a VCorpus,Plain Text Document and a Term Document Matrix is required for text mining applications.
Bad Words list - [https://github.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 8.00 14.00 15.36 21.00 132.00
This graph depicts relationship between number of unique words and percent coverage of all word instances in the dataset.
From this graph, the trade-off between dataset size and percent coverage can be analyzed.
From the graph, it can be clearly seen that the No. Of Unique Words Required increases Exponentially with increase in Percent Coverage.
Questions and bottlenecks to consider:
<1% of the dataset has been taken in this sample. There is a chance that much of the language has been discounted from this sample. A different sample set with same size and a sample set with ~10% of the dataset has to be taken to evaluate performance improvement vs speed decrease.
Word frequencies, 2-Gram, 3-Gram frequencies have been calculated. These have to be used to calculate Probabilities for the n-grams.
Given a string W1…..Wi-1, the word Wi that maximizes P(Wi | Wn-i+1…Wi-1) has to be chosen as the prediction where n is the maximum n-gram.
The above probability calculation suffers when new phrases not present in the language are introduced.