Capstone Milestone Report

Introduction

The objetive of this document is to explain the main steps taken towards the creation of a text predicting application

Obtaining the Data

We received three sets of Data that can be accessed through this link https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

The files included there consist in three different sources:
- News
- Twitter
- Blogs

Each available on different languages, such as German or Russian, for the sake of the project we will be focusing on the english dataset.

Understanding the Data

As expected each file has a different lenght on each of the pieces of content, Twitter being the shortest with 140 characters max, and Blogs being the longest

#Blogs
print(max(nchar(corpus[[1]]$content)))
## [1] 40833
#News
print(max(nchar(corpus[[2]]$content)))
## [1] 11384
#TW
print(max(nchar(corpus[[3]]$content)))
## [1] 140

In terms of lenght the opposite happens, TW is the one with the longest number of rows

#TW
nrow(usTW)
## [1] 2360148
#News
nrow(usNews)
## [1] 1010242
#Blogs
nrow(usBlogs)
## [1] 899288

For processing reasons we are taking a sample with the following characteristics

print(sampleSize)
## 
##      Recommended sample size for a population of 4269678 at a 99% confidence level 
## 
##              Population = 4269678
##        Confidence level = 99
##         Margin of error = 0.01
##   Response distribution = 0.5
## Recommended sample size = 16523

This basically means we will be working with a smaller set of data that is trustworthy

Test / Train

In order to build our model we need to make sure we are able to test any hypothesis, thus we need to divide data into two sets, test / train

Train is where all the models will be built, and then test is where we will make sure they are working

The sets will be : 75% train / 25% test

Building n-grams

The process is simple, we take the training file we built, and we need to perform the following actions:

  • Remove profanity
  • Remove any blank spaces
  • Remove punctuation, urls & numbers
  • Turn everything to lowercase

Once we have our cleaned data set we are able to count how many times each word or set of words (n-grams) appear on the test. There will be a couple of words that appear a lot and some words that don’t appear quite often. This will be the base for the prediction algorithm. Words like: the / and / you / that are among the most common words for this example.

library(tm)
library(RTextTools)
library(caret)
library(RWeka)
library(wordcloud)
library(SnowballC)

#Use train file as source
df = as.matrix(readLines("train.csv",-1,skipNul = TRUE))

#Remove profanity
prof<-read.csv("profanity.csv",  header=F, na.strings=c("NA","NaN", ""))
prof<-as.list(prof)
df = gsub("NA", "", df)
df = gsub("[[:punct:]]", "", df)
df = gsub("[[:digit:]]", "", df)
df = gsub("http\\w+", "", df)

pat <- paste0("\\b(", paste0(prof$V1, collapse="|"), ")\\b")    
df<-gsub(pat, "", df)
options(mc.cores=1)
# BigramTokenizer ####
BigramTokenizer <- function(x) RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 1, max = 3))


#control parameters
dtm.control <- list(
  tokenize = BigramTokenizer,
  tolower           = TRUE, 
  removePunctuation = TRUE,
  removeNumbers     = TRUE,
  removestopWords   = FALSE,
  stemming          = FALSE, # false for sentiment,
  wordLengths       = c(3, "inf"))

corpus <- Corpus(VectorSource(df))
corpus = tm_map(corpus, stripWhitespace,lazy=TRUE)
dtm <- DocumentTermMatrix(corpus, control = dtm.control)
tdm <- TermDocumentMatrix(corpus)
m = as.matrix(tdm)

# count words
wf <- sort(rowSums(m),decreasing=TRUE)
dm <- data.frame(word = names(wf), freq=wf)
hist(wf)

wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2"))

NEXT STEPS

In order to be able to predict which word is most likely to come after the one typed by an user the approach I’m going to take is based on markov chains, which is better explained here:

“Using an N-gram model, can use a markov chain to generate text where each new word or character is dependent on the previous word (or character) or sequence of words (or characters). For example. given the phrase “I have to” we might say the next word is 50% likely to be “go”, 30% likely to be “run” and 20% likely to be “pee.” We can construct these word sequence probabilities based on a large corpus of source texts.” - Daniel Shiffman

Actions that need to be taken:

  1. Check which words comes after each one - map the probability
  2. Check probabilty for n-grams of 2 or 3
  3. Based on the 3 approaches ( single word, two and three words) get an average probability for which word comes next
  4. Create a function that receives as single word and gives an output of 3 words
  5. After receiving the word give the 3 most likely words to come after that based on their probability

REFERENCES:

Some of the main sources to gain a better understanding of the text mining process requiered for the success of this project.
Basic word cloud example http://www.webmining.cl/2012/07/text-mining-de-twitter-usando-r/
Explanation of the TM package https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf
Ngrams explanation http://shiffman.net/teaching/a2z/generate/#ngrams
Examples of text mining http://www.rdatamining.com/examples/text-mining