The objetive of this document is to explain the main steps taken towards the creation of a text predicting application
We received three sets of Data that can be accessed through this link https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
The files included there consist in three different sources:
- News
- Twitter
- Blogs
Each available on different languages, such as German or Russian, for the sake of the project we will be focusing on the english dataset.
As expected each file has a different lenght on each of the pieces of content, Twitter being the shortest with 140 characters max, and Blogs being the longest
#Blogs
print(max(nchar(corpus[[1]]$content)))
## [1] 40833
#News
print(max(nchar(corpus[[2]]$content)))
## [1] 11384
#TW
print(max(nchar(corpus[[3]]$content)))
## [1] 140
In terms of lenght the opposite happens, TW is the one with the longest number of rows
#TW
nrow(usTW)
## [1] 2360148
#News
nrow(usNews)
## [1] 1010242
#Blogs
nrow(usBlogs)
## [1] 899288
For processing reasons we are taking a sample with the following characteristics
print(sampleSize)
##
## Recommended sample size for a population of 4269678 at a 99% confidence level
##
## Population = 4269678
## Confidence level = 99
## Margin of error = 0.01
## Response distribution = 0.5
## Recommended sample size = 16523
This basically means we will be working with a smaller set of data that is trustworthy
In order to build our model we need to make sure we are able to test any hypothesis, thus we need to divide data into two sets, test / train
Train is where all the models will be built, and then test is where we will make sure they are working
The sets will be : 75% train / 25% test
The process is simple, we take the training file we built, and we need to perform the following actions:
Once we have our cleaned data set we are able to count how many times each word or set of words (n-grams) appear on the test. There will be a couple of words that appear a lot and some words that don’t appear quite often. This will be the base for the prediction algorithm. Words like: the / and / you / that are among the most common words for this example.
library(tm)
library(RTextTools)
library(caret)
library(RWeka)
library(wordcloud)
library(SnowballC)
#Use train file as source
df = as.matrix(readLines("train.csv",-1,skipNul = TRUE))
#Remove profanity
prof<-read.csv("profanity.csv", header=F, na.strings=c("NA","NaN", ""))
prof<-as.list(prof)
df = gsub("NA", "", df)
df = gsub("[[:punct:]]", "", df)
df = gsub("[[:digit:]]", "", df)
df = gsub("http\\w+", "", df)
pat <- paste0("\\b(", paste0(prof$V1, collapse="|"), ")\\b")
df<-gsub(pat, "", df)
options(mc.cores=1)
# BigramTokenizer ####
BigramTokenizer <- function(x) RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 1, max = 3))
#control parameters
dtm.control <- list(
tokenize = BigramTokenizer,
tolower = TRUE,
removePunctuation = TRUE,
removeNumbers = TRUE,
removestopWords = FALSE,
stemming = FALSE, # false for sentiment,
wordLengths = c(3, "inf"))
corpus <- Corpus(VectorSource(df))
corpus = tm_map(corpus, stripWhitespace,lazy=TRUE)
dtm <- DocumentTermMatrix(corpus, control = dtm.control)
tdm <- TermDocumentMatrix(corpus)
m = as.matrix(tdm)
# count words
wf <- sort(rowSums(m),decreasing=TRUE)
dm <- data.frame(word = names(wf), freq=wf)
hist(wf)
wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2"))
In order to be able to predict which word is most likely to come after the one typed by an user the approach I’m going to take is based on markov chains, which is better explained here:
“Using an N-gram model, can use a markov chain to generate text where each new word or character is dependent on the previous word (or character) or sequence of words (or characters). For example. given the phrase “I have to” we might say the next word is 50% likely to be “go”, 30% likely to be “run” and 20% likely to be “pee.” We can construct these word sequence probabilities based on a large corpus of source texts.” - Daniel Shiffman
Actions that need to be taken:
Some of the main sources to gain a better understanding of the text mining process requiered for the success of this project.
Basic word cloud example http://www.webmining.cl/2012/07/text-mining-de-twitter-usando-r/
Explanation of the TM package https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf
Ngrams explanation http://shiffman.net/teaching/a2z/generate/#ngrams
Examples of text mining http://www.rdatamining.com/examples/text-mining