Capstone - Milestone Report Rubric

A brief report about the data for predictive text input capstone project

Executive summary

In this project the objective is to predict the word the user is about to type. For this user would have to type a few words and from the word that the user types, a prediction would be made. Since this report is aimed for a person who is not into data science, I would use minimalistic technical terms. For the same reason I have not shown the code here. The code for all the chunks can be found separately in the appendix instead.

For the assignment the data was downloaded from the Coursera course website. The data consists of 3 files containing:

Blogs
News
Tweets

The data was processed and cleaned so that it could be used for predictions, special characters were removed. Since data included some mis-spelled words, and spelling variants, probably because tweets were also part of the dataset, english lexicon downloaded from CMU website was used to filter out words. A profanity filter, containing list of words which should be avaoided was used to remove objectionable words from the dataset.

In the end, we had a cleaned version of data, which contained only words that are part of the English lexicon were retained. From the list of words, pairs and triplets were identified which would be used to predict the next word. First, let me present the summary of the data obtained from the course website.

Files Summary

	Number of Lines	Number of characters	Number of words	Words per Line	Characters per word
en_US.twitter.txt	2360148	184995018	33412144	14	6
en_US.news.txt	1010242	221024166	37060257	37	6
en_US.blogs.txt	899288	223335241	39951030	44	6

Exploratory Analysis

For Exploratory Analysis, I sampled data, taking 5% of total number of lines, from each of the three files. A 5% sample is often a good representative of the overall data. As you can see in the summary above, the files had a huge number of lines. So data processing involved the following steps:

Sample the data, randomly selecting 5% of the total number of lines from each file.
Merge all the 3 sapled datasets, blogs, news, and twitter in one single dataset.
Replace short forms such as ’re with are, n’t with not, ’ve with have etc.
Replacing all special characters, numbers, etc. with space.
Replacing multiple spaces with a single space.
Turning whole data to lower case.
Spliting all the data using space and storing them as tokens.
For each word, checking for presence in the dictionary.
Removing profane words.

The above steps help us get a list of words in order from raw data. These words can now be used to form the data that we want to develop a prediction model. Using the sequence of the words, which have been denoted as tokens, the following features were also formed.

Preeding word.
Preceeding two words.
Preceeding three words.

Often people remove the words such as and, the, in, a etc. also called as stop words to gain accuracy as these words often help little in predicting the following word.

Most reoccuring words, pairs, and triplets

Following is the list of top 10 most frequently occuring words in the above dataset:

	Frequency
the	238452
to	137657
and	120709
a	120540
i	100776
of	100351
is	94657
in	82634
it	57436
that	56122

Following is the list of top 10 most frequently occuring piars of words in the above dataset:

	Frequency
of the	21682
in the	20569
it is	13201
i am	13118
to the	10655
for the	10163
on the	9956
to be	8025
is a	7389
do not	7289

Following is the list of top 10 most frequently occuring triplets in the above dataset:

	Frequency
i do not	2314
one of the	1754
it is a	1494
a lot of	1435
thanks for the	1221
i am not	1194
i can not	1083
it is not	959
i have been	895
going to be	877

Frequency disribution

Here is a frequency plot of words, it shows the number of words for a particular number of occurences in data. (for words, with frequncy less than or equal to 300)

Here is a frequency plot of pair of words, it shows the number of pairs of words for a particular number of occurences in data. (for pairs, with frequncy less than or equal to 100)

Here is a frequency plot of triplets of words, it shows the number of triplets of words for a particular number of occurences in data. (for triplets, with frequncy less than or equal to 50)

Next steps

Now that the data has been cleaned and processed, previous word, previous pair of words, and previous triplet of words have been identified, now comes the task of building a prediction model. For this I would use the above dataset and use n-gram algorithm, which essentially takes a few previous words and determine the next word. This is how it is done, all previous words, previous pairs, and previous triplets are stored in a dataset. Now when the user enters, say, three words, it looks for all the instances of those three words in the text dataset. Of all the instances, it chooses the most frequent one and shows it as its prediction. Now there are few things to be taken care of, which are:

Data is not exhaustive, hence there might be instances where the triplet might be absent, which needs to be taken care of.
Stop words might sometimes create reduction in accuracy because of their excessive presence, but on the other hand they need to be predicted as well. Hence I would use both versions, one with the stopwords and one without them and would chosse the model with higher accuracy.
Maintaining balance between performance and memory/computation time

Looking forward to your suggestions!

Appendix for data scientists

Code, for reference:

blog <- readLines("en_US.blogs.txt", warn = FALSE)
set.seed(3763)
blog <- blog[sample(length(blog),length(blog)/20)]
twt <- readLines("en_US.twitter.txt", warn = FALSE)
set.seed(1293)
twt <- twt[sample(length(twt),length(twt)/20)]
news <- readLines("en_US.news.txt", warn = FALSE)
set.seed(2734)
news <- news[sample(length(news),50512)]
dataset <- c(blog,news,twt)

dataset <- gsub("'re"," are",dataset)
dataset <- gsub("'m"," am",dataset)
dataset <- gsub("can't","can not",dataset)
dataset <- gsub("'s"," is",dataset)
dataset <- gsub("'ve"," have",dataset)
dataset <- gsub("'d"," would",dataset)
dataset <- gsub("n't"," not",dataset)
dataset <- gsub("'ll"," will",dataset)
dataset <- paste(dataset,collapse=" ")
dataset <- gsub("[^a-zA-Z]"," ",dataset)
dataset <- gsub(" +"," ",dataset)
dataset <- tolower(dataset)
dataset <- strsplit(dataset,split=" ")[[1]]

dictionary <- readLines("CMUDict.csv")
profane <- readLines("profane.csv")

dictionary <- !is.na(match(dataset,dictionary))
dataset <- dataset[dictionary]

profane <- !is.na(match(dataset,profane))
dataset <- dataset[!profane]

tokens <- dataset

pairs <- rep(" ", length(tokens) - 1)
for(i in 1:length(pairs))
{
  pairs[i] <- paste(tokens[i],tokens[i+1], sep=" ")
}

triplets <- rep(" ", length(tokens) - 2)
for(i in 1:length(triplets))
{
  triplets[i] <- paste(tokens[i],tokens[i+1],tokens[i+2], sep=" ")
}

tokenNumber <- length(tokens)
word <- tokens[4:tokenNumber]
tokenNumber <- tokenNumber - 1
prevWord <- tokens[3:tokenNumber]
tokenNumber <- tokenNumber - 1
prevPair <- pairs[2:tokenNumber]
tokenNumber <- tokenNumber - 1
prevTriplet <- triplets[1:tokenNumber]

finalData <- data.frame(prevWord,prevPair,prevTriplet,word)

wordFreq <- data.frame(table(finalData$word))
colnames(wordFreq) <- c("word","Frequency")
wordFreq <- wordFreq[order(-wordFreq[,2]),]
rownames(wordFreq) <- wordFreq$word
wordFreq2 <- data.frame(wordFreq[,2])
rownames(wordFreq2) <- wordFreq$word
colnames(wordFreq2) <- c("Frequency")
wordFreq <- wordFreq2

pairFreq <- data.frame(table(pairs))
colnames(pairFreq) <- c("pairs","Frequency")
pairFreq <- pairFreq[order(-pairFreq[,2]),]
rownames(pairFreq) <- pairFreq$pair
pairFreq2 <- data.frame(pairFreq[,2])
rownames(pairFreq2) <- pairFreq$pair
colnames(pairFreq2) <- c("Frequency")
pairFreq <- pairFreq2

tripFreq <- data.frame(table(triplets))
colnames(tripFreq) <- c("trip","Frequency")
tripFreq <- tripFreq[order(-tripFreq[,2]),]
rownames(tripFreq) <- tripFreq$trip
tripFreq2 <- data.frame(tripFreq[,2])
rownames(tripFreq2) <- tripFreq$trip
colnames(tripFreq2) <- c("Frequency")
tripFreq <- tripFreq2

print(xtable(head(wordFreq,10)),type="html")

topWords <- head(wordFreq,10)
topWords$word <- rownames(topWords)
ggplot(topWords, aes(word, y=Frequency, fill=Frequency)) +
    geom_bar(stat="identity") +
    theme_bw() +
    coord_flip() +
    theme(axis.title.y = element_blank()) +
    labs(y="Frequency", title="Frequency of most reoccuring words")

print(xtable(head(pairFreq,10)),type="html")

topPairs <- head(pairFreq,10)
topPairs$pair <- rownames(topPairs)
ggplot(topPairs, aes(pair, y=Frequency, fill=Frequency)) +
    geom_bar(stat="identity") +
    theme_bw() +
    coord_flip() +
    theme(axis.title.y = element_blank()) +
    labs(y="Frequency", title="Frequency of most reoccuring pairs of words")

print(xtable(head(tripFreq,10)),type="html")

topTrips <- head(tripFreq,10)
topTrips$trip <- rownames(topTrips)
ggplot(topTrips, aes(trip, y=Frequency, fill=Frequency)) +
    geom_bar(stat="identity") +
    theme_bw() +
    coord_flip() +
    theme(axis.title.y = element_blank()) +
    labs(y="Frequency", title="Frequency of most reoccuring triplets of words")


qplot(Frequency, data=wordFreq, binwidth=5,xlim=c(0,300), xlab="number of occurences", ylab="number of words")
qplot(Frequency, data=pairFreq, binwidth=2,xlim=c(0,100), xlab="number of occurences", ylab="number of pairs")
qplot(Frequency, data=tripFreq, binwidth=1,xlim=c(0,50), xlab="number of occurences", ylab="number of triplets")