Capstone - Milestone Report Rubric

A brief report about the data for predictive text input capstone project

Executive summary

In this project the objective is to predict the word the user is about to type. For this user would have to type a few words and from the word that the user types, a prediction would be made. Since this report is aimed for a person who is not into data science, I would use minimalistic technical terms. For the same reason I have not shown the code here. The code for all the chunks can be found separately in the appendix instead.

For the assignment the data was downloaded from the Coursera course website. The data consists of 3 files containing:

  • Blogs
  • News
  • Tweets

The data was processed and cleaned so that it could be used for predictions, special characters were removed. Since data included some mis-spelled words, and spelling variants, probably because tweets were also part of the dataset, english lexicon downloaded from CMU website was used to filter out words. A profanity filter, containing list of words which should be avaoided was used to remove objectionable words from the dataset.

In the end, we had a cleaned version of data, which contained only words that are part of the English lexicon were retained. From the list of words, pairs and triplets were identified which would be used to predict the next word. First, let me present the summary of the data obtained from the course website.

Files Summary

Number of Lines Number of characters Number of words Words per Line Characters per word
en_US.twitter.txt 2360148 184995018 33412144 14 6
en_US.news.txt 1010242 221024166 37060257 37 6
en_US.blogs.txt 899288 223335241 39951030 44 6

Exploratory Analysis

For Exploratory Analysis, I sampled data, taking 5% of total number of lines, from each of the three files. A 5% sample is often a good representative of the overall data. As you can see in the summary above, the files had a huge number of lines. So data processing involved the following steps:

  • Sample the data, randomly selecting 5% of the total number of lines from each file.
  • Merge all the 3 sapled datasets, blogs, news, and twitter in one single dataset.
  • Replace short forms such as ’re with are, n’t with not, ’ve with have etc.
  • Replacing all special characters, numbers, etc. with space.
  • Replacing multiple spaces with a single space.
  • Turning whole data to lower case.
  • Spliting all the data using space and storing them as tokens.
  • For each word, checking for presence in the dictionary.
  • Removing profane words.

The above steps help us get a list of words in order from raw data. These words can now be used to form the data that we want to develop a prediction model. Using the sequence of the words, which have been denoted as tokens, the following features were also formed.

  • Preeding word.
  • Preceeding two words.
  • Preceeding three words.

Often people remove the words such as and, the, in, a etc. also called as stop words to gain accuracy as these words often help little in predicting the following word.

Most reoccuring words, pairs, and triplets

Following is the list of top 10 most frequently occuring words in the above dataset:

Frequency
the 238452
to 137657
and 120709
a 120540
i 100776
of 100351
is 94657
in 82634
it 57436
that 56122

Following is the list of top 10 most frequently occuring piars of words in the above dataset:

Frequency
of the 21682
in the 20569
it is 13201
i am 13118
to the 10655
for the 10163
on the 9956
to be 8025
is a 7389
do not 7289

Following is the list of top 10 most frequently occuring triplets in the above dataset:
Frequency
i do not 2314
one of the 1754
it is a 1494
a lot of 1435
thanks for the 1221
i am not 1194
i can not 1083
it is not 959
i have been 895
going to be 877

Frequency disribution

Here is a frequency plot of words, it shows the number of words for a particular number of occurences in data. (for words, with frequncy less than or equal to 300)

Here is a frequency plot of pair of words, it shows the number of pairs of words for a particular number of occurences in data. (for pairs, with frequncy less than or equal to 100)

Here is a frequency plot of triplets of words, it shows the number of triplets of words for a particular number of occurences in data. (for triplets, with frequncy less than or equal to 50)

Next steps

Now that the data has been cleaned and processed, previous word, previous pair of words, and previous triplet of words have been identified, now comes the task of building a prediction model. For this I would use the above dataset and use n-gram algorithm, which essentially takes a few previous words and determine the next word. This is how it is done, all previous words, previous pairs, and previous triplets are stored in a dataset. Now when the user enters, say, three words, it looks for all the instances of those three words in the text dataset. Of all the instances, it chooses the most frequent one and shows it as its prediction. Now there are few things to be taken care of, which are:

  • Data is not exhaustive, hence there might be instances where the triplet might be absent, which needs to be taken care of.
  • Stop words might sometimes create reduction in accuracy because of their excessive presence, but on the other hand they need to be predicted as well. Hence I would use both versions, one with the stopwords and one without them and would chosse the model with higher accuracy.
  • Maintaining balance between performance and memory/computation time

Looking forward to your suggestions!


Appendix for data scientists

Code, for reference:

blog <- readLines("en_US.blogs.txt", warn = FALSE)
set.seed(3763)
blog <- blog[sample(length(blog),length(blog)/20)]
twt <- readLines("en_US.twitter.txt", warn = FALSE)
set.seed(1293)
twt <- twt[sample(length(twt),length(twt)/20)]
news <- readLines("en_US.news.txt", warn = FALSE)
set.seed(2734)
news <- news[sample(length(news),50512)]
dataset <- c(blog,news,twt)

dataset <- gsub("'re"," are",dataset)
dataset <- gsub("'m"," am",dataset)
dataset <- gsub("can't","can not",dataset)
dataset <- gsub("'s"," is",dataset)
dataset <- gsub("'ve"," have",dataset)
dataset <- gsub("'d"," would",dataset)
dataset <- gsub("n't"," not",dataset)
dataset <- gsub("'ll"," will",dataset)
dataset <- paste(dataset,collapse=" ")
dataset <- gsub("[^a-zA-Z]"," ",dataset)
dataset <- gsub(" +"," ",dataset)
dataset <- tolower(dataset)
dataset <- strsplit(dataset,split=" ")[[1]]

dictionary <- readLines("CMUDict.csv")
profane <- readLines("profane.csv")

dictionary <- !is.na(match(dataset,dictionary))
dataset <- dataset[dictionary]

profane <- !is.na(match(dataset,profane))
dataset <- dataset[!profane]

tokens <- dataset

pairs <- rep(" ", length(tokens) - 1)
for(i in 1:length(pairs))
{
  pairs[i] <- paste(tokens[i],tokens[i+1], sep=" ")
}

triplets <- rep(" ", length(tokens) - 2)
for(i in 1:length(triplets))
{
  triplets[i] <- paste(tokens[i],tokens[i+1],tokens[i+2], sep=" ")
}

tokenNumber <- length(tokens)
word <- tokens[4:tokenNumber]
tokenNumber <- tokenNumber - 1
prevWord <- tokens[3:tokenNumber]
tokenNumber <- tokenNumber - 1
prevPair <- pairs[2:tokenNumber]
tokenNumber <- tokenNumber - 1
prevTriplet <- triplets[1:tokenNumber]

finalData <- data.frame(prevWord,prevPair,prevTriplet,word)

wordFreq <- data.frame(table(finalData$word))
colnames(wordFreq) <- c("word","Frequency")
wordFreq <- wordFreq[order(-wordFreq[,2]),]
rownames(wordFreq) <- wordFreq$word
wordFreq2 <- data.frame(wordFreq[,2])
rownames(wordFreq2) <- wordFreq$word
colnames(wordFreq2) <- c("Frequency")
wordFreq <- wordFreq2

pairFreq <- data.frame(table(pairs))
colnames(pairFreq) <- c("pairs","Frequency")
pairFreq <- pairFreq[order(-pairFreq[,2]),]
rownames(pairFreq) <- pairFreq$pair
pairFreq2 <- data.frame(pairFreq[,2])
rownames(pairFreq2) <- pairFreq$pair
colnames(pairFreq2) <- c("Frequency")
pairFreq <- pairFreq2

tripFreq <- data.frame(table(triplets))
colnames(tripFreq) <- c("trip","Frequency")
tripFreq <- tripFreq[order(-tripFreq[,2]),]
rownames(tripFreq) <- tripFreq$trip
tripFreq2 <- data.frame(tripFreq[,2])
rownames(tripFreq2) <- tripFreq$trip
colnames(tripFreq2) <- c("Frequency")
tripFreq <- tripFreq2

print(xtable(head(wordFreq,10)),type="html")

topWords <- head(wordFreq,10)
topWords$word <- rownames(topWords)
ggplot(topWords, aes(word, y=Frequency, fill=Frequency)) +
    geom_bar(stat="identity") +
    theme_bw() +
    coord_flip() +
    theme(axis.title.y = element_blank()) +
    labs(y="Frequency", title="Frequency of most reoccuring words")

print(xtable(head(pairFreq,10)),type="html")

topPairs <- head(pairFreq,10)
topPairs$pair <- rownames(topPairs)
ggplot(topPairs, aes(pair, y=Frequency, fill=Frequency)) +
    geom_bar(stat="identity") +
    theme_bw() +
    coord_flip() +
    theme(axis.title.y = element_blank()) +
    labs(y="Frequency", title="Frequency of most reoccuring pairs of words")

print(xtable(head(tripFreq,10)),type="html")

topTrips <- head(tripFreq,10)
topTrips$trip <- rownames(topTrips)
ggplot(topTrips, aes(trip, y=Frequency, fill=Frequency)) +
    geom_bar(stat="identity") +
    theme_bw() +
    coord_flip() +
    theme(axis.title.y = element_blank()) +
    labs(y="Frequency", title="Frequency of most reoccuring triplets of words")


qplot(Frequency, data=wordFreq, binwidth=5,xlim=c(0,300), xlab="number of occurences", ylab="number of words")
qplot(Frequency, data=pairFreq, binwidth=2,xlim=c(0,100), xlab="number of occurences", ylab="number of pairs")
qplot(Frequency, data=tripFreq, binwidth=1,xlim=c(0,50), xlab="number of occurences", ylab="number of triplets")