A brief report about the data for predictive text input capstone project
In this project the objective is to predict the word the user is about to type. For this user would have to type a few words and from the word that the user types, a prediction would be made. Since this report is aimed for a person who is not into data science, I would use minimalistic technical terms. For the same reason I have not shown the code here. The code for all the chunks can be found separately in the appendix instead.
For the assignment the data was downloaded from the Coursera course website. The data consists of 3 files containing:
The data was processed and cleaned so that it could be used for predictions, special characters were removed. Since data included some mis-spelled words, and spelling variants, probably because tweets were also part of the dataset, english lexicon downloaded from CMU website was used to filter out words. A profanity filter, containing list of words which should be avaoided was used to remove objectionable words from the dataset.
In the end, we had a cleaned version of data, which contained only words that are part of the English lexicon were retained. From the list of words, pairs and triplets were identified which would be used to predict the next word. First, let me present the summary of the data obtained from the course website.
| Number of Lines | Number of characters | Number of words | Words per Line | Characters per word | |
|---|---|---|---|---|---|
| en_US.twitter.txt | 2360148 | 184995018 | 33412144 | 14 | 6 |
| en_US.news.txt | 1010242 | 221024166 | 37060257 | 37 | 6 |
| en_US.blogs.txt | 899288 | 223335241 | 39951030 | 44 | 6 |
For Exploratory Analysis, I sampled data, taking 5% of total number of lines, from each of the three files. A 5% sample is often a good representative of the overall data. As you can see in the summary above, the files had a huge number of lines. So data processing involved the following steps:
The above steps help us get a list of words in order from raw data. These words can now be used to form the data that we want to develop a prediction model. Using the sequence of the words, which have been denoted as tokens, the following features were also formed.
Often people remove the words such as and, the, in, a etc. also called as stop words to gain accuracy as these words often help little in predicting the following word.
Following is the list of top 10 most frequently occuring words in the above dataset:
| Frequency | |
|---|---|
| the | 238452 |
| to | 137657 |
| and | 120709 |
| a | 120540 |
| i | 100776 |
| of | 100351 |
| is | 94657 |
| in | 82634 |
| it | 57436 |
| that | 56122 |
Following is the list of top 10 most frequently occuring piars of words in the above dataset:
| Frequency | |
|---|---|
| of the | 21682 |
| in the | 20569 |
| it is | 13201 |
| i am | 13118 |
| to the | 10655 |
| for the | 10163 |
| on the | 9956 |
| to be | 8025 |
| is a | 7389 |
| do not | 7289 |
| Frequency | |
|---|---|
| i do not | 2314 |
| one of the | 1754 |
| it is a | 1494 |
| a lot of | 1435 |
| thanks for the | 1221 |
| i am not | 1194 |
| i can not | 1083 |
| it is not | 959 |
| i have been | 895 |
| going to be | 877 |
Here is a frequency plot of words, it shows the number of words for a particular number of occurences in data. (for words, with frequncy less than or equal to 300)
Here is a frequency plot of pair of words, it shows the number of pairs of words for a particular number of occurences in data. (for pairs, with frequncy less than or equal to 100)
Here is a frequency plot of triplets of words, it shows the number of triplets of words for a particular number of occurences in data. (for triplets, with frequncy less than or equal to 50)
Now that the data has been cleaned and processed, previous word, previous pair of words, and previous triplet of words have been identified, now comes the task of building a prediction model. For this I would use the above dataset and use n-gram algorithm, which essentially takes a few previous words and determine the next word. This is how it is done, all previous words, previous pairs, and previous triplets are stored in a dataset. Now when the user enters, say, three words, it looks for all the instances of those three words in the text dataset. Of all the instances, it chooses the most frequent one and shows it as its prediction. Now there are few things to be taken care of, which are:
Looking forward to your suggestions!
Code, for reference:
blog <- readLines("en_US.blogs.txt", warn = FALSE)
set.seed(3763)
blog <- blog[sample(length(blog),length(blog)/20)]
twt <- readLines("en_US.twitter.txt", warn = FALSE)
set.seed(1293)
twt <- twt[sample(length(twt),length(twt)/20)]
news <- readLines("en_US.news.txt", warn = FALSE)
set.seed(2734)
news <- news[sample(length(news),50512)]
dataset <- c(blog,news,twt)
dataset <- gsub("'re"," are",dataset)
dataset <- gsub("'m"," am",dataset)
dataset <- gsub("can't","can not",dataset)
dataset <- gsub("'s"," is",dataset)
dataset <- gsub("'ve"," have",dataset)
dataset <- gsub("'d"," would",dataset)
dataset <- gsub("n't"," not",dataset)
dataset <- gsub("'ll"," will",dataset)
dataset <- paste(dataset,collapse=" ")
dataset <- gsub("[^a-zA-Z]"," ",dataset)
dataset <- gsub(" +"," ",dataset)
dataset <- tolower(dataset)
dataset <- strsplit(dataset,split=" ")[[1]]
dictionary <- readLines("CMUDict.csv")
profane <- readLines("profane.csv")
dictionary <- !is.na(match(dataset,dictionary))
dataset <- dataset[dictionary]
profane <- !is.na(match(dataset,profane))
dataset <- dataset[!profane]
tokens <- dataset
pairs <- rep(" ", length(tokens) - 1)
for(i in 1:length(pairs))
{
pairs[i] <- paste(tokens[i],tokens[i+1], sep=" ")
}
triplets <- rep(" ", length(tokens) - 2)
for(i in 1:length(triplets))
{
triplets[i] <- paste(tokens[i],tokens[i+1],tokens[i+2], sep=" ")
}
tokenNumber <- length(tokens)
word <- tokens[4:tokenNumber]
tokenNumber <- tokenNumber - 1
prevWord <- tokens[3:tokenNumber]
tokenNumber <- tokenNumber - 1
prevPair <- pairs[2:tokenNumber]
tokenNumber <- tokenNumber - 1
prevTriplet <- triplets[1:tokenNumber]
finalData <- data.frame(prevWord,prevPair,prevTriplet,word)
wordFreq <- data.frame(table(finalData$word))
colnames(wordFreq) <- c("word","Frequency")
wordFreq <- wordFreq[order(-wordFreq[,2]),]
rownames(wordFreq) <- wordFreq$word
wordFreq2 <- data.frame(wordFreq[,2])
rownames(wordFreq2) <- wordFreq$word
colnames(wordFreq2) <- c("Frequency")
wordFreq <- wordFreq2
pairFreq <- data.frame(table(pairs))
colnames(pairFreq) <- c("pairs","Frequency")
pairFreq <- pairFreq[order(-pairFreq[,2]),]
rownames(pairFreq) <- pairFreq$pair
pairFreq2 <- data.frame(pairFreq[,2])
rownames(pairFreq2) <- pairFreq$pair
colnames(pairFreq2) <- c("Frequency")
pairFreq <- pairFreq2
tripFreq <- data.frame(table(triplets))
colnames(tripFreq) <- c("trip","Frequency")
tripFreq <- tripFreq[order(-tripFreq[,2]),]
rownames(tripFreq) <- tripFreq$trip
tripFreq2 <- data.frame(tripFreq[,2])
rownames(tripFreq2) <- tripFreq$trip
colnames(tripFreq2) <- c("Frequency")
tripFreq <- tripFreq2
print(xtable(head(wordFreq,10)),type="html")
topWords <- head(wordFreq,10)
topWords$word <- rownames(topWords)
ggplot(topWords, aes(word, y=Frequency, fill=Frequency)) +
geom_bar(stat="identity") +
theme_bw() +
coord_flip() +
theme(axis.title.y = element_blank()) +
labs(y="Frequency", title="Frequency of most reoccuring words")
print(xtable(head(pairFreq,10)),type="html")
topPairs <- head(pairFreq,10)
topPairs$pair <- rownames(topPairs)
ggplot(topPairs, aes(pair, y=Frequency, fill=Frequency)) +
geom_bar(stat="identity") +
theme_bw() +
coord_flip() +
theme(axis.title.y = element_blank()) +
labs(y="Frequency", title="Frequency of most reoccuring pairs of words")
print(xtable(head(tripFreq,10)),type="html")
topTrips <- head(tripFreq,10)
topTrips$trip <- rownames(topTrips)
ggplot(topTrips, aes(trip, y=Frequency, fill=Frequency)) +
geom_bar(stat="identity") +
theme_bw() +
coord_flip() +
theme(axis.title.y = element_blank()) +
labs(y="Frequency", title="Frequency of most reoccuring triplets of words")
qplot(Frequency, data=wordFreq, binwidth=5,xlim=c(0,300), xlab="number of occurences", ylab="number of words")
qplot(Frequency, data=pairFreq, binwidth=2,xlim=c(0,100), xlab="number of occurences", ylab="number of pairs")
qplot(Frequency, data=tripFreq, binwidth=1,xlim=c(0,50), xlab="number of occurences", ylab="number of triplets")