The objective of this project is to create an app that predicts the next word while typing. We have with us several datasets which have been collected by a web crawler. In this document, we load the datasets, tidy them up and then perform some basic exploratory analyses on the cleaned data.
Download the datasets and store them in the working directory. We will currently work with the English (US) datasets only.
conblogs <- file("./Coursera-SwiftKey/final/en_US/en_US.blogs.txt")
connews <- file("./Coursera-SwiftKey/final/en_US/en_US.news.txt")
contwitter <- file("./Coursera-SwiftKey/final/en_US/en_US.twitter.txt")
suppressWarnings(blogs <- readLines(conblogs))
suppressWarnings(news <- readLines(connews))
suppressWarnings(twitter <- readLines(contwitter))
close(conblogs);
close(connews);
close(contwitter)
Let us have a brief look at the contents of each of the datasets.
blogs[4]
## [1] "so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home."
news[2]
## [1] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."
twitter[1:4]
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."
## [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"
Now, let us look at some properties of the raw data:| Type | Total posts | Total words | Total Characters | Words per post | Characters per post |
|---|---|---|---|---|---|
| Blogs | 899288 | 37334131 | 208361438 | 41.51521 | 231.69601 |
| News | 77259 | 2643969 | 15683765 | 34.22215 | 203.00243 |
| 2360148 | 30373543 | 162384825 | 12.86934 | 68.80281 |
This does make sense. Twitter has the fewest characters and words per post, possibly due to the 140/280 character limit. The average word and character distribution is similar for news articles and blog posts, although the total number of posts is vastly different.
The datasets are massive, thus we sample from the datasets. We account for the different number of total posts by sampling as follows, to ensure a similar contribution from each of the datasets:
set.seed(1234)
samplelines <- c(sample(blogs, length(blogs) * 0.1),
sample(news, length(news)),
sample(twitter, length(twitter) * 0.1))
samplelines <- gsub("[^a-zA-Z']", " ", samplelines)
samplelines <- gsub(" {2,}", " ", samplelines)
samplelines <- trimws(samplelines)
samplelines <- tolower(samplelines)
samplelines <- strsplit(samplelines, " ")
totallength <- length(samplelines)
Let us now see how this data looks.
print(totallength)
## [1] 403201
head(samplelines, 2)
## [[1]]
## [1] "he" "looked" "back" "at" "me" "his" "eyes" "were"
## [9] "as" "dark" "as" "coal"
##
## [[2]]
## [1] "you've" "set" "up" "a" "problem" "without" "stakes"
## [8] "why" "does" "she" "care" "who" "the" "voice"
## [15] "on" "the" "phone" "is" "why" "would" "she"
## [22] "even" "listen" "to" "him" "past" "hello"
To proceed with the model, we would ideally like a dictionary with every n-gram. Here, we create a dictionary for n = 1, 2, 3 in order to find out the most common unigrams, bigrams and trigrams.
The following code snippet parses through the sample of the dataset. If a word is not already present in the unigrams list, it initializes its first occurrence, otherwise it increments the previous number of cumulative occurrences.
unigram = list()
count <- length(samplelines)/50
for(line in samplelines) {
count <- count - 1
if(count < 0) break;
for(word in line) {
if(is.null(unigram[[word]]))
unigram[[word]] = 1
else
unigram[[word]] = unigram[[word]] + 1
}
}
unigram <- unigram[order(unlist(unigram), decreasing=TRUE)]
barplot(as.numeric(unigram[1:20]), names.arg=names(unigram[1:20]), las=2, col="blue", border="black", density=seq(100, 10, -4), main = "Unigrams", xlab = "Unigram", ylab = "Frequency")
The following code snippet parses through the sample of the dataset. If a bigram is not already present in the bigrams list, it initializes its first occurrence, otherwise it increments the previous number of cumulative occurrences.
bigram = list()
count <- totallength/50
for(line in samplelines) {
count <- count - 1
if(count < 0) break;
for(word in line) {
if(line[1] != word) {
create_bigram <- paste(last, word)
if(is.null(bigram[[create_bigram]]))
bigram[[create_bigram]] = 1
else
bigram[[create_bigram]] = bigram[[create_bigram]] + 1
}
last <- word
}
}
bigram <- bigram[order(unlist(bigram), decreasing=TRUE)]
barplot(as.numeric(bigram[1:20]), names.arg=names(bigram[1:20]), las=2, col="magenta", border="black", density=seq(100, 10, -4), main = "Bigrams", xlab = "Bigram", ylab = "Frequency")
trigram = list()
count <- totallength/50
for(line in samplelines) {
count <- count - 1
if(count < 0) break;
for(word in line) {
if(line[1] != word) {
if(line[2] != word) {
create_trigram <- paste(secondlast, last, word)
if(is.null(trigram[[create_trigram]]))
trigram[[create_trigram]] = 1
else
trigram[[create_trigram]] = trigram[[create_trigram]] + 1
}
secondlast <- last
}
last <- word
}
}
trigram <- trigram[order(unlist(trigram), decreasing=TRUE)]
barplot(as.numeric(trigram[1:20]), names.arg=names(trigram[1:20]), las=2, col="green", border="black", density=seq(100, 10, -4), main = "Trigrams", xlab = "Trigram", ylab = "Frequency")
First of all, we would like to find out the number of unique words in the dictionary.
uniquewords <- length(unigram)
uniquewords
## [1] 26100
Now, we would like to find out how many unique words account for 90% of the total words occurring in the datasets.
sum <- 0
uniquewords <- 0
for (i in 1:length(unigram)) sum = sum + unigram[[i]]
sum90 <- 0.9*sum
for (i in 1:length(unigram)) {
sum90 <- sum90 - unigram[[i]]
uniquewords <- uniquewords + 1
if(sum90 <= 0) break;
}
print(c(uniquewords, uniquewords/length(unigram)*100))
## [1] 5535.0000 21.2069
Okay, so we see that about 20% of the unique words in the unigram list account for 90% of the total words in the dataset. Seems like the Pareto principle in action, eh? Quantitative linguistics is actually governed by Zipf’s law which states the exact same principle.
How do we know if what we’re doing makes any sense? A good barometer is the Oxford English Corpus. Turns out that our 18 of our top 20 unigrams find a place in the OEC’s top 20. Not bad for a crude first attempt!
We have performed basic exploratory analysis. Now, we have a basic idea of the structure of our datasets and what words are more likely to come up while typing. The basic idea for the rest of the project will be to look out for the last typed words in the list of n-grams and return the following word in the n-gram with the greatest occurrence.