This project will work on making a next text predictor with the utilisation of Natural Language Processing. This project uses the files named LOCALE.blogs.txt where LOCALE is the each of the four locales en_US, de_DE, ru_RU and fi_FI. The data is from a corpus called HC Corpora.
library(tidytext)
library(dplyr)
We will load the blogs, twitter and news data into a dataframe and prepare a sample of the blogs file using the rbinom function
con <- file("final/en_US/en_US.blogs.txt", "r")
rawblogs <- data.frame(readLines(con))
close(con)
rawblogs$blogid <- rownames(rawblogs)
con <- file("final/en_US/en_US.news.txt", "r")
rawnews <- data.frame(readLines(con))
rawnews$newsid <- rownames(rawnews)
close(con)
con <- file("final/en_US/en_US.twitter.txt", "r")
rawtwitter <- data.frame(readLines(con))
rawtwitter$tweetid <- rownames(rawtwitter)
close(con)
# making a sample set from dataset using rbinom
set.seed(3223)
samplerawblogs <- data.frame(eachline = rawblogs[rbinom(nrow(rawblogs), 1, 0.60)==1,])
tokenblogs <- samplerawblogs %>% unnest_tokens(word, input = eachline.readLines.con.) %>% filter(grepl("^[a-z]+$", word))
wordfrequency <- tokenblogs %>% count(word, sort=TRUE)
barplot(wordfrequency$n[1:10], names.arg = wordfrequency$word[1:10], las = 2, main = "Frequency of most common words", ylab = "Frequency", col = "steelblue")
Generating a 2-gram dataset
# 2-gram generation
bigram <- tokenblogs %>% group_by(eachline.blogid) %>% mutate(nextword = lead(word)) %>% filter(!is.na(nextword)) %>% mutate(bigram = paste(word, nextword))
Sorting the table by frequency of each 2-gram
bifrequency <- bigram %>% count(word, nextword, bigram, sort=TRUE)
barplot(bifrequency$n[1:10], names.arg = bifrequency$bigram[1:10], las = 2, main = "Frequency of most common bigrams", ylab = "Frequency", col = "steelblue")
Generating a 3-gram dataset
# 3-gram generation
trigram <- tokenblogs %>% group_by(eachline.blogid) %>% mutate(nextword = lead(word), nexttonextword = lead(word, n=2L)) %>% filter(!is.na(nextword)) %>% filter(!is.na(nexttonextword)) %>% mutate(trigram = paste(word, nextword, nexttonextword))
Sorting the table by frequency of each 3-gram
trifrequency <- trigram %>% count(word, nextword, nexttonextword, trigram, sort=TRUE)
barplot(trifrequency$n[1:10], names.arg = trifrequency$trigram[1:10], las = 2, main = "Frequency of most common trigrams", ylab = "Frequency", col = "steelblue")
length(unique(tokenblogs$word))
## [1] 190577
This shows that for covering 50% of all the word instances, you need 95,289 unique words and 171,519 words to cover 90%.
Foreign language detection can be achieved by including a dictionary dataset and ensuring that every word token is included in the dictionary. Same can be established by a profanity dictionary and ensuring that every word token is not included in the dictionary
A Backoff model will be used where primarily a 3-gram will be assessed, and if it falls in an unseen n-gram, we will fall to 2-gram, which if unseen, we will fall back to 1-gram. In case of 1-gram, we will perform a check for the most common word.
predictword <- function(word1, word2){
# Trigram check
result <- trifrequency %>% filter(word == word1, nextword == word2) %>% arrange(desc(n))
if (nrow(result) > 0) {
return(result$nexttonextword[1:min(3, nrow(result))])
}
# Backoff to bigram check
result <- bifrequency %>% filter(word == word2) %>% arrange(desc(n))
if (nrow(result)> 0){
return(result$nextword[1:min(3, nrow(result))])
}
# Backoff to unigram check
return(wordfrequency$word[1])
}
The time taken for this predictive model can be assessed.
system.time(predictword("red", "roses"))
## user system elapsed
## 6.377 1.202 8.697