This project will work on making a next text predictor with the utilisation of Natural Language Processing. This project uses the files named LOCALE.blogs.txt where LOCALE is the each of the four locales en_US, de_DE, ru_RU and fi_FI. The data is from a corpus called HC Corpora.
library(tidytext)
library(dplyr)
We will load the blogs, twitter and news data into a combined dataframe and prepare a sample using the rbinom function.
con <- file("final/en_US/en_US.blogs.txt", "r")
rawblogs <- data.frame(text = readLines(con), source = "blog")
close(con)
con <- file("final/en_US/en_US.news.txt", "r")
rawnews <- data.frame(text = readLines(con), source = "news")
close(con)
con <- file("final/en_US/en_US.twitter.txt", "r")
rawtwitter <- data.frame(text = readLines(con), source = "twitter")
close(con)
# Combine all three into one dataset with a unique id and source column
rawdata <- bind_rows(rawblogs, rawnews, rawtwitter) %>%
mutate(id = row_number()) %>%
select(id, source, text)
tokenblogs <- rawdata %>%
unnest_tokens(word, input = text) %>%
filter(grepl("^[a-z]+$", word))
wordfrequency <- tokenblogs %>% count(word, sort = TRUE)
barplot(wordfrequency$n[1:10], names.arg = wordfrequency$word[1:10], las = 2,
main = "Frequency of most common words", ylab = "Frequency", col = "steelblue")
Generating a 2-gram dataset
# 2-gram generation
bigram <- tokenblogs %>%
group_by(id) %>%
mutate(nextword = lead(word)) %>%
filter(!is.na(nextword)) %>%
mutate(bigram = paste(word, nextword))
Sorting the table by frequency of each 2-gram
bigramfrequency <- bigram %>%
count(word, nextword, bigram, sort = TRUE)
barplot(bigramfrequency$n[1:10], names.arg = bigramfrequency$bigram[1:10], las = 2,
main = "Frequency of most common bigrams", ylab = "Frequency", col = "steelblue")
Generating a 3-gram dataset
# 3-gram generation
trigram <- tokenblogs %>%
group_by(id) %>%
mutate(nextword = lead(word), nexttonextword = lead(word, n = 2L)) %>%
filter(!is.na(nextword), !is.na(nexttonextword)) %>%
mutate(trigram = paste(word, nextword, nexttonextword))
Sorting the table by frequency of each 3-gram
trigramfrequency <- trigram %>%
count(word, nextword, nexttonextword, trigram, sort = TRUE)
barplot(trigramfrequency$n[1:10], names.arg = trigramfrequency$trigram[1:10], las = 2,
main = "Frequency of most common trigrams", ylab = "Frequency", col = "steelblue")
cumulative <- cumsum(wordfrequency$n) / sum(wordfrequency$n)
words_50 <- which(cumulative >= 0.50)[1]
words_90 <- which(cumulative >= 0.90)[1]
cat("Unique words needed to cover 50% of all instances:", words_50, "\n")
## Unique words needed to cover 50% of all instances: 132
cat("Unique words needed to cover 90% of all instances:", words_90, "\n")
## Unique words needed to cover 90% of all instances: 6911
Foreign language detection can be achieved by including a dictionary dataset and ensuring that every word token is included in the dictionary. The same can be established by a profanity dictionary and ensuring that every word token is not included in the dictionary.
A Backoff model will be used where primarily a 4-gram will be assessed, and if unseen, 3-gram will be assessed, and if it falls in an unseen n-gram, we will fall to 2-gram, which if unseen, we will fall back to 1-gram. In case of 1-gram, we will return the most common word.
predictword <- function(word1, word2) {
word1 <- tolower(word1)
word2 <- tolower(word2)
# Trigram check
result <- trigramfrequency %>%
filter(word == word1, nextword == word2) %>%
arrange(desc(n))
if (nrow(result) > 0) {
return(result$nexttonextword[1:min(3, nrow(result))])
}
# Backoff to bigram check
result <- bigramfrequency %>%
filter(word == word2) %>%
arrange(desc(n))
if (nrow(result) > 0) {
return(result$nextword[1:min(3, nrow(result))])
}
# Backoff to unigram
return(wordfrequency$word[1:3])
}
The time taken for this predictive model can be assessed.
system.time(predictword("will", "see"))
## user system elapsed
## 0.326 0.059 0.386