0. Introduction

This project will work on making a next text predictor with the utilisation of Natural Language Processing. This project uses the files named LOCALE.blogs.txt where LOCALE is the each of the four locales en_US, de_DE, ru_RU and fi_FI. The data is from a corpus called HC Corpora.

library(tidytext)
library(dplyr)

1. Task1: Getting and cleaning the data

We will load the blogs, twitter and news data into a dataframe and prepare a sample of the blogs file using the rbinom function

1.1 Reading the data and preparing a sample set using rbinom function

con <- file("final/en_US/en_US.blogs.txt", "r")
rawblogs <- data.frame(readLines(con))
close(con)
rawblogs$blogid <- rownames(rawblogs)


con <- file("final/en_US/en_US.news.txt", "r")
rawnews <- data.frame(readLines(con))
rawnews$newsid <- rownames(rawnews)
close(con)

con <- file("final/en_US/en_US.twitter.txt", "r")
rawtwitter <- data.frame(readLines(con))
rawtwitter$tweetid <- rownames(rawtwitter)
close(con)

# making a sample set from dataset using rbinom
set.seed(3223)
samplerawblogs <- data.frame(eachline = rawblogs[rbinom(nrow(rawblogs), 1, 0.60)==1,])

1.2 Tokenising each word in the sample dataset accounting for punctuation and numbers

tokenblogs <- samplerawblogs %>% unnest_tokens(word, input = eachline.readLines.con.) %>% filter(grepl("^[a-z]+$", word))

2. Task 2: Exploratory analysis

2.1 Distribution of word frequencies

wordfrequency <- tokenblogs %>% count(word, sort=TRUE)
barplot(wordfrequency$n[1:10], names.arg = wordfrequency$word[1:10], las = 2, main = "Frequency of most common words", ylab = "Frequency", col = "steelblue")

2.2 Frequencies of 2-grams and 3-grams

Generating a 2-gram dataset

# 2-gram generation
bigram <- tokenblogs %>% group_by(eachline.blogid) %>% mutate(nextword = lead(word)) %>% filter(!is.na(nextword)) %>% mutate(bigram = paste(word, nextword))

Sorting the table by frequency of each 2-gram

bifrequency <- bigram %>% count(word, nextword, bigram, sort=TRUE) 
barplot(bifrequency$n[1:10], names.arg = bifrequency$bigram[1:10], las = 2, main = "Frequency of most common bigrams", ylab = "Frequency", col = "steelblue")

Generating a 3-gram dataset

# 3-gram generation
trigram <- tokenblogs %>% group_by(eachline.blogid) %>% mutate(nextword = lead(word), nexttonextword = lead(word, n=2L)) %>% filter(!is.na(nextword)) %>% filter(!is.na(nexttonextword)) %>% mutate(trigram = paste(word, nextword, nexttonextword))

Sorting the table by frequency of each 3-gram

trifrequency <- trigram %>% count(word, nextword, nexttonextword, trigram, sort=TRUE)
barplot(trifrequency$n[1:10], names.arg = trifrequency$trigram[1:10], las = 2, main = "Frequency of most common trigrams", ylab = "Frequency", col = "steelblue")

2.3 How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

length(unique(tokenblogs$word))
## [1] 190577

This shows that for covering 50% of all the word instances, you need 95,289 unique words and 171,519 words to cover 90%.

2.4 How do you evaluate how many of the words come from foreign languages or profane?

Foreign language detection can be achieved by including a dictionary dataset and ensuring that every word token is included in the dictionary. Same can be established by a profanity dictionary and ensuring that every word token is not included in the dictionary

3. Task 3: Modelling an ngram function based on the dataset

A Backoff model will be used where primarily a 3-gram will be assessed, and if it falls in an unseen n-gram, we will fall to 2-gram, which if unseen, we will fall back to 1-gram. In case of 1-gram, we will perform a check for the most common word.

predictword <- function(word1, word2){
  # Trigram check
  result <- trifrequency %>% filter(word == word1, nextword == word2) %>% arrange(desc(n))
    
  if (nrow(result) > 0) {
  return(result$nexttonextword[1:min(3, nrow(result))])
  }
  
  # Backoff to bigram check
  result <- bifrequency %>% filter(word == word2) %>% arrange(desc(n))
  
  if (nrow(result)> 0){
    return(result$nextword[1:min(3, nrow(result))])
  }
  
  # Backoff to unigram check
  return(wordfrequency$word[1])
}

3.1 Runtime analysis

The time taken for this predictive model can be assessed.

system.time(predictword("red", "roses"))
##    user  system elapsed 
##   6.377   1.202   8.697