Exploration of the SwiftKey dataset

Introduction

The primary objective of this project is to demonstrate foundational skills in data handling and analysis, serving as an initial progress report for a word prediction algorithm project.

The final report is submitted to R Pubs (http://rpubs.com/) that covers the following points:

Exploratory Data Analysis (EDA): Summarize the key features of the text data using tables and plots (e.g., frequency charts, word clouds).
Project Goals: Outline the plans for creating a predictive algorithm and a Shiny application.

Data exploration

Size of the dataset

library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(stringi)
library(stringr)

count_file_lines_en <- function(file_path) {
  cat("Counting the rows in the file. This might take a moment if the dataset is large...\n")
  line_count_str <- system(paste0("wc -l '", file_path, "'"), intern = TRUE)
  if (length(line_count_str) == 0 || is.na(line_count_str)) {
    cat("Error: Cannot calculate the number of rows in data. Is the file readable or does the path exist?\n")
    return(NULL)
  }
    
  file_len <- as.integer(stringr::str_extract(line_count_str, "\\d+"))
  cat(paste("The file", file_path, " has", file_len, "rows.\n"))
  return(file_len)
}

line_count_blogs <- count_file_lines_en("dataset/en_US/en_US.blogs.txt")

## Counting the rows in the file. This might take a moment if the dataset is large...
## The file dataset/en_US/en_US.blogs.txt  has 899288 rows.

if (!is.null(line_count_blogs)) {
   print(paste("The final line count is:", line_count_blogs))
}

## [1] "The final line count is: 899288"

line_count_news <- count_file_lines_en("dataset/en_US/en_US.news.txt")

## Counting the rows in the file. This might take a moment if the dataset is large...
## The file dataset/en_US/en_US.news.txt  has 1010242 rows.

if (!is.null(line_count_news)) {
   print(paste("The final line count is:", line_count_news))
}

## [1] "The final line count is: 1010242"

line_count_twitter <- count_file_lines_en("dataset/en_US/en_US.twitter.txt")

## Counting the rows in the file. This might take a moment if the dataset is large...
## The file dataset/en_US/en_US.twitter.txt  has 2360148 rows.

if (!is.null(line_count_twitter)) {
   print(paste("The final line count is:", line_count_twitter))
}

## [1] "The final line count is: 2360148"

Word count and frequency.

I randomly sample only 6000 lines for each of the 3 data file because of computation power and check the word counts and word frequencies. There are 36271 unique words are found. I will show the top 20 most frequent words and least 20 frequent words, followed by a word cloud.

library(dplyr)
library(stringi)
library(ggplot2)
library(wordcloud)

## Loading required package: RColorBrewer

library(RColorBrewer)
library(stringr)

process_text_sample <- function(file_path, total_lines_count, nr_sample = 15000) {
    cat(paste("Choose", nr_sample, "rows as sample.\n"))
    set.seed(123)
    lines_to_read_indices <- sort(sample(line_count_blogs, nr_sample))
                
    cat("Load samples (This can take a few minutes)...\n")
                
    text_data <- character(nr_sample)
    con <- file(file_path, "r", encoding = "UTF-8")
    # pb <- txtProgressBar(min = 0, max = nr_sample, style = 3)
    current_sample_index <- 1
    current_line_number <- 1
                
    while(current_sample_index <= nr_sample) {
        line <- readLines(con, n = 1, warn = FALSE)
        if (length(line) == 0) {
            cat("\nDEBUG: Unexpected error after the line number ", current_line_number - 1, "\n")
            break 
        }
                    
        if (current_line_number == lines_to_read_indices[current_sample_index]) {
            text_data[current_sample_index] <- line
            current_sample_index <- current_sample_index + 1
            # setTxtProgressBar(pb, current_sample_index -1)
        }
        current_line_number <- current_line_number + 1
    }
    # close(pb)
    close(con)
    cat("\n")
    
    words <- stri_extract_all_words(text_data, simplify = TRUE)
    words_without_na <- words[words != "" & !is.na(words)]
    words_lower <- stri_trans_tolower(words_without_na)
    
    clean_text = function(text) {
        text <- tolower(text)
        text <- str_remove_all(text, "[[:punct:]]")
        text <- str_remove_all(text, "[[:digit:]]")
        return(text)
    }
    cleaned_text <- clean_text(words_lower)
    unique_words <- unique(cleaned_text)
    count_unique_words <- length(unique_words)
    cat("\nNumber of unique words:\n")
    print(count_unique_words)
    word_freq <- as.data.frame(table(cleaned_text))
    names(word_freq) <- c("Word", "Frequency")
    word_freq <- word_freq[order(-word_freq$Frequency), ]
    return(word_freq)
    
}
cat("\nGenerating histogram of word frequencies and word cloud for blog data ...\n")

## 
## Generating histogram of word frequencies and word cloud for blog data ...

word_freq <- process_text_sample(
    file_path = "dataset/en_US/en_US.blogs.txt",
    total_lines_count = line_count_blogs,
    nr_sample = 6000
)

## Choose 6000 rows as sample.
## Load samples (This can take a few minutes)...
## 
## 
## Number of unique words:
## [1] 22192

top_20_words <- head(word_freq, 20)
ggplot(top_20_words, aes(x = reorder(Word, Frequency), y = Frequency)) +
      geom_bar(stat = "identity", fill = "skyblue") +
      coord_flip() +
      theme_minimal() +
      labs(title = "Top 20 Most Frequent Words of blogs", x = "Words", y = "Frequency")

wordcloud(words = word_freq$Word, freq = word_freq$Frequency, min.freq = 1,
      max.words=600, random.order=FALSE, rot.per=0.35,
      colors=brewer.pal(8, "Dark2"))

cat("\nGenerating histogram of word frequencies and word cloud for news data ...\n")

## 
## Generating histogram of word frequencies and word cloud for news data ...

word_freq <- process_text_sample(
    file_path = "dataset/en_US/en_US.news.txt",
    total_lines_count = line_count_news,
    nr_sample = 6000
)

## Choose 6000 rows as sample.
## Load samples (This can take a few minutes)...
## 
## 
## Number of unique words:
## [1] 21992

top_20_words <- head(word_freq, 20)
ggplot(top_20_words, aes(x = reorder(Word, Frequency), y = Frequency)) +
      geom_bar(stat = "identity", fill = "green") +
      coord_flip() +
      theme_minimal() +
      labs(title = "Top 20 Most Frequent Words of news", x = "Words", y = "Frequency")

wordcloud(words = word_freq$Word, freq = word_freq$Frequency, min.freq = 1,
      max.words=600, random.order=FALSE, rot.per=0.35,
      colors=brewer.pal(8, "Dark2"))

cat("\nGenerating histogram of word frequencies and word cloud for twitter data ...\n")

## 
## Generating histogram of word frequencies and word cloud for twitter data ...

word_freq <- process_text_sample(
    file_path = "dataset/en_US/en_US.twitter.txt",
    total_lines_count = line_count_twitter,
    nr_sample = 6000
)

## Choose 6000 rows as sample.
## Load samples (This can take a few minutes)...
## 
## 
## Number of unique words:
## [1] 10828

top_20_words <- head(word_freq, 20)
ggplot(top_20_words, aes(x = reorder(Word, Frequency), y = Frequency)) +
      geom_bar(stat = "identity", fill = "orange") +
      coord_flip() +
      theme_minimal() +
      labs(title = "Top 20 Most Frequent Words of twitter", x = "Words", y = "Frequency")

wordcloud(words = word_freq$Word, freq = word_freq$Frequency, min.freq = 1,
      max.words=600, random.order=FALSE, rot.per=0.35,
      colors=brewer.pal(8, "Dark2"))

Algorithm decision

Based on the preparation and exploration of the data, I decided to use the n-gram model algorithm with the random selected sample from the big dataset for word prediction. This decision is justified by its computational efficiency, simplicity, and scalability compared to more complex models.

Computational Efficiency: N-gram models are simple to build and fast to run, requiring minimal computational resources and less complex training processes than modern deep learning models. This makes them suitable for applications on devices with limited processing power, such as mobile phones for predictive text.
Scalability: The model construction time increases linearly with the corpus size and value of ‘n’, which demonstrates a degree of scalability. The architecture can be designed to operate in a distributed and asynchronous way to handle very large text volumes efficiently.
Data Handling: Large datasets reduce the problem of data sparsity (where certain word sequences might not appear in a smaller corpus) that typically challenges n-gram models. More data means more accurate frequency estimates and better predictive probabilities for a wider range of word sequences.
Strong Baseline Performance: Despite their simplicity, n-gram models often provide competitive baseline performance for various Natural Language Processing (NLP) tasks, including speech recognition, spelling correction, and information retrieval.
Interpretability: N-gram models are relatively easy to understand, debug, and interpret compared to “black-box” neural networks, as the predictions are based on direct, observable word frequencies and probabilities.
Capturing Local Context: They effectively capture short-range dependencies and local word order (e.g., the common phrase “New York City”), which is crucial for distinguishing meaning and improving accuracy in tasks like sentiment analysis.

From the histogram of frequencies, both 3 dataset give similar distribution of words but the blog data contains more unique words, so we will use blog for model training.