Swiftkey project capstone - Exploratory data analysis

Scope

This is a milestone report for the ‘Data Science Capstone’ of the course Data Science Specialization, by Johns Hopkins University.

It is being requested to create an application of Predictive Text Model, capable of predicting subsequent words and which will be trained with a dataset from blogs, Twitter and news.

In this report, an exploratory data analysis is carried out and the design of the future application will be described.

Source database

The dataset for training is provided in the following link: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

It does contain text files in 4 different languages from Twitter, blogs and news.

For the purpose of this capstone, we will take the English version (under ‘/en_US/’ folder) and do an initial pre-processing to extract the words from each file:

Remove punctuation and numbers.
Remove empty strings.
Remove one-letter words.
Remove stopwords using the library stopwords.

Stopwords are common words like “a”, “an”, “the” that do not carry significant meaning and can be removed from text data to improve the performance of machine learning models.

The following functions have been defined for the analysis in this milestone report:

extract_words <- function(text_data)
{
  # Combine all lines into a single text
  text_data <- paste(text_data, collapse = " ")
  
  # Remove punctuation and numbers
  text_data <- str_replace_all(text_data, "[^a-z\\s]", " ")
  
  # Split the text into words
  words <- unlist(str_split(text_data, "\\s+"))
  
  # Remove empty strings
  words <- words[words != ""]
  
  # Remove stop words
  stop_words <- stopwords("en")
  words <- words[!words %in% stop_words]
  
  # Remove one-letter words
  words <- words[nchar(words) > 1]
  return(words)
}

count_words <- function(words)
{
  
  # Count the frequency of each word
  word_count <- table(words)
  
  # Convert to a data frame and arrange in descending order
  word_count_df <- as.data.frame(word_count, stringsAsFactors = FALSE)
  
  
  # Calculate the total number of words
  total_words <- sum(word_count_df$Freq)
  
  # Add a percentage column
  word_count_df <- word_count_df %>%
    arrange(desc(Freq)) %>%
    mutate(Percentage = (Freq / total_words) * 100,
           CumulativePercentage = cumsum(Freq) / total_words * 100)
}

create_ngrams <- function(words, n) {
  ngrams <- lapply(seq_along(words), function(i) {
    if (i <= length(words) - (n - 1)) {
      paste(words[i:(i + n - 1)], collapse = " ")
    } else {
      NA
    }
  })
  ngrams <- unlist(ngrams)
  ngrams <- ngrams[!is.na(ngrams)]
  
  return(ngrams)
}

freq_ngrams <- function(ngrams)
{
  # Count frequencies and return a dataframe sorted by the frequency in descending order
  
  freq <- table(ngrams)

  ngrams_df <- as.data.frame(freq, stringsAsFactors = FALSE) 
  
  # Calculate the total number of words
  total <- sum(ngrams_df$Freq)
  
  ngrams_df <- ngrams_df %>%
    arrange(desc(Freq)) %>%
    mutate(Percentage = (Freq / total) * 100,
           CumulativePercentage = cumsum(Freq) / total * 100)
  
  return(ngrams_df)
}

And this is the result:

##                File   Lines    Words Unique_words
## 1 en_US.twitter.txt 2360148 17111806       302505
## 2   en_US.blogs.txt  899288 19347162       252893
## 3    en_US.news.txt 1010242 19760894       212079

Exploratory data analysis

First of all we can look at the utilization of words in the different files and calculating how many words can be representative enough to potentially reduce the training data set, covering 50% or 90%:

Running a representation of the top 20 words, we can get a first grasp of the most used words in the language:

Then we can build trigrams, and identify the most used sequences of words. It will be later useful for predicting what is the next word when typing in the target application:

##                File   Lines    Words Distinct_words Distinct_trigrams
## 1 en_US.twitter.txt 2360148 17111806         302505          15041960
## 2   en_US.blogs.txt  899288 19347162         252893          18013460
## 3    en_US.news.txt 1010242 19760894         212079          17514013

Next steps

Some ideas to explore for building the predictive text function

Apply stemming, i.e., reduce words to their root form. For example, “running” and “runner” both become “run”. This can reduce the number of variations to include in your dictionary.
Lemmatization. Similar to stemming, but more sophisticated. It reduces words to their base or dictionary form, considering the context.
Look for ways to reduce the memory consumption. In the creation of this report, the 3 text files weight 200MB each, but the RAM usage by R went up to 6-8 GB.
Investigate prediction functions, like the one below:

# Prediction function based on bigrams and trigrams
predict_next_word <- function(previous_words, bigram_df, trigram_df) {
  # Check if input has two words for trigram prediction
  if (length(previous_words) == 2) {
    # Filter trigrams starting with the previous words
    trigram_matches <- trigram_df %>%
      filter(grepl(paste("^", paste(previous_words, collapse = " "), sep = ""), trigrams))
    
    if (nrow(trigram_matches) > 0) {
      return(trigram_matches$bigrams[1])
    }
  }
  
  # Use last word for bigram prediction
  last_word <- tail(previous_words, 1)
  bigram_matches <- bigram_df %>%
    filter(grepl(paste("^", last_word, sep = ""), bigrams))
  
  if (nrow(bigram_matches) > 0) {
    return(bigram_matches$bigrams[1])
  }
  
  return(NA)
}

Swiftkey project capstone - Exploratory data analysis

Raul Moya

1/30/2025

Scope

Source database

Exploratory data analysis

Next steps