Introduction

The goal of this project is to explore the text data and prepare for the creation of a prediction algorithm and Shiny app. The dataset used for this assignment consists of text data from various sources such as blogs, news, and Twitter for multiple languages. In this milestone report, I will demonstrate that I have successfully downloaded and loaded the data, performed exploratory analysis, and laid out the plans for building the prediction model and Shiny app.

Loading the Data

The dataset has been downloaded from the Coursera platform and contains files for different languages. For the purpose of this project, I have focused on the English (en_US) dataset and have loaded a subset of the data for the exploratory analysis. Below are the basic properties of the dataset.

# Load necessary libraries
library(data.table)

# Define file path
file_path <- "S:/CITA/SUSANA, ABRIL/Training and Resources/Data Science Specialization/capstone/Coursera-SwiftKey/final/en_US/"

# Load a small sample of the English blogs dataset
set.seed(123)
con <- file(paste0(file_path, "en_US.blogs.txt"), "r")
lines <- readLines(con, n = 10000)  # Read a sample of 10,000 lines
close(con)

# Display the first few lines of the dataset
head(lines, 3)
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
## [2] "We love you Mr. Brown."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."

The dataset consists of text data where each line represents a sentence or a small paragraph. Below are the basic summary statistics for the file:

# Word and line counts
num_lines <- length(lines)
num_words <- sum(sapply(strsplit(lines, "\\s+"), length))

data.frame("Lines" = num_lines, "Words" = num_words)
##   Lines  Words
## 1 10000 410620

From a preliminary inspection of the data, it contains a mix of short sentences and longer text. The dataset is large, but I have worked with a smaller subset for analysis to conserve memory and time.

Exploratory Data Analysis

The next step is cleaning the text by converting it to lowercase, removing punctuation, numbers, and stopwords. Here’s how I cleaned and tokenized a small subset of the data:

library(tm)
## Loading required package: NLP
# Function to clean text
clean_text <- function(text) {
  text <- tolower(text)
  text <- removePunctuation(text)
  text <- removeNumbers(text)
  text <- stripWhitespace(text)
  text <- removeWords(text, stopwords("en"))
  return(text)
}

# Clean the first 5 lines of the dataset
cleaned_lines <- sapply(lines[1:5], clean_text)
cleaned_lines
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”. 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   "  years thereafter    oil fields  platforms  named  pagan “gods”" 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               We love you Mr. Brown. 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    " love  mr brown" 
## Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him. 
##                                                                                                                                                                                                                                                  "chad   awesome   kids  holding   fort   work later  usual  kids   busy together playing skylander   xbox together  kyan cashed     piggy bank  wanted  game  bad  used  gift card   birthday    saving   money  get   never taps   thing either     know  wanted   bad  made  count    money  make sure    enough    cute  watch  reaction   realized    also    good job  letting lola feel like   playing   letting  switch   characters  loves  almost  much  " 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home. 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          " anyways   going  share  home decor inspiration     storing   folder   puter     amazing images stored away ready  come  life   get  home" 
##                                                                                     With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy! 
##                                                                                                                                                                                                                                              " graduation season right around  corner nancy  whipped   fun set  help       graduation cards  gifts   occasion  brings   change  ones life  stamped  images  memento tuxedo black  cut    circle nestabilities  embossed  kraft  red cardstock  tes new stars impressions plate   double sided  gives  fantastic patterns  can see   use  impressions plates   tutorial taylor created just one pass   die cut machine using  embossing pad kit    need   super easy"

Tokenization and Frequency Distribution

After cleaning, I tokenized the text to extract individual words and their frequencies. Here’s a summary of the word frequency distribution for the sample data:

# Tokenize the cleaned data
tokens <- unlist(strsplit(tolower(cleaned_lines), "\\s+"))
token_freq <- table(tokens)
token_freq_sorted <- sort(token_freq, decreasing = TRUE)

# Display the top 10 most frequent words
head(token_freq_sorted, 10)
## tokens
##                     bad         cut         get  graduation        home 
##           4           2           2           2           2           2 
##      images impressions        kids     letting 
##           2           2           2           2

From the analysis, common words such as “the”, “and”, and “to” appear frequently. These are common stopwords that were removed during the cleaning process.

Word Frequencies Plot

The plot below shows the frequency distribution of the top 20 most frequent words in the dataset:

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
# Prepare data for plotting
freq_data <- data.frame(word = names(token_freq_sorted[1:20]), freq = as.integer(token_freq_sorted[1:20]))

# Create a bar plot for the top 20 frequent words
ggplot(freq_data, aes(x = reorder(word, -freq), y = freq)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "Top 20 Most Frequent Words", x = "Words", y = "Frequency")

N-gram Analysis

To better understand relationships between words, I generated 2-grams (pairs of consecutive words) and 3-grams (triplets of consecutive words). The table below shows the top 10 most frequent 2-grams:

# Function to generate n-grams
generate_ngrams <- function(tokens, n = 2) {
  ngrams <- lapply(1:(length(tokens) - n + 1), function(i) paste(tokens[i:(i + n - 1)], collapse = " "))
  return(unlist(ngrams))
}

# Generate 2-grams
two_grams <- generate_ngrams(tokens, n = 2)
two_gram_freq <- table(two_grams)
two_gram_freq_sorted <- sort(two_gram_freq, decreasing = TRUE)

# Display top 10 2-grams
head(two_gram_freq_sorted, 10)
## two_grams
##        anyways     graduation           love          years        “gods”  
##              1              1              1              1              1 
##    almost much      also good amazing images  anyways going  around corner 
##              1              1              1              1              1

Insights from N-gram Analysis

From the 2-gram and 3-gram models, we can see common word pairings such as “in the”, “to the”, and “of the”. These word pairs often appear together in the context of the English language.

Plans for Prediction Algorithm and Shiny App

Prediction Algorithm

The goal is to build a predictive text algorithm using an n-gram model (2-grams, 3-grams, etc.). This model will predict the next word based on the previous words input by the user. I will employ smoothing techniques to handle unseen n-grams, such as backoff models, to improve the prediction accuracy.

Shiny App

For the Shiny app, the user will be able to type a sequence of words, and the app will suggest the most likely next word based on the trained n-gram model. The app will be lightweight and fast, ensuring a smooth user experience on mobile devices.

Conclusion

In this milestone report, I have demonstrated that I successfully loaded the data, cleaned and tokenized it, performed exploratory data analysis, and outlined my plans for creating the predictive model and Shiny app. The next steps will involve refining the n-gram model and implementing the Shiny app to allow real-time predictions.

The final prediction algorithm will be a simple and efficient model that can run on mobile devices, and the Shiny app will provide an interactive user interface for word prediction.