The goal of this project is to explore the text data and prepare for the creation of a prediction algorithm and Shiny app. The dataset used for this assignment consists of text data from various sources such as blogs, news, and Twitter for multiple languages. In this milestone report, I will demonstrate that I have successfully downloaded and loaded the data, performed exploratory analysis, and laid out the plans for building the prediction model and Shiny app.
The dataset has been downloaded from the Coursera platform and contains files for different languages. For the purpose of this project, I have focused on the English (en_US) dataset and have loaded a subset of the data for the exploratory analysis. Below are the basic properties of the dataset.
# Load necessary libraries
library(data.table)
# Define file path
file_path <- "S:/CITA/SUSANA, ABRIL/Training and Resources/Data Science Specialization/capstone/Coursera-SwiftKey/final/en_US/"
# Load a small sample of the English blogs dataset
set.seed(123)
con <- file(paste0(file_path, "en_US.blogs.txt"), "r")
lines <- readLines(con, n = 10000) # Read a sample of 10,000 lines
close(con)
# Display the first few lines of the dataset
head(lines, 3)
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”."
## [2] "We love you Mr. Brown."
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
The dataset consists of text data where each line represents a sentence or a small paragraph. Below are the basic summary statistics for the file:
# Word and line counts
num_lines <- length(lines)
num_words <- sum(sapply(strsplit(lines, "\\s+"), length))
data.frame("Lines" = num_lines, "Words" = num_words)
## Lines Words
## 1 10000 410620
From a preliminary inspection of the data, it contains a mix of short sentences and longer text. The dataset is large, but I have worked with a smaller subset for analysis to conserve memory and time.
The next step is cleaning the text by converting it to lowercase, removing punctuation, numbers, and stopwords. Here’s how I cleaned and tokenized a small subset of the data:
library(tm)
## Loading required package: NLP
# Function to clean text
clean_text <- function(text) {
text <- tolower(text)
text <- removePunctuation(text)
text <- removeNumbers(text)
text <- stripWhitespace(text)
text <- removeWords(text, stopwords("en"))
return(text)
}
# Clean the first 5 lines of the dataset
cleaned_lines <- sapply(lines[1:5], clean_text)
cleaned_lines
## In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”.
## " years thereafter oil fields platforms named pagan “gods”"
## We love you Mr. Brown.
## " love mr brown"
## Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him.
## "chad awesome kids holding fort work later usual kids busy together playing skylander xbox together kyan cashed piggy bank wanted game bad used gift card birthday saving money get never taps thing either know wanted bad made count money make sure enough cute watch reaction realized also good job letting lola feel like playing letting switch characters loves almost much "
## so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home.
## " anyways going share home decor inspiration storing folder puter amazing images stored away ready come life get home"
## With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!
## " graduation season right around corner nancy whipped fun set help graduation cards gifts occasion brings change ones life stamped images memento tuxedo black cut circle nestabilities embossed kraft red cardstock tes new stars impressions plate double sided gives fantastic patterns can see use impressions plates tutorial taylor created just one pass die cut machine using embossing pad kit need super easy"
After cleaning, I tokenized the text to extract individual words and their frequencies. Here’s a summary of the word frequency distribution for the sample data:
# Tokenize the cleaned data
tokens <- unlist(strsplit(tolower(cleaned_lines), "\\s+"))
token_freq <- table(tokens)
token_freq_sorted <- sort(token_freq, decreasing = TRUE)
# Display the top 10 most frequent words
head(token_freq_sorted, 10)
## tokens
## bad cut get graduation home
## 4 2 2 2 2 2
## images impressions kids letting
## 2 2 2 2
From the analysis, common words such as “the”, “and”, and “to” appear frequently. These are common stopwords that were removed during the cleaning process.
The plot below shows the frequency distribution of the top 20 most frequent words in the dataset:
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
# Prepare data for plotting
freq_data <- data.frame(word = names(token_freq_sorted[1:20]), freq = as.integer(token_freq_sorted[1:20]))
# Create a bar plot for the top 20 frequent words
ggplot(freq_data, aes(x = reorder(word, -freq), y = freq)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Top 20 Most Frequent Words", x = "Words", y = "Frequency")
To better understand relationships between words, I generated 2-grams (pairs of consecutive words) and 3-grams (triplets of consecutive words). The table below shows the top 10 most frequent 2-grams:
# Function to generate n-grams
generate_ngrams <- function(tokens, n = 2) {
ngrams <- lapply(1:(length(tokens) - n + 1), function(i) paste(tokens[i:(i + n - 1)], collapse = " "))
return(unlist(ngrams))
}
# Generate 2-grams
two_grams <- generate_ngrams(tokens, n = 2)
two_gram_freq <- table(two_grams)
two_gram_freq_sorted <- sort(two_gram_freq, decreasing = TRUE)
# Display top 10 2-grams
head(two_gram_freq_sorted, 10)
## two_grams
## anyways graduation love years “gods”
## 1 1 1 1 1
## almost much also good amazing images anyways going around corner
## 1 1 1 1 1
From the 2-gram and 3-gram models, we can see common word pairings such as “in the”, “to the”, and “of the”. These word pairs often appear together in the context of the English language.
The goal is to build a predictive text algorithm using an n-gram model (2-grams, 3-grams, etc.). This model will predict the next word based on the previous words input by the user. I will employ smoothing techniques to handle unseen n-grams, such as backoff models, to improve the prediction accuracy.
For the Shiny app, the user will be able to type a sequence of words, and the app will suggest the most likely next word based on the trained n-gram model. The app will be lightweight and fast, ensuring a smooth user experience on mobile devices.
In this milestone report, I have demonstrated that I successfully loaded the data, cleaned and tokenized it, performed exploratory data analysis, and outlined my plans for creating the predictive model and Shiny app. The next steps will involve refining the n-gram model and implementing the Shiny app to allow real-time predictions.
The final prediction algorithm will be a simple and efficient model that can run on mobile devices, and the Shiny app will provide an interactive user interface for word prediction.