This is a milestone report for the ‘Data Science Capstone’ of the course Data Science Specialization, by Johns Hopkins University.
It is being requested to create an application of Predictive Text Model, capable of predicting subsequent words and which will be trained with a dataset from blogs, Twitter and news.
In this report, an exploratory data analysis is carried out and the design of the future application will be described.
The dataset for training is provided in the following link: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
It does contain text files in 4 different languages from Twitter, blogs and news.
For the purpose of this capstone, we will take the English version (under ‘/en_US/’ folder) and do an initial pre-processing to extract the words from each file:
Stopwords are common words like “a”, “an”, “the” that do not carry significant meaning and can be removed from text data to improve the performance of machine learning models.
The following functions have been defined for the analysis in this milestone report:
extract_words <- function(text_data)
{
# Combine all lines into a single text
text_data <- paste(text_data, collapse = " ")
# Remove punctuation and numbers
text_data <- str_replace_all(text_data, "[^a-z\\s]", " ")
# Split the text into words
words <- unlist(str_split(text_data, "\\s+"))
# Remove empty strings
words <- words[words != ""]
# Remove stop words
stop_words <- stopwords("en")
words <- words[!words %in% stop_words]
# Remove one-letter words
words <- words[nchar(words) > 1]
return(words)
}
count_words <- function(words)
{
# Count the frequency of each word
word_count <- table(words)
# Convert to a data frame and arrange in descending order
word_count_df <- as.data.frame(word_count, stringsAsFactors = FALSE)
# Calculate the total number of words
total_words <- sum(word_count_df$Freq)
# Add a percentage column
word_count_df <- word_count_df %>%
arrange(desc(Freq)) %>%
mutate(Percentage = (Freq / total_words) * 100,
CumulativePercentage = cumsum(Freq) / total_words * 100)
}
create_ngrams <- function(words, n) {
ngrams <- lapply(seq_along(words), function(i) {
if (i <= length(words) - (n - 1)) {
paste(words[i:(i + n - 1)], collapse = " ")
} else {
NA
}
})
ngrams <- unlist(ngrams)
ngrams <- ngrams[!is.na(ngrams)]
return(ngrams)
}
freq_ngrams <- function(ngrams)
{
# Count frequencies and return a dataframe sorted by the frequency in descending order
freq <- table(ngrams)
ngrams_df <- as.data.frame(freq, stringsAsFactors = FALSE)
# Calculate the total number of words
total <- sum(ngrams_df$Freq)
ngrams_df <- ngrams_df %>%
arrange(desc(Freq)) %>%
mutate(Percentage = (Freq / total) * 100,
CumulativePercentage = cumsum(Freq) / total * 100)
return(ngrams_df)
}
And this is the result:
## File Lines Words Unique_words
## 1 en_US.twitter.txt 2360148 17111806 302505
## 2 en_US.blogs.txt 899288 19347162 252893
## 3 en_US.news.txt 1010242 19760894 212079
First of all we can look at the utilization of words in the different files and calculating how many words can be representative enough to potentially reduce the training data set, covering 50% or 90%:
Running a representation of the top 20 words, we can get a first grasp of the most used words in the language:
Then we can build trigrams, and identify the most used sequences of words. It will be later useful for predicting what is the next word when typing in the target application:
## File Lines Words Distinct_words Distinct_trigrams
## 1 en_US.twitter.txt 2360148 17111806 302505 15041960
## 2 en_US.blogs.txt 899288 19347162 252893 18013460
## 3 en_US.news.txt 1010242 19760894 212079 17514013
Some ideas to explore for building the predictive text function
# Prediction function based on bigrams and trigrams
predict_next_word <- function(previous_words, bigram_df, trigram_df) {
# Check if input has two words for trigram prediction
if (length(previous_words) == 2) {
# Filter trigrams starting with the previous words
trigram_matches <- trigram_df %>%
filter(grepl(paste("^", paste(previous_words, collapse = " "), sep = ""), trigrams))
if (nrow(trigram_matches) > 0) {
return(trigram_matches$bigrams[1])
}
}
# Use last word for bigram prediction
last_word <- tail(previous_words, 1)
bigram_matches <- bigram_df %>%
filter(grepl(paste("^", last_word, sep = ""), bigrams))
if (nrow(bigram_matches) > 0) {
return(bigram_matches$bigrams[1])
}
return(NA)
}