The overarching goal of the this project is to develop a predictive text model based on text data provided by SwiftKey. This model will be presented in the form of a Shiny app, which suggests the next word to the user after receiving some text as input.
In this milestone report, the results of the exploratory analysis are presented, as well as an outline of how the predictive model development will proceed:
The data are downloaded, extracted and cleaned. The data are sampled, to reduce necessary memory and computation time. The data are then combined into a single corpus using the “quanteda” R library. The corpus is processed to yield more informative results in the predictive model (for example, setting all words to lowercase so that capitalization doesn’t result in the same word being classified as multiple words), and preferable results (for example, removing profanity from the dataset). The corpus is then divided (“tokenized”) into single words (unigrams), pairs of words (bigrams) and groups of three words (trigrams). The frequencies of the most common unigrams, bigrams and trigrams are visualized. A brief description of how the model development will proceed is included.
In this step, the goals are to:
We have data available for four languages: German, English, Finnish and Russian. In this project, we are only using English-language data. Let’s look at the files contained within the English-language dataset:
library(stringi)
# Load Dataset
twit_data <- read.delim("en_US.twitter.txt",skipNul = TRUE,sep = "")
news_data <- read.delim("en_US.news.txt",skipNul = TRUE,sep = "",header=FALSE)
blog_data <- read.delim("en_US.blogs.txt",skipNul = TRUE,sep = "",header=FALSE)
# Check dimensions of data
dim(twit_data)
## [1] 2421232 24
dim(news_data)
## [1] 76495 85
dim(blog_data)
## [1] 898543 140
Read data from files
twitter_con <- file("en_US.twitter.txt", "r")
twitter_text<- readLines(twitter_con, encoding = "UTF-8",skipNul = TRUE)
news_con <- file("en_US.news.txt", "r")
news_text<- readLines(news_con, encoding = "UTF-8",skipNul = TRUE)
blog_con <- file("en_US.blogs.txt", "r")
blog_text<- readLines(blog_con, encoding = "UTF-8",skipNul = TRUE)
# count number of lines in each file
num_lines_file <- sapply(list(blog_text, news_text, twitter_text), length)
# count the number of words in each file
num_words_file <- sapply(list(blog_text, news_text, twitter_text), stri_stats_latex)[4,]
# count number of characters in each file
num_chars_file <- sapply(list(nchar(blog_text), nchar(news_text), nchar(twitter_text)), sum)
# summarize above information
files_summary <- data.frame(
filename = c("en_US.blogs.txt","en_US.news.txt", "en_US.twitter.txt"),
lines = num_lines_file,
words = num_words_file,
characters = num_chars_file
)
As we saw in the summary table, each of the three file sizes are quite large. Using all of the data would substantially increase computation times; therefore, 5,000 lines are selected from each file to be cleaned and combined into a unified dataset:
set.seed(54321)
sample_size <- 5000
# sample data from each text file
blog_sample <- sample(blog_text, sample_size, replace = FALSE)
news_sample <- sample(news_text, sample_size, replace = FALSE)
twitter_sample <- sample(twitter_text, sample_size, replace = FALSE)
# Combine the avboe three samples into one file
combined_sample <- c(blog_sample, news_sample, twitter_sample)
combined_sample_filename <- "sampled_data_EN.txt"
# write combined sample data to file
combined_sample_con <- file(combined_sample_filename, open = "w")
writeLines(combined_sample, combined_sample_con)
close(combined_sample_con)
library(quanteda)
corpus_EN <- corpus(combined_sample)
Remove variables that are not required. following data is removed: 1.Website URLs, Twitter handles and email addresses are removed. 2.Non-ASCII characters are removed. 3.Numbers are removed. 4.Punctuation is removed. 5.Extra white space is removed.
text_tokens <- tokens(corpus_EN,
what="word1",
remove_numbers = TRUE,
remove_punct = TRUE,
remove_url =TRUE,
remove_separators = TRUE,
remove_symbols = TRUE,
verbose = quanteda_options("verbose"))
Now we remove the stop words since they hold negligible importance
tokens_without_stopwords <- tokens_remove(text_tokens, pattern = stopwords("en"))
Next we will store the offensive words so that we can remove them from our text later
cusswords_con <- file("bad_words.txt", "r")
cusswords_text<- readLines(cusswords_con, encoding = "UTF-8",skipNul = TRUE)
close(cusswords_con)
In this step, the goal is to better understand the distribution and relationship between words, tokens, and phrases in the dataset. The following questions will be considered: 1. Some words are more frequent than others - what are the distributions of word frequencies? 2. What are the frequencies of 2-grams and 3-grams in the dataset? 3. How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%? 4. How do you evaluate how many of the words come from foreign languages? 5. Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?
# generate unigrams without stopwords
unigram_without_stopwords <- tokens_ngrams(tokens_without_stopwords, n = 1, concatenator = " ")
# generate unigrams with stopwords
unigram_with_stopwords <- tokens_ngrams(text_tokens, n = 1, concatenator = " ")
Next, a document-feature matrix is created for the unigram (single word) case, without stopwords. During this process, all words are converted to lowercase and profanity and padding (extra white space) are removed:
# create the document-feature matrix
text_unigram <- dfm(unigram_without_stopwords,
tolower = TRUE,
remove_padding = TRUE,
remove = cusswords_text,
verbose = FALSE)
The 100 most frequent unigrams in the corpus are identified:
# define most frequently-occurring unigrams
unigram_top_features_no_stopwords <- topfeatures(text_unigram, 100)
# create dataframe displaying most frequent unigrams
unigram_top_features_no_stopwords_df <- data.frame(unigram = names(unigram_top_features_no_stopwords), freq = unigram_top_features_no_stopwords)
# display the most frequent unigrams
head(unigram_top_features_no_stopwords_df)
## unigram freq
## said said 1523
## one one 1265
## just just 1110
## can can 1027
## like like 1022
## time time 966
Now lets create a word cloud
library(wordcloud)
wordcloud(words = unigram_top_features_no_stopwords_df$unigram,
freq = unigram_top_features_no_stopwords_df$freq,
min.freq = 1,
max.words = 100,
random.order = FALSE,
rot.per = 0.35,
colors=brewer.pal(9, "Set1"))
The frequencies of the 15 most common words are shown using a histogram:
library(ggplot2)
unigram_hist <- ggplot(unigram_top_features_no_stopwords_df[1:15,], aes(reorder(unigram_top_features_no_stopwords_df[1:15,]$unigram, -unigram_top_features_no_stopwords_df[1:15,]$freq), y = unigram_top_features_no_stopwords_df[1:15,]$freq))
unigram_hist <- unigram_hist + geom_bar(stat = "Identity", fill = "coral")
unigram_hist <- unigram_hist + xlab("unigram") + ylab("frequency") + ggtitle("15 Most Frequent Unigrams")
unigram_hist <- unigram_hist + theme(plot.title = element_text(size = 14, hjust = 0.5, vjust = 0.5),
axis.text.x = element_text(hjust = 0.5, vjust = 0.5, angle = 45),
axis.text.y = element_text(hjust = 0.5, vjust = 0.5))
unigram_hist
The high frequencies of the most common words suggest that a small fraction of the set of unique words comprises most of the corpus. Let’s consider how many unique words would be necessary to cover 50% and 90% of all word instances in the language (represented by the corpus).
create another dataframe with a maximum number of features to capture all unique words in dataset
unigram_freqs_no_stopwords <- topfeatures(text_unigram, 100000)
unigram_freqs_no_stopwords_df <- data.frame(unigram = names(unigram_freqs_no_stopwords), freq = unigram_freqs_no_stopwords)
# calculate cumulative percentage of dataset to which each word contributes
unigram_freqs_no_stopwords_df["cumul_perc"] <- cumsum(unigram_freqs_no_stopwords_df$freq / sum(unigram_freqs_no_stopwords_df$freq)) * 100.0
Determine number of words necessary to cover 50% of the language
num_words_fifty_perc <- nrow(filter(unigram_freqs_no_stopwords_df, unigram_freqs_no_stopwords_df$cumul_perc < 50.0)) + 1
print(paste("Number of words necessary to cover 50% of the language = ", num_words_fifty_perc, sep = ""))
## [1] "Number of words necessary to cover 50% of the language = 36032"
Determine number of words necessary to cover 90% of the language
num_words_ninety_perc <- nrow(filter(unigram_freqs_no_stopwords_df, unigram_freqs_no_stopwords_df$cumul_perc < 90.0)) + 1
print(paste("Number of words necessary to cover 90% of the language = ", num_words_ninety_perc, sep = ""))
## [1] "Number of words necessary to cover 90% of the language = 36032"
Count and print total number of words in the corpus
total_words <- nrow(unigram_freqs_no_stopwords_df)
print(paste("The total number of words in the corpus (before applying the spellchecker) = ", total_words, sep = ""))
## [1] "The total number of words in the corpus (before applying the spellchecker) = 36031"
Apply the spellchecker and print the total number of remaining words in the corpus
library(hunspell)
total_words_spellcheck <- sum(hunspell_check(unlist(unigram_freqs_no_stopwords_df["unigram"])))
print(paste("The total number of English-language words in the corpus (after applying the spellchecker) = ", total_words_spellcheck, sep = ""))
## [1] "The total number of English-language words in the corpus (after applying the spellchecker) = 23826"
# generate bigrams without stopwords
bigram_no_stopwords <- tokens_ngrams(tokens_without_stopwords, n = 2, concatenator = " ")
# generate bigrams with stopwords
bigram_with_stopwords <- tokens_ngrams(text_tokens, n = 2, concatenator = " ")
# create the document-feature matrix
text_bigram <- dfm(bigram_no_stopwords,
tolower = TRUE,
remove_padding = TRUE,
remove = cusswords_text,
verbose = FALSE)
#define most frequently-occurring bigrams
bigram_top_features_no_stopwords <- topfeatures(text_bigram, 100)
# create dataframe displaying most frequent bigrams
bigram_top_features_no_stopwords_df <- data.frame(bigram = names(bigram_top_features_no_stopwords), freq = bigram_top_features_no_stopwords)
# display the most frequent bigrams
head(bigram_top_features_no_stopwords_df)
## bigram freq
## new york new york 99
## last year last year 97
## right now right now 74
## years ago years ago 66
## high school high school 62
## last week last week 56
wordcloud(words = bigram_top_features_no_stopwords_df$bigram,
freq = bigram_top_features_no_stopwords_df$freq,
min.freq = 1,
max.words = 100,
random.order = FALSE,
rot.per = 0.35,
colors=brewer.pal(9, "Set1"))
# build histogram
bigram_hist <- ggplot(bigram_top_features_no_stopwords_df[1:15,], aes(reorder(bigram_top_features_no_stopwords_df[1:15,]$bigram, -bigram_top_features_no_stopwords_df[1:15,]$freq), y = bigram_top_features_no_stopwords_df[1:15,]$freq))
bigram_hist <- bigram_hist + geom_bar(stat = "Identity", fill = "coral")
bigram_hist <- bigram_hist + xlab("bigram") + ylab("frequency") + ggtitle("15 Most Frequent Bigrams")
bigram_hist <- bigram_hist + theme(plot.title = element_text(size = 14, hjust = 0.5, vjust = 0.5),
axis.text.x = element_text(hjust = 0.5, vjust = 0.5, angle = 45),
axis.text.y = element_text(hjust = 0.5, vjust = 0.5))
# display histogram
bigram_hist
###3. Trigrams
#generate trigrams without stopwords
trigram_no_stopwords <- tokens_ngrams(tokens_without_stopwords, n = 3, concatenator = " ")
# generate trigrams with stopwords
trigram_with_stopwords <- tokens_ngrams(text_tokens, n = 3, concatenator = " ")
# create the document-feature matrix
text_trigram <- dfm(trigram_no_stopwords,
tolower = TRUE,
remove_padding = TRUE,
remove = cusswords_text,
verbose = FALSE)
# define most frequently-occurring trigrams
trigram_top_features_no_stopwords <- topfeatures(text_trigram, 100)
# create dataframe displaying most frequent trigrams
trigram_top_features_no_stopwords_df <- data.frame(trigram = names(trigram_top_features_no_stopwords), freq = trigram_top_features_no_stopwords)
# display the most frequent trigrams
head(trigram_top_features_no_stopwords_df)
## trigram freq
## new york city new york city 13
## greenville newspaper south greenville newspaper south 12
## newspaper south carolina newspaper south carolina 12
## paintball marker repair paintball marker repair 8
## amazon services llc amazon services llc 8
## services llc amazon services llc amazon 8
wordcloud(words = trigram_top_features_no_stopwords_df$trigram,
freq = trigram_top_features_no_stopwords_df$freq,
min.freq = 1,
max.words = 100,
random.order = FALSE,
rot.per = 0.35,
colors=brewer.pal(9, "Set1"))
# build histogram
trigram_hist <- ggplot(trigram_top_features_no_stopwords_df[1:15,], aes(reorder(trigram_top_features_no_stopwords_df[1:15,]$trigram, -trigram_top_features_no_stopwords_df[1:15,]$freq), y = trigram_top_features_no_stopwords_df[1:15,]$freq))
trigram_hist <- trigram_hist + geom_bar(stat = "Identity", fill = "coral")
trigram_hist <- trigram_hist + xlab("trigram") + ylab("frequency") + ggtitle("15 Most Frequent Trigrams")
trigram_hist <- trigram_hist + theme(plot.title = element_text(size = 14, hjust = 0.5, vjust = 0.5),
axis.text.x = element_text(hjust = 0.5, vjust = 0.5, angle = 45),
axis.text.y = element_text(hjust = 0.5, vjust = 0.5))
# display histogram
trigram_hist
The predictive model will use a combination of unigrams, bigrams and trigrams to predict the next word, given some input text. Kneser-Ney smoothing is considered one of the most effective techniques for next-word prediction using n-grams. It not only discounts higher-order n-gram counts but also weights these discounted probabilities with lower-order n-gram probabilities. This model improves upon back-off models by using all n-gram counts rather than just the highest available one. However, it’s more computationally expensive and complex to implement compared to other techniques. It also requires a good understanding of the underlying algorithms and a large amount of data to work effectively. Kneser-Ney smoothing will be used for the prediction of next word and the output will be displayed thriugh a shiny app.