Milestone Report for the following project,
Suggest the user 3 words that would follow the words already entered by the user
N-grams are word combinations where N represents the number of words used to create the combination. Hence, for eg. The weather report suggested heavy rainfall
the|weather|report|suggested|heavy|rainfall)the weather|weather report|report suggested|...)the weather report|weather report suggested|...)the weather report suggested|weather report suggested heavy|...)This report aims to do the following,
Data used for this project is downloaded from here3.
Things to note about our raw data,
Option A : We combine all the data into 1 data set
Advantages - simple, increases sample size
Disadvantages - Will not be able to capture the different language morphology
used among these different sources
Option B : Keep them separate and then conduct different EDA and different
models for each
Advantages - Preserves the language morphology from each source
Disadvantages - Smaller source of information, more computationally heavy
and complex
For now we will go for Option A, since it is much simpler, and may help us capture unique combinations of words which might be more useful
Do we use all the data from each source?
Each line in the files represents 50%-truncated data from one instance of a
source type. So it is natural that blog posts and news have longer texts as
compared to twitter.
Let’s just get an idea of the number of instances(ie. lines) for each source type
if(system.file(package = "R.utils") == "") install.packages("R.utils")
library(R.utils)
twitterN <- countLines(unz("./data/textData.zip",
"final/en_US/en_US.twitter.txt", open = "rb"))
blogsN <- countLines(unz("./data/textData.zip",
"final/en_US/en_US.blogs.txt", open = "rb"))
newsN <- countLines(unz("./data/textData.zip",
"final/en_US/en_US.news.txt", open = "rb"))
cat("Number of instances(ie. lines) from each source type:\n
1. Twitter - ", twitterN/10^6, " million\n
2. Blogs - ", blogsN/10^6, " million\n
3. News - ", newsN/10^6, "million")Number of instances(ie. lines) from each source type:
1. Twitter - 2.360148 million
2. Blogs - 0.899288 million
3. News - 1.010242 million
That is a lot of instances.
A random sample of 50000 instances which should provide enough information for drawing conclusions on our population data. (As per Law of Large Numbers which states that the mean of the sample gets closer and closer to the mean of the population as sample size increases, which can be interpreted in our case as the sample will contain words at approximately the same mean number of occurrences as our entire data, and thus we should be able to draw conclusions about our entire data from this sample)
if(system.file(package = "readr") == "") install.packages("readr")
library(readr)
set.seed(65198)
twitterTextData <- read_lines(unz("data/textData.zip",
"final/en_US/en_US.twitter.txt"),
n_max = -1)[sample(1:twitterN, size = 50000)]
set.seed(65198)
blogsTextData <- read_lines(unz("data/textData.zip",
"final/en_US/en_US.blogs.txt"),
n_max = -1)[sample(1:blogsN, size = 50000)]
set.seed(65198)
newsTextData <- read_lines(unz("data/textData.zip",
"final/en_US/en_US.news.txt"),
n_max = -1)[sample(1:newsN, size = 50000)]What proportion of each source type should be used?
All 3 source type texts were broken down into sentences, then equal number of sentences were selected. This will prevent any effect of the amount of data from each source type, and at the same time, this will remove any bias while we try to determine if the language morphology among all 3 sources is different.
This is important, since we are trying to predict the next word, based on language syntax learned from this data set, and our expectation is that all 3 source types should have different language morphology, hence it should impact our predictions.
if(system.file(package = "dplyr") == "") install.packages("dplyr")
if(system.file(package = "tidytext") == "") install.packages("tidytext")
library(dplyr)
library(tidytext)
# For now, we will preserve the uppercase characters
twitterSentences <- tibble(source = "twitter", text = twitterTextData) %>%
unnest_tokens(output = sentences,
input = text,
token = "sentences",
format = "text", # Confirms usage of tokenizers package
to_lower = FALSE)
blogsSentences <- tibble(source = "blogs", text = blogsTextData) %>%
unnest_tokens(output = sentences,
input = text,
token = "sentences",
format = "text", # Confirms usage of tokenizers package
to_lower = FALSE)
newsSentences <- tibble(source = "news", text = newsTextData) %>%
unnest_tokens(output = sentences,
input = text,
token = "sentences",
format = "text", # Confirms usage of tokenizers package
to_lower = FALSE)
# Displaying number of sentences by source type
cat("Minimum and Maximum number of characters in each sample data(in million characters): \n
1. Twitter - ", summary(nchar(twitterSentences))[c(1,6)]/10^6, "\n
2. Blogs - ", summary(nchar(blogsSentences))[c(1,6)]/10^6, "\n
3. News - ", summary(nchar(newsSentences))[c(1,6)]/10^6)Minimum and Maximum number of characters in each sample data(in million characters):
1. Twitter - 0.88336 3.750952
2. Blogs - 1.197509 12.08569
3. News - 0.804137 10.50334
All sources contain different number of characters as well. Thus further sampling of these sentences was done to enforce approximately equal contribution of each source for our data. The ratio of sampling was adjusted to achieve approximately equal representation of each source type and was done in the following manner,
These sentences were randomly sampled with seed set at 75168
Data was combined from all 3 sources and the information on source type was preserved in the sample set
# A tibble: 200,000 × 2
source sentences
<chr> <chr>
1 twitter Follow up, did you ever find that baby deer?
2 twitter Easy A+. c:
3 twitter haha
4 twitter If you could write a book, it would be about ____________ ?
5 twitter Les Créations de NARISAWA in Japan - reservations made
6 twitter Patrick’s Day!
7 twitter The 48 wins in Darlington.
8 twitter My apologize for the last post.
9 twitter followed by much destruction of dining furniture.
10 twitter Saved the American auto industry
# ℹ 199,990 more rows
How many minimum and maximum words should be entered by the user to trigger predictions?
We will limit ourselves to 0-3 words, anything more than this, only the last three words will be taken into consideration.
Impact of this decision - We will unnest our sentences further down to include 4 word tokens(quadgrams), 3 word tokens(trigrams), 2 word tokens(bigrams) and 1 word tokens(words).
Following transformation will automatically lowercase everything for us. For now we will be converting everything to lowercase, we will revisit this decision when we are trying to improve our prediction model.
# Tokenizing to 4-word ngrams/quadgrams
quadgrams <- textSentences %>%
unnest_tokens(output = token,
input = sentences,
token = "ngrams",
n = 4L) %>%
mutate(ngram = "quadgram")
# Tokenizing to 3-word ngrams(trigrams)
trigrams <- textSentences %>%
unnest_tokens(output = token,
input = sentences,
token = "ngrams",
n = 3L) %>%
mutate(ngram = "trigram")
# Tokenizing to 2-word ngrams(bigrams)
bigrams <- textSentences %>%
unnest_tokens(output = token,
input = sentences,
token = "ngrams",
n = 2L) %>%
mutate(ngram = "bigram")
# Tokenizing to 1-word ngrams, which are just words
words <- textSentences %>%
unnest_tokens(output = token,
input = sentences) %>%
mutate(ngram = "word")These tokens contained some missing values, since there is a chance that the sentence would end before an n-gram could be formed
cat("Missing values for each type of n-gram:\n
1. Words - ", sum(is.na(words$token)), "out of ", nrow(words), "\n
2. Bigrams - ", sum(is.na(bigrams$token)), "out of ", nrow(bigrams), "\n
3. Trigrams - ", sum(is.na(trigrams$token)), "out of ", nrow(trigrams), "\n
4. Quadgrams - ", sum(is.na(quadgrams$token)), "out of ", nrow(quadgrams))Missing values for each type of n-gram:
1. Words - 0 out of 2620178
2. Bigrams - 9317 out of 2430431
3. Trigrams - 18623 out of 2249054
4. Quadgrams - 28864 out of 2077918
Since these missing values are so small compared to the amount of our data and they have a known pattern, they will be removed.
All types of n-grams are combined into 1 data set.
The source and n-gram type were given their own columns and converted to a factor which will be useful during exploratory data analysis.
text_ngrams <- rbind(words, bigrams, trigrams, quadgrams)
text_ngrams <- text_ngrams %>%
mutate(source = factor(source),
ngram = factor(ngram, levels = c("word", "bigram", "trigram",
"quadgram")))
text_ngrams# A tibble: 9,320,777 × 3
source token ngram
<fct> <chr> <fct>
1 twitter follow word
2 twitter up word
3 twitter did word
4 twitter you word
5 twitter ever word
6 twitter find word
7 twitter that word
8 twitter baby word
9 twitter deer word
10 twitter easy word
# ℹ 9,320,767 more rows
The number of lines in each text files and number of characters in our sample was already shown during data processing.
To conduct a more unbiased analysis during the later stages of model building, the above mentioned data set was divided into train (80%) and test (20%) sets
This splitting was done to ensure both sets have the same proportion of data from each source as well as same proportion of each n-gram type as the un-split data set.
if(system.file(package = "tidymodels") == "") install.packages("tidymodels")
library(tidymodels)
tidymodels_prefer()
# Getting indices for strata which combine both columns
temp_df <- text_ngrams %>%
select(source, ngram) %>%
group_by(source, ngram) %>%
group_indices() -> indeces
set.seed(536189)
trainSplit <- initial_split(cbind(text_ngrams, indeces),
prop = 0.8, strata = indeces)
# Defining the train set
train <- training(trainSplit)
# Defining the test set
test <- testing(trainSplit)All our exploratory analysis will be done on the train split of the data set
if(system.file(package = "ggplot2") == "") install.packages("ggplot2")
library(ggplot2)
# Calculate count of each unique token
train_count <- train %>%
count(ngram, token, sort = TRUE)
# Fig1 on relationship between type of n-gram and number of unique tokens
train_count %>%
group_by(ngram) %>%
summarise(length = length(token), max = max(n)) %>%
ggplot(aes(ngram, length, size = max, color = ngram)) +
geom_point() +
scale_color_brewer(palette = "Dark2", guide = "none") +
scale_size_continuous(range = c(10,30)) +
labs(x = NULL, y = "Number of unique n-grams",
size = "Count of\nmost frequent\nn-gram") +
theme_bw(base_size = 70)Figure 1: Unique n-grams by n-gram order
Fig. 1 shows the relationship between the number of unique tokens by N-gram type and number of occurrences of the most frequent token(counts) represented by the size of the point.
Conclusions : Higher order n-grams have large number of unique n-grams but their count keeps getting smaller and smaller by a large magnitude.
Thus, Larger order n-grams might give smaller and smaller values of counts and thus the probability(calculated from counts) of the next word might also keep approaching zero, resulting into unnecessarily complex and time-consuming computations
library(ggplot2)
train_count %>%
ggplot(aes(n, fill = ngram)) +
geom_histogram(bins = 30) +
facet_wrap("ngram", nrow = 2) +
scale_fill_brewer(palette = "Dark2", guide = "none") +
labs(x = "Counts of each unique ngram",
y = "Count") +
theme_bw(base_size = 70) +
theme(plot.caption = element_text(face = "italic", hjust = 0))Figure 2: Unique N-gram Count Distribution
Fig. 2 shows the distribution of counts of each unique n-gram by n-gram type.
Conclusion: There are a lot of n-grams which occur only once or close to around that and these increased as we go to higher order n-grams.
This once again confirms what we saw in Fig 1
Let’s get a closer look at these, to understand their nature by finding out the most frequent(10) and least frequent(10) tokens for each type of n-gram
if(system.file(package = "cowplot") == "") install.packages("cowplot")
library(ggplot2)
library(cowplot)
train_count_wide <- train_count %>%
pivot_wider(names_from = ngram, values_from = n)
# Getting custom colors
library(RColorBrewer)
cols <- brewer.pal(4, "Dark2")
# Top 10 and Bottom 10 unigrams
plotA <- train_count_wide %>%
select(token, word) %>%
na.omit() %>%
arrange(desc(word)) %>%
slice(1:10,(length(token)-9):length(token)) %>%
mutate(token = reorder(token, word)) %>%
ggplot(aes(word, token)) +
geom_col(fill = cols[1]) +
labs(title = "Words", y = NULL, "Count") +
theme_bw(base_size = 70)
# Top 10 and Bottom 10 bigrams
plotB <- train_count_wide %>%
select(token, bigram) %>%
na.omit() %>%
arrange(desc(bigram)) %>%
slice(1:10,(length(token)-9):length(token)) %>%
mutate(token = reorder(token, bigram)) %>%
ggplot(aes(bigram, token)) +
geom_col(fill = cols[2]) +
labs(title = "Bigrams", y = NULL, x = "Count") +
theme_bw(base_size = 70)
# Top 10 and Bottom 10 trigrams
plotC <- train_count_wide %>%
select(token, trigram) %>%
na.omit() %>%
arrange(desc(trigram)) %>%
slice(1:10,(length(token)-9):length(token)) %>%
mutate(token = reorder(token, trigram)) %>%
ggplot(aes(trigram, token)) +
geom_col(fill = cols[3]) +
labs(title = "Trigrams", y = NULL, x = "Count") +
theme_bw(base_size = 70)
# Top 10 and Bottom 10 quadgrams
plotD <- train_count_wide %>%
select(token, quadgram) %>%
na.omit() %>%
arrange(desc(quadgram)) %>%
slice(1:10,(length(token)-9):length(token)) %>%
mutate(token = reorder(token, quadgram)) %>%
ggplot(aes(quadgram, token)) +
geom_col(fill = cols[4]) +
labs(title = "Quadgrams", y = NULL, x = "Count") +
theme_bw(base_size = 70)
# Top 10 and Bottom 10 n-gram plot
plot_grid(plotA, plotB, plotC, plotD, nrow = 2, labels = c("A", "B", "C", "D"),
label_size = 40)Figure 3: 10 Most and 10 Least Frequent N-grams
Fig. 3 shows 10 Most and Least Frequent tokens in each n-gram type
Conclusion:
stop wordsthe, to and and. to and and are usually not the words used to start sentences, thus we will have to introduce some kind of place holder to represent the beginning of a sentence to ensure the model predicts even when no text is entered.Decisions based on the above Figures:
sssTime to revisit Decision 1 which we made where we combined the text from all source types.
Let’s check if different sources use different proportion of words4. Only the 1-word n-grams are used since our aim is to check if different sources use different words and in turn different language morphology.
The stop words will be removed since these will be same across the sources and thus may drive the correlation test we will perform.
if(system.file(package = "scales") == "") install.packages("scales")
library(tidytext)
# Get frequency of each word by source
frequency <- train %>%
filter(ngram == "word") %>%
anti_join(stop_words, join_by(token == word)) %>%
count(source, token) %>%
group_by(source) %>%
mutate(proportion = n/sum(n)) %>%
select(-n) %>%
pivot_wider(names_from = source, values_from = proportion) %>%
pivot_longer('blogs':'news',
names_to = "source", values_to = "proportion")
library(scales)
# Plot and compare the frequency of these words from each source
ggplot(frequency, aes(x = proportion, y = twitter,
color = abs(twitter - proportion))) +
geom_abline(color = "gray40", lty = 2, lwd = 3) +
geom_jitter(alpha = 0.1, size = 15, width = 0.3, height = 0.3) +
geom_text(aes(label = token), check_overlap = TRUE, vjust = 1.5, size = 25) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0,0.001),
low = "darkslategray4", high = "gray75") +
facet_wrap(~source, ncol = 2,
labeller = labeller(source = c(blogs = "Blogs", news = "News"))) +
labs(y = "Twitter", x = NULL) +
theme_bw(base_size = 90) +
theme(legend.position = "none")Figure 4: Correlation of unique words used in all 3 sources with numbers and punctuation
Fig. 4 shows the comparison of the frequency(count of each unique words/total number of words from the source) of words from each source. More close the points are to the gray dashed line passing through the center, the more likely it is for the word to appear in both sources
Things to note:
# Correlation between twitter and blogs
corr1 <- cor.test(data = frequency[frequency$source == "blogs",],
~ proportion + twitter)
# Correlation between twitter and news
corr2 <- cor.test(data = frequency[frequency$source == "news",],
~ proportion + twitter)
cat("The correlation between words used on twitter and \n
1. words used on Blogs = ", round(corr1$estimate, 3),
"(", round(corr1$conf.int, 3), ")\n
2. words used on News = ", round(corr2$estimate, 3),
"(", round(corr2$conf.int, 3), ")\n
which implies that words from twitter and blogs are 0.3 times more similar
to each other as compared to words from twitter and news")The correlation between words used on twitter and
1. words used on Blogs = 0.713 ( 0.705 0.72 )
2. words used on News = 0.555 ( 0.545 0.565 )
which implies that words from twitter and blogs are 0.3 times more similar
to each other as compared to words from twitter and news
To check the impact of numbers and punctuation, we will remove them and repeat the correlation tests done above.
if(system.file(package = "stringr") == "") install.packages("stringr")
# Get frequency of each word by source with number and punctuations removed
library(stringr)
frequency <- train %>%
filter(ngram == "word") %>%
mutate(token = tolower(token)) %>%
anti_join(stop_words, join_by(token == word)) %>%
mutate(token = str_replace_all(token, "[:digit:]+|[:punct:]+", "")) %>%
count(source, token) %>%
group_by(source) %>%
mutate(proportion = n/sum(n)) %>%
select(-n) %>%
pivot_wider(names_from = source, values_from = proportion) %>%
pivot_longer('blogs':'news',
names_to = "source", values_to = "proportion")
# Plot and compare the frequency of these words from each source
ggplot(frequency, aes(x = proportion, y = twitter,
color = abs(twitter - proportion))) +
geom_abline(color = "gray40", lty = 2, lwd = 3) +
geom_jitter(alpha = 0.1, size = 15, width = 0.3, height = 0.3) +
geom_text(aes(label = token), check_overlap = TRUE, vjust = 1.5, size = 25) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0,0.001),
low = "darkslategray4", high = "gray75") +
facet_wrap(~source, ncol = 2,
labeller = labeller(source = c(blogs = "Blogs", news = "News"))) +
labs(y = "Twitter", x = NULL) +
theme_bw(base_size = 90) +
theme(legend.position = "none")Figure 5: Correlation of unique words used in all 3 sources without numbers and punctuation
# Correlation between twitter and blogs
corr1 <- cor.test(data = frequency[frequency$source == "blogs",],
~ proportion + twitter)
# Correlation between twitter and news
corr2 <- cor.test(data = frequency[frequency$source == "news",],
~ proportion + twitter)
cat("The correlation between words used on twitter and \n
1. words used on Blogs = ", round(corr1$estimate, 3),
"(", round(corr1$conf.int, 3), ")\n
2. words used on News = ", round(corr2$estimate, 3),
"(", round(corr2$conf.int, 3), ")\n
which implies that once numbers and pucntation are removed, words from all
3 sources are almost the same, with only slight difference(0.05 times)")The correlation between words used on twitter and
1. words used on Blogs = 0.909 ( 0.906 0.911 )
2. words used on News = 0.868 ( 0.864 0.871 )
which implies that once numbers and pucntation are removed, words from all
3 sources are almost the same, with only slight difference(0.05 times)
Conclusions: Fig. 4 & 5 and the associated correlation tests confirm that numbers and punctuation should be excluded to build a text prediction model which will perform well in all these source types
The inclusion of data from all sources, instead of just 1 or 2 is favored since we will be able to build a more generalized model instead of overfitting to one type of source.
-Inf to +Inf. Although this may prevent the model to predict let’s say, am and pm, after the user enters let’s meet at 5 … . We will come back to this if the prediction accuracy of our model is low.sss) to indicate start sentences - to preserve natural syntax of the language which will help the model predict appropriate words accordingly.\[ Prob(word2|word1) = {Count(word2|word1)\over Count(word1)} \tag{1} \]
word2 following word1 is equal to the number of times word2 appears after word1 divided by the number of times word1 appears.word1 and word2 - helps with calculating the above formula.word2 by using the above equation (1).if(system.file(package = "igraph") == "") install.packages("igraph")
if(system.file(package = "ggraph") == "") install.packages("ggraph")
if(system.file(package = "grid") == "") install.packages("grid")
# Bigram visualization
library(igraph)
bigram_graph <- bigram_counts_train %>%
filter(num > 500) %>%
graph_from_data_frame()
# Plotting the igraph
library(ggraph)
set.seed(2023)
a <- grid::arrow(type = "closed", length = unit(.50, "inches"))
ggraph(bigram_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = num), show.legend = FALSE,
arrow = a, end_cap = circle(.20, 'inches')) +
geom_node_point(color = "lightblue", size = 20) +
geom_node_text(aes(label = name), vjust = 1.4, hjust = 0.8, size = 25) +
scale_edge_alpha_continuous(range = c(0.6,1)) +
labs(caption = "Fig. 6") +
theme_void() +
theme(plot.caption = element_text(face = "italic", hjust = 0))Figure 6: Markov chains of bi-grams from train data set
Fig. 64 visualizes the relationship between the first word and the second word of our bi-grams. The point of the arrow indicates the word2 and the starting of the arrow indicates word1 and the visibility of the arrows represent the strength of the relationship. Only bi-grams occurring more than 500 times are included in the graph to avoid overcrowding.
The paths laid out are known as Markov chains
For eg. one can follow along this way,
The model will utilize these Markov chains laid out in Fig. 6 when given word1 and trace along all the paths leading away from it and sort them by decreasing order of probabilities calculated from the equation (1).
Then the model will return the top 3 word2.
Hence, this idea could easily be extended to include higher order n-grams and still maintain a simple structure as shown in Fig. 6
In order to increase the accuracy of model, following entry points should be noted,
ive in Fig. 6 can be dealt with specifically by converting to I haveStemming of the text data is one way to do this, where words are included only if they are part of the words included in a data set specifically built for this purpose.Even after tuning for these variables, if the model accuracy remains low, another model type will be chosen, most likely involving some sort of deep learning model.
Written in Rmarkdown file in R version 4.3.1 (2023-06-16 ucrt) using RStudio IDE
Packages used for this report,