Introduction
This project explores large text datasets from blogs, news articles, and Twitter. The goal is to understand the structure and characteristics of the text data before building a predictive text model.
This analysis focuses on basic statistics such as number of lines, word counts, and text distribution patterns.
The goal of this project is to explore the SwiftKey text dataset and understand its basic characteristics before developing a next-word prediction model. The dataset contains text from three sources: blogs, news articles, and Twitter posts.
To make the analysis computationally manageable, a sample of 10,000 lines was taken from each source, resulting in a combined sample of 30,000 text records. The analysis focuses on word frequencies and common word sequences that may be useful for building a predictive text application.
blogs <- readLines("en_US.blogs.txt", warn = FALSE, encoding = "UTF-8")
news <- readLines("en_US.news.txt", warn = FALSE, encoding = "UTF-8")
twitter <- readLines("en_US.twitter.txt", warn = FALSE, encoding = "UTF-8")
Loading and Sampling the Data
The three English-language datasets were loaded into R. A random sample was then selected from each source using a fixed seed to ensure reproducibility.
set.seed(123)
sample_data <- c(
sample(blogs, 10000),
sample(news, 10000),
sample(twitter, 10000)
)
Data Cleaning
The sampled text was cleaned before analysis. The following preprocessing steps were applied:
-Converted all text to lowercase -Removed punctuation -Removed numbers -Removed extra whitespace -Removed common English stop words
library(quanteda)
# Create quanteda corpus
corp <- corpus(sample_data)
# Tokenize + clean in ONE step
tokens_all <- tokens(
corp,
remove_punct = TRUE,
remove_numbers = TRUE
)
# Convert to lowercase
tokens_all <- tokens_tolower(tokens_all)
After cleaning, the text was tokenized and transformed into unigrams, bigrams, and trigrams.
# Build document-feature matrices
unigrams_dfm <- dfm(tokens_all)
bigrams_dfm <- dfm(tokens_ngrams(tokens_all, n = 2))
trigrams_dfm <- dfm(tokens_ngrams(tokens_all, n = 3))
# Build frequency tables
unigram_freq <- sort(colSums(unigrams_dfm), decreasing = TRUE)
bigram_freq <- sort(colSums(bigrams_dfm), decreasing = TRUE)
trigram_freq <- sort(colSums(trigrams_dfm), decreasing = TRUE)
# Remove very rare features (frequency < 2)
unigram_freq <- unigram_freq[unigram_freq >= 2]
bigram_freq <- bigram_freq[bigram_freq >= 2]
trigram_freq <- trigram_freq[trigram_freq >= 2]
names(bigram_freq) <- gsub(" ", "_", names(bigram_freq))
names(trigram_freq) <- gsub(" ", "_", names(trigram_freq))
Summary of the Data
The table below summarizes the number of lines in each source file.
data.frame(
Source = c("Blogs", "News", "Twitter"),
Lines = c(length(blogs), length(news), length(twitter))
)
## Source Lines
## 1 Blogs 899288
## 2 News 1010206
## 3 Twitter 2360148
This table shows the relative size of each text source before sampling.
Most Frequent Words
Unigram frequencies were calculated from the cleaned corpus.
head(
data.frame(
Word = names(unigram_freq),
Frequency = as.numeric(unigram_freq)
),
20
)
## Word Frequency
## 1 the 44172
## 2 to 24147
## 3 and 22869
## 4 a 21146
## 5 of 18719
## 6 in 14997
## 7 i 13388
## 8 that 9550
## 9 for 9207
## 10 is 8819
## 11 it 7872
## 12 on 6838
## 13 with 6568
## 14 you 6484
## 15 was 5956
## 16 at 4800
## 17 be 4755
## 18 this 4707
## 19 my 4699
## 20 have 4553
The table above displays the twenty most frequently occurring words in the sampled dataset.
Top 20 Words
top20_uni <- head(unigram_freq,20)
ggplot(
data.frame(
Word = factor(names(top20_uni),
levels = rev(names(top20_uni))),
Frequency = as.numeric(top20_uni)
),
aes(Word, Frequency)
) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(title = "Top 20 Most Frequent Words")
The figure provides a visual representation of the most common words observed in the sample.
Most Frequent Bigrams
Bigrams are sequences of two consecutive words.
head(
data.frame(
Bigram = names(bigram_freq),
Frequency = as.numeric(bigram_freq)
),
20
)
## Bigram Frequency
## 1 of_the 4080
## 2 in_the 3956
## 3 to_the 1978
## 4 on_the 1738
## 5 for_the 1688
## 6 to_be 1431
## 7 at_the 1229
## 8 and_the 1211
## 9 in_a 1081
## 10 with_the 986
## 11 it_was 923
## 12 is_a 915
## 13 from_the 856
## 14 for_a 809
## 15 i_was 778
## 16 of_a 762
## 17 with_a 759
## 18 i_have 736
## 19 and_i 736
## 20 it_is 718
Top 20 Bigrams
top20_bi <- head(bigram_freq,20)
ggplot(
data.frame(
Bigram = factor(names(top20_bi),
levels = rev(names(top20_bi))),
Frequency = as.numeric(top20_bi)
),
aes(Bigram, Frequency)
) +
geom_col(fill = "darkgreen") +
coord_flip() +
labs(title = "Top 20 Most Frequent Bigrams")
These word pairs provide additional context beyond individual words and may be useful for predicting the next word in a sequence.
Most Frequent Trigrams
Trigrams are sequences of three consecutive words.
head(
data.frame(
Trigram = names(trigram_freq),
Frequency = as.numeric(trigram_freq)
),
20
)
## Trigram Frequency
## 1 one_of_the 320
## 2 =_=_= 261
## 3 a_lot_of 251
## 4 out_of_the 154
## 5 it_was_a 148
## 6 the_end_of 141
## 7 going_to_be 139
## 8 to_be_a 138
## 9 as_well_as 131
## 10 be_able_to 125
## 11 some_of_the 122
## 12 this_is_a 122
## 13 part_of_the 112
## 14 i_want_to 111
## 15 a_couple_of 106
## 16 the_rest_of 101
## 17 i_have_to 99
## 18 end_of_the 98
## 19 i_have_a 96
## 20 in_the_first 95
Top 20 Trigrams
top20_tri <- head(trigram_freq,20)
ggplot(
data.frame(
Trigram = factor(names(top20_tri),
levels = rev(names(top20_tri))),
Frequency = as.numeric(top20_tri)
),
aes(Trigram, Frequency)
) +
geom_col(fill = "darkred") +
coord_flip() +
labs(title = "Top 20 Most Frequent Trigrams")
Trigrams capture longer patterns of language and may improve prediction accuracy by incorporating more context.
Code:
pronouns <- c("you", "we", "they", "i", "he", "she", "it")
verbs <- c("are", "is", "was", "were", "have", "do", "make", "go", "get", "know", "think")
stop_junk <- c("said", "will", "one", "new", "like", "just", "get", "go", "can", "say")
predict_next_word <- function(text_input, seed = 123) {
text_input <- tolower(as.character(text_input))
text_input <- removePunctuation(text_input)
words <- unlist(strsplit(text_input, " "))
words <- words[words != ""]
n <- length(words)
# ---------------- TRIGRAM ----------------
if (n >= 2) {
pattern <- paste(words[n-1], words[n], sep = "_")
candidates <- trigram_freq[grepl(paste0("^", pattern, "_"), names(trigram_freq))]
if (length(candidates) > 0) {
best <- names(sort(candidates, decreasing = TRUE))
for (b in best) {
next_word <- strsplit(b, "_")[[1]][3]
if (next_word %in% stop_junk) next
if (next_word == words[n]) next
if (paste(words[n-1], words[n]) == "how are" && !(next_word %in% pronouns)) next
return(next_word)
}
}
}
# ---------------- BIGRAM ----------------
if (n >= 1) {
pattern <- words[n]
candidates <- bigram_freq[grepl(paste0("^", pattern, "_"), names(bigram_freq))]
if (length(candidates) > 0) {
best <- names(sort(candidates, decreasing = TRUE))
for (b in best) {
next_word <- strsplit(b, "_")[[1]][2]
if (next_word %in% stop_junk) next
if (next_word == words[n]) next
if (words[n] == "love" && next_word %in% stop_junk) next
return(next_word)
}
}
}
# ---------------- UNIGRAM ----------------
unigram_sorted <- sort(unigram_freq, decreasing = TRUE)
return(sample(names(unigram_sorted)[5:50], 1))
}
Preliminary Prediction Model
A simple prediction function was created using unigram, bigram, and trigram frequency tables.
The model first attempts to find a matching trigram based on the last two words entered. If no trigram match is available, it searches the bigram table. If neither a trigram nor bigram match is found, a word is selected from the unigram frequency table.
Example predictions generated by the current model are shown below.
set.seed(123)
predict_next_word("I love")
## [1] "you"
predict_next_word("how are")
## [1] "you"
predict_next_word("the weather")
## [1] "is"
predict_next_word("going to")
## [1] "be"
predict_next_word("thank you")
## [1] "for"
predict_next_word("what is")
## [1] "the"
These examples demonstrate that the model is able to generate candidate next words using patterns learned from the text corpus
Future Work
The next stage of the project will focus on improving the prediction model and deploying it as a Shiny application.
Planned enhancements include:
Refining the n-gram prediction strategy Improving handling of previously unseen word combinations Evaluating prediction accuracy Creating an interactive Shiny interface for real-time next-word prediction
Conclusion
This exploratory analysis successfully loaded, cleaned, and analyzed the SwiftKey text data. Frequency analysis of words, bigrams, and trigrams provides a foundation for developing a predictive text model. The results from this analysis will be used to guide the design of the final prediction algorithm and Shiny application.