Introduction
This report documented the exploratory analysis I conducted for analyzing the raw text data to better understand the underlying text data and begin to develop a strategy towards building a text prediction model that predicts the next word to be typed by the user based on her/his previous inputs.
Raw Data Overview
The training data is provided by SwiftKey. I chose to use the English version of twitter,news, and blog text files.
Let’s first load the source files and convert them into a corpus object using the Quanteda package in R. The summary of the three .txt files are below:
## Corpus consisting of 3 documents:
##
## Text Types Tokens Sentences
## en_US.blogs.txt 482484 42840192 2072941
## en_US.news.txt 431667 39918317 1867522
## en_US.twitter.txt 566995 36719702 2588548
##
## Source: /home/roger/NLP-R/* on x86_64 by roger
## Created: Wed Sep 25 14:12:23 2019
## Notes:
Since the raw data is consisted of large amount of information - millions of sentences and tens of millions of tokens - it will take large resouces and long time to process them all together at once. For the purposes of exploratory data analysis, only 1% of randomly sampled data from each of the three text files is used for practical reasons: speed of processing with sufficient amount of information to find patterns.
Once a good strategy for cleaning/processing the data and for constructing the text prediction model is developed, greater portion of the raw data will be used for the analysis will be used/revisited as needed later.
Exploratory Analysis
Twitter Data
Start with looking into the Twitter file by reading the lines from .txt into data.table and randomly draw 1% of the lines for the analysis.
twit.dt <- as.data.table(read_lines(file = "/home/roger/NLP-R/Data/en_US.twitter.txt"))
set.seed(95130)
samp_twit <- twit.dt[sample(.N, round(.N * 0.01))]Then create a corpus on the drawn Twitter data and see what the most frequent token is.
twit_features <- samp_twit[, V1] %>%
corpus() %>%
dfm() %>%
textstat_frequency() %>%
setDT()
twit_features[, 1:2]## feature frequency
## 1: . 25177
## 2: ! 12619
## 3: the 9258
## 4: to 7776
## 5: , 7456
## ---
## 27276: ankel 1
## 27277: sports-_ 1
## 27278: kassim 1
## 27279: tp 1
## 27280: #iamamentor 1
At first glance, the sampled Twitter data has 27279 unique features/tokens without any processing/trimming/stemming is performed. A closer loook at the features are required for determining the strategy for cleaning up the text data.
It’s easy to notice the following features should be removed: - punctuations - numbers - emojis - foreign characters
## feature frequency rank docfreq group
## 1: . 25177 1 12340 all
## 2: ! 12619 2 7262 all
## 3: , 7456 5 5416 all
## 4: ? 4179 10 3289 all
## 5: : 4041 11 3576 all
## ---
## 172: 😢 1 11085 1 all
## 173: 🍆 1 11085 1 all
## 174: 💏 1 11085 1 all
## 175: 🚼 1 11085 1 all
## 176: 🐬 1 11085 1 all
Additionally, common Enlish stopwords, url, twitter characters, and hyphens will also be removed and triming will be applied.
twit_features <- samp_twit[, V1] %>%
corpus() %>%
dfm(
tolower = TRUE,
remove = stopwords("english"),
stem = FALSE,
remove_punct = TRUE,
remove_numbers = TRUE,
remove_twitter = TRUE,
remove_url = TRUE,
remove_symbols = TRUE,
remove_hyphens = TRUE
) %>%
dfm(remove = "^[^a-zA-Z]$", valuetype = "regex") %>% # Remove non-english single char
textstat_frequency() %>%
setDT()We will then follow the similar approach to analyze/process the news and blog text data
news.dt <- as.data.table(read_lines(file = "/home/roger/NLP-R/Data/en_US.news.txt"))
blog.dt <- as.data.table(read_lines(file = "/home/roger/NLP-R/Data/en_US.blogs.txt"))
set.seed(95130)
samp_news <- news.dt[sample(.N, round(.N * 0.01))]
samp_blog <- blog.dt[sample(.N, round(.N * 0.01))]
news_features <- samp_news[, V1] %>%
corpus() %>%
dfm(
tolower = TRUE,
remove = stopwords("english"),
stem = FALSE,
remove_punct = TRUE,
remove_numbers = TRUE,
remove_twitter = TRUE,
remove_url = TRUE,
remove_symbols = TRUE,
remove_hyphens = TRUE
) %>%
dfm(remove = "^[^a-zA-Z]$", valuetype = "regex") %>% # Remove non-english single char
textstat_frequency() %>%
setDT()
blog_features <- samp_blog[, V1] %>%
corpus() %>%
dfm(
tolower = TRUE,
remove = stopwords("english"),
stem = FALSE,
remove_punct = TRUE,
remove_numbers = TRUE,
remove_twitter = TRUE,
remove_url = TRUE,
remove_symbols = TRUE,
remove_hyphens = TRUE
) %>%
dfm(remove = "^[^a-zA-Z]$", valuetype = "regex") %>% # Remove non-english single char
textstat_frequency() %>%
setDT()Visualizing the features/tokens
As the last step of the intial exploratory analysis, we will visualize the top 100 features from each of the three data set
p1 <-
ggplot(twit_features[1:50, ], aes(x = reorder(feature, -frequency), y = frequency)) +
geom_point() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = "Top 50 Most Frequent Features of US Twitter Data (1% Sample)",
x = "feature", y = "frequency")
p2 <-
ggplot(news_features[1:50, ], aes(x = reorder(feature, -frequency), y = frequency)) +
geom_point() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = "Top 50 Most Frequent Features of US News Data (1% Sample)",
x = "feature", y = "frequency")
p3 <-
ggplot(blog_features[1:50, ], aes(x = reorder(feature, -frequency), y = frequency)) +
geom_point() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = "Top 50 Most Frequent Features of US Blog Data (1% Sample)",
x = "feature", y = "frequency")
grid.arrange(p1, p2, p3, ncol =1)Finally, we look at how everything looks when features from all three data sets are combined.
combined_features <- rbindlist(list(twit_features[, 1:2], blog_features[, 1:2], news_features[, 1:2]))[, lapply(.SD, sum, na.rm = TRUE), by = feature]
ggplot(combined_features[1:50, ], aes(x = reorder(feature, -frequency), y = frequency)) +
geom_point() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = "Top 50 Most Frequent Features of Combined Data (1% Sample)",
x = "feature", y = "frequency") Word cloud of top 200 features from all three source data combined.
topfeatures <- head(combined_features[order(-frequency)],200)
wordcloud(words = topfeatures[,feature],
freq = topfeatures[,frequency],
colors = brewer.pal(6,"Dark2"),
random.order = FALSE)N-Gram Modeling
N-grams can be created easily using the same process under the feature creations so far with minor modifications in calling the dfm() function from the quanteda package.
We will first write a function to create n-grams.
createNG <- function(dt, n = 1L) {
ng <- dt[, V1] %>%
corpus() %>%
dfm(
tolower = TRUE,
remove = stopwords("english"),
stem = FALSE,
remove_punct = TRUE,
remove_numbers = TRUE,
remove_twitter = TRUE,
remove_url = TRUE,
remove_symbols = TRUE,
remove_hyphens = TRUE,
ngrams = n
) %>%
dfm(remove = "^[^a-zA-Z]$", valuetype = "regex") %>% # Remove non-english single char
textstat_frequency() %>%
setDT()
ng[,ngram := n]
return(ng[,c("feature", "frequency", "ngram")])
}We will then create 2-gram, 3-gram, 4-gram, and 5-gram models for each of the three data sets.
twit.ng2 <- createNG(samp_twit,2)
twit.ng3 <- createNG(samp_twit,3)
twit.ng4 <- createNG(samp_twit,4)
twit.ng5 <- createNG(samp_twit,5)
news.ng2 <- createNG(samp_news,2)
news.ng3 <- createNG(samp_news,3)
news.ng4 <- createNG(samp_news,4)
news.ng5 <- createNG(samp_news,5)
blog.ng2 <- createNG(samp_blog,2)
blog.ng3 <- createNG(samp_blog,3)
blog.ng4 <- createNG(samp_blog,4)
blog.ng5 <- createNG(samp_blog,5)And then combine them together.
Visualize N-grams
For demonstration purposes, the to 25 frequent 2-gram and 3-gram models are plotted below.
ggplot(combined_ng[ngram==2, head(.SD, 25)], aes(x = reorder(feature, frequency), y = frequency)) +
geom_point() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(
title = "Top 25 Frequncy of 2-grams from US Combined Data (1% Sample)",
x = "feature", y = "frequency"
) +
coord_flip()ggplot(combined_ng[ngram==3, head(.SD, 25)], aes(x = reorder(feature, frequency), y = frequency)) +
geom_point() +
labs(
title = "Top 25 Frequncy of 3-grams from US Combined Data (1% Sample)",
x = "feature", y = "frequency"
)+
coord_flip()Next Steps
- Study various smoothing methods and choose the most appropiate one to apply to the n-gram model
- Experienment with various sample data sizes (e.g 1% > 2% > 5% > 10%?) to see if there is improvment in prodiction accuracies. If so, rebuilding n-gram models with larger data set is necessary
- Build a Shiny app to make next word suggestion as the user type in text in a input bar.
References
- Speech and Language Processing. Daniel Jurafsky & James H. Martin. [https://lagunita.stanford.edu/c4x/Engineering/CS-224N/asset/slp4.pdf]
- Quanteda Quick Start Guide [https://quanteda.io/articles/quickstart.html]
- Frequently Asked Questions about data.table [https://cran.r-project.org/web/packages/data.table/vignettes/datatable-faq.html]