The Key partners for this project are Swiftkey and Coursera.
The project explores the Natural Language Processing facet of Data Science where a large text corpus of documents will be used to predict the next word on a preceding input.
This goal of this project milestone report is to display exploratory insights obtained while working with the data provided - Capstone Dataset, and that the project is on track to create a prediction algorithm.
This, report published on R Pubs explains the exploratory analysis and goals for the eventual app / algorithm.
This document intends to explain only the major features of the data that have been identified and a brief summary of the plans for creating the prediction algorithm and Shiny app.
Restricting exploration to the enUS corpus:
In conclusion, suggestions and comments are welcome from the peer review to be conducted.
# Libraries and options
library(stringi) # For string processing and examination.
library(caTools) # Data manipulation utility.
source('./R.script.sources/funs_dataprep.R')
source('./R.script.sources/funs_display.R')
quanteda_options(threads= 8)
Only the en_US files were used for exploratory work and were downloaded and stored locally for convenience.
#7 Read in data
blogs <- readLines('./data/en_US/en_US.blogs.txt', encoding= "UTF-8", skipNul= T)
news <- readLines('./data/en_US/en_US.news.txt', encoding= "UTF-8", skipNul= T)
twitter <- readLines('./data/en_US/en_US.twitter.txt', encoding= "UTF-8", skipNul= T)
The data sets were examined and tabulated to give a sense of the data.
The intent here was to:
fun.sumtab(blogs, news, twitter)
| Dataset | SizeMB | Lines | Chars | CharsWhite | Words | WPLmin | WPLmean | WPLmax |
|---|---|---|---|---|---|---|---|---|
| blogs | 255.35 | 899,288 | 206,824,382 | 42,636,700 | 37,570,839 | 0 | 41.75 | 6,726 |
| news | 19.77 | 77,259 | 15,639,408 | 3,096,618 | 2,651,432 | 1 | 34.62 | 1,123 |
| 318.99 | 2,360,148 | 162,096,241 | 35,958,529 | 30,451,170 | 1 | 12.75 | 47 |
The initial examination showed that the data objects were very large (up to 320 MB).
Surprisingly, the twitter data was the largest but held less words than blogs.
The statistics also show the interesting issue of words per line ranging from 0 to over 6,700 across all data sets.
The number of white space characters was also significant at around 20 - 30% of the datasets.
Given the size of the data, further exploratory work was carried out on smaller samples (1%) to reduce computing resources requirements.
# Sample and combine data.
set.seed(1234) # Repeatable sampling.
samplesize <- 0.01
blogs_smp <- sample(blogs, length(blogs) * samplesize)
news_smp <- sample(news, length(news) * samplesize)
twitter_smp <- sample(twitter, length(twitter) * samplesize)
combined = c(blogs_smp, news_smp, twitter_smp)
### Summary of samples
fun.sumtab(blogs_smp, news_smp, twitter_smp)
| Dataset | SizeMB | Lines | Chars | CharsWhite | Words | WPLmin | WPLmean | WPLmax |
|---|---|---|---|---|---|---|---|---|
| blogs_smp | 2.55 | 8,992 | 2,068,110 | 428,117 | 377,158 | 1 | 41.79 | 677 |
| news_smp | 0.20 | 772 | 156,344 | 31,021 | 26,624 | 1 | 34.74 | 195 |
| twitter_smp | 3.23 | 23,601 | 1,620,028 | 359,400 | 303,982 | 1 | 12.73 | 35 |
rm(blogs, blogs_smp, news, news_smp, twitter, twitter_smp)
As can be seen in the table above the 1% samples were smaller and easier to work with.
However, note that the means of the words per line in the samples do not change much from the original datasets.
The samples were further split into training and validation sets for later use to validate the prediction app.
# Split into train and validation sets
split <- sample.split(combined, 0.8)
train <- subset(combined, split == T)
valid <- subset(combined, split == F)
rm(combined)
The sample data was combined. A quanteda corpus was built and cleaned. Tokenization (N-grams) and the subsequent frequency analysis was conducted using the quanteda package.
The cleaning steps were:
# Transfer to quanteda corpus format.
train <- corpus(train)
# Set up vector of profanity words.
profanity <- readLines('./data/bad-words.txt')
# Clean & Tokenize ie uni grams
train1 <- tokens(train,
what= "word",
remove_punct = T,
remove_symbols = T,
remove_numbers = T,
remove_url = T,
remove_separators = T
) %>%
tokens_tolower() %>%
tokens_remove(pattern= stopwords("en")) %>%
tokens_remove(pattern= as.list(profanity))
# Create n-grams
train2 = fun.ngram(train1, 2)
train3 = fun.ngram(train1, 3)
rm(train, profanity)
# Frequency tables ####
freqtrain1 <- fun.freq(train1)
freqtrain2 <- fun.freq(train2)
freqtrain3 <- fun.freq(train3)
rm(train1, train2, train3)
Visualization was created using tables, wordcloud diagrams and bar plots.
fun.display(freqtrain1, "Uni-grams")
It was noted that 13,612 Uni-grams phrases represented 90% of the training sample.
fun.display(freqtrain2, "Bi-grams")
It was noted that 201,515 Bi-grams phrases represented 90% of the training sample.
fun.display(freqtrain3, "Tri-grams")
It was noted that 212,692 Tri-grams phrases represented 90% of the training sample.
From section 3.5 it was noted that:
gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 3061963 163.6 9845976 525.9 12778614 682.5
Vcells 13262602 101.2 49436284 377.2 96234401 734.3
The next step in the project is text prediction modeling.
The following studies will need to be conducted:
## functions dataprep.R
# Libraries
library(quanteda) # Textual data analyzer.
library(kableExtra) # To display tables.
# Data summary table function
fun.sumtab <- function(t1, t2, t3) {
# capturing row labels
lbl1 <- substitute(t1); lbl2 <- substitute(t2); lbl3 <- substitute(t3)
lbl <- sapply(c(lbl1, lbl2,lbl3),deparse)
# getting object sizes
SizeMB <- c(object.size(t1), object.size(t2), object.size(t3))/1024^2
# calculating Words per Line Stats
WPL= sapply(list(t1, t2, t3), function(x) summary(stri_count_words(x))[c('Min.', 'Mean', 'Max.')])
rownames(WPL)=c('WPLmin', 'WPLmean', 'WPLmax')
# setting up output table
stats= data.frame(Dataset= lbl, SizeMB,
t(rbind(sapply(list(t1,t2,t3),stri_stats_general)[c('Lines','Chars'),],
sapply(list(t1,t2,t3),stri_stats_latex)[c('CharsWhite','Words'),], WPL)))
# Printing summary table
kbl(stats, digits=2, format.args = list(big.mark= ",", scientific = F)) %>%
kable_minimal(full_width = T, position = "float_right")
}
# n-grams
fun.ngram = function(x, gram= 1) {
tokens_ngrams(x,
n= gram,
concatenator= " "
)
}
# Frequency analysis
fun.freq <- function(x) {
textstat_frequency(dfm(x))
}
## functions display.R
# Libraries
library(ggwordcloud)# word frequency visualizer.
library(ggplot2) # Grammar of Graphics package for Visualizations.
library(grid) # To manipulate visualizations.
library(gridExtra) # supplement to grid.
library(RColorBrewer)# color palette.
fun.display <- function(x, title) {
# Create ngram frequency display table.
table <- tableGrob(x[1:10,1:3], rows = NULL)
# Create Wordcloud
wc <- ggplot(x[1:10,], aes(label= feature, size= frequency, color= frequency)) +
geom_text_wordcloud() +
scale_size_area(max_size= 12) +
scale_color_distiller(palette = "YlOrRd", direction= 1)
# Create BarPlot
plot <- ggplot(x[1:10, 1:2], aes(y = reorder(feature, frequency), x = frequency, fill = frequency)) +
geom_bar(stat = "identity") +
scale_fill_distiller(palette = "YlOrRd", direction= 1) +
theme(legend.position= "none") +
labs(y="Phrase")
# Create Coverage Plot
p <- cumsum(x$frequency)/sum(x$frequency)
p.5 <- which(p>=0.5)[1]; p.9 <<- (which(p>=.9)[1]); tit <<-title
p <- data.frame(p)
p$idx <- as.numeric(row.names(p))
names(p) <- c("pr","idx")
cover <- ggplot(p, aes(y= idx, x= pr, fill= pr)) +
geom_col() +
scale_fill_distiller(palette = "YlOrRd", direction= 1) +
theme(legend.position= "none") +
labs(y= "Phrase Count", x= "Probability of Coverage") +
geom_vline(xintercept= 0.5, color= "steelblue", size = 1) +
geom_vline(xintercept= 0.9, color= "steelblue", size = 1) +
annotate("text", x = .45, y = p.5+1025, label = paste(format(p.5, big.mark= ","), "Phrases", "\n", "@ 50% Coverage")) +
annotate("text", x = .85, y = p.9+1025, label = paste(format(p.9, big.mark= ","), "Phrases", "\n", "@ 90% Coverage"))
# Assemble Figure
layout <- rbind(c(1,2),
c(3,4))
grid.arrange(table, cover, plot, wc,
layout_matrix =layout,
top= textGrob(paste("Top 10", title, "out of",
format(dim(x)[1],big.mark= ","), sep= " "),
gp= gpar(fontsize= 20)))
}