The development of the word suggestion application requires data, this document serves as an update on the status of this preliminary work. Three sources of data are available and explored below…blogs, news, and twitter data.
The goal here is to leverage R code to explore and summarize fundamental features observed in the data, providing a foundation upon which further research and model building will occur.
Beginning with the three text files, below are the basic file facts…these results are obtained using Unix system commands (e.g., wc -l), and are useful as confirmation the data read correctly. Code is available in Appendix B. Of note is the success using data.table fread technology here. The simple file-based Unix commands do not require reading in the data precisely, therefore no issues (more on this later).
| file_name | source | line_count | word_count | unix_mean_word_count |
|---|---|---|---|---|
| data/final/en_US/en_US.blogs.txt | blogs | 899,288 | 37,334,690 | 41.5 |
| data/final/en_US/en_US.news.txt | news | 1,010,242 | 34,372,720 | 34.0 |
| data/final/en_US/en_US.twitter.txt | 2,360,148 | 30,374,206 | 12.9 |
Below is the first record in the twitter data, which after review of the files, is a representative example of the data overall. One observes multiple sentences, use of abbreviation, mixed case, as well as odd formatting (no space after way in “way,way”). This variety will present some challenge, and will require some treatment in an effort to standardize and format such that the application returns accurate suggestions.
Revisiting the desire to leverage fread, with the twitter data, NULL characters were encountered towards the end of the file, causing fread to stop reading. The data table formed, but did not contain the full set of records. Some research was done, attempts made using a number of suggestions per stackoverflow. This is a minor issue, but may matter more as one goal of the application will be minimized response time. In cleaner data, fread is known to be about the fastest available.
[1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
The data are read in using readLines logic, and the skipNuls=TRUE option allows for a full read of the files. Note this option is not available in fread.
Next is getting the data read into an NLP package to further explore. The quanteda package was chosen for this work. A corpus is assembled from the read in data. The corpus here is simple, it is just the collection of blog, news, and twitter records; each record is its own document.
Below is a summary of the basic data, after tokenization into single words. Note these figures describe a random sample of 100,000 records each of blogs, news, and twitter records. The large size of the data – even given relatively robust computing power (see Appendix A) – required sampling. The smaller data were used to develop code.
| source | mean_nsentencex | mean_ntokenx | mean_line_length |
|---|---|---|---|
| blogs | 2.6 | 40.9 | 228.9 |
| news | 2.0 | 33.3 | 201.9 |
| 1.6 | 12.5 | 68.6 |
Below is a view of the relationship between the number of sentences and the token count. These display some differences in the rate at which new sentences add to the token count.
Below are two wordclouds of the most frequent 50 words observed in the data, the left plot has stop words included, the right does not.
As the goal of the application will be to suggest the next word given a phrase, these will likely be left in for this project to retain as much information as possible.
In order for the application to suggest a likely next word (in a given phrase), the tokenization will be expanded from single words (useful to dig into the data), to combinations of multiple words (ngrams). Below are the most frequent 10 combinations for the 3-gram set (aka trigrams).
These ngrams will eventually form a lookup table to be referenced by the application.
Developing the word suggestion application will require a lookup table of phrases. These will be the results of creating various n-grams. But with the volume of data in play, keeping the ‘n’ in n-gram selection right-sized will be important, otherwise poor application performance will occur (speed). Opposing this, choosing an n too small in size would result in too many choices for a word suggestion algorithm, as there would be too few contextual clues (accuracy).
Below are plots of timing and memory usage for three sample sizes (1,000, 50,000, and 100,000). These data were captured utilizing the pryr package’s mem_change function along with simply subtracting Sys.time calls before and after the sample is run. The number of distinct n-grams generated are also displayed.
The shorter twitter documents contain fewer long sentences and the word choices seem to have less variety. This makes some sense. The kink in the trend appears after about the 5-gram level, indicating 6+ gram combinations actually result in fewer distinct ngrams.
One task during development is to remove profane language from the text. The sentimentr package was used to separate sentences and evaluate language with respect to the lexicon::profanity_banned dictionary. Results of this function include number of sentences, the word count, as well as a count of profane words.
These words are then removed from the text, leaving previous language available as data. The twitter data, being more free-style language, includes the highest counts of profanity. The counts depend on the lexicon(s) used – using the default settings for the profanity function, which includes several lexicons, as of this writing a good choice is the lexicon::profanity_banned option (fairly simple).
While using the sentimentr package to remove profanity, the namesake function was employed as well. The sentiment function does just that, returns a basic sentiment score given sentences. These data may (or may not) play a role in future model work.
Below is a plot of the distribution of negative sentiment terms, one observes slightly more negativity in the twitter data.
The next steps in this process will be to improve the efficiency of the code modules to address processing limitations that may exist on the host platform (in this case shinyapps.io).
There will also likely be more development of the features observed in the texts, similar to the scoring of the sentiment. For example, there may be several results returned for a phrase, but if the test phrases can be assessed it may be successful in focusing the results further.
For example, next word choices may be reduced to one if it is possible to identify data as twitter via hash tag presence, length of text, and possibly sentiment.
Finally, more research is required to understand the modeling possibilities that exist.
In conclusion, the data look acceptable and are ready for further analysis and processing. The corpus yields insights into the quantities of verbiage involved, and the large data size will require efficient code.
Judicious use of ngram treatment will be guided by system limitations, as it was observed the more words involved the larger the eventual lookup list will be. This concurs with the course suggestion to balance volume/accuracy with response speed.
This update was largely assisted by the following sources:
Text Mining Infrastructure in R
The author extends his appreciation for the excellent information contained in the above sources.
This work was developed on the following system:
Model Name: iMac
Processor Name: Quad-Core Intel Core i7
Memory: 32 GB
R version 3.6.2 (2019-12-12)
Dark and Stormy Night
file_rowsn<-as.numeric(fread(cmd = paste("wc -l ", "data/final/en_US/en_US.news.txt"))[1,1])
file_rowst<-as.numeric(fread(cmd = paste("wc -l ", "data/final/en_US/en_US.twitter.txt"))[1,1])
file_rowsb<-as.numeric(fread(cmd = paste("wc -l ", "data/final/en_US/en_US.blogs.txt"))[1,1])
xfile_rowsn<-as.numeric(fread(cmd = paste("wc -w ", "data/final/en_US/en_US.news.txt"))[1,1])
xfile_rowst<-as.numeric(fread(cmd = paste("wc -w ", "data/final/en_US/en_US.twitter.txt"))[1,1])
xfile_rowsb<-as.numeric(fread(cmd = paste("wc -w ", "data/final/en_US/en_US.blogs.txt"))[1,1])
unix_summary<-data.table(
file_name=c("data/final/en_US/en_US.blogs.txt",
"data/final/en_US/en_US.news.txt",
"data/final/en_US/en_US.twitter.txt"),
source=c("blogs","news","twitter"),
line_count=c(file_rowsb,file_rowsn,file_rowst),
word_count=c(xfile_rowsb,xfile_rowsn,xfile_rowst), stringsAsFactors = F)
unix_summary[, unix_mean_word_count:=round(word_count/line_count,1)]
head(unix_summary)
sample_size<-10000
getraw<-function(sourcex, sample_sizex) {
filenamex<-str_c("data/final/en_US/en_US.", sourcex, ".txt")
full_file_rows<-as.numeric(fread(cmd = paste("wc -l ", filenamex))[1,1])
full_file_rows<-sample_sizex
sample_size<-full_file_rows
con <- file(filenamex, "r")
rawx<-data.table(
doc_id=str_c(sourcex, "_", as.character(seq(1,full_file_rows))),
text=readLines(con, full_file_rows, skipNul = TRUE, n=full_file_rows),
source=rep(sourcex,full_file_rows),
stringsAsFactors = F)
close(con)
# rawx[,pound_symbol:=grepl("#", text)]
# rawx[,term_i:=grepl("( [iI] )", text)]
rawx[,line_length:=str_length(text)]
rawx[,count_wordsx:=count_words(text)]
return(rawx)
}
news<-getraw("news", sample_size)
twitter<-getraw("twitter", sample_size)
blogs<-getraw("blogs", sample_size)
head(news)
sx<-funion(news, funion(twitter,blogs))
cx<-corpus(sx)
nsentencex<-nsentence(cx)
sx$text<-tolower(sx$text)
#head(sx)
cx<-corpus(sx)
tx<-tokens(cx,
remove_punct = T,
remove_symbols = T,
remove_numbers = T,
remove_separators = T)
#tx<-tokens_ngrams(tx, n=3, concatenator = " ")
tokenator<-function(nx) {
tx<-tokens_ngrams(tx, n=nx, concatenator = " ")
}
tx1<-tokenator(1)
tx2<-tokenator(2)
tx3<-tokenator(3)
dx1<-dfm(tx1)
dx2<-dfm(tx2)
dx3<-dfm(tx3)
tokenator2<-function(nx) {
tmp<-tokens_remove(tx, stopwords("english"))
tx<-tokens_ngrams(tmp, n=nx, concatenator = " ")
}
tx1<-tokenator2(1)
tx2<-tokenator2(2)
tx3<-tokenator2(3)
dx1_nsw<-dfm(tx1)
dx2_nsw<-dfm(tx2)
dx3_nsw<-dfm(tx3)
summary_terms1<-data.table(termx=names(docfreq(dx1)), docfreqx=docfreq(dx1), featfreqx=featfreq(dx1))
summary_terms2<-data.table(termx=names(docfreq(dx2)), docfreqx=docfreq(dx2), featfreqx=featfreq(dx2))
summary_terms3<-data.table(termx=names(docfreq(dx3)), docfreqx=docfreq(dx3), featfreqx=featfreq(dx3))
summary_docs1<-data.table(docx=docnames(dx1), nsentencex=nsentencex, ntypex=ntype(dx1), ntokenx=ntoken(dx1),
pound_symbol=cx$pound_symbol, term_i=cx$term_i,
line_length=sx$line_length,source=sx$source, count_wordsx=sx$count_wordsx)
summary_docs3<-data.table(docx=docnames(dx3), nsentencex=nsentencex, ntypex=ntype(dx3), ntokenx=ntoken(dx3),
pound_symbol=cx$pound_symbol, term_i=cx$term_i,
line_length=sx$line_length,source=sx$source, count_wordsx=sx$count_wordsx)
doc_summary<-summary_docs1%>%group_by(source)%>%summarize(mean_nsentencex=round(mean(nsentencex),1),
mean_ntokenx=round(mean(ntokenx),1),
mean_line_length=round(mean(line_length),1))
ggplot(data=summary_docs1)+
geom_jitter(mapping = aes(x=nsentencex, y=ntokenx, color=source),alpha=.3)+
geom_smooth(mapping = aes(x=nsentencex, y=ntokenx), color="black",alpha=.3)+
facet_grid(~source)
setorder(summary_terms1, -featfreqx)
ggplot(data=summary_terms1[1:20])+
geom_col(mapping = aes(y=reorder(termx,featfreqx), x=featfreqx))
library(lineprof)
library(pryr)
xxx<-function(sample_size, typex) {
#sample_size<-1000
#tx<-tokens(corpus(fread("sed 's/\x7f//g' ../ds_capstone/data/final/en_US/en_US.twitter.txt",
tx<-tokens(corpus(fread(cmd=str_c("sed 's/\x7f//g' ../ds_capstone/data/final/en_US/en_US.", typex,".txt"),
header = F,
stringsAsFactors = F,
data.table = F,
blank.lines.skip = T,
sep = "\n",
col.names = c("text"),
# ,
nrows = sample_size)),
remove_punct = T,
remove_symbols = T,
remove_numbers = T,
remove_separators = T)
tokenator<-function(nx) {
tokens_ngrams(tx, n=nx, concatenator = " ")
}
tx1<-NULL
b<-Sys.time()
mem1<-mem_change(tx1<-data.table(unlist(tokenator(1)))%>%group_by(V1)%>%
summarize(count=n())%>%arrange(-count))
a<-Sys.time()
time1<-a-b
nrow(tx1)
b<-Sys.time()
mem2<-mem_change(tx2<-data.table(unlist(tokenator(2)))%>%group_by(V1)%>%
summarize(count=n())%>%arrange(-count))
a<-Sys.time()
time2<-a-b
nrow(tx2)
b<-Sys.time()
mem3<-mem_change(tx3<-data.table(unlist(tokenator(3)))%>%group_by(V1)%>%
summarize(count=n())%>%arrange(-count))
a<-Sys.time()
time3<-a-b
nrow(tx3)
b<-Sys.time()
mem4<-mem_change(tx4<-data.table(unlist(tokenator(4)))%>%group_by(V1)%>%
summarize(count=n())%>%arrange(-count))
a<-Sys.time()
time4<-a-b
nrow(tx4)
b<-Sys.time()
mem5<-mem_change(tx5<-data.table(unlist(tokenator(5)))%>%group_by(V1)%>%
summarize(count=n())%>%arrange(-count))
a<-Sys.time()
time5<-a-b
nrow(tx5)
b<-Sys.time()
mem6<-mem_change(tx6<-data.table(unlist(tokenator(6)))%>%group_by(V1)%>%
summarize(count=n())%>%arrange(-count))
a<-Sys.time()
time6<-a-b
nrow(tx6)
b<-Sys.time()
mem7<-mem_change(tx7<-data.table(unlist(tokenator(7)))%>%group_by(V1)%>%
summarize(count=n())%>%arrange(-count))
a<-Sys.time()
time7<-a-b
nrow(tx7)
b<-Sys.time()
mem8<-mem_change(tx8<-data.table(unlist(tokenator(8)))%>%group_by(V1)%>%
summarize(count=n())%>%arrange(-count))
a<-Sys.time()
time8<-a-b
nrow(tx8)
b<-Sys.time()
mem9<-mem_change(tx9<-data.table(unlist(tokenator(9)))%>%group_by(V1)%>%
summarize(count=n())%>%arrange(-count))
a<-Sys.time()
time9<-a-b
nrow(tx9)
b<-Sys.time()
mem10<-mem_change(tx10<-data.table(unlist(tokenator(10)))%>%group_by(V1)%>%
summarize(count=n())%>%arrange(-count))
a<-Sys.time()
time10<-a-b
nrow(tx10)
ngram_level<-seq(1:10)
nrow<-c(nrow(tx1),nrow(tx2),nrow(tx3),nrow(tx4),nrow(tx5),nrow(tx6),nrow(tx7),nrow(tx8),nrow(tx9),nrow(tx10))
elapsed_time<-c(time1,time2,time3,time4,time5,time6,time7,time8,time9,time10)
memory_usage<-c(mem1,mem2,mem3,mem4,mem5,mem6,mem7,mem8,mem9,mem10)
sample_size<-rep(format(sample_size,big.mark = ","),10)
j<-data.frame(ngram_level=ngram_level,nrow=nrow,elapsed_time=elapsed_time,
memory_usage=memory_usage, sample_size=sample_size, type=rep(typex,10))
return(j)
}
# j1000<-xxx(1000,"twitter")
# j5000<-xxx(5000)
gc()
j10000b<-xxx(10000,"blogs")
j50000b<-xxx(50000,"blogs")
j100000b<-xxx(100000,"blogs")
gc()
j10000n<-xxx(10000,"news")
j50000n<-xxx(50000,"news")
j100000n<-xxx(100000,"news")
gc()
j10000t<-xxx(10000,"twitter")
j50000t<-xxx(50000,"twitter")
j100000t<-xxx(100000,"twitter")
# j10000b<-xxx(100,"blogs")
# j50000b<-xxx(100,"blogs")
# j100000b<-xxx(100,"blogs")
# j10000n<-xxx(100,"news")
# j50000n<-xxx(100,"news")
# j100000n<-xxx(100,"news")
# j10000t<-xxx(100,"twitter")
# j50000t<-xxx(100,"twitter")
# j100000t<-xxx(100,"twitter")
options(scipen = 6)
#j<-union_all(j1000,j5000)%>%union_all(j10000)%>%union_all(j50000)%>%union_all(j100000)
j<-union_all(j10000b,j50000b)%>%union_all(j100000b)%>%
union_all(j10000n)%>%union_all(j50000n)%>%union_all(j100000n)%>%
union_all(j10000t)%>%union_all(j50000t)%>%union_all(j100000t)
fwrite(j, "plot_resource_demands.txt")
ggplot(j)+
geom_line(mapping = aes(x=ngram_level, y=elapsed_time, color=sample_size))+
geom_point(mapping = aes(x=ngram_level, y=elapsed_time, color=sample_size))+
xlab("N-gram Level")+ylab("Elapsed Time (seconds)")+
scale_x_continuous("ngram_level", breaks=seq(1:10))+
facet_wrap(vars(type, sample_size))
# facet_grid(~sample_size)
ggplot(j)+
geom_line(mapping = aes(x=ngram_level, y=memory_usage/1000000, color=sample_size))+
geom_point(mapping = aes(x=ngram_level, y=memory_usage/1000000, color=sample_size))+
xlab("N-gram Level")+ylab("Memory Usage (MB)")+
scale_x_continuous("ngram_level", breaks=seq(1:10))+
facet_wrap(vars(type, sample_size))
# facet_grid(~sample_size)
ggplot(j)+
geom_line(mapping = aes(x=ngram_level, y=nrow, color=sample_size))+
geom_point(mapping = aes(x=ngram_level, y=nrow, color=sample_size))+
xlab("N-gram Level")+ylab("Distinct N-grams")+
scale_x_continuous("ngram_level", breaks=seq(1:10))+
facet_wrap(vars(type, sample_size))
# facet_grid(~sample_size)
read_rows<-10000
x<-function(typex) {
# read_rows<-1000
# typex<-"twitter"
rawx<-fread(cmd=str_c("sed 's/\x7f/x78/g' ../ds_capstone/data/final/en_US/en_US.", typex, ".txt"),
header = F,
stringsAsFactors = F,
data.table = T,
blank.lines.skip = T,
sep = "\n",
nrows = read_rows,
col.names = c("text")
)
rawx[,text:=tolower(text)]
raw_sentences<-get_sentences(rawx$text)
# summarized_rawx<-data.table(profanity(raw_sentences,profanity_list = unique(tolower(lexicon::profanity_banned)))%>%
# mutate(sentiment=sentiment(raw_sentences)[,4])%>%
# group_by(element_id)%>%summarize(sentence_count=max(sentence_id),
# word_count=sum(word_count),
# profanity_count=sum(profanity_count),
# profanity=sum(profanity),
# sentiment=sum(sentiment)))
summarized_rawx<-data.table(profanity(raw_sentences,profanity_list = unique(tolower(lexicon::profanity_banned))),
sentiment=sentiment(raw_sentences)[,4])
# %>%
# group_by(element_id)%>%summarize(sentence_count=max(sentence_id),
# word_count=sum(word_count),
# profanity_count=sum(profanity_count),
# profanity=sum(profanity),
# sentiment=sum(sentiment)))
return(summarized_rawx)
}
blogs<-x("blogs")
news<-x("news")
twitter<-x("twitter")
blogs2<-filter(blogs, sentiment.sentiment<0)%>%mutate(type="blogs")
news2<-filter(news, sentiment.sentiment<0)%>%mutate(type="news")
twitter2<-filter(twitter, sentiment.sentiment<0)%>%mutate(type="twitter")
all<-union_all(blogs2, news2)%>%union_all(twitter2)
ggplot(all)+
geom_density(mapping = aes(x=sentiment.sentiment, color=type))+
geom_vline(xintercept = 0)+
facet_grid(~type)
summarized_blogs<-data.table(blogs%>%
group_by(element_id)%>%summarize(sentence_count=max(sentence_id),
word_count=sum(word_count),
profanity_count=sum(profanity_count),
profanity=sum(profanity),
sentiment=sum(sentiment.sentiment)))
summarized_news<-data.table(news%>%
group_by(element_id)%>%summarize(sentence_count=max(sentence_id),
word_count=sum(word_count),
profanity_count=sum(profanity_count),
profanity=sum(profanity),
sentiment=sum(sentiment.sentiment)))
summarized_twitter<-data.table(twitter%>%
group_by(element_id)%>%summarize(sentence_count=max(sentence_id),
word_count=sum(word_count),
profanity_count=sum(profanity_count),
profanity=sum(profanity),
sentiment=sum(sentiment.sentiment)))
(blogs_profanity_percent<-nrow(summarized_blogs[profanity_count>0,])/nrow(summarized_blogs))
(news_profanity_percent<-nrow(summarized_news[profanity_count>0,])/nrow(summarized_news))
(twitter_profanity_percent<-nrow(summarized_twitter[profanity_count>0,])/nrow(summarized_twitter))
(blogs_profanity_count<-sum(summarized_blogs$profanity_count))
(news_profanity_count<-sum(summarized_news$profanity_count))
(twitter_profanity_count<-sum(summarized_twitter$profanity_count))
x<-data.table(percent=c(blogs_profanity_percent,news_profanity_percent, twitter_profanity_percent),
count=c(blogs_profanity_count, news_profanity_count, twitter_profanity_count),
type=as.factor(c("blogs", "news", "twitter")))
ggplot(x)+
geom_bar(mapping = aes(x=x$percent, y=type, fill=type), stat = "identity")+
xlab(str_c("Percent of ", format(read_rows,big.mark = ",")," Documents"))+ylab("Document Type")
sss<-data.table(union_all(summarized_blogs, summarized_news)%>%union_all(summarized_twitter))
sss$type<-c(rep("blogs",read_rows),rep("news",read_rows),rep("twitter",read_rows))
ggplot(sss)+
geom_density(mapping = aes(x=sentiment, color=type))+
geom_vline(xintercept = 0)+
facet_grid(~type)