Synopsis

The development of the word suggestion application requires data, this document serves as an update on the status of this preliminary work. Three sources of data are available and explored below…blogs, news, and twitter data.

The goal here is to leverage R code to explore and summarize fundamental features observed in the data, providing a foundation upon which further research and model building will occur.

Exploratory Data Anaysis

Raw data

Beginning with the three text files, below are the basic file facts…these results are obtained using Unix system commands (e.g., wc -l), and are useful as confirmation the data read correctly. Code is available in Appendix B. Of note is the success using data.table fread technology here. The simple file-based Unix commands do not require reading in the data precisely, therefore no issues (more on this later).

file_name source line_count word_count unix_mean_word_count
data/final/en_US/en_US.blogs.txt blogs 899,288 37,334,690 41.5
data/final/en_US/en_US.news.txt news 1,010,242 34,372,720 34.0
data/final/en_US/en_US.twitter.txt twitter 2,360,148 30,374,206 12.9

Below is the first record in the twitter data, which after review of the files, is a representative example of the data overall. One observes multiple sentences, use of abbreviation, mixed case, as well as odd formatting (no space after way in “way,way”). This variety will present some challenge, and will require some treatment in an effort to standardize and format such that the application returns accurate suggestions.

Revisiting the desire to leverage fread, with the twitter data, NULL characters were encountered towards the end of the file, causing fread to stop reading. The data table formed, but did not contain the full set of records. Some research was done, attempts made using a number of suggestions per stackoverflow. This is a minor issue, but may matter more as one goal of the application will be minimized response time. In cleaner data, fread is known to be about the fastest available.

[1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."

Corpus and Tokenization

The data are read in using readLines logic, and the skipNuls=TRUE option allows for a full read of the files. Note this option is not available in fread.

Next is getting the data read into an NLP package to further explore. The quanteda package was chosen for this work. A corpus is assembled from the read in data. The corpus here is simple, it is just the collection of blog, news, and twitter records; each record is its own document.

Below is a summary of the basic data, after tokenization into single words. Note these figures describe a random sample of 100,000 records each of blogs, news, and twitter records. The large size of the data – even given relatively robust computing power (see Appendix A) – required sampling. The smaller data were used to develop code.

source mean_nsentencex mean_ntokenx mean_line_length
blogs 2.6 40.9 228.9
news 2.0 33.3 201.9
twitter 1.6 12.5 68.6

Below is a view of the relationship between the number of sentences and the token count. These display some differences in the rate at which new sentences add to the token count.

  • The blogs data have a noticeable number of entries with many tokens and sentences, which makes intuitive sense as blogs are essentially a person’s stream of consciousness.
  • The news data seem to have less variation, suggesting most adhere to some journalistic standards in terms of brevity. These statements are also supported by the table above.
  • The twitter feeds generally contain small, brief sentences (therefore fewer tokens), generally five or fewer.

Below are two wordclouds of the most frequent 50 words observed in the data, the left plot has stop words included, the right does not.

As the goal of the application will be to suggest the next word given a phrase, these will likely be left in for this project to retain as much information as possible.

N-gram creation

In order for the application to suggest a likely next word (in a given phrase), the tokenization will be expanded from single words (useful to dig into the data), to combinations of multiple words (ngrams). Below are the most frequent 10 combinations for the 3-gram set (aka trigrams).

These ngrams will eventually form a lookup table to be referenced by the application.

System Resources

Developing the word suggestion application will require a lookup table of phrases. These will be the results of creating various n-grams. But with the volume of data in play, keeping the ‘n’ in n-gram selection right-sized will be important, otherwise poor application performance will occur (speed). Opposing this, choosing an n too small in size would result in too many choices for a word suggestion algorithm, as there would be too few contextual clues (accuracy).

Below are plots of timing and memory usage for three sample sizes (1,000, 50,000, and 100,000). These data were captured utilizing the pryr package’s mem_change function along with simply subtracting Sys.time calls before and after the sample is run. The number of distinct n-grams generated are also displayed.

N-gram creation speed

N-gram creation system resources

The shorter twitter documents contain fewer long sentences and the word choices seem to have less variety. This makes some sense. The kink in the trend appears after about the 5-gram level, indicating 6+ gram combinations actually result in fewer distinct ngrams.

Profanity

One task during development is to remove profane language from the text. The sentimentr package was used to separate sentences and evaluate language with respect to the lexicon::profanity_banned dictionary. Results of this function include number of sentences, the word count, as well as a count of profane words.

These words are then removed from the text, leaving previous language available as data. The twitter data, being more free-style language, includes the highest counts of profanity. The counts depend on the lexicon(s) used – using the default settings for the profanity function, which includes several lexicons, as of this writing a good choice is the lexicon::profanity_banned option (fairly simple).

Sentiment

While using the sentimentr package to remove profanity, the namesake function was employed as well. The sentiment function does just that, returns a basic sentiment score given sentences. These data may (or may not) play a role in future model work.

Below is a plot of the distribution of negative sentiment terms, one observes slightly more negativity in the twitter data.

Next steps

The next steps in this process will be to improve the efficiency of the code modules to address processing limitations that may exist on the host platform (in this case shinyapps.io).

There will also likely be more development of the features observed in the texts, similar to the scoring of the sentiment. For example, there may be several results returned for a phrase, but if the test phrases can be assessed it may be successful in focusing the results further.

For example, next word choices may be reduced to one if it is possible to identify data as twitter via hash tag presence, length of text, and possibly sentiment.

Finally, more research is required to understand the modeling possibilities that exist.

Summary

In conclusion, the data look acceptable and are ready for further analysis and processing. The corpus yields insights into the quantities of verbiage involved, and the large data size will require efficient code.

Judicious use of ngram treatment will be guided by system limitations, as it was observed the more words involved the larger the eventual lookup list will be. This concurs with the course suggestion to balance volume/accuracy with response speed.

CITATIONS

This update was largely assisted by the following sources:

Text Mining Infrastructure in R

quanteda package information

quanteda function reference

coursera mentor help

The author extends his appreciation for the excellent information contained in the above sources.

Appendix A: System setup

This work was developed on the following system:

  Model Name: iMac
  Processor Name: Quad-Core Intel Core i7
  Memory: 32 GB

R version 3.6.2 (2019-12-12)

Dark and Stormy Night

Appendix B: R code

code_raw_summary.R
file_rowsn<-as.numeric(fread(cmd = paste("wc -l ", "data/final/en_US/en_US.news.txt"))[1,1])
file_rowst<-as.numeric(fread(cmd = paste("wc -l ", "data/final/en_US/en_US.twitter.txt"))[1,1])
file_rowsb<-as.numeric(fread(cmd = paste("wc -l ", "data/final/en_US/en_US.blogs.txt"))[1,1])
xfile_rowsn<-as.numeric(fread(cmd = paste("wc -w ", "data/final/en_US/en_US.news.txt"))[1,1])
xfile_rowst<-as.numeric(fread(cmd = paste("wc -w ", "data/final/en_US/en_US.twitter.txt"))[1,1])
xfile_rowsb<-as.numeric(fread(cmd = paste("wc -w ", "data/final/en_US/en_US.blogs.txt"))[1,1])
unix_summary<-data.table(
    file_name=c("data/final/en_US/en_US.blogs.txt",
            "data/final/en_US/en_US.news.txt",
            "data/final/en_US/en_US.twitter.txt"),
    source=c("blogs","news","twitter"),
    line_count=c(file_rowsb,file_rowsn,file_rowst),
    word_count=c(xfile_rowsb,xfile_rowsn,xfile_rowst), stringsAsFactors = F)
unix_summary[, unix_mean_word_count:=round(word_count/line_count,1)]
head(unix_summary)
code_read_raw.R
sample_size<-10000
getraw<-function(sourcex, sample_sizex) {
filenamex<-str_c("data/final/en_US/en_US.", sourcex, ".txt")

full_file_rows<-as.numeric(fread(cmd = paste("wc -l ", filenamex))[1,1])
full_file_rows<-sample_sizex

sample_size<-full_file_rows

con <- file(filenamex, "r")
rawx<-data.table(
    doc_id=str_c(sourcex, "_", as.character(seq(1,full_file_rows))),
    text=readLines(con, full_file_rows, skipNul = TRUE, n=full_file_rows),
    source=rep(sourcex,full_file_rows),
    stringsAsFactors = F)
close(con)
# rawx[,pound_symbol:=grepl("#", text)]
# rawx[,term_i:=grepl("( [iI] )", text)]
rawx[,line_length:=str_length(text)]
rawx[,count_wordsx:=count_words(text)]
return(rawx)
}
news<-getraw("news", sample_size)
twitter<-getraw("twitter", sample_size)
blogs<-getraw("blogs", sample_size)

head(news)
code_summary_docs.R
sx<-funion(news, funion(twitter,blogs))

cx<-corpus(sx)
nsentencex<-nsentence(cx)

sx$text<-tolower(sx$text)
#head(sx)

cx<-corpus(sx)

tx<-tokens(cx,
       remove_punct = T,
       remove_symbols = T,
       remove_numbers = T,
       remove_separators = T)

#tx<-tokens_ngrams(tx, n=3, concatenator = " ")
tokenator<-function(nx) {
tx<-tokens_ngrams(tx, n=nx, concatenator = " ")
}
tx1<-tokenator(1)
tx2<-tokenator(2)
tx3<-tokenator(3)

dx1<-dfm(tx1)
dx2<-dfm(tx2)
dx3<-dfm(tx3)

tokenator2<-function(nx) {
tmp<-tokens_remove(tx, stopwords("english"))
tx<-tokens_ngrams(tmp, n=nx, concatenator = " ")
}
tx1<-tokenator2(1)
tx2<-tokenator2(2)
tx3<-tokenator2(3)

dx1_nsw<-dfm(tx1)
dx2_nsw<-dfm(tx2)
dx3_nsw<-dfm(tx3)

summary_terms1<-data.table(termx=names(docfreq(dx1)), docfreqx=docfreq(dx1), featfreqx=featfreq(dx1))
summary_terms2<-data.table(termx=names(docfreq(dx2)), docfreqx=docfreq(dx2), featfreqx=featfreq(dx2))
summary_terms3<-data.table(termx=names(docfreq(dx3)), docfreqx=docfreq(dx3), featfreqx=featfreq(dx3))

summary_docs1<-data.table(docx=docnames(dx1), nsentencex=nsentencex, ntypex=ntype(dx1), ntokenx=ntoken(dx1),
              pound_symbol=cx$pound_symbol, term_i=cx$term_i,
              line_length=sx$line_length,source=sx$source, count_wordsx=sx$count_wordsx)
summary_docs3<-data.table(docx=docnames(dx3), nsentencex=nsentencex, ntypex=ntype(dx3), ntokenx=ntoken(dx3),
              pound_symbol=cx$pound_symbol, term_i=cx$term_i,
              line_length=sx$line_length,source=sx$source, count_wordsx=sx$count_wordsx)

doc_summary<-summary_docs1%>%group_by(source)%>%summarize(mean_nsentencex=round(mean(nsentencex),1),
                              mean_ntokenx=round(mean(ntokenx),1),
                              mean_line_length=round(mean(line_length),1))

ggplot(data=summary_docs1)+
    geom_jitter(mapping = aes(x=nsentencex, y=ntokenx, color=source),alpha=.3)+
    geom_smooth(mapping = aes(x=nsentencex, y=ntokenx), color="black",alpha=.3)+
    facet_grid(~source)
setorder(summary_terms1, -featfreqx)
ggplot(data=summary_terms1[1:20])+
    geom_col(mapping = aes(y=reorder(termx,featfreqx), x=featfreqx))
plot_resource_demands.R
library(lineprof)
library(pryr)
xxx<-function(sample_size, typex) {
#sample_size<-1000
#tx<-tokens(corpus(fread("sed 's/\x7f//g' ../ds_capstone/data/final/en_US/en_US.twitter.txt",
tx<-tokens(corpus(fread(cmd=str_c("sed 's/\x7f//g' ../ds_capstone/data/final/en_US/en_US.", typex,".txt"),
        header = F,
        stringsAsFactors = F,
        data.table = F,
        blank.lines.skip = T,
        sep = "\n",
        col.names = c("text"),
        # ,
        nrows = sample_size)),
       remove_punct = T,
       remove_symbols = T,
       remove_numbers = T,
       remove_separators = T)

tokenator<-function(nx) {
tokens_ngrams(tx, n=nx, concatenator = " ")
}
tx1<-NULL
b<-Sys.time()
mem1<-mem_change(tx1<-data.table(unlist(tokenator(1)))%>%group_by(V1)%>%
    summarize(count=n())%>%arrange(-count))
a<-Sys.time()
time1<-a-b
nrow(tx1)
b<-Sys.time()
mem2<-mem_change(tx2<-data.table(unlist(tokenator(2)))%>%group_by(V1)%>%
    summarize(count=n())%>%arrange(-count))
a<-Sys.time()
time2<-a-b
nrow(tx2)
b<-Sys.time()
mem3<-mem_change(tx3<-data.table(unlist(tokenator(3)))%>%group_by(V1)%>%
    summarize(count=n())%>%arrange(-count))
a<-Sys.time()
time3<-a-b
nrow(tx3)
b<-Sys.time()
mem4<-mem_change(tx4<-data.table(unlist(tokenator(4)))%>%group_by(V1)%>%
    summarize(count=n())%>%arrange(-count))
a<-Sys.time()
time4<-a-b
nrow(tx4)
b<-Sys.time()
mem5<-mem_change(tx5<-data.table(unlist(tokenator(5)))%>%group_by(V1)%>%
    summarize(count=n())%>%arrange(-count))
a<-Sys.time()
time5<-a-b
nrow(tx5)
b<-Sys.time()
mem6<-mem_change(tx6<-data.table(unlist(tokenator(6)))%>%group_by(V1)%>%
    summarize(count=n())%>%arrange(-count))
a<-Sys.time()
time6<-a-b
nrow(tx6)
b<-Sys.time()
mem7<-mem_change(tx7<-data.table(unlist(tokenator(7)))%>%group_by(V1)%>%
    summarize(count=n())%>%arrange(-count))
a<-Sys.time()
time7<-a-b
nrow(tx7)
b<-Sys.time()
mem8<-mem_change(tx8<-data.table(unlist(tokenator(8)))%>%group_by(V1)%>%
    summarize(count=n())%>%arrange(-count))
a<-Sys.time()
time8<-a-b
nrow(tx8)
b<-Sys.time()
mem9<-mem_change(tx9<-data.table(unlist(tokenator(9)))%>%group_by(V1)%>%
    summarize(count=n())%>%arrange(-count))
a<-Sys.time()
time9<-a-b
nrow(tx9)
b<-Sys.time()
mem10<-mem_change(tx10<-data.table(unlist(tokenator(10)))%>%group_by(V1)%>%
    summarize(count=n())%>%arrange(-count))
a<-Sys.time()
time10<-a-b
nrow(tx10)

ngram_level<-seq(1:10)
nrow<-c(nrow(tx1),nrow(tx2),nrow(tx3),nrow(tx4),nrow(tx5),nrow(tx6),nrow(tx7),nrow(tx8),nrow(tx9),nrow(tx10))
elapsed_time<-c(time1,time2,time3,time4,time5,time6,time7,time8,time9,time10)
memory_usage<-c(mem1,mem2,mem3,mem4,mem5,mem6,mem7,mem8,mem9,mem10)
sample_size<-rep(format(sample_size,big.mark = ","),10)
j<-data.frame(ngram_level=ngram_level,nrow=nrow,elapsed_time=elapsed_time,
          memory_usage=memory_usage, sample_size=sample_size, type=rep(typex,10))
return(j)
}
# j1000<-xxx(1000,"twitter")
# j5000<-xxx(5000)
gc()
j10000b<-xxx(10000,"blogs")
j50000b<-xxx(50000,"blogs")
j100000b<-xxx(100000,"blogs")
gc()
j10000n<-xxx(10000,"news")
j50000n<-xxx(50000,"news")
j100000n<-xxx(100000,"news")
gc()
j10000t<-xxx(10000,"twitter")
j50000t<-xxx(50000,"twitter")
j100000t<-xxx(100000,"twitter")
# j10000b<-xxx(100,"blogs")
# j50000b<-xxx(100,"blogs")
# j100000b<-xxx(100,"blogs")
# j10000n<-xxx(100,"news")
# j50000n<-xxx(100,"news")
# j100000n<-xxx(100,"news")
# j10000t<-xxx(100,"twitter")
# j50000t<-xxx(100,"twitter")
# j100000t<-xxx(100,"twitter")

options(scipen = 6)
#j<-union_all(j1000,j5000)%>%union_all(j10000)%>%union_all(j50000)%>%union_all(j100000)

j<-union_all(j10000b,j50000b)%>%union_all(j100000b)%>%
    union_all(j10000n)%>%union_all(j50000n)%>%union_all(j100000n)%>%
    union_all(j10000t)%>%union_all(j50000t)%>%union_all(j100000t)
fwrite(j, "plot_resource_demands.txt")
ggplot(j)+
    geom_line(mapping = aes(x=ngram_level, y=elapsed_time, color=sample_size))+
    geom_point(mapping = aes(x=ngram_level, y=elapsed_time, color=sample_size))+
    xlab("N-gram Level")+ylab("Elapsed Time (seconds)")+
    scale_x_continuous("ngram_level", breaks=seq(1:10))+
    facet_wrap(vars(type, sample_size))
    # facet_grid(~sample_size)

ggplot(j)+
    geom_line(mapping = aes(x=ngram_level, y=memory_usage/1000000, color=sample_size))+
    geom_point(mapping = aes(x=ngram_level, y=memory_usage/1000000, color=sample_size))+
    xlab("N-gram Level")+ylab("Memory Usage (MB)")+
    scale_x_continuous("ngram_level", breaks=seq(1:10))+
    facet_wrap(vars(type, sample_size))
    # facet_grid(~sample_size)
ggplot(j)+
    geom_line(mapping = aes(x=ngram_level, y=nrow, color=sample_size))+
    geom_point(mapping = aes(x=ngram_level, y=nrow, color=sample_size))+
    xlab("N-gram Level")+ylab("Distinct N-grams")+
    scale_x_continuous("ngram_level", breaks=seq(1:10))+
    facet_wrap(vars(type, sample_size))
    # facet_grid(~sample_size)
summarize_profanity.R
read_rows<-10000
x<-function(typex) {
# read_rows<-1000
# typex<-"twitter"
rawx<-fread(cmd=str_c("sed 's/\x7f/x78/g' ../ds_capstone/data/final/en_US/en_US.", typex, ".txt"),
        header = F,
        stringsAsFactors = F,
        data.table = T,
        blank.lines.skip = T,
        sep = "\n",
        nrows = read_rows,
        col.names = c("text")
        )

rawx[,text:=tolower(text)]
raw_sentences<-get_sentences(rawx$text)

# summarized_rawx<-data.table(profanity(raw_sentences,profanity_list = unique(tolower(lexicon::profanity_banned)))%>%
#   mutate(sentiment=sentiment(raw_sentences)[,4])%>%
#   group_by(element_id)%>%summarize(sentence_count=max(sentence_id),
#                       word_count=sum(word_count),
#                       profanity_count=sum(profanity_count),
#                       profanity=sum(profanity),
#                       sentiment=sum(sentiment)))

summarized_rawx<-data.table(profanity(raw_sentences,profanity_list = unique(tolower(lexicon::profanity_banned))),
sentiment=sentiment(raw_sentences)[,4])
    # %>%
    # group_by(element_id)%>%summarize(sentence_count=max(sentence_id),
    #                   word_count=sum(word_count),
    #                   profanity_count=sum(profanity_count),
    #                   profanity=sum(profanity),
    #                   sentiment=sum(sentiment)))


return(summarized_rawx)
}

blogs<-x("blogs")
news<-x("news")
twitter<-x("twitter")

blogs2<-filter(blogs, sentiment.sentiment<0)%>%mutate(type="blogs")
news2<-filter(news, sentiment.sentiment<0)%>%mutate(type="news")
twitter2<-filter(twitter, sentiment.sentiment<0)%>%mutate(type="twitter")

all<-union_all(blogs2, news2)%>%union_all(twitter2)

ggplot(all)+
    geom_density(mapping = aes(x=sentiment.sentiment, color=type))+
    geom_vline(xintercept = 0)+
    facet_grid(~type)

summarized_blogs<-data.table(blogs%>%
    group_by(element_id)%>%summarize(sentence_count=max(sentence_id),
                        word_count=sum(word_count),
                        profanity_count=sum(profanity_count),
                        profanity=sum(profanity),
                        sentiment=sum(sentiment.sentiment)))
summarized_news<-data.table(news%>%
    group_by(element_id)%>%summarize(sentence_count=max(sentence_id),
                        word_count=sum(word_count),
                        profanity_count=sum(profanity_count),
                        profanity=sum(profanity),
                        sentiment=sum(sentiment.sentiment)))
summarized_twitter<-data.table(twitter%>%
    group_by(element_id)%>%summarize(sentence_count=max(sentence_id),
                        word_count=sum(word_count),
                        profanity_count=sum(profanity_count),
                        profanity=sum(profanity),
                        sentiment=sum(sentiment.sentiment)))


(blogs_profanity_percent<-nrow(summarized_blogs[profanity_count>0,])/nrow(summarized_blogs))
(news_profanity_percent<-nrow(summarized_news[profanity_count>0,])/nrow(summarized_news))
(twitter_profanity_percent<-nrow(summarized_twitter[profanity_count>0,])/nrow(summarized_twitter))
(blogs_profanity_count<-sum(summarized_blogs$profanity_count))
(news_profanity_count<-sum(summarized_news$profanity_count))
(twitter_profanity_count<-sum(summarized_twitter$profanity_count))
x<-data.table(percent=c(blogs_profanity_percent,news_profanity_percent, twitter_profanity_percent),
       count=c(blogs_profanity_count, news_profanity_count, twitter_profanity_count),
       type=as.factor(c("blogs", "news", "twitter")))

ggplot(x)+
    geom_bar(mapping = aes(x=x$percent, y=type, fill=type), stat = "identity")+
    xlab(str_c("Percent of ", format(read_rows,big.mark = ",")," Documents"))+ylab("Document Type")

sss<-data.table(union_all(summarized_blogs, summarized_news)%>%union_all(summarized_twitter))
sss$type<-c(rep("blogs",read_rows),rep("news",read_rows),rep("twitter",read_rows))

ggplot(sss)+
    geom_density(mapping = aes(x=sentiment, color=type))+
    geom_vline(xintercept = 0)+
    facet_grid(~type)