This document explories the text sets provided by Swiftkey with the aim of answering the questions:
What are the distributions of word frequencies? What are the frequencies of 2-grams and 3-grams in the dataset?
How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
The analysis aim to inform a prediction model which can run in a shiney app.
The original data set consists of 3 text files from news, twitter and from blogs. We will load all the datasets but only consider a sample of lines from them in the exploratory analysis.
Summary of data files (all field in millions)
source_name<-c(twit_name,news_name,blogs_name)
kable(data.frame(source_name,size,lines,sentences,words,chars))
| source_name | size | lines | sentences | words | chars |
|---|---|---|---|---|---|
| en_US.twitter.txt | 167.0 | 2.4 | 4.0 | 30.4 | 164.7 |
| en_US.news.txt | 205.8 | 0.1 | 0.2 | 2.6 | 15.8 |
| en_US.blogs.txt | 210.0 | 0.9 | 2.5 | 37.3 | 209.3 |
| Clean data and calcu | late N-G | ram freq | uencies. |
sample_corpus<-corpus(c(blogs,news,twitter))
dictionary<-read.csv("linuxwords.txt",header=FALSE)
#remove white space
texts(sample_corpus) <- gsub("\\s"," ",sample_corpus)
texts(sample_corpus) <- gsub("â","",sample_corpus)
texts(sample_corpus) <- gsub("[^[:alnum:]///' ]", " ", sample_corpus)
texts(sample_corpus) <- gsub("'s","s",sample_corpus)
# remove punctuation and numbers. Tokenise
token <- tokenize(sample_corpus, remove_punct=TRUE, remove_numbers=TRUE)
token_2n<-tokens_ngrams(token,n=2,concatenator=" ")
token_3n<-tokens_ngrams(token,n=3,concatenator=" ")
dfm_1<-dfm(token, tolower = TRUE, stem = F, remove_punct=T)
top_1<-topfeatures(dfm_1,n=5000)
no_token<-length(token$text1)+length(token$text2)+length(token$text3)
dfm_2<-dfm(token_2n, tolower = TRUE, stem = F)
top_2<-topfeatures(dfm_2,n=1000)
no_token_2n<-length(token_2n$text1)+length(token_2n$text2)+length(token_2n$text3)
dfm_3<-dfm(token_3n, tolower = TRUE, stem = F)
top_3<-topfeatures(dfm_3,n=1000)
no_token_3n<-length(token_3n$text1)+length(token_3n$text2)+length(token_3n$text3)
topNWordsDf <- data.frame(Words=names(top_1[1:20]), Frequency=top_1[1:20])
topNWordsDf
## Words Frequency
## the the 4372
## and and 2364
## to to 2273
## a a 2082
## of of 1856
## in in 1444
## i i 1440
## that that 982
## for for 906
## is is 888
## it it 789
## on on 726
## with with 651
## was was 626
## you you 609
## this this 490
## at at 484
## my my 471
## but but 460
## be be 457
#ng<-ngram(twit, 2)
#print(ng, output ="truncated")
#get.phrasetable(ng)
#babble(ng,10)
#twit<-preprocess(twit ,case ="lower", remove.punct = TRUE )
cum_freq<-cumsum(top_1)/no_token
topNWordsDf <- data.frame(Words=names(top_1), Frequency=cumsum(top_1)/no_token)
ggplot(data = topNWordsDf, aes(x = 1:5000, y = Frequency)) + geom_line() + labs(title ="Cumulative frequency of 5000 most common words", x = "Word Number")
This graph shows that it takes only about 135 words to cover 50% of the corpus, 5000 words to cover 90%.
Plot word clouds to represent the frequency of invidual words, 2 and 3 word ngrams:
textplot_wordcloud(dfm_1, min.freq = 60, random.order = FALSE,
rot.per = .25,
colors = RColorBrewer::brewer.pal(8,"Dark2"))
Plot word clouds to represent the frequency of invidual words, 2-grams:
textplot_wordcloud(dfm_2,min.freq = 30,random.order = FALSE,rot.per = .25, colors = RColorBrewer::brewer.pal(8,"Dark2"))
Plot the frequency of the most common 3-grams:
Questions and bottlenecks to consider:
NGram Modelling Word frequencies, 2-Gram, 3-Gram frequencies have been calculated. These can be used to calculate Probabilities. Not yet sure how these probabilities can be used/stored efficiently to create predictions
Sample Dataset Size Data used here is only a small fraction of the data. Efficient use of the whole set and formal division of the set are still to be determined.
Novel N-Grams Current method does not allow for novel word combinations. Need to research means for doing this.