Synopsis

This document explories the text sets provided by Swiftkey with the aim of answering the questions:

What are the distributions of word frequencies? What are the frequencies of 2-grams and 3-grams in the dataset?

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

The analysis aim to inform a prediction model which can run in a shiney app.

Data Processing

Data Loading

The original data set consists of 3 text files from news, twitter and from blogs. We will load all the datasets but only consider a sample of lines from them in the exploratory analysis.

Summary of data files (all field in millions)

source_name<-c(twit_name,news_name,blogs_name)
kable(data.frame(source_name,size,lines,sentences,words,chars))
source_name size lines sentences words chars
en_US.twitter.txt 167.0 2.4 4.0 30.4 164.7
en_US.news.txt 205.8 0.1 0.2 2.6 15.8
en_US.blogs.txt 210.0 0.9 2.5 37.3 209.3
Clean data and calcu late N-G ram freq uencies.
sample_corpus<-corpus(c(blogs,news,twitter))
dictionary<-read.csv("linuxwords.txt",header=FALSE)

#remove white space
texts(sample_corpus) <- gsub("\\s"," ",sample_corpus)
texts(sample_corpus) <- gsub("â","",sample_corpus)
texts(sample_corpus) <- gsub("[^[:alnum:]///' ]", " ", sample_corpus)
texts(sample_corpus) <- gsub("'s","s",sample_corpus)
# remove punctuation and numbers.  Tokenise
token <- tokenize(sample_corpus, remove_punct=TRUE, remove_numbers=TRUE)
token_2n<-tokens_ngrams(token,n=2,concatenator=" ")
token_3n<-tokens_ngrams(token,n=3,concatenator=" ")

dfm_1<-dfm(token, tolower = TRUE, stem = F, remove_punct=T)
top_1<-topfeatures(dfm_1,n=5000)
no_token<-length(token$text1)+length(token$text2)+length(token$text3)

dfm_2<-dfm(token_2n, tolower = TRUE, stem = F)
top_2<-topfeatures(dfm_2,n=1000)
no_token_2n<-length(token_2n$text1)+length(token_2n$text2)+length(token_2n$text3)

dfm_3<-dfm(token_3n, tolower = TRUE, stem = F)
top_3<-topfeatures(dfm_3,n=1000)
no_token_3n<-length(token_3n$text1)+length(token_3n$text2)+length(token_3n$text3)

topNWordsDf <- data.frame(Words=names(top_1[1:20]), Frequency=top_1[1:20])

topNWordsDf
##      Words Frequency
## the    the      4372
## and    and      2364
## to      to      2273
## a        a      2082
## of      of      1856
## in      in      1444
## i        i      1440
## that  that       982
## for    for       906
## is      is       888
## it      it       789
## on      on       726
## with  with       651
## was    was       626
## you    you       609
## this  this       490
## at      at       484
## my      my       471
## but    but       460
## be      be       457
#ng<-ngram(twit, 2)
#print(ng, output ="truncated")
#get.phrasetable(ng)
#babble(ng,10)
                       

#twit<-preprocess(twit ,case ="lower", remove.punct = TRUE )
cum_freq<-cumsum(top_1)/no_token
topNWordsDf <- data.frame(Words=names(top_1), Frequency=cumsum(top_1)/no_token)
ggplot(data = topNWordsDf, aes(x = 1:5000, y = Frequency)) + geom_line() + labs(title ="Cumulative frequency of 5000 most common words", x = "Word Number")

This graph shows that it takes only about 135 words to cover 50% of the corpus, 5000 words to cover 90%.

Plot word clouds to represent the frequency of invidual words, 2 and 3 word ngrams:

textplot_wordcloud(dfm_1, min.freq = 60, random.order = FALSE,
                    rot.per = .25, 
                    colors = RColorBrewer::brewer.pal(8,"Dark2"))

Plot word clouds to represent the frequency of invidual words, 2-grams:

textplot_wordcloud(dfm_2,min.freq = 30,random.order = FALSE,rot.per = .25, colors = RColorBrewer::brewer.pal(8,"Dark2"))

Plot the frequency of the most common 3-grams:

Plans For Final Project

Questions and bottlenecks to consider:

  1. NGram Modelling Word frequencies, 2-Gram, 3-Gram frequencies have been calculated. These can be used to calculate Probabilities. Not yet sure how these probabilities can be used/stored efficiently to create predictions

  2. Sample Dataset Size Data used here is only a small fraction of the data. Efficient use of the whole set and formal division of the set are still to be determined.

  3. Novel N-Grams Current method does not allow for novel word combinations. Need to research means for doing this.