Milestone report

Introduction

This report is made as a part of Johns Hopkins Data Science Capstone Session in Coursera. Objective for this course project is to make shiny which would predict new word based on a word(s) entered.

The aim of this explore original datasets, clean them and make some exploratory analysis to get some insight for course project. Questions considered:

How to clean, sample original data?
What are the most popular n-grams (n-length word sequencies)?
How many foreign words are there? Are they causing problems?
How many unique words are needed to cover 50% and 90% of word instances in language?

Loading data and basic summary

Data used for this report is from a corpus called HC Corpora. See the readme file for details on the corpora available. The files have been language filtered but may still contain some foreign text. This analysis uses English dataset (en_US.blogs.txt, en_US.news.txt, en_US.twitter.txt). First data is read in:

blogs=readLines("final/en_US/en_US.blogs.txt", encoding="UTF-8")
twitter=readLines("final/en_US/en_US.twitter.txt", encoding="UTF-8")
#is binary, have to use this method
con <- file("final/en_US/en_US.news.txt", open="rb")
news <- readLines(con, encoding="UTF-8") 
close(con)

Some summary statistics of US datasets:

#size of files
size_blogs=file.info("final/en_US/en_US.blogs.txt")$size / (1024*1024)
size_news=file.info("final/en_US/en_US.news.txt")$size / (1024*1024)
size_twitter=file.info("final/en_US/en_US.twitter.txt")$size / (1024*1024)
#count of words
words_blogs=sum(sapply(gregexpr("\\W+", blogs), length) + 1)
words_news=sum(sapply(gregexpr("\\W+", news), length) + 1)
words_twitter=sum(sapply(gregexpr("\\W+", twitter), length) + 1)
#count of lines
library(R.utils)
lines_blogs=countLines("final/en_US/en_US.blogs.txt")[1]
lines_news=countLines("final/en_US/en_US.news.txt") [1]
lines_twitter=countLines("final/en_US/en_US.twitter.txt") [1]
#count of characters
char_blogs <- sum(nchar(blogs))
char_news <- sum(nchar(news))
char_twitter <- sum(nchar(twitter))

summary=data.frame(dataset=c("blogs", "news", "twitter"),size=format(round(c(size_blogs, size_news, size_twitter)), big.mark=" ", scientific=F), words=format(c(words_blogs, words_news, words_twitter),big.mark=" ",scientific=FALSE), lines=format(c(lines_blogs, lines_news, lines_twitter), big.mark=" ",scientific=FALSE), characters=format(c(char_blogs, char_news, char_twitter), big.mark=" ",scientific=FALSE))
names(summary)=c("dataset", "size of dataset (MB)", "count of words", "count of lines", "count of characters")
library(knitr )
kable(summary, align="c")

dataset	size of dataset (MB)	count of words	count of lines	count of characters
blogs	200	39 121 566	899 288	206 824 505
news	196	36 721 104	1 010 242	203 223 159
twitter	159	32 793 388	2 360 148	162 096 031

As seen from the table that twitter dataset contains least number of characters and words. This is due to restriction (one tweet has maximum length of 140 characters). As datasets have more or less same number of words, further analysis will based on equal random samples from all 3 datasets and no weighting of datasets is made.

Sampling and cleaning files

First a subset of data is made, because processing all the data is very time consuming and needs more memory resources. Taking random samples from each dataset should allow to draw conclusions about the whole datasets as a whole.

#random samples of lines for further analysis
blogs_samp <- blogs[sample(1:length(blogs),10000)]
news_samp <- blogs[sample(1:length(news),10000)]
twitter_samp <- blogs[sample(1:length(twitter),10000)]
#write the sample files, which are basis for further analysis
writeLines(blogs_samp, "./samples/blogs.txt")
writeLines(twitter_samp, "./samples/twitter.txt")
writeLines(news_samp, "./samples/news.txt")
#remove original datasets from memory
rm(blogs)
rm(news)
rm(twitter)

First will tokenize and clean dataset. By cleaning we remove numbers, punctuation, stopwords and profanity. In model-building phase it is to be decided if all these transformations are used for final model, because they may lower predicitve ability of model.

library(tm)
#load in swearwords file
con <- file("swearWords.txt", open="rb")
swearwords <- readLines(con, encoding="UTF-8") 
close(con)
#function for cleaning corpuses
tokenize=function(cname) {
        library(tm)
        library(dplyr)
        #sub function to remove everything that is not letters
        removeSpecialChars <- function(x) gsub("[^a-zA-Z]"," ",x)
        corpus <- Corpus(DirSource(cname))%>% #create corpus from files
        tm_map(tolower) %>% #lower letters
        tm_map(removePunctuation) %>% # remove punctuation
        tm_map(removeNumbers) %>% # remove numbers
        tm_map(removeWords, stopwords("english")) %>% # remove stopwords
        tm_map(removeWords, swearwords)%>%  #remove swearwords, file from http://www.bannedwordlist.com/lists/swearWords.txt
        tm_map(stripWhitespace) %>% #strip whitespace
        tm_map(PlainTextDocument) #make TextDocument, else might get error 
        return(corpus)
}
#load in our sample files and tokenize them
cname <- file.path(".", "samples")
corpus=tokenize(cname)

N-grams

For further analysis we calculate what are the most frequent words, 2-word sequencies (2-grams), and 3-word sequencies (3-grams).

library(RWeka)
#make 1-grams, 2-grams, 3-grams
token <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
unidtm <- DocumentTermMatrix(corpus, control = list(tokenize = token))
bitoken <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
bidtm <- DocumentTermMatrix(corpus,control = list(tokenize = bitoken))
tritoken <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tridtm <- DocumentTermMatrix(corpus, control = list(tokenize = tritoken))
#sort top highest unigram frequencies
unifreq <- sort(colSums(as.matrix(unidtm)), decreasing=TRUE)
uniwordfreq <- data.frame(word=names(unifreq), freq=unifreq)
#top bigram frequencies
bifreq <- sort(colSums(as.matrix(bidtm)), decreasing=TRUE)
biwordfreq <- data.frame(word=names(bifreq), freq=bifreq)
#top  trigram frequencies
trifreq <- sort(colSums(as.matrix(tridtm)), decreasing=TRUE)
triwordfreq <- data.frame(word=names(trifreq), freq=trifreq)

For better overview frequencies of words are plotted. To be noticed is that frequencies don’t represent total frequencies in all the datatsets because we are analysing subset of data. But they should give us quite an accurate relative proportions of frequencies between words.

library(ggplot2)
#ngram frequencies 
#unigram
uniwordfreq$index=c(1:nrow(uniwordfreq))
uniwordfreq$order <- factor(uniwordfreq$word, levels = uniwordfreq[order(-uniwordfreq$freq), "word"])
#plot unigram top words
ggplot(subset(uniwordfreq, index<10), aes(y=freq, x=order))+
  geom_bar(stat="identity", fill="lightblue")+
  ylab("frequency")+
  xlab("")+
  ggtitle("Most frequent words")+
  theme_minimal()

#bigram
biwordfreq$index=c(1:nrow(biwordfreq))
biwordfreq$order <- factor(biwordfreq$word, levels = biwordfreq[order(-biwordfreq$freq), "word"])
#plot bigram top frequencies
ggplot(subset(biwordfreq, index<10), aes(y=freq, x=order))+
  geom_bar(stat="identity", fill="lightblue")+
  ylab("frequency")+
  xlab("")+
  ggtitle("Most frequent 2-word sequencies")+
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#trigram
triwordfreq$index=c(1:nrow(triwordfreq))
triwordfreq$order <- factor(triwordfreq$word, levels = triwordfreq[order(-triwordfreq$freq), "word"])
#plot trigram top frequencies
ggplot(subset(triwordfreq, index<10), aes(y=freq, x=order))+ 
  geom_bar(stat="identity", fill="lightblue")+
  ylab("frequency")+
  xlab("")+
  ggtitle("Most frequent 3-word sequencies")+
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

There are 51 859 unigrams, 388 579 bigrams and 436 616 trigrams.

Foreign words

To detect foreign words in corpus, first letters are analyzed (and non English letters are looked for).

#count of each character and plot it
library(stringr)
library(ggplot2)
library(qdap)
uniwordfreq$word%>%
  str_split("")%>%
  sapply(function(x) x[])%>%
  unlist%>%
  dist_tab%>%
  mutate(Letter=  factor(toupper(interval),levels=toupper(interval[order(-freq)])))%>%
  ggplot(aes(Letter,weight=percent))+
  geom_bar(fill="lightblue")+
  ylab("Proportion")+
  scale_y_continuous(breaks=seq(0,12,2),
                     label=function(x)paste0(x,"%"),expand=c(0,0),
                     limits=c(0,12))+
  theme_minimal()

As seen from the plot there are many letters that are not from English.

Lets what are popular foreign words.

tomatch=c("ś", "ß", "ä", "æ", "ö", "ü", "ū", "Ö", "Ü", "Ä", "Ś", "Ø", "ø", "Ō", "Ó","ó", "ō", "ī","Ī", "É", "é", "Æ", "Å", "Ā", "å", "ā")
foreign <- uniwordfreq[grep(paste(tomatch,collapse="|"),as.character(uniwordfreq$word), value = T), 1:2]
head(foreign)

##            word freq
## björk     björk    6
## café       café    5
## cliché   cliché    4
## córdoba córdoba    4
## québec   québec    4
## josé       josé    3

These words are taken from other languges but are used in English. Also they are not very frequent. We caclulate proportion of these words counts (frequencies) compared to all word frequencies (how many words contain non-English letters).

tomatch=c("ś", "ß", "ä", "æ", "ö", "ü", "ū", "Ö", "Ü", "Ä", "Ś", "Ø", "ø", "Ō", "Ó","ó", "ō", "ī","Ī", "É", "é", "Æ", "Å", "Ā", "å", "ā")
count <- uniwordfreq[grep(paste(tomatch,collapse="|"), as.character(uniwordfreq$word), value = T), 2]
paste0(round(sum(count)/sum(uniwordfreq$freq)*100, 3), "%")

## [1] "0.03%"

Only about 0.03% of all word frequencies comes from words that contain non-English letters (but it must be kept in mind that only words that contain non-English letters are considered, not words that contain English letters but are foreign). This is very small number meaning that it is not very important to remove those letters because they are infrequent and should not have big impact on final model. Also not removing these words might increase prediction model accuracy because user might want that app would predict other words than from English (or words that are used in everyday English but contain foreign letters).

Unique words needed to cover most word instances in language

Next we try to find out how many unique words do we need to cover 50% and 90% of word instances in language.

#function to find how many words we need to cover target_percent of word instances in language
word_pct=function(target_percent) {
  total_sum=sum(uniwordfreq$freq)
  current_percent=0
  i=1
  cumul_sum=0
  while (current_percent<target_percent) {
    cumul_sum=cumul_sum+uniwordfreq$freq[i]
    current_percent=cumul_sum/total_sum
    i=i+1
  }
  return(i)
}
#make data frame for different coverages
coverage=c(word_pct(0.1), word_pct(0.2), word_pct(0.3), word_pct(0.4), 
        word_pct(0.5), word_pct(0.6), word_pct(0.7), word_pct(0.8), word_pct(0.9))
coverage=as.data.frame(coverage)
coverage$percent=as.numeric(as.character(rownames(coverage)))/10
#plot it
library(ggplot2)
ggplot(coverage, aes(x=coverage, y=percent))+
  geom_point(colour="blue")+
  geom_line(colour="blue")+
  ylab("percent of of all word instances covered")+
  xlab("number of words")+
  scale_y_continuous(label=function(x)paste0(x*100,"%"))+
  theme_minimal()

So as seen from the plot to cover 50% of the word instances in language 958 unique words are needed to cover 90% 14 978 unique words.

This analysis shows that we don’t need all the words to make more or less good prediction model. That also helps to save memory for the final app. As we want to make more accurate (that covers more word instances) costs to that (memory, CPU) grows exponentially.

Ideas for modelling

For modelling N-grams are going to be used. Basic concept is that when some word is entered then all bigrams (2-word sequencies) that start with the entered word are looked up for. Most popular bigram is chosen and bigram second word is predicted. If 2 words are entered, then trigrams (3-word sequencies) that start with these two words are looked for. Most popular trigram is chosen and its 3 word is predicted. If more words are entered then only last two words are looked for. If trigram doesn’t give match then bigram (only last word is looked for) is used. If that also doesn’t give match than most popular unigram is predicted. This is very basic concept and might change during model building.

Also models accuracies are tested. To get better accuracy stopwords might not be removed from the coprus. This will depend on it if not removing stopwords improves accuracy.

Conclusions

Previous analysis showed that working with original data is very slow and using subsets of data is better. That means that for model building also subset of datasets is used. Also analysis showed that using 2-3 word frequencies (2-, 3-grams) might be a useful way to make a prediction app.