This document concerns the milestone report for the second week of the tenth installment of the data science specialization course, the capstone project. This project covers the creation of a text prediction model. Specifically, the model would provide a single word addition, when provided with an incomplete sentence/ string of words. The available training data is available across three .txt files, covering text from blogs, news and twitter.
The data is available from the course’s corporate partner SwiftKey. The data can be acquired and read into R with the following code. The unzipped file contains a folder, called final, containing the folders for datasets for four sets of languages, namely German (de_DE), English (en_US), Finnish (fi_FI) and Russian (ru_RU). The prediction model that will be developed will only concern the English training data.
url_file<-'https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip'
download.file(url_file,'SwiftKey.zip')
unzip('SwiftKey.zip')
twitter<-readLines('final/en_US/en_US.twitter.txt')
blogs<-readLines('final/en_US/en_US.blogs.txt')
news<-readLines('final/en_US/en_US.news.txt')
This function merely extracts all the words from the lists that are twitter, blogs and news, and appends them in a vector. It select all stretches of alphabetical characters, stored in a list, which is subsequently unlisted, whereafter the characters are put to lower case in a dataframe.
word_extractor <- function(x){
output<-data.frame(tolower(unlist(strsplit(x,split="[^[:alpha:]]+"))))
colnames(output)<-'word'
output
}
w_twitter<-word_extractor(twitter)
w_blogs<-word_extractor(blogs)
w_news<-word_extractor(news)
summary_table<-data.frame(matrix(nrow=2,ncol=4))
colnames(summary_table)<-c('twitter','blogs','news','combined')
rownames(summary_table)<-c('lines','words')
summary_table[1,]<-c(length(twitter),length(blogs),length(news),length(c(twitter,blogs,news)))
summary_table[2,]<-c(length(w_twitter$word),length(w_blogs$word),length(w_news$word),
length(c(w_twitter$word,w_blogs$word,w_news$word)))
kable(summary_table)
blogs | news | combined | ||
---|---|---|---|---|
lines | 2360148 | 899288 | 77259 | 3336695 |
words | 30720047 | 37981296 | 2676302 | 71377645 |
While the twitter dataset has by far the most entries (lines), it is however superceded by the blogs dataset in amount of words. The news dataset is substantially smaller than either of the other datasets in both word and line count.
twitter_wfreq<-w_twitter%>%
count(word,sort=T)%>%
mutate(id='twitter')
blogs_wfreq<-w_blogs%>%
count(word,sort=T)%>%
mutate(id='blogs')
news_wfreq<-w_news%>%
count(word,sort=T)%>%
mutate(id='news')
word_freq<-rbind(twitter_wfreq,blogs_wfreq,news_wfreq)
For each dataset (twitter,blogs and news), the word frequency is displayed for the 20 most frequently occurring words.
top20wf<-word_freq%>%
group_by(id)%>%
slice_head(n=20)
gwords<-ggplot(top20wf,aes(x=reorder(word,n),y=n,col=id,fill=id))+
geom_col()+
coord_flip()+
xlab('word')+
ylab('count')+
facet_grid(id~.,scales='free_y')
ggplotly(gwords)
For each dataset (twitter, blogs and news) the boxplot distribution of the line length is provided through nchar(). This is useful to see whether a dataset contains either particularly long or short lines of text.
par(mfrow=(c(1,3)))
boxplot(nchar(twitter),main='twitter')
boxplot(nchar(blogs),main='blogs')
boxplot(nchar(news),main='news')
Up until 2017, from before which this data originates, the character limit of twitter was 140, providing way shorter line lengths than the other two sources of text data. It can be observed that the blogs data has some incredibly large entries.
Up until now we’ve looked at the data without its context, i.e., we’ve considered individual line and word data. Herein we’ll incorporate and explore the contextual use of a particular word.
Which words are commonly associated with one another?
Across how many words can we feasibly model this word association?
We’ll make use of the tokenizers package. It contains the function tokenize_ngrams. The function already performs most of the work. It does however not contain any data cleaning code. Getting rid of punctuation and numbers prior to running this function is advised.
all_data<-c(twitter,blogs,news) #combining all three datasets
all_data<-
all_data%>%
removeNumbers()%>%
removePunctuation()
#' Sampling 5% of the data
#' There is a skew towards twitter lines (more abundant)
subset<-sample(all_data,length(all_data)*0.05)
tokenized_subset<-tokenize_ngrams(subset,lowercase = T,n=4,n_min=1)
In this plot we show the frequency of word combinations (n-grams) for n = 1:4, i.e., depicting unigrams, bigrams, trigrams and quadgrams. The code above demonstrates the creation of all ngrams in a single dataframe, but it makes it hard to keep track of ‘n’ in this instance. Declaring them individually and assigning them their gram identifier, reduces computation time.
unigrams<-data.frame(unlist(tokenize_ngrams(subset,lowercase = T,n=1,n_min=1,
stopwords = stop_words$word)))
bigrams<-data.frame(unlist(tokenize_ngrams(subset,lowercase = T,n=2,n_min=2,
stopwords = stop_words$word)))
trigrams<-data.frame(unlist(tokenize_ngrams(subset,lowercase = T,n=3,n_min=3,
stopwords = stop_words$word)))
quadgrams<-data.frame(unlist(tokenize_ngrams(subset,lowercase = T,n=4,n_min=4,
stopwords = stop_words$word)))
colnames(unigrams)<-'ngram'
colnames(bigrams)<-'ngram'
colnames(trigrams)<-'ngram'
colnames(quadgrams)<-'ngram'
unigrams$id<-'uni'
bigrams$id<-'bi'
trigrams$id<-'tris'
quadgrams$id<-'quad'
ngrams<-rbind(unigrams,bigrams,trigrams,quadgrams)
freqngram<-ngrams%>%
group_by(id)%>%
count(ngram,sort=T)%>%
slice_head(n=20)
gram_plot<-ggplot(freqngram,aes(x=reorder(ngram,n),y=n,color=id,fill=id))+
geom_col()+
coord_flip()+
xlab('')+
ylab('count')+
facet_grid(id~.,scales='free_y')+
theme(axis.text.x = element_text(size =6),
axis.text.y = element_text(size =5))
ggplotly(gram_plot)