This report serves as a milestone report for the predictive text model for the capstone project in the data science specialization courses. The complete original capstone data is downloaded from the link. It contains four folders in English, German, Russian and Finnish. This project will use the English database. An exploratory analysis after downloading, summarizing and sampling the raw data is performed. A brief description of plans for creating the prediction algorithm and Shiny App is also provided.
Download and unzip the raw data from the link. We use the English text files included in the folder ‘/en_US’, namely, ‘/en_US.blogs.txt, /en_US.news.txt, /en_US.twitter.txt’ as our working database. Then we read the text files into R with the readLines() function, which reads each line as a separate character vector. The size of each text file (megabyte(MB)) and the number of lines, the number of words, the number of numbers are summarized in the table below:
| Size_MB | Lines | Words | Nums | |
|---|---|---|---|---|
| Blog | 200.4242 | 899288 | 37334147 | 157373 |
| News | 196.2775 | 1010242 | 34372530 | 306583 |
| 159.3641 | 2360148 | 30373603 | 191792 |
The table above shows that the dataset is fairly large. We need to sample a smaller subset of the data. An estimate of ‘Words/Lines’ indicates that blog files tend to use long stentences while twitter files tend to use short sentences. An estimate of ‘Nums/Words’ reveals that the News data tend to use more numbers than blog or twitter data.
Based on the summary, we will use a smaller subset of the data and create a separate sample dataset. We choose to randomly select 0.1% samples from each text file and combine them as a whole.
We will perform an exploratory analysis on the sample data in this section. First we will create a corpus and make it tidy. We first convert words to its lower case and then remove bad words and stopwords in English. Then we remove special and frequent patterns, numbers and spaces.
After these cleaning procedures, we build an ordered 2-column data frame by frequencies. As an example, we illustrate the frequency of the top 20 most frequent words in a barplot.
We also create a word cloud to qualitatively show the distributions of words in the text file.
Using the previous exploratory analysis, we will consider 2-grams and 3-grams. In analogy to the unigram in the previous section, we will build a 2-column data frame for the bigrams and trigrams ordered by frequencies. As the next step, we need to think better way of cleaning the corpus. After that we will build a table of unique ngrams by frequencies. The next word will be predicted based on the previous 1, 2, or 3 words. In case of unseen ngrams that do not appear in the corpora we will use the backoff algorithm to assign weighted probabilities between the trigrams, bigrams and unigrams. We need to evaluate precision of our prediction model and we also need to consider the code’s efficiency to make it a user-friendly shiny app.
#setwd('yourworkdingdirectory')
#install and load required packages
#install.packages(c('magrittr','knitr','tm','ggplot2','SnowballC','wordcloud','RColorBrewer','RWeka','slam'))
#libs<-c('magrittr','knitr','ggplot2','tm','SnowballC','wordcloud','RColorBrewer','RWeka','slam')
#lapply(libs,require,character.only=T)
#download and unzip data
#url<-'https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip'
#if(!file.exists("data")){dir.create("data")}
#download.file(url,destfile = 'data/Coursera-SwiftKey.zip', method='curl')
#unzip('data/Coursera-SwiftKey.zip',exdir='data')
#list the English files
#library(magrittr)
#fls<-paste('./data/',list.files('./data',pattern='en_US',recursive = T),sep='') %>% as.list()
#size of each file in megabyte(MB)
#size<-sapply(fls,function(x){file.info(x)$size/2**20})
#read in the textfiles
#texts<-sapply(fls, function(x){readLines(x,encoding = "UTF-8", skipNul = T)} )
#names(texts)<-c('blogs_txt','news_txt','twitter_txt')
#number of lines in each textfile
#lines<-sapply(texts,length)
#number of words in each textfile
#counter_w<-function(x){sum(sapply(gregexpr("\\S+", x), length))}
#words<- sapply(texts,counter_w)
#number of numbers in each file
#wordsNonum<-sapply(texts,function(x){gsub('\\d','',x) %>% counter_w()})#delete numbers
#nums<-words-wordsNonum
#data summary in a data frame
#df<- data.frame(
#Size_MB=size,
#Lines=lines,
#Words=words,
#Nums =nums,
#row.names = c('Blog','News','Twitter')
#)
#kable(df,align='c',caption='A summary of text files')
#sample text files
#percent<-0.01
#set.seed(123)
#index<-sapply(texts, function(x){sample(1:length(x),round(percent*length(x)))})
#blog_s<-texts$blogs_txt[index$blogs_txt]
#news_s<-texts$news_txt[index$news_txt]
#twitter_s<-texts$twitter_txt[index$twitter_txt]
#text_sub<-c(blog_s,news_s,twitter_s)
#data cleaning
#removePattern<-function(x,pattern){gsub(pattern,'',x)}
#import bad words
#badwords<-readLines("badwords.txt") %>% removePattern(pattern='^god$|adult|bloody')
#fun_clean<-function(x){
# x=tolower(x)
# x=gsub("’","'",x)#use apostrophe as contraction symbol
# x=removePattern(x,pattern="[^[:alnum:][:space:]']")#symbols
# x=removePattern(x,pattern='http|http\\w+|www|www\\w+')#url
# return(x)
#}
#mytext<-fun_clean(text_sub)
#create a corpus and make it a clean plain text document
#mycorpus<-Corpus(VectorSource(mytext))%>%tm_map(PlainTextDocument) %>%tm_map(removeWords,c(badwords,stopwords('english'))) %>% tm_map(removeNumbers) %>%tm_map(removePunctuation) %>% tm_map(stemDocument) %>%tm_map(stripWhitespace)
#construct the term-document matrix and remove sparse terms
#mytdm<-TermDocumentMatrix(mycorpus) %>% removeSparseTerms(sparse=0.99)
#inspect(mytdm[,1])# the tdm is very spare
#convert mytdm as matrix
#mymatrix<-as.matrix(mytdm)
#order word counts in decreasing order
#word_freqs <-sort(rowSums(mymatrix), decreasing=T)
#create a data frame by words and their frequencies
#df_w<-data.frame(word=names(word_freqs), freq=word_freqs,row.names = NULL)
#the top 20 most frequent words
#top20<-df_w[1:20,]
#make a ggplot
#ggplot(top20, aes(x=reorder(word,freq), y= freq))+
# geom_bar(stat='identity',fill='steelblue',size=2)+
# labs(x='', y="Counts of words")+ggtitle('Top 20 Most Frequent Unigrams')+coord_flip()
#plot word cloud
#set.seed(123)
#wordcloud(words=df_w$word,freq=df_w$freq, random.color=F, random.order=F,color=brewer.pal(12, "Set3"), min.freq=1, max.words=100, scale=c(3, 0.3))