We start by downloading the zip file containing the project data and unzipping the files into a directory “data” in the working directory.
download.file(url="https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", destfile = "Coursera-SwiftKey.zip",quiet = TRUE )
unzip("Coursera-SwiftKey.zip",exdir = "data")
Read in the english version of blogs, news and twitter data using readLines function
blogs <- readLines("data/final/en_US/en_US.blogs.txt", encoding="UTF-8",skipNul = TRUE,warn = FALSE)
news <- readLines("data/final/en_US/en_US.news.txt", encoding="UTF-8",skipNul = TRUE, warn = FALSE)
twitter <-readLines("data/final/en_US/en_US.twitter.txt", encoding="UTF-8",skipNul = TRUE, warn = FALSE)
Summary of the three data sets blog.text, new.txt and twitter.txt
library(stringi)
corpus <- list(blogs,news,twitter)
linecharstat<- sapply(corpus,stri_stats_general)[c('Lines','Chars'),]
wordcount <- sapply(corpus,stri_stats_latex)[c('Words'),]
wordsummary <- sapply(corpus,function(x) summary(stri_count_words(x))[c('Min.','Mean','Max.')])
descstat <- as.data.frame(rbind(linecharstat,wordcount,wordsummary))
colnames(descstat)<- c("blogs.txt","news.txt","twitter.txt")
rownames(descstat) <- c("Number of Lines","Number of Characters","Number of Words","Min words per line","Mean words per line","Max.words per line")
rm(corpus)
format(descstat,scientific = FALSE,digits=0,big.mark=",")
## blogs.txt news.txt twitter.txt
## Number of Lines 899,288 77,259 2,360,148
## Number of Characters 206,824,382 15,639,408 162,096,241
## Number of Words 37,570,839 2,651,432 30,451,170
## Min words per line 0 1 1
## Mean words per line 42 35 13
## Max.words per line 6,726 1,123 47
We will now explore the words which are more frequently appearing in the three datasets. The analysis is done with and without stopwords .Stop words are those words which are quite frequent like“the”,“a”,“and” etc . An explanation of stopwords is given at https://en.wikipedia.org/wiki/Stop_words.
But prior to that the data has to be cleaned and preprocessed to facilitate exploratory analysis. Numbers,punctuations,symbols,seperators,url’s , twitter signs etc are also removed from the data corpus. Quanteda package in R is used for creating document frequency matrix. In the document frequency matrix(dfm) each line in the corpus is considered as a document and each unique word in the line is considered as a feature. A Dfm is a matrix with as many rows as the number of lines and as many columns as there are unique words in the corpus.The dfm provides the line wise frequency of each of these words. While creating the dfm words or word combinations which occurs only in two lines or lower are excluded to reduce the size of the dfm.
library(quanteda)
library(data.table)
createdfm <- function(x,n) {
doc <- dfm(x, tolower = TRUE, remove_numbers= TRUE, remove_punct = TRUE,remove_symbols = TRUE,remove_twitter = TRUE,remove_separators = TRUE,remove_hyphense=TRUE, remove_url = TRUE,ngrams=n,concatenator=" ")
doc <- dfm_compress(doc)
doc = dfm_trim(doc,min_docfreq = 2)
return(doc)
}
createstopdfm <- function(x,n){
docstop = dfm(x, remove= stopwords("english"), tolower = TRUE, remove_numbers= TRUE, remove_punct = TRUE,remove_symbols = TRUE,remove_twitter = TRUE,remove_separators = TRUE,remove_hyphense=TRUE, remove_url = TRUE,ngrams=n,concatenator=" ")
docstop <- dfm_compress(docstop)
docstop = dfm_trim(docstop,min_docfreq = 2)
}
createdatatable <- function(x){
table <- data.table(Words = featnames(x), TotalCount = colSums(x),Linespresent = docfreq(x),stringsAsFactors = FALSE)
table <- table[order(TotalCount,decreasing = TRUE)]
table$percentoftotal = (table$TotalCount/sum(table$TotalCount))*100
table$cumpercent <- cumsum(table$percentoftotal)
return(table)
}
Exploratory Analysis of Blogs Data
blogsunigram = createdfm(blogs,1)
blogsstopunigram = createstopdfm(blogs,1)
blogsunigramtab = createdatatable(blogsunigram)
blogsstopunigramtab =createdatatable(blogsstopunigram)
The top 10 frequent words appearing in blogs text data with and without stopwords are shown in the tables below. Table on the right are without stopwords.
A world cloud display of top 50 words with and without stop words in blogs text is shown below.Figure on the right are without stopwords.
Let us now examine the word pair combination in the blog
Below are the top 10 frequent word pairs appearing in blogs text data.
## Words TotalCount Linespresent
## 1: of the 187136 139416
## 2: in the 154187 124230
## 3: to the 86031 74644
## 4: on the 75266 66050
## 5: to be 68104 59442
## 6: and the 58737 52726
## 7: for the 58128 52473
## 8: i was 49556 38672
## 9: and i 49486 43929
## 10: i have 48074 40333
A world cloud display of top 25 word pairs in blog is shown below.
Exploratory Analysis of News data
newsunigram = createdfm(news,1)
newsstopunigram = createstopdfm(news,1)
newsunigramtab = createdatatable(newsunigram)
newsstopunigramtab =createdatatable(newsstopunigram)
The top 10 frequent words appearing in news text data with and without stopwords are shown in the tables below. Table on the right are without stopwords.
A world cloud display of top 50 words with and without stop words in news text is shown below.Figure on the right are without stopwords.
Let us now examine the word pair combination in the News data
newsbigram = createdfm(news,2)
newsbigramtab =createdatatable(newsbigram)
rm(news)
Below are the top 10 frequent word pairs appearing in news data.
## Words TotalCount Linespresent
## 1: of the 14096 12097
## 2: in the 13708 11808
## 3: to the 6442 5981
## 4: on the 5537 5133
## 5: for the 5397 5079
## 6: at the 4516 4231
## 7: and the 4050 3857
## 8: in a 4041 3859
## 9: to be 3552 3314
## 10: with the 3323 3163
A world cloud display of top 25 word pairs in news data is shown below.
Exploratory Analysis of Twitter Data
twitterunigram = createdfm(twitter,1)
twitterstopunigram = createstopdfm(twitter,1)
twitterunigramtab = createdatatable(twitterunigram)
twitterstopunigramtab =createdatatable(twitterstopunigram)
The top 10 frequent words appearing in twitter text data with and without stopwords are shown in the tables below. Table on the right are without stopwords.
A world cloud display of top 50 words with and without stop words in news text is shown below.Figure on the right are without stopwords.
Let us now examine the word pair combination in the twitter data
Below are the top 10 frequent word pairs appearing in twitter text data.
## Words TotalCount Linespresent
## 1: in the 78335 76159
## 2: for the 73955 72854
## 3: of the 56873 55032
## 4: on the 48469 47488
## 5: to be 46986 45596
## 6: to the 43400 42683
## 7: thanks for 42983 42778
## 8: at the 37229 36601
## 9: i love 35904 34947
## 10: going to 34270 33604
A world cloud display of top 25 word pairs in twitter data is shown below.
We will examine how many unique words cover 50% or more word instances in the data sets.
We had 163660 unique words in the blog data including stop words. Intrestingly just 105 words cover 50% of the word usages in the data. Just 1953 words cover 80% and 6351 words cover 90% of the word usages in the blog data. Out of the 45422 unique words in news data just 178,2615 & 7071 cover 50%, 80% and 90% of total word usage. Like wise 120,1499,4942 cover 50%,80% and 90% word usage in twitter data which has 151046 unique words.
The above shows that by using much lesser word or word pairs we can ensure a fair representation of the text corpus. this forms the basis of the model for predicting the next word.
Based on the exploratory analysis it’s evident that we don’t need to use the complete corpus to build our ngrams. The following steps are proposed to develop the model.
Use about 30% of the entire corpus which contains words or word pairs that cover about 95% of the word usage in the entire corpus.
Build upto 5 grams using quanteda library.
Use Knesser Nay Smoothing algorithms based on conditional probailities
Use Stupid back off models to estimate the probability of unobserved ngrams.