Introduction

The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships observed in the training data and preparation required to build the first prediction model. Below analysis takes EN_US as a subset to explore further.

Data preperation

The load of the data in R has been done using text input which is further transformed in the Corpus data structure provided by the text mining framework “tm”. Data frames are not great as its pronoe to dimensioniality issues. Corpus uses lists.

# Set working directory and include packages 
library(stringi)
library(ggplot2)
library(tm)
library(RWeka)
library(SnowballC)
library(wordcloud)
library(dplyr)

setwd("D:/coursera-ds/capstone/Coursera-SwiftKey/final/en_US")

# import the blogs and twitter datasets in text mode
blogs <- readLines("en_US.blogs.txt", encoding="UTF-8") 
twitter <- readLines("en_US.twitter.txt", encoding="UTF-8")
news <- readLines("en_US.news.txt", encoding="UTF-8")
 
# drop non UTF-8 characters 
twitter <- iconv(twitter, from = "latin1", to = "UTF-8", sub="") 
twitter <- stri_replace_all_regex(twitter, "\u2019|`","'") 
twitter <- stri_replace_all_regex(twitter, "\u201c|\u201d|u201f|``",'"') 

Below details provide basic count and summary information on the data

#length 
length(blogs)
## [1] 899288
length(twitter)
## [1] 2360148
length(news)
## [1] 77259
# number of characters per line 
summary( nchar(blogs)   ) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1      47     156     230     329   40830
summary( nchar(news)    ) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.0   111.0   186.0   202.4   270.0  5760.0
summary( nchar(twitter) ) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.0    37.0    64.0    68.8   100.0   213.0
# word analysis   
w_blogs   <- stri_count_words(blogs) 
w_news    <- stri_count_words(news) 
w_twitter <- stri_count_words(twitter)
summary( w_blogs   ) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   28.00   41.75   60.00 6726.00
summary( w_news    ) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   19.00   32.00   34.62   46.00 1123.00
summary( w_twitter )
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   12.00   12.79   18.00   61.00

Given the fact that we are dealing with a large dataset, we take a sample of the data to explore further. The sample data is then written to the sample folder to be converted into a corpus.

sampleTwitter <- twitter[sample(1:length(twitter),10000)]
sampleNews <- news[sample(1:length(news),10000)]
sampleBlogs <- blogs[sample(1:length(blogs),10000)]
sampleData <- c(sampleTwitter,sampleNews,sampleBlogs)

As the final step of data preperation, the corpus is created out of the sample data

# create the corpus of the sample data
bag <- Corpus(DirSource("D:/coursera-ds/capstone/Coursera-SwiftKey/final/en_US/sample"))

Data Cleaning

We clean the data by changing all text to lowercase, remove all numbers and punctuation. Removal of stop words is not required as we are building a predction model.

#Data cleaning tranformations 
bag <- tm_map(bag, content_transformer(tolower))
bag <- tm_map(bag,removePunctuation)
bag <- tm_map(bag, removeNumbers)
bag <- tm_map(bag, stripWhitespace)

# Stemming
bag <- tm_map(bag, stemDocument)

Tokenization

This process involves idenifying appropriate tokens such as a words, punctuation and numbers.N-grams models are created to explore word frequencies. Using the RWeka package bigrams and trigrams are created.

#Tokenization 
BiToken <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TriToken <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
bidtm <- DocumentTermMatrix(bag, control = list(tokenize = BiToken,removeSparseTerms=0.8))
tridtm <- DocumentTermMatrix(bag, control = list(tokenize = TriToken,removeSparseTerms=0.8))

Profanity filtering

This task deals with removal of profanility that you do not want to predict. This is required at the time of building the prediction model and not reviewed in detail while exploring data and building ngrams.

Exploratory analysis

Goal here is to perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora. We are also looking to understand frequencies of words and word pairs by building figures and tables

#bigrams
bifreq <- sort(colSums(as.matrix(bidtm)), decreasing=TRUE)
biwordfreq <- data.frame(word=names(bifreq), freq=bifreq)

#trigrams
trifreq <- sort(colSums(as.matrix(tridtm)), decreasing=TRUE)
triwordfreq <- data.frame(word=names(trifreq), freq=trifreq)

#graphs and tables for bigrams
freqplot2 <- filter(biwordfreq,freq>1000)
ggplot(data = freqplot2,aes(word,freq)) +
  geom_bar(stat="identity") +
  ggtitle("bigrams with frequencies > 1000") +
  xlab("bigrams") + ylab("Frequency") +
  theme(axis.text.x=element_text(angle=45, hjust=1))

#word cloud of top 100 bigrams 
wordcloud(names(bifreq),bifreq, max.words=100, scale=c(5, .1),col=rainbow(6))

#graphs and tables for trigrams
freqplot3 <- filter(triwordfreq,freq>100)
ggplot(data = freqplot3,aes(word,freq)) +
  geom_bar(stat="identity") +
  ggtitle("trigrams with frequencies > 100") +
  xlab("trigrams") + ylab("Frequency")+
  theme(axis.text.x=element_text(angle=45, hjust=1))

#word cloud of top 100 trigrams 
wordcloud(names(trifreq),trifreq, max.words=100, scale=c(5, .1),col=rainbow(6))

Conclusion

Coverage is non linear.Lot more words are required to increase the accuracy of the prediction model. This may be due to the fact that all words do not appear with the same frequency. n-gram based approach is memory intensive and may not work well unless we define coverage and work backwards.

To analyze the words from the foreign language we may use the UTF symbol codes.