Exploratory data analysis

Introduction

In this report I show a summary of the exploratory analysis I performed on the data. First I describe the characteristics of the files and how the data was imported and preprocessed. Then, based on a sample of the data I obtained the most common 1-3 grams. Finally I briefly describe how I plan to build the prediction model.

The packages used for the exploratory analysis were:

Importing and preprocessing

First, the working directory was changed to the folder where the files were extracted (in this case I’ll work with the files in english only):

blogData <- readLines("en_US.blogs.txt",skipNul = TRUE)
twitterData <- readLines("en_US.twitter.txt",skipNul = TRUE)
newsData <- readLines("en_US.news.txt", skipNul = TRUE)

## Warning in readLines("en_US.news.txt", skipNul = TRUE): incomplete final
## line found on 'en_US.news.txt'

In order to proceed with the analysis, the 3 datasets were merged:

I sampled 1% of the data using the function sample as suggested:

Exploratory Analysis

Using the merged data I define the corpus using the function corpus from the tm package:

As almost every NLP guide on the internet suggest, I performed some transformations (like removing white spaces, transforming all the characters to lowercase, remove punctuation and so on):

options(mc.cores=1)
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
us_files <- tm_map(corpus, toSpace, "/|@|\\|")

# convert to lowercase
us_files <- tm_map(us_files, content_transformer(tolower))

# remove punctuation
us_files <- tm_map(us_files, removePunctuation)

# remove numbers
us_files <- tm_map(us_files, removeNumbers)

# strip whitespace
us_files <- tm_map(us_files, stripWhitespace)

# remove english stop words
us_files <- tm_map(us_files, removeWords, stopwords("english"))

# initiate stemming
corpus <- tm_map(us_files, stemDocument) # Stemming words

Using the Term Document Matrix then I obtained the list of 1-grams that are mostly used:

Now I created the N grams

#Creating the n-grams
corpus.unigram <- TermDocumentMatrix(corpus)
corpus.unigram <- removeSparseTerms(corpus.unigram, 0.99)
corpus.unigram.freq <- freq_df(corpus.unigram)

corpus.bigram <- TermDocumentMatrix(corpus, control=list(tokenize=bigramTokenizer))
corpus.bigram <- removeSparseTerms(corpus.bigram, 0.999)
corpus.bigram.freq <- freq_df(corpus.bigram)

corpus.trigram <- TermDocumentMatrix(corpus, control=list(tokenize=trigramTokenizer))
corpus.trigram <- removeSparseTerms(corpus.trigram, 0.999)
corpus.trigram.freq <- freq_df(corpus.trigram)

corpus.quadgram <- TermDocumentMatrix(corpus, control=list(tokenize=quadgramTokenizer))
corpus.quadgram <- removeSparseTerms(corpus.quadgram, 0.9999)
corpus.quadgram.freq <- freq_df(corpus.quadgram)

Here is a graphical description of the frequent 1-grams:

Similarly, this is the list of the most used 2-grams:

top_50_plot(corpus.bigram.freq,"Top 50 2 word phrases","steelblue")

# Top 50 words - trigram
top_50_plot(corpus.trigram.freq,"Top 50 3 word phrases","steelblue")

# Top 50 words - quadgram
top_50_plot(corpus.quadgram.freq,"Top 50 4 word phrases","steelblue")

The shiny app is shown in the R server

Exploratory data analysis

Aakansha Garg

April 8, 2018

Introduction

Importing and preprocessing

Exploratory Analysis