1. Introduction

This document will aim to be a concise report of the major features of the data that were identified. The exploratory analysis will mainly consist of some statistical metrics, histograms and word clouds. At the end there will be brief outlook of the plans for creating the prediction algorithm, a Shiny app and in the end a finished data product.

2. Data loading and overview

The complete data set consists out of four languages (English, German, Finnish and Russian). In this analysis we will only use the English language data.

Data Source (provided by Coursera): https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.

2.1 Loading

set.seed(1111)
# loading libs
library(stringi)
library(dplyr)
library(tm)
library(ggplot2)
library(RColorBrewer)
library(RWeka)
library(wordcloud)

# get data and load
twitFile <- file("en_US.twitter.txt", "r")
blogFile <- file("en_US.blogs.txt", "r")
newsFile <- file("en_US.news.txt", "r")
twit <- readLines(twitFile, encoding = "UTF-8", skipNul = TRUE)
blog <- readLines(blogFile, encoding = "UTF-8", skipNul = TRUE)
news <- readLines(newsFile, encoding = "UTF-8", skipNul = TRUE)
close(twitFile)
close(blogFile)
close(newsFile)

2.2 First look

completeWordCount <- sapply(list(twit, blog, news),stri_stats_latex)["Words",]
wordsPerLine <- sapply(list(twit, blog, news), stri_count_words)
wordsPerLineStats <- sapply(list(twit, blog, news), function(x) summary(stri_count_words(x)))
dataSetStats <- sapply(list(twit, blog, news), stri_stats_general)
allStats <- data.frame(t(rbind(wordsPerLineStats, completeWordCount, dataSetStats))) %>% mutate(Mean = round(Mean, 2))
rownames(allStats) <- c("twitter", "blog", "news")
allStats[,-c(1,2,5,8)]
##         Median  Mean Max. completeWordCount LinesNEmpty     Chars CharsNWhite
## twitter     12 12.75   47          30451170     2360148 162096241   134082806
## blog        28 41.75 6726          37570839      899288 206824382   170389539
## news        32 34.41 1796          34494539     1010242 203223154   169860866

3. Cleaning an pre-processing

Due to computational constraints and for easier handling we will only use a subset of the data. We’ve learned in previous classes how a representative sample can be used to infer facts about a population. In this case we will use .2 percent of the data to proceed with the exploratory analysis.

sampleSize <- 0.002
twitterSample <- sample(twit, length(twit) * sampleSize)
blogSample <- sample(blog, length(blog) * sampleSize)
newsSample <- sample(news, length(news) * sampleSize)
sampleData <- c(twitterSample, blogSample, newsSample)

3.1 Creating a corpus

Because we are working with online sources it can be expected that there will be alot of special characters in the data set (urls, email addresses, emojis, etc.). In addition to this special characters I will try to completely remove all symbols that are not relevant for the analysis. Further below is a list of all steps performed to clean the corpus. For cleaning the data and building a corpus we will use the tm library (documentation: cran.r-project.org and rdocumentation.org). Profanity will be removed from the corpus, for this purpose we use a list of profane words provided by google.

  • remove profanity using google’s bad word list
  • convert all words to lower characters
  • remove stopwords with tm’s stopword function
  • remove punctuation
  • remove numbers
  • remove white space
  • convert to plain text documents
profanity <- as.character(t(read.csv("bwl.txt", header = FALSE, skip = 13))) # read bad words list
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x)) # custom content_transformer

corpus <- VCorpus(VectorSource(sampleData))
corpus <- tm_map(corpus, removeWords, profanity) # removing profanity
corpus <- tm_map(corpus, toSpace, "\\W") # convert non word characters to space
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("en")) # comment out, if you want with stopwords
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)

# for (d in seq(length(corpus))){ # careful prints all docs
#   print(corpus[[d]]$content)
# }

3.2 Tokenize the data and create n-grams

Now that we have a cleaned corpus to work with, we can now focus on building the data type representation of the n-grams. We will use the R interface to Weka RWeka to do this. The final data frames containing the n-grams are an important step of this analysis and will be used to learn more details about the data.

# create tokenizer functions for `tm`'s TermDocumentMatrix
unigramTokenizer <- function(x) {
  NGramTokenizer(x, Weka_control(min = 1, max = 1))
} 
bigramTokenizer <- function(x) {
  NGramTokenizer(x, Weka_control(min = 2, max = 2))
} 
trigramTokenizer <- function(x) {
  NGramTokenizer(x, Weka_control(min = 3, max = 3))
} 
# quadgramTokenizer <- function(x) {
#   NGramTokenizer(x, Weka_control(min = 4, max = 4))
# } 

#tokenization
unigrams <- TermDocumentMatrix(corpus, control = list(tokenize = unigramTokenizer))
bigrams <- TermDocumentMatrix(corpus, control = list(tokenize = bigramTokenizer))
trigrams <- TermDocumentMatrix(corpus, control = list(tokenize = trigramTokenizer))
# quadgrams <- TermDocumentMatrix(corpus, control = list(tokenize = quadgramTokenizer))

# convert to data.fram for better access
toDF <- function(x){
    wordCount <- sort(rowSums(as.matrix(x)), decreasing = TRUE)[1:100] #only the first 100
    gramDF <- data.frame(word = names(wordCount), frequency = wordCount)
    return(gramDF)
}

# create data.frames
uniGrDF <- toDF(unigrams)
biGrDF <- toDF(bigrams)
triGrDF <- toDF(trigrams)
# quaGrDF <- toDF(quadgrams)
allGrams <- list(uniGrDF, biGrDF, triGrDF)

# save for later use
# save(uniGrDF, file = "unigram.rda")
# save(biGrDF, file = "bigram.rda")
# save(triGrDF, file = "trigram.rda")
# save(quaGrDF, file = "quadgram.rda")

4. Exploratory plots of the of n-grams frequency

4.1 word clouds

A tag cloud (word cloud) is a visual representation of text data, which can be used to visualize word frequency of a text corpus. Utilizing font size and color this format is useful for perceiving the most prominent terms with a glance. We will use the wordcloud package to analyze the frequency of the different n-grams in our corpus.

# function to avoid repetetive code
plotClouds <- function (DFs) {
  for (i in 1:3) {
    print(paste(c("Wordcloud for ", i, "-gram"), collapse=""))
    wordcloud(DFs[[i]]$word, DFs[[i]]$frequency, random.order = FALSE, fixed.asp = TRUE, colors = brewer.pal(5, "Accent"))
  }
}

plotClouds(allGrams)
## [1] "Wordcloud for 1-gram"

## [1] "Wordcloud for 2-gram"

## [1] "Wordcloud for 3-gram"

4.2 histograms

In this section we will plot the histograms of the 15 most used \(n\)-grams for \(n\in\{1,2,3\}\).

# again to avoid repetetive code
histPlot <- function(DF, xlab) {
ggplot(DF[1:15,], aes(reorder(word, frequency), frequency)) +
  geom_histogram(stat = "identity", color = "black", fill = "lightblue") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 11)) +
  labs(x = xlab, y = "Count")
}

histPlot(uniGrDF, "Most used unigrams")

histPlot(biGrDF, "Most used bigrams")

histPlot(triGrDF, "Most used trigrams")

# histPlot(quaGrDF, "Most used quadgrams")

We can see that for the unigram the distribution is not following Zip’s Law.

Zipf’s law was originally formulated in terms of quantitative linguistics, stating that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table.

This can be related to the cleaning process and has to be investigated further. The distribution of the higher n-grams follows the law more closly.

5. Summary and outlook

I’m now at the end of my exploratory analysis of this data set. With the distribution of the different n-grams, we have learned important facts about the data, that will help us to tackle the rest of the course.

In the next few days the goal will be to build the complete statistical learning model, that can be used to make predictions on given data (text input). This model will then be incorporated into a Shiny app that can be distributed and accessed on shinyapps.io. The first version of the shiny app will probably consist out of a text input field that can be used to enter phrases, on which the app wil make a prediction for the possible next word. This app could then be expanded to incorporate different models that can be either selected by the user or utilized to make parallel predictions on the same data. On the way to a full data product a presentation to pitch the project will also be worked on.

6. References and other information

# sessionInfo()