Data Science Capstone: Milestone Report

Executive Summary

This is the exploratory data analysis assignment of the Data Science Capstone course, the last course in the Statistics and Machine Learning Specialization, taught in Coursera.

This project aims to predict the 3 most probable, next words given one, two or three previous words typed in by the users. However, this report only focuses on the exploratory data analysis part of the project. Thus, the main objective of this report is to explore the 20 most common 1-gram, 2-gram and 3-gram words occurred on blogs, news and twitter.

This data set for the project is sponsored by the SwiftKey company, which comprises of data from 4 languages: English, Finnish, German and Russian. But it shoud be noted that the scope of this report is limited to the English language only.

Loading Data into R

All of the data comes in as text files, which will be read into R.

con <- file("./training_dataset/en_US/en_US.blogs.txt")
blogs <- readLines(con)
close(con)

con <- file("./training_dataset/en_US/en_US.news.txt")
news <- readLines(con)
close(con)

con <- file("./training_dataset/en_US/en_US.twitter.txt")
twit <- readLines(con)
close(con)

rm(con)

Basic Summary of Each Data Type

First, let’s explore the following properties of each file. 1. Total Number of Lines
2. Total Number of Words
3. Average Number of Words per Line
4. Total Number of Characters
5. Average Number of Characters per Line

# number of lines
lines <- c(length(blogs), length(news), length(twit))

# number of words
wordblogs <- lengths(strsplit(blogs, " "))
wordnews <- lengths(strsplit(news, " "))
wordtwit <- lengths(strsplit(twit, " "))

words <- c(sum(wordblogs), sum(wordnews), sum(wordtwit))

# mean number of words per line
meanwords <- c(mean(wordblogs), mean(wordnews), mean(wordtwit))

# number of characters 
charblogs <- nchar(blogs)
charnews <- nchar(news)
chartwit <- nchar(twit)

characters <- c(sum(charblogs), sum(charnews), sum(chartwit))

# mean number of characters per line
meanchar <- c(mean(charblogs), mean(charnews), mean(chartwit))

features <- data.frame(Type = c("Blogs", "News", "Twitter"),
                       Total.Lines = lines,
                       Total.Words = words,
                       Mean.WordsperLine = meanwords,
                       Total.Characters = characters,
                       Mean.Charactersperline = meanchar)

rm(lines, words, wordblogs, wordnews, wordtwit, charblogs, charnews, chartwit, characters, meanchar, meanwords)

features

##      Type Total.Lines Total.Words Mean.WordsperLine Total.Characters
## 1   Blogs      899288    37334131          41.51521        206824505
## 2    News     1010242    34372530          34.02406        203223159
## 3 Twitter     2360148    30373543          12.86934        162096031
##   Mean.Charactersperline
## 1              229.98695
## 2              201.16285
## 3               68.68045

Sample Data

It is obvious that there are so many lines of blogs, news and tweets that if all were to be fitted into an algorithm, it will take so long to complete. Hence, in order to improve the efficiency and the speed of the program without reducing the effectiveness in predicting the wanted words, the sampling without replacement is performed.

Please note that since there are some data in which the words are not written in English. So, these words will be filtered out first before the sample takes place.

In each data set, only 1% of the total data will be taken. Then, each sub-sampling of all 3 data sets will be combined into a single data set, which will be used later throughout the analysis.

set.seed(418801)
# remove non-English alphabets
blogs <- iconv(blogs, from = "latin1", to = "ASCII", sub = "")
news <- iconv(news, from = "latin1", to = "ASCII", sub = "")
twit <- iconv(twit, from = "latin1", to = "ASCII", sub = "")

# remove data which contains only non-English alphabets
blogs <- blogs[grep(pattern = ".+", blogs, ignore.case = TRUE)]
news <- news[grep(pattern = ".+", news, ignore.case = TRUE)]
twit <- twit[grep(pattern = ".+", twit, ignore.case = TRUE)]

# sample
sampleblogs <- blogs[sample(1:length(blogs), size = 0.01*length(blogs), 
                            replace = FALSE)]
samplenews <- news[sample(1:length(news), size = 0.01*length(news), 
                          replace = FALSE)]
sampletwit <- twit[sample(1:length(twit), size = 0.01*length(twit), 
                          replace = FALSE)]

sampledata <- rbind(sampleblogs, samplenews, sampletwit)

rm(sampleblogs, samplenews, sampletwit)

# write a separate file for this sample data for later use
writeLines(sampledata, con = "./training_dataset/en_US/en_US.sampledata.txt")

Processing the Data

Even though the data now contains only English alphabets, it is still messy. This is because it still contains all of the numbers, punctuation, possibly URL links, email or even account names in twitter as well as some tabs and new lines. These components should be filtered out as they are not necessarily words that are typed in normal circumstances. Thus, they will become unnecessary noise later in the algorithm.

Please note that the processing of the data will be done using the gsub function where possible rather than the tm_map function in the tm package. This is because the gsub function was found to be quicker. However, for more complex processes, they will be done using the tm_map. This includes removing bad words as banned by Google and removing stop words in English such as the, a, as etc.

sampledata <- gsub("(f|ht)tp(s?)://.*", "", sampledata) # remove url
sampledata <- gsub("@[^\\s]+", "", sampledata)  # remove account name
sampledata <- gsub("[[:alnum:|:punct:]]*[@](.*?)[.]{1,3}", "", sampledata) #remove email
sampledata <- tolower(sampledata)
sampledata <- gsub("[[:digit:]]", "", sampledata)   #remove numbers
sampledata <- gsub("[[:punct:]]", "", sampledata)   # remove punctuation

con <- file("badwordslist.txt")
badwords <- readLines("./badwordslist.txt") # read in list of bad words
close(con)

library(tm)
vcorpus <- VCorpus(VectorSource(sampledata))
vcorpus <- tm_map(vcorpus, removeWords, badwords)   # remove badwords
vcorpus <- tm_map(vcorpus, removeWords, stopwords("en"))    # stopwords
vcorpus <- tm_map(vcorpus, stripWhitespace)    #remove multiple whitespcae
vcorpus <- tm_map(vcorpus, PlainTextDocument)

rm(sampledata, badwords, con)

Word Frequency

After the data is cleaned, the frequency of the remaining words will be calculated. This does not only include individual (1-gram) words but also phrases that comes together in 2 or 3 words (2-gram and 3-gram, respectively). Examples of these include United States, United Kingdom, Happy New Year etc.

In order to make it more compact, words that are very rare in occurrence will be disregarded (less than 0.01%).

library(RWeka)  # for NGramTokenizer

count <- function(x) {
    freq <- sort(rowSums(as.matrix(x)), decreasing = TRUE)
    return(data.frame(word = names(freq), frequency = freq))
}
token1 <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
token2 <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
token3 <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

freq1 <- count(removeSparseTerms(TermDocumentMatrix(vcorpus, control = list(
    tokenize = token1)), 0.9999))
freq2 <- count(removeSparseTerms(TermDocumentMatrix(vcorpus, control = list(
    tokenize = token2)), 0.9999))
freq3 <- count(removeSparseTerms(TermDocumentMatrix(vcorpus, control = list(
    tokenize = token3)), 0.9999))

Histograms

Let’s visualize the result above using histograms. Each of the following histograms will plot the most 20 common n-gram words and its corresponding frequency.

library(ggplot2)
histogram <- function(x, n) {
    g <- ggplot(x[1:20,], aes(x = reorder(word, -frequency), y = frequency))
    g <- g + geom_histogram(stat = "identity", fill = "midnightblue",
                            color = "white")
    g <- g + theme(axis.text.x = element_text(angle = 90))
    title <- paste("20 Most Common ", n, "-Gram Words", sep = "")
    xlab <- paste(n, "-Gram Word", sep = "")
    ylab <- "Frequency"
    g + labs(title = title, x = xlab, y = ylab)
}

# 1-gram
histogram(freq1, 1)

# 2-gram
histogram(freq2, 2)

# 3-gram
histogram(freq3, 3)

Prediction Algorithm

To emphasize, the actual aim of the whole project is to accurately predict the next word given that some number of previous words are provided. The prediction algorithm will be similar to what was done in the exploratory analysis: use n-gram tokenization to predict the next words based on its frequency of occurrence. Then the algorithm will be deployed in the shiny apps for the users to actually type in their sentences and get their forecasted words out.

However, it is noticeable that the bigger the value of n in the n-gram tokenization, the frequency of the most common words fall dramatically. As for the 3-gram words, the most common is only above 60, going beyond 3-gram may be a wasteful of resources and time.

As a consequence, the strategy for the prediction is the following:

No matter how many words are typed into the sentence, the prediction algorithm should first look at the most recent 3 words and the 3-gram tokenization model. If the probability of its occurrence given the data set is low (less than 5%), the model should turn to the latest 2 words and the 2-gram tokenization and so on. However, if the user has typed in numbers, punctuation, stop words, or bad words, they should be disregarded and the model should consider the previous 3 words instead.