Data Science Capstone - Week 2: Milestone Report.

Overview

The goal of the project is to build a Predictive Text Application that will predict the next word as the user types a sentence, to create a predictive text model using a large text corpus of documents as training data.

This “Milestone report” describes the major features of the training data with our exploratory data analysis and summarizes our plans for creating the predictive model. In this report there are some logs from different sources (in different languages), to train and create a prediction algorithm. In special, this report only uses English Data.

Download and Load Datasets

The capstone dataset is dowloaded from and unzip into 3 different sources of text files. The traing data is available here: Training Data.

Training Data is read from online source, load the dataset placed into the working directory. The datasets consist of text from 3 different sources: News, Blogs and Twitter feeds. The text data are provided in 4 different languages: German, English - United States, Finnish and Russian. Process and select only English - United States.

Sys.setlocale("LC_TIME", "English")     # Set properly if different from USA

## [1] "English_United States.1252"

# Download and unzip the data
if (!file.exists("Coursera-SwiftKey.zip")) {
  download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip")
  unzip("Coursera-SwiftKey.zip")
}
blog <- readLines("./Coursera-SwiftKey/final/en_US/en_US.blogs.txt", warn = FALSE, encoding = "UTF-8")
news <- readLines("./Coursera-SwiftKey/final/en_US/en_US.news.txt", warn = FALSE, encoding = "UTF-8")
twit <- readLines("./Coursera-SwiftKey/final/en_US/en_US.twitter.txt", warn = FALSE, encoding = "UTF-8")

Data Summaries

Basic information for each file - number of lines, number of characters, number of words, and the minimum, mean, and maximum words per line.

# Process data and show information
folder <- "./Coursera-SwiftKey/final/en_US/"
filelist <- list.files(path=folder)
filepath <- "./Coursera-SwiftKey/final/en_US/en_US.twitter.txt"
fsummary <- lapply(paste0(folder, filelist), function(filepath) {
    size <- file.info(filepath)[1]/1024/1000
    con <- file(filepath, open="r")
    lines <- readLines(con)
    nchars <- lapply(lines, nchar)
    maxchars <- max(unlist(nchars))
    nwords <- sum(sapply(strsplit(lines, "\\s+"), length))
    close(con)
    return(c(filepath, format(round(size, 2), nsmall=2), length(lines), maxchars, nwords))
})
df <- data.frame(matrix(unlist(fsummary), nrow=3, byrow=TRUE))
names(df) <- c("File_Name", "Size-MB", "Entries", "Large-line", "Tot_Words")
df

##                                           File_Name Size-MB Entries Large-line
## 1   ./Coursera-SwiftKey/final/en_US/en_US.blogs.txt  205.23  899288      40835
## 2    ./Coursera-SwiftKey/final/en_US/en_US.news.txt  200.99   77259       5760
## 3 ./Coursera-SwiftKey/final/en_US/en_US.twitter.txt  163.19 2360148        213
##   Tot_Words
## 1  37334441
## 2   2643972
## 3  30373792

Sample, Cleaning and Prepare DataSample

As required to proceed with analysis, it’s necesary to clean the datasets and to reduce the amount of time that this process take, here just sample randomly 1% of the given information. The non text words such as URLs, special characters, punctuations, numbers, excess whitespace, profanities, hashtag, stopwords are all removed, and the texts are changed to lower case.

library(NLP)            # Load libraries
library(tm)
set.seed(2021)          # Set seed for reproducability
sampleSize <- 0.01      # Sample 1% of Datasets
# Create data sample
dataSample <- c(sample(blog, length(blog) * sampleSize),
                sample(news, length(news) * sampleSize),
                sample(twit, length(twit) * sampleSize))
# Create corpus from the joined subsets and cleaning
corpus <- VCorpus(VectorSource(dataSample))
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
# Remove offensive words (https://www.cs.cmu.edu/~biglou/resources/bad-words.txt)
profanities <- read.csv("https://www.cs.cmu.edu/~biglou/resources/bad-words.txt",header =FALSE, strip.white = TRUE, stringsAsFactors = FALSE)
corpus <- tm_map(corpus, removeWords, profanities$V1)

Exploratory Analysis

The Corpus is ready to perform the final Exploratory analysis on the sample dataset. First, it’s to find the most recurrent words in the dataset, then using the n-grams functions provided by RWeka library can create different n-grams from the corpus and construct a term-document matricies for the n-grams tokens. Finally, to show some plots for the n-grams information.

Wordcloud of terms from a sampled and merged datasets in english:

library(RColorBrewer)   # Load libraries
library(Rcpp)
library(wordcloud)
wordcloud(corpus, max.words = 200,
          random.order = FALSE, 
          rot.per = 0.35, 
          colors = brewer.pal(8, "Dark2"))

### Tokenize, create and remove sparse terms for Uni, Duo, Tri, Qua and Qui grams.

library(RWeka)          # Weka is a collection of machine learning algorithms for data mining
# Tokenize the lines
UniGramTokens <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
DuoGramTokens <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TriGramTokens <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
QuaGramTokens <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))
QuiGramTokens <- function(x) NGramTokenizer(x, Weka_control(min = 5, max = 5))
# Create the NGrams
UniGrams <- TermDocumentMatrix(corpus, control = list(tokenize = UniGramTokens))
DuoGrams <- TermDocumentMatrix(corpus, control = list(tokenize = DuoGramTokens))
TriGrams <- TermDocumentMatrix(corpus, control = list(tokenize = TriGramTokens))
QuaGrams <- TermDocumentMatrix(corpus, control = list(tokenize = QuaGramTokens))
QuiGrams <- TermDocumentMatrix(corpus, control = list(tokenize = QuiGramTokens))
# Remove sparse terms
UniGramsDense <- removeSparseTerms(UniGrams, 0.9999)
DuoGramsDense <- removeSparseTerms(DuoGrams, 0.9999)
TriGramsDense <- removeSparseTerms(TriGrams, 0.9999)
QuaGramsDense <- removeSparseTerms(QuaGrams, 0.9999)
QuiGramsDense <- removeSparseTerms(QuiGrams, 0.9999)

Sort by frequency

Sort the remaining dense terms by descending frequency of occurance.

freqSample <- function(tdm){
    freq <- sort(rowSums(as.matrix(tdm)), decreasing=TRUE)
    freqSample <- data.frame(word=names(freq), freq=freq)
    return(freqSample)
}
# Get frequencies of most common n-grams in data sample
UniGramsDenseSort <- freqSample(UniGramsDense)
DuoGramsDenseSort <- freqSample(DuoGramsDense)
TriGramsDenseSort <- freqSample(TriGramsDense)
QuaGramsDenseSort <- freqSample(QuaGramsDense)
QuiGramsDenseSort <- freqSample(QuiGramsDense)

Plotting n-grams frecuency

Plotting the frecuency for the top 20 Uni, Duo and Tri grams.

library(ggplot2)        # Load libraries
plotGrams <- function(data, title, num) {
    topGrams <- data[1:num,]
    topGrams$word <- as.character(topGrams$word)
    ggplot(topGrams, aes(x=reorder(word, -freq), y=freq)) + 
        geom_bar(stat="identity", fill="blue") + 
        ggtitle(paste(title, "- Top ", num)) + 
        xlab(title) + ylab("Frequency") + 
        theme(axis.text.x=element_text(angle=90, hjust=1))
}

    plotGrams(UniGramsDenseSort, "Unigrams", 20)

    plotGrams(DuoGramsDenseSort, "Duograms", 20)

    plotGrams(TriGramsDenseSort, "Trigrams", 20)

Plans for creating a prediction algorithm and Shiny app

With this exploratory analysis done, the dataset it’s ready to start applying a predictive model in order to use in a Data Related Product. To use this report’s results, it’s necesary to:

- Select the correct predictive model.
- Use the calculated tokens.
- Set the app to use the model.
- Make the word prediction with the user input.

This it’s the final goal of the Capstone Project. The predictive algorithm use the n-gram model with frequency lookup with the same way the exploratory analysis was done. Trigram model is the first priority to lookup for the predicted words, and follow by duograms and unigrams. This means that if no matching trigram can be found, then the algorithm would back off to the bigram model, and then to the unigram model if needed.