Capstone Data Science Project

1. Introduction

The objective of the Capstone Project is to develop a text predictive data product. The data used is a collection of text documents, called a Corpus. The corpus that will be used for this milestone report has been made available by HC Corpora through the Coursera website.

The goal of this report is to perform exploratory analysis to understand statistical properties of the data set that can later be used when building the prediction model for the final Shiny application. Here we will identify the major features of the training data and then summarize plans for the predictive model.

2. Loading, Cleaning and Summarizing the Data

Understanding the characteristics of the acquired data is important, as it will elucidate as to how the data should be cleaned and preprocessed for analysis.

The documents downloaded are zipped text files. The text files are grouped into folders by language. The folder of interest to us will be the English US folder. In this folder there are three files, text documents, that contain text gathered from three sources - blogs, news and twitter - and the model will be trained using that same data.

## Data already exists. Skipping download.

## Warning in readLines(con, encoding = "UTF-8", skipNul = TRUE): incomplete final
## line found on 'R_Capstone/final/en_US/en_US.news.txt'

## All files read successfully.

The following table outlines the size of the files, characters, Words and the number of lines each document has.

## Warning: package 'kableExtra' was built under R version 4.4.2

Summary of Text Datasets
File	FileSize	Lines	Characters	Words
en_US.blogs.txt	200 MB	899288	206824505	37570839
en_US.news.txt	196 MB	77259	15639408	2651432
en_US.twitter.txt	159 MB	2360148	162096241	30451170

3. Preprocessing the sample sets

An important observation in this initial investigation shows that the text files are fairly large. To improve processing time, a smaller sample size of 1% will be obtained from all three data sets and then combined into a unified document corpus for subsequent analyses later in this report as part of preparing the data. The method used to clean up the text is important as it has a large bearing on the usefulness of the model.

Prior to performing exploratory data analysis, the three data sets will be sampled to improve performance.

Summary of Sampled Dataset
File	FileSize	Lines	Characters	Words
en_US.sample.txt	0.75 MB	6674	774790	141721

## Sample data saved: R_Capstone/final/en_US/en_US.sample.txt

A custom function named buildCorpus will be employed to perform the following transformation steps for each document:

1. Remove URL, Twitter handles and email patterns by converting them to spaces using a custom content transformer
2. Convert all words to lowercase
3. Remove common English stop words
4. Remove punctuation marks
5. Remove numbers
6. Trim whitespace
7. Remove profanity
8. Convert to plain text documents

## Warning: package 'tm' was built under R version 4.4.2

## Loading required package: NLP

## Warning: package 'NLP' was built under R version 4.4.2

4. Word Frequencies or Exploratory Data Analysis

Exploratory data analysis will be performed to fulfill the primary goal for this report. Several techniques will be employed to develop an understanding of the training data which include looking at the most frequently used words, tokenizing and n-gram generation.

A bar chart and word cloud will be constructed to illustrate unique word frequencies.

## Warning: package 'wordcloud' was built under R version 4.4.2

## Loading required package: RColorBrewer

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

4.1 Exploratory analysis of tokens

The predictive model I plan to develop for the Shiny application will handle uniqrams, bigrams, and trigrams. In this section, I will use the RWeka package to construct functions that tokenize the sample data and construct matrices of uniqrams, bigrams, and trigrams. Tokenization is defined as taking a string and breaking it up into smaller parts. The parts could be, words, phrases or radicals of words as examples. Tokens are then used as the building blocks in understanding how text is structured and how tokens are related to each other. Therefore the objective is to understand what tokens to use and how they appear in the text and with what frequency.

Tokenize Functions

## Warning: package 'RWeka' was built under R version 4.4.2

The following graphs represent the most common 20 unigrams, bigrams or trigrams by frequency count.

Unigrams

Bigrams

Trigrams

This report was based on the corpus that was kindly made available through the Coursera website. This corpus was a great point of departure in understanding the basics of text mining. The point to consider here is what other corpus should be included to improve the coverage of words. Further, this report used 1% of the given corpus, so ideally, even though it was a random sample it still may be too small. Either a bigger sample should be taken or at least more samples should be used of equivalent size.

The final deliverable in the capstone project is to build a predictive algorithm that will be deployed as a Shiny app for the user interface. The Shiny app should take as input a phrase (multiple words) in a text box input and output a prediction of the next word.

The predictive algorithm will be developed using an n-gram model with a word frequency lookup similar to that performed in the exploratory data analysis section of this report. A strategy will be built based on the knowledge gathered during the exploratory analysis. For example, as n increased for each n-gram, the frequency decreased for each of its terms. So one possible strategy may be to construct the model to first look for the unigram that would follow from the entered text. Once a full term is entered followed by a space, find the most common bigram model and so on.

The final strategy will be based on the one that increases efficiency and provides the best accuracy.