The objective of the Capstone Project is to develop a text predictive data product. The data used is a collection of text documents, called a Corpus. The corpus that will be used for this milestone report has been made available by HC Corpora through the Coursera website.
The goal of this report is to perform exploratory analysis to understand statistical properties of the data set that can later be used when building the prediction model for the final Shiny application. Here we will identify the major features of the training data and then summarize plans for the predictive model.
Understanding the characteristics of the acquired data is important, as it will elucidate as to how the data should be cleaned and preprocessed for analysis.
The documents downloaded are zipped text files. The text files are grouped into folders by language. The folder of interest to us will be the English US folder. In this folder there are three files, text documents, that contain text gathered from three sources - blogs, news and twitter - and the model will be trained using that same data.
## Data already exists. Skipping download.
## Warning in readLines(con, encoding = "UTF-8", skipNul = TRUE): incomplete final
## line found on 'R_Capstone/final/en_US/en_US.news.txt'
## All files read successfully.
The following table outlines the size of the files, characters, Words and the number of lines each document has.
## Warning: package 'kableExtra' was built under R version 4.4.2
| File | FileSize | Lines | Characters | Words |
|---|---|---|---|---|
| en_US.blogs.txt | 200 MB | 899288 | 206824505 | 37570839 |
| en_US.news.txt | 196 MB | 77259 | 15639408 | 2651432 |
| en_US.twitter.txt | 159 MB | 2360148 | 162096241 | 30451170 |
An important observation in this initial investigation shows that the text files are fairly large. To improve processing time, a smaller sample size of 1% will be obtained from all three data sets and then combined into a unified document corpus for subsequent analyses later in this report as part of preparing the data. The method used to clean up the text is important as it has a large bearing on the usefulness of the model.
Prior to performing exploratory data analysis, the three data sets will be sampled to improve performance.| File | FileSize | Lines | Characters | Words |
|---|---|---|---|---|
| en_US.sample.txt | 0.75 MB | 6674 | 774790 | 141721 |
## Sample data saved: R_Capstone/final/en_US/en_US.sample.txt
A custom function named buildCorpus will be employed to perform the following transformation steps for each document:
1. Remove URL, Twitter handles and email patterns by converting them to spaces using a custom content transformer
2. Convert all words to lowercase
3. Remove common English stop words
4. Remove punctuation marks
5. Remove numbers
6. Trim whitespace
7. Remove profanity
8. Convert to plain text documents
## Warning: package 'tm' was built under R version 4.4.2
## Loading required package: NLP
## Warning: package 'NLP' was built under R version 4.4.2
Exploratory data analysis will be performed to fulfill the primary goal for this report. Several techniques will be employed to develop an understanding of the training data which include looking at the most frequently used words, tokenizing and n-gram generation.
A bar chart and word cloud will be constructed to illustrate unique word frequencies.
## Warning: package 'wordcloud' was built under R version 4.4.2
## Loading required package: RColorBrewer
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
The predictive model I plan to develop for the Shiny application will handle uniqrams, bigrams, and trigrams. In this section, I will use the RWeka package to construct functions that tokenize the sample data and construct matrices of uniqrams, bigrams, and trigrams. Tokenization is defined as taking a string and breaking it up into smaller parts. The parts could be, words, phrases or radicals of words as examples. Tokens are then used as the building blocks in understanding how text is structured and how tokens are related to each other. Therefore the objective is to understand what tokens to use and how they appear in the text and with what frequency.
## Warning: package 'RWeka' was built under R version 4.4.2
The following graphs represent the most common 20 unigrams, bigrams or trigrams by frequency count.
This report was based on the corpus that was kindly made available through the Coursera website. This corpus was a great point of departure in understanding the basics of text mining. The point to consider here is what other corpus should be included to improve the coverage of words. Further, this report used 1% of the given corpus, so ideally, even though it was a random sample it still may be too small. Either a bigger sample should be taken or at least more samples should be used of equivalent size.
The final deliverable in the capstone project is to build a predictive algorithm that will be deployed as a Shiny app for the user interface. The Shiny app should take as input a phrase (multiple words) in a text box input and output a prediction of the next word.
The predictive algorithm will be developed using an n-gram model with a word frequency lookup similar to that performed in the exploratory data analysis section of this report. A strategy will be built based on the knowledge gathered during the exploratory analysis. For example, as n increased for each n-gram, the frequency decreased for each of its terms. So one possible strategy may be to construct the model to first look for the unigram that would follow from the entered text. Once a full term is entered followed by a space, find the most common bigram model and so on.
The final strategy will be based on the one that increases efficiency and provides the best accuracy.