Capstone Milestone Report

Introduction

Three text datasets will be used to understand the distribution and relationship between words and phrases. This will be accomplished by doing exploratory analysis on the text in these datasets and considereing the following:

The frequency and distribution of words in each dataset as well as a dataset combined of all three datasets.
The frequency and distribution of two-word combinations in the same datasets.
The frequency and distribution of three-word combinations in the same datasets.

Libraries

## Loading required package: NLP

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## quanteda version 0.9.9.65

## Using 7 of 8 cores for parallel computing

## 
## Attaching package: 'quanteda'

## The following objects are masked from 'package:tm':
## 
##     as.DocumentTermMatrix, stopwords

## The following object is masked from 'package:NLP':
## 
##     ngrams

## The following object is masked from 'package:utils':
## 
##     View

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

Load Data

Read in blog, twitter and news data for text mining analysis.

## Warning in readLines(con <- file("./en_US.news.txt"), encoding = "UTF-8", :
## incomplete final line found on './en_US.news.txt'

Number of blog lines:

## [1] 899288

Number of twitter lines:

## [1] 2360148

Number of news lines:

## [1] 77259

Sample Data

Since there are a lot of lines of texts from the initial datasets a random sample of 10% per dataset will be used for further analysis. The following summary shows the number of lines per data type after taking the random sample.

Number of blog sample lines:

## [1] 89928

Number of twitter sample lines:

## [1] 236014

Number of news sample lines:

## [1] 7725

Exploratory Analysis

The text will be broken up into meaningful units of text (tokens) as follows:

Indivdual words
Consecutive two-word pairs
Consecutive three-word pairs

To clean up the dataset even more, numbers, punctuation, hyphens and twitter hashtags will be removed. The datsets are then examined for frequency of words and phrase, and plotted below.

Top twenty words from blogs, twitter and news text combined

##    the     to    and      a      i     of     in    you     is    for 
## 293907 192625 160354 157797 151466 130069 102763  85508  81501  77211 
##   that     it     on     my   with   this    was     be   have     at 
##  71939  71343  58042  56586  48008  43434  41658  40480  39917  37503

Least frequent twenty words from blogs, twitter and news text combined

## tranquilpc.co.uk       worshipful        midvalley           lettra 
##                1                1                1                1 
##        flatcards         lehavdil       sportingly       accrington 
##                1                1                1                1 
##           yeovil       lightshade         tienamos          quantas 
##                1                1                1                1 
##           ningun         grabbers          jeering            boozo 
##                1                1                1                1 
##         ruminant         gretch's           gretch          yardarm 
##                1                1                1                1

The least frequent words show numbers weren’t all removed because they were attached to a letter and a lot of misspelled or strange words appear in the text.

The datasets were examined further to see what the frequency of words were for each separate dataset.

The datasets contain mostly the same frequent words with only three being different and with blogs only containing the word “it”, news containing the word “on”, and twitter containing the word “you” in their top tens.

Again the datasets contain mostly the same frequent words. There are only 5 different between the datsets with two phrases being extremely common in all three datasets.

As expected the number of different phrases per dataset has increased, but there are two phrase that are extremely common in the blogs and news datasets, and one phrase that is extremely common in the twiiter dataset that didn’t crack the top ten of the other two.

Conclusion

The next step is to create a model that predicts the next word in a phrase. After examining the dataset, it is clear that there are many words that are common to all three datasets, but there are also some differences when you look at phrases with increased word count. The main things to consider when making the prediction will be:

Do you run the model by combining all text in the blogs, news, and twitter datasets or
Do you run the model by using only the blogs text to predict the next word in a blog, and so on
Managing size so that your shiny app doesn’t take too long to produce results