Three text datasets will be used to understand the distribution and relationship between words and phrases. This will be accomplished by doing exploratory analysis on the text in these datasets and considereing the following:
## Loading required package: NLP
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## quanteda version 0.9.9.65
## Using 7 of 8 cores for parallel computing
##
## Attaching package: 'quanteda'
## The following objects are masked from 'package:tm':
##
## as.DocumentTermMatrix, stopwords
## The following object is masked from 'package:NLP':
##
## ngrams
## The following object is masked from 'package:utils':
##
## View
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
Read in blog, twitter and news data for text mining analysis.
## Warning in readLines(con <- file("./en_US.news.txt"), encoding = "UTF-8", :
## incomplete final line found on './en_US.news.txt'
Number of blog lines:
## [1] 899288
Number of twitter lines:
## [1] 2360148
Number of news lines:
## [1] 77259
Since there are a lot of lines of texts from the initial datasets a random sample of 10% per dataset will be used for further analysis. The following summary shows the number of lines per data type after taking the random sample.
Number of blog sample lines:
## [1] 89928
Number of twitter sample lines:
## [1] 236014
Number of news sample lines:
## [1] 7725
The text will be broken up into meaningful units of text (tokens) as follows:
To clean up the dataset even more, numbers, punctuation, hyphens and twitter hashtags will be removed. The datsets are then examined for frequency of words and phrase, and plotted below.
Top twenty words from blogs, twitter and news text combined
## the to and a i of in you is for
## 293907 192625 160354 157797 151466 130069 102763 85508 81501 77211
## that it on my with this was be have at
## 71939 71343 58042 56586 48008 43434 41658 40480 39917 37503
Least frequent twenty words from blogs, twitter and news text combined
## tranquilpc.co.uk worshipful midvalley lettra
## 1 1 1 1
## flatcards lehavdil sportingly accrington
## 1 1 1 1
## yeovil lightshade tienamos quantas
## 1 1 1 1
## ningun grabbers jeering boozo
## 1 1 1 1
## ruminant gretch's gretch yardarm
## 1 1 1 1
The least frequent words show numbers weren’t all removed because they were attached to a letter and a lot of misspelled or strange words appear in the text.
The datasets were examined further to see what the frequency of words were for each separate dataset.
The datasets contain mostly the same frequent words with only three being different and with blogs only containing the word “it”, news containing the word “on”, and twitter containing the word “you” in their top tens.
Again the datasets contain mostly the same frequent words. There are only 5 different between the datsets with two phrases being extremely common in all three datasets.
As expected the number of different phrases per dataset has increased, but there are two phrase that are extremely common in the blogs and news datasets, and one phrase that is extremely common in the twiiter dataset that didn’t crack the top ten of the other two.
The next step is to create a model that predicts the next word in a phrase. After examining the dataset, it is clear that there are many words that are common to all three datasets, but there are also some differences when you look at phrases with increased word count. The main things to consider when making the prediction will be: