TEXT PREDICTION - EXPLORATORY ANALYSIS ON SWIFTKEY DATASETS

Jenifer PK & Srinath KS

30th August 2015

ABOUT DATASET :

The project that we have selected for this course is “TEXT PREDICTION - EXPLORATORY ANALYSIS ON SWIFTKEY DATASETS”. The HC Corpora dataset is comprised of the output of lots of news sites, blogs and twitter. The dataset contains 3 files across four languages (Russian, Finnish, German and English). This project will focus on the English language datasets. The names of the data files are as follows:

en_US.blogs.txt
en_US.twitter.txt
en_US.news.txt

The datasets will be referred to as “Blogs”, “Twitter” and “News” for the remainder of this report

SUMMARY & DESCRIPTION OF THE DATASET:

	Words	Characters	Letters	Lines	Avg Word Length	Avg Words/Line
Blogs	37,242,000	206,824,000	163,815,000	899,000	4.4	41.41
Newspapers	34,275,000	203,223,000	162,803,000	1,010,000	4.75	33.93
Twitter	29,876,000	162,122,000	125,998,000	2,360,000	4.22	12.66
Total	101,393,000	572,170,000	452,617,000	4,269,000	4.46	23.75

TASK:

Explore the data
Profanity filtering - removing profanity and other words you do not want to predict
Tokenization - identifying appropriate tokens such as words, punctuation, and numbers
Train the data Natural Language Process
Build a shiny apps with the model

The task is to build predictive model that predicts the next word when a user types a word/phrase similar to how google predicts what you want to search for, based on the most popular search terms.

From Swiftkey files Twitters, News, Blogs in the English Create a data product to predict the next word. The datasets from blogs and news contain approximately 200 megabytes and the Twitter dataset contains approximately 160 megabytes.

In this project we will apply Natural Language Processing (NLP), text mining, and the tools in R for exploratory data analysis and for the following text modelling and prediction as well.

DATASET DOWNLOAD:

We downloaded the datasets for this project from the web site and unzip it into our working directory using the following code:

fileName = "Coursera-SwiftKey.zip"
if(!file.exists(fileName)){
  #Download the dataset
  download.file(url="https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
                "Coursera-SwiftKey.zip",destfile = fileName)
  
  Download_Date <- Sys.time()
  Download_Date
  
  #Unzip the dataset
  unzip(zipfile = fileName,overwrite = TRUE)
}else{
  print("Dataset is already downloaded!")
}

## [1] "Dataset is already downloaded!"

PACKAGES TO BE LOADED:

library(NLP)
library(openNLP)
library(tm)
library(RWeka)
library(qdapDictionaries)
library(qdapRegex)
library(qdapTools)
library(RColorBrewer)
library(qdap)
library(stringr)
library(ggplot2)
library(RColorBrewer)
library(SnowballC)
library(wordcloud)

No of Characters & Lines in Blogs

connection = file ("file:///E:/LIBA_EDBA/PL/EDBA- PL-1st Sem-Project/en_US/en_US.blogs.txt", "r")
blogs = readLines (connection, n = -1, encoding = "UTF-8")
close (connection)
nCharBlog = sum(nchar(blogs)); lenBlog = length(blogs)
nCharBlog;lenBlog

## [1] 206824505

## [1] 899288

No of Characters & Lines in News

connection = file ("file:///E:/LIBA_EDBA/PL/EDBA- PL-1st Sem-Project/en_US/en_US.news.txt", "r")
news = readLines (connection, n = -1, encoding = "UTF-8")
close (connection)
nCharNews = sum(nchar(news)); lenNews = length(news)
nCharNews;lenNews

## [1] 15639408

## [1] 77259

No of Characters & Lines in Twitter

connection = file ("file:///E:/LIBA_EDBA/PL/EDBA- PL-1st Sem-Project/en_US/en_US.twitter.txt", "r")
tweets = readLines (connection, n = -1, encoding = "UTF-8")
close (connection)
nCharTweets = sum(nchar(tweets)); lenTweets = length (tweets)
nCharTweets;lenTweets

## [1] 162096031

## [1] 2360148

READ & SUMMARIZE DATA :

linesToRead = 500

connection = file ("file:///E:/LIBA_EDBA/PL/EDBA- PL-1st Sem-Project/en_US/en_US.blogs.txt", "r")
blogs = readLines (connection, n = -1, encoding = "UTF-8")
close (connection)
nCharBlog = nchar (blogs); lenBlog = length (blogs)

connection = file ("file:///E:/LIBA_EDBA/PL/EDBA- PL-1st Sem-Project/en_US/en_US.blogs.txt", "r")
blogs = readLines (connection, n = linesToRead, encoding = "UTF-8")
close (connection)

connection = file ("file:///E:/LIBA_EDBA/PL/EDBA- PL-1st Sem-Project/en_US/en_US.news.txt", "r")
news = readLines (connection, n = -1, encoding = "UTF-8")
close (connection)
nCharNews = nchar (news); lenNews = length (news)

connection = file ("file:///E:/LIBA_EDBA/PL/EDBA- PL-1st Sem-Project/en_US/en_US.news.txt", "r")
news = readLines (connection, n = linesToRead, encoding = "UTF-8")
close (connection)

connection = file ("file:///E:/LIBA_EDBA/PL/EDBA- PL-1st Sem-Project/en_US/en_US.twitter.txt", "r")
tweets = readLines (connection, n = -1, encoding = "UTF-8")
close (connection)
nCharTweets = nchar (tweets); lenTweets = length (tweets)

connection = file ("file:///E:/LIBA_EDBA/PL/EDBA- PL-1st Sem-Project/en_US/en_US.twitter.txt", "r")
tweets = readLines (connection, n = linesToRead, encoding = "UTF-8")
close (connection)

DATA PROCESSING - TOKENIZATION AND PROFANITY FILTERING

REMOVE RETWEETS

tweets = gsub ("(RT|via)((?:\\b\\W*@\\w+)+)", "", tweets)
# Remove @people
tweets = gsub ("@\\w+", "", tweets)

Sampledata =c(blogs,news,tweets)

Replace abbreviations so that the sentences are not split at incorrect places & splitting of text paragraphs into sentences

Sampledata = replace_abbreviation (Sampledata)
Sampledata <- sent_detect(Sampledata, language = "en", model = NULL)

Cleaning the Data:

We first reduce the dataset taking a 10% random sample of the initial corpus, so we can manage it easily for exploration. Afterwards, some basic text mining preprocessing was done:

remove URLs, RTs, via, and accounts mostly from tweets.
lower all the words
remove numbers
remove punctuation
remove english stopwords
remove profanity words using 2 different lists banned-words list bad-words list

corpusdata <- VCorpus (VectorSource (Sampledata)) # main corpus with all sample files
removeURL = function(x) gsub("http\\w+", "", x)
corpusdata <- tm_map(corpusdata, removeNumbers) 
corpusdata <- tm_map(corpusdata, stripWhitespace) 
corpusdata <- tm_map(corpusdata, tolower) 
corpusdata <- tm_map(corpusdata, removePunctuation)
corpusdata <- tm_map(corpusdata, removeWords, stopwords("english"))
corpusdata <- tm_map(corpusdata, PlainTextDocument)

REMOVE PROFANITY:

profanityFileName = "profanity.txt"
if (!file.exists(profanityFileName)) download.file(url = "http://pattern-for-python.googlecode.com/svn-history/r20/trunk/pattern/vector/wordlists/profanity.txt", 
                                                   destfile = profanityFileName)
profanityWords = str_trim(as.character(read.table(profanityFileName, sep = ",", 
                                                  stringsAsFactors = FALSE)))

corpusdata = tm_map(corpusdata, removeWords, profanityWords)

Finally remove all the white space that was created by the removals

corpusdata = tm_map(corpusdata, stripWhitespace)

TOKENIZATION

Corpus is tokenized into 1,2 and 3 grams and Term Document Matrices are created to understand the frequency of words and phrases. This information is to provide further help with the modeling.

Using the RWeka package for the 1-gram(single word) tokenization, 2-grams sets and 3-grams sets for further exploratory analysis.

OneGramTokenizer = function (corpus) {
  NGramTokenizer (corpus, Weka_control (min = 1, max = 1))
}

TwoGramTokenizer = function (corpus) {
  NGramTokenizer (corpus, Weka_control (min = 2, max = 2))
}

ThreeGramTokenizer = function (corpus) {
  NGramTokenizer (corpus, Weka_control (min = 3, max = 3))
}

tdmOneToken = as.TermDocumentMatrix (corpusdata, 
                                     control = list (tokenize = OneGramTokenizer))
tdmOneToken <- removeSparseTerms(tdmOneToken,0.99)

tdmTwoToken = as.TermDocumentMatrix (corpusdata, 
                                     control = list (tokenize = TwoGramTokenizer))
tdmTwoToken <- removeSparseTerms(tdmTwoToken,0.99)

tdmThreeToken = as.TermDocumentMatrix (corpusdata, 
                                       control = list (tokenize = ThreeGramTokenizer))
tdmThreeToken <- removeSparseTerms(tdmThreeToken,0.99)


OneTokenTermFreq = sort (rowSums (as.matrix (tdmOneToken)), decreasing = TRUE)
OneTokenTermFreqPerc = 100 * (OneTokenTermFreq / sum (OneTokenTermFreq))
OneTokenTermFreqTopTwenty = head (OneTokenTermFreqPerc, 20)


TwoTokenTermFreq = sort (rowSums (as.matrix (tdmTwoToken)), decreasing = TRUE)
TwoTokenTermFreqPerc = 100 * (TwoTokenTermFreq / sum (TwoTokenTermFreq))
TwoTokenTermFreqTopTwenty = head (TwoTokenTermFreqPerc, 20)

ThreeTokenTermFreq = sort (rowSums (as.matrix (tdmThreeToken)), decreasing = TRUE)
ThreeTokenTermFreqPerc = 100 * (ThreeTokenTermFreq / sum (ThreeTokenTermFreq))
ThreeTokenTermFreqTopTwenty = head (ThreeTokenTermFreqPerc, 20)

BOXPLOT OF ONETOKENTERM FREQUENCY:

boxplot(nchar(names(OneTokenTermFreq)))

## $stats
##      [,1]
## [1,]    3
## [2,]    5
## [3,]    7
## [4,]    8
## [5,]   12
## attr(,"class")
##         1 
## "integer" 
## 
## $n
## [1] 9096
## 
## $conf
##        [,1]
## [1,] 6.9503
## [2,] 7.0497
## 
## $out
##   [1] 13 13 13 13 13 14 13 14 13 15 13 13 13 13 13 13 15 13 14 13 13 48 14
##  [24] 14 13 14 13 14 17 16 19 14 13 15 13 13 13 15 14 32 13 13 14 14 13 14
##  [47] 14 13 13 13 16 16 13 16 17 20 14 13 15 15 14 13 16 14 13 20 13 16 13
##  [70] 13 20 13 15 13 13 24 13 18 13 13 20 13 14 15 13 74 13 15 14 14 18 14
##  [93] 16 13 14 13 13 13 14 14 13 14 14 14 13 15 15 19 14 14 15 19 15 14 17
## [116] 16 13 14 20 13 16 18 20 16 14 13 13 13 13 15 18 13 13 13 13 21 13 13
## [139] 13 13 15 14 13 13 13 14 14 16 13 14 14 13 17 13 14 15 15 13 14 13 16
## [162] 13 13 17 14 14 18 14 14 18 13 13 19 13 13 15 13 13 13 13 15 20 16 16
## [185] 20 14 13 16 14 13 14 17 13 13 14 63 15 13 13
## 
## $group
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [106] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [141] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [176] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 
## $names
## [1] "1"

BOXPLOT OF TWOTOKENTERM FREQUENCY:

boxplot(nchar(names(TwoTokenTermFreq)))

## $stats
##      [,1]
## [1,]    3
## [2,]    8
## [3,]   10
## [4,]   13
## [5,]   20
## attr(,"class")
##         1 
## "integer" 
## 
## $n
## [1] 30598
## 
## $conf
##           [,1]
## [1,]  9.954837
## [2,] 10.045163
## 
## $out
##   [1] 21 22 21 21 51 21 23 21 21 21 23 21 22 28 21 22 24 23 21 21 22 24 22
##  [24] 21 25 28 21 21 21 21 77 26 21 22 23 21 22 21 24 21 21 21 25 22 22 26
##  [47] 21 23 22 22 22 26 22 25 25 23 22 21 23 21 21 22 23 22 23 22 21 21 21
##  [70] 22 23 23 28 21 21 28 23 23 21 21 21 21 21 22 23 21 24 27 22 23 21 21
##  [93] 24 21 23 22 23 23 23 21 21 22 23 21 27 21 21 24 29 24 23 22 22 21 22
## [116] 22 21 25 25 22 24 25 22 25 22 21 22 22 30 21 79 21 22 22 23 21 22 21
## [139] 23 22 21 21 27 22 26 23 21 21 24 22 21 24 21 23 25 22 24 22 21 23 21
## [162] 24 21 21 28 22 22 26 22 22 24 25 22 22 24 23 21 21 23 23 27 22 21 22
## [185] 21 22 24 21 21 21 24 23 26 22 22 21 21 30 21 27 21 21 26 23 22 21 21
## [208] 21 23 55 26 23 21 22 21 21 22 22 21 22 21 28 23 22 22 22 22 25 22 21
## [231] 25 24 27 21 21 22 21 23 25 23 22 26 22 24 24 21 22 22 21 29 24 23 22
## [254] 22 21 21 21 21 21 21 21 27 21 23 22 22 21 25 21 22 22 23 21 21 22 28
## [277] 23 21 22 21 27 22 22 22 21 24 23 23 22 23 24 23 23 24 21 21 22 21 26
## [300] 22 23 23 23 23 21 22 22 24 22 21 21 21 21 26 23 23 24 21 21 21 21
## 
## $group
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [106] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [141] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [176] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [211] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [246] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [281] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [316] 1 1 1 1 1 1
## 
## $names
## [1] "1"

BOXPLOT OF THREETOKENTERM FREQUENCY:

boxplot(nchar(names(ThreeTokenTermFreq)))

## $stats
##      [,1]
## [1,]    6
## [2,]   13
## [3,]   15
## [4,]   18
## [5,]   25
## attr(,"class")
##         1 
## "integer" 
## 
## $n
## [1] 36857
## 
## $conf
##          [,1]
## [1,] 14.95885
## [2,] 15.04115
## 
## $out
##   [1] 29 28 33 28 29 33 33 55 27 32 28 29 28 26 34 28 26 30 32 26 26 31 28
##  [24] 26 28 26 29 32 28 26 27 31 29 28 31 26 26 26 26 28 26 28 27 26 26 27
##  [47] 32 27 29 26 29 26 28 26 27 28 27 27 26 29 28 27 30 30 30 27 26 28 27
##  [70] 26 26 27 26 26 26 26 26 27 26 27 28 28 27 28 26 28 82 31 26 26 26 27
##  [93] 26 28 27 26 26 30 26 26 27 28 27 26 28 26 27 26 28 28 27 27 26 26 29
## [116] 28 26 29 28 34 32 27 26 27 27 28 28 26 32 27 28 35 28 28 27 26 27 27
## [139] 29 26 27 33 26 28 27 31 28 35 27 29 30 26 26 33 27 29 28 26 27 26 27
## [162] 30 27 31 26 30 27 29 26 26 27 29 27 26 26 34 28 30 29 27 28 26 26 27
## [185] 30 30 27 28 28 29 27 27 30 30 27 26 26 26 28 26 26 26 28 26 26 27 26
## [208] 26 27 27 27 28 26 27 26 26 31 28 30 34 27 32 30 31 26 30 26 27 26 27
## [231] 29 27 27 28 26 26 26 63 28 26 27 35 26 26 28 27 26 33 27 35 27 26 27
## [254] 30 31 29 26 27 26 32 26 26 26 26 27 26 27 27 28 30 27 26 27 27 26 28
## [277] 31 28 40 28 26 27 26 26 31 26 30 26 26 27 31 27 27 26 28 27 26 26 28
## [300] 26 28 30 26 26 26 28 29 29 30 29 30 29 28 31 26 29 29 26 26 26 33 26
## [323] 26 31 28 30 27 28 26 27 29 33 27 26 84 32 28 26 28 26 28 28 28 32 29
## [346] 28 29 27 26 28 27 27 27 26 28 30 26 27 27 32 27 37 31 26 27 31 27 26
## [369] 27 27 32 26 28 27 26 27 30 27 26 27 35 26 27 26 27 28 28 26 34 33 30
## [392] 26 26 26 28 28 26 29 26 26 32 28 28 30 30 26 28 28 26 26 26 26 27 27
## [415] 28 30 29 34 26 35 34 28 26 31 26 30 30 26 27 28 31 30 27 26 29 26 27
## [438] 29 26 39 29 26 30 34 26 26 26 30 28 26 26 27 26 34 26 30 27 28 26 26
## [461] 27 28 43 28 28 30 27 28 28 26 26 26 26 26 26 26 27 29 27 26 26 27 33
## [484] 28 31 27 30 29 26 30 28 27 28 28 26 28 30 26 29 27 27 27 28 30 26 27
## [507] 27 58 27 26 26 34 26 27 26 29 26 34 28 28 28 26 26 26 28 26 26 27 28
## [530] 30 28 26 26 30 30 32 28 28 26 26 26 26 26 27 30 26 26 29 26 28 26 26
## [553] 27 29 37 27 26 27 27 29 29 28 31 27 29 26 27 26 30 27 28 28 26 26 28
## [576] 26 27 29 30 31 26 27 26 28 26 31 28 29 28 26 27 26 26 26 26 29 33 27
## [599] 26 26 26 29 35 29 29 29 27 27 26 28 26 30 27 26 28 30 26 39 26 34 29
## [622] 29 28 27 27 27 27 26 28 82 26 28 29 27 31 26 27 28 27 30 26 26 26 28
## [645] 26 28 28 26 29 26 29 28 32 26 26 29 26 27 35 29 27 30 29 26 30 28 26
## [668] 28 27 27 29 27 33 33 28 26 30 26 28 29 29 29 30 27 26 29 30 32 26 26
## [691] 27 33 31 28 26 27 27 26 28 26 29 26 28 26 26 26 28 26 27 26 29 30 31
## [714] 26 27 27 28 26 33 31 26 30 26 28 26 28 26 27 36 26 29 26 26 30 29 28
## [737] 27 26 27 30 33 28 27 29 26 29 28 28 27 31 27 26 30 31 29 26 27 27 26
## [760] 29 27 27 27 29 26 28 28 34 28 28 27 26 27 29 29 26 34 29 28 26 26 26
## [783] 26 28 26 29 27 26 28 27 27 27
## 
## $group
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [106] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [141] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [176] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [211] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [246] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [281] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [316] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [351] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [386] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [421] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [456] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [491] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [526] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [561] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [596] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [631] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [666] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [701] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [736] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [771] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 
## $names
## [1] "1"

MOST FREQUENT 1-GRAMS

qplot (names (OneTokenTermFreqTopTwenty), OneTokenTermFreqTopTwenty,
       main = "One Token Term Frequency: Top Twenty in Percentage",
       geom = "bar", stat = "identity",
       xlab = "Word", ylab = "% of all terms") + theme (
         axis.text.x = element_text (angle = 45))

MOST FREQUENT 2-GRAMS

qplot (names (TwoTokenTermFreqTopTwenty), TwoTokenTermFreqTopTwenty,
       main = "Most Frequent 2-Grams",
       geom = "bar", stat = "identity",
       xlab = "Phrase", ylab = "% of all terms") + theme (
         axis.text.x = element_text (angle = 45))

MOST FREQUENT 3-GRAMS

qplot (names (ThreeTokenTermFreqTopTwenty), ThreeTokenTermFreqTopTwenty,
       main = "Most Frequent 3-Grams",
       geom = "bar", stat = "identity",
       xlab = "Phrase", ylab = "% of all terms") + theme (
         axis.text.x = element_text (angle = 45))

CONCLUSION

After initial steps, we can come to a few conclusions.

ANALYSIS

The data set is quite large.

It contains * Formal sources (news) * Longer, Informal sources (blogs) * Short,More Informal sources (twitter)

Using the tm package, the larga data set needs to be subsampled to effectively run on desktop PC.

A very simple predictor can be implemented using n-grams: predicting the next word based solely on the number of occurences of two or three-word phrases