Executive Summary

This report is to explore and analyze three text files which will be used for the prediction model. The three files analyzed are from www.corpora.heliohost.org, only for en_US locale. To speed up the initial analysis process, only a sample of original data set is used for the document-feature matrix.

Reading in the data

library(tm)
## Loading required package: NLP
library(quanteda)
## 
## Attaching package: 'quanteda'
## 
## The following objects are masked from 'package:tm':
## 
##     as.DocumentTermMatrix, stopwords
## 
## The following object is masked from 'package:NLP':
## 
##     ngrams
## 
## The following object is masked from 'package:stats':
## 
##     df
## 
## The following object is masked from 'package:base':
## 
##     sample
con1 <- file("C:/Users/cynth/Documents/en_US/Data/en_US.blogs.txt", "r")
con2 <- file("C:/Users/cynth/Documents/en_US/Data/en_US.news.txt", "r")
con3 <- file("C:/Users/cynth/Documents/en_US/Data/en_US.twitter.txt", "r")
blogsData <- readLines(con1)
newsData <- readLines(con2)
## Warning in readLines(con2): incomplete final line found on 'C:/Users/cynth/
## Documents/en_US/Data/en_US.news.txt'
twitterData <- readLines(con3)
## Warning in readLines(con3): line 167155 appears to contain an embedded nul
## Warning in readLines(con3): line 268547 appears to contain an embedded nul
## Warning in readLines(con3): line 1274086 appears to contain an embedded nul
## Warning in readLines(con3): line 1759032 appears to contain an embedded nul
close(con1)
close(con2)
close(con3)

Let’s take a look at summary of each data set.Below only show the first 10 documents of blogsData.

summary(blogsData,10)
##      Text Types Tokens Sentences
## 1   text1    20     22         1
## 2   text2     6      7         1
## 3   text3   104    154         7
## 4   text4    36     43         1
## 5   text5    91    119         5
## 6   text6    13     13         1
## 7   text7     6      6         1
## 8   text8    55     67         3
## 9   text9    47     53         3
## 10 text10    96    154         7
str(blogsData)
##  chr [1:899288] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”." ...
str(newsData)
##  chr [1:77259] "He wasn't home alone, apparently." ...
str(twitterData)
##  chr [1:2360148] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long." ...
##      File TotalLines  FileSize TotalWords
## 1   blogs     899288 248.49350   37334441
## 2    news      77259  19.17972    2643972
## 3 twitter    2360148 301.39670   30373792

Sampling the data

Those are huge files. Use a sample sample (5%) from above data sets for further analysis and combine three samples into one data set.

set.seed(1234)
blogsSubset <- sample(blogsData,length(blogsData)*0.05)
newsSubset <- sample(newsData,length(newsData)*0.05)
twitterSubset <- sample(twitterData,length(twitterData)*0.05)
OneData <- c(blogsSubset,newsSubset,twitterSubset)
head(OneData)
## [1] "#70...Babs"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
## [2] "“I don’t know. Maybe they’re getting too much sun. I think I’m going to cut them way back.” I replied."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
## [3] "The reason could be anything. Maybe you violated some arcane, meaningless regulation among the hundreds of thousands of pages of US Code (ignorance of the law is NOT an excuse!). Maybe you were at the wrong place at the wrong time. Or maybe they had no real reason at all other than mere suspicion."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
## [4] "Last but certainly far from least, I want to talk about the magnetic triggers that was mentioned yesterday. I had seen for a couple of weeks various people just waking up one day and walking out of their lives. I had not talked about it because it was really strange. It looked almost zombie like… blank stares just leaving. I had no clue where they were going, I was too transfixed on the blank facial expressions… some even had older children along side of them, equally with the same blank look on their face. I am sure, if I had really looked at the expression on my own face as I moved out of my family’s life to New Mexico, I would have looked the same. Had no clue why I was doing it, or what would happen…. I just had to go. I am more than grateful that I did!!"
## [5] "I think I can believe that, though it’s hard"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
## [6] "Josef Strauss: Delirien waltz"
OneData <- iconv(OneData, 'UTF-8', 'ASCII', "byte")

Build the corpus

myCorpus <- corpus(OneData)
summary(myCorpus,n=10)
## Corpus consisting of 166833 documents, showing 10 documents.
## 
##    Text Types Tokens Sentences
##   text1     3      5         2
##   text2    28     72         3
##   text3    47     62         4
##   text4   110    204         9
##   text5    15     20         1
##   text6     5      5         1
##   text7    36     44         1
##   text8     1      1         1
##   text9    47     67         5
##  text10    10     13         1
## 
## Source:  C:/Users/cynth/OneDrive/Training/Data Science Certification/Capstone/* on x86-64 by cynth
## Created: Sun Mar 20 10:16:18 2016
## Notes:

Create Document-feature matrix

Use ‘quanteda’ package to clean the data and create the document-feature matrix. Preprocessing includes lower case transformation, stopword, stemming, etc.

myDfm <- dfm(myCorpus, ignoredFeatures = c("鈥","檚",stopwords("english")), stem = TRUE)
## Creating a dfm from a corpus ...
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 166,833 documents
##    ... indexing features: 112,235 feature types
##    ... removed 174 features, from 176 supplied (glob) feature types
##    ... stemming features (English), trimmed 27364 feature variants
##    ... created a 166833 x 84697 sparse dfm
##    ... complete. 
## Elapsed time: 16.4 seconds.

Top features

Below is the table of top 20 features.

topfeatures(myDfm, 20)
##    e2  just   get  like   one  will    go   can     s  time  love   day 
## 42387 12767 12368 12149 11254 11047 10739 10474  9958  9902  9423  9072 
##  make  good  know thank   now    9c    9d   see 
##  7932  7803  7793  7612  7325  7050  7048  6794

We can also visualize the most frequent features.

## Loading required package: RColorBrewer

Plan for prediction model

n-gram models will be used to build the prediction model. The next word predicted will be based on the probability on the condition of previous n words. 1. Use the three text files to build a dictionary to cover at least 50% of the in-sample scenarios. 2. calculate the conditional probabilities using the dictionary. (2-gram and 3-gram models) 3. assgin a small probability to words not covered in the dictionary