This report is to explore and analyze three text files which will be used for the prediction model. The three files analyzed are from www.corpora.heliohost.org, only for en_US locale. To speed up the initial analysis process, only a sample of original data set is used for the document-feature matrix.
library(tm)
## Loading required package: NLP
library(quanteda)
##
## Attaching package: 'quanteda'
##
## The following objects are masked from 'package:tm':
##
## as.DocumentTermMatrix, stopwords
##
## The following object is masked from 'package:NLP':
##
## ngrams
##
## The following object is masked from 'package:stats':
##
## df
##
## The following object is masked from 'package:base':
##
## sample
con1 <- file("C:/Users/cynth/Documents/en_US/Data/en_US.blogs.txt", "r")
con2 <- file("C:/Users/cynth/Documents/en_US/Data/en_US.news.txt", "r")
con3 <- file("C:/Users/cynth/Documents/en_US/Data/en_US.twitter.txt", "r")
blogsData <- readLines(con1)
newsData <- readLines(con2)
## Warning in readLines(con2): incomplete final line found on 'C:/Users/cynth/
## Documents/en_US/Data/en_US.news.txt'
twitterData <- readLines(con3)
## Warning in readLines(con3): line 167155 appears to contain an embedded nul
## Warning in readLines(con3): line 268547 appears to contain an embedded nul
## Warning in readLines(con3): line 1274086 appears to contain an embedded nul
## Warning in readLines(con3): line 1759032 appears to contain an embedded nul
close(con1)
close(con2)
close(con3)
Let’s take a look at summary of each data set.Below only show the first 10 documents of blogsData.
summary(blogsData,10)
## Text Types Tokens Sentences
## 1 text1 20 22 1
## 2 text2 6 7 1
## 3 text3 104 154 7
## 4 text4 36 43 1
## 5 text5 91 119 5
## 6 text6 13 13 1
## 7 text7 6 6 1
## 8 text8 55 67 3
## 9 text9 47 53 3
## 10 text10 96 154 7
str(blogsData)
## chr [1:899288] "In the years thereafter, most of the Oil fields and platforms were named after pagan âgodsâ." ...
str(newsData)
## chr [1:77259] "He wasn't home alone, apparently." ...
str(twitterData)
## chr [1:2360148] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long." ...
## File TotalLines FileSize TotalWords
## 1 blogs 899288 248.49350 37334441
## 2 news 77259 19.17972 2643972
## 3 twitter 2360148 301.39670 30373792
Those are huge files. Use a sample sample (5%) from above data sets for further analysis and combine three samples into one data set.
set.seed(1234)
blogsSubset <- sample(blogsData,length(blogsData)*0.05)
newsSubset <- sample(newsData,length(newsData)*0.05)
twitterSubset <- sample(twitterData,length(twitterData)*0.05)
OneData <- c(blogsSubset,newsSubset,twitterSubset)
head(OneData)
## [1] "#70...Babs"
## [2] "âI donât know. Maybe theyâre getting too much sun. I think Iâm going to cut them way back.â I replied."
## [3] "The reason could be anything. Maybe you violated some arcane, meaningless regulation among the hundreds of thousands of pages of US Code (ignorance of the law is NOT an excuse!). Maybe you were at the wrong place at the wrong time. Or maybe they had no real reason at all other than mere suspicion."
## [4] "Last but certainly far from least, I want to talk about the magnetic triggers that was mentioned yesterday. I had seen for a couple of weeks various people just waking up one day and walking out of their lives. I had not talked about it because it was really strange. It looked almost zombie like⦠blank stares just leaving. I had no clue where they were going, I was too transfixed on the blank facial expressions⦠some even had older children along side of them, equally with the same blank look on their face. I am sure, if I had really looked at the expression on my own face as I moved out of my familyâs life to New Mexico, I would have looked the same. Had no clue why I was doing it, or what would happenâ¦. I just had to go. I am more than grateful that I did!!"
## [5] "I think I can believe that, though itâs hard"
## [6] "Josef Strauss: Delirien waltz"
OneData <- iconv(OneData, 'UTF-8', 'ASCII', "byte")
myCorpus <- corpus(OneData)
summary(myCorpus,n=10)
## Corpus consisting of 166833 documents, showing 10 documents.
##
## Text Types Tokens Sentences
## text1 3 5 2
## text2 28 72 3
## text3 47 62 4
## text4 110 204 9
## text5 15 20 1
## text6 5 5 1
## text7 36 44 1
## text8 1 1 1
## text9 47 67 5
## text10 10 13 1
##
## Source: C:/Users/cynth/OneDrive/Training/Data Science Certification/Capstone/* on x86-64 by cynth
## Created: Sun Mar 20 10:16:18 2016
## Notes:
Use ‘quanteda’ package to clean the data and create the document-feature matrix. Preprocessing includes lower case transformation, stopword, stemming, etc.
myDfm <- dfm(myCorpus, ignoredFeatures = c("鈥","檚",stopwords("english")), stem = TRUE)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 166,833 documents
## ... indexing features: 112,235 feature types
## ... removed 174 features, from 176 supplied (glob) feature types
## ... stemming features (English), trimmed 27364 feature variants
## ... created a 166833 x 84697 sparse dfm
## ... complete.
## Elapsed time: 16.4 seconds.
Below is the table of top 20 features.
topfeatures(myDfm, 20)
## e2 just get like one will go can s time love day
## 42387 12767 12368 12149 11254 11047 10739 10474 9958 9902 9423 9072
## make good know thank now 9c 9d see
## 7932 7803 7793 7612 7325 7050 7048 6794
We can also visualize the most frequent features.
## Loading required package: RColorBrewer
n-gram models will be used to build the prediction model. The next word predicted will be based on the probability on the condition of previous n words. 1. Use the three text files to build a dictionary to cover at least 50% of the in-sample scenarios. 2. calculate the conditional probabilities using the dictionary. (2-gram and 3-gram models) 3. assgin a small probability to words not covered in the dictionary