Instructions
The goal here is to build your first simple model for the relationship between words. This is the first step in building a predictive text mining application. You will explore simple models and discover more complicated modeling techniques.
Tasks to accomplish
Build basic n-gram model - using the exploratory analysis you performed, build a basic n-gram model for predicting the next word based on the previous 1, 2, or 3 words. Build a model to handle unseen n-grams - in some cases people will want to type a combination of words that does not appear in the corpora. Build a model to handle cases where a particular n-gram isn’t observed.
After download from Coursera: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
library(tm)
## Loading required package: NLP
setwd("D:\\Data Scientist\\Capstone\\Coursera-SwiftKey\\final\\en_US")
1.The en_US.blogs.txt file is how many megabytes? Is 200MB
Step: ls -alh in the Coursera-Swiftkey/final/en_US directory.
2.The en_US.twitter.txt has how many lines of text? Is Over 2 million
Step:
twitter <- readLines(con <- file("./en_US.twitter.txt"), encoding = "UTF-8", skipNul = TRUE)
close(con)
#Checking the length
length(twitter)
## [1] 2360148
3.What is the length of the longest line seen in any of the three en_US data sets? Is Over 40 thousand in the blogs data set
Step: Read in the lines to arrays:
fileName="en_US.blogs.txt"
con=file(fileName,open="r")
lineBlogs=readLines(con)
longBlogs=length(line)
close(con)
fileName="en_US.news.txt"
con=file(fileName,open="r")
lineNews=readLines(con)
## Warning in readLines(con): incomplete final line found on 'en_US.news.txt'
longNews=length(line)
close(con)
fileName="en_US.twitter.txt"
con=file(fileName,open="r")
lineTwitter=readLines(con)
## Warning in readLines(con): line 167155 appears to contain an embedded nul
## Warning in readLines(con): line 268547 appears to contain an embedded nul
## Warning in readLines(con): line 1274086 appears to contain an embedded nul
## Warning in readLines(con): line 1759032 appears to contain an embedded nul
longTwitter=length(line)
close(con)
#Need the longest line in each array.
longBlogs = nchar(longBlogs)
max(nchar(longBlogs))
## [1] 1
#Apparently below is max of lineBlogs
require(stringi)
## Loading required package: stringi
longBlogs<-stri_length(lineBlogs)
max(longBlogs)
## [1] 40835
#Apparently below is max of lineNews
longNews<-stri_length(lineNews)
max(longNews)
## [1] 5760
#Apparently below is max of lineTwitter
longTwitter<-stri_length(lineTwitter)
max(longTwitter)
## [1] 213
4.In the en_US twitter data set, if you divide the number of lines where the word “love” (all lowercase) occurs by the number of lines the word “hate” (all lowercase) occurs, about what do you get? Is = 4
Step: Get count of lines with “love” (case-sensitive), store in variable grep?
#Word "love"
loveTwitter<-grep("love",lineTwitter)
length(loveTwitter)
## [1] 90956
#Word "hate"
hateTwitter<-grep("hate",lineTwitter)
length(hateTwitter)
## [1] 22138
#Divide love by hate
90956/22138
## [1] 4.108592
5.The one tweet in the en_US twitter data set that matches the word “biostats” says what? As below
Step:
biostatsTwitter<-grep("biostats",lineTwitter)
lineTwitter[biostatsTwitter]
## [1] "i know how you feel.. i have biostats on tuesday and i have yet to study =/"
6.How many tweets have the exact characters “A computer once beat me at chess, but it was no match for me at kickboxing”. (I.e. the line matches those characters exactly.) Is 3
Step:
sentenceTwitter<-grep("A computer once beat me at chess, but it was no match for me at kickboxing",lineTwitter)
length(sentenceTwitter)
## [1] 3
n-gram model (think Markov Chains)
#Remove all weird characters
cleanedTwitter<- iconv(twitter, 'UTF-8', 'ASCII', "byte")
#Sample 10000
twitterSample<-sample(cleanedTwitter, 10000)
doc.vec <- VectorSource(twitterSample)
doc.corpus <- Corpus(doc.vec)
#Convert to lower case
doc.corpus<- tm_map(doc.corpus, tolower)
#Remove all punctuatins
doc.corpus<- tm_map(doc.corpus, removePunctuation)
#Remove all numbers
doc.corpus<- tm_map(doc.corpus, removeNumbers)
##Remove whitespace
doc.corpus <- tm_map(doc.corpus, stripWhitespace)
##Force everything back to plaintext document
doc.corpus <- tm_map(doc.corpus, PlainTextDocument)
#Visualize using wordcloud
library(wordcloud)
## Loading required package: RColorBrewer
wordcloud(doc.corpus, max.words = 200, random.order = FALSE,rot.per=0.35, use.r.layout=FALSE,colors=brewer.pal(8, "Dark2"))
This coming project is to be continue with building a predictive model. n-grams model (think Markov Chains) will be used to tokenize the words & frequency of tokens will be used in building the model. Finally a shiny app that will predict the next word following a given word typed in by user as input will be developed as a data product. And can run the predictive model much more faster.