Instructions

The goal here is to build your first simple model for the relationship between words. This is the first step in building a predictive text mining application. You will explore simple models and discover more complicated modeling techniques.

Tasks to accomplish

Build basic n-gram model - using the exploratory analysis you performed, build a basic n-gram model for predicting the next word based on the previous 1, 2, or 3 words. Build a model to handle unseen n-grams - in some cases people will want to type a combination of words that does not appear in the corpora. Build a model to handle cases where a particular n-gram isn’t observed.

Basic summary

After download from Coursera: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

Loading library

library(tm)

## Loading required package: NLP

Setting the environment

setwd("D:\\Data Scientist\\Capstone\\Coursera-SwiftKey\\final\\en_US")

Quiz 1:

1.The en_US.blogs.txt file is how many megabytes? Is 200MB

Step: ls -alh in the Coursera-Swiftkey/final/en_US directory.

2.The en_US.twitter.txt has how many lines of text? Is Over 2 million

Step:

twitter <- readLines(con <- file("./en_US.twitter.txt"), encoding = "UTF-8", skipNul = TRUE)
close(con)

#Checking the length
length(twitter)

## [1] 2360148

3.What is the length of the longest line seen in any of the three en_US data sets? Is Over 40 thousand in the blogs data set

Step: Read in the lines to arrays:

fileName="en_US.blogs.txt"
con=file(fileName,open="r")
lineBlogs=readLines(con) 
longBlogs=length(line)
close(con)

fileName="en_US.news.txt"
con=file(fileName,open="r")
lineNews=readLines(con)

## Warning in readLines(con): incomplete final line found on 'en_US.news.txt'

longNews=length(line)
close(con)

fileName="en_US.twitter.txt"
con=file(fileName,open="r")
lineTwitter=readLines(con)

## Warning in readLines(con): line 167155 appears to contain an embedded nul

## Warning in readLines(con): line 268547 appears to contain an embedded nul

## Warning in readLines(con): line 1274086 appears to contain an embedded nul

## Warning in readLines(con): line 1759032 appears to contain an embedded nul

longTwitter=length(line)
close(con)

#Need the longest line in each array.
longBlogs = nchar(longBlogs)
max(nchar(longBlogs))

## [1] 1

#Apparently below is max of lineBlogs
require(stringi)

## Loading required package: stringi

longBlogs<-stri_length(lineBlogs)
max(longBlogs)

## [1] 40835

#Apparently below is max of lineNews
longNews<-stri_length(lineNews)
max(longNews)

## [1] 5760

#Apparently below is max of lineTwitter
longTwitter<-stri_length(lineTwitter)
max(longTwitter)

## [1] 213

4.In the en_US twitter data set, if you divide the number of lines where the word “love” (all lowercase) occurs by the number of lines the word “hate” (all lowercase) occurs, about what do you get? Is = 4

Step: Get count of lines with “love” (case-sensitive), store in variable grep?

#Word "love"
loveTwitter<-grep("love",lineTwitter)
length(loveTwitter)

## [1] 90956

#Word "hate"
hateTwitter<-grep("hate",lineTwitter)
length(hateTwitter)

## [1] 22138

#Divide love by hate
90956/22138

## [1] 4.108592

5.The one tweet in the en_US twitter data set that matches the word “biostats” says what? As below

Step:

biostatsTwitter<-grep("biostats",lineTwitter)
lineTwitter[biostatsTwitter]

## [1] "i know how you feel.. i have biostats on tuesday and i have yet to study =/"

6.How many tweets have the exact characters “A computer once beat me at chess, but it was no match for me at kickboxing”. (I.e. the line matches those characters exactly.) Is 3

Step:

sentenceTwitter<-grep("A computer once beat me at chess, but it was no match for me at kickboxing",lineTwitter)
length(sentenceTwitter)

## [1] 3

2.Data Cleaning

n-gram model (think Markov Chains)

#Remove all weird characters
cleanedTwitter<- iconv(twitter, 'UTF-8', 'ASCII', "byte")

#Sample 10000 
twitterSample<-sample(cleanedTwitter, 10000)
doc.vec <- VectorSource(twitterSample)                      
doc.corpus <- Corpus(doc.vec)

#Convert to lower case
doc.corpus<- tm_map(doc.corpus, tolower)

#Remove all punctuatins
doc.corpus<- tm_map(doc.corpus, removePunctuation)

#Remove all numbers 
doc.corpus<- tm_map(doc.corpus, removeNumbers)

##Remove whitespace
doc.corpus <- tm_map(doc.corpus, stripWhitespace)

##Force everything back to plaintext document
doc.corpus <- tm_map(doc.corpus, PlainTextDocument)

3.Exploratory Analysis

#Visualize using wordcloud
library(wordcloud)

## Loading required package: RColorBrewer

wordcloud(doc.corpus, max.words = 200, random.order = FALSE,rot.per=0.35, use.r.layout=FALSE,colors=brewer.pal(8, "Dark2"))

4.Future NLP project

This coming project is to be continue with building a predictive model. n-grams model (think Markov Chains) will be used to tokenize the words & frequency of tokens will be used in building the model. Finally a shiny app that will predict the next word following a given word typed in by user as input will be developed as a data product. And can run the predictive model much more faster.

Data Science Capstone Milestone Report Quiz 1

Patihe Suip

March 18, 2016

Basic summary

Loading library

Setting the environment

Quiz 1:

2.Data Cleaning

3.Exploratory Analysis

4.Future NLP project