title: “capstone quiz 1” author: “Michelle Tan” date: “1/27/2018” output: html_document _________________________________________________________ Data Science Capstone Milestone Report Quiz 1

The goal here is to build your first simple model for the relationship between words. This is the first step in building a predictive text mining application. You will explore simple models and discover more complicated modeling techniques.

Tasks to accomplish

Build basic n-gram model - using the exploratory analysis you performed, build a basic n-gram model for predicting the next word based on the previous 1, 2, or 3 words. Build a model to handle unseen n-grams - in some cases people will want to type a combination of words that does not appear in the corpora. Build a model to handle cases where a particular n-gram isn’t observed. ## Basic summary After download from Coursera: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

Loading library

library(tm)
## Loading required package: NLP
setwd("~/Desktop/en_US")
twitter <- readLines(con <- file("./en_US.twitter.txt"), encoding = "UTF-8", skipNul = TRUE)
close(con)
length(twitter)
## [1] 2360148
fileName="en_US.blogs.txt"
con=file(fileName,open="r")
lineBlogs=readLines(con) 
longBlogs=length(line)
close(con)
fileName="en_US.news.txt"
con=file(fileName,open="r")
lineNews=readLines(con)
longNews=length(line)
close(con)
fileName="en_US.twitter.txt"
con=file(fileName,open="r")
lineTwitter=readLines(con)
## Warning in readLines(con): line 167155 appears to contain an embedded nul
## Warning in readLines(con): line 268547 appears to contain an embedded nul
## Warning in readLines(con): line 1274086 appears to contain an embedded nul
## Warning in readLines(con): line 1759032 appears to contain an embedded nul
longTwitter=length(line)
close(con)
length(twitter)
## [1] 2360148
longBlogs=nchar(longBlogs)
max(nchar(longBlogs))
## [1] 1
require(stringi)
## Loading required package: stringi
longBlogs <- stri_length(lineBlogs)
max(longBlogs)
## [1] 40833
longNews <- stri_length(lineNews)
max(longNews)
## [1] 11384
longTwitter <- stri_length(lineTwitter)
max(longTwitter)
## [1] 140
loveTwitter<-grep("love",lineTwitter)
length(loveTwitter)
## [1] 90956
hateTwitter<- grep("hate",lineTwitter)
length(hateTwitter)
## [1] 22138
90956/22138
## [1] 4.108592
biostatsTwitter<-grep("biostats",lineTwitter)
lineTwitter[biostatsTwitter]
## [1] "i know how you feel.. i have biostats on tuesday and i have yet to study =/"
sentenceTwitter<-grep("A computer once beat me at chess, but it was no match for me at kickboxing",lineTwitter)
length(sentenceTwitter)
## [1] 3