title: “capstone quiz 1” author: “Michelle Tan” date: “1/27/2018” output: html_document _________________________________________________________ Data Science Capstone Milestone Report Quiz 1

The goal here is to build your first simple model for the relationship between words. This is the first step in building a predictive text mining application. You will explore simple models and discover more complicated modeling techniques.

Build basic n-gram model - using the exploratory analysis you performed, build a basic n-gram model for predicting the next word based on the previous 1, 2, or 3 words. Build a model to handle unseen n-grams - in some cases people will want to type a combination of words that does not appear in the corpora. Build a model to handle cases where a particular n-gram isn’t observed. ## Basic summary After download from Coursera: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

Loading library

library(tm)

## Loading required package: NLP

setwd("~/Desktop/en_US")
twitter <- readLines(con <- file("./en_US.twitter.txt"), encoding = "UTF-8", skipNul = TRUE)
close(con)
length(twitter)

## [1] 2360148

fileName="en_US.blogs.txt"
con=file(fileName,open="r")
lineBlogs=readLines(con) 
longBlogs=length(line)
close(con)

fileName="en_US.news.txt"
con=file(fileName,open="r")
lineNews=readLines(con)
longNews=length(line)
close(con)

fileName="en_US.twitter.txt"
con=file(fileName,open="r")
lineTwitter=readLines(con)

## Warning in readLines(con): line 167155 appears to contain an embedded nul

## Warning in readLines(con): line 268547 appears to contain an embedded nul

## Warning in readLines(con): line 1274086 appears to contain an embedded nul

## Warning in readLines(con): line 1759032 appears to contain an embedded nul

longTwitter=length(line)
close(con)
length(twitter)

## [1] 2360148

longBlogs=nchar(longBlogs)
max(nchar(longBlogs))

## [1] 1

require(stringi)

## Loading required package: stringi

longBlogs <- stri_length(lineBlogs)
max(longBlogs)

## [1] 40833

longNews <- stri_length(lineNews)
max(longNews)

## [1] 11384

longTwitter <- stri_length(lineTwitter)
max(longTwitter)

## [1] 140

loveTwitter<-grep("love",lineTwitter)
length(loveTwitter)

## [1] 90956

hateTwitter<- grep("hate",lineTwitter)
length(hateTwitter)

## [1] 22138

90956/22138

## [1] 4.108592

biostatsTwitter<-grep("biostats",lineTwitter)
lineTwitter[biostatsTwitter]

## [1] "i know how you feel.. i have biostats on tuesday and i have yet to study =/"

sentenceTwitter<-grep("A computer once beat me at chess, but it was no match for me at kickboxing",lineTwitter)
length(sentenceTwitter)

## [1] 3