Task 0: Understanding the Problem

The reason for this report is to investigate the corpus data provided by the Coursera Data Science Capstone Course and use it to create a text predicting model. The model should predict the most probable word by using an input string.

Below are few things to be noted from the given corpus data. * The data contains a lot unnecessary noise and other foreign words and words from different encodings. * Most of the words repeat only few times, so associating each words with other is important to predict the next word. * We select only english language words by using a regular expression. ##Task 1: Data Acquisition & Cleaning.

Make sure the path for working directory is set to location where your files are stored. The data provided by Coursera in partnership with Swiftkey contains data for different languages like Russian, German, Finnish & english. We are intrested in english so lets load the files inside the “en_US” folder.

loading the data

Data set provided by cursera WEB Site

Statistics

 library(stringi)
 library(stringr)
 number.lines=cbind(stri_stats_general(news)["Lines"],stri_stats_general(twitter)["Lines"],stri_stats_general(blogs)["Lines"])
 number.words=c(sum(stri_count_words(news)),sum(stri_count_words(twitter)),sum(stri_count_words(blogs)))
 summary_table=rbind(number.lines,number.words)
 rownames(summary_table)=c("Number of Lines","Number of words")
 colnames(summary_table)=c("Blogs","Twitter","News")
 summary_table

##                   Blogs  Twitter     News
## Number of Lines   77259  2360148   899288
## Number of words 2674536 30093369 37546246

Sampling

set.seed(48)
news.sample <- sample(news, 1000, replace = FALSE)
twitter.sample <- sample(twitter, 1000, replace = FALSE)
blogs.sample <- sample(blogs, 1000, replace = FALSE)

Save sample

setwd("C:/Users/elias/Downloads/Coursera-SwiftKey (1)/final/en_US")
writeLines(news.sample, "news.sample.txt")
writeLines(twitter.sample, "twitter.sample.txt")
writeLines(blogs.sample, "blogs.sample.txt")

#statistics

##                  News Twitter Blogs
## Number of Lines  1000    1000  1000
## Number of words 33669   12941 41164

Histograms

Loading Sample Files

news.sample <- readLines("news.sample.txt", encoding = "UTF-8")
blogs.sample <- readLines("blogs.sample.txt", encoding = "UTF-8")
twitter.sample <- readLines("twitter.sample.txt", encoding = "UTF-8")

Corpus Creation

To remove data and clean the final file (Corpus)

## Loading required package: NLP

Corpus Creation

profanity <- read.csv(“Terms-to-block.csv”, header = F) profanity <- rep(profanity$V1) corpus <- tm_map(corpus, removeWords, profanity) corpus <- Corpus(VectorSource(corpus))

Tokenizing

library(RWeka)
options(mc.cores =1)
Uni.Gram_Tokenizer<-function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
Bi.Gram_Tokenizer<-function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
Tri.Gram_Tokenizer<-function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
Quad.Gram_Tokenizer<-function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))
Five.Gram_Tokenizer<-function(x) NGramTokenizer(x, Weka_control(min = 5, max = 5))

Creating TDM for GRAMS

TDM_UniGram <- TermDocumentMatrix(corpus, control = list(tokenize = Uni.Gram_Tokenizer))
TDM_BiGram <- TermDocumentMatrix(corpus, control = list(tokenize = Bi.Gram_Tokenizer))
TDM_TriGram <- TermDocumentMatrix(corpus, control = list(tokenize = Tri.Gram_Tokenizer))
TDM_QuadGram <- TermDocumentMatrix(corpus, control = list(tokenize = Quad.Gram_Tokenizer))
TDM_FiveGram <- TermDocumentMatrix(corpus, control = list(tokenize = Five.Gram_Tokenizer))

TO do or further Developmnt

As we can see from the three gram function. Most of the following words are some apostrophe words. So I would like to replace all apostrophe words into continuing words. Like I don’t to I do not.
Build a predictive Model.
Build a Shiny Model. *Create of Unigrams,Bi,tri, Quad, Fivegrams file

Capstone_text_mining

G. Garrido

Monday, December 28, 2015

Task 0: Understanding the Problem

loading the data

Statistics

Sampling

Save sample

#statistics

Histograms

Loading Sample Files

Corpus Creation

Corpus Creation

Tokenizing

Creating TDM for GRAMS

TO do or further Developmnt