The reason for this report is to investigate the corpus data provided by the Coursera Data Science Capstone Course and use it to create a text predicting model. The model should predict the most probable word by using an input string.
Below are few things to be noted from the given corpus data. * The data contains a lot unnecessary noise and other foreign words and words from different encodings. * Most of the words repeat only few times, so associating each words with other is important to predict the next word. * We select only english language words by using a regular expression. ##Task 1: Data Acquisition & Cleaning.
Make sure the path for working directory is set to location where your files are stored. The data provided by Coursera in partnership with Swiftkey contains data for different languages like Russian, German, Finnish & english. We are intrested in english so lets load the files inside the “en_US” folder.
Data set provided by cursera WEB Site
library(stringi)
library(stringr)
number.lines=cbind(stri_stats_general(news)["Lines"],stri_stats_general(twitter)["Lines"],stri_stats_general(blogs)["Lines"])
number.words=c(sum(stri_count_words(news)),sum(stri_count_words(twitter)),sum(stri_count_words(blogs)))
summary_table=rbind(number.lines,number.words)
rownames(summary_table)=c("Number of Lines","Number of words")
colnames(summary_table)=c("Blogs","Twitter","News")
summary_table
## Blogs Twitter News
## Number of Lines 77259 2360148 899288
## Number of words 2674536 30093369 37546246
set.seed(48)
news.sample <- sample(news, 1000, replace = FALSE)
twitter.sample <- sample(twitter, 1000, replace = FALSE)
blogs.sample <- sample(blogs, 1000, replace = FALSE)
setwd("C:/Users/elias/Downloads/Coursera-SwiftKey (1)/final/en_US")
writeLines(news.sample, "news.sample.txt")
writeLines(twitter.sample, "twitter.sample.txt")
writeLines(blogs.sample, "blogs.sample.txt")
## News Twitter Blogs
## Number of Lines 1000 1000 1000
## Number of words 33669 12941 41164
news.sample <- readLines("news.sample.txt", encoding = "UTF-8")
blogs.sample <- readLines("blogs.sample.txt", encoding = "UTF-8")
twitter.sample <- readLines("twitter.sample.txt", encoding = "UTF-8")
To remove data and clean the final file (Corpus)
## Loading required package: NLP
profanity <- read.csv(“Terms-to-block.csv”, header = F) profanity <- rep(profanity$V1) corpus <- tm_map(corpus, removeWords, profanity) corpus <- Corpus(VectorSource(corpus))
library(RWeka)
options(mc.cores =1)
Uni.Gram_Tokenizer<-function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
Bi.Gram_Tokenizer<-function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
Tri.Gram_Tokenizer<-function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
Quad.Gram_Tokenizer<-function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))
Five.Gram_Tokenizer<-function(x) NGramTokenizer(x, Weka_control(min = 5, max = 5))
TDM_UniGram <- TermDocumentMatrix(corpus, control = list(tokenize = Uni.Gram_Tokenizer))
TDM_BiGram <- TermDocumentMatrix(corpus, control = list(tokenize = Bi.Gram_Tokenizer))
TDM_TriGram <- TermDocumentMatrix(corpus, control = list(tokenize = Tri.Gram_Tokenizer))
TDM_QuadGram <- TermDocumentMatrix(corpus, control = list(tokenize = Quad.Gram_Tokenizer))
TDM_FiveGram <- TermDocumentMatrix(corpus, control = list(tokenize = Five.Gram_Tokenizer))