The goal of this project is used to working with the data and being on track to create prediction algorithm. The report is submitted on R Pubs with explaination about dowloading data from url as zip file ,how to unzip the file and cleaning data and summary statistics , exploratory data analysis ,plots and prediction algorithm and shiny app.
setwd("C:/Users/farza/Desktop/Data Science/Course 10/Milestone Report")
getwd()
[1] "C:/Users/farza/Desktop/Data Science/Course 10/Milestone Report"
library(RWeka)
library(dplyr)
library(stringi)
library(ggplot2)
library(NLP)
library(tm)
if(!file.exists("Coursera-SwiftKey.zip")){
download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
"Coursera-SwiftKey.zip")
unzip("Coursera-SwiftKey.zip")
}
blogs <- readLines("final/en_US/en_US.blogs.txt", warn = FALSE, encoding = "UTF-8")
news <- readLines("final/en_US/en_US.news.txt", warn = FALSE, encoding = "UTF-8")
twitter <- readLines("final/en_US/en_US.twitter.txt", warn = FALSE, encoding = "UTF-8")
Stat_summary <- data.frame('File' = c("Blogs","News","Twitter"),
"Size" = sapply(list(blogs, news, twitter), function(x){format(object.size(x),"MB")}),
'No_Entries' = sapply(list(blogs, news, twitter), function(x){length(x)}),
'Total_Chars' = sapply(list(blogs, news, twitter), function(x){sum(nchar(x))}),
'Max_Chars' = sapply(list(blogs, news, twitter), function(x){max(unlist(lapply(x, function(y) nchar(y))))})
)
Stat_summary
File Size No_Entries Total_Chars Max_Chars
1 Blogs 248.5 Mb 899288 206824505 40833
2 News 19.2 Mb 77259 15639408 5760
3 Twitter 301.4 Mb 2360148 162096031 140
stri_stats_general(blogs)
Lines LinesNEmpty Chars CharsNWhite
899288 899165 206043906 169609063
stri_stats_general(news)
Lines LinesNEmpty Chars CharsNWhite
77259 77259 15615538 13048828
stri_stats_general(twitter)
Lines LinesNEmpty Chars CharsNWhite
2360148 2360148 161961345 133947948
Removal of non-English characters
blogs <- iconv(blogs, "latin1", "ASCII", sub="")
news <- iconv(news, "latin1", "ASCII", sub="")
twitter <- iconv(twitter, "latin1", "ASCII", sub="")
Due to Big Data ,only 1% of each file consider as sample.
set.seed(1379)
dataSample <- c(sample(blogs, length(blogs) * 0.05),
sample(news, length(news) * 0.05),
sample(twitter, length(twitter) * 0.05))
summary(dataSample)
Length Class Mode
166833 character character
Building corpus and cleaning of punctuations,numbers and white spaces and finally convert it to lower case plain text format.
corpus <- Corpus(VectorSource(dataSample))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
N-grams and converting to NLP format to calculate frequency N-grams
Uni_Tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
Bi_Tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
Tri_Tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
Quadi_Tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))
Uni_matrix <- TermDocumentMatrix(corpus, control = list(tokenize = Uni_Tokenizer))
Bi_matrix <- TermDocumentMatrix(corpus, control = list(tokenize = Bi_Tokenizer))
Tri_matrix <- TermDocumentMatrix(corpus, control = list(tokenize = Tri_Tokenizer))
Quadi_matrix<-TermDocumentMatrix(corpus, control = list(tokenize = Quadi_Tokenizer))
words_blogs <- stri_count_words(blogs)
summary(words_blogs)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 9.00 28.00 41.71 60.00 6725.
hist(words_blogs,binwidth=2)
words_news <- stri_count_words(news)
summary(words_news)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.0 19.0 32.0 34.6 46.0 1123.0
hist(words_news,binwidth=1)
words_twitter <- stri_count_words(twitter)
summary(words_twitter)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 7.00 12.00 12.75 18.00 47.00
hist(words_twitter,binwidth=1)
Approximate size of each files are about 200 mb but the items in blogs and news are about 1,000,000 and twitter 2,000,000 items due to limitation of twitter in character (140)therefore consists more items. Frequency of blogs and news are same but frequency in twitter is higher due to limitation of character.
Building of corpus and N-gram has significant role in evaluation of dats frequency and prediction of following words as uni,bi,tri quadri gram . Shiny app is capable to run the system without reducing the sample size .