Abstract
The final goal of this project is to simulate the development of a data product to predict words of an english written text like in smartphones. This document discribe the first part of the developing processThis first part of the developing process is to
The data basis of this Capstone project provided by cloudfront Capstone Dataset.
It contains the 3 major files for this analysis:
I put this files on my local developer direcory and start to get a feeling for the data.
My first investigation is to determine the
For this I use standard linux commands wc and ls.
# file size
>ls -l en_US.blogs.txt
# count lines
>wc -l en_US.blogs.txt
# count words
>wc -w en_US.blogs.txt
## size_in_MB lines words
## en_US.blogs.txt 210 899288 37334114
## en_US.news.txt 205 1010242 34365936
## en_US.twitter.txt 167 2360148 30359804
The available data need to split into training data and test data. The basic files are quite large to I decide to realize the split of chunck based and use file streeming. So it is also possible to run it on lower sofisticated computers.
split_data <- function(path_filename_extention,p){
library("tools")
# set seed
set.seed(100)
# set chunk_size for read/write
chunk_size = 10000
# takes first p percent of lines
# stores training data in path_filename_training
# stores test data in path_filename_test
# set filename for traning data set (file) and test data set (file)
filename = file_path_sans_ext(basename(path_filename_extention))
path = dirname(path_filename_extention)
extention = file_ext(path_filename_extention)
training_file_name = paste(path,"/",filename,"_","train",".",extention,sep="")
test_file_name = paste(path,"/",filename,"_","test",".",extention,sep="")
# con stream for reading and writing data
con_read <- file(description = path_filename_extention, open= "r")
con_write_training = file(description = training_file_name, open="w")
con_write_test = file(description = test_file_name, open= "w")
# read first source data chunk
source_data = readLines(con_read, chunk_size,skipNul = TRUE, warn = FALSE)
# loop over source data
while(length(source_data)!=0)
{
# select training data / source data
is_training = rbinom(n = length(source_data), size = 1, prob = p)
training_data = source_data[is_training == 1]
test_data = source_data[is_training == 0]
# write data
writeLines(training_data, con_write_training)
writeLines(test_data, con_write_test)
# read first source data chunk
source_data = readLines(con_read, chunk_size,skipNul = TRUE, warn = FALSE)
}
# close file streams
close(con_read)
close(con_write_training)
close(con_write_test)
}
# split source data into training data (10%) and testdata (90%)
split_data("../DataSet/en_US.blogs.txt",0.1)
split_data("../DataSet/en_US.news.txt",0.1)
split_data("../DataSet/en_US.twitter.txt",0.1)
Only nearly 10% of the raw data will be use for training. Thats enough for exploratory the data.
In the following steps I only operate on training data. The remaining testdata will be set in focus on later phases of this project.
Now the data will preprocessed to provide a nearly clean data source for further steps. Finaly we are interested in a sequence of words. So I remove all non alpha-characters and linebrake.
In addition following preprocessing steps will performed
# writes a clean tokenized version to disk
clean_file <- function(source_file,dest_file){
chunk_size = 10000
# con stream for reading and writing data
con_read <- file(description = source_file, open= "r")
con_write = file(description = dest_file, open= "w")
# read first source data chunk
source_data = readLines(con_read, chunk_size,skipNul = TRUE, warn = FALSE)
# loop over source data
while(length(source_data)!=0)
{
# keep only ALPHA and linebrak
clean_data = gsub("[^A-Za-z\n\r]"," ",source_data)
# convert to upper case
clean_data = tolower(clean_data)
# remove multipe spaces
clean_data = gsub("\\s+"," ",clean_data)
# write data
writeLines(clean_data, con_write)
# read first source data chunk
source_data = readLines(con_read, chunk_size,skipNul = TRUE, warn = FALSE)
}
# close file streams
close(con_read)
close(con_write)
}
# traing_data
clean_file(source_file = "../DataSet/en_US.blogs_train.txt", dest_file = "../DataSet/clean_data/en_US.blogs_train.txt")
clean_file(source_file = "../DataSet/en_US.news_train.txt", dest_file = "../DataSet/clean_data/en_US.news_train.txt")
clean_file(source_file = "../DataSet/en_US.twitter_train.txt", dest_file = "../DataSet/clean_data/en_US.twitter_train.txt")
At this step I did not remove stopwords because they should be predicte as well so I keep it in the training data.
Now we are interested of the frequency words and frequency of word sequences. All 3 source type will be examine separately. So the differnces between the 3 sources become visible and comparable.
Calulate the fequency of words. At first compute the absolut value and than the probability of occurrences of a word.
# get data.frame with frequency
freqWord1 <- function(){
# load all 3 cleaned up training files
docs <- VCorpus(DirSource("../DataSet/clean_data/"))
# setup tokenizer
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
# generate DocumentTermMatrix
tdm <- TermDocumentMatrix(docs, control = list(tokenize = BigramTokenizer)) #531MB
# convert into frequency Matrix
freq_tdm_M <- data.frame(as.matrix(tdm))
}
freq_tdm_M = freqWord1()
Now take a loot at some charateristics:
Here are the top 20 words per document.
## en_US.blogs_train.txt en_US.news_train.txt en_US.twitter_train.txt
## 1 the the the
## 2 and and you
## 3 that that and
## 4 for for for
## 5 you with that
## 6 with said with
## 7 was was your
## 8 this his have
## 9 have from this
## 10 but but are
## 11 are have just
## 12 not are can
## 13 they they but
## 14 from this what
## 15 all has all
## 16 one you not
## 17 can not like
## 18 about who was
## 19 what will out
## 20 will about get
## sumfreq
## 1 the
## 2 and
## 3 that
## 4 for
## 5 you
## 6 with
## 7 was
## 8 this
## 9 have
## 10 are
## 11 but
## 12 not
## 13 from
## 14 they
## 15 all
## 16 can
## 17 will
## 18 just
## 19 said
## 20 your
You can see the sequence is a bit different between the documents.
A plot of the the probability of occurrence of word over the different documents shows:
Not that the plot has a logaritmic scale on y-axis.
Calulate the fequency of words pairs. Like above first compute the absolut value and than the probability of occurrences of a pairs word.
And take a look at some charateristics:
Here are the top 20 words per document.
## en_US.blogs_train.txt en_US.news_train.txt en_US.twitter_train.txt
## 1 of the of the i m
## 2 in the in the it s
## 3 to the to the don t
## 4 it s on the in the
## 5 on the for the for the
## 6 to be it s of the
## 7 i m at the on the
## 8 and the and the to be
## 9 and i in a can t
## 10 for the to be to the
## 11 don t with the thanks for
## 12 i was from the you re
## 13 i have he said if you
## 14 at the with a at the
## 15 it was of a i love
## 16 it is as a that s
## 17 with the don t i can
## 18 that i for a thank you
## 19 is a is a going to
## 20 in a it was have a
## sumfreq
## 1 of the
## 2 in the
## 3 it s
## 4 i m
## 5 to the
## 6 for the
## 7 on the
## 8 don t
## 9 to be
## 10 at the
## 11 and the
## 12 in a
## 13 with the
## 14 and i
## 15 is a
## 16 it was
## 17 for a
## 18 from the
## 19 i have
## 20 if you
You can see the sequence is a bit different between the documents.
A plot of the the probability of occurrence of word over the different documents shows: Here we observe a different situation like for one word frequency.
Not that the plot has a logaritmic scale on y-axis.
Size of the modle is to large. More restrictions are needed. Build an other model without stopword * Identify words of other languages. Eg evaluate other data files other languages provided by Capstone Dataset. to identify words of foreign languages. * take of linebreaks by extracting word pairs * there is still a lot of trash words like ‘aaaa’ of ’uuuhh’this should remove * there is also a need to compress the model
We find some interestion results but the model need to optimize and compress.