The word to sentence ratio is similar to that of the word to line ratio for twitter, but significantly different for the news and blog data-sets (en_US). This influences the approach for creating the training data-sets for the generation and testing of the n-gram data-models to perfect a next word prediction tool.
The goal of this Milestone Report submission is to provide a brief and concise summary regarding the data analysis taken place during the initial stages of the project.
An overview of how the data was obtained and the initial summary statistics regarding the dataset will be shared. In addition an initial plan for the creation of the prediction algorithm.
The test dataset was obtained directly from Coursera which has been language filtered. The original data is from a corpus called HC Corpora [www.corpora.heliohost.org]((http://www.corpora.heliohost.org) which has been collected by webcrawler programs.
When analysing the files I am priortising looking at the data as sentences and not as lines. My reasoning is that I’m looking to predict the next word of a sentence and never the starting word of a sentence, so it doesn’t make sense at this stage to consider the ends and starts of sentences together. Investigating the average number of words per sentence and not per line, should highlight:-
The provided data files contain non-ASCII characters that must be read using readbinary otherwise on windows this prevents the reading of the full-dataset and provides an incomplete view
start.time <- Sys.time()
library(stringr)
data_directory <- "../data/raw/en_US/"
save_directory <- "../data/clean/en_US/"
files <- list.files( data_directory, pattern = ".txt", recursive=TRUE)
summary_table <- NULL
#
for (filename in files) {
fin <- file(paste0(data_directory,filename), open="rb", encoding="UTF-8" )
data.text <- readLines(fin, n=-1, encoding="UTF-8", warn=FALSE)
close(fin)
# Replace accented characters
data.text <- iconv(data.text, to='ASCII//TRANSLIT')
# convert to lowercase
data.text <- tolower(data.text)
sentencecnt <- sum(str_count(data.text,"[\\.|?]"))
linecnt <- length(data.text)
#Split the strings into separate words
data.words.list <- strsplit(data.text, "\\W+", perl=TRUE)
data.words.vector <- unlist(data.words.list)
wordcnt <- length(data.words.vector)
##List each word and frequency (top-10)
data.freq.list <- sort( table(data.words.vector) , decreasing=TRUE)
print( paste("Top 10 words for data-set",filename))
print( data.freq.list[1:10]) # Top 10 common words
##List frequency distribution of word length
data.freq.wordlength <- table( unlist( lapply( data.words.list , nchar) ) )
print(paste("Most popular word length for",filename,"is:",names(data.freq.wordlength)[which.max(data.freq.wordlength)]) )
#
title.text <- paste('Word length frequency', str_replace(filename,"en_US.",""))
barplot( data.freq.wordlength[data.freq.wordlength>100] ,
col=rainbow(16), main=title.text, xlab="Word size",ylab="Frequency",xlim=c(1,15),
cex.axis=0.6, cex.names=0.6, cex.main=0.8)
#
summary_table <- rbind(summary_table ,
cbind(filename, sentencecnt, linecnt,
wordsentenceratio=round( wordcnt / sentencecnt , 2),
wordlineratio=round( wordcnt / linecnt , 2) ) )
# Preserve object
#save(data.text,file=paste0(save_directory,filename,".RData"),compress=TRUE,compression_level=9 )
}
## [1] "Top 10 words for data-set en_US.blogs.txt"
## data.words.vector
## the a and to of i in that it
## 1854668 1174753 1093421 1068774 876535 846674 597655 471084 443575
## is
## 431813
## [1] "Most popular word length for en_US.blogs.txt is: 3"
## [1] "Top 10 words for data-set en_US.news.txt"
## data.words.vector
## the a to and of in s that for
## 1971936 1059007 905973 889072 774469 678854 386394 367727 353763
## is
## 284161
## [1] "Most popular word length for en_US.news.txt is: 3"
## [1] "Top 10 words for data-set en_US.twitter.txt"
## data.words.vector
## the i to a you and for it in of
## 937298 921720 788788 674361 599859 438658 385397 382222 380498 359670
## [1] "Most popular word length for en_US.twitter.txt is: 4"
This highlights the contextual differences within the data-sources and confirms that these words are the words that would need to be removed prior to building the model (aka stopwords) which are fairly consistent between all the top-10 used words for each data source. Unfortunately the numbers do not provide a clear indication for the value of ‘n’ (for n-grams).
The below table shows the comparison of the three file types, with blogs and news showing a significant difference in the line and sentence ratios, unfortunately it’s not practical to create, store and use an n-gram of 10, but this does support the creation of sentences into the NLP library instead of supply an entire line of text as a sequence.
summary_table
## filename sentencecnt linecnt wordsentenceratio
## [1,] "en_US.blogs.txt" "3213456" "899288" "12.03"
## [2,] "en_US.news.txt" "2523440" "1010242" "14.24"
## [3,] "en_US.twitter.txt" "3185329" "2360148" "9.81"
## wordlineratio
## [1,] "42.99"
## [2,] "35.57"
## [3,] "13.25"
My planned next steps for this project are:-
Processing time: 5.2327428