Introduction

The project goal is to build a predictive text mining application. For doing this, will load multiple files (news, blogs & twitter). Then, we will explore this dataset, clean it & build a simple text predictor.

Data Exploratory

First, we will load some R packages.

library(R.utils)
## Warning: package 'R.utils' was built under R version 3.4.4
## Warning: package 'R.oo' was built under R version 3.4.4
library(tokenizers)
## Warning: package 'tokenizers' was built under R version 3.4.4
library(stringr)
## Warning: package 'stringr' was built under R version 3.4.4
library(stringi)
## Warning: package 'stringi' was built under R version 3.4.4

Now, we can explore our file from the followed aspects: - File size - Lines per file - Biggest line per a file - Total words amount per file

setwd("D:/Proj1/final/en_US")
fileNames <- list.files(path=".", recursive=T, pattern=".*en_.*.txt")
df <- data.frame(fileNames)
length(fileNames)
## [1] 4
i <- 1
for (i in 1:length(fileNames)){
    df$Size.MB[i] <- floor(file.size(fileNames[i])[1] / 1048576)
    conn <- file(fileNames[i], open="r")
    someFile <- readLines(conn, encoding = "UTF-8", skipNul = TRUE)
    close(conn)
    lineCount <- length(someFile )
    charsCount <- lapply(someFile , nchar)
    biggestLine <- which.max(charsCount)
    wordsCount <- sum(sapply(strsplit(someFile , "\\s+"), length))
    
    df$linesCount[i] <- lineCount
    df$biggestLine[i] <- biggestLine
    df$wordsCount[i] <- wordsCount
    i <- i + 1
}
## Warning in readLines(conn, encoding = "UTF-8", skipNul = TRUE): incomplete
## final line found on 'en_US.news.txt'
df
##           fileNames Size.MB linesCount biggestLine wordsCount
## 1   en_US.blogs.txt     200     899288      483415   37334131
## 2  en_US.corpus.txt     107     831746      174454   21350030
## 3    en_US.news.txt     196      77259       14556    2643969
## 4 en_US.twitter.txt     159    2360148          26   30373583

Let’s load the text files (news, blogs & twitter).

setwd("D:/Proj1/final/en_US")
conn <- file("en_US.blogs.txt", open="r")
blogs <- readLines(conn, encoding = "UTF-8", skipNul = TRUE)
close(conn)

conn <- file("en_US.blogs.txt", open="r")
news <- readLines(conn, encoding = "UTF-8", skipNul = TRUE)
close(conn)

conn <- file("en_US.twitter.txt", open="r")
twitter <- readLines(conn,encoding = "UTF-8", skipNul = TRUE)
close(conn)

Creaet a single corpus for ease on data cleaning work.

corpus <- c(blogs, news, twitter)
blogs <- 0
news <- 0
twitter <- 0

Let’s clean our data s followed: - removing non alphanumeric. - Remove double spaces. - Invert all word into lower cases. - Breaking chunks into reasonable sentences.

corpus <- stringr::str_replace_all(corpus,"[^a-zA-Z\\s]", " ")
corpus <- stringr::str_replace_all(corpus,"[\\s]+", " ")
corpus <- tolower(corpus)

Builing an N-grams of 2 - 5 words. The Ngrams would hepl us to predict our next word.

Ngrams2 <- tokenize_ngrams(corpus, n = 2, n_min = 2)
Ngrams3 <- tokenize_ngrams(corpus, n = 3, n_min = 3)
Ngrams4 <- tokenize_ngrams(corpus, n = 4, n_min = 4)
Ngrams5 <- tokenize_ngrams(corpus, n = 5, n_min = 5)

Let’s make some plots of the top words in 3-ngarms. For doing this, we need to calculate to frequency of each Ngram & sort.

FreqWordsNgrams3 <- table(unlist(Ngrams3))
FreqWordsNgrams3 <- sort(FreqWordsNgrams3, decreasing = TRUE)
FreqWordsNgrams3 <- head(FreqWordsNgrams3, 50)
#png(filename="FreqWordsNgrams3.png")
plot(FreqWordsNgrams3, width = 480, height = 480)
## Warning in plot.window(...): "width" is not a graphical parameter
## Warning in plot.window(...): "height" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "width" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "height" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "width" is not
## a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "height" is
## not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "width" is not
## a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "height" is
## not a graphical parameter
## Warning in title(...): "width" is not a graphical parameter
## Warning in title(...): "height" is not a graphical parameter
## Warning in axis(...): "width" is not a graphical parameter
## Warning in axis(...): "height" is not a graphical parameter

#dev.off()

Next Step - Prediction

For doing a prediction, we need to do the next steps: (1)grab a phrase for search Again, some phrases need also to be clean from: punctuation, numbers & whitespaces. (2)My plan is to take a phrase & extrat the last 4 words, sine 5-ngram is the maximum. (3)Search the phrase in n+1 Ngrams Poin to think: how to grab the phrase at the beginning of the Ngram, since the phrase could be at the Ngram end. (4)Should we search in all 4 N-grams or only at a specific Ngram. (5)How to extract the predicted word? (6)In case, of too many words, how to give the most relevant word.

Let’s search for phrase. some phrases need also to be clean from: punctuation, numbers & whitespaces. Then, we’ll extrat the last 4 words, sine 5-ngram is the maximum.

myFunction <-function(inputPhrase){
inputPhrase <- stringr::str_replace_all(inputPhrase,"[^a-zA-Z\\s]", " ")
inputPhrase <- stringr::str_replace_all(inputPhrase,"[\\s]+", " ")
inputPhrase <- tolower(inputPhrase)
#Count words for ease search by Ngrams
wc <- count_words(inputPhrase)
#taking only last input phrase words
phraseSearch <- paste(tail(strsplit(inputPhrase,split=" ")[[1]],4), collapse = " ")
phraseSearch1 <- paste0(phraseSearch, sep = " ")
return()
}

Since, we have many Ngrams, we’ll search in a specific one.

if (wc == 1) {
    searchTemp <- unlist(Ngrams2)
} else if (wc == 2) {
    searchTemp <- unlist(Ngrams3)
} else if (wc == 3) {
    searchTemp <- unlist(Ngrams4)
} else if (wc == 4) {
    searchTemp <- unlist(Ngrams5)
}
searchList <- grep(phraseSearch1, searchTemp) 

First, Search the phrase in n+1 Ngrams. Second, give the most relevant word by frequency.

#extarcting the next word from each relevant sentence
words <- c()
i <- 1
w <- 0
if (length(searchList) > 0) {
    for (i in length(searchList)){
    w[i] <- tail(strsplit(searchTemp[searchList[i]],split=" ")[[1]],1)
    words <- append(words, w[i])
    i <- i + 1
    }
} else if (length(searchList) == 0) {
    print("No suggestion were found!")
}
words.freq<-table(unlist(words))
words.freq <- sort(words.freq, decreasing = TRUE)
head(words.freq, 5)
print("Follwed suggested words:")
print(words.freq)

Conclusion

As for now, we can conclude the followed: - The bigger dataset the bigger accuracy for prediction words. However, it cause for PC performance issue. For example: when I used a regular PC, I encounter many memory & CPU issues hence I used a small sample ~3%. While playing with full DB was not an issue on powerful server. - The bigger Ngrams it will give a more accurate words prediction.

The NLP gave me a better understand of how to build an R full project. Thank you JHU!