ATTENTION: This project not using the Dataset suggested in rubric. I’m using a Dataset about my personal research, that I want to show you.

My understanding is that we can use any database known in Sata Science literature

In the next sessions we will be analyzing a subset of the data that I will work for all Capstone session.

Introduction

There are a lot of models and metrics for Text Prediction. At this time, I’ll show you a Novel Text Prediction using one most recognize Data Set from Machine LEarning. I choose a Data Set with academic status. The Data Set is available on UIC

Source:
Original Owner and Donor:

Tom Mitchell
School of Computer Science
Carnegie Mellon University
tom.mitchell ‘@’ cmu.edu
http://www.cs.cmu.edu/~tom/

After Download, and unpack (using tar -zxf 20_newsgroups.tar.gz) you can see the Data Set like this:

list.files("class")
##  [1] "alt.atheism"              "comp.graphics"           
##  [3] "comp.os.ms-windows.misc"  "comp.sys.ibm.pc.hardware"
##  [5] "comp.sys.mac.hardware"    "comp.windows.x"          
##  [7] "misc.forsale"             "rec.autos"               
##  [9] "rec.motorcycles"          "rec.sport.baseball"      
## [11] "rec.sport.hockey"         "sci.crypt"               
## [13] "sci.electronics"          "sci.med"                 
## [15] "sci.space"                "soc.religion.christian"  
## [17] "talk.politics.guns"       "talk.politics.mideast"   
## [19] "talk.politics.misc"       "talk.religion.misc"

Corpora: 20 classes and 20000 files

Original class directory is 20_newsgroups, but I changed the name for generic name.
I prepared a text preprocessing using Shell Script. One most important metric to see the importance of the words is TF-IDF.

The TF-IDF stands for “Term Frequency, Inverse Document Frequency”. It is a way to score the importance of words (or “terms”) in a document based on how frequently they appear across multiple documents.

If a word appears frequently in a document, it’s important. Give the word a high score. But if a word appears in many documents, it’s not a unique identifier. Give the word a low score.

Therefore, common words like “the” and “for”, which appear in many documents, will be scaled down. Words that appear frequently in a single document will be scaled up.

Yates2011 R. Baeza-Yates and B. Ribeiro-Neto (2011). Modern Information Retrieval. Addison Wesley, pp. 68-74.

I create all files with its respective weight (TF-IDF) using scripts posted under Github called DistProcess. This scripts working to Cleaning Data.

You can download all source script from DistProcess

The Data Set summary

After weight creation for all files and terms, you’ll have the same file names, but the content is like this:

doc  <- read.csv("./index/51124-7.txt.idx",stringsAsFactors = FALSE, header = FALSE,col.names = c("term","tfidf"),sep = ";", encoding = "UTF-8")

head(doc)
##       term    tfidf
## 1    about 1.411171
## 2  against 3.683870
## 3 anecdote 9.158213
## 4 annoying 7.368633
## 5  another 3.187491
## 6    argue 6.034831

All files (e.g. 8514-2408.txt) are transformed to idx version under the index directory.
In this example, the file has one vector with 152 dimenssion.

Looking for pattern we show the firt graphics.

maxylim <- as.numeric(max(doc$tfidf))
doc$i <- 1:length(doc$term)
plot(doc$i, doc$tfidf, col = "blue", type = "p", main = "51124-7.txt",
     xlim = c(0,length(doc$term)), ylim = c(0,maxylim+5),xlab = "Terms", ylab = "TF-IDF")
abline(glm(doc$tfidf ~ doc$i), col = "blue")

The words of Terms are changed to one sequential mode to ilustrate.

In this paper, a Generalized Centroid based Classifier (GCC) and its variants for text categorization are proposed by utilizing other (newest) algorithm to compare with two well-known classifiers, i.e., the K-nearest-neighbor (KNN) classifier and the Rocchio classifier. KNN, a lazy learning method, suffers from inefficiency in online categorization while achieving remarkable effectiveness. Rocchio, which has efficient categorization performance, fails to obtain an expressive categorization model due to its inherent linear separability assumption. Our proposed method mainly focuses on two points: one point is that we use the word, or its importance from TF-IDF metric to strengthen the expressiveness of the Zipf’ Law model proposed in this experiment; another one is that we improve the results if we combine GLM (Generalized Linear Regression) to speed up the categorization process as well as correlation test get garanties to take rapidly classification with acuracy near to Roccio algoritm.

How can I get a classification documents by importance of word?

The answer came from Centroid which you can see in next picture. We get the mean of all auto-values from all classes. For example, we get the centroid from alt.atheism class.

ni <- read.csv("./statistic/alt.atheism.trn", stringsAsFactors = FALSE)
head(subset(ni, mean > 0))
##       X       term  tf       mean   i
## 10   10    aaaahhh   1 0.02945875  10
## 60   60      aario   6 0.50761679  60
## 61   61     aarnet  22 0.41474211  61
## 62   62      aaron 122 0.03562637  62
## 95   95  abandoned  47 0.06100583  95
## 109 109 abberation   2 0.15361267 109

We can show the same graphics (the doc graphics showed before):

ni$i <-  1:length(ni$term)
plot(ni$i, ni$mean, col = "red", pch = 16, xlim = c(0,max(ni$i)), ylim = c(0,max(ni$mean)),
     xlab = "Terms", ylab = "Mean")
abline(glm(ni$mean ~ ni$i), col = "red")

If you looking for both graphics you can’t see any pattern, but a Data Scientis must get step over, looking ahead. Let’s to make some modifications to Exploratory Data Analysis.

Averages in descending order:

source("plotClass.R")
plotClass(lfile = "alt.atheism", compare = "./index/51124-7.txt.idx",wplot = TRUE)

## [1] 0.06656369

Now, we can get some classification documents just using Zipf’s law

see more about Zipf’s law

If you get an another document from other class you see other arrangement.
The file 14990-11011.txt is the sci.crypt class.

TF-IDF in descending order in the same term of Centroid.

source("plotClass.R")
plotClass(lfile = "alt.atheism", compare = "./index/14990-11011.txt.idx",wplot = TRUE)

## [1] 0.0392327

Just take a look, looks like?

Next step is try to get classfication using Zipf’s law.

The research problem is:
What is the value of accuracy in the classification of documents using the Zipf’s law?

The Scientit