The simple way to classify texts

I developed this algorithm to classify texts quickly and with a good hit. In fact, it aims at real-time rating or the need for great performance. It is ideal for large documents, the same for the collection. It has been tested with 20 news groups and aTribuna. All with F1 above .783.

Source

After download the sources from:

MyGithub

all files and directories works fine at last time 04/30/2017 3:33 pm

Dataset

The first step is: Loading Dataset from Kaggle

Credits dataset: samdeeplearning

Loading sources and executing

At this step just TF-IDF will be created

source("createTFIDF.R")

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Loading required package: NLP

## 
## Attaching package: 'NLP'

## The following object is masked from 'package:ggplot2':
## 
##     annotate

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

## Joining, by = "file"

head(book_words)

##          file     word n total    tf       idf     tf_idf       class
## 1: response_1      and 1     8 0.125 0.6443570 0.08054463 not_flagged
## 2: response_1    avoid 1     8 0.125 4.3820266 0.54775333 not_flagged
## 3: response_1 conflict 1     8 0.125 4.3820266 0.54775333 not_flagged
## 4: response_1        i 1     8 0.125 0.4700036 0.05875045 not_flagged
## 5: response_1       of 1     8 0.125 0.9808293 0.12260366 not_flagged
## 6: response_1     sort 1     8 0.125 4.3820266 0.54775333 not_flagged

Creating centroids

Loading source and creating centroids with .7 to training, .3 to test

source("createClassCentroid.R")
createClassCentroid()
createFiles2Test()

Classifying

After training we’ll classify, but before will demostrate a graphical represetation of this classification. We’ll use couple of files response_1 and response_49

source("plotFiles.R")
par(mfrow=c(1,2))
plotFile(file1 = "response_1",wplot = TRUE,classCentroid = "flagged")

## [1] 0.3026763

plotFile(file1 = "response_1",wplot = TRUE,classCentroid = "not_flagged")

## [1] 0.9843987

Just take a look about correlation of predictors, strong for not_flagged and weak for flagged, like we expected. Now, we take a look about response_49.

ind <- which(book_words$file == "response_1")
book_words[ind,]

##          file     word n total    tf       idf     tf_idf       class
## 1: response_1      and 1     8 0.125 0.6443570 0.08054463 not_flagged
## 2: response_1    avoid 1     8 0.125 4.3820266 0.54775333 not_flagged
## 3: response_1 conflict 1     8 0.125 4.3820266 0.54775333 not_flagged
## 4: response_1        i 1     8 0.125 0.4700036 0.05875045 not_flagged
## 5: response_1       of 1     8 0.125 0.9808293 0.12260366 not_flagged
## 6: response_1     sort 1     8 0.125 4.3820266 0.54775333 not_flagged
## 7: response_1     this 1     8 0.125 2.9957323 0.37446653 not_flagged
## 8: response_1      try 1     8 0.125 2.3025851 0.28782314 not_flagged

par(mfrow=c(1,2))
plotFile(file1 = "response_49",wplot = TRUE,classCentroid = "flagged")

## [1] 0.674903

plotFile(file1 = "response_49",wplot = TRUE,classCentroid = "not_flagged")

## [1] 0.9414847

Wrong decision! Now we run this test for all files and extract the acuracy from Zipf Law Algortm.

ind <- which(book_words$file == "response_49")
book_words[ind,]

##             file    word  n total          tf       idf      tf_idf
##   1: response_49       a 11   304 0.036184211 0.8266786 0.029912712
##   2: response_49   about  1   304 0.003289474 1.6739764 0.005506501
##   3: response_49  advice  1   304 0.003289474 2.7725887 0.009120358
##   4: response_49     all  1   304 0.003289474 2.7725887 0.009120358
##   5: response_49 already  1   304 0.003289474 4.3820266 0.014414561
##  ---                                                               
## 161: response_49    will  2   304 0.006578947 3.2834143 0.021601410
## 162: response_49    with  5   304 0.016447368 1.1631508 0.019130770
## 163: response_49   would  3   304 0.009868421 2.7725887 0.027361073
## 164: response_49    year  1   304 0.003289474 3.2834143 0.010800705
## 165: response_49   years  3   304 0.009868421 3.2834143 0.032402115
##        class
##   1: flagged
##   2: flagged
##   3: flagged
##   4: flagged
##   5: flagged
##  ---        
## 161: flagged
## 162: flagged
## 163: flagged
## 164: flagged
## 165: flagged

Algorithm Accuracy

Flagged Class

source("iClassFile.R")
printf <- function(...) invisible(print(sprintf(...)))
f1 <- iClassFileAll(iclass = "flagged")[4]

##    iclass
## 1 flagged

printf("Flagged Class Accuracy (F1): %f",f1)

## [1] "Flagged Class Accuracy (F1): 0.750000"

Not Flagged Class

f1 <- iClassFileAll(iclass = "not_flagged")[4]

##        iclass
## 1 not_flagged

printf("not_Flagged Class Accuracy (F1): %f",f1)

## [1] "not_Flagged Class Accuracy (F1): 0.705882"

Plotting Words from files

library("wordcloud")

## Loading required package: RColorBrewer

classFlagged <- read.csv("data/centroid.flagged")
classFlagged <- classFlagged[order(classFlagged$mean,decreasing = TRUE),]
wordcloud(classFlagged$word,classFlagged$mean, scale=c(3,.1),min.freq=0.001,
          max.words=300, random.order=FALSE, rot.per=.35,
          colors=brewer.pal(8,"Dark2"))

classFlagged <- read.csv("data/centroid.not_flagged")
classFlagged <- classFlagged[order(classFlagged$mean,decreasing = TRUE),]
wordcloud(classFlagged$word,classFlagged$mean, scale=c(4,.3),min.freq=0.001,
          max.words=300, random.order=FALSE, rot.per=.15,
          colors=brewer.pal(8,"Dark2"))

The Scientist

Deep-NLP

Delermando Branquinho Filho

30 de abril de 2017