I developed this algorithm to classify texts quickly and with a good hit. In fact, it aims at real-time rating or the need for great performance. It is ideal for large documents, the same for the collection. It has been tested with 20 news groups and aTribuna. All with F1 above .783.
After download the sources from:
all files and directories works fine at last time 04/30/2017 3:33 pm
At this step just TF-IDF will be created
source("createTFIDF.R")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## Joining, by = "file"
head(book_words)
## file word n total tf idf tf_idf class
## 1: response_1 and 1 8 0.125 0.6443570 0.08054463 not_flagged
## 2: response_1 avoid 1 8 0.125 4.3820266 0.54775333 not_flagged
## 3: response_1 conflict 1 8 0.125 4.3820266 0.54775333 not_flagged
## 4: response_1 i 1 8 0.125 0.4700036 0.05875045 not_flagged
## 5: response_1 of 1 8 0.125 0.9808293 0.12260366 not_flagged
## 6: response_1 sort 1 8 0.125 4.3820266 0.54775333 not_flagged
Loading source and creating centroids with .7 to training, .3 to test
source("createClassCentroid.R")
createClassCentroid()
createFiles2Test()
After training we’ll classify, but before will demostrate a graphical represetation of this classification. We’ll use couple of files response_1 and response_49
source("plotFiles.R")
par(mfrow=c(1,2))
plotFile(file1 = "response_1",wplot = TRUE,classCentroid = "flagged")
## [1] 0.3026763
plotFile(file1 = "response_1",wplot = TRUE,classCentroid = "not_flagged")
## [1] 0.9843987
Just take a look about correlation of predictors, strong for not_flagged and weak for flagged, like we expected. Now, we take a look about response_49.
ind <- which(book_words$file == "response_1")
book_words[ind,]
## file word n total tf idf tf_idf class
## 1: response_1 and 1 8 0.125 0.6443570 0.08054463 not_flagged
## 2: response_1 avoid 1 8 0.125 4.3820266 0.54775333 not_flagged
## 3: response_1 conflict 1 8 0.125 4.3820266 0.54775333 not_flagged
## 4: response_1 i 1 8 0.125 0.4700036 0.05875045 not_flagged
## 5: response_1 of 1 8 0.125 0.9808293 0.12260366 not_flagged
## 6: response_1 sort 1 8 0.125 4.3820266 0.54775333 not_flagged
## 7: response_1 this 1 8 0.125 2.9957323 0.37446653 not_flagged
## 8: response_1 try 1 8 0.125 2.3025851 0.28782314 not_flagged
par(mfrow=c(1,2))
plotFile(file1 = "response_49",wplot = TRUE,classCentroid = "flagged")
## [1] 0.674903
plotFile(file1 = "response_49",wplot = TRUE,classCentroid = "not_flagged")
## [1] 0.9414847
Wrong decision! Now we run this test for all files and extract the acuracy from Zipf Law Algortm.
ind <- which(book_words$file == "response_49")
book_words[ind,]
## file word n total tf idf tf_idf
## 1: response_49 a 11 304 0.036184211 0.8266786 0.029912712
## 2: response_49 about 1 304 0.003289474 1.6739764 0.005506501
## 3: response_49 advice 1 304 0.003289474 2.7725887 0.009120358
## 4: response_49 all 1 304 0.003289474 2.7725887 0.009120358
## 5: response_49 already 1 304 0.003289474 4.3820266 0.014414561
## ---
## 161: response_49 will 2 304 0.006578947 3.2834143 0.021601410
## 162: response_49 with 5 304 0.016447368 1.1631508 0.019130770
## 163: response_49 would 3 304 0.009868421 2.7725887 0.027361073
## 164: response_49 year 1 304 0.003289474 3.2834143 0.010800705
## 165: response_49 years 3 304 0.009868421 3.2834143 0.032402115
## class
## 1: flagged
## 2: flagged
## 3: flagged
## 4: flagged
## 5: flagged
## ---
## 161: flagged
## 162: flagged
## 163: flagged
## 164: flagged
## 165: flagged
Flagged Class
source("iClassFile.R")
printf <- function(...) invisible(print(sprintf(...)))
f1 <- iClassFileAll(iclass = "flagged")[4]
## iclass
## 1 flagged
printf("Flagged Class Accuracy (F1): %f",f1)
## [1] "Flagged Class Accuracy (F1): 0.750000"
Not Flagged Class
f1 <- iClassFileAll(iclass = "not_flagged")[4]
## iclass
## 1 not_flagged
printf("not_Flagged Class Accuracy (F1): %f",f1)
## [1] "not_Flagged Class Accuracy (F1): 0.705882"
library("wordcloud")
## Loading required package: RColorBrewer
classFlagged <- read.csv("data/centroid.flagged")
classFlagged <- classFlagged[order(classFlagged$mean,decreasing = TRUE),]
wordcloud(classFlagged$word,classFlagged$mean, scale=c(3,.1),min.freq=0.001,
max.words=300, random.order=FALSE, rot.per=.35,
colors=brewer.pal(8,"Dark2"))
classFlagged <- read.csv("data/centroid.not_flagged")
classFlagged <- classFlagged[order(classFlagged$mean,decreasing = TRUE),]
wordcloud(classFlagged$word,classFlagged$mean, scale=c(4,.3),min.freq=0.001,
max.words=300, random.order=FALSE, rot.per=.15,
colors=brewer.pal(8,"Dark2"))