word <- read.delim("data_input/trainingdata.txt")
head(word)
## X5485
## 1 1 champion products ch approves stock split champion products inc said its board of directors approved a two for one stock split of its common shares for shareholders of record as of april the company also said its board voted to recommend to shareholders at the annual meeting april an increase in the authorized capital stock from five mln to mln shares reuter
## 2 2 computer terminal systems cpml completes sale computer terminal systems inc said it has completed the sale of shares of its common stock and warrants to acquire an additional one mln shares to sedio n v of lugano switzerland for dlrs the company said the warrants are exercisable for five years at a purchase price of dlrs per share computer terminal said sedio also has the right to buy additional shares and increase its total holdings up to pct of the computer terminal s outstanding common stock under certain circumstances involving change of control at the company the company said if the conditions occur the warrants would be exercisable at a price equal to pct of its common stock s market price at the time not to exceed dlrs per share computer terminal also said it sold the technolgy rights to its dot matrix impact technology including any future improvements to woodco inc of houston tex for dlrs but it said it would continue to be the exclusive worldwide licensee of the technology for woodco the company said the moves were part of its reorganization plan and would help pay current operation costs and ensure product delivery computer terminal makes computer generated labels forms tags and ticket printers and terminals reuter
## 3 1 cobanco inc cbco year net shr cts vs dlrs net vs assets mln vs mln deposits mln vs mln loans mln vs mln note th qtr not available year includes extraordinary gain from tax carry forward of dlrs or five cts per shr reuter
## 4 1 am international inc am nd qtr jan oper shr loss two cts vs profit seven cts oper shr profit vs profit revs mln vs mln avg shrs mln vs mln six mths oper shr profit nil vs profit cts oper net profit vs profit revs mln vs mln avg shrs mln vs mln note per shr calculated after payment of preferred dividends results exclude credits of or four cts and or nine cts for qtr and six mths vs or six cts and or cts for prior periods from operating loss carryforwards reuter
## 5 1 brown forman inc bfd th qtr net shr one dlr vs cts net mln vs mln revs mln vs mln nine mths shr dlrs vs dlrs net mln vs mln revs billion vs mln reuter
## 6 1 dean foods df sees strong th qtr earnings dean foods co expects earnings for the fourth quarter ending may to exceed those of the same year ago period chairman kenneth douglas told analysts in the fiscal fourth quarter the food processor reported earnings of cts a share douglas also said the year s sales should exceed billion dlrs up from billion dlrs the prior year he repeated an earlier projection that third quarter earnings will probably be off slightly from last year s cts a share falling in the range of cts to cts a share douglas said it was too early to project whether the anticipated fourth quarter performance would be enough for us to exceed the prior year s overall earnings of dlrs a share in douglas said dean should experience a pct improvement in our bottom line from effects of the tax reform act alone president howard dean said in fiscal the company will derive benefits of various dairy and frozen vegetable acquisitions from ryan milk to the larsen co dean also said the company will benefit from its acquisition in late december of elgin blenders inc west chicago he said the company is a major shareholder of e b i foods ltd a united kingdom blender and has licensing arrangements in australia canada brazil and japan it provides ann entry to mcdonalds corp mcd we ve been after for years douglas told analysts reuter
word_class <- read.delim("https://raw.githubusercontent.com/TATABOX42/text-mining-in-r/master/labels.txt",header=FALSE)
word_text <- read.delim("https://raw.githubusercontent.com/TATABOX42/text-mining-in-r/master/dataset.txt", header=FALSE)
word_text <- word_text %>%
rename(text = V1)
word_class <- word_class %>%
rename(label = V1)
word <- data.frame(word_class, word_text)
head(word)
## label
## 1 1
## 2 2
## 3 1
## 4 1
## 5 1
## 6 1
## text
## 1 champion products ch approves stock split champion products inc said its board of directors approved a two for one stock split of its common shares for shareholders of record as of april the company also said its board voted to recommend to shareholders at the annual meeting april an increase in the authorized capital stock from five mln to mln shares reuter
## 2 computer terminal systems cpml completes sale computer terminal systems inc said it has completed the sale of shares of its common stock and warrants to acquire an additional one mln shares to sedio n v of lugano switzerland for dlrs the company said the warrants are exercisable for five years at a purchase price of dlrs per share computer terminal said sedio also has the right to buy additional shares and increase its total holdings up to pct of the computer terminal s outstanding common stock under certain circumstances involving change of control at the company the company said if the conditions occur the warrants would be exercisable at a price equal to pct of its common stock s market price at the time not to exceed dlrs per share computer terminal also said it sold the technolgy rights to its dot matrix impact technology including any future improvements to woodco inc of houston tex for dlrs but it said it would continue to be the exclusive worldwide licensee of the technology for woodco the company said the moves were part of its reorganization plan and would help pay current operation costs and ensure product delivery computer terminal makes computer generated labels forms tags and ticket printers and terminals reuter
## 3 cobanco inc cbco year net shr cts vs dlrs net vs assets mln vs mln deposits mln vs mln loans mln vs mln note th qtr not available year includes extraordinary gain from tax carry forward of dlrs or five cts per shr reuter
## 4 am international inc am nd qtr jan oper shr loss two cts vs profit seven cts oper shr profit vs profit revs mln vs mln avg shrs mln vs mln six mths oper shr profit nil vs profit cts oper net profit vs profit revs mln vs mln avg shrs mln vs mln note per shr calculated after payment of preferred dividends results exclude credits of or four cts and or nine cts for qtr and six mths vs or six cts and or cts for prior periods from operating loss carryforwards reuter
## 5 brown forman inc bfd th qtr net shr one dlr vs cts net mln vs mln revs mln vs mln nine mths shr dlrs vs dlrs net mln vs mln revs billion vs mln reuter
## 6 dean foods df sees strong th qtr earnings dean foods co expects earnings for the fourth quarter ending may to exceed those of the same year ago period chairman kenneth douglas told analysts in the fiscal fourth quarter the food processor reported earnings of cts a share douglas also said the year s sales should exceed billion dlrs up from billion dlrs the prior year he repeated an earlier projection that third quarter earnings will probably be off slightly from last year s cts a share falling in the range of cts to cts a share douglas said it was too early to project whether the anticipated fourth quarter performance would be enough for us to exceed the prior year s overall earnings of dlrs a share in douglas said dean should experience a pct improvement in our bottom line from effects of the tax reform act alone president howard dean said in fiscal the company will derive benefits of various dairy and frozen vegetable acquisitions from ryan milk to the larsen co dean also said the company will benefit from its acquisition in late december of elgin blenders inc west chicago he said the company is a major shareholder of e b i foods ltd a united kingdom blender and has licensing arrangements in australia canada brazil and japan it provides ann entry to mcdonalds corp mcd we ve been after for years douglas told analysts reuter
prop.table(table(word$label))
##
## 1 2 3 4 5 6
## 0.517775752 0.290975387 0.045761167 0.019690064 0.007474932 0.046125798
## 7 8
## 0.034639927 0.037556974
word <- word %>%
mutate(label = as.factor(label))
glimpse(word)
## Rows: 5,485
## Columns: 2
## $ label <fct> 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 3, 4, 1, 1, 1, 4...
## $ text <chr> "champion products ch approves stock split champion products ...
word_corpus <- VCorpus(VectorSource(word$text))
word_corpus <- tm_map(word_corpus, removeNumbers)
word_corpus <- tm_map(word_corpus, function(x) removeWords(x,stopwords("english")))
word_corpus <- tm_map(word_corpus, removePunctuation)
word_dtm <- DocumentTermMatrix(x = word_corpus)
inspect(word_dtm)
## <<DocumentTermMatrix (documents: 5485, terms: 19600)>>
## Non-/sparse entries: 247365/107258635
## Sparsity : 100%
## Maximal term length: 24
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs company cts dlrs loss mln net pct reuter said year
## 1342 0 0 3 0 0 0 2 1 11 2
## 1577 0 0 0 2 10 5 13 1 7 5
## 1978 0 0 4 0 0 0 0 1 13 2
## 2108 0 0 1 0 0 0 0 1 7 5
## 2902 0 0 5 0 1 0 8 1 14 7
## 3660 5 0 5 2 5 1 3 1 17 7
## 4278 0 0 1 0 0 0 1 1 18 3
## 4696 8 0 10 0 6 1 1 1 15 2
## 5332 0 0 0 0 0 0 10 1 16 3
## 706 0 0 6 0 4 0 0 1 12 1
tdm <- DocumentTermMatrix(x = word_corpus)
tdm_sparse <- removeSparseTerms(tdm, 0.90)
tdm_dm <- as.data.frame(as.matrix(tdm_sparse))
tdm_df <- as.matrix((tdm_dm > 0) + 0)
tdm_df <- as.data.frame(tdm_df)
tdm_df <- cbind(tdm_df, word$label)
tdm_df[1:10,]
## also april bank billion company corp cts dlr dlrs group inc last loss march
## 1 1 1 0 0 1 0 0 0 0 0 1 0 0 0
## 2 1 0 0 0 1 0 0 0 1 0 1 0 0 0
## 3 0 0 0 0 0 0 1 0 1 0 1 0 0 0
## 4 0 0 0 0 0 0 1 0 0 0 1 0 1 0
## 5 0 0 0 1 0 0 1 1 1 0 1 0 0 0
## 6 1 0 0 1 1 1 1 0 1 0 1 1 0 0
## 7 0 1 0 0 1 0 1 1 1 0 1 0 0 1
## 8 0 1 0 0 0 0 1 0 0 0 1 0 1 1
## 9 0 0 0 0 0 1 1 0 1 0 0 0 0 0
## 10 0 0 0 0 0 0 1 0 0 0 1 0 0 1
## may mln net new note one pay pct per prior profit qtr record reuter revs
## 1 0 1 0 0 0 1 0 0 0 0 0 0 1 1 0
## 2 0 1 0 0 0 1 1 1 1 0 0 0 0 1 0
## 3 0 1 1 0 1 0 0 0 1 0 0 1 0 1 0
## 4 0 1 1 0 1 0 0 0 1 1 1 1 0 1 1
## 5 0 1 1 0 0 1 0 0 0 0 0 1 0 1 1
## 6 1 0 0 0 0 0 0 1 0 1 0 1 0 1 0
## 7 0 1 0 1 0 1 0 1 0 1 0 0 1 1 0
## 8 0 1 1 0 1 0 0 0 0 1 1 1 1 1 1
## 9 0 1 1 0 1 0 0 0 0 0 0 1 0 1 1
## 10 0 0 0 0 0 0 1 0 0 1 0 0 1 1 0
## said sales share shares shr stock three two will year word$label
## 1 1 0 0 1 0 1 0 1 0 0 1
## 2 1 0 1 1 0 1 0 0 0 0 2
## 3 0 0 0 0 1 0 0 0 0 1 1
## 4 0 0 0 0 1 0 0 1 0 0 1
## 5 0 0 0 0 1 0 0 0 0 0 1
## 6 1 1 1 0 0 0 0 0 1 1 1
## 7 1 1 1 1 0 1 1 1 0 1 1
## 8 0 0 0 0 1 0 0 0 0 0 1
## 9 0 0 0 0 1 0 0 0 0 1 1
## 10 0 0 0 0 0 0 0 0 0 0 1
tdm_df <- tdm_df %>%
rename(label = `word$label`)
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
intrain <- sample(nrow(tdm_df), nrow(tdm_df)*0.75)
word_train <- tdm_df[intrain, ]
word_test <- tdm_df[-intrain, ]
table(word_train$label)
##
## 1 2 3 4 5 6 7 8
## 2093 1214 197 80 34 198 144 153
#control <- trainControl(method="repeatedcv", number = 5, repeats = 4)
#set.seed(128)
#word_model <- train(label ~ ., data = word_train, method = "rf", trControl = control)
#saveRDS(word_model, "word_model1.RDS") # simpan model
word_model <- readRDS("word_model1.RDS")
word_model
## Random Forest
##
## 4113 samples
## 39 predictor
## 8 classes: '1', '2', '3', '4', '5', '6', '7', '8'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 4 times)
## Summary of sample sizes: 3290, 3290, 3292, 3288, 3292, 3291, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7862898 0.6512330
## 20 0.8284083 0.7320242
## 39 0.8153398 0.7122870
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 20.
pred <- predict(word_model, newdata = word_test)
confusionMatrix(pred, word_test$label)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4 5 6 7 8
## 1 701 25 4 1 1 4 1 3
## 2 34 342 4 8 3 17 8 7
## 3 3 5 33 2 1 4 4 7
## 4 0 0 3 13 1 2 0 0
## 5 0 0 0 0 0 2 0 0
## 6 5 5 4 2 0 22 0 0
## 7 3 3 1 2 0 3 27 6
## 8 1 2 5 0 1 1 6 30
##
## Overall Statistics
##
## Accuracy : 0.8513
## 95% CI : (0.8314, 0.8697)
## No Information Rate : 0.5445
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7582
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity 0.9384 0.8953 0.61111 0.464286 0.000000 0.40000
## Specificity 0.9376 0.9182 0.98027 0.995536 0.998535 0.98785
## Pos Pred Value 0.9473 0.8085 0.55932 0.684211 0.000000 0.57895
## Neg Pred Value 0.9272 0.9579 0.98401 0.988914 0.994891 0.97526
## Prevalence 0.5445 0.2784 0.03936 0.020408 0.005102 0.04009
## Detection Rate 0.5109 0.2493 0.02405 0.009475 0.000000 0.01603
## Detection Prevalence 0.5394 0.3083 0.04300 0.013848 0.001458 0.02770
## Balanced Accuracy 0.9380 0.9067 0.79569 0.729911 0.499267 0.69393
## Class: 7 Class: 8
## Sensitivity 0.58696 0.56604
## Specificity 0.98643 0.98787
## Pos Pred Value 0.60000 0.65217
## Neg Pred Value 0.98568 0.98265
## Prevalence 0.03353 0.03863
## Detection Rate 0.01968 0.02187
## Detection Prevalence 0.03280 0.03353
## Balanced Accuracy 0.78669 0.77695
summary(predict(word_model, newdata = word_test))
## 1 2 3 4 5 6 7 8
## 740 423 59 19 2 38 45 46
Dari model random forest di atas saya berhasil mendapatkan accuracy 85% untuk memprediksi data word_test, yang didapatkan dari data Document Classification trainingdata.txt