Intro
oleh karena itu perlu adanya alat pendeteksi berita hoax, agar masyarakat bisa dengan mudah memfilter mana berita yang layak dikonsumsi, dan mana berita yang seharusnya kita abaikan dan tidak kita sebarluaskan. dan pada kesempatan kali ini, saya akan mencoba membuat model untuk mendeteksi mana berita yang hoax dan mana berita yang valid
Import Library
library(tidyverse) ## Warning: package 'tidyverse' was built under R version 4.0.4
## Warning: package 'ggplot2' was built under R version 4.0.4
## Warning: package 'tibble' was built under R version 4.0.4
## Warning: package 'tidyr' was built under R version 4.0.4
## Warning: package 'readr' was built under R version 4.0.4
## Warning: package 'purrr' was built under R version 4.0.4
## Warning: package 'dplyr' was built under R version 4.0.4
## Warning: package 'stringr' was built under R version 4.0.4
## Warning: package 'forcats' was built under R version 4.0.4
library(readr) #Read Rectangular Text Data
library(caret) #Classification and Regression Training## Warning: package 'caret' was built under R version 4.0.4
library(e1071) ### Warning: package 'e1071' was built under R version 4.0.4
library(devtools)## Warning: package 'devtools' was built under R version 4.0.4
## Warning: package 'usethis' was built under R version 4.0.4
library(katadasaR)
library(textclean)## Warning: package 'textclean' was built under R version 4.0.4
library(wordcloud)## Warning: package 'wordcloud' was built under R version 4.0.4
library(wordcloud2)## Warning: package 'wordcloud2' was built under R version 4.0.4
library(RColorBrewer)
library(tm)## Warning: package 'tm' was built under R version 4.0.4
library(stopwords)## Warning: package 'stopwords' was built under R version 4.0.4
library(xlsx)## Warning: package 'xlsx' was built under R version 4.0.4
library(randomForest)## Warning: package 'randomForest' was built under R version 4.0.4
Import dataset
hoax <- read.xlsx("hoaxnews.xlsx", sheetIndex = 1)
head(hoax)Data Wrangling
hoax$tagging <- as.factor(hoax$tagging)
hoax$berita <- str_replace_all(hoax$berita, "[\r\n]" , "")
hoax$berita <- str_replace_all(hoax$berita, "â.." , "")
hoax$berita <- str_replace_all(hoax$berita, "â.œ" , "")
hoax$berita <- str_replace_all(hoax$berita, "“bristleâ€\u009d" , "")
hoax$berita <- str_replace_all(hoax$berita, "â€\u009d" , "")WordClouds
hoax %>%
filter(tagging == "Hoax") %>%
pull(berita) %>%
wordcloud(max.words = 50,
scale = c(3, 1),
random.order = FALSE,
colors = brewer.pal(8, "BrBG"))## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents
hoax %>%
filter(tagging == "Valid") %>%
pull(berita) %>%
wordcloud(max.words = 50,
scale = c(2, 0.8),
random.order = FALSE,
colors = brewer.pal(8, "BrBG"))## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents
Start Text Mining
Transform into Corpus
hoax_corpus <- VCorpus(VectorSource(hoax$berita))hoax_corpus[[400]]$content## [1] "Semakin maraknya para gamer memainkan Pokemon Go, ternyata menimbulkan Pro dan Kontra khususnya di Indonesia. Apalagi, baru-baru ini terdengar kabar bahwa game tersebut memiliki makna tersirat yang tidak bisa ditolerir khususnya untuk para pemeluk agama Islam.Dari beberapa link yang tersebar di sosial media, dibahas bahwa Pokemon sebenarnya memiliki arti \"Aku Yahudi\" dalam bahasa Syriac, sedangkan Pikachu disebut sebagai \"Jadilah Yahudi\" dan Charmander berarti \"Tuhan itu Lemah\". Dimulai dari situlah masalah ini menyeruak, beberapa orang ada yang langsung percaya dan menyebarkan kembali link tersebut untuk mengajak yang lainnya menghindari Pokemon Go. Ada juga yang tidak percaya dan menganggap hal tersebut sesuatu yang dibuat-buat. Lantas, gosip tersebut benar atau tidak? Ternyata Nintendo membantahnya dan LA Times mencatat hal tersebut, bisa dibilang semua ini hanya sebuah rumor yang tidak terbukti sama sekali. Pokemon dalam bahasa Inggris merupakan singkatan dari dua kata, yaitu Pocket Monster yang berarti monster saku. Bisa dibilang, monster-monster yang disebut Pokemon ini memiliki ukuran yang kecil dan bisa disimpan ke dalam saku trainer-nya.Sedangkan Pikachu yang diketahui sebagai icon dari Pokemon juga terbentuk dari dari dua kata, yakni Pika (bersinar) dan Chu (tikus). Dari bentuknya saja jelas bahwa Pikachu adalah tikus yang bersinar atau bersenjatakan sengatan listrik. Terakhir Charmander juga memiliki arti api yang menyala/membakar, seperti tampilan dari Charmander sendiri di dalam anime maupun game Pokemon."
Cleaning unwanted words and symbol
transformer <- content_transformer(FUN = function(x, pattern){
gsub(x = x, # data text
pattern = pattern, # pattern yang ditemui
replacement = " ") # ganti pattern dengan spasi " "
})hoax_corpus <- tm_map(hoax_corpus, removeNumbers)
#to lower
hoax_corpus <- tm_map(hoax_corpus, content_transformer(tolower))
#mengganti symbol menjadi spasi
hoax_corpus <- tm_map(hoax_corpus, transformer, "/")
hoax_corpus <- tm_map(hoax_corpus, transformer, "@")
hoax_corpus <- tm_map(hoax_corpus, transformer, "\\.")
hoax_corpus <- tm_map(hoax_corpus, transformer, ":")
hoax_corpus <- tm_map(hoax_corpus, removePunctuation)
hoax_corpus <- tm_map(hoax_corpus, removeWords, stopwords("id", source = "stopwords-iso"))
hoax_corpus <- tm_map(hoax_corpus, stripWhitespace)
hoax_corpus <- tm_map(hoax_corpus, function(x) { stemDocument(x, language="indonesian") })hoax_corpus[[400]]$content## [1] "maraknya gamer memainkan pokemon go menimbulkan pro kontra indonesia barubaru terdengar kabar game memiliki makna tersirat ditolerir pemeluk agama islam link tersebar sosial media dibaha pokemon memiliki arti yahudi bahasa syriac pikachu yahudi charmand tuhan lemah situlah menyeruak orang langsung percaya menyebarkan link mengajak menghindari pokemon go percaya menganggap dibuatbuat lanta gosip nintendo membantahnya la time mencatat dibilang rumor terbukti pokemon bahasa inggri singkatan pocket monster monster saku dibilang monstermonst pokemon memiliki ukuran disimpan saku trainernya pikachu icon pokemon terbentuk pika bersinar chu tikus bentuknya pikachu tikus bersinar bersenjatakan sengatan listrik charmand memiliki arti api menyala membakar tampilan charmand anim game pokemon"
Transform into documenttermmatrix
hoax_dtm <- DocumentTermMatrix(hoax_corpus)
inspect(hoax_dtm)## <<DocumentTermMatrix (documents: 600, terms: 10003)>>
## Non-/sparse entries: 62837/5938963
## Sparsity : 99%
## Maximal term length: 51
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs babi bulu facebook ikan indonesia lele media pokemon sikat stroke
## 25 4 0 0 58 0 53 0 0 0 0
## 360 0 0 3 0 5 0 6 26 0 0
## 361 0 0 0 0 2 0 0 21 0 0
## 363 0 0 0 0 0 0 2 28 0 0
## 380 0 0 0 0 7 0 6 12 0 0
## 388 0 0 2 0 1 0 2 39 0 0
## 409 0 0 0 0 0 0 2 0 0 0
## 43 0 0 0 46 1 51 0 0 0 0
## 556 0 0 0 0 1 0 0 0 0 0
## 574 0 0 0 0 1 0 0 0 0 0
Take only words that show minimum 10
freqq <- findFreqTerms(hoax_dtm, lowfreq = 10)
hoax_dtm <- hoax_dtm[, freqq]Create First Model
Splitting
set.seed(100)
index <- sample(nrow(hoax_dtm), nrow(hoax_dtm)*0.75)
data_train <- hoax_dtm[index, ]
data_test <- hoax_dtm[-index, ]# Target Variable
status_train <- hoax[index, "tagging"]
status_test <- hoax[-index, "tagging"]prop.table(table(status_test))## status_test
## Hoax Valid
## 0.3666667 0.6333333
prop.table(table(status_train))## status_train
## Hoax Valid
## 0.3844444 0.6155556
Bernouli Conv
bernoulli_conv <- function(x){
x <- as.factor(ifelse(x > 1, 1, 0))
return(x)
}data_train_bn <- apply(X = data_train, MARGIN = 2, FUN = bernoulli_conv)
data_test_bn <- apply(X = data_test, MARGIN = 2, FUN = bernoulli_conv)Modeling
model_naive <- naiveBayes(x = data_train_bn, # data prediktor
y = status_train, # data target
laplace = 1)Prediction
hoax_pred_class <- predict(object = model_naive,
newdata = data_test_bn,
type = "class")Evaluation
confusionMatrix(data = hoax_pred_class,
reference = status_test,
positive = "Hoax")## Confusion Matrix and Statistics
##
## Reference
## Prediction Hoax Valid
## Hoax 50 56
## Valid 5 39
##
## Accuracy : 0.5933
## 95% CI : (0.5102, 0.6727)
## No Information Rate : 0.6333
## P-Value [Acc > NIR] : 0.8643
##
## Kappa : 0.2674
##
## Mcnemar's Test P-Value : 1.535e-10
##
## Sensitivity : 0.9091
## Specificity : 0.4105
## Pos Pred Value : 0.4717
## Neg Pred Value : 0.8864
## Prevalence : 0.3667
## Detection Rate : 0.3333
## Detection Prevalence : 0.7067
## Balanced Accuracy : 0.6598
##
## 'Positive' Class : Hoax
##
Create using another algirithm
Data Preprocessing
tdmdata <- data.frame(data.matrix(hoax_dtm), stringsAsFactors = FALSE)
a <- cbind(tagging = hoax$tagging, tdmdata)Splitting
set.seed(1234)
indexx <- sample(1:nrow(a), size = round(0.7*nrow(a)), replace=FALSE)
trainn <- a[indexx, ]
testt <- a[-indexx, ]Second Model
model1 <- svm(tagging~., data = trainn, scale = F, kernel = 'linear')pred <- predict(model1, testt[,-1])confusionMatrix(factor(pred), factor(testt$tagging))## Confusion Matrix and Statistics
##
## Reference
## Prediction Hoax Valid
## Hoax 31 30
## Valid 32 87
##
## Accuracy : 0.6556
## 95% CI : (0.5812, 0.7247)
## No Information Rate : 0.65
## P-Value [Acc > NIR] : 0.4720
##
## Kappa : 0.2374
##
## Mcnemar's Test P-Value : 0.8989
##
## Sensitivity : 0.4921
## Specificity : 0.7436
## Pos Pred Value : 0.5082
## Neg Pred Value : 0.7311
## Prevalence : 0.3500
## Detection Rate : 0.1722
## Detection Prevalence : 0.3389
## Balanced Accuracy : 0.6178
##
## 'Positive' Class : Hoax
##
Third Model
library(partykit)## Warning: package 'partykit' was built under R version 4.0.4
## Warning: package 'libcoin' was built under R version 4.0.4
model2 <- ctree(tagging~., data = trainn)pred2 <- predict(model2, testt[,-1])confusionMatrix(factor(pred2), factor(testt$tagging))## Warning in confusionMatrix.default(factor(pred2), factor(testt$tagging)): Levels
## are not in the same order for reference and data. Refactoring data to match.
## Confusion Matrix and Statistics
##
## Reference
## Prediction Hoax Valid
## Hoax 0 0
## Valid 63 117
##
## Accuracy : 0.65
## 95% CI : (0.5755, 0.7195)
## No Information Rate : 0.65
## P-Value [Acc > NIR] : 0.5342
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 5.662e-15
##
## Sensitivity : 0.00
## Specificity : 1.00
## Pos Pred Value : NaN
## Neg Pred Value : 0.65
## Prevalence : 0.35
## Detection Rate : 0.00
## Detection Prevalence : 0.00
## Balanced Accuracy : 0.50
##
## 'Positive' Class : Hoax
##
Fourth Model
model3 <- randomForest(tagging~., data = trainn)pred3 <- predict(model3, testt[,-1])confusionMatrix(factor(pred3), factor(testt$tagging))## Confusion Matrix and Statistics
##
## Reference
## Prediction Hoax Valid
## Hoax 26 11
## Valid 37 106
##
## Accuracy : 0.7333
## 95% CI : (0.6624, 0.7964)
## No Information Rate : 0.65
## P-Value [Acc > NIR] : 0.010586
##
## Kappa : 0.3522
##
## Mcnemar's Test P-Value : 0.000308
##
## Sensitivity : 0.4127
## Specificity : 0.9060
## Pos Pred Value : 0.7027
## Neg Pred Value : 0.7413
## Prevalence : 0.3500
## Detection Rate : 0.1444
## Detection Prevalence : 0.2056
## Balanced Accuracy : 0.6593
##
## 'Positive' Class : Hoax
##
Conclusion
Berdasarkan 4 model yang saya coba, model dengan algoritma random forrest mendapatkan hasil akurasi tertinggi, namun itu juga belum cukup karena belum mencapai lebih dari 85%.kekurangan dari rmd ini mungkin kurang banyaknya sample, dan mungkin masih ada kekurangan dari segi data preprocessing dan data wrangling