Text mining atau data mining merupakan proses penemuan pengetahuan menggunakan Natural Language Processing (NLP) dengan cara menggali informasi dari sebuah data berformat teks. Sedangkan, Klasifikasi merupakan proses pembelajaran sebuah fungsi atau model terhadap sekumpulan data latih, sehingga model tersebut dapat digunakan untuk memprediksi klasifikasi dari data uji. Ada berbagai macam metode klasifikasi contohnya Naive Bayes, Support Vector Machine (SVM), Random Forest, dan lain-lain. Ada pula beberapa jurnal yang mengatakan bahwa metode klasifikasi menggunakan Random Forest memiliki nilai akurasi yang lebih tinggi dibanding metode lainnya, jurnal-jurnal tersebut adalah Nalatissifa, dkk (2021), Himawan dan Eliyani (2021), dan Hartmann, dkk (2019) sehingga pada penelitian ini menggunakan metode Random Forest. Data penelitian menggunakan data Women’s E-Commerce Clothing Reviews dari kaggle.com dengan 2 klasifikasi yaitu merekomendasi dan tidak merekomendasi. Library yang digunakan adalah sebagai berikut.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(knitr) #kable
library(kableExtra) #kable_styling
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(tokenizers) #tokenisasi
library(stringr) #str
library(SnowballC) #untuk wordstem
library(tidytext) #unnest tokens
library(tm) #tdm
## Loading required package: NLP
library(randomForest) #randomforest
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(rebus.base) #or
## 
## Attaching package: 'rebus.base'
## The following object is masked from 'package:stringr':
## 
##     regex
or()
## <regex> (?:)

df <- read.csv("D:/Semester 6/MDTT/Womens Clothing E-Commerce Reviews.csv", sep = ",")
str(df)

## 'data.frame':    23486 obs. of  11 variables:
##  $ X                      : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ Clothing.ID            : int  767 1080 1077 1049 847 1080 858 858 1077 1077 ...
##  $ Age                    : int  33 34 60 50 47 49 39 39 24 34 ...
##  $ Title                  : chr  "" "" "Some major design flaws" "My favorite buy!" ...
##  $ Review.Text            : chr  "Absolutely wonderful - silky and sexy and comfortable" "Love this dress!  it's sooo pretty.  i happened to find it in a store, and i'm glad i did bc i never would have"| __truncated__ "I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small "| __truncated__ "I love, love, love this jumpsuit. it's fun, flirty, and fabulous! every time i wear it, i get nothing but great compliments!" ...
##  $ Rating                 : int  4 5 3 5 5 2 5 4 5 5 ...
##  $ Recommended.IND        : int  1 1 0 1 1 0 1 1 1 1 ...
##  $ Positive.Feedback.Count: int  0 4 0 0 6 4 1 4 0 0 ...
##  $ Division.Name          : chr  "Initmates" "General" "General" "General Petite" ...
##  $ Department.Name        : chr  "Intimate" "Dresses" "Dresses" "Bottoms" ...
##  $ Class.Name             : chr  "Intimates" "Dresses" "Dresses" "Pants" ...

Variabel yang digunakan pada analisis adalah X sebagai ID, Review.Text, dan Recommended.IND dimana 1 adalah recommended dan 0 adalah not recommended.

df_review <- df[,c(1,5,7)]
kable(head(df_review,3), 
      caption ="<center>Tabel 1. Data Review</center>", 
      format = "html", align = 'ccc') %>% 
  kable_styling(bootstrap_options = "bordered", full_width = FALSE)

Tabel 1. Data Review
X	Review.Text	Recommended.IND
0	Absolutely wonderful - silky and sexy and comfortable	1
1	Love this dress! it’s sooo pretty. i happened to find it in a store, and i’m glad i did bc i never would have ordered it online bc it’s petite. i bought a petite and am 5’8”. i love the length on me- hits just a little below the knee. would definitely be a true midi on someone who is truly petite.	1
2	I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was comfortable and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap (net) over layers. imo, a major design flaw was the net over layer sewn directly into the zipper - it c	0

Langkah-langkah dalam analisis klasifikasi ini adalah sebagai berikut.

Preprocessing: Cleaning Data, Stopwords, Tokenisasi, Stemming, membuat document term matrix (dtm), serta mebentuk data 80% data training dan 20% data testing.
Membentuk model dan prediksi
Tabel Klasifikasi
Menghitung akurasi model.

1 Preprocessing - Cleaning Data

Cleaning data dilakukan dengan menghapus username, hastag, link, angka, tanda baca, double white space, dan mengubah ke lower case. Review yang telah bersih dimasukkan ke variabel baru yang bernama text_clean.

# menghapus username
df_review$text_clean <- str_replace_all(df_review$Review.Text,
                                        pattern=or("@\\w*: ","@\\w*"),
                                        replacement = "")
# menghapus hastag
df_review$text_clean <- str_replace_all(df_review$text_clean,
                                       pattern="#\\w*",
                                       replacement = "")
# menghapus link
df_review$text_clean <- str_replace_all(df_review$text_clean,
                                       pattern=or("https:.*","http:.*"),
                                       replacement = "")
# menghapus angka
df_review$text_clean <- str_replace_all(df_review$text_clean,
                                       pattern="\\d+\\w*",
                                       replacement = "")
# menghapus tanda baca
df_review$text_clean <- str_replace_all(df_review$text_clean,
                                       pattern="[^[:alnum:][:space:]]",
                                       replacement = "")
# menghapus double white space
df_review$text_clean <- str_squish(df_review$text_clean)
# menghubah ke lower case
df_review$text_clean <- str_to_lower(df_review$text_clean)
kable(head(df_review,3), 
      caption ="<center>Tabel 2. Data Clean</center>", 
      format = "html", align = 'cccc') %>% 
  kable_styling(bootstrap_options = "bordered", full_width = FALSE)

Tabel 2. Data Clean
X	Review.Text	Recommended.IND	text_clean
0	Absolutely wonderful - silky and sexy and comfortable	1	absolutely wonderful silky and sexy and comfortable
1	Love this dress! it’s sooo pretty. i happened to find it in a store, and i’m glad i did bc i never would have ordered it online bc it’s petite. i bought a petite and am 5’8”. i love the length on me- hits just a little below the knee. would definitely be a true midi on someone who is truly petite.	1	love this dress its sooo pretty i happened to find it in a store and im glad i did bc i never would have ordered it online bc its petite i bought a petite and am i love the length on me hits just a little below the knee would definitely be a true midi on someone who is truly petite
2	I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was comfortable and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap (net) over layers. imo, a major design flaw was the net over layer sewn directly into the zipper - it c	0	i had such high hopes for this dress and really wanted it to work for me i initially ordered the petite small my usual size but i found this to be outrageously small so small in fact that i could not zip it up i reordered it in petite medium which was just ok overall the top half was comfortable and fit nicely but the bottom half had a very tight under layer and several somewhat cheap net over layers imo a major design flaw was the net over layer sewn directly into the zipper it c

2 Preprocessing - Stopwords

Stopwords dilakukan untuk menghapus kata-kata yang kurang memiliki arti penting dalam kalimat. Kata-kata stopwords menggunakan file tersendiri yang berisikan kata-kata dari bahasa indonesia dan bahasa inggris, file tersebut dapat dilihat pada stopwords_ind_eng. Setelah kalimat-kalimat tersebut distopwords kemudian diletakkan ke dalam variabel baru yang bernama teks_clean2.

stopwords_ind_eng <- readLines("D:/Semester 6/MDTT/stop_words_ind_eng.txt")

## Warning in readLines("D:/Semester 6/MDTT/stop_words_ind_eng.txt"): incomplete
## final line found on 'D:/Semester 6/MDTT/stop_words_ind_eng.txt'

df_tokens <- tokenize_words(df_review$text_clean, stopwords = stopwords_ind_eng)
clean_word <- NULL
for(i in 1:23486){
  clean_word <- c(clean_word, paste(df_tokens[[i]], collapse=" "))
}
df_review$text_clean2 <- clean_word
kable(head(df_review[,c(1,3,4,5)],3), 
      caption ="<center>Tabel 3. Data Clean2</center>", 
      format = "html", align = 'cccc') %>% 
  kable_styling(bootstrap_options = "bordered", full_width = FALSE)

Tabel 3. Data Clean2
X	Recommended.IND	text_clean	text_clean2
0	1	absolutely wonderful silky and sexy and comfortable	absolutely wonderful silky sexy comfortable
1	1	love this dress its sooo pretty i happened to find it in a store and im glad i did bc i never would have ordered it online bc its petite i bought a petite and am i love the length on me hits just a little below the knee would definitely be a true midi on someone who is truly petite	love dress sooo pretty happened store im glad bc online bc petite bought petite love length hits knee true midi petite
2	0	i had such high hopes for this dress and really wanted it to work for me i initially ordered the petite small my usual size but i found this to be outrageously small so small in fact that i could not zip it up i reordered it in petite medium which was just ok overall the top half was comfortable and fit nicely but the bottom half had a very tight under layer and several somewhat cheap net over layers imo a major design flaw was the net over layer sewn directly into the zipper it c	hopes dress initially petite usual size found outrageously zip reordered petite medium top half comfortable fit nicely bottom half tight layer cheap net layers imo major design flaw net layer sewn directly zipper

3 Preprocessing - Tokenisasi dan Stemming

Tokenisasi adalah mengubah kalimat menjadi per kata sedangkan stemming adalah menggubah kata menjadi kata dasar. Tokenisasi menggunakan fungsi unnest_tokens sedangkan stemming menggunakan wordStem.

df_token <- df_review %>%
  unnest_tokens(output="word", token = "words", input = text_clean2) %>%
  mutate(word = wordStem(word))
kable(head(df_token[,c(1,3,4,5)],3), 
      caption ="<center>Tabel 4. Data Token</center>", 
      format = "html", align = 'cccc') %>% 
  kable_styling(bootstrap_options = "bordered", full_width = FALSE)

Tabel 4. Data Token
Recommended.IND	text_clean	word
1	absolutely wonderful silky and sexy and comfortable	absolut
1	absolutely wonderful silky and sexy and comfortable	wonder
1	absolutely wonderful silky and sexy and comfortable	silki

4 Preprocessing - Document Term Matrix

Pembuatan DTM digunakan untuk memboboti kata-kata yang telah ditokenisasi dengan TF IDF. Term Frequency — Inverse Document Frequency atau TF — IDF adalah suatu metode algoritma yang berguna untuk menghitung bobot setiap kata yang umum digunakan. Setelah mendapatkan DTM, dilakukan removeSpareTerms untuk menghapus dimensi yang memiliki persentase nilai 0 minimal 95%.

dtm <- df_token %>%
  count(X, word) %>%
  cast_dtm(document = X, term = word,
           value = n, weighting = weightTfIdf)
dtm1 <- removeSparseTerms(dtm, sparse = 0.95)

5 Preprocessing - Data Training dan Testing

Penelitian ini menggunakan 80% data training dan 20% data testing. Menggunakan fungsi set.seed agar sampel yang digunakan tidak berubah.

sample_size <- floor(0.8*nrow(dtm1))
set.seed(111)
train_ind <- sample(nrow(dtm1),
                    size = sample_size)
train <- dtm1[train_ind,]
test <- dtm1[-train_ind,]

6 Membentuk Model dan Prediksi

Membentuk model klasifikasi menggunakan metode randomForest dengan menggunakan variabel Recommended.IND sebagai Y dan data training yang telah dibentuk sebagai X dimana ntree sebesar 1000.

train_rf <- randomForest(x = as.data.frame(as.matrix(train)),
                         y = as.factor(df_review$Recommended.IND[train_ind]), ntree = 1000)
pred_rf <- predict(train_rf, as.data.frame(as.matrix(test)))

7 Tabel Klasifikasi

Tabel klasifikasi menunjukkan berapa frekuensi data tepat dan salah diprediksi.

## Membuat confusion matrix
c_matrix <- table(Rekomendasi = as.factor(df_review$Recommended.IND[-train_ind]), prediksi = pred_rf)
c_matrix

##            prediksi
## Rekomendasi    0    1
##           0  203  607
##           1  112 3776

8 Akurasi Model

Akurasi model dilakukan untuk menghitung berapa besar persentase hasil prediksi dengan data aktualnya.

\(%akurasi = (n11+n22)\div\Sigma(nij)\)

# Menghitung akurasi
sum(diag(c_matrix))/sum(c_matrix)

## [1] 0.8469562

9 Daftar Pustaka

Nalatissifa, H., Gata, W., Diantika, S., & Nisa, K. (2021). “Perbandingan Kinerja Algoritma Klasifikasi Naive Bayes, Support Vector Machine (SVM), dan Random Forest untuk Prediksi Ketidakhadiran di Tempat Kerja”. Jurnal Informatika Universitas Pamulang, 5(4).

Himawan, R. D., & Eliyani, E. (2021). “Perbandingan Akurasi Analisis Sentimen Tweet terhadap Pemerintah Provinsi DKI Jakarta di Masa Pandemi”. JEPIN (Jurnal Edukasi dan Penelitian Informatika), 7(1).

Hartmann, J., Huppertz, J., Schamp, C., & Heitmann, M. (2019). “Comparing automated text classification methods”. International Journal of Research in Marketing, 36(1).

Text Classification

Odelia_10611810000033

7/5/2021