Text mining atau data mining merupakan proses penemuan pengetahuan menggunakan Natural Language Processing (NLP) dengan cara menggali informasi dari sebuah data berformat teks. Sedangkan, Klasifikasi merupakan proses pembelajaran sebuah fungsi atau model terhadap sekumpulan data latih, sehingga model tersebut dapat digunakan untuk memprediksi klasifikasi dari data uji. Ada berbagai macam metode klasifikasi contohnya Naive Bayes, Support Vector Machine (SVM), Random Forest, dan lain-lain. Ada pula beberapa jurnal yang mengatakan bahwa metode klasifikasi menggunakan Random Forest memiliki nilai akurasi yang lebih tinggi dibanding metode lainnya, jurnal-jurnal tersebut adalah Nalatissifa, dkk (2021), Himawan dan Eliyani (2021), dan Hartmann, dkk (2019) sehingga pada penelitian ini menggunakan metode Random Forest. Data penelitian menggunakan data Women’s E-Commerce Clothing Reviews dari kaggle.com dengan 2 klasifikasi yaitu merekomendasi dan tidak merekomendasi. Library yang digunakan adalah sebagai berikut.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(knitr) #kable
library(kableExtra) #kable_styling
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
library(tokenizers) #tokenisasi
library(stringr) #str
library(SnowballC) #untuk wordstem
library(tidytext) #unnest tokens
library(tm) #tdm
## Loading required package: NLP
library(randomForest) #randomforest
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
library(rebus.base) #or
##
## Attaching package: 'rebus.base'
## The following object is masked from 'package:stringr':
##
## regex
or()
## <regex> (?:)
df <- read.csv("D:/Semester 6/MDTT/Womens Clothing E-Commerce Reviews.csv", sep = ",")
str(df)
## 'data.frame': 23486 obs. of 11 variables:
## $ X : int 0 1 2 3 4 5 6 7 8 9 ...
## $ Clothing.ID : int 767 1080 1077 1049 847 1080 858 858 1077 1077 ...
## $ Age : int 33 34 60 50 47 49 39 39 24 34 ...
## $ Title : chr "" "" "Some major design flaws" "My favorite buy!" ...
## $ Review.Text : chr "Absolutely wonderful - silky and sexy and comfortable" "Love this dress! it's sooo pretty. i happened to find it in a store, and i'm glad i did bc i never would have"| __truncated__ "I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small "| __truncated__ "I love, love, love this jumpsuit. it's fun, flirty, and fabulous! every time i wear it, i get nothing but great compliments!" ...
## $ Rating : int 4 5 3 5 5 2 5 4 5 5 ...
## $ Recommended.IND : int 1 1 0 1 1 0 1 1 1 1 ...
## $ Positive.Feedback.Count: int 0 4 0 0 6 4 1 4 0 0 ...
## $ Division.Name : chr "Initmates" "General" "General" "General Petite" ...
## $ Department.Name : chr "Intimate" "Dresses" "Dresses" "Bottoms" ...
## $ Class.Name : chr "Intimates" "Dresses" "Dresses" "Pants" ...
Variabel yang digunakan pada analisis adalah X sebagai ID, Review.Text, dan Recommended.IND dimana 1 adalah recommended dan 0 adalah not recommended.
df_review <- df[,c(1,5,7)]
kable(head(df_review,3),
caption ="<center>Tabel 1. Data Review</center>",
format = "html", align = 'ccc') %>%
kable_styling(bootstrap_options = "bordered", full_width = FALSE)
| X | Review.Text | Recommended.IND |
|---|---|---|
| 0 | Absolutely wonderful - silky and sexy and comfortable | 1 |
| 1 | Love this dress! it’s sooo pretty. i happened to find it in a store, and i’m glad i did bc i never would have ordered it online bc it’s petite. i bought a petite and am 5’8”. i love the length on me- hits just a little below the knee. would definitely be a true midi on someone who is truly petite. | 1 |
| 2 | I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was comfortable and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap (net) over layers. imo, a major design flaw was the net over layer sewn directly into the zipper - it c | 0 |
Langkah-langkah dalam analisis klasifikasi ini adalah sebagai berikut.
Preprocessing: Cleaning Data, Stopwords, Tokenisasi, Stemming, membuat document term matrix (dtm), serta mebentuk data 80% data training dan 20% data testing.
Membentuk model dan prediksi
Tabel Klasifikasi
Menghitung akurasi model.
Cleaning data dilakukan dengan menghapus username, hastag, link, angka, tanda baca, double white space, dan mengubah ke lower case. Review yang telah bersih dimasukkan ke variabel baru yang bernama text_clean.
# menghapus username
df_review$text_clean <- str_replace_all(df_review$Review.Text,
pattern=or("@\\w*: ","@\\w*"),
replacement = "")
# menghapus hastag
df_review$text_clean <- str_replace_all(df_review$text_clean,
pattern="#\\w*",
replacement = "")
# menghapus link
df_review$text_clean <- str_replace_all(df_review$text_clean,
pattern=or("https:.*","http:.*"),
replacement = "")
# menghapus angka
df_review$text_clean <- str_replace_all(df_review$text_clean,
pattern="\\d+\\w*",
replacement = "")
# menghapus tanda baca
df_review$text_clean <- str_replace_all(df_review$text_clean,
pattern="[^[:alnum:][:space:]]",
replacement = "")
# menghapus double white space
df_review$text_clean <- str_squish(df_review$text_clean)
# menghubah ke lower case
df_review$text_clean <- str_to_lower(df_review$text_clean)
kable(head(df_review,3),
caption ="<center>Tabel 2. Data Clean</center>",
format = "html", align = 'cccc') %>%
kable_styling(bootstrap_options = "bordered", full_width = FALSE)
| X | Review.Text | Recommended.IND | text_clean |
|---|---|---|---|
| 0 | Absolutely wonderful - silky and sexy and comfortable | 1 | absolutely wonderful silky and sexy and comfortable |
| 1 | Love this dress! it’s sooo pretty. i happened to find it in a store, and i’m glad i did bc i never would have ordered it online bc it’s petite. i bought a petite and am 5’8”. i love the length on me- hits just a little below the knee. would definitely be a true midi on someone who is truly petite. | 1 | love this dress its sooo pretty i happened to find it in a store and im glad i did bc i never would have ordered it online bc its petite i bought a petite and am i love the length on me hits just a little below the knee would definitely be a true midi on someone who is truly petite |
| 2 | I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was comfortable and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap (net) over layers. imo, a major design flaw was the net over layer sewn directly into the zipper - it c | 0 | i had such high hopes for this dress and really wanted it to work for me i initially ordered the petite small my usual size but i found this to be outrageously small so small in fact that i could not zip it up i reordered it in petite medium which was just ok overall the top half was comfortable and fit nicely but the bottom half had a very tight under layer and several somewhat cheap net over layers imo a major design flaw was the net over layer sewn directly into the zipper it c |
Stopwords dilakukan untuk menghapus kata-kata yang kurang memiliki arti penting dalam kalimat. Kata-kata stopwords menggunakan file tersendiri yang berisikan kata-kata dari bahasa indonesia dan bahasa inggris, file tersebut dapat dilihat pada stopwords_ind_eng. Setelah kalimat-kalimat tersebut distopwords kemudian diletakkan ke dalam variabel baru yang bernama teks_clean2.
stopwords_ind_eng <- readLines("D:/Semester 6/MDTT/stop_words_ind_eng.txt")
## Warning in readLines("D:/Semester 6/MDTT/stop_words_ind_eng.txt"): incomplete
## final line found on 'D:/Semester 6/MDTT/stop_words_ind_eng.txt'
df_tokens <- tokenize_words(df_review$text_clean, stopwords = stopwords_ind_eng)
clean_word <- NULL
for(i in 1:23486){
clean_word <- c(clean_word, paste(df_tokens[[i]], collapse=" "))
}
df_review$text_clean2 <- clean_word
kable(head(df_review[,c(1,3,4,5)],3),
caption ="<center>Tabel 3. Data Clean2</center>",
format = "html", align = 'cccc') %>%
kable_styling(bootstrap_options = "bordered", full_width = FALSE)
| X | Recommended.IND | text_clean | text_clean2 |
|---|---|---|---|
| 0 | 1 | absolutely wonderful silky and sexy and comfortable | absolutely wonderful silky sexy comfortable |
| 1 | 1 | love this dress its sooo pretty i happened to find it in a store and im glad i did bc i never would have ordered it online bc its petite i bought a petite and am i love the length on me hits just a little below the knee would definitely be a true midi on someone who is truly petite | love dress sooo pretty happened store im glad bc online bc petite bought petite love length hits knee true midi petite |
| 2 | 0 | i had such high hopes for this dress and really wanted it to work for me i initially ordered the petite small my usual size but i found this to be outrageously small so small in fact that i could not zip it up i reordered it in petite medium which was just ok overall the top half was comfortable and fit nicely but the bottom half had a very tight under layer and several somewhat cheap net over layers imo a major design flaw was the net over layer sewn directly into the zipper it c | hopes dress initially petite usual size found outrageously zip reordered petite medium top half comfortable fit nicely bottom half tight layer cheap net layers imo major design flaw net layer sewn directly zipper |
Tokenisasi adalah mengubah kalimat menjadi per kata sedangkan stemming adalah menggubah kata menjadi kata dasar. Tokenisasi menggunakan fungsi unnest_tokens sedangkan stemming menggunakan wordStem.
df_token <- df_review %>%
unnest_tokens(output="word", token = "words", input = text_clean2) %>%
mutate(word = wordStem(word))
kable(head(df_token[,c(1,3,4,5)],3),
caption ="<center>Tabel 4. Data Token</center>",
format = "html", align = 'cccc') %>%
kable_styling(bootstrap_options = "bordered", full_width = FALSE)
| X | Recommended.IND | text_clean | word |
|---|---|---|---|
| 0 | 1 | absolutely wonderful silky and sexy and comfortable | absolut |
| 0 | 1 | absolutely wonderful silky and sexy and comfortable | wonder |
| 0 | 1 | absolutely wonderful silky and sexy and comfortable | silki |
Pembuatan DTM digunakan untuk memboboti kata-kata yang telah ditokenisasi dengan TF IDF. Term Frequency — Inverse Document Frequency atau TF — IDF adalah suatu metode algoritma yang berguna untuk menghitung bobot setiap kata yang umum digunakan. Setelah mendapatkan DTM, dilakukan removeSpareTerms untuk menghapus dimensi yang memiliki persentase nilai 0 minimal 95%.
dtm <- df_token %>%
count(X, word) %>%
cast_dtm(document = X, term = word,
value = n, weighting = weightTfIdf)
dtm1 <- removeSparseTerms(dtm, sparse = 0.95)
Penelitian ini menggunakan 80% data training dan 20% data testing. Menggunakan fungsi set.seed agar sampel yang digunakan tidak berubah.
sample_size <- floor(0.8*nrow(dtm1))
set.seed(111)
train_ind <- sample(nrow(dtm1),
size = sample_size)
train <- dtm1[train_ind,]
test <- dtm1[-train_ind,]
Membentuk model klasifikasi menggunakan metode randomForest dengan menggunakan variabel Recommended.IND sebagai Y dan data training yang telah dibentuk sebagai X dimana ntree sebesar 1000.
train_rf <- randomForest(x = as.data.frame(as.matrix(train)),
y = as.factor(df_review$Recommended.IND[train_ind]), ntree = 1000)
pred_rf <- predict(train_rf, as.data.frame(as.matrix(test)))
Tabel klasifikasi menunjukkan berapa frekuensi data tepat dan salah diprediksi.
## Membuat confusion matrix
c_matrix <- table(Rekomendasi = as.factor(df_review$Recommended.IND[-train_ind]), prediksi = pred_rf)
c_matrix
## prediksi
## Rekomendasi 0 1
## 0 203 607
## 1 112 3776
Akurasi model dilakukan untuk menghitung berapa besar persentase hasil prediksi dengan data aktualnya.
\(%akurasi = (n11+n22)\div\Sigma(nij)\)
# Menghitung akurasi
sum(diag(c_matrix))/sum(c_matrix)
## [1] 0.8469562