Learn by Building C2: Sentiment Labelled Sentences

Rumusan Masalah

Pada kesempatan ini, akan dilakukan pemodelan untuk memprediksi sentiment labelled (positif dan negatif) dari dataset IMDb Movie Review yang diperoleh dari website kaggle:
https://www.kaggle.com/marklvl/sentiment-labelled-sentences-data-set

Model yang digunakan adalah Naive Bayes dan Random Forest. Akan dilihat accuracy dari masing - masing model. Accurary dipilih karena diinginkan model dapat memprediksi baik review positif dan negatif secara tepat.

Praproses Data

Membaca dan melihat struktur data

imdb <- read.delim("imdb_labelled.txt", header = F)
str(imdb)

## 'data.frame':    528 obs. of  2 variables:
##  $ V1: Factor w/ 527 levels " But Storm Trooper is not even bad enough to make it to the list of wonderfully terrible movies.  \t0\nIt's jus"| __truncated__,..: 22 295 51 502 352 402 505 318 12 280 ...
##  $ V2: int  0 0 0 0 1 0 0 1 0 1 ...

Melihat apakah ada data missing

anyNA(imdb)

## [1] FALSE

Feature Selection dan Feature Enginering

Akan diubah kelas data untuk kolom V1 (dari factor menjadi character), dan untuk kolom v2 (dari int menjadi factor, dengan levels: 1 = positif, 0 = negatif) dan sekaligus merubah nama kedua variable tersebut.

imdb <- imdb %>% 
  select(text = V1, sentiment = V2) %>% 
  mutate(text = as.character(text), 
         sentiment = factor(sentiment, levels = c(1,0), labels = c("positif", "negatif") ))
imdb[sample(nrow(imdb), 10),"text"]

##  [1] "But in terms of the writing it's very fresh and bold.  "                                                                         
##  [2] "Frankly, after Cotton club and Unfaithful, it was kind of embarrassing to watch Lane and Gere in this film, because it is BAD.  "
##  [3] "Wasted two hours.  "                                                                                                             
##  [4] "This movie now joins Revenge of the Boogeyman and Zombiez as part of the hellish trinity of horror films.  "                     
##  [5] "It is a true classic.  "                                                                                                         
##  [6] "I loved it, it was really scary.  "                                                                                              
##  [7] "Alexander Nevsky is a great film.  "                                                                                             
##  [8] "It was so BORING!  "                                                                                                             
##  [9] "The film is well paced, understated and one of the best courtroom documentaries I've seen.  "                                    
## [10] "So I am here to warn you--DO NOT RENT THIS MOVIE, it is the dumbest thing you have never seen!  "

Pemodelan

Model Naive Bayes - Cross Validation

Menggunakan fungsi VCorpus pada library tm untuk membuat corpus dari dataset dan menggunakan $content untuk melihat konten yang ada pada masing - masing dokumen.

library(tm)
imdb.corpus <- VCorpus(VectorSource(imdb$text))
imdb.corpus[[1]]$content

## [1] "A very, very, very slow-moving, aimless movie about a distressed, drifting young man.  "

imdb.corpus[[2]]$content

## [1] "Not sure who was more lost - the flat characters or the audience, nearly half of whom walked out.  "

Melakukan transformasi konten yang ada pada masing - masing dokumen, sehingga bisa digunakan untuk pemodelan.
Pertama, mentrasformasi kata dengan huruf besar menjadi huruf kecil.

imdb.corpus <- tm_map(imdb.corpus, content_transformer(tolower))
imdb.corpus[[1]]$content

## [1] "a very, very, very slow-moving, aimless movie about a distressed, drifting young man.  "

imdb.corpus[[2]]$content

## [1] "not sure who was more lost - the flat characters or the audience, nearly half of whom walked out.  "

Kedua, menghilangkan angka, tanda baca, dan kata yang tidak penting untuk pemodelan, seperti: “am”, “or”, “if”, etc.

imdb.corpus <- tm_map(imdb.corpus, removeNumbers)
imdb.corpus <- tm_map(imdb.corpus, removePunctuation)
imdb.corpus <- tm_map(imdb.corpus, removeWords, stopwords("english"))
imdb.corpus[[1]]$content

## [1] "    slowmoving aimless movie   distressed drifting young man  "

imdb.corpus[[2]]$content

## [1] " sure    lost   flat characters   audience nearly half   walked   "

Ketiga, menghilangkan whitespace diantara kata.

imdb.corpus <- tm_map(imdb.corpus, stripWhitespace)
imdb.corpus[[1]]$content

## [1] " slowmoving aimless movie distressed drifting young man "

imdb.corpus[[2]]$content

## [1] " sure lost flat characters audience nearly half walked "

Keempat, menjadikan semua kata mendaji kata dasarnya.

library(SnowballC)
imdb.corpus <- tm_map(imdb.corpus, stemDocument)
imdb.corpus[[1]]$content

## [1] "slowmov aimless movi distress drift young man"

imdb.corpus[[2]]$content

## [1] "sure lost flat charact audienc near half walk"

langkah terakhir adalah memisahkan text menjadi individual komponen.

imdb.dtm <- DocumentTermMatrix(imdb.corpus)
inspect(imdb.dtm)

## <<DocumentTermMatrix (documents: 528, terms: 2423)>>
## Non-/sparse entries: 5577/1273767
## Sparsity           : 100%
## Maximal term length: 28
## Weighting          : term frequency (tf)
## Sample             :
##      Terms
## Docs  bad charact film good just like movi one time watch
##   196   3       1    4    0    1    1    6   2    0     0
##   197   8       7   22    4    7    4   23  11    3     6
##   222  29      17   60   17   21   19   61  17   15    19
##   226   0       0    0    0    0    0    0   2    0     0
##   301   0       0    1    1    0    1    0   0    0     0
##   344   1       0    0    0    0    2    0   0    0     0
##   345   0       0    1    0    1    0    0   0    0     0
##   360   0       1    0    0    1    0    0   0    0     0
##   367   0       0    1    0    0    0    0   0    0     0
##   373   1       0    1    0    1    0    5   4    1     0

Membagi data menjadi train dan test, begitu juga dengan data sentiment yang akan diprediksi.

set.seed(100)
idx <- sample(nrow(imdb.dtm), nrow(imdb.dtm)*0.7)
train.imdb.dtm <- imdb.dtm[idx,]
test.imdb.dtm <- imdb.dtm[-idx,]

train.imdb.sentiment <- imdb[idx,2]
test.imdb.sentiment <- imdb[-idx,2]

Melihat proporsi sentiment pada data train dan data test.

prop.table(table(train.imdb.sentiment))

## train.imdb.sentiment
##   positif   negatif 
## 0.5203252 0.4796748

prop.table(table(test.imdb.sentiment))

## test.imdb.sentiment
##   positif   negatif 
## 0.5157233 0.4842767

Untuk mengurangi term (kata) sehingga lebih ringan saat pemodelan, maka hanya digunakan kata yang muncul paling sedikit 5 kali (~0.1%).

set.seed(100)
term.freq <- findFreqTerms(imdb.dtm,5)
train.imdb.dtm <- train.imdb.dtm[,term.freq]
test.imdb.dtm <- test.imdb.dtm[,term.freq]

Sebelum melakukan pemodelan, data pada imdb.dtm yang merupakan data count(jumlah) dari masing - masing kata, diubah terlebih dahulu menjadi factor (0 dan 1).
Arti dari masing - masing level factor tersebut adalah:
1. 0, artinya jumlah kata tesebut adalah 0 (tidak pernah muncul).
2. 1, artinya jumlah kata tersebut adalah minimal 1 (pernah muncul).

bernoulli_conv <- function(x){
        x <- as.factor(as.numeric(x > 0))
}

train.imdb.dtm.bn <- apply(train.imdb.dtm, 2, bernoulli_conv)
test.imdb.dtm.bn <- apply(test.imdb.dtm, 2, bernoulli_conv)

Model Naive Bayes - Prediction and Evaluasi Model

model.nb <- naiveBayes(train.imdb.dtm.bn, train.imdb.sentiment)
pred <- predict(model.nb, test.imdb.dtm.bn)

Accuracy dari model Naive Bayes adalah 65.4%.

table(prediction = pred, actual= test.imdb.sentiment)

##           actual
## prediction positif negatif
##    positif      54      27
##    negatif      28      50

sum(pred == test.imdb.sentiment)/length(test.imdb.sentiment)*100

## [1] 65.40881

Selanjutnya, akan digunakan laplace = 1, dan melihat pengaruh terhadap model.

model.nb.lap <- naiveBayes(train.imdb.dtm.bn, train.imdb.sentiment, laplace = 1)
pred <- predict(model.nb.lap, test.imdb.dtm.bn)

Accuracy dari model Naive Bayes dengan Laplace = 1 adalah 69.2%

table(prediction = pred, actual= test.imdb.sentiment)

##           actual
## prediction positif negatif
##    positif      47      14
##    negatif      35      63

sum(pred == test.imdb.sentiment)/length(test.imdb.sentiment)*100

## [1] 69.18239

Dengan menggunakan laplace = 1, accuracy model naik dari 65.4% menjadi 69.2%.

Model Random Forest - Cross Validation

Sebelumnya diubah data dtm menjadi data frame dan menambahkan variable sentiment, agar bisa digunakan untuk model Random Forest.
Sama seperti model Naive Bayes, supaya model lebih ringan dijalankan, kata yang digunakan adalah yang muncul paling sedikit 5 kali (~0.1%).

set.seed(100)
term.freq <- findFreqTerms(imdb.dtm,5)
imdb.dtm <- imdb.dtm[,term.freq]
imdb.rf <- as.data.frame(as.matrix(imdb.dtm))
imdb.rf$sentiment <- imdb$sentiment

Membagi data menjadi train dan test.

set.seed(100)
idx <- sample(nrow(imdb.rf), nrow(imdb.rf)*0.7)
train <- imdb.rf[idx,]
test <- imdb.rf[-idx,]

Model Random Forest - Prediction and Evaluasi Model

model.rf <- randomForest(sentiment ~ ., train)
model.rf

## 
## Call:
##  randomForest(formula = sentiment ~ ., data = train) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 18
## 
##         OOB estimate of  error rate: 30.62%
## Confusion matrix:
##         positif negatif class.error
## positif     133      59   0.3072917
## negatif      54     123   0.3050847

Accuracy dari model Random Forest adalah 64.8%.

pred <- predict(model.rf, test)
table(prediction = pred, actual= test$sentiment)

##           actual
## prediction positif negatif
##    positif      55      29
##    negatif      27      48

sum(pred == test$sentiment)/length(test$sentiment)*100

## [1] 64.77987

Akan dicoba beberapa nilai mtry, yaitu sebesar nilai minimal dari variable prediktor (mtry = 2), sekitar nilai tengah dari varible prediktor (mtry = 333/2 ~ 167), dan nilai maksimal dari variable prediktor (mtry = 333).

mtry = 2

model.rf2<- randomForest(sentiment ~ ., train, mtry = 2)
model.rf2

## 
## Call:
##  randomForest(formula = sentiment ~ ., data = train, mtry = 2) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 48.51%
## Confusion matrix:
##         positif negatif class.error
## positif     188       4  0.02083333
## negatif     175       2  0.98870056

Accuracy dari model Random Forest dengan mtry = 2 adalah 51.6%

pred <- predict(model.rf2, test)
table(prediction = pred, actual= test$sentiment)

##           actual
## prediction positif negatif
##    positif      82      77
##    negatif       0       0

sum(pred == test$sentiment)/length(test$sentiment)*100

## [1] 51.57233

mtry = 167

model.rf3<- randomForest(sentiment ~ ., train, mtry = 167)
model.rf3

## 
## Call:
##  randomForest(formula = sentiment ~ ., data = train, mtry = 167) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 167
## 
##         OOB estimate of  error rate: 32.25%
## Confusion matrix:
##         positif negatif class.error
## positif     134      58   0.3020833
## negatif      61     116   0.3446328

Accuracy dari model Random Forest dengan mtry = 167 adalah 59.8%

pred <- predict(model.rf3, test)
table(prediction = pred, actual= test$sentiment)

##           actual
## prediction positif negatif
##    positif      54      36
##    negatif      28      41

sum(pred == test$sentiment)/length(test$sentiment)*100

## [1] 59.74843

mtry = 333

model.rf4<- randomForest(sentiment ~ ., train, mtry = 333)
model.rf4

## 
## Call:
##  randomForest(formula = sentiment ~ ., data = train, mtry = 333) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 333
## 
##         OOB estimate of  error rate: 34.42%
## Confusion matrix:
##         positif negatif class.error
## positif     128      64   0.3333333
## negatif      63     114   0.3559322

Accuracy dari model Random Forest dengan mtry = 333 adalah 59.7%

pred <- predict(model.rf3, test)
table(prediction = pred, actual= test$sentiment)

##           actual
## prediction positif negatif
##    positif      54      36
##    negatif      28      41

sum(pred == test$sentiment)/length(test$sentiment)*100

## [1] 59.74843

Kesimpulan

Berikut ini kesimpulan yang dapat diambil dari hasil pemodelan yang telah dilakukan:
1. Model Naive Bayes lebih baik dalam memprediksi accuracy dari kelas target, dimana memberikan accuracy paling tinggi 69.2%, sedangkan model Random Forest hanya memberikan accuracy paling tinggi 64.8%.
2. Model Naive Bayes yang memberikan accuracy paling tinggi tersebut setelah diterapkan laplace = 1, yang meningkatkan tinggat accuracy dari 65.4% menjadi 69.2%.
3. Model Random Forest yang memberikan accuracy paling tinggi 64.8% diperoleh dari mtry = 18.