## Loading required package: NLP
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1 v purrr 0.3.3
## v tibble 2.1.3 v stringr 1.4.0
## v tidyr 1.0.0 v forcats 0.4.0
## v readr 1.3.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x ggplot2::annotate() masks NLP::annotate()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## Loading required package: usethis
## Loading required package: RColorBrewer
Data was obtained from Indonesian Institute of Sciences (Lipi). There were some amount of cleaning already done in the term of removing missing value which consist of 2 data sets, positive and negative sentiment. The aim of this analysis is to detect sentiment polarity in Indonesian user generated text.
Two data sets are combined into one dataframe
neg <- read.csv("data_input/olshop_negative.csv", sep = "|")
pos <- read.csv("data_input/olshop_positive.csv", sep = "|")
## 'data.frame': 12319 obs. of 4 variables:
## $ no : Factor w/ 12319 levels "100","10003",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ title : Factor w/ 8942 levels "'' Bukalapak adalah Jaminan Kepuasan Pelanggan\"",..: 4859 619 3387 3858 5970 4307 5391 4338 3173 4800 ...
## $ text : Factor w/ 11934 levels "-","- rate is always the cheapest - point is good - straightforward term and condition - many easy ways of payment "| __truncated__,..: 7630 811 4065 652 9071 3372 6800 8895 3430 6840 ...
## $ senti_value: Factor w/ 7 levels "4","5","senti_value",..: 2 1 2 2 2 1 1 2 2 1 ...
no
: Number of datatitle
: The heading of commenttext
: The main of comment which delivered customer expressionAs we only need title
and text
column so that take them into sentimen1
and combined into new column named comb
Checking missing value in dataset
## title text comb
## 0 0 0
#VCorpus(VectorSource(sentimen12))
sentimen1_corpus <- sentimen1 %>%
pull(comb) %>%
VectorSource() %>%
VCorpus()
## [1] "Pesan barang di Ebay dlm satu klik Sekarang mau blanja barang apa ajah di Ebay mudah sekali, tidak direpotkan dgn urusan pajak, ongkir, bea cukai dll. prosesnya semudah blanja barang di dlm negeri dgn pilihan pembayaran yg beragam dan aman. Tidak perlu khawatir barang tidak akan sampai atau tersasar. tinggal klik, bayar dan tunggu barang sampai di rumah."
The next main step is transformation of the Corpus, so that the corpus is ready for our analysis. Transformation involves performing the following steps.
-Remove Punctuation -Convert to lower case -Remove stopwords such as dont, can, etc using the lexicon available in the tm package -Replace numbers with words -Remove brackets -Remove whitespaces -Stem document which involves stemming words into a root form i.e words such as “serve”, “service”, “server” are stemmed to a common root word “serv”. Stemming is optional as it might lead to a loss of context. We have used SnowballC package to perform the stemming. -Transform the documents to a Term Document Matrix, so that we get a matrix of terms and their frequencies, which can then be converted to a normal matrix and we can perform Analytical tasks.
stemming_bahasa <- content_transformer(function(x){
paste(sapply(words(x),katadasar),collapse = " ")
})
sen_t12 <- tm_map(sen_t1, stemming_bahasa)
## [1] "pesan barang ebay dlm klik blanja barang ajah ebay mudah repot dgn urus pajak ongkir bea cukai dll proses mudah blanja barang dlm neger dgn pilih bayar yg agam aman khawatir barang sasar tinggal klik bayar tunggu barang rumah"
## [1] "good job belanja bliblicom five star abissss"
#Stem document
clean_corpus <- tm_map(sen_t12, stemDocument)
#create Term Document Matrix
clean_dtm <- DocumentTermMatrix(clean_corpus)
#Converting TDM to matrix for analysis
clean_m <- as.matrix(clean_dtm)
We can build a word cloud
## 11822 10342 11782 11809 11813 11915
## 347 278 271 266 266 260
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 3