Introduction

We are given reviews of a certain hotel given by customers who had taken dinner at that hotel in the past days.We have to train a machine learning model to such that it will classify new reviews into positive or negative reviews based on the historical reviews. So, the data that we are going to have here is of text form.Using techniques of text mining we have to clean the data as per our use and then using classification algorithm we will classify the reviews.

Importing the data set :

dataset_original = read.delim('Restaurant_Reviews_new.txt',quote = '', stringsAsFactors = FALSE)
# It is betetr to have file is in tab seperated format ,because people have habit of writing ,
# in the reviews and then csv file we consider one review as two reviwes if it has commos in 
# between.And it is very rare that people will enter tabs while writing the review. 
class(dataset_original)

## [1] "data.frame"

head(dataset_original)  #this is how data look like

dim(dataset_original)

## [1] 1000    2

#It has 1000 reviews and each review marked as 0 and 1 under liked column.
names(dataset_original)

## [1] "Review" "Liked"

Cleaning of the reviews :

For training classification model we have to create a matrix such that each row corresponds to one review and number of columns will be the total number of unique words from the entire dataset.And the entire corresponsing to say 1st review and 1st column will be number of times the word corresponding to first column has apperaed in the first review.

# We need a package called as tm for the text cleaning :
# install.packages("tm")
library(tm)

## Loading required package: NLP

corpus = VCorpus(VectorSource(dataset_original$Review))
#This will create a matrix as described above.
#corpus is collection of all the words.

Now imagine how many unique words will be there in this 1000 reviews .Many many which will result into a matrix with huge number of colums.Hence,we would only want to keep the words that have significant impact on training the model.So,lets begin :

tm library has a very useful tm_map function.we want all the reviews to be in lower case to avoid repetition of the same word in different form.say for example if nice word is appered in many reviews as NICE,nice,Nice then we will have 3 coulums for the same word in our corpous.To avoid this and to reduce number of columns following is done.

corpus = tm_map(corpus, content_transformer(tolower))

Also some people of have habit of writing their phone numbers in the review. Now,phone number will have no correlation with the review written and if the review is positive or negative and model may not be able to gain any significant information from it.So,better remove them.

corpus = tm_map(corpus, removeNumbers)

While writing we use exclamations and double quotes to convey what we feel.Say for eg,Excellent food!! this ! mark is not neccesary for building the model and will add upto make a extra column.So,we better remove all such punctutions from our corpus.

corpus = tm_map(corpus, removePunctuation)

Certain words like the,it ,I ,my or verbs and connectives that are very common and are essential for forming a sentence but have so as such importance on their own to add on the value for training the model.We will remove such words.R has some default 174 stop words on its own.

head(stopwords()) #174 words.We can also add to the list od stop words.

## [1] "i"      "me"     "my"     "myself" "we"     "our"

# say for example people will tend to write restaurants nme in review.
#Say "Taj" is the hotel's name.We can add it to the stop words to remove that as well.
stopwords_new=c("Taj",stopwords("en"),"the","and","this","here")  #en is for english stop words.
head(stopwords_new)

## [1] "Taj"    "i"      "me"     "my"     "myself" "we"

corpus = tm_map(corpus, removeWords, stopwords_new)

Words have different forms.For example words love can be loved,lovely,loving etc.But the basic word is “love” and all the other words will the same orginal meaning but in diffrent context.So we want to take only one orginal form of all the words to train our model and t decrease the number of columns in our corpus.This can be done using stemDocument function and the process is called as stemming.

#This require pacakge "SnowballC".
# install.packages("SnowballC")
library(SnowballC)
corpus = tm_map(corpus, stemDocument)

We also need to remove extra spaces for obvious reasons as mentioned above.

corpus = tm_map(corpus, stripWhitespace)

Creating Bag of words ::

dtm = DocumentTermMatrix(corpus)
dtm

## <<DocumentTermMatrix (documents: 1000, terms: 1575)>>
## Non-/sparse entries: 5425/1569575
## Sparsity           : 100%
## Maximal term length: 32
## Weighting          : term frequency (tf)

The matrix is going to be a sparse matrix.so above dtm matrix has 100 % sparsity because most of the enteries will be zeros.As above out 1997687 enteries only 8313 enteries are non zero.There are many words that might have appread only once in the data.So,we would like to have the top 99.9% of the words based on their frequency of occurence.For this following function can be used.

dtm = removeSparseTerms(dtm, 0.999)
dtm  #sparsity is recuded to 99%

## <<DocumentTermMatrix (documents: 1000, terms: 691)>>
## Non-/sparse entries: 4541/686459
## Sparsity           : 99%
## Maximal term length: 12
## Weighting          : term frequency (tf)

dataset = as.data.frame(as.matrix(dtm)) #converting to data frame
dim(dataset)

## [1] 1000  691

class(dataset)

## [1] "data.frame"

1000 reviews as rows and 824 unique words after cleaning of text will be used to train the model. ## Visualise the words with wordclouds

library(wordcloud)

## Loading required package: RColorBrewer

m <- as.matrix(dtm)
v <- sort(colSums(m),decreasing=TRUE)
 head(v,14)

##   food  place   good servic  great   back   time   like   will friend 
##    124    112     94     84     70     61     54     51     37     35 
##   just realli   love   best 
##     35     34     33     30

 words <- names(v)
 d <- data.frame(word=words, freq=v)
 wordcloud(d$word,d$freq,min.freq=10,colors=brewer.pal(8, "Dark2"))

The above word cloud clearly shows that “food”, “great”, “place”, “good” and “service” are the five most important words in the “review.

The frequency of the first 10 frequent words are plotted :

barplot(d[1:10,]$freq, las = 2, names.arg = d[1:10,]$word,
        col ="lightblue", main ="Most frequent words",
        ylab = "Word frequencies")

###Training the model : We are going to use Random forest algorithm to train the model for classificatin of the reviews.

dataset$Liked = dataset_original$Liked
# Encoding the target feature as factor
dataset$Liked = factor(dataset$Liked, levels = c(0, 1))
# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Liked, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)  #model will be trained on this set
test_set = subset(dataset, split == FALSE) #predictive power of model wll be tested on this set

Fitting Random Forest Classification to the Training set

# install.packages('randomForest')
library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

classifier = randomForest(x = training_set[-692],
                          y = training_set$Liked,
                          ntree = 10)  #10 decision trees to be build in random forest.

# Predicting the Test set results
y_pred = predict(classifier, newdata = test_set[-692])

#Confusion matrix to check accuracy
cm=table(y_pred,test_set$Liked)
cm

##       
## y_pred  0  1
##      0 75 23
##      1 25 77

accuracy=((cm[1,1]+cm[2,2])/sum(cm))*100
accuracy

## [1] 76

The accuracy of the classifier is 76 % .The accuracy is not that great but we have used elementary methods only.So,not that bad with these techniques.

Classification of reviews using methods of NLP and Decision tree algorithm

Shubhangi Zanan

January 3, 2019

Introduction

Importing the data set :

Cleaning of the reviews :

Creating Bag of words ::

Fitting Random Forest Classification to the Training set