Introduction

The exercise posted here was part of one project I have developed for my assessment. And I choose to post it with some improvements.

For this task I undertake a sentiment analysis on food that is sold on Amazon, with an attempt to indentify the positive and negative assessments on products. Analysis of such data is critical because the amount of data emerging from comments by customers, made online, are growing exponentially. Thus, leading companies with a burden in processing information on the product they may want to buy. Moreover, the producers are faced with a ubiquitous, unstructured and untapped data, that won’t allow an immediate and an effective review of a product.

Therefore, mining this data, is critical, in order to allow an ambitious decision-making on the company side, and improve their business for a forceful presence on today’s competitive market. It is intended to tackle a fundamental problem of sentiment analysis, specifically, producing a sentiment analysis by using Score as a target variable using bag of words as predictors.

With this in mind, therefore, with this exercise, I attempt in predicting the score for the bag of words developed with reference to the Amazon data.

Data, Data Manipulation, Software usage and libraries

The data is facilitated by kaagle.com that is online since 2012. The online customers reviews contain are numbering 500,000 in total, and was done on Amazon. Each review, with the information provided contain: 1) reviewer ID, 2) product ID; 3) rating; 4) time of the review; 5) helpfulness; 6) review text. Every rating is represented by a 5-star scale, from 1 to 5, 1 lowest, and 5 highest.

Nevertheless, data used for such task are the Score column and the Text column in order to enable the analysis of the data. With this, I have also randomly selected only 10.000 data points. Additionally, the NA values generated in the preprocessing stage, has been deleted. Secondly, cleaning the data has been proceeded, as it can be seen bellow. Yet, I have made used of uncleaned data and build the Machine Learning models to hypothesise that with a preprocessing in place, the models prove better results.

Methodology

In light of the goal of the exercise, that of producing a sentiment analysis, by predicting the score as a target variable, using bag of words as predictors, following methods were used for sentiment analysis.

Sentiment sentences extraction has been the focus of the report. Machine tagging approach that counts the appearance of positive and negative tokens for every sentence has been used. These are called ground truth labels. Such approach has come to aid for a sentiment polarization categorization. Moreover, the feature vector machine has been used for sentiment categorization.

The classification models selected for categorization are: Naïve Bayesian, Decision Trees and Support Vector Machine.

Reading Data & Preprocessing

# Sentiment analysis in RStudio
# Installing packages for WordCloud
    
# We want to product the sentiment analysis by using Score as a target variable
# using the bag of words as predictors
    
library("tm") # text mining
## Loading required package: NLP
library("rJava")
library(wordcloud) 
## Loading required package: RColorBrewer
library("textir") # word stemming
## Loading required package: distrom
## Loading required package: Matrix
## Loading required package: gamlr
## Loading required package: parallel
#library("RWeka") ## classification and evaluations from Weka
library("qdap") 
## Loading required package: qdapDictionaries
## Loading required package: qdapRegex
## Loading required package: qdapTools
## 
## Attaching package: 'qdap'
## The following object is masked from 'package:Matrix':
## 
##     %&%
## The following objects are masked from 'package:tm':
## 
##     as.DocumentTermMatrix, as.TermDocumentMatrix
## The following object is masked from 'package:NLP':
## 
##     ngrams
## The following object is masked from 'package:base':
## 
##     Filter
library("maptpx")
## Loading required package: slam
library(syuzhet)
library(rpart) # classification, regression and survival trees
library(SnowballC) # word stemming
library(caret) # partitioning dataset
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:qdapRegex':
## 
##     %+%
## The following object is masked from 'package:NLP':
## 
##     annotate
library(rminer) #  classification evaluation
library(kernlab) #support vector Machine Classifier
## 
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
## 
##     alpha

Reading the data

dat = read.csv(file='Reviews.csv', header = TRUE) #importing as  characters not as factors

dim(dat) 
## [1] 568454     10

Randomly selecting 10.000 points due to the size of the data.

subset = dat[sample(nrow(dat), 10000),] # 

Recoding the target variable into two categories. Positive and Negative as the purpose of the exercise is to identify the positive and negative words.

subset[,7] = ifelse(subset[,7] == 1, "Negative",
               ifelse(subset[,7] == 2, "Negative", 
               ifelse(subset[,7] == 4, "Positive",
               ifelse(subset[,7] == 5, "Positive", 99))))
    
    
subset[subset == 99] <- NA
subset = na.omit(subset) # remained with  9258 datapoints
dim(subset) # we remained with 9256 data points
## [1] 9269   10

Storing the Score and Text only as the goal of the exercise is to predict the score using the bag of words, I will develope bellow.

text = subset[,c("Score", "Text")]
    
head(text)
##           Score
## 152595 Positive
## 465007 Positive
## 398863 Positive
## 30925  Positive
## 100934 Positive
## 282375 Negative
##                                                                                                                                                                                                                                                                                                                                                                                                      Text
## 152595                                                     This is my top favorite hot sauce.  I use this on my breakfast...eggs, potatoes and bacon. I use it on my sandwiches and steaks, chicken, pork.  It's even great on mashed potatoes or on pasta salad...It has the BEST FLAVOR and I find myself trying it on more and more dishes and loving it! It can make just about anything taste great!
## 465007                                                                                                                                                                                                         Probably the best packaged cookie we've tasted.  Better than the vanilla Oreo.  Good cookie and the filling is not to sweet.  We can't get them in the S.E., finally found them on Amazon.
## 398863 We love these bars...they make a great snack for my preschool daughter who has multiple food allergies.  No soy, hooray!  I have been buying them for about a year and a half.<br />This box we got from NetRush, though, tastes "off" and "old"...perhaps they were not properly stored.  I will not buy from this seller again, although Cranberry KIND bars will remain a staple in our pantry!
## 30925                                                                                                                                                                                                                                                                  They are so tasty and it feels like I am eating real BBQ chips. They are a great snack and satisfying when I have a salty craving!
## 100934                                                                                                                                                                                                                                                               16 dollar for a 12 pack of these chips is a steal =D!!!! These chips are freaking delicious if you want something sweet to snack on.
## 282375                                                                                                                                                                               I figured for the price this would be a nice sized piece of chocolate. It was a gift to my daughter... who.... when questioned said the thing was smaller than a Hershey bar. Too small for the price.... me thinks!
write.csv(text, file= "text.txt")
str(text)
## 'data.frame':    9269 obs. of  2 variables:
##  $ Score: chr  "Positive" "Positive" "Positive" "Positive" ...
##  $ Text : Factor w/ 393579 levels "__________________________________________________________________________________<br />NOTE: I have left an update on my eczem"| __truncated__,..: 332859 255698 375618 309131 3122 86153 286951 144846 63518 359789 ...
nrow(text)
## [1] 9269
prop.table(table(text$Score)) # checking the class balance
## 
##  Negative  Positive 
## 0.1538462 0.8461538

Partitioning datasets from caret library

set.seed(14)
train = createDataPartition(y=text$Score, p = 0.50, list = FALSE)
train_d = text[train,]
test_d =text[-train,]
    
itest = createDataPartition(y= text$Score, p = 0.50, list = FALSE)
    
test_one = test_d[itest,] 
test_two = test_d[-itest,]
nrow(train_d)
## [1] 4635
nrow(test_one)
## [1] 4635
prop.table(table(train_d$Score))
## 
##  Negative  Positive 
## 0.1538296 0.8461704
prop.table(table(test_one$Score))
## 
## Negative Positive 
## 0.152513 0.847487
prop.table(table(test_two$Score))
## 
##  Negative  Positive 
## 0.1552021 0.8447979
train_corp = Corpus(VectorSource(train_d$Text)) # transforming the text into one vector

length(train_corp)
## [1] 4635

Pre-processing - cleaning the data - tm library used

Bellow are the steps explained in detailed how this has been obtained, mainly, using the tm library.

Step 1 - transform all characters to a lower case using tm library

library(tm)
train_d$Text[1]
## [1] This is my top favorite hot sauce.  I use this on my breakfast...eggs, potatoes and bacon. I use it on my sandwiches and steaks, chicken, pork.  It's even great on mashed potatoes or on pasta salad...It has the BEST FLAVOR and I find myself trying it on more and more dishes and loving it! It can make just about anything taste great!
## 393579 Levels: __________________________________________________________________________________<br />NOTE: I have left an update on my eczema condition in the comments section of this review. To summarize the update: I tried applying the oil topically to the affected area and it stopped the itch and burn completely! Applying the hemp seed oil directly to the area will cure eczema!!! (8/15/2011)<br />__________________________________________________________________________________<br /><br />Couldn't be happier with this hemp seed oil. The delicious nutty flavor makes it easy to take this as a daily supplement. The texture will take some getting used to if you're taking it straight as a supplement (it is oil after all). And you can't beat the price here on Amazon. The original price is $20 and here it's half that!<br /><br />I actually decided to get this because I heard an omega6 deficiency causes eczema (a skin disease that I have). I've been taking it for about a week and my skin is showing improvement. I itch less and my skin doesn't inflame as badly. I'm thinking I need to up the dose a little bit. Maybe I'll start taking 1 full tablespoon instead of 2 teaspoons that I'm taking daily. I understand this probably won't be of use to most people simply looking for hemp seed oil, but I will update in the comments of this review about how my eczema curing progress is going. Maybe this can help someone.<br /><br />Nutiva is obviously very passionate and dedicated to the products they sell. The label on this oil is informative and their website is even more informative. By quality of this oil alone I can say I trust this company, and I'll be buying from them in the future. ...
St1 = tm_map(train_corp, tolower) # 
St1[1] # see the first review using the sigle square bracket
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 1
St1[[1]] # only the review 
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 334
inspect(St1[1:3]) # we can see it has been transformed to lower case
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 3
## 
## [1] this is my top favorite hot sauce.  i use this on my breakfast...eggs, potatoes and bacon. i use it on my sandwiches and steaks, chicken, pork.  it's even great on mashed potatoes or on pasta salad...it has the best flavor and i find myself trying it on more and more dishes and loving it! it can make just about anything taste great!                                                    
## [2] probably the best packaged cookie we've tasted.  better than the vanilla oreo.  good cookie and the filling is not to sweet.  we can't get them in the s.e., finally found them on amazon.                                                                                                                                                                                                        
## [3] we love these bars...they make a great snack for my preschool daughter who has multiple food allergies.  no soy, hooray!  i have been buying them for about a year and a half.<br />this box we got from netrush, though, tastes "off" and "old"...perhaps they were not properly stored.  i will not buy from this seller again, although cranberry kind bars will remain a staple in our pantry!

Step 2 - removing numerical values

St2 = tm_map(St1, removeNumbers)
St2[[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 186
inspect(St2[1:3])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 3
## 
## [1] this is my top favorite hot sauce.  i use this on my breakfast...eggs, potatoes and bacon. i use it on my sandwiches and steaks, chicken, pork.  it's even great on mashed potatoes or on pasta salad...it has the best flavor and i find myself trying it on more and more dishes and loving it! it can make just about anything taste great!                                                    
## [2] probably the best packaged cookie we've tasted.  better than the vanilla oreo.  good cookie and the filling is not to sweet.  we can't get them in the s.e., finally found them on amazon.                                                                                                                                                                                                        
## [3] we love these bars...they make a great snack for my preschool daughter who has multiple food allergies.  no soy, hooray!  i have been buying them for about a year and a half.<br />this box we got from netrush, though, tastes "off" and "old"...perhaps they were not properly stored.  i will not buy from this seller again, although cranberry kind bars will remain a staple in our pantry!

Step 3 - Remove stopwords - snowball.tartarus

St3 = tm_map(St2, removeWords, stopwords())
St3[[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 136
St3[2:3]
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 2
inspect(St3[1:3])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 3
## 
## [1]    top favorite hot sauce.   use    breakfast...eggs, potatoes  bacon.  use    sandwiches  steaks, chicken, pork.   even great  mashed potatoes   pasta salad...   best flavor   find  trying      dishes  loving !  can make just  anything taste great!                                    
## [2] probably  best packaged cookie  tasted.  better   vanilla oreo.  good cookie   filling    sweet.    get    s.e., finally found   amazon.                                                                                                                                                     
## [3]  love  bars... make  great snack   preschool daughter   multiple food allergies.   soy, hooray!     buying     year   half.<br /> box  got  netrush, though, tastes ""  "old"...perhaps    properly stored.   will  buy   seller , although cranberry kind bars will remain  staple   pantry!

Step 4 Remove punctuation from Corpus

St4 = tm_map(St3, removePunctuation)
St4[[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 129
St4[1:3]
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 3
inspect(St4[1:3])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 3
## 
## [1]    top favorite hot sauce   use    breakfasteggs potatoes  bacon  use    sandwiches  steaks chicken pork   even great  mashed potatoes   pasta salad   best flavor   find  trying      dishes  loving   can make just  anything taste great                            
## [2] probably  best packaged cookie  tasted  better   vanilla oreo  good cookie   filling    sweet    get    se finally found   amazon                                                                                                                                      
## [3]  love  bars make  great snack   preschool daughter   multiple food allergies   soy hooray     buying     year   halfbr  box  got  netrush though tastes   oldperhaps    properly stored   will  buy   seller  although cranberry kind bars will remain  staple   pantry

Step 5 - Remove white space

St5 = tm_map(St4, stripWhitespace)
St5[[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 110
St5[1:3]
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 3
inspect(St5[1:3])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 3
## 
## [1]  top favorite hot sauce use breakfasteggs potatoes bacon use sandwiches steaks chicken pork even great mashed potatoes pasta salad best flavor find trying dishes loving can make just anything taste great                         
## [2] probably best packaged cookie tasted better vanilla oreo good cookie filling sweet get se finally found amazon                                                                                                                      
## [3]  love bars make great snack preschool daughter multiple food allergies soy hooray buying year halfbr box got netrush though tastes oldperhaps properly stored will buy seller although cranberry kind bars will remain staple pantry

Step 6 - Steming

St6 = tm_map(St5,stemDocument, language="english") # not working well
St6[[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 98
St6[2:3]
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 2
inspect(St6[1:3])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 3
## 
## [1] top favorit hot sauc use breakfastegg potato bacon use sandwich steak chicken pork even great mash potato pasta salad best flavor find tri dish love can make just anyth tast great                                 
## [2] probabl best packag cooki tast better vanilla oreo good cooki fill sweet get se final found amazon                                                                                                                  
## [3] love bar make great snack preschool daughter multipl food allergi soy hooray buy year halfbr box got netrush though tast oldperhap proper store will buy seller although cranberri kind bar will remain stapl pantri

In order to capture a more accurate text with the most positive and negative words, the removal of the sparse terms function was used. However, this function was used passing the threshold of 0.95, meaning that the most frequent terms will remain. Thus, the 2 chunks bellow is revealing how many distinct words the text contained on sparse data, 11877 vs. The second that reveals how many unique words on Non-Sparse data, with 109

# Corpus transformation into a Document Terms Matrix 
dtm_train = DocumentTermMatrix(St6) # all the reviews 5000 will convert into 5000 rows 

dim(dtm_train) # we have 5001 and more than 12187 distinct words
## [1]  4635 12203
dim(dtm_train)
## [1]  4635 12203
# Removing sparse terms 
dtm_train_spte= removeSparseTerms(dtm_train, 0.95) # the most 30 words used
dim(dtm_train_spte) # we have now 117 distinct words
## [1] 4635  112

The first attempt in getting a clean text is shown by calculating the average frequency for the top 30 words. As it is shown, the word like, taste, flavor, good, love, are one of among the first top 30 adjectives, between 30 – 50 percent, mostly used in describing the products. A bar plot of the top 30 most used words is revealing that positive words are mostly used, therefore, leading to a high frequency of positive feedback overall, in the sample of 10.000 points.

# Average frequency of the top 30 words - selecting in desenting oder
meantrain = sort(colMeans(as.matrix(dtm_train_spte)), decreasing= T)
    
meantrain[1:30]
##      like      tast    flavor      good      love       one   product 
## 0.4722762 0.4327940 0.3710895 0.3553398 0.3337648 0.3186624 0.3154261 
##       use      just     great       can     coffe      food       tri 
## 0.3085221 0.3022654 0.3016181 0.2944984 0.2944984 0.2871629 0.2806904 
##       tea       get      make      will       dog       buy       eat 
## 0.2532902 0.2459547 0.2409924 0.2129450 0.1909385 0.1810140 0.1803668 
##      time    realli      also     price      much      well     order 
## 0.1704423 0.1661273 0.1555556 0.1540453 0.1529666 0.1497303 0.1486516 
##    amazon     littl 
## 0.1473571 0.1454153
mean30 = mean(meantrain[1:30])
mean30
## [1] 0.2521467

Ploting the frequency

barplot(meantrain[1:30], border=NA, las=3, xlab="top 30 words", ylab="Frequency", ylim=c(0, 0.5))

# Compare the averate frequency of the top 30 with non-zeros. Zeros are in, thus taking them out
    
dtm_train_spte_n0 = as.matrix(dtm_train_spte)
    
dim(dtm_train_spte_n0)
## [1] 4635  112
is.na(dtm_train_spte_n0) = dtm_train_spte_n0 == 0
#Calculating the mean without na - yet such approach does not help since positive and negative words are less likely to occure therefore, removing non important words  
meantrainM = sort(colMeans(dtm_train_spte_n0, na.rm = TRUE), decreasing=T)

meantrainM
##       tea     coffe       dog      food    chocol     treat     water 
##  2.913151  2.271215  2.169118  2.139871  1.883019  1.738019  1.660066 
##     drink      like    flavor       bag       use       box       can 
##  1.645244  1.503434  1.502183  1.493088  1.478800  1.477064  1.459893 
##       cup      tast     sugar   product       one       eat       mix 
##  1.452941  1.428775  1.428571  1.392381  1.377799  1.352751  1.342466 
##     brand      pack       get       day    review       tri      just 
##  1.337950  1.322476  1.307339  1.305747  1.302251  1.299700  1.298424 
##      good     sweet      make       add      love     snack   ingredi 
##  1.297872  1.287500  1.275114  1.264000  1.251618  1.250871  1.249097 
##   healthi    packag      made    amazon    realli     order     great 
##  1.247934  1.247024  1.238994  1.230631  1.228070  1.225979  1.225241 
##      feel      well      work      will     stuff     price      time 
##  1.223140  1.217544  1.217514  1.217016  1.214592  1.214286  1.213518 
##     littl     first      year     store      much       old       two 
##  1.205725  1.203704  1.203233  1.197959  1.195616  1.193133  1.183280 
##     thing      also       buy      take      look       now      mani 
##  1.181538  1.180033  1.180028  1.176667  1.176606  1.174699  1.172414 
##      less      find     found      even      back   purchas     think 
##  1.172269  1.165794  1.165450  1.163498  1.161826  1.161725  1.159892 
##     still     fresh   qualiti      give       way      seem      nice 
##  1.154639  1.149306  1.146341  1.143852  1.143631  1.140288  1.138889 
##    better       say      ship       lot      last     alway      want 
##  1.138636  1.138028  1.137339  1.136508  1.134454  1.133333  1.133028 
##       bit    differ       got     enjoy      high      need      come 
##  1.128814  1.125392  1.120521  1.115702  1.115578  1.115044  1.110759 
##       put   definit    bought      long    someth    wonder      best 
##  1.110687  1.109756  1.107692  1.104000  1.101626  1.101562  1.101329 
##   without recommend      keep      know     local     never    enough 
##  1.099644  1.098712  1.096346  1.095679  1.094828  1.093985  1.090129 
##      sinc   perfect   favorit    delici     right     everi      ever 
##  1.089636  1.084142  1.080332  1.079734  1.075397  1.071698  1.068441
#30 most freq words without na 
meantrainM[1:30] # we have higher frequencies 
##      tea    coffe      dog     food   chocol    treat    water    drink 
## 2.913151 2.271215 2.169118 2.139871 1.883019 1.738019 1.660066 1.645244 
##     like   flavor      bag      use      box      can      cup     tast 
## 1.503434 1.502183 1.493088 1.478800 1.477064 1.459893 1.452941 1.428775 
##    sugar  product      one      eat      mix    brand     pack      get 
## 1.428571 1.392381 1.377799 1.352751 1.342466 1.337950 1.322476 1.307339 
##      day   review      tri     just     good    sweet 
## 1.305747 1.302251 1.299700 1.298424 1.297872 1.287500
av30top = mean(meantrainM[1:30])
av30top 
## [1] 1.562304
barplot(meantrainM[1:30], border = NA, las=3, xlab= "top 30 words", ylab= "Frequency", ylim = c(0,3))

However, it may be observed, some words are unnecessary having in mind the goal of this project. Thus, I have gotten rid of the unnecessary words by creating my own word list.

#Removing non - interesting words - Such as tea, dog, coffe, food, water, baf, cup, drink, sugar, mix, eat, one, two, three, box, product, ingredients, just, day, get, brand


mystopwords = c("tea", "dog", "coffe", "food", "water", "tri", "will", "bag", "cup", "drink", "sugar", "mix", "eat", "one", "two", "three", "box", "product", "ingredients", "just", "day", "get", "brand", "didn", "come", "someth", "see", "put", "pack", "got", "say" , "packag" , "even" , "can" , "use" ,"price" , "time" , "find" , "order" , "bag" , "chocol" , "amazon" , "make" , "also" , "order" , "way" , "purchase" , "give" , "year" , "way" , "review" , "thing" , "bought" , "take", "eat" , "ship", "everi" , "alway", "back", "howev", "put", "stuff", "see", "actual", "month" ,"without" , "local" , "sure" , "ever" , "think", "make", "don", "found", "work", "product", "though" , "amazon", "wonder")

It is observed, that removing non-important words, with reference to the problem sought to be solved, the number of unique words are lower in cases. Thar is because the data contains no important words and data containing non – sparse text. The function on removing the sparsity of the text was undertaken in the text containing non-important words. This compared to the instances seen previously, the difference is huge.

train_corp2 = tm_map(train_corp, removeWords, mystopwords)
dtm_train2 = DocumentTermMatrix(train_corp2, control = list(tolower  =t, removeNumbers=T, removePunctuation=T, stopwords=T, stripWhitespace=T, stemming=T))
    
dim(dtm_train2)
## [1]  4635 13061
dtm_train_spte2 = removeSparseTerms(dtm_train2, 0.95)
dim(dtm_train_spte2) 
## [1] 4635   73

The same process has been applied, that of checking for the most top 30 words in my dataset. What is revealed are the occurrences of words that are relevant to us.

#Average of top 30 after removing the non interesting stop words
meantrain2 = sort(colMeans(as.matrix(dtm_train_spte2)), decreasing= T)
meantrain2[1:30]
##       like       tast     flavor        The       good       love 
## 0.46774542 0.42912621 0.37842503 0.37734628 0.34023732 0.28932039 
##      coffe        tri       This      great        buy        use 
## 0.27831715 0.27119741 0.26796117 0.24660194 0.17303128 0.16094930 
##     realli       They       much      littl     Amazon       well 
## 0.15620280 0.15059331 0.14843581 0.14390507 0.14282632 0.12923409 
##      store       best      treat      sweet  recommend     better 
## 0.12772384 0.12578209 0.11672060 0.11089536 0.10852211 0.10765912 
##       look       want      first       make      These     chocol 
## 0.10701187 0.10507012 0.10140237 0.10075512 0.09902913 0.09751888
ave30top_mean = mean(meantrain2[1:30]) # calculating the average 
    
barplot(meantrain2[1:30], border = NA, las=3, xlab= "top 30 words", ylab= "Frequency", ylim = c(0, 0.5))

#Ploting the 30 most freq with wordcloud
wordcloud(names(meantrain2[1:30]), meantrain2[1:30], col = rainbow(4)) # what the corpus, what phrases and words are used

Obtaining the bag of words, the top 30 words used in the random dataset

Modeling – Sentiment classification on training and test datasets

The sentiment classification on the training and test dataset is to follow, storing the positive and negative tags into a BAG OF WORDS. However, more steps will be taken, using what I have achieved above, prior to sentiment classification modelling.

Sentiment classification the training and test datasets. generate train dataset; transform Doc Term Matrix to Matrix

train_freq_bagw = as.matrix(dtm_train_spte)

train_data_m = data.frame(y = train_d$Score, x = train_freq_bagw) # combining the review level with Score
    
dim(train_data_m)
## [1] 4635  113

Bag of words - training set - use later

train_m_bagw = findFreqTerms(dtm_train_spte)
length(train_m_bagw)
## [1] 112

Test 1 including term frequency of words form train data

test1_corp = Corpus(DataframeSource(as.matrix(test_one$Text)))  

Generate test1 and test2 Doc Term M - BAG of WORDS- trainig data - for later use

test1_m_bagw = DocumentTermMatrix(test1_corp, 
                                    control = list(tolower  = T, 
                                    removeNumbers=T, 
                                    removePunctuation=T, 
                                    stopwords=T, 
                                    stripWhitespace=T, 
                                    stemming=T, 
                                    dictionary=train_m_bagw)) # dictionary calculates the frequency of the words from the dictionary 117
    
test1_freq_bagw = as.matrix(test1_m_bagw) # document transformed into matrix from a doc term matrix
    
test1_d_m = data.frame(y=test_one$Score, x =test1_freq_bagw)
dim(test1_d_m)
## [1] 4635  113
test2_corp <- Corpus(DataframeSource(as.matrix(test_two$Text)))
test2_m_bagw <- DocumentTermMatrix(test2_corp, 
                                   control = list(tolower  = T,
                                                  removeNumbers=T,
                                                  removePunctuation=T,
                                                  stopwords=T,
                                                  stripWhitespace=T,
                                                  stemming=T,
                                                  dictionary=train_m_bagw))
dim(test2_m_bagw) # dimension is 109 for 4629 datapoint
## [1] 2326  112
test2_bagw_freq = as.matrix(test2_m_bagw)
test2_data_m = data.frame(y = test_two$Score, x = test2_bagw_freq)

Nevertheless, the training and test dataset are left untouched from the process of cleaning. And modelling has been undertaken in order to underline the difference between the models achieved where preprocessing hasn’t been done.

However, the same models are developed, but using the test data , with the same bag of words as a dictionary, obtained in the previous preprocessing. It is hypothesised that the accuracy model generated and tested with the second test data, to perform better.

Modelling - Ctree, NaiveBayes, KSVM from kernlab

Decision Trees

By evaluating the model, by confusion matrix, we see that that the accuracy is 84 %, a good percentage.

If we observe the plot, the words perfect and good, on the left hand side of the plot, if it occurred more than once, the likelihood these words will fall under positive is very high, as the black block, that is observable on the histogram (or not due to the lack of space, yet it can be observed on the Inference Tree, with the 23 terminal nodes), is above 80. The white block reveals the likelihood the word will fall under negative classification.

By evaluating the model, by confusion matrix, we see that that the accuracy is 41 %, a good percentage.

#Decision tree for data that preprocessing hasn't been achieved
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## 
## Attaching package: 'modeltools'
## The following object is masked from 'package:kernlab':
## 
##     prior
## The following object is masked from 'package:rminer':
## 
##     fit
## The following object is masked from 'package:rJava':
## 
##     clone
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
# ctree model on training dataset
bagw_ctree = ctree (y ~., data = train_data_m)
bagw_ctree
## 
##   Conditional inference tree with 24 terminal nodes
## 
## Response:  y 
## Inputs:  x.best, x.can, x.even, x.favorit, x.find, x.flavor, x.great, x.just, x.love, x.make, x.tast, x.tri, x.use, x.amazon, x.better, x.found, x.get, x.good, x.packag, x.sweet, x.box, x.buy, x.food, x.got, x.snack, x.store, x.will, x.year, x.delici, x.pack, x.someth, x.want, x.also, x.dog, x.feel, x.healthi, x.look, x.mix, x.old, x.perfect, x.price, x.purchas, x.review, x.seem, x.treat, x.without, x.work, x.order, x.first, x.ingredi, x.last, x.like, x.long, x.one, x.well, x.product, x.ever, x.time, x.wonder, x.chocol, x.cup, x.day, x.enough, x.give, x.lot, x.qualiti, x.realli, x.say, x.take, x.tea, x.way, x.thing, x.water, x.bit, x.enjoy, x.eat, x.keep, x.local, x.still, x.bought, x.bag, x.nice, x.differ, x.made, x.need, x.now, x.recommend, x.know, x.much, x.definit, x.fresh, x.everi, x.mani, x.stuff, x.sugar, x.brand, x.coffe, x.high, x.littl, x.add, x.ship, x.put, x.drink, x.alway, x.back, x.never, x.think, x.less, x.sinc, x.come, x.two, x.right 
## Number of observations:  4635 
## 
## 1) x.great <= 0; criterion = 1, statistic = 97.582
##   2) x.product <= 1; criterion = 1, statistic = 70.426
##     3) x.love <= 0; criterion = 1, statistic = 66.904
##       4) x.best <= 0; criterion = 1, statistic = 51.342
##         5) x.good <= 0; criterion = 1, statistic = 49.311
##           6) x.delici <= 0; criterion = 1, statistic = 29.658
##             7) x.tast <= 1; criterion = 0.999, statistic = 19.837
##               8) x.perfect <= 0; criterion = 0.999, statistic = 19.015
##                 9) x.well <= 0; criterion = 0.987, statistic = 14.789
##                   10) x.make <= 0; criterion = 0.988, statistic = 15.043
##                     11) x.enjoy <= 0; criterion = 0.977, statistic = 13.729
##                       12)*  weights = 889 
##                     11) x.enjoy > 0
##                       13)*  weights = 64 
##                   10) x.make > 0
##                     14) x.think <= 0; criterion = 0.994, statistic = 17.206
##                       15)*  weights = 158 
##                     14) x.think > 0
##                       16)*  weights = 13 
##                 9) x.well > 0
##                   17) x.even <= 0; criterion = 0.965, statistic = 12.967
##                     18)*  weights = 105 
##                   17) x.even > 0
##                     19)*  weights = 19 
##               8) x.perfect > 0
##                 20)*  weights = 79 
##             7) x.tast > 1
##               21)*  weights = 87 
##           6) x.delici > 0
##             22)*  weights = 98 
##         5) x.good > 0
##           23) x.tast <= 0; criterion = 0.999, statistic = 19.378
##             24)*  weights = 379 
##           23) x.tast > 0
##             25) x.bought <= 0; criterion = 0.997, statistic = 17.925
##               26)*  weights = 208 
##             25) x.bought > 0
##               27)*  weights = 27 
##       4) x.best > 0
##         28)*  weights = 319 
##     3) x.love > 0
##       29) x.look <= 0; criterion = 0.985, statistic = 14.514
##         30)*  weights = 773 
##       29) x.look > 0
##         31)*  weights = 67 
##   2) x.product > 1
##     32)*  weights = 209 
## 1) x.great > 0
##   33) x.got <= 0; criterion = 1, statistic = 48.877
##     34) x.tast <= 1; criterion = 0.997, statistic = 17.48
##       35) x.bought <= 0; criterion = 0.976, statistic = 13.666
##         36) x.purchas <= 0; criterion = 0.986, statistic = 14.7
##           37)*  weights = 819 
##         36) x.purchas > 0
##           38) x.look <= 0; criterion = 0.979, statistic = 13.947
##             39) x.year <= 0; criterion = 1, statistic = 22.222
##               40)*  weights = 48 
##             39) x.year > 0
##               41)*  weights = 9 
##           38) x.look > 0
##             42)*  weights = 10 
##       35) x.bought > 0
##         43)*  weights = 83 
##     34) x.tast > 1
##       44)*  weights = 99 
##   33) x.got > 0
##     45) x.now <= 0; criterion = 1, statistic = 24.559
##       46)*  weights = 55 
##     45) x.now > 0
##       47)*  weights = 18
library(partykit)
## 
## Attaching package: 'partykit'
## The following objects are masked from 'package:party':
## 
##     cforest, ctree, ctree_control, edge_simple, mob, mob_control,
##     node_barplot, node_bivplot, node_boxplot, node_inner,
##     node_surv, node_terminal
plot(bagw_ctree, gp=gpar(fontsize=6))

# prediction for testing data1
test1pdction = predict(bagw_ctree, newdata=test1_d_m)

# evaluating the prediction results
#confusionMatrix(test1pdction, test1_d_m[,1], positive="Positive", dnn=c("Prediction", "True"))

mmetric(test1pdction, test1_d_m[,1], c("ACC", "TPR", "PRECISION", "F1"))
##        ACC       TPR1       TPR2 PRECISION1 PRECISION2        F11 
##  42.222222  57.142857  84.876141   1.136364  99.846626   2.228412 
##        F12 
##  91.754757

Decision trees with preprocessing in place - it improved from 41 to 85, a huge difference. This difference proves what was hypothesised, that accuracy is better on the preprocessed data.

#Preprocessing in place 

test2dt = predict(bagw_ctree, newdata=test2_data_m)
#confusionMatrix(test2dt,test2_data_m[,1],positive="Positive",dnn=c("Prediction", "True"))
mmetric(test2dt, test2_data_m[,1], c("ACC", "TPR", "PRECISION", "F1"))
##        ACC       TPR1       TPR2 PRECISION1 PRECISION2        F11 
## 84.4368014 40.0000000 84.5325291  0.5540166 99.8473282  1.0928962 
##        F12 
## 91.5538964
# ACCURACCY is 85 -  very good accuraccy test

Naive Bayes

Naive Bayes classifier. ACCURACCY is 40 - not a very good accuraccy test

# Naive Bayes without preprocessing 
library(e1071)
bagw_nb = naiveBayes(y ~., data= train_data_m)
    
test1prdNB = predict(bagw_nb, newdata=test1_d_m)
# evaluating the prediction results
#confusionMatrix(test1prdNB, test1_d_m[,1], positive="Positive", dnn=c("Prediction", "True"))   

mmetric(test1prdNB, test1_d_m[,1], c("ACC", "TPR", "PRECISION", "F1"))
##        ACC       TPR1       TPR2 PRECISION1 PRECISION2        F11 
##   40.17260   36.64773   88.59918   36.64773   88.59918   36.64773 
##        F12 
##   88.59918

Accuraccy for Naive Bayes model has improved with 40 points, from 40 to 80, with preprocessing in place. It is exactly what was hypothesised.

# Naive Bayes with preprocessing in place
library(e1071)
bagw_nb = naiveBayes(y ~., data= train_data_m)
test2prdNB = predict(bagw_nb, newdata=test2_data_m)
    # evaluating the prediction results
#confusionMatrix(test2prdNB, test2_data_m[,1], positive="Positive", dnn=c ("Prediction", "True"))
mmetric(test2prdNB, test2_data_m[,1], c("ACC", "TPR", "PRECISION", "F1"))
##        ACC       TPR1       TPR2 PRECISION1 PRECISION2        F11 
##   80.18057   37.11340   88.80289   39.88920   87.58270   38.45127 
##        F12 
##   88.18857

Suport Vector Machines Kernel Model

KSVM Model.ACCURACCY is 42 - not a very good accuraccy test

library(kernlab)
bagw_ksvm = ksvm( y~., data= train_data_m)
    
# support vector machine kernel model
test1prdSV = predict(bagw_ksvm, newdata = test1_d_m)
#confusionMatrix(test1prdSV, test1_d_m[,1], positive="Positive",dnn=c("Prediction","True"))

mmetric(test1prdSV, test1_d_m[,1], c("ACC", "TPR", "PRECISION", "F1"))
##        ACC       TPR1       TPR2 PRECISION1 PRECISION2        F11 
##  42.265372  80.000000  84.889275   1.136364  99.948875   2.240896 
##        F12 
##  91.805588

Support vector machine kernel model. ACCURACCY is 85, a good accuracy test.It improved from 41 to 85 with preprocessing in place. It does prove the hypothesis to be correct.

test2prdSV = predict(bagw_ksvm, newdata = test2_data_m)

#confusionMatrix(test2prdSV, test2_data_m[,1], positive="Positive", dnn=c 
                   # ("Prediction", "True"))

mmetric(test2prdSV, test2_data_m[,1], c("ACC", "TPR", "PRECISION", "F1"))
##         ACC        TPR1        TPR2  PRECISION1  PRECISION2         F11 
##  84.5227859 100.0000000  84.5161290   0.2770083 100.0000000   0.5524862 
##         F12 
##  91.6083916

Conclusion

With machine learning models generated above, the goal of the exercise has been achieved, that of producing a sentiment analysis by using Score as a target variable using bag of words as predictors. The model performing the best is represented by the kernel support vector machine, whilst other models, decision trees and Naïve Bayes performing just well, when preprocessing has been achieved. This is not the same for the data where proprocessing hasn’t been achieved, where the Accuracy metric revealed low percentages in all of the models. However, more to be done when it comes to class imbalances as the data reveals this gap. However, I chosen not to balance the classes as there were more positive assessments of the products then nagative in this particular data. Yet, it is acknowledged that there are more positive comments then negative, for Amazon products.