Natural Language Processing - Lectures

check working directory getwd()

Importing the dataset

The quote = ‘’ tells things to ignore any “” in our data. stringsAsFactors - logical: should character vectors be converted to factors? Note that this is overridden by as.is and colClasses, both of which allow finer control. Since we don’t what the reviews to be treated as Factors we need to set this to FALSE. We’ll be digging into the words of the reviews themselves so we wont’ threat the ‘review’ as a factor.

dataset_original = read.delim('Restaurant_Reviews.tsv', quote = '', stringsAsFactors = FALSE)

Cleaning the texts

We’ll create a corpus https://en.wikipedia.org/wiki/Text_corpus
In the end we are creating a sparse matrix, and to do this well we want to remove all the un-necessary elements of the reviews. Basically we want to chop this down to the fewest words such that we have fewer columns in the sparse matix.

# install.packages('tm')
# install.packages('SnowballC')
library(tm)

## Loading required package: NLP

library(SnowballC)
# initialize of the corpus
corpus = VCorpus(VectorSource(dataset_original$Review)) # we point at the the review column

Let’s have a look; 841 is a good example of many things needing to be cleaned.

as.character(corpus[[841]])

## [1] "for 40 bucks a head, i really expect better food."

The rest of the cleaning follows the same idea for the various processes we wish to execute on our data.

corpus = tm_map(corpus, content_transformer(tolower)) # make all words lower case
corpus = tm_map(corpus, removeNumbers) # remove numbers
corpus = tm_map(corpus, removePunctuation) # take out puncutation
corpus = tm_map(corpus, removeWords, stopwords()) # use stopwards library to clean off words
corpus = tm_map(corpus, stemDocument) # stem will transalte loved, loving, loves, etc... to love
corpus = tm_map(corpus, stripWhitespace) # some of the above steps will have added extra spaces into our data

Let’s look again. Interestingly really was changed to realli (this is OK) I believe the stem function does that on purpose because of the fact that a lot of words that end in “y” are adjectives that change to an “i” in other forms. Like, the word “pretty” would be changed to “pretti” because “pretty”,“prettier”, and “prettiest” are all essentially the same for this kind of analysis.

as.character(corpus[[841]])

## [1] "buck head realli expect better food"

Creating the Bag of Words model

https://en.wikipedia.org/wiki/Bag-of-words_model
https://www.r-bloggers.com/using-sparse-matrices-in-r/
A bag of words is where the rows are the reviews and the columns are the unique words. Each cell will hold a number for the number of times that the word represented by the column appears in that review.

dtm = DocumentTermMatrix(corpus) # this creates the rows and columns with the data

Let’s have a look at some details of dtm before we continue

dtm

## <<DocumentTermMatrix (documents: 1000, terms: 1577)>>
## Non-/sparse entries: 5435/1571565
## Sparsity           : 100%
## Maximal term length: 32
## Weighting          : term frequency (tf)

Let’s check rows and columns.

nrow(dtm)

## [1] 1000

ncol(dtm)

## [1] 1577

Let’s clean off 0.1 % of words, leaving 99.9% of the most common words. The % we use is an art, as it is influenced by things such as the number of instances of the data you have.

dtm = removeSparseTerms(dtm, 0.999) # here we'll remove some of the less frequent words dtm

Let’s have a look at some details of dtm again before we continue

dtm

## <<DocumentTermMatrix (documents: 1000, terms: 691)>>
## Non-/sparse entries: 4549/686451
## Sparsity           : 99%
## Maximal term length: 12
## Weighting          : term frequency (tf)

Let’s check rows and columns again.

nrow(dtm)

## [1] 1000

ncol(dtm)

## [1] 691

Change our dataset into a dataframe

Our model will require a dataframe.

dataset = as.data.frame(as.matrix(dtm))

Have a quick look at rough structure.

head(dataset, 2)

##   absolut acknowledg actual ago almost also although alway amaz ambianc
## 1       0          0      0   0      0    0        0     0    0       0
## 2       0          0      0   0      0    0        0     0    0       0
##   ambienc amount anoth anyon anyth anytim anyway apolog appet area arent
## 1       0      0     0     0     0      0      0      0     0    0     0
## 2       0      0     0     0     0      0      0      0     0    0     0
##   around arriv ask assur ate atmospher attack attent attitud authent
## 1      0     0   0     0   0         0      0      0       0       0
## 2      0     0   0     0   0         0      0      0       0       0
##   averag avoid away awesom awkward babi bachi back bacon bad bagel bakeri
## 1      0     0    0      0       0    0     0    0     0   0     0      0
## 2      0     0    0      0       0    0     0    0     0   0     0      0
##   bar bare bartend basic bathroom batter bay bean beat beauti becom beef
## 1   0    0       0     0        0      0   0    0    0      0     0    0
## 2   0    0       0     0        0      0   0    0    0      0     0    0
##   beer behind believ belli best better beyond big bill biscuit bisqu bit
## 1    0      0      0     0    0      0      0   0    0       0     0   0
## 2    0      0      0     0    0      0      0   0    0       0     0   0
##   bite black bland blow boba boot bother bowl box boy boyfriend bread
## 1    0     0     0    0    0    0      0    0   0   0         0     0
## 2    0     0     0    0    0    0      0    0   0   0         0     0
##   break breakfast brick bring brought brunch buck buffet build burger busi
## 1     0         0     0     0       0      0    0      0     0      0    0
## 2     0         0     0     0       0      0    0      0     0      0    0
##   butter cafe call came can cant car care cashier char charcoal charg
## 1      0    0    0    0   0    0   0    0       0    0        0     0
## 2      0    0    0    0   0    0   0    0       0    0        0     0
##   cheap check chees cheeseburg chef chewi chicken chines chip choos
## 1     0     0     0          0    0     0       0      0    0     0
## 2     0     0     0          0    0     0       0      0    0     0
##   classic clean close cocktail coffe cold color combin combo come comfort
## 1       0     0     0        0     0    0     0      0     0    0       0
## 2       0     0     0        0     0    0     0      0     0    0       0
##   compani complain complaint complet consid contain conveni cook cool
## 1       0        0         0       0      0       0       0    0    0
## 2       0        0         0       0      0       0       0    0    0
##   correct couldnt coupl cours cover cow crazi cream creami crowd crust
## 1       0       0     0     0     0   0     0     0      0     0     0
## 2       0       0     0     0     0   0     0     0      0     0     1
##   curri custom cut cute damn dark date day deal decent decid decor defin
## 1     0      0   0    0    0    0    0   0    0      0     0     0     0
## 2     0      0   0    0    0    0    0   0    0      0     0     0     0
##   definit delici delight delish deserv dessert didnt die differ dine
## 1       0      0       0      0      0       0     0   0      0    0
## 2       0      0       0      0      0       0     0   0      0    0
##   dinner dirt dirti disappoint disgrac disgust dish disrespect dog done
## 1      0    0     0          0       0       0    0          0   0    0
## 2      0    0     0          0       0       0    0          0   0    0
##   dont door doubl doubt downtown dress dri driest drink drive duck eat
## 1    0    0     0     0        0     0   0      0     0     0    0   0
## 2    0    0     0     0        0     0   0      0     0     0    0   0
##   eaten edibl egg eggplant either els elsewher employe empti end enjoy
## 1     0     0   0        0      0   0        0       0     0   0     0
## 2     0     0   0        0      0   0        0       0     0   0     0
##   enough entre equal especi establish even event ever everi everyon
## 1      0     0     0      0         0    0     0    0     0       0
## 2      0     0     0      0         0    0     0    0     0       0
##   everyth excel excus expect experi experienc extra extrem eye fact fail
## 1       0     0     0      0      0         0     0      0   0    0    0
## 2       0     0     0      0      0         0     0      0   0    0    0
##   fair famili familiar fan fantast far fare fast favor favorit feel fell
## 1    0      0        0   0       0   0    0    0     0       0    0    0
## 2    0      0        0   0       0   0    0    0     0       0    0    0
##   felt filet fill final find fine finish first fish flavor flavorless
## 1    0     0    0     0    0    0      0     0    0      0          0
## 2    0     0    0     0    0    0      0     0    0      0          0
##   flower focus folk food found fresh fri friend front frozen full fun
## 1      0     0    0    0     0     0   0      0     0      0    0   0
## 2      0     0    0    0     0     0   0      0     0      0    0   0
##   garlic gave generous get give given glad gold gone good got greas great
## 1      0    0        0   0    0     0    0    0    0    0   0     0     0
## 2      0    0        0   0    0     0    0    0    0    1   0     0     0
##   greek green greet grill gross group guess guest guy gyro hair half hand
## 1     0     0     0     0     0     0     0     0   0    0    0    0    0
## 2     0     0     0     0     0     0     0     0   0    0    0    0    0
##   handl happen happi hard hate head healthi heard heart heat help high
## 1     0      0     0    0    0    0       0     0     0    0    0    0
## 2     0      0     0    0    0    0       0     0     0    0    0    0
##   highlight hit home homemad honest hope horribl hot hour hous howev huge
## 1         0   0    0       0      0    0       0   0    0    0     0    0
## 2         0   0    0       0      0    0       0   0    0    0     0    0
##   human hummus husband ice ignor ill imagin immedi impecc impress includ
## 1     0      0       0   0     0   0      0      0      0       0      0
## 2     0      0       0   0     0   0      0      0      0       0      0
##   incred indian inexpens insid insult interest isnt italian ive job joint
## 1      0      0        0     0      0        0    0       0   0   0     0
## 2      0      0        0     0      0        0    0       0   0   0     0
##   joke judg just kept kid kind know known lack ladi larg last late later
## 1    0    0    0    0   0    0    0     0    0    0    0    0    0     0
## 2    0    0    0    0   0    0    0     0    0    0    0    0    0     0
##   least leav left legit let life light like list liter littl live lobster
## 1     0    0    0     0   0    0     0    0    0     0     0    0       0
## 2     0    0    0     0   0    0     0    0    0     0     0    0       0
##   locat long longer look lost lot love lover lukewarm lunch made main make
## 1     0    0      0    0    0   0    1     0        0     0    0    0    0
## 2     0    0      0    0    0   0    0     0        0     0    0    0    0
##   mall manag mani margarita mari may mayb meal mean meat mediocr meh melt
## 1    0     0    0         0    0   0    0    0    0    0       0   0    0
## 2    0     0    0         0    0   0    0    0    0    0       0   0    0
##   menu mexican mid min mind minut miss mistak moist mom money mood mouth
## 1    0       0   0   0    0     0    0      0     0   0     0    0     0
## 2    0       0   0   0    0     0    0      0     0   0     0    0     0
##   much multipl mushroom music must nacho nasti need needless neighborhood
## 1    0       0        0     0    0     0     0    0        0            0
## 2    0       0        0     0    0     0     0    0        0            0
##   never new next nice nicest night none note noth now offer old omg one
## 1     0   0    0    0      0     0    0    0    0   0     0   0   0   0
## 2     0   0    0    0      0     0    0    0    0   0     0   0   0   0
##   opportun option order other outsid outstand oven overal overcook overpr
## 1        0      0     0     0      0        0    0      0        0      0
## 2        0      0     0     0      0        0    0      0        0      0
##   overwhelm owner pace pack paid pancak paper par part parti pass pasta
## 1         0     0    0    0    0      0     0   0    0     0    0     0
## 2         0     0    0    0    0      0     0   0    0     0    0     0
##   patio pay peanut peopl perfect person pho phoenix pictur piec pita pizza
## 1     0   0      0     0       0      0   0       0      0    0    0     0
## 2     0   0      0     0       0      0   0       0      0    0    0     0
##   place plate play pleas pleasant plus point poor pop pork portion possibl
## 1     1     0    0     0        0    0     0    0   0    0       0       0
## 2     0     0    0     0        0    0     0    0   0    0       0       0
##   potato prepar present pretti price probabl profession promis prompt
## 1      0      0       0      0     0       0          0      0      0
## 2      0      0       0      0     0       0          0      0      0
##   provid public pull pure put qualiti quick quit rare rate rather rave
## 1      0      0    0    0   0       0     0    0    0    0      0    0
## 2      0      0    0    0   0       0     0    0    0    0      0    0
##   read real realiz realli reason receiv recent recommend red refil regular
## 1    0    0      0      0      0      0      0         0   0     0       0
## 2    0    0      0      0      0      0      0         0   0     0       0
##   relax remind restaur return review rice right roast roll room rude run
## 1     0      0       0      0      0    0     0     0    0    0    0   0
## 2     0      0       0      0      0    0     0     0    0    0    0   0
##   sad said salad salmon salsa salt sandwich sashimi sat satisfi sauc say
## 1   0    0     0      0     0    0        0       0   0       0    0   0
## 2   0    0     0      0     0    0        0       0   0       0    0   0
##   scallop seafood season seat second see seem seen select serious serv
## 1       0       0      0    0      0   0    0    0      0       0    0
## 2       0       0      0    0      0   0    0    0      0       0    0
##   server servic set sever shop show shrimp sick side sign similar simpl
## 1      0      0   0     0    0    0      0    0    0    0       0     0
## 2      0      0   0     0    0    0      0    0    0    0       0     0
##   simpli sinc singl sit six slice slow small smell soggi someon someth
## 1      0    0     0   0   0     0    0     0     0     0      0      0
## 2      0    0     0   0   0     0    0     0     0     0      0      0
##   soon soooo sore soup special spend spice spici spot staff stale star
## 1    0     0    0    0       0     0     0     0    0     0     0    0
## 2    0     0    0    0       0     0     0     0    0     0     0    0
##   start station stay steak step stick still stir stomach stop strip stuf
## 1     0       0    0     0    0     0     0    0       0    0     0    0
## 2     0       0    0     0    0     0     0    0       0    0     0    0
##   stuff style subpar subway suck sugari suggest summer super sure surpris
## 1     0     0      0      0    0      0       0      0     0    0       0
## 2     0     0      0      0    0      0       0      0     0    0       0
##   sushi sweet tabl taco take talk tap tapa tartar tast tasteless tasti tea
## 1     0     0    0    0    0    0   0    0      0    0         0     0   0
## 2     0     0    0    0    0    0   0    0      0    0         0     0   0
##   tell ten tender terribl textur thai that thin thing think third though
## 1    0   0      0       0      0    0    0    0     0     0     0      0
## 2    0   0      0       0      0    0    0    0     0     0     0      0
##   thought thumb time tip toast today told took top tot total touch toward
## 1       0     0    0   0     0     0    0    0   0   0     0     0      0
## 2       0     0    0   0     0     0    0    0   0   0     0     0      0
##   town treat tri trip tuna twice two unbeliev undercook underwhelm
## 1    0     0   0    0    0     0   0        0         0          0
## 2    0     0   0    0    0     0   0        0         0          0
##   unfortun unless use valley valu vega veget vegetarian ventur vibe
## 1        0      0   0      0    0    0     0          0      0    0
## 2        0      0   0      0    0    0     0          0      0    0
##   vinegrett visit wait waiter waitress walk wall want warm wasnt wast
## 1         0     0    0      0        0    0    0    0    0     0    0
## 2         0     0    0      0        0    0    0    0    0     0    0
##   watch water way week well went weve white whole wife will wine wing
## 1     0     0   0    0    0    0    0     0     0    0    0    0    0
## 2     0     0   0    0    0    0    0     0     0    0    0    0    0
##   without wonder wont word work worker world wors worst worth wouldnt wow
## 1       0      0    0    0    0      0     0    0     0     0       0   1
## 2       0      0    0    0    0      0     0    0     0     0       0   0
##   wrap wrong year yet youd your yummi zero
## 1    0     0    0   0    0    0     0    0
## 2    0     0    0   0    0    0     0    0

Remembering that our dtm was only the review content (independent variables) we need to add the liked variable (dependent variable).

# 1st bit adds new column Liked and 2nd bit gets the data from the original dataset data
dataset$Liked = dataset_original$Liked

Our Dataset is now ready!

Encoding the target feature (Liked) as factor

dataset$Liked = factor(dataset$Liked, levels = c(0, 1))

Splitting the dataset into the Training set and Test set

# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Liked, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

Feature scalling is not needed as we only have 0 and 1’s there are no units

Fitting Random Forest Classification to the Training set

The idea is that we want to have a classification model that will learn the correlations between the words in the review and whether it was liked or not. With this classification model in hand we can then use the model to determine if a future review will indicate if the (restuarant in this case) was positive (liked) or not.

# install.packages('randomForest')
library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

classifier = randomForest(x = training_set[-688], # in x we need to remove the dependent variable which is the last column liked, in our case 688, in lecture it was 692 
                          y = training_set$Liked,
                          ntree = 10) # number of trees

Predicting the Test set results

Using the Test Set we’ll predict

y_pred = predict(classifier, newdata = test_set[-688])

Making the Confusion Matrix

cm = table(test_set[, 688], y_pred)

cm

##    y_pred
##       0   1
##   0 100 100

knitr::include_graphics("Confusion_Matrix_Explained.png")

UCB steps from lecture

Evaluation of Performance

Evaluate the performance of each of these models; Accuracy is not enough, so you should also look at other performance metrics like Precision (measuring exactness), Recall (measuring completeness) and the F1 Score (compromise between Precision and Recall).
Root-mean-square deviation is another way to measure performance.
https://en.wikipedia.org/wiki/Root-mean-square_deviation

Another good source of data on metrics to measure model effectiveness by.
https://medium.com/usf-msds/choosing-the-right-metric-for-machine-learning-models-part-1-a99d7d7414e4 In discussing the usefulness of a model then we need to see that it depends on each error metric depending on the objective and the problem we are trying to solve. When someone tells you that “USA is the best country”, the first question that you should ask is on what basis is this statement being made. Are we judging each country on the basis of their economic status, or their health facilities etc.? Similarly each machine learning model is trying to solve a problem with a different objective using a different dataset and hence, it is important to understand the context before choosing a metric.

Let’s look at some measures of our Model’s performance

76 - TP = # True Positives
71 - TN = # True Negatives
24 - FP = # False Positives
29 - FN = # False Negatives

Accuracy = (TP + TN) / (TP + TN + FP + FN)

(76+71)/200

## [1] 0.735

Precision = TP / (TP + FP)

76/(76+24)

## [1] 0.76

Recall = TP / (TP + FN)

76/(76+29)

## [1] 0.7238095

F1 Score = 2 * Precision * Recall / (Precision + Recall)

(2*0.76*0.7238095)/(0.76+0.7238095)

## [1] 0.7414634

=========================
Github files; https://github.com/ghettocounselor

Useful PDF for common questions in Lectures;
https://github.com/ghettocounselor/Machine_Learning/blob/master/Machine-Learning-A-Z-Q-A.pdf

Natural Language Processing in R