Background

Natural Language Processing, or NLP for short, is broadly defined as the automatic manipulation of natural language, like speech and text, by software. At this project, we will do exploratory analysis for review texts as well as we will build a supervised machine learning model on text data then we will compare it Bayes naive classifier model, using the popular statistical programming language, ‘R’.

We extracted Yale New Haven Hospital (YNHH) reviews data from “Yelp” website. If the review more than 3 stars, it would be recommended and given number 1 and if 3 stars or below, it would take 0.

Import data

library(readxl)

## Warning: package 'readxl' was built under R version 3.6.2

ynhhreviews <- read_excel("ynhhreviews.xlsx")

Let’s have a data look up

ynhhreviews$target <- as.factor(ynhhreviews$target)
str(ynhhreviews)

## Classes 'tbl_df', 'tbl' and 'data.frame':    40 obs. of  4 variables:
##  $ id     : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ keyword: chr  "nice" "worst" "beautiful" "excellent" ...
##  $ text   : chr  "Appointment was made at the children's surgery center, text message to confirm to be there at 9:15, and a voice"| __truncated__ "The worst ER I've ever seen. I waited in pain for 8 hours. The ladies at registration had extreme attitudes whe"| __truncated__ "I feel weird posting 5 stars because I have only completed a hospital tour as part of a childbirth class. Howev"| __truncated__ "Excellent service!! Monday, July 8, 2019 admitted 4:16pm discharged at 7:02pm I drove myself to the emergency r"| __truncated__ ...
##  $ target : Factor w/ 2 levels "0","1": 2 1 2 2 1 1 1 2 2 1 ...

table(ynhhreviews$target)

## 
##  0  1 
## 27 13

We have 27 negative reviews and 13 positive reviews.

1- We will create a corpus which is a collection of documents using ‘tm’ package.

#install.packages("tm")
library(tm)

## Warning: package 'tm' was built under R version 3.6.3

## Loading required package: NLP

review_corpus <- VCorpus(VectorSource(ynhhreviews$text))

2- Text cleaning:

review_corpus_clean <- tm_map(review_corpus,content_transformer(tolower)) #converting to lower case letters
review_corpus_clean <- tm_map(review_corpus_clean,removeNumbers) #removing numbers
review_corpus_clean <- tm_map(review_corpus_clean,removeWords,stopwords()) #remvoing stop words
review_corpus_clean <- tm_map(review_corpus_clean,removePunctuation) #remving punctuation

3- Word Stemming:

library(SnowballC)
review_corpus_clean <- tm_map(review_corpus_clean,stemDocument)
review_corpus_clean <- tm_map(review_corpus_clean,stripWhitespace)#removing spaces after doing above process

4- No we will visualize words

library(wordcloud)

## Warning: package 'wordcloud' was built under R version 3.6.3

## Loading required package: RColorBrewer

wordcloud(review_corpus_clean,min.freq = 100, scale = c(4,.1),max.words = 50,random.order = FALSE, random.color = FALSE, colors = brewer.pal(6, 'Dark2'))

5- Tokenization which means splitting the reviews into individual components

review_dtm <- DocumentTermMatrix(review_corpus_clean)

We could do all the previous steps in only one step

review_dtm2 <- DocumentTermMatrix(review_corpus,
                              control = list(tolower = TRUE,
                                            removeNumbers = TRUE,
                                            stopwords = TRUE,
                                            removePunctuatio = TRUE,
                                            stemming = TRUE))

6- Now will we will divide the data into training and test data to evaluate how the predictive model is performing.we will divide by 0.7 ratio

Let’s check the baseline accuracy of predictive model

prop.table(table(ynhhreviews$target))

## 
##     0     1 
## 0.675 0.325

review_dtm_train <- review_dtm[c(1:20, 25:32),]
review_dtm_test <- review_dtm[c(21:24, 33:40),]
review_train_labels <- ynhhreviews[c(1:20, 25:32),]$target
review_test_labels <- ynhhreviews[c(21:24, 33:40),]$target
#lets check whether the subsets are representing complete set of ynhh review data
prop.table(table(review_train_labels))

## review_train_labels
##         0         1 
## 0.6785714 0.3214286

prop.table(table(review_test_labels))

## review_test_labels
##         0         1 
## 0.6666667 0.3333333

It is pretty fine.

7- ‘DocumentTermMatrix’ results in a a matrix that contains zeroes in many of the cells, a problem called sparsity. we will convert these zeros (numeric) to Yes/No (categorical)

review_freq_words <- findFreqTerms(review_dtm_train,5)
str(review_freq_words)

##  chr [1:74] "also" "ask" "avail" "bed" "call" "came" "can" "care" "chest" ...

#this command will display the words appearing at least five times in review_dtm_train matrix

review_dtm_freq_train <- review_dtm_train[,review_freq_words]
review_dtm_freq_test <- review_dtm_test[,review_freq_words]
convert_counts <- function(x){
                 x <- ifelse(x>0,"Yes","No") 
               }

review_train <- apply(review_dtm_freq_train,MARGIN = 2,convert_counts)
review_test <- apply(review_dtm_freq_test,MARGIN = 2,convert_counts)

Bayes Modeling

1- train the model on data

#install.packages("e1071")
library(e1071)

## Warning: package 'e1071' was built under R version 3.6.3

review_classifier <- naiveBayes(review_train,review_train_labels)

2-Evaluate model performance

bayes_test_pred <- predict(review_classifier,review_test)
library(gmodels)

## Warning: package 'gmodels' was built under R version 3.6.3

table(bayes_test_pred, review_test_labels)

##                review_test_labels
## bayes_test_pred 0 1
##               0 7 2
##               1 1 2

Let’s calculate accuracy: 9/(9+3) = 0.75 or 75%

Random forest modeling

Random forest modeling needs different matrix zeros preparation

#remove 0s then convert matrix to a dataframe
review_sparse <- removeSparseTerms(review_dtm, 0.995)
review_data = as.data.frame(as.matrix(review_sparse))
colnames(review_data) = make.names(colnames(review_data))
review_data$target = ynhhreviews$target

#dividing data to train and test
review_train <- review_data[c(1:20, 25:32),]
review_test <- review_data[c(21:24, 33:40),]

library(randomForest)

## Warning: package 'randomForest' was built under R version 3.6.3

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

review_model <- randomForest(target ~ ., data = review_train)

random_test_pred <- predict(review_model, newdata = review_test)
table(random_test_pred, review_test$target)

##                 
## random_test_pred 0 1
##                0 8 4
##                1 0 0

the accuracy of the model = 8/(8+4) = 0.67 or 67% while Bayes is 75%, so bayes is more accurate than random forest.

NLP random forest model vs naive bayes prediction model

Amany Marey

March 12, 2020

Background

Import data

Bayes Modeling

Random forest modeling