Natural Language Processing, or NLP for short, is broadly defined as the automatic manipulation of natural language, like speech and text, by software. At this project, we will do exploratory analysis for review texts as well as we will build a supervised machine learning model on text data then we will compare it Bayes naive classifier model, using the popular statistical programming language, ‘R’.
We extracted Yale New Haven Hospital (YNHH) reviews data from “Yelp” website. If the review more than 3 stars, it would be recommended and given number 1 and if 3 stars or below, it would take 0.
library(readxl)
## Warning: package 'readxl' was built under R version 3.6.2
ynhhreviews <- read_excel("ynhhreviews.xlsx")
Let’s have a data look up
ynhhreviews$target <- as.factor(ynhhreviews$target)
str(ynhhreviews)
## Classes 'tbl_df', 'tbl' and 'data.frame': 40 obs. of 4 variables:
## $ id : num 1 2 3 4 5 6 7 8 9 10 ...
## $ keyword: chr "nice" "worst" "beautiful" "excellent" ...
## $ text : chr "Appointment was made at the children's surgery center, text message to confirm to be there at 9:15, and a voice"| __truncated__ "The worst ER I've ever seen. I waited in pain for 8 hours. The ladies at registration had extreme attitudes whe"| __truncated__ "I feel weird posting 5 stars because I have only completed a hospital tour as part of a childbirth class. Howev"| __truncated__ "Excellent service!! Monday, July 8, 2019 admitted 4:16pm discharged at 7:02pm I drove myself to the emergency r"| __truncated__ ...
## $ target : Factor w/ 2 levels "0","1": 2 1 2 2 1 1 1 2 2 1 ...
table(ynhhreviews$target)
##
## 0 1
## 27 13
We have 27 negative reviews and 13 positive reviews.
1- We will create a corpus which is a collection of documents using ‘tm’ package.
#install.packages("tm")
library(tm)
## Warning: package 'tm' was built under R version 3.6.3
## Loading required package: NLP
review_corpus <- VCorpus(VectorSource(ynhhreviews$text))
2- Text cleaning:
review_corpus_clean <- tm_map(review_corpus,content_transformer(tolower)) #converting to lower case letters
review_corpus_clean <- tm_map(review_corpus_clean,removeNumbers) #removing numbers
review_corpus_clean <- tm_map(review_corpus_clean,removeWords,stopwords()) #remvoing stop words
review_corpus_clean <- tm_map(review_corpus_clean,removePunctuation) #remving punctuation
3- Word Stemming:
library(SnowballC)
review_corpus_clean <- tm_map(review_corpus_clean,stemDocument)
review_corpus_clean <- tm_map(review_corpus_clean,stripWhitespace)#removing spaces after doing above process
4- No we will visualize words
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 3.6.3
## Loading required package: RColorBrewer
wordcloud(review_corpus_clean,min.freq = 100, scale = c(4,.1),max.words = 50,random.order = FALSE, random.color = FALSE, colors = brewer.pal(6, 'Dark2'))
5- Tokenization which means splitting the reviews into individual components
review_dtm <- DocumentTermMatrix(review_corpus_clean)
We could do all the previous steps in only one step
review_dtm2 <- DocumentTermMatrix(review_corpus,
control = list(tolower = TRUE,
removeNumbers = TRUE,
stopwords = TRUE,
removePunctuatio = TRUE,
stemming = TRUE))
6- Now will we will divide the data into training and test data to evaluate how the predictive model is performing.we will divide by 0.7 ratio
Let’s check the baseline accuracy of predictive model
prop.table(table(ynhhreviews$target))
##
## 0 1
## 0.675 0.325
review_dtm_train <- review_dtm[c(1:20, 25:32),]
review_dtm_test <- review_dtm[c(21:24, 33:40),]
review_train_labels <- ynhhreviews[c(1:20, 25:32),]$target
review_test_labels <- ynhhreviews[c(21:24, 33:40),]$target
#lets check whether the subsets are representing complete set of ynhh review data
prop.table(table(review_train_labels))
## review_train_labels
## 0 1
## 0.6785714 0.3214286
prop.table(table(review_test_labels))
## review_test_labels
## 0 1
## 0.6666667 0.3333333
It is pretty fine.
7- ‘DocumentTermMatrix’ results in a a matrix that contains zeroes in many of the cells, a problem called sparsity. we will convert these zeros (numeric) to Yes/No (categorical)
review_freq_words <- findFreqTerms(review_dtm_train,5)
str(review_freq_words)
## chr [1:74] "also" "ask" "avail" "bed" "call" "came" "can" "care" "chest" ...
#this command will display the words appearing at least five times in review_dtm_train matrix
review_dtm_freq_train <- review_dtm_train[,review_freq_words]
review_dtm_freq_test <- review_dtm_test[,review_freq_words]
convert_counts <- function(x){
x <- ifelse(x>0,"Yes","No")
}
review_train <- apply(review_dtm_freq_train,MARGIN = 2,convert_counts)
review_test <- apply(review_dtm_freq_test,MARGIN = 2,convert_counts)
1- train the model on data
#install.packages("e1071")
library(e1071)
## Warning: package 'e1071' was built under R version 3.6.3
review_classifier <- naiveBayes(review_train,review_train_labels)
2-Evaluate model performance
bayes_test_pred <- predict(review_classifier,review_test)
library(gmodels)
## Warning: package 'gmodels' was built under R version 3.6.3
table(bayes_test_pred, review_test_labels)
## review_test_labels
## bayes_test_pred 0 1
## 0 7 2
## 1 1 2
Let’s calculate accuracy: 9/(9+3) = 0.75 or 75%
Random forest modeling needs different matrix zeros preparation
#remove 0s then convert matrix to a dataframe
review_sparse <- removeSparseTerms(review_dtm, 0.995)
review_data = as.data.frame(as.matrix(review_sparse))
colnames(review_data) = make.names(colnames(review_data))
review_data$target = ynhhreviews$target
#dividing data to train and test
review_train <- review_data[c(1:20, 25:32),]
review_test <- review_data[c(21:24, 33:40),]
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.6.3
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
review_model <- randomForest(target ~ ., data = review_train)
random_test_pred <- predict(review_model, newdata = review_test)
table(random_test_pred, review_test$target)
##
## random_test_pred 0 1
## 0 8 4
## 1 0 0
the accuracy of the model = 8/(8+4) = 0.67 or 67% while Bayes is 75%, so bayes is more accurate than random forest.