This is a web scraped social media review site of two chriopractic clinics offering massage, one low priced grocery store, and one high end massage retreat.The names of the doctors have been replaced with ‘DOCTOR’, and name of chiropracic facility to CHIROPRACTIC. The names of the high end massage retreat have been replaced with ‘HIGH END SPA’. The name of the low cost grocery store was replaced with ‘LOW COST GROCERY STORE.’
This data table originally had 1338 observations, but that was an error due to copy and paste in Excel, so there is a need to remove empty rows after reading in the data if your copy of RStudio reads in those empty rows. After reading in the data there will be 516 reviews of mixed ratings for these business models. There aren’t many reviews lower than four stars for the chiropractic clinics, but there is a lot of variation in the high end massage retreat and low cost grocery store reviews. These businesses are in the Corona area and the first to be listed when typing massage, except for the grocery store, because it was directly typed in. Note that the social media site will send your api information on your demographics to these businesses when extracting the data, so you should have an alias.
library(DT)
library(tidyverse)
library(dplyr)
library(lubridate)
library(tm)
library(SnowballC)
library(wordcloud)
library(ggplot2)
library(textstem)
library(stringr)
library(visNetwork)
library(igraph)
#the following packages are needed by throw errors on some of the functions in dplyr and the tidyvere that make this script not run, so I am adding them to a chunk in the ML section to use them instead, after the other commands have been entered.
# library(RANN) #this pkg supplements caret for one out bag validation
# library(e1071)
# library(caret)
# library(randomForest)
# library(MASS)
# library(gbm)
reviews <- read.csv('ReviewsMassageChiropractorYelp_withCompanyNamesOmitted.csv',
sep=',',header=TRUE, na.strings=c('',' ','NA','NULL'))
Clean up this data of NA rows and empty fields if you have more than 516 observations. You should have five columns.
Reviews <- reviews[complete.cases(reviews),]
colnames(Reviews)
## [1] "review"
## [2] "rating_last_first_if_multipleUpdated"
## [3] "site"
## [4] "LowAvgHighCost"
## [5] "businessType"
You can download this data to follow along with this DT datatable.
Reviews_DT <- datatable(data=Reviews, rownames=FALSE,
extensions=c('Buttons','Responsive'),
filter=list(position='top'),
options=list( dom='Bfrtip',scrollX = TRUE, scrollY=TRUE,
buttons=c('colvis','csv'),
language=list(sSearch='Filter:')
)
)
Reviews_DT
The rating has more than one rating, separated by a comma for those reviews that are updated and the other reviews displayed have different ratings. We will see this after running the next chunk. The first listed value is the rating for the latest review, and the subsequent ratings (1-5) are for the next subsequent reviews backtracking in time. Each review will have a date listed before each previous review that the later review updated.
unique(Reviews$rating_last_first_if_multipleUpdated)
## [1] 5 1 3 5,5 4 4,3,1 2
## [8] 5,3 4,4 1,1 2,2 3,5 5,2 2,4
## [15] 1,1,1 4,4,1,3,3 5,4
## 17 Levels: 1 1,1 1,1,1 2 2,2 2,4 3 3,5 4 4,3,1 4,4 4,4,1,3,3 5 5,2 5,3 ... 5,5
You can see from the above there are various review values, and we could choose to keep these or make separate dummy fields for how many time the review was updated from last to first review. The most updates appears to be five times, so we could create dummy fields to capture that rating and if it is the first,second,…, or fifth review and rating for each reviewer. Why not lets just do this. And so the next chunk will add those dummy fields.
rating <- strsplit(as.character(paste(Reviews$rating_last_first_if_multipleUpdated)), split=',')
Reviews$mostRecentVisit_rating <- as.character(paste(lapply(rating,'[',1)))
Reviews$lastVisit_rating <- as.character(paste(lapply(rating,'[',2)))
Reviews$twoVisitsPrior_rating <- as.character(paste(lapply(rating,'[',3)))
Reviews$threeVisitsPrior_rating <- as.character(paste(lapply(rating,'[',4)))
Reviews$fourVisitsPrior_rating <- as.character(paste(lapply(rating,'[',5)))
Reviews1 <- Reviews[with(Reviews, order(fourVisitsPrior_rating,
threeVisitsPrior_rating,
twoVisitsPrior_rating,
lastVisit_rating, decreasing=FALSE)),]
# head(Reviews1)
The order by decreasing=FALSE had to be used to see those sequential visits from last visit, because these fields are character fields. When using predictive analytics they can be changed to factor, or we can change them to numeric.
Reviews1$mostRecentVisit_rating <- as.numeric(paste(Reviews1$mostRecentVisit_rating))
Reviews1$lastVisit_rating <- as.numeric(paste(Reviews1$lastVisit_rating))
## Warning: NAs introduced by coercion
Reviews1$twoVisitsPrior_rating <- as.numeric(paste(Reviews1$twoVisitsPrior_rating))
## Warning: NAs introduced by coercion
Reviews1$threeVisitsPrior_rating <- as.numeric(paste(Reviews1$threeVisitsPrior_rating))
## Warning: NAs introduced by coercion
Reviews1$fourVisitsPrior_rating <- as.numeric(paste(Reviews1$fourVisitsPrior_rating))
## Warning: NAs introduced by coercion
str(Reviews1)
## 'data.frame': 516 obs. of 10 variables:
## $ review : Factor w/ 516 levels "\t\nAffinity Z.\nCorona, CA\n64 friends\n216 reviews\n23 photos\n\n\n\n\n\n\n\n\n Affinity Z.\n\n\t5/6/2014\nSE"| __truncated__,..: 408 414 409 31 5 217 79 168 430 49 ...
## $ rating_last_first_if_multipleUpdated: Factor w/ 17 levels "1","1,1","1,1,1",..: 12 3 10 2 2 2 5 14 15 11 ...
## $ site : Factor w/ 1 level "yelp": 1 1 1 1 1 1 1 1 1 1 ...
## $ LowAvgHighCost : Factor w/ 3 levels "Avg","High","Low": 2 2 1 2 2 2 2 2 3 2 ...
## $ businessType : Factor w/ 3 levels "chiropractic",..: 3 3 1 3 3 3 3 3 2 3 ...
## $ mostRecentVisit_rating : num 4 1 4 1 1 1 2 5 5 4 ...
## $ lastVisit_rating : num 4 1 3 1 1 1 2 2 3 4 ...
## $ twoVisitsPrior_rating : num 1 1 1 NA NA NA NA NA NA NA ...
## $ threeVisitsPrior_rating : num 3 NA NA NA NA NA NA NA NA NA ...
## $ fourVisitsPrior_rating : num 3 NA NA NA NA NA NA NA NA NA ...
Now our ‘NA’ filled dummy columns are recognized as actual missing values or NAs of numeric instead of character fields. It is easier to turn the character fields into numeric, then factors so I changed them into numeric. If we want to use them as factors, which they are, when running the models we can. But we are going to focus on extracting hidden features from the data first and cleaning up redundancies in the data from the web scraping extenstions. Like the header user information and the extra dates, or the actual dates, and the ‘updated’ header to every previous review update. Lets look at our table of Reviews right now, but using the DT package for the datatable function.
Reviews_DT1 <- datatable(data=Reviews1, rownames=FALSE, # width = 800, height = 700,
extensions=c('Buttons','Responsive'),#'FixedColumns'),
#filter=list(position='top'),
options=list(pageLength=1,
dom='Bfrtip',scrollX = TRUE,# scrollY=TRUE,fixedColumns = TRUE,
buttons=c('colvis','csv'),
language=list(sSearch='Filter:')
)
)
Reviews_DT1
We should also look at the table within Rmarkdown, because DT is fussy and takes a while to load, plus the amount of text in the first column takes up the rest of the rows.
row.names(Reviews1) <- NULL
# head(Reviews1)
Reviews1 data table is ordered by the review with the most previous reviews in most to least. We see from this first observation in the table and many others that the reviews have a header that needs cleaning up. So, lets do that. We will use gsub to remove these headers with some regex commands. There are a lot of non character elements in the reviews that are considered white space characters for (tabs, ()newlines or a mac newline( or ()space. If any mistakes it’ll be easy to adjust instead of rerunning codes to get the Reviews1 table.
Reviews2 <- Reviews1
Reviews2$review <- gsub('[P].*[.][\\t][\\n]','',perl=TRUE,Reviews2$review)
# head(Reviews2)
We see that the Photo… header was removed if the review included a header. That placemarker is removed. But lets also remove the observations that didn’t have a photo placemarker and with these observations remove anything between the header and the first listed date that is preceeded by two newlines and one tab.We will also extract the first name and the header of the observations we removed the photo placemarker from up to the first date listed.
Reviews2$review <- gsub('^[\\t][\\n]', '', perl=TRUE, Reviews2$review)
# head(Reviews2)
I noticed there is a user name that begins with a single apostrophe, and it throws off this script if not fixed early, because later these names will be put into the userName field.So we have to add in the escape character backslash and apostrophe with a pipe for ‘or’ into this next command.
Reviews2$review <- gsub('^[a-zA-Z|\'].*[.]','', perl=TRUE, Reviews2$review)
# head(Reviews2)
We see that we have the city, state, number of friends, reviews, photos, and a status if they have more than a certain number of reviews. But also that this information could be useful, so we might want to split these string reviews by the 9 newline characters.
reviewStringSplit <- strsplit(Reviews2$review, split='[\n]{9}',perl=TRUE)
# head(reviewStringSplit,1)
This is great, because now we can separate the review with the header information. Lets name one string the headerData and the other the userObservation.
headerData <- lapply(reviewStringSplit, '[',1)
head(headerData)
## [[1]]
## [1] "\nMission Viejo, CA\n500 friends\n404 reviews\n452 photos\nElite '2020"
##
## [[2]]
## [1] "\nRancho Cucamonga, CA\n12 friends\n12 reviews\n4 photos"
##
## [[3]]
## [1] "\nCorona, CA\n10 friends\n95 reviews\n28 photos"
##
## [[4]]
## [1] "\nWestminster, CA\n0 friends\n74 reviews\n78 photos"
##
## [[5]]
## [1] "\nLos Angeles, CA\n92 friends\n69 reviews\n113 photos"
##
## [[6]]
## [1] "\nVan Nuys, CA\n116 friends\n24 reviews\n30 photos"
We see the city, state, number of friends, the number of reviews, number of photos, and if elite separated by a newline. Lets remove the newline, then add all these separately to our table by feature identified accordingly.
headerData2 <- as.character(headerData)
headerData2 <- gsub('^[\n]','', headerData2, perl=TRUE)
headermetaSplit <- strsplit(headerData2,split='[\n]',perl=TRUE)
head(headermetaSplit)
## [[1]]
## [1] "Mission Viejo, CA" "500 friends" "404 reviews"
## [4] "452 photos" "Elite '2020"
##
## [[2]]
## [1] "Rancho Cucamonga, CA" "12 friends" "12 reviews"
## [4] "4 photos"
##
## [[3]]
## [1] "Corona, CA" "10 friends" "95 reviews" "28 photos"
##
## [[4]]
## [1] "Westminster, CA" "0 friends" "74 reviews" "78 photos"
##
## [[5]]
## [1] "Los Angeles, CA" "92 friends" "69 reviews" "113 photos"
##
## [[6]]
## [1] "Van Nuys, CA" "116 friends" "24 reviews" "30 photos"
Reviews2$cityState <- lapply(headermetaSplit,'[',1)
Reviews2$friends <- lapply(headermetaSplit,'[',2)
Reviews2$reviews <- lapply(headermetaSplit,'[',3)
Reviews2$photos <- lapply(headermetaSplit,'[',4)
Reviews2$eliteStatus <- lapply(headermetaSplit,'[',5)
userObservation <- lapply(reviewStringSplit,'[',2)
# head(userObservation,1)
We can see from the second split of the string that the userObservation includes more meta data that includes the user name, date, if it was an updated review, and if available the number of check ins to the business and number of photos. The first name is the first part of this string. Lets split the string and make a field called userName to add to our table. We have to split by one newline and one tab, because one of the dates only has a prepend of one newline and one tab, although most are two newlines followed by a tab, we miss a review if we don’t.
obsStrSplit <- as.character(userObservation)
obsStrSplit2 <- strsplit(obsStrSplit,split='[\n][\t]', perl=TRUE)
Reviews2$userName <- lapply(obsStrSplit2,'[',1)
Reviews2$userName <- gsub('[\n][\n]','',perl=TRUE, Reviews2$userName)
Reviews2$userName <- gsub('^[ ]','',perl=TRUE, Reviews2$userName)
Reviews2$userName <- gsub('[\n]$','',perl=TRUE, Reviews2$userName)
head(Reviews2$userName)
## [1] "Patty A R." "Raven H." "Phillip G." "Julie C." "Amy S."
## [6] "Jack K."
Lets now replace the review field that has been slightly adjusted to exclude the first header. There is still the user name header that occurs right before the data and if the review is updated.This is the obsStrSplit2 list.
# head(obsStrSplit2)
These top reviews are the ones that have more than one review ordered from most to least by rating when this analysis and data cleaning began. We see from our obsStrSplit2 object that there are at most 5 listed reviews that were separated by the double newline and tab that preceeds each date. So we can make dummy fields for these reviews as well by the listed item they correspond to for each user. Later we can gather those fields into one Review field by review by visit. In each of those fields we will take out the response by the business if there was one. They are shown above and start with a double newline and the words, ‘Business Customer Service.’ We’ll mark this so we know to search for this fix later. %^& its tagged. So, lets add in the latest review, previous review, 2nd previous review, 3rd previous review, and 4th previous review because the most reviews for this data is five.
Reviews2$mostRecentVisit_review <- as.character(paste(lapply(obsStrSplit2,'[',2)))
Reviews2$lastVisit_review <- as.character(paste(lapply(obsStrSplit2,'[',3)))
Reviews2$twoVisitsPrior_review <- as.character(paste(lapply(obsStrSplit2,'[',4)))
Reviews2$threeVisitsPrior_review <- as.character(paste(lapply(obsStrSplit2,'[',5)))
Reviews2$fourVisitsPrior_review <- as.character(paste(lapply(obsStrSplit2,'[',5)))
lets remove the first review field from our table using the dplyr package select function. We should also change these added list types so that they are character strings.
Reviews3 <- Reviews2 %>% select(-review)
Reviews3$cityState <- as.factor(paste(Reviews3$cityState))
Reviews3$friends <- gsub(' friends','',Reviews3$friends)
Reviews3$friends <- as.numeric(paste(Reviews3$friends))
## Warning: NAs introduced by coercion
Reviews3$reviews <- gsub(' reviews','', Reviews3$reviews)
Reviews3$reviews <- as.numeric(paste(Reviews3$reviews))
## Warning: NAs introduced by coercion
Reviews3$photos <- gsub(' photos','', Reviews3$photos)
Reviews3$photos <- as.numeric(paste(Reviews3$photos))
## Warning: NAs introduced by coercion
Reviews3$eliteStatus <- as.factor(paste(Reviews3$eliteStatus))
Reviews3$mostRecentVisit_review <- as.character(Reviews3$mostRecentVisit_review)
Reviews3$lastVisit_review <- as.character(Reviews3$lastVisit_review)
Reviews3$twoVisitsPrior_review <- as.character(Reviews3$twoVisitsPrior_review)
Reviews3$threeVisitsPrior_review <- as.character(Reviews3$threeVisitsPrior_review)
Reviews3$fourVisitsPrior_review <- as.character(Reviews3$fourVisitsPrior_review)
str(Reviews3)
## 'data.frame': 516 obs. of 20 variables:
## $ rating_last_first_if_multipleUpdated: Factor w/ 17 levels "1","1,1","1,1,1",..: 12 3 10 2 2 2 5 14 15 11 ...
## $ site : Factor w/ 1 level "yelp": 1 1 1 1 1 1 1 1 1 1 ...
## $ LowAvgHighCost : Factor w/ 3 levels "Avg","High","Low": 2 2 1 2 2 2 2 2 3 2 ...
## $ businessType : Factor w/ 3 levels "chiropractic",..: 3 3 1 3 3 3 3 3 2 3 ...
## $ mostRecentVisit_rating : num 4 1 4 1 1 1 2 5 5 4 ...
## $ lastVisit_rating : num 4 1 3 1 1 1 2 2 3 4 ...
## $ twoVisitsPrior_rating : num 1 1 1 NA NA NA NA NA NA NA ...
## $ threeVisitsPrior_rating : num 3 NA NA NA NA NA NA NA NA NA ...
## $ fourVisitsPrior_rating : num 3 NA NA NA NA NA NA NA NA NA ...
## $ cityState : Factor w/ 148 levels "Alhambra, CA",..: 73 101 24 144 66 138 113 54 13 116 ...
## $ friends : num 500 12 10 0 92 116 258 117 107 408 ...
## $ reviews : num 404 12 95 74 69 24 288 20 94 44 ...
## $ photos : num 452 4 28 78 113 30 132 13 188 33 ...
## $ eliteStatus : Factor w/ 2 levels "Elite '2020",..: 1 2 2 2 2 2 1 2 2 1 ...
## $ userName : chr "Patty A R." "Raven H." "Phillip G." "Julie C." ...
## $ mostRecentVisit_review : chr "6/5/2018Updated review\n 6 photos\n\n 2 check-ins\n\nAnother fabulous trip to HIGH END SPA! Love the new additi"| __truncated__ "2/15/2019Updated review\nStill no update by this facility, don't think I'll ever go back nor will I ever refer "| __truncated__ "6/15/2018Updated review\n 3 check-ins\n\nI've been here consistently for the last few years. Mistake were made "| __truncated__ "1/11/2020Updated review\n 3 photos\n\nImagine planning a family event for the last three months only to be gree"| __truncated__ ...
## $ lastVisit_review : chr "3/30/2018Previous review\nBeen going here for over 27 years. I've seen all the growth and change. It's been bum"| __truncated__ "5/12/2018Previous review\nIt's odd to me how you see the complaint, then a response but no update from the cust"| __truncated__ "9/14/2015Previous review\nMy wife and myself had some issues with the secretary and massages. We had communicat"| __truncated__ "9/15/2019Previous review\nImagine planning a family event for the last three months only to be greeted by list "| __truncated__ ...
## $ twoVisitsPrior_review : chr "2/10/2018Previous review\nI miss the old HIGH END SPA. I've waited 2 days to actually talk to someone there abo"| __truncated__ "5/7/2018Previous review\nIt's too bad, I had such a great time here and some bathroom attendant ruined my whole"| __truncated__ "9/11/2015Previous review\nHad major issues with this place. My wife signed up for a massage and chiropractic se"| __truncated__ "NA" ...
## $ threeVisitsPrior_review : chr "8/27/2017Previous review\nI've been coming here for over 25 years and have seen the ups and downs of this place"| __truncated__ "NA" "NA" "NA" ...
## $ fourVisitsPrior_review : chr "8/27/2017Previous review\nI've been coming here for over 25 years and have seen the ups and downs of this place"| __truncated__ "NA" "NA" "NA" ...
Lets rearrange the columns in our new table. We still need to extract from each of the five reviews by user (if exist) the business response (also if exists).
colnames(Reviews3)
## [1] "rating_last_first_if_multipleUpdated"
## [2] "site"
## [3] "LowAvgHighCost"
## [4] "businessType"
## [5] "mostRecentVisit_rating"
## [6] "lastVisit_rating"
## [7] "twoVisitsPrior_rating"
## [8] "threeVisitsPrior_rating"
## [9] "fourVisitsPrior_rating"
## [10] "cityState"
## [11] "friends"
## [12] "reviews"
## [13] "photos"
## [14] "eliteStatus"
## [15] "userName"
## [16] "mostRecentVisit_review"
## [17] "lastVisit_review"
## [18] "twoVisitsPrior_review"
## [19] "threeVisitsPrior_review"
## [20] "fourVisitsPrior_review"
First lets gather the review fields and the ratings fields and remove the NA values from the userReveiwSeries and userRatingSeries.
Reviews4 <- gather(Reviews3, 'userReviewSeries','userReviewContent',16:20)
Reviews4$userReviewContent <- gsub('NA','', Reviews4$userReviewContent)
#remove the char NAs because complete.cases won't work unless the table is read in with
# the correct NA values, even after converting to empty in the table.
write.csv(Reviews4, 'reviews4.csv', row.names=FALSE)
Reviews4 <- read.csv('reviews4.csv', sep=',', header=TRUE, na.strings=c('',' ','NA',NULL))
#now the table is 543 instead of 2580 observations.
Reviews4 <- Reviews4[complete.cases(Reviews4$userReviewContent),]
Reviews5 <- gather(Reviews4, 'userRatingSeries','userRatingValue',5:9)
#because this userRatingValue field is numeric, the NAs are already read by R as such
#we can remove the NAs with complete.cases to get a 614 obs table instead of 2715
Reviews5 <- Reviews5[complete.cases(Reviews5$userRatingValue),]
colnames(Reviews5)
## [1] "rating_last_first_if_multipleUpdated"
## [2] "site"
## [3] "LowAvgHighCost"
## [4] "businessType"
## [5] "cityState"
## [6] "friends"
## [7] "reviews"
## [8] "photos"
## [9] "eliteStatus"
## [10] "userName"
## [11] "userReviewSeries"
## [12] "userReviewContent"
## [13] "userRatingSeries"
## [14] "userRatingValue"
Reviews6 <- Reviews5 %>% select(userReviewSeries, userReviewContent,
userRatingSeries, userRatingValue,
everything())
Reviews7 <- Reviews6 %>% select(-rating_last_first_if_multipleUpdated,
-site)
# head(Reviews7)
Lets now remove the business response from the review content field.
businessReplied <- grep('Comment from',Reviews7$userReviewContent)
Reviews7$businessReplied <- 'no'
Reviews7$businessReplied[businessReplied] <- 'yes'
Reviews8 <- Reviews7 %>% select(userReviewSeries:userRatingValue,businessReplied,
everything())
Reviews9 <- Reviews8[order(Reviews8$businessReplied, decreasing=TRUE),]
row.names(Reviews9) <- NULL
Lets make this field a new field of the public relations reply removed.
Reviews9$userReviewContent <- as.character(paste(Reviews9$userReviewContent))
Reviews9$userRatingSeries <- as.factor(paste(Reviews9$userRatingSeries))
Reviews9$businessReplied <- as.factor(Reviews9$businessReplied)
PR <- strsplit(Reviews9$userReviewContent, split='[C][o][m][m][e][n][t] [f][r][o][m]',
perl=TRUE)
# head(PR,1)
Lets separate these into two separate character strings of user only review and PR_reply
userOnlyReview <- as.character(paste(lapply(PR,'[',1)))
PR_reply <- as.character(paste(lapply(PR,'[',2)))
Both of the above vectors are the same number of observations as our table.
Lets remove the other data on photos
userOnlyReview <- gsub('[\n][P][h][o][t][o] [o][f].*','', userOnlyReview,perl=TRUE)
grep('Comment', userOnlyReview)
## integer(0)
There shouldn’t be any comments from business owners in this first part of the string.
# head(userOnlyReview,1)
Also, the above shows the beginning is a date with a dropped zero for the months 1-9, and some observations have the photo[s] or check-in[s]. This should be modified with regex to add a date column and also add the numeric values for the photos or check-ins to those fields in our table.
Lets add these two strings of user only and PR reply to the data as two separate fields with ifelse functions.
Reviews9$userReviewOnlyContent <- userOnlyReview
Reviews9$businessReplyContent <- PR_reply
Reviews9$userReviewOnlyContent <- gsub('[\n][P][h][o][t][o] [o][f].*','',
Reviews9$userReviewOnlyContent,perl=TRUE)
Reviews9$userReviewOnlyContent <- gsub('[S][e][e] [a][l][l] [p].*','',
Reviews9$userReviewOnlyContent,
perl=TRUE)
# head(Reviews9[,13:15])
We just mentioned the date beginning each userReviewOnlyContent and userReviewContent columns, so lets create a date column for these dates. There are actually a bunch of anomolies in that first part of the string.
Reviews9$Date <- substr(Reviews9$userReviewOnlyContent,1,11)
date <- strsplit(Reviews9$Date, split='[a-zA-Z]', perl=TRUE)
Date <- as.character(paste(lapply(date,'[',1)))
Date <- trimws(Date, which='right',whitespace='[\n]')
Date <- gsub('[ ][\n][0-9]','', perl=TRUE, Date)
Date <- gsub('[\n][ ][0-9]','', perl=TRUE, Date)
Date <- gsub('[\n][ ]','', perl=TRUE, Date)
Date <- gsub('[\n][\" ][0-9]','', perl=TRUE, Date)
Date <- gsub('[\n][0-9]{2}','', perl=TRUE, Date)
Date <- gsub('[\n][\\]["]','', perl=TRUE, Date)
Date <- gsub('[\n][0-9]','', perl=TRUE,Date)
Date1 <- mdy(Date)
Reviews9$Date <-Date1
Remove the first date string in the userReviewOnlyContent.
Reviews9$userReviewOnlyContent <- gsub('[0-9]{1,2}[/][0-9]{1,2}[/][0-9]{4}','',
perl=TRUE, Reviews9$userReviewOnlyContent)
Lets also remove any photo meta from the original userReviewContent column.
Reviews9$userReviewContent <- gsub('[S][e][e] [a][l][l] [p].*','',Reviews9$userReviewContent,
perl=TRUE)
Reviews9$userReviewContent <- gsub('[\n][P][h][o][t][o] [o][f].*','',
Reviews9$userReviewContent,perl=TRUE)
Now lets rearrange our columns and make this a searchable and downloadable datatable.
Reviews10 <- Reviews9 %>% select(userReviewSeries, userReviewOnlyContent,
userRatingSeries, userRatingValue, businessReplied, businessReplyContent, everything())
colnames(Reviews10)
## [1] "userReviewSeries" "userReviewOnlyContent" "userRatingSeries"
## [4] "userRatingValue" "businessReplied" "businessReplyContent"
## [7] "userReviewContent" "LowAvgHighCost" "businessType"
## [10] "cityState" "friends" "reviews"
## [13] "photos" "eliteStatus" "userName"
## [16] "Date"
The userReviewContent has the text of both business response and the user as well as photo placemarker data.
Reviews10_DT <- datatable(data=Reviews10, rownames=FALSE, # width = 800, height = 700,
extensions=c('Buttons','Responsive'),#'FixedColumns'),
#filter=list(position='top'),
options=list(pageLength=1,
dom='Bfrtip',scrollX = TRUE,# scrollY=TRUE,fixedColumns = TRUE,
buttons=c('colvis','csv'),
language=list(sSearch='Filter:')
)
)
Reviews10_DT
# head(Reviews10)
The userReviewContent was kept in the table to compare the cleaned up columns on user reviews.
We should still remove the updated and previous review descriptions.
Reviews10$userReviewOnlyContent <- gsub('[uU][p][d][a][t][e][d].*[\n]','', perl=TRUE,
Reviews10$userReviewOnlyContent)
Reviews10$userReviewOnlyContent <- gsub('[pP][r][e][v][i][o][u].*[w]','', perl=TRUE,
Reviews10$userReviewOnlyContent)
Reviews10$id <- row.names(Reviews10)
pix <- grep('photo+',Reviews10$userReviewOnlyContent)
pix2 <- Reviews10$userReviewOnlyContent[pix]
pix3 <- trimws(pix2, which="left",whitespace="[\t\r\n]")
pixs <- as.data.frame(pix3)
colnames(pixs) <- 'busPhotos'
pixs$id <- pix
pixs$busPhotos <- gsub('^[ ]','', pixs$busPhotos)
pixs2 <- pixs[grep('^[0-9][ ][pP]',pixs$busPhotos, perl=TRUE),]
# head(pixs2)
pics <- strsplit(pixs2$busPhotos,split='[\n\n]',perl=TRUE)
# head(pics,1)
From the above, we our only interested in the first split of photos, the other reviews are split on the double newline and caused multiple splits for most single reviews.
pixs2$userBusinessPhotos <- as.character(paste(lapply(pics,'[',1)))
pixs2$userBusinessPhotos <- gsub(' photo','', pixs2$userBusinessPhotos)
pixs2$userBusinessPhotos <- gsub('s','',pixs2$userBusinessPhotos)
pixs2$userBusinessPhotos <- trimws(pixs2$userBusinessPhotos,which='right')
pixs2$userBusinessPhotos <- as.numeric(paste(pixs2$userBusinessPhotos))
# head(pixs2)
Lets keep only the id and userBusinessPhotos columns.
pics3 <- pixs2 %>% select(id,userBusinessPhotos)
Combine this new feature to the data table of all features thus far.
Reviews11 <- merge(Reviews10, pics3, by.x='id', by.y='id', all.x=TRUE)
Now lets do the same thing for the check-ins information. The number of times the user checked in or visited the business. Get only those reviews with the check-ins a header and not in the observation or found at the end of the observation.
checks <- Reviews11 %>% select(id,userReviewOnlyContent)
checks$substring <- substr(checks$userReviewOnlyContent, 1,40)
chekn <- grep('check-in',checks$substring)
checks1 <- checks[chekn,]
There are more check-ins than photos by the user.
chekn <- checks1 %>% select(id, substring)
# head(chekn,10)
Lets remove the reference to photos and double newline characters from our substring.
chekn$substring <- gsub('[0-9] [p][h][o][t][o][\n][\n]','', chekn$substring,perl=TRUE)
chekn$substring <- gsub('[0-9][0-9] [p][h][o][t][o][s][\n][\n]','', chekn$substring,perl=TRUE)
chekn$substring <- gsub('[0-9] [p][h][o][t][o][s][\n][\n]','', chekn$substring,perl=TRUE)
Split on the double newline characters and grab the first entries, after verifying the substring column only starts with the number of check-ins per user.
checkN <- strsplit(chekn$substring, split='[\n][\n]',perl=TRUE)
head(checkN)
## [[1]]
## [1] "\n 1 check-in" " has been treating myself,"
##
## [[2]]
## [1] "\n 1 check-in" "My boyfriend too"
##
## [[3]]
## [1] "\n 1 check-in" "My wife and I h"
##
## [[4]]
## [1] "\n 41 check-ins" "When I first moved to Co"
##
## [[5]]
## [1] "\n 1 check-in" "So I was having back probl"
##
## [[6]]
## [1] "\n 1 check-in" "DOCTOR showed me great exe"
checkN2 <- as.character(paste(lapply(checkN,'[',1)))
checkN2 <- trimws(checkN2, which='left', whitespace="[\n]")
checkN2 <- gsub('^ ','',checkN2)
checkN2 <- gsub('^ ','',checkN2)
checkN2 <- gsub(' check-ins','',checkN2)
checkN2 <- gsub(' check-in','',checkN2)
checkN2 <- as.numeric(paste(checkN2))
head(checkN2)
## [1] 1 1 1 41 1 1
chekn$userCheckIns <- checkN2
head(chekn)
## id substring userCheckIns
## 8 105 \n 1 check-in\n\n has been treating myself, 1
## 13 11 \n 1 check-in\n\nMy boyfriend too 1
## 14 110 \n 1 check-in\n\nMy wife and I h 1
## 21 117 \n 41 check-ins\n\nWhen I first moved to Co 41
## 23 119 \n 1 check-in\n\nSo I was having back probl 1
## 26 121 \n 1 check-in\n\nDOCTOR showed me great exe 1
merge this with Reviews11 data.
Reviews12 <- merge(Reviews11, chekn, by.x='id', by.y='id', all.x=TRUE)
head(Reviews12[order(Reviews12$userCheckIns,decreasing=TRUE),c(17,19:20)])
## Date substring userCheckIns
## 207 2014-09-23 \n 45 check-ins\n\nEverywhere I live I alwa 45
## 133 2014-10-21 \n 43 check-ins\n\nLove it here, sucks my c 43
## 21 2018-04-13 \n 41 check-ins\n\nWhen I first moved to Co 41
## 301 2018-09-28 31 check-ins\n\nEveryone is very nice and 31
## 517 2018-09-28 31 check-ins\n\nEveryone is very nice and 31
## 217 2015-12-04 \n 30 check-ins\n\nWe can I say "We Love LO 30
Reviews13 <- Reviews12 %>% select(-id, -substring)
head(Reviews13[order(Reviews13$userCheckIns,decreasing=TRUE),c(16,18)])
## Date userCheckIns
## 207 2014-09-23 45
## 133 2014-10-21 43
## 21 2018-04-13 41
## 301 2018-09-28 31
## 517 2018-09-28 31
## 217 2015-12-04 30
Lets remove the header from the userReviewOnlyContent column now that we have extracted the photos and check-in data per user.
subUser <- substr(Reviews13$userReviewOnlyContent,1,30)
head(subUser,30)
## [1] " 2 photos\n\nWhat a wonderful wa" "\n 12 photos\n\nMy sister and I b"
## [3] "\nI came to CHIROPRACTIC with s" "\nI have to say.... This is by "
## [5] "\nDr. is my chiropractor and h" "\nMany in our family have seen "
## [7] "\nDr. fixed my neck/shoulder p" "\n 1 check-in\n\n has been treati"
## [9] "\nDr. is great! I've been to o" "\nI'm so happy I found CHIROPRA"
## [11] "\nI got my first one hour full " "\n2 months ago I was rear ended"
## [13] "\n 1 photo\n\n 1 check-in\n\nMy boy" "\n 2 photos\n\n 1 check-in\n\nMy wi"
## [15] "\n is a good man and truly care" "\nLies. Shady. Overcharge. Th"
## [17] "\nFirst time here. Highly recom" "\nDOCTOR is great and gives rea"
## [19] "\nThis has been my spa for the " "\nI really enjoyed my visit eve"
## [21] "\n 41 check-ins\n\nWhen I first m" "\nTo be honest I dont even know"
## [23] "\n 1 check-in\n\nSo I was having " "\n 3 photos\n\nI booked a Winter "
## [25] "\nDr. is honestly the best thu" "\n 1 check-in\n\nDOCTOR showed me"
## [27] "\nI was first suspicious of thi" "\nThe team here at CHIROPRACTIC"
## [29] "\n 1 photo\n\n 1 check-in\n\nCHIROP" "\n 1 check-in\n\nGreat experience"
subUser2 <- gsub('[0-9]{1,2}.*[\n][\n]','', subUser, perl=TRUE)
head(subUser2,30)
## [1] " What a wonderful wa" "\n My sister and I b"
## [3] "\nI came to CHIROPRACTIC with s" "\nI have to say.... This is by "
## [5] "\nDr. is my chiropractor and h" "\nMany in our family have seen "
## [7] "\nDr. fixed my neck/shoulder p" "\n has been treati"
## [9] "\nDr. is great! I've been to o" "\nI'm so happy I found CHIROPRA"
## [11] "\nI got my first one hour full " "\n2 months ago I was rear ended"
## [13] "\n My boy" "\n My wi"
## [15] "\n is a good man and truly care" "\nLies. Shady. Overcharge. Th"
## [17] "\nFirst time here. Highly recom" "\nDOCTOR is great and gives rea"
## [19] "\nThis has been my spa for the " "\nI really enjoyed my visit eve"
## [21] "\n When I first m" "\nTo be honest I dont even know"
## [23] "\n So I was having " "\n I booked a Winter "
## [25] "\nDr. is honestly the best thu" "\n DOCTOR showed me"
## [27] "\nI was first suspicious of thi" "\nThe team here at CHIROPRACTIC"
## [29] "\n CHIROP" "\n Great experience"
subUser3 <- gsub('[\n][ ][0-9]{1,2}.*[\n][\n]','',subUser2, perl=TRUE)
head(subUser3,50)
## [1] " What a wonderful wa" "\n My sister and I b"
## [3] "\nI came to CHIROPRACTIC with s" "\nI have to say.... This is by "
## [5] "\nDr. is my chiropractor and h" "\nMany in our family have seen "
## [7] "\nDr. fixed my neck/shoulder p" "\n has been treati"
## [9] "\nDr. is great! I've been to o" "\nI'm so happy I found CHIROPRA"
## [11] "\nI got my first one hour full " "\n2 months ago I was rear ended"
## [13] "\n My boy" "\n My wi"
## [15] "\n is a good man and truly care" "\nLies. Shady. Overcharge. Th"
## [17] "\nFirst time here. Highly recom" "\nDOCTOR is great and gives rea"
## [19] "\nThis has been my spa for the " "\nI really enjoyed my visit eve"
## [21] "\n When I first m" "\nTo be honest I dont even know"
## [23] "\n So I was having " "\n I booked a Winter "
## [25] "\nDr. is honestly the best thu" "\n DOCTOR showed me"
## [27] "\nI was first suspicious of thi" "\nThe team here at CHIROPRACTIC"
## [29] "\n CHIROP" "\n Great experience"
## [31] "\nI am so impressed with CHIROP" "\n I came here bec"
## [33] "\n Thanks for suppo" "\nI love coming to CHIROPRACTIC"
## [35] "\n Had a gift card" "\nI went in because I had pain "
## [37] "\nI came in about a month ago d" "\nI'd been to different chiro,b"
## [39] "\n For ye" "\nBest place to get a massage a"
## [41] "\nGreat place, all the staff is" "\n I love"
## [43] "\n Love this place!" "\n I came to him wi"
## [45] "\nI came here before I was diag" "\n I enjoyed every mo"
## [47] "\n My favorite chir" "\nMy favorite place to get my b"
## [49] "\n My wife booked m" "\n DOC"
Now that we tested the removal using regex on a string, we can apply these regex commands to the column userReviewOnlyContent.
Reviews13$userReviewOnlyContent <- gsub('[0-9]{1,2}.*[\n][\n]','',
Reviews13$userReviewOnlyContent, perl=TRUE)
Reviews13$userReviewOnlyContent <- gsub('[\n][ ][0-9]{1,2}.*[\n][\n]','',
Reviews13$userReviewOnlyContent, perl=TRUE)
Reviews13$userReviewOnlyContent <- trimws(Reviews13$userReviewOnlyContent, which='left',
whitespace="[\n\t\r]")
# head(Reviews13,30)
Now it looks like we have our cleaned data to run sentiment and text analysis of in determining the rating the user review is given by the user. Lets write this file out to csv, and make a DT datatable for downloading from Rpubs.
write.csv(Reviews13, 'cleanedRegexReviews13.csv',row.names=FALSE)
Reviews13_DT <- datatable(data=Reviews13, rownames=FALSE, #width = 800, height = 700,
extensions=c('Buttons','Responsive'),#,'FixedColumns'),
filter=list(position='top'),
options=list(pageLength=2,
dom='Bfrtip',scrollX = TRUE, scrollY=TRUE,#fixedColumns = TRUE,
buttons=c('colvis','csv'),
language=list(sSearch='Filter:')
)
)
Reviews13_DT
Lets keep only the cleaned user review and the rating for that user’s visit.
Reviews14 <- Reviews13 %>% select(userReviewOnlyContent,userRatingValue)
row.names(Reviews14) <- NULL
head(Reviews14)
## userReviewOnlyContent
## 1 What a wonderful way to start the year! This was my second time back to HIGH END SPA, and we had a great time. The crowds were very low (seriously, it felt like we had the place to ourselves most of the day.) We walked right into the mineral baths, club mud, and didn't wait in any kind of line for lunch. None of the pools were crowded, and we were even able to enjoy one of the hammocks in the secret garden.\n\nTiffany at the front check-in desk went above and beyond for us regarding the robes. I had requested a plus-sized robe, since after my last review I knew they had added some to their collection. Unfortunately, all of their plus-sized robes were still dirty from the day before. Tiffany was so accommodating, though! She was able to get us robes from the cabana area that fit me perfectly! It is so great to know that not only do they now offer guests of all sizes the option to enjoy a warm robe, but that they really want to make sure you have a good day. Thank you, Tiffany, for everything.\n\nAll of the staff today were in good spirits. The only thing that would have made today better would have been a massage. We'll have to book one next time. My husband and I are going to make HIGH END SPA our annual New Year's Day tradition!\n\n
## 2 My sister and I brought my mom here for her birthday and overall, we really enjoyed our time here. We're used to going to Korean spas, but this was definitely an upgrade.\n\nPROS:\n- The resort itself is beautiful and so relaxing. Like seriously such a pleasing escape from reality that I needed. It's set up so nicely and feels very luxurious.\n- It was my mom's birthday so she received free admission on birthday with a purchase of a service. Admission is $52, so she booked a manicure for $50 and got in for free. WORTH. My mom had gone 52 years without ever getting her nails done, so it was kind of heartwarming to see how much she loved her experience.\n- The three of us took a Yin Yoga class and really enjoyed it. We definitely want to take advantage of the other class options next time we come.\n- CLUB MUD. We had so much fun there and even made a little clay sculpture. It really does do wonders for your skin, and the area is suprisingly very well-kept.\n- The shower and locker facilities can get pretty crowded, but overall, they are super nice and clean. They have an ample amount of showers, so we didn't have to wait at all.\n- All the staff seemed really friendly and helpful. There's always staff members roaming around, so you always feel somewhat taken care of.\n- I really appreciated the towel and water stands located throughout the resort. So handy and necessary.\n- Parking is free, thank God.\n\nCONS:\n- We went on a fairly cold day (around 60 degrees), so the hot pools were CROWDED,. Like there were a couple of times I touched other people's body parts I definitely did not want to touch. I feel like some of the hot pools exceeded capacity, and I'm sure it was mostly because it was a cold day, but I do wish there were more of the hot pools or they should just be larger!\n- The food is incredibly expensive. Like as ridiculous as Disneyland, which is saying something. Plan to spend around $20 per meal per person. The one thing that was worth it was the nachos ($16 for the small, but this thing is huge).\n- The kitchen moves VERY SLOWLY. Especially the salad section because I came before the lunch rush and still waited 20 minutes to order my salad. The kitchen staff seems a bit incompetent, or maybe it's just run inefficiently.\n- This is more of a side note, but I wish there was a more streamlined reservation system. I made the entire reservation over the phone, which was fine, but it wasn't laid out as clearly as I would have liked it with the premium admissions prices, services, etc. The online one also just seemed really confusing.\n\nOverall, we had a positive experience with just a couple of kinks here and there. We love that there's just a lot to do here and time FLIES when you're here so come as early as you can. We definitely want to try coming back in the summer months when it's warmer!\n\n\n
## 3 I came to CHIROPRACTIC with severe back and neck pain. DOCTOR was AMAZING and helped me to feel much better than I have felt for YEARS! The girls up front also are very sweet and always made sure that all my appointments were set and on time! Heather the billing manager was very kind as well, she was AWESOME when it came to dealing with me and my insurance amd was definitely a huge help! I don't know what I would have done without Heather helping me with all of the insurance problems I had!!! She is the BEST, thank you Heather!! I would definitely recommend going to this clinic!!!!
## 4 I have to say.... This is by far the best Chiropractic place I've ever been to. The staff is super friendly and very professional. From the moment I walk in the door I get greeted by name . The Drs are amazing too. Love this place and I highly recommend them.
## 5 Dr. is my chiropractor and he is a fabulous individual. I've never waited more than few minutes for him to see me. The front team (Both ladies" are great with an outstanding care and smile. Thank you guys for all you do.
## 6 Many in our family have seen DOCTOR for chiropractic care. He is very warm and friendly, knowledgable, puts your mind at ease during his adjustments. He gives great explanations. Our 14yo son said, "he is really good at what he does and he is a good person." We all feel better after visiting him. Recommend him to everyone.
## userRatingValue
## 1 5
## 2 4
## 3 5
## 4 5
## 5 5
## 6 5
We will have to create a corpus of documents for each rating, then we can clean up the text with the programs within these text mining libraries other than what we have done to the data already. We should also remove the words: ‘DOCTOR’, ‘CHIROPRACTIC,’HIGH END SPA’, and ‘LOW COST GROCERY STORE.’
Reviews14$userReviewOnlyContent <- gsub('DOCTOR','', Reviews14$userReviewOnlyContent)
Reviews14$userReviewOnlyContent <- gsub('CHIROPRACTIC','', Reviews14$userReviewOnlyContent)
Reviews14$userReviewOnlyContent <- gsub('HIGH END SPA','', Reviews14$userReviewOnlyContent)
Reviews14$userReviewOnlyContent <- gsub('LOW COST GROCERY STORE','',
Reviews14$userReviewOnlyContent)
Lets lemmatize the document first to grab the root word and not the stem of each review.
lemma <- lemmatize_strings(Reviews14$userReviewOnlyContent, dictionary=lexicon::hash_lemmas)
Lemma <- as.data.frame(lemma)
Lemma <- cbind(Lemma, Reviews14)
colnames(Lemma) <- c('lemmatizedReview','review', 'rating')
Lemma$rating <- as.factor(paste(Lemma$rating))
head(Lemma)
## lemmatizedReview
## 1 What a wonderful way to start the year! This be my 2 time back to, and we have a great time. The crowd be very low ( seriously, it feel like we have the place to ourselves much of the day. ) We walk right into the mineral bath, club mud, and didn't wait in any kind of line for lunch. None of the pool be crowd, and we be even able to enjoy one of the hammock in the secret garden. Tiffany at the front check - in desk go above and beyond for us regard the robe. I have request a plus - size robe, since after my last review I know they have add some to their collection. Unfortunately, all of their plus - size robe be still dirty from the day before. Tiffany be so accommodate, though! She be able to get us robe from the cabana area that fit me perfectly! It be so great to know that not only do they now offer guest of all size the option to enjoy a warm robe, but that they really want to make sure you have a good day. Thank you, Tiffany, for everything. All of the staff today be in good spirit. The only thing that would have make today good would have be a massage. We'll have to book one next time. My husband and I be go to make our annual New Year's Day tradition!
## 2 My sister and I bring my mom here for her birthday and overall, we really enjoy our time here. We're use to go to Korean spa, but this be definitely a upgrade. pro: - The resort itself be beautiful and so relax. Like seriously such a please escape from reality that I need. It's set up so nicely and feel very luxurious. - It be my mom's birthday so she receive free admission on birthday with a purchase of a service. Admission be $52, so she book a manicure for $50 and get in for free. WORTH. My mom have go 52 year without ever get her nail do, so it be kind of heartwarming to see how much she love her experience. - The three of us take a Yin Yoga class and really enjoy it. We definitely want to take advantage of the other class option next time we come. - CLUB MUD. We have so much fun there and even make a little clay sculpture. It really do do wonder for your skin, and the area be suprisingly very good - keep. - The shower and locker facility can get pretty crowd, but overall, they be super nice and clean. They have a ample amount of shower, so we didn't have to wait at all. - All the staff seem really friendly and helpful. There's always staff member roam around, so you always feel somewhat take care of. - I really appreciate the towel and water stand locate throughout the resort. So handy and necessary. - park be free, thank God. con: - We go on a fairly cold day ( around 60 degree ), so the hot pool be crowd,. Like there be a couple of time I touch other people's body part I definitely do not want to touch. I feel like some of the hot pool exceed capacity, and I'm sure it be mostly because it be a cold day, but I do wish there be much of the hot pool or they should just be large! - The food be incredibly expensive. Like as ridiculous as Disneyland, which be say something. Plan to spend around $20 per meal per person. The one thing that be worth it be the nachos ( $16 for the small, but this thing be huge ). - The kitchen move VERY SLOWLY. Especially the salad section because I come before the lunch rush and still wait 20 minute to order my salad. The kitchen staff seem a bite incompetent, or maybe it's just run inefficiently. - This be much of a side note, but I wish there be a much streamline reservation system. I make the entire reservation over the phone, which be fine, but it wasn't lay out as clearly as I would have like it with the premium admission price, service, etc. The online one also just seem really confuse. Overall, we have a positive experience with just a couple of kink here and there. We love that there's just a lot to do here and time fly when you're here so come as early as you can. We definitely want to try come back in the summer month when it's warm!
## 3 I come to with severe back and neck pain. be amaze and help me to feel much good than I have feel for year! The girl up front also be very sweet and always make sure that all my appointment be set and on time! Heather the bill manager be very kind as good, she be AWESOME when it come to deal with me and my insurance amd be definitely a huge help! I don't know what I would have do without Heather help me with all of the insurance problem I have!!! She be the good, thank you Heather!! I would definitely recommend go to this clinic!!!!
## 4 I have to say.... This be by far the good Chiropractic place I've ever be to. The staff be super friendly and very professional. From the moment I walk in the door I get greet by name. The dr be amaze too. Love this place and I highly recommend them.
## 5 Dr. be my chiropractor and he be a fabulous individual. I've never wait much than few minute for him to see me. The front team ( Both lady " be great with a outstanding care and smile. Thank you guy for all you do.
## 6 Many in our family have see for chiropractic care. He be very warm and friendly, knowledgable, put your mind at ease during his adjustment. He give great explanation. Our 14yo son say, " he be really good at what he do and he be a good person. " We all feel good after visit him. Recommend him to everyone.
## review
## 1 What a wonderful way to start the year! This was my second time back to , and we had a great time. The crowds were very low (seriously, it felt like we had the place to ourselves most of the day.) We walked right into the mineral baths, club mud, and didn't wait in any kind of line for lunch. None of the pools were crowded, and we were even able to enjoy one of the hammocks in the secret garden.\n\nTiffany at the front check-in desk went above and beyond for us regarding the robes. I had requested a plus-sized robe, since after my last review I knew they had added some to their collection. Unfortunately, all of their plus-sized robes were still dirty from the day before. Tiffany was so accommodating, though! She was able to get us robes from the cabana area that fit me perfectly! It is so great to know that not only do they now offer guests of all sizes the option to enjoy a warm robe, but that they really want to make sure you have a good day. Thank you, Tiffany, for everything.\n\nAll of the staff today were in good spirits. The only thing that would have made today better would have been a massage. We'll have to book one next time. My husband and I are going to make our annual New Year's Day tradition!\n\n
## 2 My sister and I brought my mom here for her birthday and overall, we really enjoyed our time here. We're used to going to Korean spas, but this was definitely an upgrade.\n\nPROS:\n- The resort itself is beautiful and so relaxing. Like seriously such a pleasing escape from reality that I needed. It's set up so nicely and feels very luxurious.\n- It was my mom's birthday so she received free admission on birthday with a purchase of a service. Admission is $52, so she booked a manicure for $50 and got in for free. WORTH. My mom had gone 52 years without ever getting her nails done, so it was kind of heartwarming to see how much she loved her experience.\n- The three of us took a Yin Yoga class and really enjoyed it. We definitely want to take advantage of the other class options next time we come.\n- CLUB MUD. We had so much fun there and even made a little clay sculpture. It really does do wonders for your skin, and the area is suprisingly very well-kept.\n- The shower and locker facilities can get pretty crowded, but overall, they are super nice and clean. They have an ample amount of showers, so we didn't have to wait at all.\n- All the staff seemed really friendly and helpful. There's always staff members roaming around, so you always feel somewhat taken care of.\n- I really appreciated the towel and water stands located throughout the resort. So handy and necessary.\n- Parking is free, thank God.\n\nCONS:\n- We went on a fairly cold day (around 60 degrees), so the hot pools were CROWDED,. Like there were a couple of times I touched other people's body parts I definitely did not want to touch. I feel like some of the hot pools exceeded capacity, and I'm sure it was mostly because it was a cold day, but I do wish there were more of the hot pools or they should just be larger!\n- The food is incredibly expensive. Like as ridiculous as Disneyland, which is saying something. Plan to spend around $20 per meal per person. The one thing that was worth it was the nachos ($16 for the small, but this thing is huge).\n- The kitchen moves VERY SLOWLY. Especially the salad section because I came before the lunch rush and still waited 20 minutes to order my salad. The kitchen staff seems a bit incompetent, or maybe it's just run inefficiently.\n- This is more of a side note, but I wish there was a more streamlined reservation system. I made the entire reservation over the phone, which was fine, but it wasn't laid out as clearly as I would have liked it with the premium admissions prices, services, etc. The online one also just seemed really confusing.\n\nOverall, we had a positive experience with just a couple of kinks here and there. We love that there's just a lot to do here and time FLIES when you're here so come as early as you can. We definitely want to try coming back in the summer months when it's warmer!\n\n\n
## 3 I came to with severe back and neck pain. was AMAZING and helped me to feel much better than I have felt for YEARS! The girls up front also are very sweet and always made sure that all my appointments were set and on time! Heather the billing manager was very kind as well, she was AWESOME when it came to dealing with me and my insurance amd was definitely a huge help! I don't know what I would have done without Heather helping me with all of the insurance problems I had!!! She is the BEST, thank you Heather!! I would definitely recommend going to this clinic!!!!
## 4 I have to say.... This is by far the best Chiropractic place I've ever been to. The staff is super friendly and very professional. From the moment I walk in the door I get greeted by name . The Drs are amazing too. Love this place and I highly recommend them.
## 5 Dr. is my chiropractor and he is a fabulous individual. I've never waited more than few minutes for him to see me. The front team (Both ladies" are great with an outstanding care and smile. Thank you guys for all you do.
## 6 Many in our family have seen for chiropractic care. He is very warm and friendly, knowledgable, puts your mind at ease during his adjustments. He gives great explanations. Our 14yo son said, "he is really good at what he does and he is a good person." We all feel better after visiting him. Recommend him to everyone.
## rating
## 1 5
## 2 4
## 3 5
## 4 5
## 5 5
## 6 5
From this table, we are going to subset the reviews by rating by the user of 1 through 5.
rating1 <- subset(Lemma, Lemma$rating==1)
rating2 <- subset(Lemma, Lemma$rating==2)
rating3 <- subset(Lemma, Lemma$rating==3)
rating4 <- subset(Lemma, Lemma$rating==4)
rating5 <- subset(Lemma, Lemma$rating==5)
Lets create a directory for each rating.Erase the eval=FALsE, if you want to run this script. I already have the files.
dir.create('./rating1')
dir.create('./rating2')
dir.create('./rating3')
dir.create('./rating4')
dir.create('./rating5')
r1 <- as.character(rating1$lemmatizedReview)
setwd('./rating1')
for (j in 1:length(r1)){
write(r1[j], paste(paste('rating1',j, sep='.'), '.txt', sep=''))
}
setwd('../')
r2 <- as.character(rating2$lemmatizedReview)
setwd('./rating2')
for (j in 1:length(r2)){
write(r2[j], paste(paste('rating2',j, sep='.'), '.txt', sep=''))
}
setwd('../')
r3 <- as.character(rating3$lemmatizedReview)
setwd('./rating3')
for (j in 1:length(r3)){
write(r3[j], paste(paste('rating3',j, sep='.'), '.txt', sep=''))
}
setwd('../')
r4 <- as.character(rating4$lemmatizedReview)
setwd('./rating4')
for (j in 1:length(r4)){
write(r4[j], paste(paste('rating4',j, sep='.'), '.txt', sep=''))
}
setwd('../')
r5 <- as.character(rating5$lemmatizedReview)
setwd('./rating5')
for (j in 1:length(r5)){
write(r5[j], paste(paste('rating5',j, sep='.'), '.txt', sep=''))
}
setwd('../')
List the files in each folder rating 1-5.
list.files('./rating1')
## [1] "rating1.1.txt" "rating1.10.txt" "rating1.11.txt" "rating1.12.txt"
## [5] "rating1.13.txt" "rating1.14.txt" "rating1.15.txt" "rating1.16.txt"
## [9] "rating1.17.txt" "rating1.18.txt" "rating1.19.txt" "rating1.2.txt"
## [13] "rating1.20.txt" "rating1.21.txt" "rating1.22.txt" "rating1.23.txt"
## [17] "rating1.24.txt" "rating1.25.txt" "rating1.26.txt" "rating1.27.txt"
## [21] "rating1.28.txt" "rating1.29.txt" "rating1.3.txt" "rating1.30.txt"
## [25] "rating1.31.txt" "rating1.32.txt" "rating1.33.txt" "rating1.34.txt"
## [29] "rating1.35.txt" "rating1.36.txt" "rating1.37.txt" "rating1.38.txt"
## [33] "rating1.39.txt" "rating1.4.txt" "rating1.40.txt" "rating1.41.txt"
## [37] "rating1.42.txt" "rating1.43.txt" "rating1.44.txt" "rating1.45.txt"
## [41] "rating1.46.txt" "rating1.47.txt" "rating1.48.txt" "rating1.49.txt"
## [45] "rating1.5.txt" "rating1.50.txt" "rating1.51.txt" "rating1.52.txt"
## [49] "rating1.53.txt" "rating1.54.txt" "rating1.55.txt" "rating1.56.txt"
## [53] "rating1.57.txt" "rating1.58.txt" "rating1.59.txt" "rating1.6.txt"
## [57] "rating1.60.txt" "rating1.61.txt" "rating1.62.txt" "rating1.63.txt"
## [61] "rating1.64.txt" "rating1.65.txt" "rating1.66.txt" "rating1.67.txt"
## [65] "rating1.68.txt" "rating1.69.txt" "rating1.7.txt" "rating1.70.txt"
## [69] "rating1.71.txt" "rating1.72.txt" "rating1.73.txt" "rating1.74.txt"
## [73] "rating1.75.txt" "rating1.76.txt" "rating1.77.txt" "rating1.78.txt"
## [77] "rating1.79.txt" "rating1.8.txt" "rating1.80.txt" "rating1.81.txt"
## [81] "rating1.82.txt" "rating1.83.txt" "rating1.84.txt" "rating1.85.txt"
## [85] "rating1.86.txt" "rating1.87.txt" "rating1.88.txt" "rating1.9.txt"
list.files('./rating2')
## [1] "rating2.1.txt" "rating2.10.txt" "rating2.11.txt" "rating2.12.txt"
## [5] "rating2.13.txt" "rating2.14.txt" "rating2.15.txt" "rating2.16.txt"
## [9] "rating2.17.txt" "rating2.18.txt" "rating2.19.txt" "rating2.2.txt"
## [13] "rating2.20.txt" "rating2.21.txt" "rating2.22.txt" "rating2.23.txt"
## [17] "rating2.24.txt" "rating2.25.txt" "rating2.26.txt" "rating2.27.txt"
## [21] "rating2.28.txt" "rating2.29.txt" "rating2.3.txt" "rating2.30.txt"
## [25] "rating2.31.txt" "rating2.32.txt" "rating2.33.txt" "rating2.34.txt"
## [29] "rating2.4.txt" "rating2.5.txt" "rating2.6.txt" "rating2.7.txt"
## [33] "rating2.8.txt" "rating2.9.txt"
list.files('./rating3')
## [1] "rating3.1.txt" "rating3.10.txt" "rating3.11.txt" "rating3.12.txt"
## [5] "rating3.13.txt" "rating3.14.txt" "rating3.15.txt" "rating3.16.txt"
## [9] "rating3.17.txt" "rating3.18.txt" "rating3.19.txt" "rating3.2.txt"
## [13] "rating3.20.txt" "rating3.21.txt" "rating3.22.txt" "rating3.23.txt"
## [17] "rating3.24.txt" "rating3.25.txt" "rating3.26.txt" "rating3.27.txt"
## [21] "rating3.28.txt" "rating3.29.txt" "rating3.3.txt" "rating3.30.txt"
## [25] "rating3.31.txt" "rating3.32.txt" "rating3.33.txt" "rating3.34.txt"
## [29] "rating3.35.txt" "rating3.36.txt" "rating3.37.txt" "rating3.38.txt"
## [33] "rating3.39.txt" "rating3.4.txt" "rating3.40.txt" "rating3.41.txt"
## [37] "rating3.42.txt" "rating3.43.txt" "rating3.44.txt" "rating3.45.txt"
## [41] "rating3.46.txt" "rating3.47.txt" "rating3.48.txt" "rating3.49.txt"
## [45] "rating3.5.txt" "rating3.50.txt" "rating3.51.txt" "rating3.52.txt"
## [49] "rating3.53.txt" "rating3.54.txt" "rating3.6.txt" "rating3.7.txt"
## [53] "rating3.8.txt" "rating3.9.txt"
list.files('./rating4')
## [1] "rating4.1.txt" "rating4.10.txt" "rating4.100.txt" "rating4.101.txt"
## [5] "rating4.102.txt" "rating4.103.txt" "rating4.11.txt" "rating4.12.txt"
## [9] "rating4.13.txt" "rating4.14.txt" "rating4.15.txt" "rating4.16.txt"
## [13] "rating4.17.txt" "rating4.18.txt" "rating4.19.txt" "rating4.2.txt"
## [17] "rating4.20.txt" "rating4.21.txt" "rating4.22.txt" "rating4.23.txt"
## [21] "rating4.24.txt" "rating4.25.txt" "rating4.26.txt" "rating4.27.txt"
## [25] "rating4.28.txt" "rating4.29.txt" "rating4.3.txt" "rating4.30.txt"
## [29] "rating4.31.txt" "rating4.32.txt" "rating4.33.txt" "rating4.34.txt"
## [33] "rating4.35.txt" "rating4.36.txt" "rating4.37.txt" "rating4.38.txt"
## [37] "rating4.39.txt" "rating4.4.txt" "rating4.40.txt" "rating4.41.txt"
## [41] "rating4.42.txt" "rating4.43.txt" "rating4.44.txt" "rating4.45.txt"
## [45] "rating4.46.txt" "rating4.47.txt" "rating4.48.txt" "rating4.49.txt"
## [49] "rating4.5.txt" "rating4.50.txt" "rating4.51.txt" "rating4.52.txt"
## [53] "rating4.53.txt" "rating4.54.txt" "rating4.55.txt" "rating4.56.txt"
## [57] "rating4.57.txt" "rating4.58.txt" "rating4.59.txt" "rating4.6.txt"
## [61] "rating4.60.txt" "rating4.61.txt" "rating4.62.txt" "rating4.63.txt"
## [65] "rating4.64.txt" "rating4.65.txt" "rating4.66.txt" "rating4.67.txt"
## [69] "rating4.68.txt" "rating4.69.txt" "rating4.7.txt" "rating4.70.txt"
## [73] "rating4.71.txt" "rating4.72.txt" "rating4.73.txt" "rating4.74.txt"
## [77] "rating4.75.txt" "rating4.76.txt" "rating4.77.txt" "rating4.78.txt"
## [81] "rating4.79.txt" "rating4.8.txt" "rating4.80.txt" "rating4.81.txt"
## [85] "rating4.82.txt" "rating4.83.txt" "rating4.84.txt" "rating4.85.txt"
## [89] "rating4.86.txt" "rating4.87.txt" "rating4.88.txt" "rating4.89.txt"
## [93] "rating4.9.txt" "rating4.90.txt" "rating4.91.txt" "rating4.92.txt"
## [97] "rating4.93.txt" "rating4.94.txt" "rating4.95.txt" "rating4.96.txt"
## [101] "rating4.97.txt" "rating4.98.txt" "rating4.99.txt"
list.files('./rating5')
## [1] "rating5.1.txt" "rating5.10.txt" "rating5.100.txt" "rating5.101.txt"
## [5] "rating5.102.txt" "rating5.103.txt" "rating5.104.txt" "rating5.105.txt"
## [9] "rating5.106.txt" "rating5.107.txt" "rating5.108.txt" "rating5.109.txt"
## [13] "rating5.11.txt" "rating5.110.txt" "rating5.111.txt" "rating5.112.txt"
## [17] "rating5.113.txt" "rating5.114.txt" "rating5.115.txt" "rating5.116.txt"
## [21] "rating5.117.txt" "rating5.118.txt" "rating5.119.txt" "rating5.12.txt"
## [25] "rating5.120.txt" "rating5.121.txt" "rating5.122.txt" "rating5.123.txt"
## [29] "rating5.124.txt" "rating5.125.txt" "rating5.126.txt" "rating5.127.txt"
## [33] "rating5.128.txt" "rating5.129.txt" "rating5.13.txt" "rating5.130.txt"
## [37] "rating5.131.txt" "rating5.132.txt" "rating5.133.txt" "rating5.134.txt"
## [41] "rating5.135.txt" "rating5.136.txt" "rating5.137.txt" "rating5.138.txt"
## [45] "rating5.139.txt" "rating5.14.txt" "rating5.140.txt" "rating5.141.txt"
## [49] "rating5.142.txt" "rating5.143.txt" "rating5.144.txt" "rating5.145.txt"
## [53] "rating5.146.txt" "rating5.147.txt" "rating5.148.txt" "rating5.149.txt"
## [57] "rating5.15.txt" "rating5.150.txt" "rating5.151.txt" "rating5.152.txt"
## [61] "rating5.153.txt" "rating5.154.txt" "rating5.155.txt" "rating5.156.txt"
## [65] "rating5.157.txt" "rating5.158.txt" "rating5.159.txt" "rating5.16.txt"
## [69] "rating5.160.txt" "rating5.161.txt" "rating5.162.txt" "rating5.163.txt"
## [73] "rating5.164.txt" "rating5.165.txt" "rating5.166.txt" "rating5.167.txt"
## [77] "rating5.168.txt" "rating5.169.txt" "rating5.17.txt" "rating5.170.txt"
## [81] "rating5.171.txt" "rating5.172.txt" "rating5.173.txt" "rating5.174.txt"
## [85] "rating5.175.txt" "rating5.176.txt" "rating5.177.txt" "rating5.178.txt"
## [89] "rating5.179.txt" "rating5.18.txt" "rating5.180.txt" "rating5.181.txt"
## [93] "rating5.182.txt" "rating5.183.txt" "rating5.184.txt" "rating5.185.txt"
## [97] "rating5.186.txt" "rating5.187.txt" "rating5.188.txt" "rating5.189.txt"
## [101] "rating5.19.txt" "rating5.190.txt" "rating5.191.txt" "rating5.192.txt"
## [105] "rating5.193.txt" "rating5.194.txt" "rating5.195.txt" "rating5.196.txt"
## [109] "rating5.197.txt" "rating5.198.txt" "rating5.199.txt" "rating5.2.txt"
## [113] "rating5.20.txt" "rating5.200.txt" "rating5.201.txt" "rating5.202.txt"
## [117] "rating5.203.txt" "rating5.204.txt" "rating5.205.txt" "rating5.206.txt"
## [121] "rating5.207.txt" "rating5.208.txt" "rating5.209.txt" "rating5.21.txt"
## [125] "rating5.210.txt" "rating5.211.txt" "rating5.212.txt" "rating5.213.txt"
## [129] "rating5.214.txt" "rating5.215.txt" "rating5.216.txt" "rating5.217.txt"
## [133] "rating5.218.txt" "rating5.219.txt" "rating5.22.txt" "rating5.220.txt"
## [137] "rating5.221.txt" "rating5.222.txt" "rating5.223.txt" "rating5.224.txt"
## [141] "rating5.225.txt" "rating5.226.txt" "rating5.227.txt" "rating5.228.txt"
## [145] "rating5.229.txt" "rating5.23.txt" "rating5.230.txt" "rating5.231.txt"
## [149] "rating5.232.txt" "rating5.233.txt" "rating5.234.txt" "rating5.235.txt"
## [153] "rating5.236.txt" "rating5.237.txt" "rating5.238.txt" "rating5.239.txt"
## [157] "rating5.24.txt" "rating5.240.txt" "rating5.241.txt" "rating5.242.txt"
## [161] "rating5.243.txt" "rating5.244.txt" "rating5.245.txt" "rating5.246.txt"
## [165] "rating5.247.txt" "rating5.248.txt" "rating5.249.txt" "rating5.25.txt"
## [169] "rating5.250.txt" "rating5.251.txt" "rating5.252.txt" "rating5.253.txt"
## [173] "rating5.254.txt" "rating5.255.txt" "rating5.256.txt" "rating5.257.txt"
## [177] "rating5.258.txt" "rating5.259.txt" "rating5.26.txt" "rating5.260.txt"
## [181] "rating5.261.txt" "rating5.262.txt" "rating5.263.txt" "rating5.264.txt"
## [185] "rating5.265.txt" "rating5.266.txt" "rating5.267.txt" "rating5.268.txt"
## [189] "rating5.269.txt" "rating5.27.txt" "rating5.270.txt" "rating5.271.txt"
## [193] "rating5.272.txt" "rating5.273.txt" "rating5.274.txt" "rating5.275.txt"
## [197] "rating5.276.txt" "rating5.277.txt" "rating5.278.txt" "rating5.279.txt"
## [201] "rating5.28.txt" "rating5.280.txt" "rating5.281.txt" "rating5.282.txt"
## [205] "rating5.283.txt" "rating5.284.txt" "rating5.285.txt" "rating5.286.txt"
## [209] "rating5.287.txt" "rating5.288.txt" "rating5.289.txt" "rating5.29.txt"
## [213] "rating5.290.txt" "rating5.291.txt" "rating5.292.txt" "rating5.293.txt"
## [217] "rating5.294.txt" "rating5.295.txt" "rating5.296.txt" "rating5.297.txt"
## [221] "rating5.298.txt" "rating5.299.txt" "rating5.3.txt" "rating5.30.txt"
## [225] "rating5.300.txt" "rating5.301.txt" "rating5.302.txt" "rating5.303.txt"
## [229] "rating5.304.txt" "rating5.305.txt" "rating5.306.txt" "rating5.307.txt"
## [233] "rating5.308.txt" "rating5.309.txt" "rating5.31.txt" "rating5.310.txt"
## [237] "rating5.311.txt" "rating5.312.txt" "rating5.313.txt" "rating5.314.txt"
## [241] "rating5.315.txt" "rating5.316.txt" "rating5.317.txt" "rating5.318.txt"
## [245] "rating5.319.txt" "rating5.32.txt" "rating5.320.txt" "rating5.321.txt"
## [249] "rating5.322.txt" "rating5.323.txt" "rating5.324.txt" "rating5.325.txt"
## [253] "rating5.326.txt" "rating5.327.txt" "rating5.328.txt" "rating5.329.txt"
## [257] "rating5.33.txt" "rating5.330.txt" "rating5.331.txt" "rating5.332.txt"
## [261] "rating5.333.txt" "rating5.334.txt" "rating5.335.txt" "rating5.34.txt"
## [265] "rating5.35.txt" "rating5.36.txt" "rating5.37.txt" "rating5.38.txt"
## [269] "rating5.39.txt" "rating5.4.txt" "rating5.40.txt" "rating5.41.txt"
## [273] "rating5.42.txt" "rating5.43.txt" "rating5.44.txt" "rating5.45.txt"
## [277] "rating5.46.txt" "rating5.47.txt" "rating5.48.txt" "rating5.49.txt"
## [281] "rating5.5.txt" "rating5.50.txt" "rating5.51.txt" "rating5.52.txt"
## [285] "rating5.53.txt" "rating5.54.txt" "rating5.55.txt" "rating5.56.txt"
## [289] "rating5.57.txt" "rating5.58.txt" "rating5.59.txt" "rating5.6.txt"
## [293] "rating5.60.txt" "rating5.61.txt" "rating5.62.txt" "rating5.63.txt"
## [297] "rating5.64.txt" "rating5.65.txt" "rating5.66.txt" "rating5.67.txt"
## [301] "rating5.68.txt" "rating5.69.txt" "rating5.7.txt" "rating5.70.txt"
## [305] "rating5.71.txt" "rating5.72.txt" "rating5.73.txt" "rating5.74.txt"
## [309] "rating5.75.txt" "rating5.76.txt" "rating5.77.txt" "rating5.78.txt"
## [313] "rating5.79.txt" "rating5.8.txt" "rating5.80.txt" "rating5.81.txt"
## [317] "rating5.82.txt" "rating5.83.txt" "rating5.84.txt" "rating5.85.txt"
## [321] "rating5.86.txt" "rating5.87.txt" "rating5.88.txt" "rating5.89.txt"
## [325] "rating5.9.txt" "rating5.90.txt" "rating5.91.txt" "rating5.92.txt"
## [329] "rating5.93.txt" "rating5.94.txt" "rating5.95.txt" "rating5.96.txt"
## [333] "rating5.97.txt" "rating5.98.txt" "rating5.99.txt"
R1 <- Corpus(DirSource("rating1"))
R1
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 88
R1 <- tm_map(R1, removePunctuation)
R1 <- tm_map(R1, removeNumbers)
#R1 <- tm_map(R1, tolower) # I want to capture the emotion the users write with when All caps
#R1 <- tm_map(R1, removeWords, stopwords("english"))#Also the number of 'and's' and 'not' etc
R1 <- tm_map(R1, stripWhitespace)
#+R1 <- tm_map(R1, stemDocument)#we already lemmatized the document it is more robust to meaning
dtmR1 <- DocumentTermMatrix(R1)
freq <- colSums(as.matrix(dtmR1))
wordcloud(names(freq), freq, min.freq=30,colors=brewer.pal(3,'Dark2'))
freqR1 <- as.data.frame(colSums(as.matrix(dtmR1)))
colnames(freqR1) <- 'rating1'
freqR1$id <- row.names(freqR1)
FREQ_R1 <- freqR1[order(freqR1$rating1,decreasing=TRUE),]
row.names(FREQ_R1) <- NULL
head(FREQ_R1,50)
## rating1 id
## 1 555 the
## 2 454 and
## 3 174 for
## 4 161 have
## 5 144 that
## 6 128 this
## 7 120 they
## 8 106 not
## 9 85 you
## 10 78 with
## 11 76 but
## 12 75 get
## 13 74 time
## 14 69 out
## 15 68 there
## 16 65 tell
## 17 57 good
## 18 56 she
## 19 55 say
## 20 51 from
## 21 51 ask
## 22 50 day
## 23 50 place
## 24 49 our
## 25 48 here
## 26 48 all
## 27 47 what
## 28 47 can
## 29 45 come
## 30 45 didnt
## 31 45 service
## 32 44 back
## 33 44 its
## 34 43 even
## 35 42 just
## 36 42 make
## 37 42 would
## 38 41 her
## 39 41 because
## 40 41 when
## 41 39 after
## 42 39 their
## 43 37 cabana
## 44 36 experience
## 45 35 then
## 46 35 much
## 47 34 bad
## 48 34 customer
## 49 33 call
## 50 33 food
People in general, speaking as American born and raised, when angry usually speak out of anger and disappointment when feeling they have been had, taken, or in some way been victimized for not getting what they paid for when promised or convinced into getting that experience, purchase, feeling, etc. As you can see by not excluding the stop words we now have a count of the number of connections to persuade the reader that he or she was wronged or rewarded by the number of interjections. We learn this early in persuasive writing in grammar school to have at least five paragraphs to build a persuasive story with an introduction, three body paragraphs, and a conclusion, and again in English composition in lower level undergrad work. This means build on three points or perspectives in the body paragraphs to persuade the reader your right, and find them if they aren’t readily considered.
Lets do the same for the other four folders in getting our ordered word counts or frequencies.
R2 <- Corpus(DirSource("rating2"))
R2
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 34
R2 <- tm_map(R2, removePunctuation)
R2 <- tm_map(R2, removeNumbers)
R2 <- tm_map(R2, stripWhitespace)
dtmR2 <- DocumentTermMatrix(R2)
freq2 <- colSums(as.matrix(dtmR2))
wordcloud(names(freq2), freq2, min.freq=25,colors=brewer.pal(3,'Dark2'))
freqR2 <- as.data.frame(colSums(as.matrix(dtmR2)))
colnames(freqR2) <- 'rating2'
freqR2$id <- row.names(freqR2)
FREQ_R2 <- freqR2[order(freqR2$rating2,decreasing=TRUE),]
row.names(FREQ_R2) <- NULL
head(FREQ_R2,50)
## rating2 id
## 1 384 the
## 2 221 and
## 3 88 you
## 4 84 have
## 5 74 for
## 6 70 that
## 7 59 not
## 8 58 they
## 9 57 but
## 10 50 this
## 11 49 all
## 12 44 get
## 13 44 there
## 14 40 with
## 15 37 spa
## 16 36 just
## 17 33 experience
## 18 33 good
## 19 29 your
## 20 28 drink
## 21 28 time
## 22 27 make
## 23 27 day
## 24 26 can
## 25 26 would
## 26 26 people
## 27 25 its
## 28 25 check
## 29 25 our
## 30 24 tell
## 31 24 when
## 32 23 even
## 33 22 relax
## 34 22 come
## 35 22 like
## 36 22 line
## 37 22 pool
## 38 22 service
## 39 22 out
## 40 21 here
## 41 20 much
## 42 20 very
## 43 20 off
## 44 19 what
## 45 19 which
## 46 19 food
## 47 19 she
## 48 18 grotto
## 49 18 pay
## 50 18 place
R3 <- Corpus(DirSource("rating3"))
R3
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 54
R3 <- tm_map(R3, removePunctuation)
R3 <- tm_map(R3, removeNumbers)
R3 <- tm_map(R3, stripWhitespace)
dtmR3 <- DocumentTermMatrix(R3)
freq3 <- colSums(as.matrix(dtmR3))
wordcloud(names(freq3), freq3, min.freq=25,colors=brewer.pal(3,'Dark2'))
freqR3 <- as.data.frame(colSums(as.matrix(dtmR3)))
colnames(freqR3) <- 'rating3'
freqR3$id <- row.names(freqR3)
FREQ_R3 <- freqR3[order(freqR3$rating3,decreasing=TRUE),]
row.names(FREQ_R3) <- NULL
head(FREQ_R3,50)
## rating3 id
## 1 354 the
## 2 210 and
## 3 103 for
## 4 97 have
## 5 63 you
## 6 60 that
## 7 55 this
## 8 52 get
## 9 49 not
## 10 47 but
## 11 46 with
## 12 45 they
## 13 38 good
## 14 36 here
## 15 35 spa
## 16 34 its
## 17 33 just
## 18 32 time
## 19 32 like
## 20 30 cabana
## 21 29 all
## 22 28 much
## 23 28 very
## 24 28 pool
## 25 27 there
## 26 27 day
## 27 27 feel
## 28 26 out
## 29 25 can
## 30 24 come
## 31 23 would
## 32 23 our
## 33 23 place
## 34 22 about
## 35 22 love
## 36 21 when
## 37 21 crowd
## 38 20 because
## 39 20 really
## 40 20 from
## 41 20 price
## 42 19 little
## 43 19 over
## 44 19 check
## 45 19 their
## 46 18 one
## 47 18 wait
## 48 18 service
## 49 17 massage
## 50 17 people
R4 <- Corpus(DirSource("rating4"))
R4
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 103
R4 <- tm_map(R4, removePunctuation)
R4 <- tm_map(R4, removeNumbers)
R4 <- tm_map(R4, stripWhitespace)
dtmR4 <- DocumentTermMatrix(R4)
freq4 <- colSums(as.matrix(dtmR4))
wordcloud(names(freq4), freq4, min.freq=25,colors=brewer.pal(3,'Dark2'))
freqR4 <- as.data.frame(colSums(as.matrix(dtmR4)))
colnames(freqR4) <- 'rating4'
freqR4$id <- row.names(freqR4)
FREQ_R4 <- freqR4[order(freqR4$rating4,decreasing=TRUE),]
row.names(FREQ_R4) <- NULL
head(FREQ_R4,50)
## rating4 id
## 1 652 the
## 2 487 and
## 3 175 have
## 4 172 for
## 5 157 you
## 6 131 that
## 7 117 but
## 8 110 they
## 9 103 good
## 10 100 get
## 11 93 with
## 12 90 this
## 13 80 pool
## 14 79 not
## 15 75 day
## 16 74 your
## 17 73 time
## 18 70 there
## 19 65 like
## 20 65 much
## 21 64 all
## 22 64 can
## 23 61 here
## 24 57 massage
## 25 54 very
## 26 54 great
## 27 52 experience
## 28 50 its
## 29 50 mud
## 30 48 just
## 31 47 from
## 32 46 out
## 33 45 which
## 34 44 come
## 35 43 would
## 36 42 food
## 37 38 relax
## 38 38 spa
## 39 37 love
## 40 36 one
## 41 36 our
## 42 35 price
## 43 35 service
## 44 34 feel
## 45 34 little
## 46 34 thing
## 47 33 some
## 48 32 make
## 49 32 want
## 50 32 too
R5 <- Corpus(DirSource("rating5"))
R5
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 335
R5 <- tm_map(R5, removePunctuation)
R5 <- tm_map(R5, removeNumbers)
R5 <- tm_map(R5, stripWhitespace)
dtmR5 <- DocumentTermMatrix(R5)
freq5 <- colSums(as.matrix(dtmR5))
wordcloud(names(freq5), freq5, min.freq=75,colors=brewer.pal(3,'Dark2'))
freqR5 <- as.data.frame(colSums(as.matrix(dtmR5)))
colnames(freqR5) <- 'rating5'
freqR5$id <- row.names(freqR5)
FREQ_R5 <- freqR5[order(freqR5$rating5,decreasing=TRUE),]
row.names(FREQ_R5) <- NULL
head(FREQ_R5,50)
## rating5 id
## 1 1246 and
## 2 1180 the
## 3 491 have
## 4 384 for
## 5 272 good
## 6 260 you
## 7 213 this
## 8 213 with
## 9 207 they
## 10 205 that
## 11 183 very
## 12 178 get
## 13 152 staff
## 14 150 great
## 15 137 all
## 16 137 time
## 17 135 but
## 18 135 massage
## 19 130 come
## 20 128 place
## 21 127 here
## 22 122 can
## 23 117 much
## 24 113 love
## 25 110 make
## 26 109 day
## 27 108 feel
## 28 105 always
## 29 104 back
## 30 104 recommend
## 31 92 amaze
## 32 91 experience
## 33 90 pain
## 34 89 your
## 35 88 from
## 36 88 there
## 37 85 friendly
## 38 84 also
## 39 83 service
## 40 82 not
## 41 80 really
## 42 80 his
## 43 79 just
## 44 77 pool
## 45 75 out
## 46 72 about
## 47 71 will
## 48 71 when
## 49 70 like
## 50 70 year
Lets add a feature for the ratio of word frequencies to the number of documents in the reviews with each rating 1-5.
l1 <- length(list.files('./rating1'))
l2 <- length(list.files('./rating2'))
l3 <- length(list.files('./rating3'))
l4 <- length(list.files('./rating4'))
l5 <- length(list.files('./rating5'))
FREQ_R1$termTotalFilesRatio <-FREQ_R1$rating1/l1
FREQ_R2$termTotalFilesRatio <- FREQ_R2$rating2/l2
FREQ_R3$termTotalFilesRatio <- FREQ_R3$rating3/l3
FREQ_R4$termTotalFilesRatio <- FREQ_R4$rating4/l4
FREQ_R5$termTotalFilesRatio <- FREQ_R5$rating5/l5
FREQ_R1$termTotalTermsRatio <-FREQ_R1$rating1/length(FREQ_R1$id)
FREQ_R2$termTotalTermsRatio <- FREQ_R2$rating2/length(FREQ_R2$id)
FREQ_R3$termTotalTermsRatio <- FREQ_R3$rating3/length(FREQ_R3$id)
FREQ_R4$termTotalTermsRatio <- FREQ_R4$rating4/length(FREQ_R4$id)
FREQ_R5$termTotalTermsRatio <- FREQ_R5$rating5/length(FREQ_R5$id)
Lets change the column names of each rating table.
colnames(FREQ_R1) <-c('Rating1termfrequency',
'term',
'Rating1_termTotalFilesRatio',
'Rating1_termTotalTermsRatio')
colnames(FREQ_R2) <-c('Rating2termfrequency',
'term',
'Rating2_termTotalFilesRatio',
'Rating2_termTotalTermsRatio')
colnames(FREQ_R3) <-c('Rating3termfrequency',
'term',
'Rating3_termTotalFilesRatio',
'Rating3_termTotalTermsRatio')
colnames(FREQ_R4) <-c('Rating4termfrequency',
'term',
'Rating4_termTotalFilesRatio',
'Rating4_termTotalTermsRatio')
colnames(FREQ_R5) <-c('Rating5termfrequency',
'term',
'Rating5_termTotalFilesRatio',
'Rating5_termTotalTermsRatio')
Lets now combine all these term frequencies.
m1 <- merge(FREQ_R1,FREQ_R2, by.x='term', by.y='term', all=TRUE)
m2 <- merge(m1,FREQ_R3, by.x='term', by.y='term', all=TRUE)
m3 <- merge(m2,FREQ_R4, by.x='term', by.y='term', all=TRUE)
m4 <- merge(m3,FREQ_R5, by.x='term', by.y='term', all=TRUE)
allTerms <- m4 %>% select(term,Rating1termfrequency,Rating2termfrequency,
Rating3termfrequency,Rating4termfrequency,
Rating5termfrequency,Rating1_termTotalFilesRatio,
Rating2_termTotalFilesRatio,Rating3_termTotalFilesRatio,
Rating4_termTotalFilesRatio,Rating5_termTotalFilesRatio,
everything())
colnames(allTerms)
## [1] "term" "Rating1termfrequency"
## [3] "Rating2termfrequency" "Rating3termfrequency"
## [5] "Rating4termfrequency" "Rating5termfrequency"
## [7] "Rating1_termTotalFilesRatio" "Rating2_termTotalFilesRatio"
## [9] "Rating3_termTotalFilesRatio" "Rating4_termTotalFilesRatio"
## [11] "Rating5_termTotalFilesRatio" "Rating1_termTotalTermsRatio"
## [13] "Rating2_termTotalTermsRatio" "Rating3_termTotalTermsRatio"
## [15] "Rating4_termTotalTermsRatio" "Rating5_termTotalTermsRatio"
Lets add a median field for the word in each rating to this table.
allTerms$MedianCount <- apply(allTerms[2:6],1,median, na.rm=TRUE)
medianRating1 <- apply(allTerms[2],2,median,na.rm=TRUE)
medianRating2 <- apply(allTerms[3],2,median,na.rm=TRUE)
medianRating3 <- apply(allTerms[4],2,median,na.rm=TRUE)
medianRating4 <- apply(allTerms[5],2,median,na.rm=TRUE)
medianRating5 <- apply(allTerms[6],2,median,na.rm=TRUE)
meanRating1 <- floor(apply(allTerms[2],2,mean,na.rm=TRUE))
meanRating2 <- floor(apply(allTerms[3],2,mean,na.rm=TRUE))
meanRating3 <- floor(apply(allTerms[4],2,mean,na.rm=TRUE))
meanRating4 <- floor(apply(allTerms[5],2,mean,na.rm=TRUE))
meanRating5 <- floor(apply(allTerms[6],2,mean,na.rm=TRUE))
allTerms2 <- allTerms[order(allTerms$MedianCount,decreasing=TRUE),]
Lets add a bottom and top percentile to this table based on the terms in each rating subset.
allTerms2$Quantile5_R1 <- ifelse(allTerms2$Rating1termfrequency <=
quantile(allTerms2$Rating1termfrequency, .05,na.rm=TRUE),
1,0)
allTerms2$Quantile95_R1 <- ifelse(allTerms2$Rating1termfrequency >=
quantile(allTerms2$Rating1termfrequency, .95,na.rm=TRUE),
1,0)
allTerms2$Quantile5_R2 <- ifelse(allTerms2$Rating2termfrequency <=
quantile(allTerms2$Rating2termfrequency, .05,na.rm=TRUE),
1,0)
allTerms2$Quantile95_R2 <- ifelse(allTerms2$Rating2termfrequency >=
quantile(allTerms2$Rating2termfrequency, .95,na.rm=TRUE),
1,0)
allTerms2$Quantile5_R3 <- ifelse(allTerms2$Rating3termfrequency <=
quantile(allTerms2$Rating3termfrequency, .05,na.rm=TRUE),
1,0)
allTerms2$Quantile95_R3 <- ifelse(allTerms2$Rating3termfrequency >=
quantile(allTerms2$Rating3termfrequency, .95,na.rm=TRUE),
1,0)
allTerms2$Quantile5_R4 <- ifelse(allTerms2$Rating4termfrequency <=
quantile(allTerms2$Rating4termfrequency, .05,na.rm=TRUE),
1,0)
allTerms2$Quantile95_R4 <- ifelse(allTerms2$Rating4termfrequency >=
quantile(allTerms2$Rating4termfrequency, .95,na.rm=TRUE),
1,0)
allTerms2$Quantile5_R5 <- ifelse(allTerms2$Rating5termfrequency <=
quantile(allTerms2$Rating5termfrequency, .05,na.rm=TRUE),
1,0)
allTerms2$Quantile95_R5 <- ifelse(allTerms2$Rating5termfrequency >=
quantile(allTerms2$Rating5termfrequency, .95,na.rm=TRUE),
1,0)
We have to keep this data wide, but it is useful to filter by, and extracting those words more used in each rating for each review.
goodGreat <- subset(allTerms2, allTerms2$Quantile95_R5==1 &
allTerms2$Quantile95_R4==1 &
allTerms2$Rating5termfrequency > allTerms2$MedianCount &
allTerms2$Rating4termfrequency > allTerms2$MedianCount |
allTerms2$Quantile5_R5==1 &
allTerms2$Quantile5_R4==1 |
allTerms2$Rating5termfrequency > meanRating5 |
allTerms2$Rating4termfrequency > meanRating4
)
average <- subset(allTerms2, allTerms2$Quantile95_R3==1 &
allTerms2$Quantile95_R2==1 &
allTerms2$Rating3termfrequency > allTerms2$MedianCount &
allTerms2$Rating2termfrequency > allTerms2$MedianCount |
allTerms2$Quantile5_R3==1 &
allTerms2$Quantile5_R2==1 |
allTerms2$Rating3termfrequency > meanRating3 |
allTerms2$Rating2termfrequency > meanRating2
)
poor <- subset(allTerms2, allTerms2$Quantile95_R1==1 &
allTerms2$Quantile95_R2==1 &
allTerms2$Rating1termfrequency > allTerms2$MedianCount &
allTerms2$Rating2termfrequency > allTerms2$MedianCount |
allTerms2$Quantile5_R1==1 &
allTerms2$Quantile5_R2==1 |
allTerms2$Rating1termfrequency > meanRating1 |
allTerms2$Rating2termfrequency > meanRating2
)
Here is bar chart of the word counts for the poor ratings.
wf <- data.frame(word=poor$term, freq=poor$Rating1termfrequency)
p <- ggplot(subset(wf, freq>60), aes(word, freq))
p <- p + geom_bar(stat= 'identity')
p <- p + theme(axis.text.x=element_text(angle=90, hjust=1))
p
Lets make a word cloud of each of these data tables terms by weights of the lowest for poor, highest for average, and highest for goodGreat The NAs have to be removed before using word cloud.
poorNA <- poor[complete.cases(poor$Rating1termfrequency),]
Poor1 <- as.data.frame(t(poorNA$Rating1termfrequency))
colnames(Poor1) <- poorNA$term
Poor1 <- Poor1 %>% select(-and,-the)
freqPoor <- colSums(as.matrix(Poor1))
wordcloud(names(freqPoor), freqPoor, min.freq=20,
colors=brewer.pal(6,'Dark2'))
wordcloud(names(freqPoor), freqPoor, min.freq=25,colors=brewer.pal(3,'Dark2'))
Here is bar chart of the word counts for the average ratings.
wf <- data.frame(word=average$term, freq=average$Rating3termfrequency)
p <- ggplot(subset(wf, freq>25), aes(word, freq))
p <- p + geom_bar(stat= 'identity')
p <- p + theme(axis.text.x=element_text(angle=90, hjust=1))
p
Lets make a word cloud.
avgNA <- average[complete.cases(average$Rating3termfrequency),]
Avg1 <- as.data.frame(t(avgNA$Rating3termfrequency))
colnames(Avg1) <- avgNA$term
Avg1 <- Avg1 %>% select(-and,-the)
freqAvg <- colSums(as.matrix(Avg1))
wordcloud(names(freqAvg), freqAvg, min.freq=20,
colors=brewer.pal(6,'Dark2'))
wordcloud(names(freqAvg), freqAvg, min.freq=25,colors=brewer.pal(3,'Dark2'))
Here is bar chart of the word counts for the good or great ratings.
wf <- data.frame(word=goodGreat$term, freq=goodGreat$Rating5termfrequency)
p <- ggplot(subset(wf, freq>70), aes(word, freq))
p <- p + geom_bar(stat= 'identity')
p <- p + theme(axis.text.x=element_text(angle=90, hjust=1))
p
Lets make a word cloud.
grtNA <- goodGreat[complete.cases(goodGreat$Rating5termfrequency),]
Grt1 <- as.data.frame(t(grtNA$Rating5termfrequency))
colnames(Grt1) <- grtNA$term
Grt1 <- Grt1[,-c(1:4)] #remove the first 4 words, (the,and,for,have)
freqGrt <- colSums(as.matrix(Grt1))
wordcloud(names(freqGrt), freqGrt, min.freq=80,
colors=brewer.pal(6,'Dark2'))
wordcloud(names(freqGrt), freqGrt, min.freq=95,colors=brewer.pal(3,'Dark2'))
That was a great way to look at the word clouds of these ratings and the words in each set of words in the top and bottom 5th percentiles as well as higher than the median or mean values. The last couple of word plots I removed the interjection words at the top of the list. Otherwise, you would have seen and,the,for, and have. But the for is a keyword and the data table had to be sliced instead of deselecting those words.
We still haven’t done predictive analytics to predict the rating by the review. We will do that next. I would also like to create a visNetwork of these words, with the ratings, and the business type these words are associated with.
The way that sentiment analysis works is to build the document term matrix (dtm) of counts based on the reveiws, and use those counts of words and given ratings to determine the best fit from any particular algorithm that can predict a review as being a specific rating. We have the dtms of all five of our ratings. But we don’t have anything set up manually to count all those words from every review or at least any keywords to build those models in predicting our reviews. Normally, you have each row in a dtm is a review, and the columns are each specific word, and evertime that word is found, the word will be added to its last count to get a final count of each word per document. We could do something like this based on our key words.
We could also quickly jump over to python and wait a bit in running the datatable we cleaned up into a bunch of algorithms like random forest, decision trees, generalized linear models, boosted trees, naive bayes, etc. Or we could look up the text mining and natural language processing packages in the libraries we attached to this document or add to as needed.
Since this document has been manual from the beginning by cleaning up and extracting features from the reviews. We could just use those features, instead of the words, or we could pick a handful of words, even stopwords, that our program will count in each review, and use as features to predict the reviews with what we already know how to do from previous work in github and rpubs.
Lets look again at the features we do have from our big cleaned up table.
colnames(Reviews13)
## [1] "userReviewSeries" "userReviewOnlyContent" "userRatingSeries"
## [4] "userRatingValue" "businessReplied" "businessReplyContent"
## [7] "userReviewContent" "LowAvgHighCost" "businessType"
## [10] "cityState" "friends" "reviews"
## [13] "photos" "eliteStatus" "userName"
## [16] "Date" "userBusinessPhotos" "userCheckIns"
Our target variable would be the 4th column feature above called userRatingValue. We can keep every feature column except the 7th for userReveiwContent that is not our cleaned up review feature and Date. Although, we could get the day of the date feature, because that might have a value added benefit to predicting the rating from these reviews. We also don’t need the business Reply Content and won’t need the user reviews cleaned up as a predictor once we extract the keyword counts we want. We will just use the words we saw from our word clouds above for a poor, average, or great review subsets. Lets keep the top 10 from each, including the stopwords.
The poor ratings keywords are for ratings of 1 or 2.
KW_poor <- poor %>% select(term,Rating1termfrequency,Rating2termfrequency)
KW_poor$medianLowRate <- apply(KW_poor[2:3],1,median, na.rm=TRUE)
keywords_low <- KW_poor[order(KW_poor$medianLowRate,decreasing=TRUE)[1:10],]
keywords_low
## term Rating1termfrequency Rating2termfrequency medianLowRate
## 3199 the 555 384 469.5
## 141 and 454 221 337.5
## 1254 for 174 74 124.0
## 1441 have 161 84 122.5
## 3197 that 144 70 107.0
## 3213 they 120 58 89.0
## 3222 this 128 50 89.0
## 3625 you 85 88 86.5
## 2123 not 106 59 82.5
## 482 but 76 57 66.5
We can now use these as our poor keywords.
low_keys <- as.data.frame(t(keywords_low$medianLowRate))
colnames(low_keys) <- keywords_low$term
row.names(low_keys) <- 'lowRating'
low_keys
## the and for have that they this you not but
## lowRating 469.5 337.5 124 122.5 107 89 89 86.5 82.5 66.5
And these are our average rating keywords. from the median of 2-4 ratings.
KW_avg <- average %>% select(term,Rating2termfrequency,Rating3termfrequency,
Rating4termfrequency)
KW_avg$medianAvgRate <- apply(KW_avg[2:4],1,median, na.rm=TRUE)
keywords_avg <- KW_avg[order(KW_avg$medianAvgRate,decreasing=TRUE)[1:10],]
keywords_avg
## term Rating2termfrequency Rating3termfrequency Rating4termfrequency
## 3199 the 384 354 652
## 141 and 221 210 487
## 1254 for 74 103 172
## 1441 have 84 97 175
## 3625 you 88 63 157
## 3197 that 70 60 131
## 2123 not 59 49 79
## 3213 they 58 45 110
## 482 but 57 47 117
## 3222 this 50 55 90
## medianAvgRate
## 3199 384
## 141 221
## 1254 103
## 1441 97
## 3625 88
## 3197 70
## 2123 59
## 3213 58
## 482 57
## 3222 55
The keywords for the average ratings is a median value of the ratings 2 through 4.
avg_keys <- as.data.frame(t(keywords_avg$medianAvgRate))
colnames(avg_keys) <- keywords_avg$term
row.names(avg_keys) <- 'avgRating'
avg_keys
## the and for have you that not they but this
## avgRating 384 221 103 97 88 70 59 58 57 55
Lets get our great ratings as the median of the 4-5 ratings.
KW_grt <- goodGreat %>% select(term,Rating5termfrequency,Rating4termfrequency)
KW_grt$medianGrtRate <- apply(KW_grt[2:3],1,median, na.rm=TRUE)
keywords_grt <- KW_grt[order(KW_grt$medianGrtRate,decreasing=TRUE)[1:10],]
keywords_grt
## term Rating5termfrequency Rating4termfrequency medianGrtRate
## 3199 the 1180 652 916.0
## 141 and 1246 487 866.5
## 1441 have 491 175 333.0
## 1254 for 384 172 278.0
## 3625 you 260 157 208.5
## 1358 good 272 103 187.5
## 3197 that 205 131 168.0
## 3213 they 207 110 158.5
## 3572 with 213 93 153.0
## 3222 this 213 90 151.5
The keywords for the great ratings is a median value of the ratings 4 through 5.
grt_keys <- as.data.frame(t(keywords_grt$medianGrtRate))
colnames(grt_keys) <- keywords_grt$term
row.names(grt_keys) <- 'grtRating'
grt_keys
## the and have for you good that they with this
## grtRating 916 866.5 333 278 208.5 187.5 168 158.5 153 151.5
Now lets combine these tables.
j1 <- full_join(grt_keys,avg_keys)
## Joining, by = c("the", "and", "have", "for", "you", "that", "they", "this")
j1
## the and have for you good that they with this not but
## 1 916 866.5 333 278 208.5 187.5 168 158.5 153 151.5 NA NA
## 2 384 221.0 97 103 88.0 NA 70 58.0 NA 55.0 59 57
j2 <- full_join(low_keys,j1)
## Joining, by = c("the", "and", "for", "have", "that", "they", "this", "you",
## "not", "but")
row.names(j2) <- c('low','great','average')
j2
## the and for have that they this you not but good with
## low 469.5 337.5 124 122.5 107 89.0 89.0 86.5 82.5 66.5 NA NA
## great 916.0 866.5 278 333.0 168 158.5 151.5 208.5 NA NA 187.5 153
## average 384.0 221.0 103 97.0 70 58.0 55.0 88.0 59.0 57.0 NA NA
Lets fill in these words manually with their median values. Optionally, we could just take the complete.cases of this table and find those words to use as features.
j2$not[2] <- KW_grt[grep('^not$',KW_grt$term),4]
j2$but[2] <- KW_grt[grep('^but$',KW_grt$term),4]
j2$good[1] <- KW_poor[grep('^good$',KW_poor$term),4]
j2$good[3] <- KW_avg[grep('^good$',KW_avg$term),4]
j2$with[1] <- KW_poor[grep('^with$',KW_poor$term),4]
j2$with[3] <- KW_avg[grep('^with$',KW_poor$term),4]
j2
## the and for have that they this you not but good with
## low 469.5 337.5 124 122.5 107 89.0 89.0 86.5 82.5 66.5 45.0 59
## great 916.0 866.5 278 333.0 168 158.5 151.5 208.5 80.5 126.0 187.5 153
## average 384.0 221.0 103 97.0 70 58.0 55.0 88.0 59.0 57.0 103.0 93
But these values are for counts out of the entire count of reviews for each rating. So we should divide each value by the total number of documents to get a ratio or the values in each rating as low, great, or average.
keys_t <- as.data.frame(t(j2))
keys_t
## low great average
## the 469.5 916.0 384
## and 337.5 866.5 221
## for 124.0 278.0 103
## have 122.5 333.0 97
## that 107.0 168.0 70
## they 89.0 158.5 58
## this 89.0 151.5 55
## you 86.5 208.5 88
## not 82.5 80.5 59
## but 66.5 126.0 57
## good 45.0 187.5 103
## with 59.0 153.0 93
s1 <- sum(Reviews13$userRatingValue==1)+sum(Reviews13$userRatingValue==2)
s2 <- sum(Reviews13$userRatingValue==2)+sum(Reviews13$userRatingValue==3)+
sum(Reviews13$userRatingValue==4)
s3 <- sum(Reviews13$userRatingValue==4)+sum(Reviews13$userRatingValue==5)
keys_t$low <- round(((keys_t$low)/s1),2)
keys_t$great <- round(((keys_t$great)/s3),2)
keys_t$average <- round(((keys_t$average)/s2),2)
keys_t
## low great average
## the 3.85 2.09 2.01
## and 2.77 1.98 1.16
## for 1.02 0.63 0.54
## have 1.00 0.76 0.51
## that 0.88 0.38 0.37
## they 0.73 0.36 0.30
## this 0.73 0.35 0.29
## you 0.71 0.48 0.46
## not 0.68 0.18 0.31
## but 0.55 0.29 0.30
## good 0.37 0.43 0.54
## with 0.48 0.35 0.49
The above table is for document term frequency on average that is how many times the term shows up in a single document by category of low, average, or great rating. We made these tables earlier, FREQ_R1, …,FREQ_R5.
What about the ratio for the term against the number in terms in total for all ratings? Lets put that table together.
termKeys <- as.data.frame(row.names(keys_t))
colnames(termKeys) <- 'term'
tk1 <- merge(termKeys, FREQ_R1, by.x='term', by.y='term')
tk2 <- merge(tk1,FREQ_R2, by.x='term', by.y='term')
tk3 <- merge(tk2, FREQ_R3, by.x='term', by.y='term')
tk4 <- merge(tk3, FREQ_R4, by.x='term', by.y='term')
tk5 <- merge(tk4, FREQ_R5, by.x='term', by.y='term')
tk5$Rating1_totalTerms <- sum(FREQ_R1$Rating1termfrequency)
tk5$Rating2_totalTerms <- sum(FREQ_R2$Rating2termfrequency)
tk5$Rating3_totalTerms <- sum(FREQ_R3$Rating3termfrequency)
tk5$Rating4_totalTerms <- sum(FREQ_R4$Rating4termfrequency)
tk5$Rating5_totalTerms <- sum(FREQ_R5$Rating5termfrequency)
#these are total terms over all by rating, not unique terms
tk5$Rating1_term2totalTerm <- tk5$Rating1termfrequency/tk5$Rating1_totalTerms
tk5$Rating2_term2totalTerm <- tk5$Rating2termfrequency/tk5$Rating2_totalTerms
tk5$Rating3_term2totalTerm <- tk5$Rating3termfrequency/tk5$Rating3_totalTerms
tk5$Rating4_term2totalTerm <- tk5$Rating4termfrequency/tk5$Rating4_totalTerms
tk5$Rating5_term2totalTerm <- tk5$Rating5termfrequency/tk5$Rating5_totalTerms
termToTotalTerms <- tk5 %>% select(term,Rating1_term2totalTerm,
Rating2_term2totalTerm,
Rating3_term2totalTerm,
Rating4_term2totalTerm,
Rating5_term2totalTerm)
term_to_totalTerms <- round(termToTotalTerms[,2:6],3)
row.names(term_to_totalTerms) <- termToTotalTerms$term
wordToAllWords <- as.data.frame(t(term_to_totalTerms))
wordToAllWords
## and but for good have not that the they
## Rating1_term2totalTerm 0.046 0.008 0.018 0.006 0.016 0.011 0.015 0.057 0.012
## Rating2_term2totalTerm 0.042 0.011 0.014 0.006 0.016 0.011 0.013 0.073 0.011
## Rating3_term2totalTerm 0.043 0.010 0.021 0.008 0.020 0.010 0.012 0.072 0.009
## Rating4_term2totalTerm 0.046 0.011 0.016 0.010 0.016 0.007 0.012 0.061 0.010
## Rating5_term2totalTerm 0.059 0.006 0.018 0.013 0.023 0.004 0.010 0.056 0.010
## this with you
## Rating1_term2totalTerm 0.013 0.008 0.009
## Rating2_term2totalTerm 0.010 0.008 0.017
## Rating3_term2totalTerm 0.011 0.009 0.013
## Rating4_term2totalTerm 0.008 0.009 0.015
## Rating5_term2totalTerm 0.010 0.010 0.012
This table is the total word ratio to all words (not unique words) in each subset of ratings 1-5. Lets write this last table out to csv. We will use it later, and this script will be a long one, with manu objects.
write.csv(wordToAllWords,'wordToAllWords.csv', row.names=TRUE)
Once we get our counts of each word in each review, we can compare it to these words and see if it appears in the document this percent of the time to aid in classifying each review into the correct rating.
Lets use the stringr library’s function str_match_all function. Lets clean up the first observation and store it as a string. Then we will use str_match_all to find the exact number of times each keyword is in the review. and put it in our table.
str1 <- as.character(paste(Reviews13$userReviewOnlyContent[1]))
str1 <- gsub('[!|.|,|\n|\']',' ',str1,perl=TRUE)
str1 <- gsub('[ ]',' ',str1)
str1 <- trimws(str1, which=c('both'), whitespace='[\t\r\n ]')
totalTerms <- length((strsplit(str1, split=' ')[[1]]))
keys <- row.names(keys_t)
and <- str_match_all(str1,' [aA][nN][dD] ')
AND <- length(and[[1]])
the <- str_match_all(str1,' [tT][hH][eE] ')
THE <- length(the[[1]])
for1 <- str_match_all(str1,' [fF][oO][rR] ')
FOR1 <- length(for1[[1]])
have <- str_match_all(str1,' [hH][aA][vV][eE] ')
HAVE <- length(have[[1]])
that <- str_match_all(str1,' [tT][hH][aA][tT] ')
THAT <- length(that[[1]])
they <- str_match_all(str1,' [tT][hH][eE][yY] ')
THEY <- length(they[[1]])
this <- str_match_all(str1,' [tT][hH][iI][sS] ')
THIS <- length(this[[1]])
you <- str_match_all(str1,' [yY][oO][uU] ')
YOU <- length(you[[1]])
not <- str_match_all(str1,' [nN][oO][tT] ')
NOT <- length(not[[1]])
but <- str_match_all(str1,' [bB][uU][tT] ')
BUT <- length(but[[1]])
good <- str_match_all(str1,' [gG][oO][oO][dD] ')
GOOD <- length(good[[1]])
with <- str_match_all(str1,' [wW][iI][tT][hH] ')
WITH <- length(with[[1]])
values <- as.data.frame(c(THE,AND,FOR1,HAVE,THAT,THEY,THIS,YOU,NOT,BUT,GOOD,WITH))
row.names(values) <- keys
keyValues <- as.data.frame(t(values))
keyValues2 <- keyValues/totalTerms
keyValues3 <- rbind(keyValues,keyValues2)
row.names(keyValues3) <- c('documentTermCount','term_to_totalDocumentTerms')
keyValues3 <- round(keyValues3,3)
keyValues3
## the and for have that they this you
## documentTermCount 15.000 5.000 3.000 4.000 4.000 3.000 1.000 2.000
## term_to_totalDocumentTerms 0.055 0.018 0.011 0.015 0.015 0.011 0.004 0.007
## not but good with
## documentTermCount 1.000 1.000 2.000 0
## term_to_totalDocumentTerms 0.004 0.004 0.007 0
Join this table to the wordToAllWords table using dplyr’s full join function.
joinKeys <- full_join(wordToAllWords,keyValues3)
## Joining, by = c("and", "but", "for", "good", "have", "not", "that", "the",
## "they", "this", "with", "you")
r1 <- row.names(wordToAllWords)
r2 <- row.names(keyValues3)
names <- c(r1,r2)
row.names(joinKeys) <- names
joinKeys
## and but for good have not that the
## Rating1_term2totalTerm 0.046 0.008 0.018 0.006 0.016 0.011 0.015 0.057
## Rating2_term2totalTerm 0.042 0.011 0.014 0.006 0.016 0.011 0.013 0.073
## Rating3_term2totalTerm 0.043 0.010 0.021 0.008 0.020 0.010 0.012 0.072
## Rating4_term2totalTerm 0.046 0.011 0.016 0.010 0.016 0.007 0.012 0.061
## Rating5_term2totalTerm 0.059 0.006 0.018 0.013 0.023 0.004 0.010 0.056
## documentTermCount 5.000 1.000 3.000 2.000 4.000 1.000 4.000 15.000
## term_to_totalDocumentTerms 0.018 0.004 0.011 0.007 0.015 0.004 0.015 0.055
## they this with you
## Rating1_term2totalTerm 0.012 0.013 0.008 0.009
## Rating2_term2totalTerm 0.011 0.010 0.008 0.017
## Rating3_term2totalTerm 0.009 0.011 0.009 0.013
## Rating4_term2totalTerm 0.010 0.008 0.009 0.015
## Rating5_term2totalTerm 0.010 0.010 0.010 0.012
## documentTermCount 3.000 1.000 0.000 2.000
## term_to_totalDocumentTerms 0.011 0.004 0.000 0.007
Looking at the table above, we can use the term_to_totalDocumentTerms values of this observation compared to the ratios of the term2totalTerm ratings for each of these 12 words, and choose the rating with the lowest difference or distance between, then to add up the votes for ratings 1-5 for all 12 choices. There should be a clear winner in this algorithm of selecting or predicting the sentiment rating. So, lets try it out.
and_diff <- joinKeys$and[1:5]-joinKeys$and[7]
but_diff <- joinKeys$but[1:5]-joinKeys$but[7]
for_diff <- joinKeys[1:5,3]-joinKeys[7,3]
good_diff <- joinKeys$good[1:5]-joinKeys$good[7]
have_diff <- joinKeys$have[1:5]-joinKeys$have[7]
not_diff <- joinKeys$not[1:5]-joinKeys$not[7]
that_diff <- joinKeys$that[1:5]-joinKeys$that[7]
the_diff <- joinKeys$the[1:5]-joinKeys$the[7]
they_diff <- joinKeys$they[1:5]-joinKeys$they[7]
this_diff <- joinKeys$this[1:5]-joinKeys$this[7]
with_diff <- joinKeys$with[1:5]-joinKeys$with[7]
you_diff <- joinKeys$you[1:5]-joinKeys$you[7]
diff <- as.data.frame(t(cbind(and_diff, but_diff, for_diff, good_diff, have_diff, not_diff,
that_diff, the_diff, they_diff, this_diff, with_diff, you_diff)))
colnames(diff) <- r1
diff$minValue <- apply(diff,1, min)
diff$vote <- ifelse(diff$Rating1_term2totalTerm==diff$minValue,
1,
ifelse(diff$Rating2_term2totalTerm==diff$minValue,
2,
ifelse(diff$Rating3_term2totalTerm==diff$minValue,
3,
ifelse(diff$Rating4_term2totalTerm==diff$minValue,
4,
5)
)
)
)
diff$minValue2 <- ifelse(abs(diff$minValue)>abs(diff$Rating1_term2totalTerm),
diff$Rating1_term2totalTerm,
ifelse(abs(diff$minValue)>abs(diff$Rating2_term2totalTerm),
diff$Rating2_term2totalTerm,
ifelse(abs(diff$minValue)>abs(diff$Rating3_term2totalTerm),
diff$Rating3_term2totalTerm,
ifelse(abs(diff$minValue)>abs(diff$Rating4_term2totalTerm),
diff$Rating4_term2totalTerm,
ifelse(abs(diff$minValue)>abs(diff$Rating5_term2totalTerm),
diff$Rating5_term2totalTerm,
diff$minValue)
)
)
)
)
diff$vote2 <- ifelse(diff$Rating1_term2totalTerm==diff$minValue2,
1,
ifelse(diff$Rating2_term2totalTerm==diff$minValue2,
2,
ifelse(diff$Rating3_term2totalTerm==diff$minValue2,
3,
ifelse(diff$Rating4_term2totalTerm==diff$minValue2,
4,
5)
)
)
)
diff
## Rating1_term2totalTerm Rating2_term2totalTerm Rating3_term2totalTerm
## and_diff 0.028 0.024 0.025
## but_diff 0.004 0.007 0.006
## for_diff 0.007 0.003 0.010
## good_diff -0.001 -0.001 0.001
## have_diff 0.001 0.001 0.005
## not_diff 0.007 0.007 0.006
## that_diff 0.000 -0.002 -0.003
## the_diff 0.002 0.018 0.017
## they_diff 0.001 0.000 -0.002
## this_diff 0.009 0.006 0.007
## with_diff 0.008 0.008 0.009
## you_diff 0.002 0.010 0.006
## Rating4_term2totalTerm Rating5_term2totalTerm minValue vote minValue2
## and_diff 0.028 0.041 0.024 2 0.024
## but_diff 0.007 0.002 0.002 5 0.002
## for_diff 0.005 0.007 0.003 2 0.003
## good_diff 0.003 0.006 -0.001 1 -0.001
## have_diff 0.001 0.008 0.001 1 0.001
## not_diff 0.003 0.000 0.000 5 0.000
## that_diff -0.003 -0.005 -0.005 5 0.000
## the_diff 0.006 0.001 0.001 5 0.001
## they_diff -0.001 -0.001 -0.002 3 0.001
## this_diff 0.004 0.006 0.004 4 0.004
## with_diff 0.009 0.010 0.008 1 0.008
## you_diff 0.008 0.005 0.002 1 0.002
## vote2
## and_diff 2
## but_diff 5
## for_diff 2
## good_diff 1
## have_diff 1
## not_diff 5
## that_diff 1
## the_diff 5
## they_diff 1
## this_diff 4
## with_diff 1
## you_diff 1
There is actually a tie between the review being a 5 or a 1 when using vote 1 that takes the minimum value that includes very negative values. We need to make a rule for when this happens. How about try out for if there is a tie, the best of the median rounded up or the mean rounded down. There is also a vote2 field that takes the shortest distance to the review ratio out of each review and votes for that review. Lets see the results of the first vote with only the minimum.
bestVote <- diff %>% group_by(vote) %>% count()
bestVote$maxVote <- ifelse(bestVote$n==max(bestVote$n),
1,0)
bestVote$ratingMean <- ifelse(sum(bestVote$maxVote) > 1,
ifelse(ceiling(mean(bestVote$vote*bestVote$n))>5,
5, ceiling(mean(bestVote$vote*bestVote$n))),
ifelse(bestVote$n==max(bestVote$n),
bestVote$vote,
0)
)
bestVote$ratingMedian <- ifelse(sum(bestVote$maxVote) > 1,
ifelse(ceiling(median(bestVote$vote*bestVote$n))>5,
5,ceiling(median(bestVote$vote*bestVote$n))),
ifelse(bestVote$n==max(bestVote$n),
bestVote$vote,
0)
)
max(bestVote$ratingMean)
## [1] 5
max(bestVote$ratingMedian)
## [1] 4
bestVote
## # A tibble: 5 x 5
## # Groups: vote [5]
## vote n maxVote ratingMean ratingMedian
## <dbl> <int> <dbl> <dbl> <dbl>
## 1 1 4 1 5 4
## 2 2 2 0 5 4
## 3 3 1 0 5 4
## 4 4 1 0 5 4
## 5 5 4 1 5 4
From the above table, it identified a tie in votes, and calculated the mean and medians of the votes*the count for each vote as a dot product. The mean is actually 7, so a constraint was also placed or wrapped around the ceiling of the mean if it is greater than our highest rating, that it be the highest rating. Same for the median. Lets use Vote2 which takes the shortest distance from the term to Total Term frequency ratio of the review to each ratings term to Total Term frequency ratio. We could choose to accept the mean driven vote of 5 or median driven vote of 4.But lets see how vote2 measures in for predicting most likely reveiw.
bestVote2 <- diff %>% group_by(vote2) %>% count()
bestVote2$maxVote2 <- ifelse(bestVote2$n==max(bestVote2$n),
1,0)
bestVote2$ratingMean2 <- ifelse(sum(bestVote2$maxVote2) > 1,
ifelse(ceiling(mean(bestVote2$vote2*bestVote2$n))>5,
5, ceiling(mean(bestVote2$vote2*bestVote2$n))),
ifelse(bestVote2$n==max(bestVote2$n),
bestVote2$vote2,
0)
)
bestVote2$ratingMedian2 <- ifelse(sum(bestVote2$maxVote2) > 1,
ifelse(ceiling(median(bestVote2$vote2*bestVote2$n))>5,
5,ceiling(median(bestVote2$vote2*bestVote2$n))),
ifelse(bestVote2$n==max(bestVote2$n),
bestVote2$vote2,
0)
)
max(bestVote2$ratingMean2)
## [1] 1
max(bestVote2$ratingMedian2)
## [1] 1
bestVote2
## # A tibble: 4 x 5
## # Groups: vote2 [4]
## vote2 n maxVote2 ratingMean2 ratingMedian2
## <dbl> <int> <dbl> <dbl> <dbl>
## 1 1 6 1 1 1
## 2 2 2 0 1 1
## 3 4 1 0 1 1
## 4 5 3 0 1 1
When using the shortest distance between the ratio of term to total terms in the review, instead of the minimum distance, the highest votes were not a tie, but for a 1 rating.
Lets see what this rating is. The string object was taken from the first review of the business.
Reviews13[1,]
## userReviewSeries
## 1 mostRecentVisit_review
## userReviewOnlyContent
## 1 What a wonderful way to start the year! This was my second time back to HIGH END SPA, and we had a great time. The crowds were very low (seriously, it felt like we had the place to ourselves most of the day.) We walked right into the mineral baths, club mud, and didn't wait in any kind of line for lunch. None of the pools were crowded, and we were even able to enjoy one of the hammocks in the secret garden.\n\nTiffany at the front check-in desk went above and beyond for us regarding the robes. I had requested a plus-sized robe, since after my last review I knew they had added some to their collection. Unfortunately, all of their plus-sized robes were still dirty from the day before. Tiffany was so accommodating, though! She was able to get us robes from the cabana area that fit me perfectly! It is so great to know that not only do they now offer guests of all sizes the option to enjoy a warm robe, but that they really want to make sure you have a good day. Thank you, Tiffany, for everything.\n\nAll of the staff today were in good spirits. The only thing that would have made today better would have been a massage. We'll have to book one next time. My husband and I are going to make HIGH END SPA our annual New Year's Day tradition!\n\n
## userRatingSeries userRatingValue businessReplied
## 1 mostRecentVisit_rating 5 yes
## businessReplyContent
## 1 Amber P. of HIGH END SPA Hot Springs\n\nBusiness Customer Service\n\n1/2/20191/15/2018-\nHi Michelle, HIGH END SPA is proud to welcome men and women of all shapes and sizes. In response to your day, we are now in the process of ordering a few XL robes so we can continue to have offerings for all of our guests. I wanted to reach out to you to let you know we have sent you a private message as we would like to connect with you directly. Thank you again for communicating your concern with us.\nAlexa Gallegos\n\n1/2/2019 -\n\nHi Michelle,\nI am so happy to hear that you had a great returning experience! Our team members do the best they can to accommodate all of our guests needs and we are very glad to hear you were happy with the solution.\nWe hope to see you and your husband again!\n\nBest,\nAmber Peyghambari\n\nRead less\n
## userReviewContent
## 1 1/1/2019Updated review\n 2 photos\n\nWhat a wonderful way to start the year! This was my second time back to HIGH END SPA, and we had a great time. The crowds were very low (seriously, it felt like we had the place to ourselves most of the day.) We walked right into the mineral baths, club mud, and didn't wait in any kind of line for lunch. None of the pools were crowded, and we were even able to enjoy one of the hammocks in the secret garden.\n\nTiffany at the front check-in desk went above and beyond for us regarding the robes. I had requested a plus-sized robe, since after my last review I knew they had added some to their collection. Unfortunately, all of their plus-sized robes were still dirty from the day before. Tiffany was so accommodating, though! She was able to get us robes from the cabana area that fit me perfectly! It is so great to know that not only do they now offer guests of all sizes the option to enjoy a warm robe, but that they really want to make sure you have a good day. Thank you, Tiffany, for everything.\n\nAll of the staff today were in good spirits. The only thing that would have made today better would have been a massage. We'll have to book one next time. My husband and I are going to make HIGH END SPA our annual New Year's Day tradition!\n\nComment from Amber P. of HIGH END SPA Hot Springs\n\nBusiness Customer Service\n\n1/2/20191/15/2018-\nHi Michelle, HIGH END SPA is proud to welcome men and women of all shapes and sizes. In response to your day, we are now in the process of ordering a few XL robes so we can continue to have offerings for all of our guests. I wanted to reach out to you to let you know we have sent you a private message as we would like to connect with you directly. Thank you again for communicating your concern with us.\nAlexa Gallegos\n\n1/2/2019 -\n\nHi Michelle,\nI am so happy to hear that you had a great returning experience! Our team members do the best they can to accommodate all of our guests needs and we are very glad to hear you were happy with the solution.\nWe hope to see you and your husband again!\n\nBest,\nAmber Peyghambari\n\nRead less\n
## LowAvgHighCost businessType cityState friends reviews photos
## 1 High high end massage retreat Orange, CA 26 33 21
## eliteStatus userName Date userBusinessPhotos userCheckIns
## 1 <NA> Michelle A. 2019-01-01 2 NA
It turns out using the highest vote of the shortest distance is not the best in predicting the sentiment, but it looks like the ceiling of the mean of the dot product when a tie based on selecting the minimum value from the difference between ratios in each rating of word to total words in the document subset by rating is best when using the ratio of each review as a ratio of document term to total terms in the document. Lets re-run this script with a different review now and compare results. We only used the words within each rating ratio and most frequent within a broad category of low (1s and 2s), average(2s,3s,and 4s), and great(4s and 5s) only in selecting best keywords by median ratios for each of those three categories for describing each review rating. Also, the stopwords were not excluded like they normally are or in some cases they are.
We are going to test out this algorithm using only the top 12 keywords of three groups but keeping the five ratings to predict by vote. Our best prediction the first run was on breaking a tie with the ceiling of the mean or the highest value of the vote if it is the highest. Lets use the 2nd review this time.
str1 <- as.character(paste(Reviews13$userReviewOnlyContent[2]))
str1 <- gsub('[!|.|,|\n|\']',' ',str1,perl=TRUE)
str1 <- gsub('[ ]',' ',str1)
str1 <- trimws(str1, which=c('both'), whitespace='[\t\r\n ]')
totalTerms <- length((strsplit(str1, split=' ')[[1]]))
keys <- row.names(keys_t)
and <- str_match_all(str1,' [aA][nN][dD] ')
AND <- length(and[[1]])
the <- str_match_all(str1,' [tT][hH][eE] ')
THE <- length(the[[1]])
for1 <- str_match_all(str1,' [fF][oO][rR] ')
FOR1 <- length(for1[[1]])
have <- str_match_all(str1,' [hH][aA][vV][eE] ')
HAVE <- length(have[[1]])
that <- str_match_all(str1,' [tT][hH][aA][tT] ')
THAT <- length(that[[1]])
they <- str_match_all(str1,' [tT][hH][eE][yY] ')
THEY <- length(they[[1]])
this <- str_match_all(str1,' [tT][hH][iI][sS] ')
THIS <- length(this[[1]])
you <- str_match_all(str1,' [yY][oO][uU] ')
YOU <- length(you[[1]])
not <- str_match_all(str1,' [nN][oO][tT] ')
NOT <- length(not[[1]])
but <- str_match_all(str1,' [bB][uU][tT] ')
BUT <- length(but[[1]])
good <- str_match_all(str1,' [gG][oO][oO][dD] ')
GOOD <- length(good[[1]])
with <- str_match_all(str1,' [wW][iI][tT][hH] ')
WITH <- length(with[[1]])
values <- as.data.frame(c(THE,AND,FOR1,HAVE,THAT,THEY,THIS,YOU,NOT,BUT,GOOD,WITH))
row.names(values) <- keys
keyValues <- as.data.frame(t(values))
keyValues2 <- keyValues/totalTerms
keyValues3 <- rbind(keyValues,keyValues2)
row.names(keyValues3) <- c('documentTermCount','term_to_totalDocumentTerms')
keyValues3 <- round(keyValues3,3)
keyValues3
## the and for have that they this you
## documentTermCount 24.00 17.000 5.000 3.000 3.000 3.000 3.000 3.000
## term_to_totalDocumentTerms 0.04 0.028 0.008 0.005 0.005 0.005 0.005 0.005
## not but good with
## documentTermCount 1.000 6.00 0 3.000
## term_to_totalDocumentTerms 0.002 0.01 0 0.005
Join this table to the wordToAllWords table using dplyr’s full join function.
joinKeys <- full_join(wordToAllWords,keyValues3)
## Joining, by = c("and", "but", "for", "good", "have", "not", "that", "the",
## "they", "this", "with", "you")
r1 <- row.names(wordToAllWords)
r2 <- row.names(keyValues3)
names <- c(r1,r2)
row.names(joinKeys) <- names
joinKeys
## and but for good have not that the
## Rating1_term2totalTerm 0.046 0.008 0.018 0.006 0.016 0.011 0.015 0.057
## Rating2_term2totalTerm 0.042 0.011 0.014 0.006 0.016 0.011 0.013 0.073
## Rating3_term2totalTerm 0.043 0.010 0.021 0.008 0.020 0.010 0.012 0.072
## Rating4_term2totalTerm 0.046 0.011 0.016 0.010 0.016 0.007 0.012 0.061
## Rating5_term2totalTerm 0.059 0.006 0.018 0.013 0.023 0.004 0.010 0.056
## documentTermCount 17.000 6.000 5.000 0.000 3.000 1.000 3.000 24.000
## term_to_totalDocumentTerms 0.028 0.010 0.008 0.000 0.005 0.002 0.005 0.040
## they this with you
## Rating1_term2totalTerm 0.012 0.013 0.008 0.009
## Rating2_term2totalTerm 0.011 0.010 0.008 0.017
## Rating3_term2totalTerm 0.009 0.011 0.009 0.013
## Rating4_term2totalTerm 0.010 0.008 0.009 0.015
## Rating5_term2totalTerm 0.010 0.010 0.010 0.012
## documentTermCount 3.000 3.000 3.000 3.000
## term_to_totalDocumentTerms 0.005 0.005 0.005 0.005
Looking at the table above, we can use the term_to_totalDocumentTerms values of this observation compared to the ratios of the term2totalTerm ratings for each of these 12 words, and choose the rating with the lowest difference or distance between, then to add up the votes for ratings 1-5 for all 12 choices. There should be a clear winner in this algorithm of selecting or predicting the sentiment rating. So, lets try it out.
and_diff <- joinKeys$and[1:5]-joinKeys$and[7]
but_diff <- joinKeys$but[1:5]-joinKeys$but[7]
for_diff <- joinKeys[1:5,3]-joinKeys[7,3]
good_diff <- joinKeys$good[1:5]-joinKeys$good[7]
have_diff <- joinKeys$have[1:5]-joinKeys$have[7]
not_diff <- joinKeys$not[1:5]-joinKeys$not[7]
that_diff <- joinKeys$that[1:5]-joinKeys$that[7]
the_diff <- joinKeys$the[1:5]-joinKeys$the[7]
they_diff <- joinKeys$they[1:5]-joinKeys$they[7]
this_diff <- joinKeys$this[1:5]-joinKeys$this[7]
with_diff <- joinKeys$with[1:5]-joinKeys$with[7]
you_diff <- joinKeys$you[1:5]-joinKeys$you[7]
diff <- as.data.frame(t(cbind(and_diff, but_diff, for_diff, good_diff, have_diff, not_diff,
that_diff, the_diff, they_diff, this_diff, with_diff, you_diff)))
colnames(diff) <- r1
diff$minValue <- apply(diff,1, min)
diff$vote <- ifelse(diff$Rating1_term2totalTerm==diff$minValue,
1,
ifelse(diff$Rating2_term2totalTerm==diff$minValue,
2,
ifelse(diff$Rating3_term2totalTerm==diff$minValue,
3,
ifelse(diff$Rating4_term2totalTerm==diff$minValue,
4,
5)
)
)
)
diff$minValue2 <- ifelse(abs(diff$minValue)>abs(diff$Rating1_term2totalTerm),
diff$Rating1_term2totalTerm,
ifelse(abs(diff$minValue)>abs(diff$Rating2_term2totalTerm),
diff$Rating2_term2totalTerm,
ifelse(abs(diff$minValue)>abs(diff$Rating3_term2totalTerm),
diff$Rating3_term2totalTerm,
ifelse(abs(diff$minValue)>abs(diff$Rating4_term2totalTerm),
diff$Rating4_term2totalTerm,
ifelse(abs(diff$minValue)>abs(diff$Rating5_term2totalTerm),
diff$Rating5_term2totalTerm,
diff$minValue)
)
)
)
)
diff$vote2 <- ifelse(diff$Rating1_term2totalTerm==diff$minValue2,
1,
ifelse(diff$Rating2_term2totalTerm==diff$minValue2,
2,
ifelse(diff$Rating3_term2totalTerm==diff$minValue2,
3,
ifelse(diff$Rating4_term2totalTerm==diff$minValue2,
4,
5)
)
)
)
diff
## Rating1_term2totalTerm Rating2_term2totalTerm Rating3_term2totalTerm
## and_diff 0.018 0.014 0.015
## but_diff -0.002 0.001 0.000
## for_diff 0.010 0.006 0.013
## good_diff 0.006 0.006 0.008
## have_diff 0.011 0.011 0.015
## not_diff 0.009 0.009 0.008
## that_diff 0.010 0.008 0.007
## the_diff 0.017 0.033 0.032
## they_diff 0.007 0.006 0.004
## this_diff 0.008 0.005 0.006
## with_diff 0.003 0.003 0.004
## you_diff 0.004 0.012 0.008
## Rating4_term2totalTerm Rating5_term2totalTerm minValue vote minValue2
## and_diff 0.018 0.031 0.014 2 0.014
## but_diff 0.001 -0.004 -0.004 5 -0.002
## for_diff 0.008 0.010 0.006 2 0.006
## good_diff 0.010 0.013 0.006 1 0.006
## have_diff 0.011 0.018 0.011 1 0.011
## not_diff 0.005 0.002 0.002 5 0.002
## that_diff 0.007 0.005 0.005 5 0.005
## the_diff 0.021 0.016 0.016 5 0.016
## they_diff 0.005 0.005 0.004 3 0.004
## this_diff 0.003 0.005 0.003 4 0.003
## with_diff 0.004 0.005 0.003 1 0.003
## you_diff 0.010 0.007 0.004 1 0.004
## vote2
## and_diff 2
## but_diff 1
## for_diff 2
## good_diff 1
## have_diff 1
## not_diff 5
## that_diff 5
## the_diff 5
## they_diff 3
## this_diff 4
## with_diff 1
## you_diff 1
We need to make a rule for when this happens. How about try out for if there is a tie, the best of the median rounded up or the mean rounded down. There is also a vote2 field that takes the shortest distance to the review ratio out of each review and votes for that review. Lets see the results of the first vote with only the minimum.
bestVote <- diff %>% group_by(vote) %>% count()
bestVote$maxVote <- ifelse(bestVote$n==max(bestVote$n),
1,0)
bestVote$ratingMean <- ifelse(sum(bestVote$maxVote) > 1,
ifelse(ceiling(mean(bestVote$vote*bestVote$n))>5,
5, ceiling(mean(bestVote$vote*bestVote$n))),
ifelse(bestVote$n==max(bestVote$n),
bestVote$vote,
0)
)
bestVote$ratingMedian <- ifelse(sum(bestVote$maxVote) > 1,
ifelse(ceiling(median(bestVote$vote*bestVote$n))>5,
5,ceiling(median(bestVote$vote*bestVote$n))),
ifelse(bestVote$n==max(bestVote$n),
bestVote$vote,
0)
)
max(bestVote$ratingMean)
## [1] 5
max(bestVote$ratingMedian)
## [1] 4
Our best algorithm selected 5 as the best vote, the first run of this program it was a 5 and the mean rating won that prediction.
bestVote
## # A tibble: 5 x 5
## # Groups: vote [5]
## vote n maxVote ratingMean ratingMedian
## <dbl> <int> <dbl> <dbl> <dbl>
## 1 1 4 1 5 4
## 2 2 2 0 5 4
## 3 3 1 0 5 4
## 4 4 1 0 5 4
## 5 5 4 1 5 4
Lets see how vote2 measures in for predicting most likely reveiw.
bestVote2 <- diff %>% group_by(vote2) %>% count()
bestVote2$maxVote2 <- ifelse(bestVote2$n==max(bestVote2$n),
1,0)
bestVote2$ratingMean2 <- ifelse(sum(bestVote2$maxVote2) > 1,
ifelse(ceiling(mean(bestVote2$vote2*bestVote2$n))>5,
5, ceiling(mean(bestVote2$vote2*bestVote2$n))),
ifelse(bestVote2$n==max(bestVote2$n),
bestVote2$vote2,
0)
)
bestVote2$ratingMedian2 <- ifelse(sum(bestVote2$maxVote2) > 1,
ifelse(ceiling(median(bestVote2$vote2*bestVote2$n))>5,
5,ceiling(median(bestVote2$vote2*bestVote2$n))),
ifelse(bestVote2$n==max(bestVote2$n),
bestVote2$vote2,
0)
)
max(bestVote2$ratingMean2)
## [1] 1
max(bestVote2$ratingMedian2)
## [1] 1
bestVote2
## # A tibble: 5 x 5
## # Groups: vote2 [5]
## vote2 n maxVote2 ratingMean2 ratingMedian2
## <dbl> <int> <dbl> <dbl> <dbl>
## 1 1 5 1 1 1
## 2 2 2 0 1 1
## 3 3 1 0 1 1
## 4 4 1 0 1 1
## 5 5 3 0 1 1
Well they have the same results almost for both vote and vote2, except that vote2 this time included all ratings, the first run the rating 3 had no votes, but the vote is the same as a 1 rating, while the ceiling of the mean is a 5 rating, and the ceiling of the median is a 4 rating.
Lets see what this rating is. The string object was taken from the first review of the business.
Reviews13[2,]
## userReviewSeries
## 2 mostRecentVisit_review
## userReviewOnlyContent
## 2 My sister and I brought my mom here for her birthday and overall, we really enjoyed our time here. We're used to going to Korean spas, but this was definitely an upgrade.\n\nPROS:\n- The resort itself is beautiful and so relaxing. Like seriously such a pleasing escape from reality that I needed. It's set up so nicely and feels very luxurious.\n- It was my mom's birthday so she received free admission on birthday with a purchase of a service. Admission is $52, so she booked a manicure for $50 and got in for free. WORTH. My mom had gone 52 years without ever getting her nails done, so it was kind of heartwarming to see how much she loved her experience.\n- The three of us took a Yin Yoga class and really enjoyed it. We definitely want to take advantage of the other class options next time we come.\n- CLUB MUD. We had so much fun there and even made a little clay sculpture. It really does do wonders for your skin, and the area is suprisingly very well-kept.\n- The shower and locker facilities can get pretty crowded, but overall, they are super nice and clean. They have an ample amount of showers, so we didn't have to wait at all.\n- All the staff seemed really friendly and helpful. There's always staff members roaming around, so you always feel somewhat taken care of.\n- I really appreciated the towel and water stands located throughout the resort. So handy and necessary.\n- Parking is free, thank God.\n\nCONS:\n- We went on a fairly cold day (around 60 degrees), so the hot pools were CROWDED,. Like there were a couple of times I touched other people's body parts I definitely did not want to touch. I feel like some of the hot pools exceeded capacity, and I'm sure it was mostly because it was a cold day, but I do wish there were more of the hot pools or they should just be larger!\n- The food is incredibly expensive. Like as ridiculous as Disneyland, which is saying something. Plan to spend around $20 per meal per person. The one thing that was worth it was the nachos ($16 for the small, but this thing is huge).\n- The kitchen moves VERY SLOWLY. Especially the salad section because I came before the lunch rush and still waited 20 minutes to order my salad. The kitchen staff seems a bit incompetent, or maybe it's just run inefficiently.\n- This is more of a side note, but I wish there was a more streamlined reservation system. I made the entire reservation over the phone, which was fine, but it wasn't laid out as clearly as I would have liked it with the premium admissions prices, services, etc. The online one also just seemed really confusing.\n\nOverall, we had a positive experience with just a couple of kinks here and there. We love that there's just a lot to do here and time FLIES when you're here so come as early as you can. We definitely want to try coming back in the summer months when it's warmer!\n\n\n
## userRatingSeries userRatingValue businessReplied
## 2 mostRecentVisit_rating 4 yes
## businessReplyContent
## 2 Amber P. of HIGH END SPA Hot Springs\n\nBusiness Customer Service\n\n3/25/2019Hi Cathy,\n\nThank you for taking the time to share your experience with us. We are happy to hear that you enjoyed your day at HIGH END SPA. We appreciate all feedback and will share these concerns with our team. We hope to see you back this summer!\n\nWith kind,\nAmber Peyghambari\n
## userReviewContent
## 2 3/24/2019\n 12 photos\n\nMy sister and I brought my mom here for her birthday and overall, we really enjoyed our time here. We're used to going to Korean spas, but this was definitely an upgrade.\n\nPROS:\n- The resort itself is beautiful and so relaxing. Like seriously such a pleasing escape from reality that I needed. It's set up so nicely and feels very luxurious.\n- It was my mom's birthday so she received free admission on birthday with a purchase of a service. Admission is $52, so she booked a manicure for $50 and got in for free. WORTH. My mom had gone 52 years without ever getting her nails done, so it was kind of heartwarming to see how much she loved her experience.\n- The three of us took a Yin Yoga class and really enjoyed it. We definitely want to take advantage of the other class options next time we come.\n- CLUB MUD. We had so much fun there and even made a little clay sculpture. It really does do wonders for your skin, and the area is suprisingly very well-kept.\n- The shower and locker facilities can get pretty crowded, but overall, they are super nice and clean. They have an ample amount of showers, so we didn't have to wait at all.\n- All the staff seemed really friendly and helpful. There's always staff members roaming around, so you always feel somewhat taken care of.\n- I really appreciated the towel and water stands located throughout the resort. So handy and necessary.\n- Parking is free, thank God.\n\nCONS:\n- We went on a fairly cold day (around 60 degrees), so the hot pools were CROWDED,. Like there were a couple of times I touched other people's body parts I definitely did not want to touch. I feel like some of the hot pools exceeded capacity, and I'm sure it was mostly because it was a cold day, but I do wish there were more of the hot pools or they should just be larger!\n- The food is incredibly expensive. Like as ridiculous as Disneyland, which is saying something. Plan to spend around $20 per meal per person. The one thing that was worth it was the nachos ($16 for the small, but this thing is huge).\n- The kitchen moves VERY SLOWLY. Especially the salad section because I came before the lunch rush and still waited 20 minutes to order my salad. The kitchen staff seems a bit incompetent, or maybe it's just run inefficiently.\n- This is more of a side note, but I wish there was a more streamlined reservation system. I made the entire reservation over the phone, which was fine, but it wasn't laid out as clearly as I would have liked it with the premium admissions prices, services, etc. The online one also just seemed really confusing.\n\nOverall, we had a positive experience with just a couple of kinks here and there. We love that there's just a lot to do here and time FLIES when you're here so come as early as you can. We definitely want to try coming back in the summer months when it's warmer!\n\n\nComment from Amber P. of HIGH END SPA Hot Springs\n\nBusiness Customer Service\n\n3/25/2019Hi Cathy,\n\nThank you for taking the time to share your experience with us. We are happy to hear that you enjoyed your day at HIGH END SPA. We appreciate all feedback and will share these concerns with our team. We hope to see you back this summer!\n\nWith kind,\nAmber Peyghambari\n
## LowAvgHighCost businessType cityState friends reviews
## 2 High high end massage retreat Los Angeles, CA 894 311
## photos eliteStatus userName Date userBusinessPhotos userCheckIns
## 2 1187 Elite '2020 Cathy P. 2019-03-24 NA NA
This time, the ceiling of the median using the minimum difference between the document to corpus of each rating ratios of term to total terms in the document versus term to total terms within all documents in each rating.
We should try another, maybe a review closer to the tail to see if the minimum distance is still the best, but choosing mean or median is still a fixer upper. We still haven’t used the Reviews13 regular features we spent some time extracting and adding to base what the review’s rating will be. Also, adding a visNetwork link analysis plot to show how the ratings and keywords look or link to each other by weight as the term to total terms ratio, or forgetting these keywords and using the top full join keywords by frequency in each rating.
Lets re-run this script on another review closer to the tail to see how the results are predicted.
str1 <- as.character(paste(Reviews13$userReviewOnlyContent[600]))
str1 <- gsub('[!|.|,|\n|\']',' ',str1,perl=TRUE)
str1 <- gsub('[ ]',' ',str1)
str1 <- trimws(str1, which=c('both'), whitespace='[\t\r\n ]')
totalTerms <- length((strsplit(str1, split=' ')[[1]]))
keys <- row.names(keys_t)
and <- str_match_all(str1,' [aA][nN][dD] ')
AND <- length(and[[1]])
the <- str_match_all(str1,' [tT][hH][eE] ')
THE <- length(the[[1]])
for1 <- str_match_all(str1,' [fF][oO][rR] ')
FOR1 <- length(for1[[1]])
have <- str_match_all(str1,' [hH][aA][vV][eE] ')
HAVE <- length(have[[1]])
that <- str_match_all(str1,' [tT][hH][aA][tT] ')
THAT <- length(that[[1]])
they <- str_match_all(str1,' [tT][hH][eE][yY] ')
THEY <- length(they[[1]])
this <- str_match_all(str1,' [tT][hH][iI][sS] ')
THIS <- length(this[[1]])
you <- str_match_all(str1,' [yY][oO][uU] ')
YOU <- length(you[[1]])
not <- str_match_all(str1,' [nN][oO][tT] ')
NOT <- length(not[[1]])
but <- str_match_all(str1,' [bB][uU][tT] ')
BUT <- length(but[[1]])
good <- str_match_all(str1,' [gG][oO][oO][dD] ')
GOOD <- length(good[[1]])
with <- str_match_all(str1,' [wW][iI][tT][hH] ')
WITH <- length(with[[1]])
values <- as.data.frame(c(THE,AND,FOR1,HAVE,THAT,THEY,THIS,YOU,NOT,BUT,GOOD,WITH))
row.names(values) <- keys
keyValues <- as.data.frame(t(values))
keyValues2 <- keyValues/totalTerms
keyValues3 <- rbind(keyValues,keyValues2)
row.names(keyValues3) <- c('documentTermCount','term_to_totalDocumentTerms')
keyValues3 <- round(keyValues3,3)
keyValues3
## the and for have that they this you not but
## documentTermCount 2.000 4.000 2.000 1.000 1.000 0 0 0 0 0
## term_to_totalDocumentTerms 0.031 0.062 0.031 0.015 0.015 0 0 0 0 0
## good with
## documentTermCount 0 0
## term_to_totalDocumentTerms 0 0
Join this table to the wordToAllWords table using dplyr’s full join function.
joinKeys <- full_join(wordToAllWords,keyValues3)
## Joining, by = c("and", "but", "for", "good", "have", "not", "that", "the",
## "they", "this", "with", "you")
r1 <- row.names(wordToAllWords)
r2 <- row.names(keyValues3)
names <- c(r1,r2)
row.names(joinKeys) <- names
joinKeys
## and but for good have not that the
## Rating1_term2totalTerm 0.046 0.008 0.018 0.006 0.016 0.011 0.015 0.057
## Rating2_term2totalTerm 0.042 0.011 0.014 0.006 0.016 0.011 0.013 0.073
## Rating3_term2totalTerm 0.043 0.010 0.021 0.008 0.020 0.010 0.012 0.072
## Rating4_term2totalTerm 0.046 0.011 0.016 0.010 0.016 0.007 0.012 0.061
## Rating5_term2totalTerm 0.059 0.006 0.018 0.013 0.023 0.004 0.010 0.056
## documentTermCount 4.000 0.000 2.000 0.000 1.000 0.000 1.000 2.000
## term_to_totalDocumentTerms 0.062 0.000 0.031 0.000 0.015 0.000 0.015 0.031
## they this with you
## Rating1_term2totalTerm 0.012 0.013 0.008 0.009
## Rating2_term2totalTerm 0.011 0.010 0.008 0.017
## Rating3_term2totalTerm 0.009 0.011 0.009 0.013
## Rating4_term2totalTerm 0.010 0.008 0.009 0.015
## Rating5_term2totalTerm 0.010 0.010 0.010 0.012
## documentTermCount 0.000 0.000 0.000 0.000
## term_to_totalDocumentTerms 0.000 0.000 0.000 0.000
Looking at the table above, we can use the term_to_totalDocumentTerms values of this observation compared to the ratios of the term2totalTerm ratings for each of these 12 words, and choose the rating with the lowest difference or distance between, then to add up the votes for ratings 1-5 for all 12 choices. There should be a clear winner in this algorithm of selecting or predicting the sentiment rating. So, lets try it out.
and_diff <- joinKeys$and[1:5]-joinKeys$and[7]
but_diff <- joinKeys$but[1:5]-joinKeys$but[7]
for_diff <- joinKeys[1:5,3]-joinKeys[7,3]
good_diff <- joinKeys$good[1:5]-joinKeys$good[7]
have_diff <- joinKeys$have[1:5]-joinKeys$have[7]
not_diff <- joinKeys$not[1:5]-joinKeys$not[7]
that_diff <- joinKeys$that[1:5]-joinKeys$that[7]
the_diff <- joinKeys$the[1:5]-joinKeys$the[7]
they_diff <- joinKeys$they[1:5]-joinKeys$they[7]
this_diff <- joinKeys$this[1:5]-joinKeys$this[7]
with_diff <- joinKeys$with[1:5]-joinKeys$with[7]
you_diff <- joinKeys$you[1:5]-joinKeys$you[7]
diff <- as.data.frame(t(cbind(and_diff, but_diff, for_diff, good_diff, have_diff, not_diff,
that_diff, the_diff, they_diff, this_diff, with_diff, you_diff)))
colnames(diff) <- r1
diff$minValue <- apply(diff,1, min)
diff$vote <- ifelse(diff$Rating1_term2totalTerm==diff$minValue,
1,
ifelse(diff$Rating2_term2totalTerm==diff$minValue,
2,
ifelse(diff$Rating3_term2totalTerm==diff$minValue,
3,
ifelse(diff$Rating4_term2totalTerm==diff$minValue,
4,
5)
)
)
)
diff$minValue2 <- ifelse(abs(diff$minValue)>abs(diff$Rating1_term2totalTerm),
diff$Rating1_term2totalTerm,
ifelse(abs(diff$minValue)>abs(diff$Rating2_term2totalTerm),
diff$Rating2_term2totalTerm,
ifelse(abs(diff$minValue)>abs(diff$Rating3_term2totalTerm),
diff$Rating3_term2totalTerm,
ifelse(abs(diff$minValue)>abs(diff$Rating4_term2totalTerm),
diff$Rating4_term2totalTerm,
ifelse(abs(diff$minValue)>abs(diff$Rating5_term2totalTerm),
diff$Rating5_term2totalTerm,
diff$minValue)
)
)
)
)
diff$vote2 <- ifelse(diff$Rating1_term2totalTerm==diff$minValue2,
1,
ifelse(diff$Rating2_term2totalTerm==diff$minValue2,
2,
ifelse(diff$Rating3_term2totalTerm==diff$minValue2,
3,
ifelse(diff$Rating4_term2totalTerm==diff$minValue2,
4,
5)
)
)
)
diff
## Rating1_term2totalTerm Rating2_term2totalTerm Rating3_term2totalTerm
## and_diff -0.016 -0.020 -0.019
## but_diff 0.008 0.011 0.010
## for_diff -0.013 -0.017 -0.010
## good_diff 0.006 0.006 0.008
## have_diff 0.001 0.001 0.005
## not_diff 0.011 0.011 0.010
## that_diff 0.000 -0.002 -0.003
## the_diff 0.026 0.042 0.041
## they_diff 0.012 0.011 0.009
## this_diff 0.013 0.010 0.011
## with_diff 0.008 0.008 0.009
## you_diff 0.009 0.017 0.013
## Rating4_term2totalTerm Rating5_term2totalTerm minValue vote minValue2
## and_diff -0.016 -0.003 -0.020 2 -0.016
## but_diff 0.011 0.006 0.006 5 0.006
## for_diff -0.015 -0.013 -0.017 2 -0.013
## good_diff 0.010 0.013 0.006 1 0.006
## have_diff 0.001 0.008 0.001 1 0.001
## not_diff 0.007 0.004 0.004 5 0.004
## that_diff -0.003 -0.005 -0.005 5 0.000
## the_diff 0.030 0.025 0.025 5 0.025
## they_diff 0.010 0.010 0.009 3 0.009
## this_diff 0.008 0.010 0.008 4 0.008
## with_diff 0.009 0.010 0.008 1 0.008
## you_diff 0.015 0.012 0.009 1 0.009
## vote2
## and_diff 1
## but_diff 5
## for_diff 1
## good_diff 1
## have_diff 1
## not_diff 5
## that_diff 1
## the_diff 5
## they_diff 3
## this_diff 4
## with_diff 1
## you_diff 1
Lets see the results of the first vote with only the minimum.
bestVote <- diff %>% group_by(vote) %>% count()
bestVote$maxVote <- ifelse(bestVote$n==max(bestVote$n),
1,0)
bestVote$ratingMean <- ifelse(sum(bestVote$maxVote) > 1,
ifelse(ceiling(mean(bestVote$vote*bestVote$n))>5,
5, ceiling(mean(bestVote$vote*bestVote$n))),
ifelse(bestVote$n==max(bestVote$n),
bestVote$vote,
0)
)
bestVote$ratingMedian <- ifelse(sum(bestVote$maxVote) > 1,
ifelse(ceiling(median(bestVote$vote*bestVote$n))>5,
5,ceiling(median(bestVote$vote*bestVote$n))),
ifelse(bestVote$n==max(bestVote$n),
bestVote$vote,
0)
)
max(bestVote$ratingMean)
## [1] 5
max(bestVote$ratingMedian)
## [1] 4
bestVote
## # A tibble: 5 x 5
## # Groups: vote [5]
## vote n maxVote ratingMean ratingMedian
## <dbl> <int> <dbl> <dbl> <dbl>
## 1 1 4 1 5 4
## 2 2 2 0 5 4
## 3 3 1 0 5 4
## 4 4 1 0 5 4
## 5 5 4 1 5 4
Lets see how vote2 measures in for predicting most likely reveiw.
bestVote2 <- diff %>% group_by(vote2) %>% count()
bestVote2$maxVote2 <- ifelse(bestVote2$n==max(bestVote2$n),
1,0)
bestVote2$ratingMean2 <- ifelse(sum(bestVote2$maxVote2) > 1,
ifelse(ceiling(mean(bestVote2$vote2*bestVote2$n))>5,
5, ceiling(mean(bestVote2$vote2*bestVote2$n))),
ifelse(bestVote2$n==max(bestVote2$n),
bestVote2$vote2,
0)
)
bestVote2$ratingMedian2 <- ifelse(sum(bestVote2$maxVote2) > 1,
ifelse(ceiling(median(bestVote2$vote2*bestVote2$n))>5,
5,ceiling(median(bestVote2$vote2*bestVote2$n))),
ifelse(bestVote2$n==max(bestVote2$n),
bestVote2$vote2,
0)
)
max(bestVote2$ratingMean2)
## [1] 1
max(bestVote2$ratingMedian2)
## [1] 1
bestVote2
## # A tibble: 4 x 5
## # Groups: vote2 [4]
## vote2 n maxVote2 ratingMean2 ratingMedian2
## <dbl> <int> <dbl> <dbl> <dbl>
## 1 1 7 1 1 1
## 2 3 1 0 1 1
## 3 4 1 0 1 1
## 4 5 3 0 1 1
These keywords aren’t really showing much benefit, since they keep having the same result. A 1 rating for vote2 by shortest distance between review to all reviews’ term to total terms ratio and a 5 for ceiling of the mean of dot product of votes and ratings by minimum values of ratio differences, and a 4 for the ceiling of the median of dot product of votes and ratings by minimum values of ratio differences.
Lets see what this rating is. The string object was taken from the first review of the business.
Reviews13[600,]
## userReviewSeries
## 600 mostRecentVisit_review
## userReviewOnlyContent
## 600 DOCTOR is great, and has friendly staff. I have had lower back pain for years and he fixed it right up in a few visits. Every now and then the pain comes back and he always gets me in the same day without a problem for an adjustment. He also has a laser that he can use to treat different areas.
## userRatingSeries userRatingValue businessReplied businessReplyContent
## 600 mostRecentVisit_rating 5 no NA
## userReviewContent
## 600 2/19/2013\nDOCTOR is great, and has friendly staff. I have had lower back pain for years and he fixed it right up in a few visits. Every now and then the pain comes back and he always gets me in the same day without a problem for an adjustment. He also has a laser that he can use to treat different areas.
## LowAvgHighCost businessType cityState friends reviews photos eliteStatus
## 600 Avg chiropractic Norco, CA 23 36 NA <NA>
## userName Date userBusinessPhotos userCheckIns
## 600 Mike D. 2013-02-19 NA NA
The rating is actually a 5. So, the ceiling of the mean value of the dot product of votes by ratings selected by minimum value of ratio of review to ratio of reviews in each rating using each term to total terms in the single review to each collection of reviews in each rating of 1-5.
We can later add these features to each observation, make 70% a trainging set to build and train the model to predict the rating. Then use the other 30% the testing set to test the model built on the other partitioned 70% to predict the rating. We would have to run the best model above which used the mean of dot product of the votes by ratings, then the ceiling of that value which rounds up on all measures. As this algorithm was more accurate in predicting the rating.We could either do that now, or step away for a while to let it simmer and see what this data on reviews and key words looks like in a link analysis plot. Why don’t we do the latter and do it.
Lets visualize these keywords by the layout of these keywords to rating by ratios as weight, ratings as edges, and nodes as keywords. We have to first take this information from the data table on the ratios of terms counted each per documents in each rating to total terms per documents per rating. This data table is the wordToAllWords table. From this table we have our weights and our ratings. The weights are what will make the arrows width smaller or larger than the other arrow widths depending on how much weight they have on each word linked to a specific rating. The ratings are the edges. The nodes are the words, and those are also in this table. The label in the nodes table will be the keyword, and the title will be the rating. The id is the row number from the nodes table, which is the from column in the edges table. and the to column in the edges table will be the rating. Lets also add another feature from the Reviews13 data table for the day of the week as Monday through Sunday by adding a feature that takes the day of the week from the date field we added earlier in the data. Lets read in those two data tables and make sure our libraries are loaded in to Rstudio from the top of this script.The visNetwork and igraph link analysis and visualization libraries will be used for this link analysis. The package igraph makes the visNetwork package work faster in uploading and allows editing and modifying the link analysis network using various customized plot layouts and color schemes as well as other added value to the visualization.
Lets add the day of the week using the lubridate package to the Reviews13 datatable Date field.
head(Reviews13)
## userReviewSeries
## 1 mostRecentVisit_review
## 2 mostRecentVisit_review
## 3 mostRecentVisit_review
## 4 mostRecentVisit_review
## 5 mostRecentVisit_review
## 6 mostRecentVisit_review
## userReviewOnlyContent
## 1 What a wonderful way to start the year! This was my second time back to HIGH END SPA, and we had a great time. The crowds were very low (seriously, it felt like we had the place to ourselves most of the day.) We walked right into the mineral baths, club mud, and didn't wait in any kind of line for lunch. None of the pools were crowded, and we were even able to enjoy one of the hammocks in the secret garden.\n\nTiffany at the front check-in desk went above and beyond for us regarding the robes. I had requested a plus-sized robe, since after my last review I knew they had added some to their collection. Unfortunately, all of their plus-sized robes were still dirty from the day before. Tiffany was so accommodating, though! She was able to get us robes from the cabana area that fit me perfectly! It is so great to know that not only do they now offer guests of all sizes the option to enjoy a warm robe, but that they really want to make sure you have a good day. Thank you, Tiffany, for everything.\n\nAll of the staff today were in good spirits. The only thing that would have made today better would have been a massage. We'll have to book one next time. My husband and I are going to make HIGH END SPA our annual New Year's Day tradition!\n\n
## 2 My sister and I brought my mom here for her birthday and overall, we really enjoyed our time here. We're used to going to Korean spas, but this was definitely an upgrade.\n\nPROS:\n- The resort itself is beautiful and so relaxing. Like seriously such a pleasing escape from reality that I needed. It's set up so nicely and feels very luxurious.\n- It was my mom's birthday so she received free admission on birthday with a purchase of a service. Admission is $52, so she booked a manicure for $50 and got in for free. WORTH. My mom had gone 52 years without ever getting her nails done, so it was kind of heartwarming to see how much she loved her experience.\n- The three of us took a Yin Yoga class and really enjoyed it. We definitely want to take advantage of the other class options next time we come.\n- CLUB MUD. We had so much fun there and even made a little clay sculpture. It really does do wonders for your skin, and the area is suprisingly very well-kept.\n- The shower and locker facilities can get pretty crowded, but overall, they are super nice and clean. They have an ample amount of showers, so we didn't have to wait at all.\n- All the staff seemed really friendly and helpful. There's always staff members roaming around, so you always feel somewhat taken care of.\n- I really appreciated the towel and water stands located throughout the resort. So handy and necessary.\n- Parking is free, thank God.\n\nCONS:\n- We went on a fairly cold day (around 60 degrees), so the hot pools were CROWDED,. Like there were a couple of times I touched other people's body parts I definitely did not want to touch. I feel like some of the hot pools exceeded capacity, and I'm sure it was mostly because it was a cold day, but I do wish there were more of the hot pools or they should just be larger!\n- The food is incredibly expensive. Like as ridiculous as Disneyland, which is saying something. Plan to spend around $20 per meal per person. The one thing that was worth it was the nachos ($16 for the small, but this thing is huge).\n- The kitchen moves VERY SLOWLY. Especially the salad section because I came before the lunch rush and still waited 20 minutes to order my salad. The kitchen staff seems a bit incompetent, or maybe it's just run inefficiently.\n- This is more of a side note, but I wish there was a more streamlined reservation system. I made the entire reservation over the phone, which was fine, but it wasn't laid out as clearly as I would have liked it with the premium admissions prices, services, etc. The online one also just seemed really confusing.\n\nOverall, we had a positive experience with just a couple of kinks here and there. We love that there's just a lot to do here and time FLIES when you're here so come as early as you can. We definitely want to try coming back in the summer months when it's warmer!\n\n\n
## 3 I came to CHIROPRACTIC with severe back and neck pain. DOCTOR was AMAZING and helped me to feel much better than I have felt for YEARS! The girls up front also are very sweet and always made sure that all my appointments were set and on time! Heather the billing manager was very kind as well, she was AWESOME when it came to dealing with me and my insurance amd was definitely a huge help! I don't know what I would have done without Heather helping me with all of the insurance problems I had!!! She is the BEST, thank you Heather!! I would definitely recommend going to this clinic!!!!
## 4 I have to say.... This is by far the best Chiropractic place I've ever been to. The staff is super friendly and very professional. From the moment I walk in the door I get greeted by name . The Drs are amazing too. Love this place and I highly recommend them.
## 5 Dr. is my chiropractor and he is a fabulous individual. I've never waited more than few minutes for him to see me. The front team (Both ladies" are great with an outstanding care and smile. Thank you guys for all you do.
## 6 Many in our family have seen DOCTOR for chiropractic care. He is very warm and friendly, knowledgable, puts your mind at ease during his adjustments. He gives great explanations. Our 14yo son said, "he is really good at what he does and he is a good person." We all feel better after visiting him. Recommend him to everyone.
## userRatingSeries userRatingValue businessReplied
## 1 mostRecentVisit_rating 5 yes
## 2 mostRecentVisit_rating 4 yes
## 3 mostRecentVisit_rating 5 no
## 4 mostRecentVisit_rating 5 no
## 5 mostRecentVisit_rating 5 no
## 6 mostRecentVisit_rating 5 no
## businessReplyContent
## 1 Amber P. of HIGH END SPA Hot Springs\n\nBusiness Customer Service\n\n1/2/20191/15/2018-\nHi Michelle, HIGH END SPA is proud to welcome men and women of all shapes and sizes. In response to your day, we are now in the process of ordering a few XL robes so we can continue to have offerings for all of our guests. I wanted to reach out to you to let you know we have sent you a private message as we would like to connect with you directly. Thank you again for communicating your concern with us.\nAlexa Gallegos\n\n1/2/2019 -\n\nHi Michelle,\nI am so happy to hear that you had a great returning experience! Our team members do the best they can to accommodate all of our guests needs and we are very glad to hear you were happy with the solution.\nWe hope to see you and your husband again!\n\nBest,\nAmber Peyghambari\n\nRead less\n
## 2 Amber P. of HIGH END SPA Hot Springs\n\nBusiness Customer Service\n\n3/25/2019Hi Cathy,\n\nThank you for taking the time to share your experience with us. We are happy to hear that you enjoyed your day at HIGH END SPA. We appreciate all feedback and will share these concerns with our team. We hope to see you back this summer!\n\nWith kind,\nAmber Peyghambari\n
## 3 NA
## 4 NA
## 5 NA
## 6 NA
## userReviewContent
## 1 1/1/2019Updated review\n 2 photos\n\nWhat a wonderful way to start the year! This was my second time back to HIGH END SPA, and we had a great time. The crowds were very low (seriously, it felt like we had the place to ourselves most of the day.) We walked right into the mineral baths, club mud, and didn't wait in any kind of line for lunch. None of the pools were crowded, and we were even able to enjoy one of the hammocks in the secret garden.\n\nTiffany at the front check-in desk went above and beyond for us regarding the robes. I had requested a plus-sized robe, since after my last review I knew they had added some to their collection. Unfortunately, all of their plus-sized robes were still dirty from the day before. Tiffany was so accommodating, though! She was able to get us robes from the cabana area that fit me perfectly! It is so great to know that not only do they now offer guests of all sizes the option to enjoy a warm robe, but that they really want to make sure you have a good day. Thank you, Tiffany, for everything.\n\nAll of the staff today were in good spirits. The only thing that would have made today better would have been a massage. We'll have to book one next time. My husband and I are going to make HIGH END SPA our annual New Year's Day tradition!\n\nComment from Amber P. of HIGH END SPA Hot Springs\n\nBusiness Customer Service\n\n1/2/20191/15/2018-\nHi Michelle, HIGH END SPA is proud to welcome men and women of all shapes and sizes. In response to your day, we are now in the process of ordering a few XL robes so we can continue to have offerings for all of our guests. I wanted to reach out to you to let you know we have sent you a private message as we would like to connect with you directly. Thank you again for communicating your concern with us.\nAlexa Gallegos\n\n1/2/2019 -\n\nHi Michelle,\nI am so happy to hear that you had a great returning experience! Our team members do the best they can to accommodate all of our guests needs and we are very glad to hear you were happy with the solution.\nWe hope to see you and your husband again!\n\nBest,\nAmber Peyghambari\n\nRead less\n
## 2 3/24/2019\n 12 photos\n\nMy sister and I brought my mom here for her birthday and overall, we really enjoyed our time here. We're used to going to Korean spas, but this was definitely an upgrade.\n\nPROS:\n- The resort itself is beautiful and so relaxing. Like seriously such a pleasing escape from reality that I needed. It's set up so nicely and feels very luxurious.\n- It was my mom's birthday so she received free admission on birthday with a purchase of a service. Admission is $52, so she booked a manicure for $50 and got in for free. WORTH. My mom had gone 52 years without ever getting her nails done, so it was kind of heartwarming to see how much she loved her experience.\n- The three of us took a Yin Yoga class and really enjoyed it. We definitely want to take advantage of the other class options next time we come.\n- CLUB MUD. We had so much fun there and even made a little clay sculpture. It really does do wonders for your skin, and the area is suprisingly very well-kept.\n- The shower and locker facilities can get pretty crowded, but overall, they are super nice and clean. They have an ample amount of showers, so we didn't have to wait at all.\n- All the staff seemed really friendly and helpful. There's always staff members roaming around, so you always feel somewhat taken care of.\n- I really appreciated the towel and water stands located throughout the resort. So handy and necessary.\n- Parking is free, thank God.\n\nCONS:\n- We went on a fairly cold day (around 60 degrees), so the hot pools were CROWDED,. Like there were a couple of times I touched other people's body parts I definitely did not want to touch. I feel like some of the hot pools exceeded capacity, and I'm sure it was mostly because it was a cold day, but I do wish there were more of the hot pools or they should just be larger!\n- The food is incredibly expensive. Like as ridiculous as Disneyland, which is saying something. Plan to spend around $20 per meal per person. The one thing that was worth it was the nachos ($16 for the small, but this thing is huge).\n- The kitchen moves VERY SLOWLY. Especially the salad section because I came before the lunch rush and still waited 20 minutes to order my salad. The kitchen staff seems a bit incompetent, or maybe it's just run inefficiently.\n- This is more of a side note, but I wish there was a more streamlined reservation system. I made the entire reservation over the phone, which was fine, but it wasn't laid out as clearly as I would have liked it with the premium admissions prices, services, etc. The online one also just seemed really confusing.\n\nOverall, we had a positive experience with just a couple of kinks here and there. We love that there's just a lot to do here and time FLIES when you're here so come as early as you can. We definitely want to try coming back in the summer months when it's warmer!\n\n\nComment from Amber P. of HIGH END SPA Hot Springs\n\nBusiness Customer Service\n\n3/25/2019Hi Cathy,\n\nThank you for taking the time to share your experience with us. We are happy to hear that you enjoyed your day at HIGH END SPA. We appreciate all feedback and will share these concerns with our team. We hope to see you back this summer!\n\nWith kind,\nAmber Peyghambari\n
## 3 1/26/2020\nI came to CHIROPRACTIC with severe back and neck pain. DOCTOR was AMAZING and helped me to feel much better than I have felt for YEARS! The girls up front also are very sweet and always made sure that all my appointments were set and on time! Heather the billing manager was very kind as well, she was AWESOME when it came to dealing with me and my insurance amd was definitely a huge help! I don't know what I would have done without Heather helping me with all of the insurance problems I had!!! She is the BEST, thank you Heather!! I would definitely recommend going to this clinic!!!!
## 4 1/24/2020\nI have to say.... This is by far the best Chiropractic place I've ever been to. The staff is super friendly and very professional. From the moment I walk in the door I get greeted by name . The Drs are amazing too. Love this place and I highly recommend them.
## 5 10/22/2019\nDr. is my chiropractor and he is a fabulous individual. I've never waited more than few minutes for him to see me. The front team (Both ladies" are great with an outstanding care and smile. Thank you guys for all you do.
## 6 12/23/2019\nMany in our family have seen DOCTOR for chiropractic care. He is very warm and friendly, knowledgable, puts your mind at ease during his adjustments. He gives great explanations. Our 14yo son said, "he is really good at what he does and he is a good person." We all feel better after visiting him. Recommend him to everyone.
## LowAvgHighCost businessType cityState friends reviews
## 1 High high end massage retreat Orange, CA 26 33
## 2 High high end massage retreat Los Angeles, CA 894 311
## 3 Avg chiropractic Laguna Beach, CA 0 NA
## 4 Avg chiropractic Moreno Valley, CA 0 NA
## 5 Avg chiropractic Corona, CA 0 11
## 6 Avg chiropractic Corona, CA 0 2
## photos eliteStatus userName Date userBusinessPhotos userCheckIns
## 1 21 <NA> Michelle A. 2019-01-01 2 NA
## 2 1187 Elite '2020 Cathy P. 2019-03-24 NA NA
## 3 NA <NA> Brie W. 2020-01-26 NA NA
## 4 NA <NA> Yoles A. 2020-01-24 NA NA
## 5 NA <NA> Rafeh T. 2019-10-22 NA NA
## 6 NA <NA> Kort U. 2019-12-23 NA NA
When looking at the table above there are other factors that could be grouped by, such as the business type, the cost as low, average, or high, then number of photos each user has, etc. But we will just focus on using the day of the week. Lets add the day of the week now. First lets make sure the Date feature is recognized as a date feature.
class(Reviews13$Date)
## [1] "Date"
It is a factor, so we will change it to a date feature type.
Reviews13$Date <- as.Date(Reviews13$Date)
class(Reviews13$Date)
## [1] "Date"
Now lets extract the day of the week from our date feature.
date <- ymd(Reviews13$Date)
date <- day(Reviews13$Date)
Reviews13$weekday <- wday(date, label=TRUE, week_start=1)#set start of week to Monday)
head(Reviews13$weekday)
## [1] Sun Tue Thu Tue Sun Mon
## Levels: Mon < Tue < Wed < Thu < Fri < Sat < Sun
Now, we could have attached this information to the count of keywords in each review, but we skipped that step, and we have to make the process automated with a for loop to take every row in data table feature and keep applying this string filter program that counts the 12 words in each observation, and returns a vector that is row binded to the previous vector until now more observations left. We could do that later, you could do it now, or we could try it now and see if it is as simple as it sounds to set up. I will try it once, and if it will take more manipulation, or time, we will just work with what is readily available to focus on the link analysis network design we already planned and created instructions though loose on how to create. This is the keyword extraction script:
str1 <- as.character(paste(Reviews13$userReviewOnlyContent[600]))
str1 <- gsub('[!|.|,|\n|\']',' ',str1,perl=TRUE)
str1 <- gsub('[ ]',' ',str1)
str1 <- trimws(str1, which=c('both'), whitespace='[\t\r\n ]')
totalTerms <- length((strsplit(str1, split=' ')[[1]]))
keys <- row.names(keys_t)
and <- str_match_all(str1,' [aA][nN][dD] ')
AND <- length(and[[1]])
the <- str_match_all(str1,' [tT][hH][eE] ')
THE <- length(the[[1]])
for1 <- str_match_all(str1,' [fF][oO][rR] ')
FOR1 <- length(for1[[1]])
have <- str_match_all(str1,' [hH][aA][vV][eE] ')
HAVE <- length(have[[1]])
that <- str_match_all(str1,' [tT][hH][aA][tT] ')
THAT <- length(that[[1]])
they <- str_match_all(str1,' [tT][hH][eE][yY] ')
THEY <- length(they[[1]])
this <- str_match_all(str1,' [tT][hH][iI][sS] ')
THIS <- length(this[[1]])
you <- str_match_all(str1,' [yY][oO][uU] ')
YOU <- length(you[[1]])
not <- str_match_all(str1,' [nN][oO][tT] ')
NOT <- length(not[[1]])
but <- str_match_all(str1,' [bB][uU][tT] ')
BUT <- length(but[[1]])
good <- str_match_all(str1,' [gG][oO][oO][dD] ')
GOOD <- length(good[[1]])
with <- str_match_all(str1,' [wW][iI][tT][hH] ')
WITH <- length(with[[1]])
values <- as.data.frame(c(THE,AND,FOR1,HAVE,THAT,THEY,THIS,YOU,NOT,BUT,GOOD,WITH))
row.names(values) <- keys
keyValues <- as.data.frame(t(values))
keyValues2 <- keyValues/totalTerms
keyValues3 <- rbind(keyValues,keyValues2)
row.names(keyValues3) <- c('documentTermCount','term_to_totalDocumentTerms')
keyValues3 <- round(keyValues3,3)
keyValues3
## the and for have that they this you not but
## documentTermCount 2.000 4.000 2.000 1.000 1.000 0 0 0 0 0
## term_to_totalDocumentTerms 0.031 0.062 0.031 0.015 0.015 0 0 0 0 0
## good with
## documentTermCount 0 0
## term_to_totalDocumentTerms 0 0
We see from this output per review, that we could first add a paste function to attach a new name to every review by row number in the data table Reviews13. We need to make sure they are the correct row number values by order of their listed review.
row.names(Reviews13) <- NULL
row.names(Reviews13) <- as.character(paste(row.names(Reviews13)))
head(row.names(Reviews13))
## [1] "1" "2" "3" "4" "5" "6"
%&%& Looks like they are ordered. There are 614 reviews to extract the 12 keyword counts and ratio of words per document to total words per document. Lets try wrapping this up in a for loop.
str <- as.character(paste(Reviews13$userReviewOnlyContent))
for (review in (str))
{
s <- gsub('[!|.|,|\n|\']',' ',review,perl=TRUE)
s <- gsub('[ ]',' ',s)
s <- trimws(s, which=c('both'), whitespace='[\t\r\n ]')
# s <- strsplit(s, split=' ') #this is where the error is, outside the loop is fine
totalTerms <- length(strsplit(s, split=' ')[[1]])
and <- str_match_all(s,' [aA][nN][dD] ')
AND <- length(and[[1]])
the <- str_match_all(s,' [tT][hH][eE] ')
THE <- length(the[[1]])
for1 <- str_match_all(s,' [fF][oO][rR] ')
FOR1 <- length(for1[[1]])
have <- str_match_all(s,' [hH][aA][vV][eE] ')
HAVE <- length(have[[1]])
that <- str_match_all(s,' [tT][hH][aA][tT] ')
THAT <- length(that[[1]])
they <- str_match_all(s,' [tT][hH][eE][yY] ')
THEY <- length(they[[1]])
this <- str_match_all(s,' [tT][hH][iI][sS] ')
THIS <- length(this[[1]])
you <- str_match_all(s,' [yY][oO][uU] ')
YOU <- length(you[[1]])
not <- str_match_all(s,' [nN][oO][tT] ')
NOT <- length(not[[1]])
but <- str_match_all(s,' [bB][uU][tT] ')
BUT <- length(but[[1]])
good <- str_match_all(s,' [gG][oO][oO][dD] ')
GOOD <- length(good[[1]])
with <- str_match_all(s,' [wW][iI][tT][hH] ')
WITH <- length(with[[1]])
values1 <- c(THE,AND,FOR1,HAVE,THAT,THEY,THIS,YOU,NOT,BUT,GOOD,WITH)
values2 <- values1/totalTerms
##cat function to save the values and work out another way to read in and combine
cat(values1,file="values1.csv",sep="\n",append ="TRUE",fill=TRUE)
cat(values2,file="values2.csv",sep="\n",append ="TRUE",fill=TRUE)
}
#loop fails at the splitstr function that was commented out.
I played around with the for loop longer than expected, and it needs more work. It isn’t doing what I want it to do. I will manually get a handful of values later as needed, or move on to the vis network. I fixed the code, but this chunk won’t evaluate. The script will be assumed to err until we encounter it again so as to move on to the visNetwork link analysis plots.
Lets move onto the visNetwork. We are using the wordToAllWords and Reviews13 tables. Lets select our columns from each.These were both written to csv earlier as ReviewsCleanedWithKeywordsAndRatios.csv for Reviews13 and wordToAllWords.csv for that table. If you cleaned out your environment and left or shut down Rstudio then you can read in these two as their table names and test the script below.
visNodes <- Reviews13 %>% select(userRatingValue,LowAvgHighCost, businessType,weekday)
visNodes$label <- visNodes$userRatingValue
visNodes$label <- paste('rate',visNodes$label,sep='')
visNodes$title <- visNodes$LowAvgHighCost
visNodes$title <- paste(visNodes$title,'Cost',sep='')
visNodes$group <- visNodes$weekday
visEdges <- as.data.frame(t(wordToAllWords ))
colnames(visEdges) <- c('rate1','rate2','rate3','rate4','rate5')
visEdges$label <- row.names(visEdges)
#the weight is the ratio term2alltermsPerRating
visEdges <- gather(visEdges, 'rating','weight', 1:5)
head(visNodes)
## userRatingValue LowAvgHighCost businessType weekday label
## 1 5 High high end massage retreat Sun rate5
## 2 4 High high end massage retreat Tue rate4
## 3 5 Avg chiropractic Thu rate5
## 4 5 Avg chiropractic Tue rate5
## 5 5 Avg chiropractic Sun rate5
## 6 5 Avg chiropractic Mon rate5
## title group
## 1 HighCost Sun
## 2 HighCost Tue
## 3 AvgCost Thu
## 4 AvgCost Tue
## 5 AvgCost Sun
## 6 AvgCost Mon
Nodes1 <- visNodes %>% select(label,title,group)
head(Nodes1)
## label title group
## 1 rate5 HighCost Sun
## 2 rate4 HighCost Tue
## 3 rate5 AvgCost Thu
## 4 rate5 AvgCost Tue
## 5 rate5 AvgCost Sun
## 6 rate5 AvgCost Mon
It moves from 614 to 7368 obsrevations because of the 12 keywords and 614 reveiws which equals 7368.
Nodes2 <- merge(Nodes1, visEdges, by.x='label', by.y='rating')
Nodes2$id <- as.factor(paste(row.names(Nodes2)))
Nodes2$term <- Nodes2$label.y
Nodes3 <- Nodes2 %>% select(id,label,title,group,term)
Nodes3$label <- as.factor(paste(Nodes3$label))
Nodes3$term <- as.factor(paste(Nodes3$term))
Nodes3$title <- as.factor(paste(Nodes3$title))
head(Nodes3)
## id label title group term
## 1 1 rate1 LowCost Mon this
## 2 2 rate1 LowCost Mon they
## 3 3 rate1 LowCost Mon and
## 4 4 rate1 LowCost Mon but
## 5 5 rate1 LowCost Mon for
## 6 6 rate1 LowCost Mon good
visEdges only has 60 because it was 5 ratings ratios foe 12 words each.
visEdges$label <- as.factor(paste(visEdges$label))
visEdges$rating <- as.factor(paste(visEdges$rating))
head(visEdges)
## label rating weight
## 1 and rate1 0.046
## 2 but rate1 0.008
## 3 for rate1 0.018
## 4 good rate1 0.006
## 5 have rate1 0.016
## 6 not rate1 0.011
Because there are only 60 fields in the edges and 7368 in the nodes, there will be errors for those values in the nodes that don’t have values in the edges. But we will still have a plot. I will suppress these warnings messages within the chunk.The nodes have to have unique IDs but the edges don’t.
Edges2 <- visEdges %>% mutate(from=plyr::mapvalues(visEdges$rating,
from=Nodes3$label,to=Nodes3$id))
Edges3 <- Edges2 %>% mutate(to=plyr::mapvalues(Edges2$label,
from=Nodes3$term, to=Nodes3$id))
Edges4 <- Edges3 %>% select(from,to,label,weight)
head(Edges4,20)
## from to label weight
## 1 1 2223 and 0.046
## 2 1 3334 but 0.008
## 3 1 4445 for 0.018
## 4 1 5556 good 0.006
## 5 1 6667 have 0.016
## 6 1 7147 not 0.011
## 7 1 7258 that 0.015
## 8 1 2 the 0.057
## 9 1 1112 they 0.012
## 10 1 1 this 0.013
## 11 1 113 with 0.008
## 12 1 224 you 0.009
## 13 66 2223 and 0.042
## 14 66 3334 but 0.011
## 15 66 4445 for 0.014
## 16 66 5556 good 0.006
## 17 66 6667 have 0.016
## 18 66 7147 not 0.011
## 19 66 7258 that 0.013
## 20 66 2 the 0.073
Now lets use visNetwork and igraph to plot these nodes and edges.
visNetwork(nodes=Nodes3, edges=Edges4, main='Weekday Groups of Rating and 12 Keywords') %>% visEdges(arrows=c('from','middle')) %>%
visInteraction(navigationButtons=TRUE, dragNodes=FALSE,
dragView=TRUE, zoomView = TRUE) %>%
visOptions(nodesIdSelection = TRUE, manipulation=FALSE) %>%
visIgraphLayout() %>%
visLegend
## Warning in if (class(newval) == "factor") {: the condition has length > 1 and
## only the first element will be used
The above is very large because it has the added groups and keywords of 12 to run combinations against the original 614 observations. Lets try limiting the observations.
We will work from the beginning of the last visNetwork plot.
visNodes <- Reviews13 %>% select(userRatingValue,LowAvgHighCost, businessType,weekday)
visNodes$label <- visNodes$userRatingValue
visNodes$label <- paste('rate',visNodes$label,sep='')
visNodes$title <- visNodes$LowAvgHighCost
visNodes$title <- paste(visNodes$title,'Cost',sep='')
visNodes$group <- visNodes$weekday
head(visNodes)
## userRatingValue LowAvgHighCost businessType weekday label
## 1 5 High high end massage retreat Sun rate5
## 2 4 High high end massage retreat Tue rate4
## 3 5 Avg chiropractic Thu rate5
## 4 5 Avg chiropractic Tue rate5
## 5 5 Avg chiropractic Sun rate5
## 6 5 Avg chiropractic Mon rate5
## title group
## 1 HighCost Sun
## 2 HighCost Tue
## 3 AvgCost Thu
## 4 AvgCost Tue
## 5 AvgCost Sun
## 6 AvgCost Mon
visEdges <- as.data.frame(t(wordToAllWords ))
colnames(visEdges) <- c('rate1','rate2','rate3','rate4','rate5')
visEdges$label <- row.names(visEdges)
#the weight is the ratio term2alltermsPerRating
visEdges <- gather(visEdges, 'rating','weight', 1:5)
head(visEdges)
## label rating weight
## 1 and rate1 0.046
## 2 but rate1 0.008
## 3 for rate1 0.018
## 4 good rate1 0.006
## 5 have rate1 0.016
## 6 not rate1 0.011
Nodes1 <- visNodes %>% select(weekday:group)
head(Nodes1)
## weekday label title group
## 1 Sun rate5 HighCost Sun
## 2 Tue rate4 HighCost Tue
## 3 Thu rate5 AvgCost Thu
## 4 Tue rate5 AvgCost Tue
## 5 Sun rate5 AvgCost Sun
## 6 Mon rate5 AvgCost Mon
Nodes2 <- merge(Nodes1, visEdges, by.x='label', by.y='rating')
Nodes2$term <- Nodes2$label.y
Nodes2$id <- row.names(Nodes2)
Nodes3 <- Nodes2 %>% select(id,label,title,group,term,weight)
head(Nodes3)
## id label title group term weight
## 1 1 rate1 LowCost Mon this 0.013
## 2 2 rate1 LowCost Mon they 0.012
## 3 3 rate1 LowCost Mon and 0.046
## 4 4 rate1 LowCost Mon but 0.008
## 5 5 rate1 LowCost Mon for 0.018
## 6 6 rate1 LowCost Mon good 0.006
Now subset from Nodes3 and make the edges table from this table.
Nodes4 <- subset(Nodes3, (Nodes3$group=='Mon'|Nodes3$group=='Wed'|Nodes3$group=='Sat') &
(Nodes3$term=='this'|Nodes3$term=='they'|Nodes3$term=='have'|
Nodes3$term=='you'|Nodes3$term=='not'|Nodes3$term=='good'))
row.names(Nodes4) <- NULL
Nodes4$id <- as.factor(row.names(Nodes4))
head(Nodes4)
## id label title group term weight
## 1 1 rate1 LowCost Mon this 0.013
## 2 2 rate1 LowCost Mon they 0.012
## 3 3 rate1 LowCost Mon good 0.006
## 4 4 rate1 LowCost Mon have 0.016
## 5 5 rate1 LowCost Mon not 0.011
## 6 6 rate1 LowCost Mon you 0.009
Edges1 <- Nodes4 %>% select(label,term,group,weight)
Edges2 <- Edges1 %>% mutate(from=plyr::mapvalues(Edges1$label,
from=Nodes4$label,to=Nodes4$id))
## The following `from` values were not present in `x`: rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate1, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate2, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate3, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate4, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5, rate5
Edges3 <- Edges2 %>% mutate(to=plyr::mapvalues(Edges2$term,
from=Nodes4$term, to=Nodes4$id))
## The following `from` values were not present in `x`: this, they, good, have, not, you, this, they, good, have, not, you, this, they, good, have, not, you, this, they, good, have, not, you, this, they, good, have, not, you, this, they, good, have, not, you, this, they, good, have, not, you, this, they, good, have, not, you, this, they, good, have, not, you, this, they, good, have, not, you, this, they, good, have, not, you, this, they, good, have, not, you, this, they, good, have, not, you, this, they, good, have, not, you, this, they, good, have, not, you, this, they, good, have, not, you, this, they, good, have, not, you, this, they, good, have, not, you, this, they, good, have, not, you, this, they, good, have, not, you, this, they, good, have, not, you, this, they, good, have, not, you, this, they, good, have, not, you, this, they, good, have, not, you, this, they, good, have, not, you, this, they, good, have, not, you, this, they, good, have, not, you, this, they, good, have, not, you, this, they, good, have, not, you, this, they, good, have, not, you, this, they, good, have, not, you, this, they, good, have, not, you, this, they, good, have, not, you, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good, have, not, they, this, you, good
Edges4 <- Edges3 %>% select(from,to,label,term,group,weight)
Edges4$label <- Edges4$term
head(Edges4)
## from to label term group weight
## 1 1 1 this this Mon 0.013
## 2 1 535 they they Mon 0.012
## 3 1 646 good good Mon 0.006
## 4 1 757 have have Mon 0.016
## 5 1 868 not not Mon 0.011
## 6 1 979 you you Mon 0.009
visNetwork(nodes=Nodes4, edges=Edges4, main='Three Weekday Groups of Five Ratings and Five Keywords') %>% visEdges(arrows=c('from','middle')) %>%
visInteraction(navigationButtons=TRUE, dragNodes=FALSE,
dragView=TRUE, zoomView = TRUE) %>%
visOptions(nodesIdSelection = TRUE, manipulation=FALSE) %>%
visIgraphLayout() %>%
visLegend
## Warning in if (class(newval) == "factor") {: the condition has length > 1 and
## only the first element will be used
Lets make another visualization on price and ratings with terms omitted. I used five keywords instead of three as planned.
We well use the Reviews13 data table. And select the features needed.
visNodes3 <- Reviews13 %>% select(userRatingValue,LowAvgHighCost,businessReplied,friends)
visNodes3$id <- row.names(visNodes3)
visNodes3$weight <- visNodes3$friends/max(visNodes3$friends,na.rm=TRUE)
visNodes3$group <- visNodes3$LowAvgHighCost
visNodes3$label <- as.factor(paste('rating', visNodes3$userRatingValue, sep=' '))
visNodes3$title <- visNodes3$businessReplied
head(visNodes3)
## userRatingValue LowAvgHighCost businessReplied friends id weight group
## 1 5 High yes 26 1 0.0052 High
## 2 4 High yes 894 2 0.1788 High
## 3 5 Avg no 0 3 0.0000 Avg
## 4 5 Avg no 0 4 0.0000 Avg
## 5 5 Avg no 0 5 0.0000 Avg
## 6 5 Avg no 0 6 0.0000 Avg
## label title
## 1 rating 5 yes
## 2 rating 4 yes
## 3 rating 5 no
## 4 rating 5 no
## 5 rating 5 no
## 6 rating 5 no
nodes1 <- visNodes3 %>% select(id,label,title, group)
edges1 <- visNodes3 %>% select(id,label,group,weight)
edges1$from <- edges1$id
edges2 <- edges1 %>% mutate(to = plyr::mapvalues(edges1$group, from=nodes1$group, to = nodes1$id))
## The following `from` values were not present in `x`: High, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, High, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, High, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, High, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, High, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, High, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, High, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, High, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, High, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, High, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, High, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, High, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, High, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, High, Low, Low, Low, Low, Low, Low, Low, Low, Low, Low, High, Low, Low, Low, Low, Low, Low, Low, Low, Low, Low, High, Low, Low, Low, Low, Low, Low, Low, Low, Low, Low, High, Low, Low, Low, Low, Low, Low, Low, Low, Low, Low, High, Low, Low, Low, Low, Low, Low, Low, Low, Low, Low, Low, Low, Low, Low, Low, Low, Low, Low, Low, Low, Low, Avg, High, Low, Low, Low, Low, Low, Low, Low, Low, Low, Low, High, Low, Low, Low, Low, Low, Low, Low, Low, Low, Low, High, Low, Low, Low, Low, Low, Low, Low, Low, Low, Low, Avg, Low, Low, Low, Low, Low, Low, Low, Low, Low, Low, Avg, Low, Low, Low, Low, Low, Low, Low, Low, Low, Low, Avg, Low, Low, Low, Low, Low, Low, Low, Low, Low, Low, Avg, Low, Low, Low, Low, Low, Low, Low, High, High, High, Avg, High, High, High, High, High, High, High, High, High, High, Avg, High, High, High, High, High, High, High, High, High, High, Avg, High, High, High, High, High, High, High, High, High, High, High, Low, High, High, High, High, High, High, High, High, High, High, High, High, High, High, High, High, High, High, High, High, High, High, High, High, High, High, High, High, High, High, High, High, Avg, High, High, High, High, High, High, High, High, High, High, Avg, High, High, High, High, High, High, High, High, High, High, Avg, High, High, High, High, High, High, High, High, High, High, Avg, High, High, High, High, High, High, High, High, High, High, Avg, High, High, High, High, High, High, High, High, High, High, Avg, High, High, High, High, High, High, High, High, High, High, Avg, High, High, High, High, High, High, High, High, High, High, High, Avg, High, High, High, High, High, High, High, High, High, High, Avg, High, High, High, High, High, High, High, High, High, High, Avg, High, High, High, High, High, Avg, High, High, High, High, Avg, Low, High, High, High, High, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Low, High, High, High, Avg, High, High, High, High, Avg, Avg, High, High, High, High, High, Low, High, High, High, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Low, High, High, Avg, High, High, Avg, High, High, High, High, Low, High, High, Avg, High, High, Avg, Avg, Avg, Avg, Avg, Avg, Low, High, Avg, High, High, Avg, High, High, High, High, Avg, High, High, High, Avg, Avg, High, Avg, High, High, High, High, High, High, High, Avg, High, High, High, High, High, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, High, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, High, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, High, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg, Avg
edges3 <- edges2 %>% select(from,to,label, group,weight)
head(edges3)
## from to label group weight
## 1 1 1 rating 5 High 0.0052
## 2 2 1 rating 4 High 0.1788
## 3 3 3 rating 5 Avg 0.0000
## 4 4 3 rating 5 Avg 0.0000
## 5 5 3 rating 5 Avg 0.0000
## 6 6 3 rating 5 Avg 0.0000
head(nodes1)
## id label title group
## 1 1 rating 5 yes High
## 2 2 rating 4 yes High
## 3 3 rating 5 no Avg
## 4 4 rating 5 no Avg
## 5 5 rating 5 no Avg
## 6 6 rating 5 no Avg
visNetwork(nodes=nodes1, edges=edges3, main='Ratings Cost if business replied and Number of Friends as arrow weights') %>% visEdges(arrows=c('from','middle')) %>%
visInteraction(navigationButtons=TRUE, dragNodes=FALSE,
dragView=TRUE, zoomView = TRUE) %>%
visOptions(nodesIdSelection = TRUE, manipulation=FALSE) %>%
visIgraphLayout() %>%
visLegend
We can see in the above plot of the ratings in groups by cost of either high, average, or low, that there are almost the same amount of reviews in each group. When hovering the nodes are going to show if the business replied to his or her review as yes if they did and as no if not. When zooming in on the nodes of each group you can see the rating and arrow weights of the number of friends each review has as ratios of the number of friends a user has divided by the max number of friends all users have.
Lets build another network but with different groupings.
visNodes3 <- Reviews13 %>% select(userRatingValue,LowAvgHighCost,businessReplied,friends)
visNodes3$id <- row.names(visNodes3)
visNodes3$weight <- visNodes3$friends/max(visNodes3$friends,na.rm=TRUE)
visNodes3$label <- as.factor(paste(visNodes3$LowAvgHighCost,'cost',sep=' '))
visNodes3$group <- as.factor(paste('rating', visNodes3$userRatingValue, sep=' '))
visNodes3$title <- paste(visNodes3$friends,'friends',sep=' ')
head(visNodes3)
## userRatingValue LowAvgHighCost businessReplied friends id weight label
## 1 5 High yes 26 1 0.0052 High cost
## 2 4 High yes 894 2 0.1788 High cost
## 3 5 Avg no 0 3 0.0000 Avg cost
## 4 5 Avg no 0 4 0.0000 Avg cost
## 5 5 Avg no 0 5 0.0000 Avg cost
## 6 5 Avg no 0 6 0.0000 Avg cost
## group title
## 1 rating 5 26 friends
## 2 rating 4 894 friends
## 3 rating 5 0 friends
## 4 rating 5 0 friends
## 5 rating 5 0 friends
## 6 rating 5 0 friends
nodes1 <- visNodes3 %>% select(id,label,title, group)
edges1 <- visNodes3 %>% select(id,label,group,weight)
edges1$from <- edges1$id
edges2 <- edges1 %>% mutate(to = plyr::mapvalues(edges1$group, from=nodes1$group, to = nodes1$id))
## The following `from` values were not present in `x`: rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 4, rating 5, rating 4, rating 5, rating 5, rating 5, rating 5, rating 5, rating 1, rating 4, rating 5, rating 5, rating 1, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 3, rating 5, rating 4, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 4, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 1, rating 5, rating 4, rating 5, rating 5, rating 5, rating 4, rating 5, rating 5, rating 5, rating 5, rating 1, rating 5, rating 5, rating 5, rating 5, rating 1, rating 5, rating 1, rating 5, rating 5, rating 5, rating 4, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 1, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 1, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 4, rating 1, rating 1, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 4, rating 5, rating 5, rating 5, rating 4, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 4, rating 5, rating 1, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 4, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 1, rating 1, rating 5, rating 3, rating 1, rating 5, rating 1, rating 5, rating 4, rating 3, rating 5, rating 1, rating 5, rating 1, rating 5, rating 4, rating 4, rating 2, rating 1, rating 5, rating 5, rating 1, rating 1, rating 5, rating 5, rating 5, rating 5, rating 4, rating 5, rating 5, rating 1, rating 5, rating 4, rating 2, rating 2, rating 5, rating 1, rating 1, rating 4, rating 1, rating 5, rating 2, rating 5, rating 1, rating 5, rating 1, rating 2, rating 4, rating 4, rating 1, rating 4, rating 5, rating 5, rating 1, rating 5, rating 5, rating 5, rating 1, rating 1, rating 4, rating 5, rating 4, rating 5, rating 5, rating 5, rating 3, rating 1, rating 4, rating 2, rating 1, rating 5, rating 5, rating 5, rating 4, rating 5, rating 5, rating 5, rating 1, rating 4, rating 4, rating 1, rating 4, rating 5, rating 4, rating 5, rating 1, rating 5, rating 4, rating 5, rating 2, rating 3, rating 5, rating 4, rating 5, rating 5, rating 1, rating 5, rating 5, rating 4, rating 4, rating 5, rating 5, rating 4, rating 3, rating 3, rating 5, rating 5, rating 4, rating 4, rating 1, rating 4, rating 5, rating 4, rating 5, rating 3, rating 3, rating 4, rating 5, rating 5, rating 4, rating 4, rating 5, rating 5, rating 1, rating 3, rating 5, rating 4, rating 4, rating 4, rating 5, rating 4, rating 3, rating 5, rating 5, rating 5, rating 3, rating 5, rating 5, rating 5, rating 3, rating 5, rating 3, rating 3, rating 5, rating 5, rating 5, rating 3, rating 4, rating 4, rating 4, rating 5, rating 2, rating 4, rating 1, rating 4, rating 5, rating 3, rating 3, rating 3, rating 3, rating 2, rating 2, rating 5, rating 1, rating 5, rating 5, rating 5, rating 5, rating 4, rating 5, rating 5, rating 4, rating 5, rating 1, rating 3, rating 5, rating 5, rating 4, rating 5, rating 4, rating 1, rating 5, rating 5, rating 4, rating 1, rating 3, rating 2, rating 5, rating 4, rating 5, rating 5, rating 1, rating 2, rating 2, rating 3, rating 5, rating 4, rating 1, rating 5, rating 2, rating 3, rating 5, rating 4, rating 5, rating 4, rating 4, rating 1, rating 5, rating 3, rating 2, rating 5, rating 5, rating 4, rating 1, rating 4, rating 2, rating 5, rating 1, rating 3, rating 1, rating 2, rating 5, rating 5, rating 3, rating 5, rating 5, rating 4, rating 5, rating 5, rating 4, rating 2, rating 1, rating 5, rating 5, rating 1, rating 2, rating 3, rating 2, rating 5, rating 1, rating 5, rating 5, rating 5, rating 4, rating 5, rating 3, rating 1, rating 5, rating 5, rating 4, rating 3, rating 4, rating 5, rating 1, rating 4, rating 5, rating 3, rating 4, rating 5, rating 5, rating 5, rating 5, rating 4, rating 4, rating 5, rating 5, rating 5, rating 1, rating 5, rating 4, rating 2, rating 1, rating 5, rating 5, rating 4, rating 5, rating 5, rating 5, rating 2, rating 4, rating 5, rating 5, rating 4, rating 4, rating 3, rating 5, rating 1, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 1, rating 1, rating 2, rating 1, rating 2, rating 3, rating 3, rating 5, rating 5, rating 4, rating 1, rating 2, rating 1, rating 1, rating 1, rating 2, rating 2, rating 4, rating 5, rating 4, rating 3, rating 1, rating 4, rating 1, rating 4, rating 1, rating 1, rating 2, rating 5, rating 5, rating 5, rating 4, rating 4, rating 2, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 1, rating 5, rating 5, rating 5, rating 3, rating 4, rating 4, rating 4, rating 4, rating 4, rating 1, rating 5, rating 3, rating 1, rating 1, rating 1, rating 2, rating 2, rating 3, rating 4, rating 4, rating 4, rating 3, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 4, rating 1, rating 3, rating 1, rating 1, rating 2, rating 2, rating 3, rating 4, rating 4, rating 5, rating 4, rating 4, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 4, rating 3, rating 4, rating 4, rating 1, rating 1, rating 1, rating 1, rating 1, rating 4, rating 5, rating 1, rating 1, rating 1, rating 1, rating 1, rating 3, rating 3, rating 3, rating 3, rating 3, rating 5, rating 3, rating 3, rating 3, rating 3, rating 3, rating 5, rating 5, rating 5, rating 5, rating 5, rating 3, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 4, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 5, rating 1, rating 5, rating 5, rating 5, rating 5, rating 5
edges3 <- edges2 %>% select(from,to,label, group,weight)
head(edges3)
## from to label group weight
## 1 1 1 High cost rating 5 0.0052
## 2 2 2 High cost rating 4 0.1788
## 3 3 1 Avg cost rating 5 0.0000
## 4 4 1 Avg cost rating 5 0.0000
## 5 5 1 Avg cost rating 5 0.0000
## 6 6 1 Avg cost rating 5 0.0000
head(nodes1)
## id label title group
## 1 1 High cost 26 friends rating 5
## 2 2 High cost 894 friends rating 4
## 3 3 Avg cost 0 friends rating 5
## 4 4 Avg cost 0 friends rating 5
## 5 5 Avg cost 0 friends rating 5
## 6 6 Avg cost 0 friends rating 5
visNetwork(nodes=nodes1, edges=edges3, main='Ratings as Groups and Cost as Labels with Number of Friends as Arrow Weights') %>% visEdges(arrows=c('from','middle')) %>%
visInteraction(navigationButtons=TRUE, dragNodes=FALSE,
dragView=TRUE, zoomView = TRUE) %>%
visOptions(nodesIdSelection = TRUE, manipulation=FALSE) %>%
visIgraphLayout() %>%
visLegend(ncol=2)
The above visual network is great for looking at the number of reviews in each rating, but also when zooming in to see the cost as Low, High, or Average and hovering shows how many social media friends each reviewer has.
We should make a visual network of the keywords and the ratings to go with and maybe a couple different visualizations on our manually built best model for predicting ratings based on the ceiling of the median of the dot product of votes times ratings when there is a tie between rating votes that were voted on by which review has the minimum value of the ratio of term to total terms in the document to term to total terms by rating.
I worked on the for loop outside this script and thankfully it works and it doesn’t take very long for it to run. Less than two minutes. Here is that content and the new file Reviews15 we will be working with.
!!! CAUTION: !!!
Make sure to only run this once if you already have these files or delete the keyValues3.csv and the keyValues3_ratios, as it will append to your files. When running the chunks even with eval set to FALSE, everything is ran, those header commands only work when kitting the file. It takes about a minute to load all 614 reviews into those 12 keywords with ratios. So it is ok to delete the files.
for (num in 1:length(Reviews13$userReviewOnlyContent)){
#num <- 30
str1 <- as.character(paste(Reviews13$userReviewOnlyContent[num]))
str1 <- gsub('[!|.|,|\n|\']',' ',str1,perl=TRUE)
str1 <- gsub('[ ]',' ',str1)
str1 <- trimws(str1, which=c('both'), whitespace='[\t\r\n ]')
totalTerms <- length((strsplit(str1, split=' ')[[1]]))
keys <- c("the", "and" , "for" , "have" ,"that" ,"they" ,"this" ,"you" ,
"not" , "but" ,"good" ,"with")
and <- str_match_all(str1,' [aA][nN][dD] ')
AND <- length(and[[1]])
the <- str_match_all(str1,' [tT][hH][eE] ')
THE <- length(the[[1]])
for1 <- str_match_all(str1,' [fF][oO][rR] ')
FOR1 <- length(for1[[1]])
have <- str_match_all(str1,' [hH][aA][vV][eE] ')
HAVE <- length(have[[1]])
that <- str_match_all(str1,' [tT][hH][aA][tT] ')
THAT <- length(that[[1]])
they <- str_match_all(str1,' [tT][hH][eE][yY] ')
THEY <- length(they[[1]])
this <- str_match_all(str1,' [tT][hH][iI][sS] ')
THIS <- length(this[[1]])
you <- str_match_all(str1,' [yY][oO][uU] ')
YOU <- length(you[[1]])
not <- str_match_all(str1,' [nN][oO][tT] ')
NOT <- length(not[[1]])
but <- str_match_all(str1,' [bB][uU][tT] ')
BUT <- length(but[[1]])
good <- str_match_all(str1,' [gG][oO][oO][dD] ')
GOOD <- length(good[[1]])
with <- str_match_all(str1,' [wW][iI][tT][hH] ')
WITH <- length(with[[1]])
values <- as.data.frame(c(THE,AND,FOR1,HAVE,THAT,THEY,THIS,YOU,NOT,BUT,GOOD,WITH))
row.names(values) <- keys
keyValues <- as.data.frame(t(values))
keyValues2 <- keyValues/totalTerms
keyValues3 <- rbind(keyValues,keyValues2)
row.names(keyValues3) <- c(paste('documentTermCount', num, sep='_'),
paste('term_to_totalDocumentTerms', num, sep='_'))
keyValues3 <- round(keyValues3,3)
keyValues4 <- as.matrix(keyValues3)
cat(keyValues4[1,1:12],file='keyValues3.csv',append=TRUE, sep='\n',fill=TRUE)
cat(keyValues4[2,1:12],file='keyValues3_ratios.csv',append=TRUE, sep='\n',fill=TRUE)
}
Well, great news! The above looks like it worked and now we have the rest of our keyword data to make a matrix and then data frame out of. It took about one minute as I watched the kb file size in the file window change for each keyValues csv file in the for loop above. These are actually all the records, because they were appended to the other records. So we should have 614X12= r614*12
observations. Lets find out.
all_kws <- read.csv('keyValues3.csv', sep=',', header=FALSE, na.strings=c('',' ','NA'))
Ok, good, because it does say 7368 obs and 1 variable, (if you have more rows than this, you ran the code twice. search for keywords3.csv within Rstudio with the magnifying glass in the toolbar and see if you did, otherwise continue) as expected or anticipating it to. Now lets make this into a data frame after first making it into a matrix.
all_kws1 <- all_kws$V1
ALL_kws <- matrix(all_kws1, nrow=12,ncol=614,byrow=FALSE)
ALL_KWs <- as.data.frame(t(ALL_kws))
row.names(ALL_KWs) <- NULL
colnames(ALL_KWs) <- keys
Now lets get the ratios for all of these reviews and keywords.
all_kwrs <- read.csv('keyValues3_ratios.csv', header=FALSE, sep=',',
na.strings=c('',' ','NA'))
all_kwrs1 <- all_kwrs$V1
ALL_kwrs <- matrix(all_kwrs1, nrow=12,ncol=614,byrow=FALSE)
ALL_KWRs <- as.data.frame(t(ALL_kwrs))
row.names(ALL_KWRs) <- NULL
colnames(ALL_KWRs) <- paste(keys,'ratios', sep='_')
Now lets combine the two tables together.
ALL_keywords <- cbind(ALL_KWs,ALL_KWRs)
head(ALL_keywords)
## the and for have that they this you not but good with the_ratios and_ratios
## 1 15 5 3 4 4 3 1 2 1 1 2 0 0.055 0.018
## 2 24 17 5 3 3 3 3 3 1 6 0 3 0.040 0.028
## 3 4 5 1 2 1 0 1 1 0 0 0 3 0.033 0.041
## 4 5 2 0 1 0 0 2 0 0 0 0 0 0.083 0.033
## 5 1 2 2 0 0 0 0 2 0 0 0 1 0.021 0.042
## 6 0 2 1 1 0 0 0 0 0 0 2 0 0.000 0.030
## for_ratios have_ratios that_ratios they_ratios this_ratios you_ratios
## 1 0.011 0.015 0.015 0.011 0.004 0.007
## 2 0.008 0.005 0.005 0.005 0.005 0.005
## 3 0.008 0.016 0.008 0.000 0.008 0.008
## 4 0.000 0.017 0.000 0.000 0.033 0.000
## 5 0.042 0.000 0.000 0.000 0.000 0.042
## 6 0.015 0.015 0.000 0.000 0.000 0.000
## not_ratios but_ratios good_ratios with_ratios
## 1 0.004 0.004 0.007 0.000
## 2 0.002 0.010 0.000 0.005
## 3 0.000 0.000 0.000 0.024
## 4 0.000 0.000 0.000 0.000
## 5 0.000 0.000 0.000 0.021
## 6 0.000 0.000 0.030 0.000
Super great! These for loops can be tricky. Thankfully, that worked. Lets write this out to file. The row names are not important because they are the order listed the same as the reviews listed from the Reviews13 data table.
write.csv(ALL_keywords,'ALL_keywords.csv', row.names=FALSE)
Lets now combine this with a merge of IDs as row numbers shall we?
Reviews14 <- Reviews13
Reviews14$id <- row.names(Reviews13)
ALL_keywords$id <- row.names(ALL_keywords)
Now merge by id.
Reviews15 <- merge(Reviews14, ALL_keywords, by.x='id', by.y='id')
head(Reviews15)
## id userReviewSeries
## 1 1 mostRecentVisit_review
## 2 10 mostRecentVisit_review
## 3 100 mostRecentVisit_review
## 4 101 twoVisitsPrior_review
## 5 102 mostRecentVisit_review
## 6 103 mostRecentVisit_review
## userReviewOnlyContent
## 1 What a wonderful way to start the year! This was my second time back to HIGH END SPA, and we had a great time. The crowds were very low (seriously, it felt like we had the place to ourselves most of the day.) We walked right into the mineral baths, club mud, and didn't wait in any kind of line for lunch. None of the pools were crowded, and we were even able to enjoy one of the hammocks in the secret garden.\n\nTiffany at the front check-in desk went above and beyond for us regarding the robes. I had requested a plus-sized robe, since after my last review I knew they had added some to their collection. Unfortunately, all of their plus-sized robes were still dirty from the day before. Tiffany was so accommodating, though! She was able to get us robes from the cabana area that fit me perfectly! It is so great to know that not only do they now offer guests of all sizes the option to enjoy a warm robe, but that they really want to make sure you have a good day. Thank you, Tiffany, for everything.\n\nAll of the staff today were in good spirits. The only thing that would have made today better would have been a massage. We'll have to book one next time. My husband and I are going to make HIGH END SPA our annual New Year's Day tradition!\n\n
## 2 I'm so happy I found CHIROPRACTIC!\n\nBrenda was so sweet and attentive, from making my appointment to greeting me upon arrival.\n\nI saw Bertha for a prenatal massage, how I survived my first pregnancy without one, I'm clueless. Bertha listened to my needs and my bodies. She helped relieve tension in my neck and shoulders.\n\nI could have fell asleep, only complaint would be - why aren't massages longer than an hour lol\n\nI cannot wait to come back monthly through this pregnancy. I also am excited to try a prenatal adjustment
## 3 Their staff is super nice. The doctor is also great and always gets the knots out of my back. I felt better right after my first appointment!
## 4 It's too bad, I had such a great time here and some bathroom attendant ruined my whole experience!! Just the worst manners and let's just say customer service was not her specialty or even close.\nThis young girl had some nerve to correct a customer for accidentally missing the trash with some paper from a cinco de mayo mustache.. (jokes) she chases after me to tell me to throw it in the trash I explained half way down the hall I was sorry and had to\nLeAve, my friend was sick and need me to tend to her. She then chased me down again and started to harass me to tell her where my friend threw up. Really? Well, maybe she had a bad day.. but after explaining what happen to management and the front office, Jose, the manager, didn't look too surprised.. I guess this is normal behavior for her.. needless to say I'm almost afraid to go back. I may not hold my tongue next time.. personal space was not In her vocabulary she tapped me on the shoulder, she's lucky i was in a great mood till then..\n
## 5 Fabulous place to get adjusted. The office is calm and clean. The staff is friendly. Dr. Ramada is fantastic! He really understood the cause of my pain and was able to adjust me quickly. I love the availability and evening appointments. Highly recommend!
## 6 I was looking for a Chiropractor in my area and I stumbled upon CHIROPRACTIC. It is a really awesome place. The staff and facilities are very nice. And they are very reasonably priced, much better price then my last Chiropractor. Conveniently located off of the 15 freeway two exits south of the 91. What is really cool is they also offer massages also. If you are looking for a Chiropractor in Corona/Riverside area look no further.
## userRatingSeries userRatingValue businessReplied
## 1 mostRecentVisit_rating 5 yes
## 2 mostRecentVisit_rating 5 no
## 3 mostRecentVisit_rating 5 no
## 4 lastVisit_rating 1 yes
## 5 mostRecentVisit_rating 5 no
## 6 mostRecentVisit_rating 5 no
## businessReplyContent
## 1 Amber P. of HIGH END SPA Hot Springs\n\nBusiness Customer Service\n\n1/2/20191/15/2018-\nHi Michelle, HIGH END SPA is proud to welcome men and women of all shapes and sizes. In response to your day, we are now in the process of ordering a few XL robes so we can continue to have offerings for all of our guests. I wanted to reach out to you to let you know we have sent you a private message as we would like to connect with you directly. Thank you again for communicating your concern with us.\nAlexa Gallegos\n\n1/2/2019 -\n\nHi Michelle,\nI am so happy to hear that you had a great returning experience! Our team members do the best they can to accommodate all of our guests needs and we are very glad to hear you were happy with the solution.\nWe hope to see you and your husband again!\n\nBest,\nAmber Peyghambari\n\nRead less\n
## 2 NA
## 3 NA
## 4 Amber P. of HIGH END SPA Hot Springs\n\nBusiness Customer Service\n\n5/8/2018Hi Raven, thank you for taking the time to write a review of your recent visit. I am sorry to hear about your incident in the bath house at the end of your visit. I spoke with Jose and he mentioned you all were a pleasure to have and that after speaking with you about this occurrence, he internally addressed the issue so this wouldn't happen again. We take our guest comments very seriously because guest comments, good and bad, help us to learn and grow. A guest who makes the commitment to reach out and tell us what was not perfect is invaluable to our company. Always feel free to contact me regarding any of your visits or if you ever have any questions or comments.\nBest regards, Alexa Gallegos, HIGH END SPA Hot Springs
## 5 NA
## 6 NA
## userReviewContent
## 1 1/1/2019Updated review\n 2 photos\n\nWhat a wonderful way to start the year! This was my second time back to HIGH END SPA, and we had a great time. The crowds were very low (seriously, it felt like we had the place to ourselves most of the day.) We walked right into the mineral baths, club mud, and didn't wait in any kind of line for lunch. None of the pools were crowded, and we were even able to enjoy one of the hammocks in the secret garden.\n\nTiffany at the front check-in desk went above and beyond for us regarding the robes. I had requested a plus-sized robe, since after my last review I knew they had added some to their collection. Unfortunately, all of their plus-sized robes were still dirty from the day before. Tiffany was so accommodating, though! She was able to get us robes from the cabana area that fit me perfectly! It is so great to know that not only do they now offer guests of all sizes the option to enjoy a warm robe, but that they really want to make sure you have a good day. Thank you, Tiffany, for everything.\n\nAll of the staff today were in good spirits. The only thing that would have made today better would have been a massage. We'll have to book one next time. My husband and I are going to make HIGH END SPA our annual New Year's Day tradition!\n\nComment from Amber P. of HIGH END SPA Hot Springs\n\nBusiness Customer Service\n\n1/2/20191/15/2018-\nHi Michelle, HIGH END SPA is proud to welcome men and women of all shapes and sizes. In response to your day, we are now in the process of ordering a few XL robes so we can continue to have offerings for all of our guests. I wanted to reach out to you to let you know we have sent you a private message as we would like to connect with you directly. Thank you again for communicating your concern with us.\nAlexa Gallegos\n\n1/2/2019 -\n\nHi Michelle,\nI am so happy to hear that you had a great returning experience! Our team members do the best they can to accommodate all of our guests needs and we are very glad to hear you were happy with the solution.\nWe hope to see you and your husband again!\n\nBest,\nAmber Peyghambari\n\nRead less\n
## 2 11/29/2018\nI'm so happy I found CHIROPRACTIC!\n\nBrenda was so sweet and attentive, from making my appointment to greeting me upon arrival.\n\nI saw Bertha for a prenatal massage, how I survived my first pregnancy without one, I'm clueless. Bertha listened to my needs and my bodies. She helped relieve tension in my neck and shoulders.\n\nI could have fell asleep, only complaint would be - why aren't massages longer than an hour lol\n\nI cannot wait to come back monthly through this pregnancy. I also am excited to try a prenatal adjustment
## 3 3/20/2018\n 2 check-ins\n\nTheir staff is super nice. The doctor is also great and always gets the knots out of my back. I felt better right after my first appointment!
## 4 5/7/2018Previous review\nIt's too bad, I had such a great time here and some bathroom attendant ruined my whole experience!! Just the worst manners and let's just say customer service was not her specialty or even close.\nThis young girl had some nerve to correct a customer for accidentally missing the trash with some paper from a cinco de mayo mustache.. (jokes) she chases after me to tell me to throw it in the trash I explained half way down the hall I was sorry and had to\nLeAve, my friend was sick and need me to tend to her. She then chased me down again and started to harass me to tell her where my friend threw up. Really? Well, maybe she had a bad day.. but after explaining what happen to management and the front office, Jose, the manager, didn't look too surprised.. I guess this is normal behavior for her.. needless to say I'm almost afraid to go back. I may not hold my tongue next time.. personal space was not In her vocabulary she tapped me on the shoulder, she's lucky i was in a great mood till then..\nComment from Amber P. of HIGH END SPA Hot Springs\n\nBusiness Customer Service\n\n5/8/2018Hi Raven, thank you for taking the time to write a review of your recent visit. I am sorry to hear about your incident in the bath house at the end of your visit. I spoke with Jose and he mentioned you all were a pleasure to have and that after speaking with you about this occurrence, he internally addressed the issue so this wouldn't happen again. We take our guest comments very seriously because guest comments, good and bad, help us to learn and grow. A guest who makes the commitment to reach out and tell us what was not perfect is invaluable to our company. Always feel free to contact me regarding any of your visits or if you ever have any questions or comments.\nBest regards, Alexa Gallegos, HIGH END SPA Hot Springs
## 5 8/30/2016\n 1 check-in\n\nFabulous place to get adjusted. The office is calm and clean. The staff is friendly. Dr. Ramada is fantastic! He really understood the cause of my pain and was able to adjust me quickly. I love the availability and evening appointments. Highly recommend!
## 6 2/26/2018\n 27 check-ins\n\nI was looking for a Chiropractor in my area and I stumbled upon CHIROPRACTIC. It is a really awesome place. The staff and facilities are very nice. And they are very reasonably priced, much better price then my last Chiropractor. Conveniently located off of the 15 freeway two exits south of the 91. What is really cool is they also offer massages also. If you are looking for a Chiropractor in Corona/Riverside area look no further.
## LowAvgHighCost businessType cityState friends reviews
## 1 High high end massage retreat Orange, CA 26 33
## 2 Avg chiropractic Fayetteville, NC 0 7
## 3 Avg chiropractic Corona, CA 943 7
## 4 High high end massage retreat Rancho Cucamonga, CA 12 12
## 5 Avg chiropractic Los Angeles, CA 11 24
## 6 Avg chiropractic Corona, CA 4 NA
## photos eliteStatus userName Date userBusinessPhotos userCheckIns
## 1 21 <NA> Michelle A. 2019-01-01 2 NA
## 2 NA <NA> Devin S. 2018-11-29 NA NA
## 3 2 <NA> Courtney S. 2018-03-20 NA 2
## 4 4 <NA> Raven H. 2018-05-07 NA NA
## 5 11 <NA> Monique O. 2016-08-30 NA 1
## 6 NA <NA> Matthew B. 2018-02-26 NA 27
## weekday the and for have that they this you not but good with the_ratios
## 1 Sun 15 5 3 4 4 3 1 2 1 1 2 0 0.055
## 2 Sun 0 3 1 1 0 0 1 0 0 0 0 0 0.000
## 3 Fri 2 1 0 0 0 0 0 0 0 0 0 0 0.069
## 4 Sat 7 6 2 0 0 0 2 0 3 1 0 1 0.032
## 5 Mon 4 3 0 0 0 0 0 0 0 0 0 0 0.080
## 6 Thu 3 3 2 0 0 2 0 1 0 0 0 0 0.036
## and_ratios for_ratios have_ratios that_ratios they_ratios this_ratios
## 1 0.018 0.011 0.015 0.015 0.011 0.004
## 2 0.028 0.009 0.009 0.000 0.000 0.009
## 3 0.034 0.000 0.000 0.000 0.000 0.000
## 4 0.027 0.009 0.000 0.000 0.000 0.009
## 5 0.060 0.000 0.000 0.000 0.000 0.000
## 6 0.036 0.024 0.000 0.000 0.024 0.000
## you_ratios not_ratios but_ratios good_ratios with_ratios
## 1 0.007 0.004 0.004 0.007 0.000
## 2 0.000 0.000 0.000 0.000 0.000
## 3 0.000 0.000 0.000 0.000 0.000
## 4 0.000 0.014 0.005 0.000 0.005
## 5 0.000 0.000 0.000 0.000 0.000
## 6 0.012 0.000 0.000 0.000 0.000
Now we need to write this out to csv to read in to other file or just to have.
write.csv(Reviews15,'ReviewsCleanedWithKeywordsAndRatios.csv', row.names=FALSE)
This file is almost 1 Mb in file size, which actually isn’t that large.
Now that we have our table with the added ratios we can now look at building different data sets with feature selection to predict any of our other targets, but most likely the rating by review. But also we can now use our best prediction model of the mean votes algorithm and other big name machine learning algorithms I have used many times. We will do that next. Caret Cheatsheet is available from the available cheatsheets in the toolbar in Rstudio in the Help menu under Cheatsheets, scroll to the ‘Contributed Cheatsheets’ at the bottom and select the caret cheatsheet if you would like to refresh or learn about the algorithms caret has to work with in R for machine learning on the above data set.
Before moving on to the machine learning algorithms in R, specifically the caret package. Lets make a test and train set out of this data frame, and test the features we individually select on the target variable of the rating using our best model, the ceiling of the mean of the dot product of the votes and ratings or the top votes.We will use our database Reviews15. If you don’t have it in your environment go ahead and read it in from the ReviewsCleanedWithKeywordsAndRatios.csv file.
Reviews15 <- read.csv('ReviewsCleanedWithKeywordsAndRatios.csv',sep=',',
header=TRUE, na.strings=c('',' ','NA'))
wordToAllWords <- read.csv('wordToAllWords.csv', sep=',', header=TRUE, na.strings=c('',' ','NA'),
row.names=1)
Now lets see the table we will compare each review ratio of term to total terms to the term to total terms in all reviews by rating.
head(wordToAllWords)
## and but for. good have not that the they
## Rating1_term2totalTerm 0.046 0.008 0.018 0.006 0.016 0.011 0.015 0.057 0.012
## Rating2_term2totalTerm 0.042 0.011 0.014 0.006 0.016 0.011 0.013 0.073 0.011
## Rating3_term2totalTerm 0.043 0.010 0.021 0.008 0.020 0.010 0.012 0.072 0.009
## Rating4_term2totalTerm 0.046 0.011 0.016 0.010 0.016 0.007 0.012 0.061 0.010
## Rating5_term2totalTerm 0.059 0.006 0.018 0.013 0.023 0.004 0.010 0.056 0.010
## this with you
## Rating1_term2totalTerm 0.013 0.008 0.009
## Rating2_term2totalTerm 0.010 0.008 0.017
## Rating3_term2totalTerm 0.011 0.009 0.013
## Rating4_term2totalTerm 0.008 0.009 0.015
## Rating5_term2totalTerm 0.010 0.010 0.012
Now lets select our features we want to use the top model on. Since our model was based on the ratios, we will only be using the rating and those features, and from their it will create a voting system to grab the rating with the highest votes for the term to total terms per document. Make sure to read in all the packages this Rmarkdown file uses. The following will use stringr, dplyr, and tidyr or tidyverse for sure. Later we will use the caret package for Machine learning models to compare against our model.
colnames(Reviews15)
## [1] "id" "userReviewSeries" "userReviewOnlyContent"
## [4] "userRatingSeries" "userRatingValue" "businessReplied"
## [7] "businessReplyContent" "userReviewContent" "LowAvgHighCost"
## [10] "businessType" "cityState" "friends"
## [13] "reviews" "photos" "eliteStatus"
## [16] "userName" "Date" "userBusinessPhotos"
## [19] "userCheckIns" "weekday" "the"
## [22] "and" "for." "have"
## [25] "that" "they" "this"
## [28] "you" "not" "but"
## [31] "good" "with" "the_ratios"
## [34] "and_ratios" "for_ratios" "have_ratios"
## [37] "that_ratios" "they_ratios" "this_ratios"
## [40] "you_ratios" "not_ratios" "but_ratios"
## [43] "good_ratios" "with_ratios"
ML_rating_data <- Reviews15 %>% select(userRatingValue,
the_ratios:with_ratios)
head(ML_rating_data)
## userRatingValue the_ratios and_ratios for_ratios have_ratios that_ratios
## 1 5 0.055 0.018 0.011 0.015 0.015
## 2 5 0.000 0.028 0.009 0.009 0.000
## 3 5 0.069 0.034 0.000 0.000 0.000
## 4 1 0.032 0.027 0.009 0.000 0.000
## 5 5 0.080 0.060 0.000 0.000 0.000
## 6 5 0.036 0.036 0.024 0.000 0.000
## they_ratios this_ratios you_ratios not_ratios but_ratios good_ratios
## 1 0.011 0.004 0.007 0.004 0.004 0.007
## 2 0.000 0.009 0.000 0.000 0.000 0.000
## 3 0.000 0.000 0.000 0.000 0.000 0.000
## 4 0.000 0.009 0.000 0.014 0.005 0.000
## 5 0.000 0.000 0.000 0.000 0.000 0.000
## 6 0.024 0.000 0.012 0.000 0.000 0.000
## with_ratios
## 1 0.000
## 2 0.000
## 3 0.000
## 4 0.005
## 5 0.000
## 6 0.000
If, or when, we use the caret algorithms to make predictions with many of the available models in caret, the target variable would be the userRatingValue, and the keyword ratios (12) would be the predictors. But we aren’t right at this moment and so we don’t have to separate the target from the predictors just yet. We can keep it attached to compare to our model we build on the ceiling of the mean if a tie in votes for the minimum values of the review ratio to the five rating rations.
Keep the rating column as an integer instead of a factor, because we need it as an integer to use the dot product on those reviews that there is a tie against the votes (the minimum value of the difference between the term to total terms in the document to the ratios of the same for all the documents in each rating). We saw earlier that there are some instances where there is a tie, but that the ceiling of the mean or the highest rating will beat out the ceiling of the median or the shortest distance (absolute value) between the reveiw ratio and rating ratios. We need to also get our data frame on ratios of total of each of these 12 words or terms to the total of all words in each corpus or file of reviews by rating in the wordToAllWords table.
colnames(wordToAllWords)[3] <- 'for_' #this is special keyword in R
w2w <- wordToAllWords #shorten name
MLr <- ML_rating_data #shorten the name
MLr$R1_and <- rep(w2w[1,1],length(MLr$userRatingValue))
MLr$R2_and <- rep(w2w[2,1],length(MLr$userRatingValue))
MLr$R3_and <- rep(w2w[3,1],length(MLr$userRatingValue))
MLr$R4_and <- rep(w2w[4,1],length(MLr$userRatingValue))
MLr$R5_and <- rep(w2w[5,1],length(MLr$userRatingValue))
MLr$R1_but <- rep(w2w[1,2],length(MLr$userRatingValue))
MLr$R2_but <- rep(w2w[2,2],length(MLr$userRatingValue))
MLr$R3_but <- rep(w2w[3,2],length(MLr$userRatingValue))
MLr$R4_but <- rep(w2w[4,2],length(MLr$userRatingValue))
MLr$R5_but <- rep(w2w[5,2],length(MLr$userRatingValue))
MLr$R1_for <- rep(w2w[1,3],length(MLr$userRatingValue))
MLr$R2_for <- rep(w2w[2,3],length(MLr$userRatingValue))
MLr$R3_for <- rep(w2w[3,3],length(MLr$userRatingValue))
MLr$R4_for <- rep(w2w[4,3],length(MLr$userRatingValue))
MLr$R5_for <- rep(w2w[5,3],length(MLr$userRatingValue))
MLr$R1_good <- rep(w2w[1,4],length(MLr$userRatingValue))
MLr$R2_good <- rep(w2w[2,4],length(MLr$userRatingValue))
MLr$R3_good <- rep(w2w[3,4],length(MLr$userRatingValue))
MLr$R4_good <- rep(w2w[4,4],length(MLr$userRatingValue))
MLr$R5_good <- rep(w2w[5,4],length(MLr$userRatingValue))
MLr$R1_have <- rep(w2w[1,5],length(MLr$userRatingValue))
MLr$R2_have <- rep(w2w[2,5],length(MLr$userRatingValue))
MLr$R3_have <- rep(w2w[3,5],length(MLr$userRatingValue))
MLr$R4_have <- rep(w2w[4,5],length(MLr$userRatingValue))
MLr$R5_have <- rep(w2w[5,5],length(MLr$userRatingValue))
MLr$R1_not <- rep(w2w[1,6],length(MLr$userRatingValue))
MLr$R2_not <- rep(w2w[2,6],length(MLr$userRatingValue))
MLr$R3_not <- rep(w2w[3,6],length(MLr$userRatingValue))
MLr$R4_not <- rep(w2w[4,6],length(MLr$userRatingValue))
MLr$R5_not <- rep(w2w[5,6],length(MLr$userRatingValue))
MLr$R1_that <- rep(w2w[1,7],length(MLr$userRatingValue))
MLr$R2_that <- rep(w2w[2,7],length(MLr$userRatingValue))
MLr$R3_that <- rep(w2w[3,7],length(MLr$userRatingValue))
MLr$R4_that <- rep(w2w[4,7],length(MLr$userRatingValue))
MLr$R5_that <- rep(w2w[5,7],length(MLr$userRatingValue))
MLr$R1_the <- rep(w2w[1,8],length(MLr$userRatingValue))
MLr$R2_the <- rep(w2w[2,8],length(MLr$userRatingValue))
MLr$R3_the <- rep(w2w[3,8],length(MLr$userRatingValue))
MLr$R4_the <- rep(w2w[4,8],length(MLr$userRatingValue))
MLr$R5_the <- rep(w2w[5,8],length(MLr$userRatingValue))
MLr$R1_they <- rep(w2w[1,9],length(MLr$userRatingValue))
MLr$R2_they <- rep(w2w[2,9],length(MLr$userRatingValue))
MLr$R3_they <- rep(w2w[3,9],length(MLr$userRatingValue))
MLr$R4_they <- rep(w2w[4,9],length(MLr$userRatingValue))
MLr$R5_they <- rep(w2w[5,9],length(MLr$userRatingValue))
MLr$R1_this <- rep(w2w[1,10],length(MLr$userRatingValue))
MLr$R2_this <- rep(w2w[2,10],length(MLr$userRatingValue))
MLr$R3_this <- rep(w2w[3,10],length(MLr$userRatingValue))
MLr$R4_this <- rep(w2w[4,10],length(MLr$userRatingValue))
MLr$R5_this <- rep(w2w[5,10],length(MLr$userRatingValue))
MLr$R1_with <- rep(w2w[1,11],length(MLr$userRatingValue))
MLr$R2_with <- rep(w2w[2,11],length(MLr$userRatingValue))
MLr$R3_with <- rep(w2w[3,11],length(MLr$userRatingValue))
MLr$R4_with <- rep(w2w[4,11],length(MLr$userRatingValue))
MLr$R5_with <- rep(w2w[5,11],length(MLr$userRatingValue))
MLr$R1_you <- rep(w2w[1,12],length(MLr$userRatingValue))
MLr$R2_you <- rep(w2w[2,12],length(MLr$userRatingValue))
MLr$R3_you <- rep(w2w[3,12],length(MLr$userRatingValue))
MLr$R4_you <- rep(w2w[4,12],length(MLr$userRatingValue))
MLr$R5_you <- rep(w2w[5,12],length(MLr$userRatingValue))
MLr$and_diff1 <- MLr$R1_and-MLr$and_ratios
MLr$and_diff2 <- MLr$R2_and-MLr$and_ratios
MLr$and_diff3 <- MLr$R3_and-MLr$and_ratios
MLr$and_diff4 <- MLr$R4_and-MLr$and_ratios
MLr$and_diff5 <- MLr$R5_and-MLr$and_ratios
MLr$but_diff1 <- MLr$R1_but-MLr$but_ratios
MLr$but_diff2 <- MLr$R2_but-MLr$but_ratios
MLr$but_diff3 <- MLr$R3_but-MLr$but_ratios
MLr$but_diff4 <- MLr$R4_but-MLr$but_ratios
MLr$but_diff5 <- MLr$R5_but-MLr$but_ratios
MLr$for_diff1 <- MLr$R1_for-MLr$for_ratios
MLr$for_diff2 <- MLr$R2_for-MLr$for_ratios
MLr$for_diff3 <- MLr$R3_for-MLr$for_ratios
MLr$for_diff4 <- MLr$R4_for-MLr$for_ratios
MLr$for_diff5 <- MLr$R5_for-MLr$for_ratios
MLr$good_diff1 <- MLr$R1_good-MLr$good_ratios
MLr$good_diff2 <- MLr$R2_good-MLr$good_ratios
MLr$good_diff3 <- MLr$R3_good-MLr$good_ratios
MLr$good_diff4 <- MLr$R4_good-MLr$good_ratios
MLr$good_diff5 <- MLr$R5_good-MLr$good_ratios
MLr$have_diff1 <- MLr$R1_have-MLr$have_ratios
MLr$have_diff2 <- MLr$R2_have-MLr$have_ratios
MLr$have_diff3 <- MLr$R3_have-MLr$have_ratios
MLr$have_diff4 <- MLr$R4_have-MLr$have_ratios
MLr$have_diff5 <- MLr$R5_have-MLr$have_ratios
MLr$not_diff1 <- MLr$R1_not-MLr$not_ratios
MLr$not_diff2 <- MLr$R2_not-MLr$not_ratios
MLr$not_diff3 <- MLr$R3_not-MLr$not_ratios
MLr$not_diff4 <- MLr$R4_not-MLr$not_ratios
MLr$not_diff5 <- MLr$R5_not-MLr$not_ratios
MLr$that_diff1 <- MLr$R1_that-MLr$that_ratios
MLr$that_diff2 <- MLr$R2_that-MLr$that_ratios
MLr$that_diff3 <- MLr$R3_that-MLr$that_ratios
MLr$that_diff4 <- MLr$R4_that-MLr$that_ratios
MLr$that_diff5 <- MLr$R5_that-MLr$that_ratios
MLr$the_diff1 <- MLr$R1_the-MLr$the_ratios
MLr$the_diff2 <- MLr$R2_the-MLr$the_ratios
MLr$the_diff3 <- MLr$R3_the-MLr$the_ratios
MLr$the_diff4 <- MLr$R4_the-MLr$the_ratios
MLr$the_diff5 <- MLr$R5_the-MLr$the_ratios
MLr$they_diff1 <- MLr$R1_they-MLr$they_ratios
MLr$they_diff2 <- MLr$R2_they-MLr$they_ratios
MLr$they_diff3 <- MLr$R3_they-MLr$they_ratios
MLr$they_diff4 <- MLr$R4_they-MLr$they_ratios
MLr$they_diff5 <- MLr$R5_they-MLr$they_ratios
MLr$this_diff1 <- MLr$R1_this-MLr$this_ratios
MLr$this_diff2 <- MLr$R2_this-MLr$this_ratios
MLr$this_diff3 <- MLr$R3_this-MLr$this_ratios
MLr$this_diff4 <- MLr$R4_this-MLr$this_ratios
MLr$this_diff5 <- MLr$R5_this-MLr$this_ratios
MLr$with_diff1 <- MLr$R1_with-MLr$with_ratios
MLr$with_diff2 <- MLr$R2_with-MLr$with_ratios
MLr$with_diff3 <- MLr$R3_with-MLr$with_ratios
MLr$with_diff4 <- MLr$R4_with-MLr$with_ratios
MLr$with_diff5 <- MLr$R5_with-MLr$with_ratios
MLr$you_diff1 <- MLr$R1_you-MLr$you_ratios
MLr$you_diff2 <- MLr$R2_you-MLr$you_ratios
MLr$you_diff3 <- MLr$R3_you-MLr$you_ratios
MLr$you_diff4 <- MLr$R4_you-MLr$you_ratios
MLr$you_diff5 <- MLr$R5_you-MLr$you_ratios
MLr$andMin <- apply(MLr[74:78],1, min)
MLr$andvote <- ifelse(MLr$R1_and==MLr$andMin,
1,
ifelse(MLr$R2_and==MLr$andMin,
2,
ifelse(MLr$R3_and==MLr$andMin,
3,
ifelse(MLr$R4_and==MLr$andMin,
4,
5)
)
)
)
MLr$butMin <- apply(MLr[79:83],1, min)
MLr$butvote <- ifelse(MLr$R1_but==MLr$butMin,
1,
ifelse(MLr$R2_but==MLr$butMin,
2,
ifelse(MLr$R3_but==MLr$butMin,
3,
ifelse(MLr$R4_but==MLr$butMin,
4,
5)
)
)
)
MLr$forMin <- apply(MLr[84:88],1, min)
MLr$forvote <- ifelse(MLr$R1_for==MLr$forMin,
1,
ifelse(MLr$R2_for==MLr$forMin,
2,
ifelse(MLr$R3_for==MLr$forMin,
3,
ifelse(MLr$R4_for==MLr$forMin,
4,
5)
)
)
)
MLr$goodMin <- apply(MLr[89:93],1, min)
MLr$goodvote <- ifelse(MLr$R1_good==MLr$goodMin,
1,
ifelse(MLr$R2_good==MLr$goodMin,
2,
ifelse(MLr$R3_good==MLr$goodMin,
3,
ifelse(MLr$R4_good==MLr$goodMin,
4,
5)
)
)
)
MLr$haveMin <- apply(MLr[94:98],1, min)
MLr$havevote <- ifelse(MLr$R1_have==MLr$haveMin,
1,
ifelse(MLr$R2_have==MLr$haveMin,
2,
ifelse(MLr$R3_have==MLr$haveMin,
3,
ifelse(MLr$R4_have==MLr$haveMin,
4,
5)
)
)
)
MLr$notMin <- apply(MLr[99:103],1, min)
MLr$notvote <- ifelse(MLr$R1_not==MLr$notMin,
1,
ifelse(MLr$R2_not==MLr$notMin,
2,
ifelse(MLr$R3_not==MLr$notMin,
3,
ifelse(MLr$R4_not==MLr$notMin,
4,
5)
)
)
)
MLr$thatMin <- apply(MLr[104:108],1, min)
MLr$thatvote <- ifelse(MLr$R1_that==MLr$thatMin,
1,
ifelse(MLr$R2_that==MLr$thatMin,
2,
ifelse(MLr$R3_that==MLr$thatMin,
3,
ifelse(MLr$R4_that==MLr$thatMin,
4,
5)
)
)
)
MLr$theMin <- apply(MLr[109:113],1, min)
MLr$thevote <- ifelse(MLr$R1_the==MLr$theMin,
1,
ifelse(MLr$R2_the==MLr$theMin,
2,
ifelse(MLr$R3_the==MLr$theMin,
3,
ifelse(MLr$R4_the==MLr$theMin,
4,
5)
)
)
)
MLr$theyMin <- apply(MLr[114:118],1, min)
MLr$theyvote <- ifelse(MLr$R1_they==MLr$theyMin,
1,
ifelse(MLr$R2_they==MLr$theyMin,
2,
ifelse(MLr$R3_they==MLr$theyMin,
3,
ifelse(MLr$R4_they==MLr$theyMin,
4,
5)
)
)
)
MLr$thisMin <- apply(MLr[119:123],1, min)
MLr$thisvote <- ifelse(MLr$R1_this==MLr$thisMin,
1,
ifelse(MLr$R2_this==MLr$thisMin,
2,
ifelse(MLr$R3_this==MLr$thisMin,
3,
ifelse(MLr$R4_this==MLr$thisMin,
4,
5)
)
)
)
MLr$withMin <- apply(MLr[124:128],1, min)
MLr$withvote <- ifelse(MLr$R1_with==MLr$withMin,
1,
ifelse(MLr$R2_with==MLr$withMin,
2,
ifelse(MLr$R3_with==MLr$withMin,
3,
ifelse(MLr$R4_with==MLr$withMin,
4,
5)
)
)
)
MLr$youMin <- apply(MLr[129:133],1, min)
MLr$youvote <- ifelse(MLr$R1_you==MLr$youMin,
1,
ifelse(MLr$R2_you==MLr$youMin,
2,
ifelse(MLr$R3_you==MLr$youMin,
3,
ifelse(MLr$R4_you==MLr$youMin,
4,
5)
)
)
)
bestVote <- MLr %>% select(andvote,
butvote, forvote,
goodvote, havevote,
notvote, thatvote, thevote,theyvote,
thisvote,withvote,
youvote)
bestVote$andvote <- as.factor(paste(bestVote$andvote))
bestVote$butvote <- as.factor(paste(bestVote$butvote))
bestVote$forvote <- as.factor(paste(bestVote$forvote))
bestVote$goodvote <- as.factor(paste(bestVote$goodvote))
bestVote$havevote <- as.factor(paste(bestVote$havevote))
bestVote$notvote <- as.factor(paste(bestVote$notvote))
bestVote$thatvote <- as.factor(paste(bestVote$thatvote))
bestVote$thevote <- as.factor(paste(bestVote$thevote))
bestVote$theyvote <- as.factor(paste(bestVote$theyvote))
bestVote$thisvote <- as.factor(paste(bestVote$thisvote))
bestVote$withvote <- as.factor(paste(bestVote$withvote))
bestVote$youvote <- as.factor(paste(bestVote$youvote))
bestVote$counts1 <- 0
bestVote$counts2 <- 0
bestVote$counts3 <- 0
bestVote$counts4 <- 0
bestVote$counts5 <- 0
a5 <- grep('5',bestVote$andvote)
a4 <- grep('4', bestVote$andvote)
a3 <- grep('3',bestVote$andvote)
a2 <- grep('2',bestVote$andvote)
a1 <- grep('1',bestVote$andvote)
b5 <- grep('5',bestVote$butvote)
b4 <- grep('4', bestVote$butvote)
b3 <- grep('3',bestVote$butvote)
b2 <- grep('2',bestVote$butvote)
b1 <- grep('1',bestVote$butvote)
c5 <- grep('5',bestVote$forvote)
c4 <- grep('4', bestVote$forvote)
c3 <- grep('3',bestVote$forvote)
c2 <- grep('2',bestVote$forvote)
c1 <- grep('1',bestVote$forvote)
d5 <- grep('5',bestVote$goodvote)
d4 <- grep('4', bestVote$goodvote)
d3 <- grep('3',bestVote$goodvote)
d2 <- grep('2',bestVote$goodvote)
d1 <- grep('1',bestVote$goodvote)
e5 <- grep('5',bestVote$havevote)
e4 <- grep('4', bestVote$havevote)
e3 <- grep('3',bestVote$havevote)
e2 <- grep('2',bestVote$havevote)
e1 <- grep('1',bestVote$havevote)
f5 <- grep('5',bestVote$notvote)
f4 <- grep('4', bestVote$notvote)
f3 <- grep('3',bestVote$notvote)
f2 <- grep('2',bestVote$notvote)
f1 <- grep('1',bestVote$notvote)
g5 <- grep('5',bestVote$thatvote)
g4 <- grep('4', bestVote$thatvote)
g3 <- grep('3',bestVote$thatvote)
g2 <- grep('2',bestVote$thatvote)
g1 <- grep('1',bestVote$thatvote)
h5 <- grep('5',bestVote$thevote)
h4 <- grep('4', bestVote$thevote)
h3 <- grep('3',bestVote$thevote)
h2 <- grep('2',bestVote$thevote)
h1 <- grep('1',bestVote$thevote)
i5 <- grep('5',bestVote$theyvote)
i4 <- grep('4', bestVote$theyvote)
i3 <- grep('3',bestVote$theyvote)
i2 <- grep('2',bestVote$theyvote)
i1 <- grep('1',bestVote$theyvote)
j5 <- grep('5',bestVote$thisvote)
j4 <- grep('4', bestVote$thisvote)
j3 <- grep('3',bestVote$thisvote)
j2 <- grep('2',bestVote$thisvote)
j1 <- grep('1',bestVote$thisvote)
k5 <- grep('5',bestVote$withvote)
k4 <- grep('4', bestVote$withvote)
k3 <- grep('3',bestVote$withvote)
k2 <- grep('2',bestVote$withvote)
k1 <- grep('1',bestVote$withvote)
l5 <- grep('5',bestVote$youvote)
l4 <- grep('4', bestVote$youvote)
l3 <- grep('3',bestVote$youvote)
l2 <- grep('2',bestVote$youvote)
l1 <- grep('1',bestVote$youvote)
bestVote$counts1[l1] <- bestVote$counts1[l1]+ 1
bestVote$counts1[k1] <- bestVote$counts1[k1]+ 1
bestVote$counts1[j1] <- bestVote$counts1[j1]+ 1
bestVote$counts1[i1] <- bestVote$counts1[i1]+ 1
bestVote$counts1[h1] <- bestVote$counts1[h1]+ 1
bestVote$counts1[g1] <- bestVote$counts1[g1]+ 1
bestVote$counts1[f1] <- bestVote$counts1[f1]+ 1
bestVote$counts1[e1] <- bestVote$counts1[e1]+ 1
bestVote$counts1[d1] <- bestVote$counts1[d1]+ 1
bestVote$counts1[c1] <- bestVote$counts1[c1]+ 1
bestVote$counts1[b1] <- bestVote$counts1[b1]+ 1
bestVote$counts1[a1] <- bestVote$counts1[a1]+ 1
bestVote$counts2[l2] <- bestVote$counts2[l2] + 1
bestVote$counts2[k2] <- bestVote$counts2[k2] + 1
bestVote$counts2[j2] <- bestVote$counts2[j2] + 1
bestVote$counts2[i2] <- bestVote$counts2[i2] + 1
bestVote$counts2[h2] <- bestVote$counts2[h2] + 1
bestVote$counts2[g2] <- bestVote$counts2[g2] + 1
bestVote$counts2[f2] <- bestVote$counts2[f2] + 1
bestVote$counts2[e2] <- bestVote$counts2[e2] + 1
bestVote$counts2[d2] <- bestVote$counts2[d2] + 1
bestVote$counts2[c2] <- bestVote$counts2[c2] + 1
bestVote$counts2[b2] <- bestVote$counts2[b2] + 1
bestVote$counts2[a2] <- bestVote$counts2[a2] + 1
bestVote$counts3[l3] <- bestVote$counts3[l3] + 1
bestVote$counts3[k3] <- bestVote$counts3[k3] + 1
bestVote$counts3[j3] <- bestVote$counts3[j3] + 1
bestVote$counts3[i3] <- bestVote$counts3[i3] + 1
bestVote$counts3[h3] <- bestVote$counts3[h3] + 1
bestVote$counts3[g3] <- bestVote$counts3[g3] + 1
bestVote$counts3[f3] <- bestVote$counts3[f3] + 1
bestVote$counts3[e3] <- bestVote$counts3[e3] + 1
bestVote$counts3[d3] <- bestVote$counts3[d3] + 1
bestVote$counts3[c3] <- bestVote$counts3[c3] + 1
bestVote$counts3[b3] <- bestVote$counts3[b3] + 1
bestVote$counts3[a3] <- bestVote$counts3[a3] + 1
bestVote$counts4[l4] <- bestVote$counts4[l4] + 1
bestVote$counts4[k4] <- bestVote$counts4[k4] + 1
bestVote$counts4[j4] <- bestVote$counts4[j4] + 1
bestVote$counts4[i4] <- bestVote$counts4[i4] + 1
bestVote$counts4[h4] <- bestVote$counts4[h4] + 1
bestVote$counts4[g4] <- bestVote$counts4[g4] + 1
bestVote$counts4[f4] <- bestVote$counts4[f4] + 1
bestVote$counts4[e4] <- bestVote$counts4[e4] + 1
bestVote$counts4[d4] <- bestVote$counts4[d4] + 1
bestVote$counts4[c4] <- bestVote$counts4[c4] + 1
bestVote$counts4[b4] <- bestVote$counts4[b4] + 1
bestVote$counts4[a4] <- bestVote$counts4[a4] + 1
bestVote$counts5[l5] <- bestVote$counts5[l5] + 1
bestVote$counts5[k5] <- bestVote$counts5[k5] + 1
bestVote$counts5[j5] <- bestVote$counts5[j5] + 1
bestVote$counts5[i5] <- bestVote$counts5[i5] + 1
bestVote$counts5[h5] <- bestVote$counts5[h5] + 1
bestVote$counts5[g5] <- bestVote$counts5[g5] + 1
bestVote$counts5[f5] <- bestVote$counts5[f5] + 1
bestVote$counts5[e5] <- bestVote$counts5[e5] + 1
bestVote$counts5[d5] <- bestVote$counts5[d5] + 1
bestVote$counts5[c5] <- bestVote$counts5[c5] + 1
bestVote$counts5[b5] <- bestVote$counts5[b5] + 1
bestVote$counts5[a5] <- bestVote$counts5[a5] + 1
bestVote$maxVote <- apply(bestVote[13:17],1,max)
mv <- bestVote$maxVote
ct1 <- bestVote$counts1
ct2 <- bestVote$counts2
ct3 <- bestVote$counts3
ct4 <- bestVote$counts4
ct5 <- bestVote$counts5
bestVote$votedRating <- ifelse(mv==ct1, 1,
ifelse(mv==ct2, 2,
ifelse(mv==ct3, 3,
ifelse(mv==ct4, 4, 5))))
bestVote$Rating <- ifelse(mv==ct1 & (mv==ct2|mv==ct3|mv==ct4|mv==ct5),'tie',
ifelse(mv==ct1,1,
ifelse(mv==ct2 & (mv==ct3|mv==ct4|mv==ct5),'tie',
ifelse(mv==ct2, 2,
ifelse(mv==ct3 &(mv==ct4|mv==ct5),'tie',
ifelse(mv==ct3, 3,
ifelse(mv==ct4 & mv==ct5, 'tie',
ifelse(mv==ct4, 4,5
)))))))
)
bestVote$finalPrediction <- ifelse(bestVote$Rating=='tie',
ifelse(ceiling((c(ct1,ct2,ct3,ct4,ct5)*c(1,2,3,4,5))/5) > 5,
5, ceiling((c(ct1,ct2,ct3,ct4,ct5)*c(1,2,3,4,5))/5)),
bestVote$votedRating )
Now that we have our final prediction with this algorithm. Lets attach these two tables together and rearrange the columns.
MLr2 <- cbind(MLr, bestVote)
MLr3 <- MLr2[,c(2:178,1)]
MLr3$CorrectPrediction <- ifelse(MLr3$finalPrediction==MLr3$userRatingValue,
1,0)
MLr3$finalPrediction <- as.factor(paste(MLr3$finalPrediction))
MLr3$userRatingValue <- paste('rating ', MLr3$userRatingValue,sep='')
Accuracy <- sum(MLr3$CorrectPrediction)/length(MLr3$CorrectPrediction)
Accuracy
## [1] 0.5439739
The first bunch of reviews seem to be very accurate, but then more reviews get less accurate.
MLr3[1:20,175:179]
## votedRating Rating finalPrediction userRatingValue CorrectPrediction
## 1 5 5 5 rating 5 1
## 2 5 5 5 rating 5 1
## 3 5 5 5 rating 5 1
## 4 5 5 5 rating 1 0
## 5 5 5 5 rating 5 1
## 6 5 5 5 rating 5 1
## 7 5 5 5 rating 5 1
## 8 5 5 5 rating 5 1
## 9 5 5 5 rating 5 1
## 10 5 5 5 rating 5 1
## 11 5 5 5 rating 5 1
## 12 5 5 5 rating 5 1
## 13 1 tie 3 rating 5 0
## 14 5 5 5 rating 5 1
## 15 5 5 5 rating 4 0
## 16 5 5 5 rating 1 0
## 17 5 5 5 rating 1 0
## 18 5 5 5 rating 5 1
## 19 5 5 5 rating 5 1
## 20 5 5 5 rating 5 1
This was a good approach to building a model from the ground up and developing algorithms that seem like they could be good models to make predictions with. This was a straight algorithm model as in a function you put something in and get an answer out. The caret package has other more heavily used and industry related models tht are used frequently and can be tune. We will look at those later.
Lets write this table out to csv.
write.csv(MLr3, 'ML_Reviews614_resultsTable.csv', row.names=FALSE)
Lets look at those values that the correct prediction was false.
FalsePredictions <- subset(MLr3, MLr3$CorrectPrediction==0)
head(FalsePredictions[,175:179],30)
## votedRating Rating finalPrediction userRatingValue CorrectPrediction
## 4 5 5 5 rating 1 0
## 13 1 tie 3 rating 5 0
## 15 5 5 5 rating 4 0
## 16 5 5 5 rating 1 0
## 17 5 5 5 rating 1 0
## 25 5 5 5 rating 4 0
## 29 5 5 5 rating 4 0
## 35 5 5 5 rating 4 0
## 39 5 5 5 rating 4 0
## 41 5 5 5 rating 1 0
## 52 1 tie 2 rating 5 0
## 53 5 5 5 rating 4 0
## 57 5 5 5 rating 4 0
## 58 5 5 5 rating 2 0
## 65 5 5 5 rating 1 0
## 66 5 5 5 rating 1 0
## 68 5 5 5 rating 1 0
## 69 5 5 5 rating 3 0
## 70 5 5 5 rating 1 0
## 72 5 5 5 rating 1 0
## 74 5 5 5 rating 4 0
## 75 5 5 5 rating 3 0
## 77 5 5 5 rating 1 0
## 80 5 5 5 rating 1 0
## 82 5 5 5 rating 4 0
## 83 5 5 5 rating 4 0
## 84 5 5 5 rating 2 0
## 85 5 5 5 rating 1 0
## 88 1 tie 3 rating 1 0
## 89 5 5 5 rating 1 0
Many times the voted rating was not the rating even if it wasn’t a tie. The 2-4 ratings were most of the incorrect predictions and the rating as being either a 5 or a 1 were more 50/50 probably because their ratios of term to total terms were very close. Some tuning that could be done would be to select different keywords, see if there is a difference between type of business or cost of service or goods. We left in these words of which most are included in stopwords like by,my,has,have,etc.that are personal pronouns or of the text mining tm package.
#library(tm)
stop <- stopwords()
stop
## [1] "i" "me" "my" "myself" "we"
## [6] "our" "ours" "ourselves" "you" "your"
## [11] "yours" "yourself" "yourselves" "he" "him"
## [16] "his" "himself" "she" "her" "hers"
## [21] "herself" "it" "its" "itself" "they"
## [26] "them" "their" "theirs" "themselves" "what"
## [31] "which" "who" "whom" "this" "that"
## [36] "these" "those" "am" "is" "are"
## [41] "was" "were" "be" "been" "being"
## [46] "have" "has" "had" "having" "do"
## [51] "does" "did" "doing" "would" "should"
## [56] "could" "ought" "i'm" "you're" "he's"
## [61] "she's" "it's" "we're" "they're" "i've"
## [66] "you've" "we've" "they've" "i'd" "you'd"
## [71] "he'd" "she'd" "we'd" "they'd" "i'll"
## [76] "you'll" "he'll" "she'll" "we'll" "they'll"
## [81] "isn't" "aren't" "wasn't" "weren't" "hasn't"
## [86] "haven't" "hadn't" "doesn't" "don't" "didn't"
## [91] "won't" "wouldn't" "shan't" "shouldn't" "can't"
## [96] "cannot" "couldn't" "mustn't" "let's" "that's"
## [101] "who's" "what's" "here's" "there's" "when's"
## [106] "where's" "why's" "how's" "a" "an"
## [111] "the" "and" "but" "if" "or"
## [116] "because" "as" "until" "while" "of"
## [121] "at" "by" "for" "with" "about"
## [126] "against" "between" "into" "through" "during"
## [131] "before" "after" "above" "below" "to"
## [136] "from" "up" "down" "in" "out"
## [141] "on" "off" "over" "under" "again"
## [146] "further" "then" "once" "here" "there"
## [151] "when" "where" "why" "how" "all"
## [156] "any" "both" "each" "few" "more"
## [161] "most" "other" "some" "such" "no"
## [166] "nor" "not" "only" "own" "same"
## [171] "so" "than" "too" "very"
The above are the stopwords that are eliminated within text cleaning and preprocessing before running or retrieving a document term matrix for an observation such as a review. Now lets look at are keywords.
keywordsUsed <- colnames(wordToAllWords)
keywordsUsed
## [1] "and" "but" "for_" "good" "have" "not" "that" "the" "they" "this"
## [11] "with" "you"
The above shows our keywords, and note that for is for_ because it errors in R as it is a programming keyword so it was altered in the column name with an appended underscore character, but the search for it as a character or factor is ‘for’ not ‘for_’. We actually see that all 12 of our stopwords are keywords. We briefly explained that this is because in grammar and composition to write a persuasive stories many connections have to be made with the use of pronouns and actions and states of being and descriptors. These were used as features by count to see if they added any value to predicting a review. We do see that these stop words are used a lot for extreme ends of the rating scale as a 1 for the lowest and 5 for the highest review rating.
Lets get a count of each rating by incorrect classification.
userRatings <- FalsePredictions %>% group_by(userRatingValue) %>% count(finalPrediction)
userRatings
## # A tibble: 11 x 3
## # Groups: userRatingValue [5]
## userRatingValue finalPrediction n
## <chr> <fct> <int>
## 1 rating 1 3 1
## 2 rating 1 5 85
## 3 rating 2 5 34
## 4 rating 3 4 2
## 5 rating 3 5 51
## 6 rating 4 2 1
## 7 rating 4 3 1
## 8 rating 4 5 99
## 9 rating 5 2 2
## 10 rating 5 3 1
## 11 rating 5 4 3
We can see from the userRatingValue and the final prediction, that there were 85 instances where this model couldn’t distinguish between a 1 or 5 rating based on the keywords (mostly stopwords). But also those ratings that were 4s were actually classified the most incorrectly as a 5 because some people don’t readily give 5s as others. Also the actual 2s that were classified incorrectly were scaled up to 5s in the error, the same with the 3s. The gray areas were in user ratings of 2-4, where few were classified as a value in a 2-4 that was incorrect. If we instead said to classify any 2-4 as in the range of 2-4 without penalizing the exact value, then the prediction accuracy would be better. But people have their own reasons for giving 2-4s in ratings. Like they had better or there was an absolute worst experience the company still doesn’t meet because he or she knows it could be worse and it could be better. Companies are supposed to use this to improve or adjust needs of consumers. But at the same time some users are just upset with the cost or time wasted or spent when other options they know of were or are better.
Lets look at those predictions with a tie.
ties <- subset(MLr3, MLr3$Rating=='tie')
ties[,175:179]
## votedRating Rating finalPrediction userRatingValue CorrectPrediction
## 13 1 tie 3 rating 5 0
## 52 1 tie 2 rating 5 0
## 88 1 tie 3 rating 1 0
## 119 1 tie 4 rating 4 1
## 182 1 tie 2 rating 5 0
## 195 1 tie 4 rating 5 0
## 197 1 tie 2 rating 4 0
## 290 1 tie 4 rating 5 0
## 345 1 tie 4 rating 5 0
## 440 1 tie 4 rating 4 1
## 460 1 tie 4 rating 3 0
## 498 1 tie 3 rating 4 0
## 511 1 tie 1 rating 1 1
## 518 1 tie 3 rating 3 1
## 525 1 tie 4 rating 3 0
## 586 1 tie 1 rating 1 1
There weren’t a lot of ties on votes for the minimum difference of review to all reviews by ratios of term to total terms.
tieGroups <- ties %>% group_by(userRatingValue) %>% count(finalPrediction)
tieGroups
## # A tibble: 10 x 3
## # Groups: userRatingValue [4]
## userRatingValue finalPrediction n
## <chr> <fct> <int>
## 1 rating 1 1 2
## 2 rating 1 3 1
## 3 rating 3 3 1
## 4 rating 3 4 2
## 5 rating 4 2 1
## 6 rating 4 3 1
## 7 rating 4 4 2
## 8 rating 5 2 2
## 9 rating 5 3 1
## 10 rating 5 4 3
Note that these tie ratings are both correct and incorrect. From the above grouped information of counts of user ratings by final predictions all the 5s were misclassified or predicted to be in the gray area of 2-4 and more of the 1s were classified correctly except for one review in the middle gray area. None of the 2s are in our list of ties but many of the gray area 2-4s were classified into the gray area of incorrect 2-4s such as the 4s and 3s.
We could use a link analysis to see how these results compared with the final predicted value and actual rating value as groups. With the nodes as the user rating, and edges as the final prediction. We should also add in one of either the business type or the cost. First lets combine the Reviews15 table with the MLr3 table of only the results.
MLr4 <- MLr3 %>% select(maxVote:CorrectPrediction)
MLr4$actualRatingValue <- MLr4$userRatingValue
MLr5 <- MLr4 %>% select(-userRatingValue)
Reviews15_results <- cbind(Reviews15, MLr5)
colnames(Reviews15_results)
## [1] "id" "userReviewSeries" "userReviewOnlyContent"
## [4] "userRatingSeries" "userRatingValue" "businessReplied"
## [7] "businessReplyContent" "userReviewContent" "LowAvgHighCost"
## [10] "businessType" "cityState" "friends"
## [13] "reviews" "photos" "eliteStatus"
## [16] "userName" "Date" "userBusinessPhotos"
## [19] "userCheckIns" "weekday" "the"
## [22] "and" "for." "have"
## [25] "that" "they" "this"
## [28] "you" "not" "but"
## [31] "good" "with" "the_ratios"
## [34] "and_ratios" "for_ratios" "have_ratios"
## [37] "that_ratios" "they_ratios" "this_ratios"
## [40] "you_ratios" "not_ratios" "but_ratios"
## [43] "good_ratios" "with_ratios" "maxVote"
## [46] "votedRating" "Rating" "finalPrediction"
## [49] "CorrectPrediction" "actualRatingValue"
Lets keep the ratios because this table minus results (and the review content columns) just added will be useful when running the caret algorithms. For now, lets write this out to csv to easily read in later instead of running whatever chunks preceded this table to get it.
write.csv(Reviews15_results, 'Reviews15_results.csv', row.names=FALSE)
Now lets use visNetwork to group by CorrectPrediction and map the actualRatingValue to the predicted values as well as add a hovering feature as the title column in our nodes table of the businessType. We have to pick a width for the weighted arrows, or else there won’t be any edges. Lets pick the not_ratios.
network <- Reviews15_results %>% select(businessType,
finalPrediction:actualRatingValue, not,
not_ratios)
network$finalPrediction <- paste('predicted',network$finalPrediction,sep=' ')
network$CorrectPrediction <- gsub(1,'TRUE', network$CorrectPrediction)
network$CorrectPrediction <- gsub(0,'FALSE', network$CorrectPrediction)
head(network,20)
## businessType finalPrediction CorrectPrediction actualRatingValue
## 1 high end massage retreat predicted 5 TRUE rating 5
## 2 chiropractic predicted 5 TRUE rating 5
## 3 chiropractic predicted 5 TRUE rating 5
## 4 high end massage retreat predicted 5 FALSE rating 1
## 5 chiropractic predicted 5 TRUE rating 5
## 6 chiropractic predicted 5 TRUE rating 5
## 7 chiropractic predicted 5 TRUE rating 5
## 8 chiropractic predicted 5 TRUE rating 5
## 9 chiropractic predicted 5 TRUE rating 5
## 10 chiropractic predicted 5 TRUE rating 5
## 11 chiropractic predicted 5 TRUE rating 5
## 12 chiropractic predicted 5 TRUE rating 5
## 13 chiropractic predicted 3 FALSE rating 5
## 14 chiropractic predicted 5 TRUE rating 5
## 15 chiropractic predicted 5 FALSE rating 4
## 16 chiropractic predicted 5 FALSE rating 1
## 17 high end massage retreat predicted 5 FALSE rating 1
## 18 chiropractic predicted 5 TRUE rating 5
## 19 chiropractic predicted 5 TRUE rating 5
## 20 chiropractic predicted 5 TRUE rating 5
## not not_ratios
## 1 1 0.004
## 2 0 0.000
## 3 0 0.000
## 4 3 0.014
## 5 0 0.000
## 6 0 0.000
## 7 1 0.013
## 8 0 0.000
## 9 0 0.000
## 10 0 0.000
## 11 0 0.000
## 12 0 0.000
## 13 0 0.000
## 14 0 0.000
## 15 1 0.006
## 16 0 0.000
## 17 3 0.014
## 18 1 0.009
## 19 0 0.000
## 20 0 0.000
The nodes table will include all of the above except the not_ratios.
nodes <- network
nodes$id <- row.names(nodes)
nodes$title <- nodes$businessType
nodes$label <- nodes$actualRatingValue
nodes$group <- nodes$CorrectPrediction
nodes1 <- nodes %>% select(id, label, title,group,finalPrediction)
head(nodes1,10)
## id label title group finalPrediction
## 1 1 rating 5 high end massage retreat TRUE predicted 5
## 2 2 rating 5 chiropractic TRUE predicted 5
## 3 3 rating 5 chiropractic TRUE predicted 5
## 4 4 rating 1 high end massage retreat FALSE predicted 5
## 5 5 rating 5 chiropractic TRUE predicted 5
## 6 6 rating 5 chiropractic TRUE predicted 5
## 7 7 rating 5 chiropractic TRUE predicted 5
## 8 8 rating 5 chiropractic TRUE predicted 5
## 9 9 rating 5 chiropractic TRUE predicted 5
## 10 10 rating 5 chiropractic TRUE predicted 5
The edges will have the finalPrediction and actualRatingValue columns.
edges <- network %>% select(finalPrediction,actualRatingValue,not,not_ratios)
edges$label <- edges$finalPrediction
edges$weight <- edges$not_ratios
edges$width <- edges$not
edges1 <- edges %>% mutate(from = plyr::mapvalues(edges$actualRatingValue,
from = nodes$label,
to = nodes$id)
)
edges2 <- edges1 %>% mutate(to = plyr::mapvalues(edges$finalPrediction,
from = nodes$finalPrediction,
to = nodes$id)
)
edges3 <- edges2 %>% select(from,to,label,width,weight)
head(edges3,10)
## from to label width weight
## 1 1 1 predicted 5 1 0.004
## 2 1 1 predicted 5 0 0.000
## 3 1 1 predicted 5 0 0.000
## 4 4 1 predicted 5 3 0.014
## 5 1 1 predicted 5 0 0.000
## 6 1 1 predicted 5 0 0.000
## 7 1 1 predicted 5 1 0.013
## 8 1 1 predicted 5 0 0.000
## 9 1 1 predicted 5 0 0.000
## 10 1 1 predicted 5 0 0.000
Lets see the link analysis visualization.
visNetwork(nodes=nodes1, edges=edges3, main='Grouped Predictions of True or False Ratings') %>% visEdges(arrows=c('from','middle')) %>%
visInteraction(navigationButtons=TRUE, dragNodes=TRUE,
dragView=TRUE, zoomView = TRUE) %>%
visOptions(nodesIdSelection = TRUE, manipulation=FALSE) %>%
visIgraphLayout() %>%
visLegend
The above shows a visual network that looks mostly half and half by true versus false predicted rating values.None of the edges show except for that tiny cluster above, that shows the predicted ratings.
Lets group by business type now and map the actual to correctly predicted
nodes <- network
nodes$id <- row.names(nodes)
nodes$group <- nodes$businessType
nodes$label <- nodes$actualRatingValue
nodes$title <- nodes$CorrectPrediction
nodes1 <- nodes %>% select(id, label, title,group,finalPrediction)
head(nodes1,10)
## id label title group finalPrediction
## 1 1 rating 5 TRUE high end massage retreat predicted 5
## 2 2 rating 5 TRUE chiropractic predicted 5
## 3 3 rating 5 TRUE chiropractic predicted 5
## 4 4 rating 1 FALSE high end massage retreat predicted 5
## 5 5 rating 5 TRUE chiropractic predicted 5
## 6 6 rating 5 TRUE chiropractic predicted 5
## 7 7 rating 5 TRUE chiropractic predicted 5
## 8 8 rating 5 TRUE chiropractic predicted 5
## 9 9 rating 5 TRUE chiropractic predicted 5
## 10 10 rating 5 TRUE chiropractic predicted 5
The edges will have the finalPrediction and actualRatingValue columns.
edges <- network %>% select(finalPrediction,actualRatingValue,not,not_ratios)
edges$label <- edges$finalPrediction
edges$weight <- edges$not_ratios
edges$width <- edges$not
edges1 <- edges %>% mutate(from = plyr::mapvalues(edges$actualRatingValue,
from = nodes$label,
to = nodes$id)
)
edges2 <- edges1 %>% mutate(to = plyr::mapvalues(edges$finalPrediction,
from = nodes$finalPrediction,
to = nodes$id)
)
edges3 <- edges2 %>% select(from,to,label,width, weight)
head(edges3,10)
## from to label width weight
## 1 1 1 predicted 5 1 0.004
## 2 1 1 predicted 5 0 0.000
## 3 1 1 predicted 5 0 0.000
## 4 4 1 predicted 5 3 0.014
## 5 1 1 predicted 5 0 0.000
## 6 1 1 predicted 5 0 0.000
## 7 1 1 predicted 5 1 0.013
## 8 1 1 predicted 5 0 0.000
## 9 1 1 predicted 5 0 0.000
## 10 1 1 predicted 5 0 0.000
Lets see the link analysis visualization.
visNetwork(nodes=nodes1, edges=edges3, main='Grouped Predictions of Business Type Ratings with True or False and Actual Rating') %>% visEdges(arrows=c('from','middle')) %>%
visInteraction(navigationButtons=TRUE, dragNodes=TRUE,
dragView=TRUE, zoomView = TRUE) %>%
visOptions(nodesIdSelection = TRUE, manipulation=FALSE) %>%
visIgraphLayout() %>%
visLegend
Lets group by business type now and map the business type to the prediction.
nodes <- network
nodes$id <- row.names(nodes)
nodes$group <- nodes$businessType
nodes$label <- nodes$businessType
nodes$title <- nodes$CorrectPrediction
nodes1 <- nodes %>% select(id, label,group,finalPrediction, title)
head(nodes1,10)
## id label group finalPrediction title
## 1 1 high end massage retreat high end massage retreat predicted 5 TRUE
## 2 2 chiropractic chiropractic predicted 5 TRUE
## 3 3 chiropractic chiropractic predicted 5 TRUE
## 4 4 high end massage retreat high end massage retreat predicted 5 FALSE
## 5 5 chiropractic chiropractic predicted 5 TRUE
## 6 6 chiropractic chiropractic predicted 5 TRUE
## 7 7 chiropractic chiropractic predicted 5 TRUE
## 8 8 chiropractic chiropractic predicted 5 TRUE
## 9 9 chiropractic chiropractic predicted 5 TRUE
## 10 10 chiropractic chiropractic predicted 5 TRUE
edges <- network %>% select(CorrectPrediction,actualRatingValue,not,businessType,
finalPrediction)
edges$label <- edges$finalPrediction
#edges$weight <- edges$not_ratios
edges$width <- edges$not
edges1 <- edges %>% mutate(from = plyr::mapvalues(edges$businessType,
from = nodes$label,
to = nodes$id)
)
edges2 <- edges1 %>% mutate(to = plyr::mapvalues(edges$CorrectPrediction,
from = nodes$title,
to = nodes$id)
)
edges3 <- edges2 %>% select(from,to,label,width)
head(edges3,10)
## from to label width
## 1 1 1 predicted 5 1
## 2 2 1 predicted 5 0
## 3 2 1 predicted 5 0
## 4 1 4 predicted 5 3
## 5 2 1 predicted 5 0
## 6 2 1 predicted 5 0
## 7 2 1 predicted 5 1
## 8 2 1 predicted 5 0
## 9 2 1 predicted 5 0
## 10 2 1 predicted 5 0
Lets see the link analysis visualization with the star layout.
visNetwork(nodes=nodes1, edges=edges3, main='Grouped Predictions by Business Type') %>% visEdges(arrows=c('from','middle')) %>%
visInteraction(navigationButtons=TRUE, dragNodes=TRUE,
dragView=TRUE, zoomView = TRUE) %>%
visOptions(nodesIdSelection = TRUE, manipulation=FALSE) %>%
visIgraphLayout(layout='layout.star') %>%
visLegend
The above is a hula hoop because there are not any links for the most part, and only a handfule of the true or false predictions are mapping from business type to other business types by other business types predicted true or false. you can click on a node drag it off the disk and see no links, but click on the background to stop dragging. Then do the same with the center nodes that do have links to the disk and see the edges stay attached with the predicted value. Not many business types had the same predicted value as a true or false prediction of the same actual to predicted by business type.
Lets see this design in a grid layout. The sphere is the default and you can see it doesn’t really describe much visually without grouping when there are no links between the nodes.
visNetwork(nodes=nodes1, edges=edges3, main='Grouped Predictions by Business Type') %>% visEdges(arrows=c('from','middle')) %>%
visInteraction(navigationButtons=TRUE, dragNodes=TRUE,
dragView=TRUE, zoomView = TRUE) %>%
visOptions(nodesIdSelection = TRUE, manipulation=FALSE) %>%
visIgraphLayout(layout='layout_on_grid') %>%
visLegend
The above is a grid layout that looks like tetris or pac man ping pong machine type layout with the groups.
Lets look out the layout.graphopt layout next.
visNetwork(nodes=nodes1, edges=edges3, main='Grouped Predictions by Business Type') %>% visEdges(arrows=c('from','middle')) %>%
visInteraction(navigationButtons=TRUE, dragNodes=TRUE,
dragView=TRUE, zoomView = TRUE) %>%
visOptions(nodesIdSelection = TRUE, manipulation=FALSE) %>%
visIgraphLayout(layout='layout.graphopt') %>%
visLegend
Looks a lot like the default layout for this data of features. This is the spherical layout that follows.
visNetwork(nodes=nodes1, edges=edges3, main='Grouped Predictions by Business Type') %>% visEdges(arrows=c('from','middle')) %>%
visInteraction(navigationButtons=TRUE, dragNodes=TRUE,
dragView=TRUE, zoomView = TRUE) %>%
visOptions(nodesIdSelection = TRUE, manipulation=FALSE) %>%
visIgraphLayout(layout='layout_on_sphere') %>%
visLegend
It looks similar to the hula hoop star layout for this data with minimal links between nodes.
That was interesting to look at the different layouts in igraph with visNetwork, and to also see ways to improve our model design just as the businesses use reviews to improve or make changes as needed by analyzing the correctly and falsely predicted ratings. The next section will use machine learning algorithms in the caret packageto test out results of various models and defaults as well as test out the attributes within each model for tuning and validating for better generalization.
Lets use the Reviews15_results table, you can read it in if you closed your RStudio session and emptied your environment. The file we saved it as is the ML_Reviews614_resultsTable.csv file.
colnames(Reviews15_results)
## [1] "id" "userReviewSeries" "userReviewOnlyContent"
## [4] "userRatingSeries" "userRatingValue" "businessReplied"
## [7] "businessReplyContent" "userReviewContent" "LowAvgHighCost"
## [10] "businessType" "cityState" "friends"
## [13] "reviews" "photos" "eliteStatus"
## [16] "userName" "Date" "userBusinessPhotos"
## [19] "userCheckIns" "weekday" "the"
## [22] "and" "for." "have"
## [25] "that" "they" "this"
## [28] "you" "not" "but"
## [31] "good" "with" "the_ratios"
## [34] "and_ratios" "for_ratios" "have_ratios"
## [37] "that_ratios" "they_ratios" "this_ratios"
## [40] "you_ratios" "not_ratios" "but_ratios"
## [43] "good_ratios" "with_ratios" "maxVote"
## [46] "votedRating" "Rating" "finalPrediction"
## [49] "CorrectPrediction" "actualRatingValue"
The column features to start with in predicting ratings will be the userRatingValue, businessReplied as a yes or no, LowAvgHighCost, businessType, number of friends, number of reviews, number of photos, number of userBusinessPhotos, weekday, number of userCheckIns, if the user is an eliteStatus, and the stopword ratios.Make sure your libraries are loaded, we will use the caret package to run some analysis on this table of features.
businessRatings <- Reviews15_results %>% select(userRatingValue,businessReplied,
LowAvgHighCost,businessType,
friends ,
reviews , photos , eliteStatus ,
userBusinessPhotos ,
userCheckIns , weekday , the_ratios,
and_ratios , for_ratios , have_ratios ,
that_ratios , they_ratios , this_ratios ,
you_ratios , not_ratios , but_ratios ,
good_ratios , with_ratios )
businessRatings$userRatingValue <- as.factor(paste(businessRatings$userRatingValue))
head(businessRatings)
## userRatingValue businessReplied LowAvgHighCost businessType
## 1 5 yes High high end massage retreat
## 2 5 no Avg chiropractic
## 3 5 no Avg chiropractic
## 4 1 yes High high end massage retreat
## 5 5 no Avg chiropractic
## 6 5 no Avg chiropractic
## friends reviews photos eliteStatus userBusinessPhotos userCheckIns weekday
## 1 26 33 21 <NA> 2 NA Sun
## 2 0 7 NA <NA> NA NA Sun
## 3 943 7 2 <NA> NA 2 Fri
## 4 12 12 4 <NA> NA NA Sat
## 5 11 24 11 <NA> NA 1 Mon
## 6 4 NA NA <NA> NA 27 Thu
## the_ratios and_ratios for_ratios have_ratios that_ratios they_ratios
## 1 0.055 0.018 0.011 0.015 0.015 0.011
## 2 0.000 0.028 0.009 0.009 0.000 0.000
## 3 0.069 0.034 0.000 0.000 0.000 0.000
## 4 0.032 0.027 0.009 0.000 0.000 0.000
## 5 0.080 0.060 0.000 0.000 0.000 0.000
## 6 0.036 0.036 0.024 0.000 0.000 0.024
## this_ratios you_ratios not_ratios but_ratios good_ratios with_ratios
## 1 0.004 0.007 0.004 0.004 0.007 0.000
## 2 0.009 0.000 0.000 0.000 0.000 0.000
## 3 0.000 0.000 0.000 0.000 0.000 0.000
## 4 0.009 0.000 0.014 0.005 0.000 0.005
## 5 0.000 0.000 0.000 0.000 0.000 0.000
## 6 0.000 0.012 0.000 0.000 0.000 0.000
Lets use the numeric fields to predict the target of the ratings.
numRegressions <- businessRatings %>% select(userRatingValue,
friends:photos, userBusinessPhotos,userCheckIns,
the_ratios:with_ratios)
numRegressions$userRatingValue <- as.numeric(paste(numRegressions$userRatingValue))
str(numRegressions)
## 'data.frame': 614 obs. of 18 variables:
## $ userRatingValue : num 5 5 5 1 5 5 5 5 5 5 ...
## $ friends : int 26 0 943 12 11 4 244 10 14 149 ...
## $ reviews : int 33 7 7 12 24 NA 5 52 35 66 ...
## $ photos : int 21 NA 2 4 11 NA NA 38 5 112 ...
## $ userBusinessPhotos: int 2 NA NA NA NA NA NA NA NA NA ...
## $ userCheckIns : int NA NA 2 NA 1 27 NA 1 1 NA ...
## $ the_ratios : num 0.055 0 0.069 0.032 0.08 0.036 0.038 0.016 0.014 0 ...
## $ and_ratios : num 0.018 0.028 0.034 0.027 0.06 0.036 0.063 0.048 0.042 0 ...
## $ for_ratios : num 0.011 0.009 0 0.009 0 0.024 0 0.008 0 0.048 ...
## $ have_ratios : num 0.015 0.009 0 0 0 0 0.025 0 0 0 ...
## $ that_ratios : num 0.015 0 0 0 0 0 0.013 0.016 0.014 0 ...
## $ they_ratios : num 0.011 0 0 0 0 0.024 0 0 0.014 0 ...
## $ this_ratios : num 0.004 0.009 0 0.009 0 0 0 0 0 0 ...
## $ you_ratios : num 0.007 0 0 0 0 0.012 0.013 0.008 0.028 0 ...
## $ not_ratios : num 0.004 0 0 0.014 0 0 0.013 0 0 0 ...
## $ but_ratios : num 0.004 0 0 0.005 0 0 0 0.008 0.014 0 ...
## $ good_ratios : num 0.007 0 0 0 0 0 0 0 0 0 ...
## $ with_ratios : num 0 0 0 0.005 0 0 0 0 0 0 ...
Now lets select our training set and our testing set to build the caret models and test the models on. We will sample randomly with the sample function on our indices of the training set and use those indices not in the training set for our testing set.
set.seed(56789)
train <- sample(floor(.7*length(numRegressions$userRatingValue)),replace=FALSE)
trainingSet <- numRegressions[train,]
testingSet <- numRegressions[-train,]
dim(trainingSet);dim(testingSet);dim(trainingSet)[1]+dim(testingSet)[1];dim(numRegressions)
## [1] 429 18
## [1] 185 18
## [1] 614
## [1] 614 18
library(e1071)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(gbm)
## Loaded gbm 2.1.5
Optionally you could use this method
inTrain <- createDataPartition(y=numRegressions$userRatingValue, p=0.7, list=FALSE)
trainingSet2 <- numRegressions[inTrain,]
testingSet2 <- numRegressions[-inTrain,]
Our training set to build the model is 429 reviews, and our testing set is 185 reviews. There are more 5s in the data overall. Lets look at those numbers.
statsTrain <- trainingSet %>% group_by(userRatingValue) %>% count()
statsTrain$percent <- statsTrain$n/sum(statsTrain$n)
statsTrain
## # A tibble: 5 x 3
## # Groups: userRatingValue [5]
## userRatingValue n percent
## <dbl> <int> <dbl>
## 1 1 64 0.149
## 2 2 30 0.0699
## 3 3 36 0.0839
## 4 4 82 0.191
## 5 5 217 0.506
statsTest <- testingSet %>% group_by(userRatingValue) %>% count()
statsTest$percent <- statsTest$n/sum(statsTest$n)
statsTest
## # A tibble: 5 x 3
## # Groups: userRatingValue [5]
## userRatingValue n percent
## <dbl> <int> <dbl>
## 1 1 24 0.130
## 2 2 4 0.0216
## 3 3 18 0.0973
## 4 4 21 0.114
## 5 5 118 0.638
There are more percent of 3s and 5s in the testing set and more 1s,2s, and 4s in the training set. but not by too much.
Lets see what the percent of sampling is with the createDataPartition function in the second sampled set.
statsTrain2 <- trainingSet2 %>% group_by(userRatingValue) %>% count()
statsTrain2$percent <- statsTrain2$n/sum(statsTrain2$n)
statsTrain2
## # A tibble: 5 x 3
## # Groups: userRatingValue [5]
## userRatingValue n percent
## <dbl> <int> <dbl>
## 1 1 62 0.144
## 2 2 22 0.0510
## 3 3 40 0.0928
## 4 4 71 0.165
## 5 5 236 0.548
statsTest2 <- testingSet2 %>% group_by(userRatingValue) %>% count()
statsTest2$percent <- statsTest2$n/sum(statsTest2$n)
statsTest2
## # A tibble: 5 x 3
## # Groups: userRatingValue [5]
## userRatingValue n percent
## <dbl> <int> <dbl>
## 1 1 26 0.142
## 2 2 12 0.0656
## 3 3 14 0.0765
## 4 4 32 0.175
## 5 5 99 0.541
The number of percents are better with the createDataPartitions function, so we will use that sampling set.
library(RANN) #this pkg supplements caret for one out bag validation, and interferes with the select function of tidyverse and dplyr
## Warning: package 'RANN' was built under R version 3.6.3
rfMod0 <- train(userRatingValue~., method='rf',
na.action=na.pass,
data=(trainingSet2), preProc = c("center", "scale","medianImpute"),
trControl=trainControl(method='oob'), number=5)
The following produces an error because of the imputing of missing data, when running the next line the predRF0 only has 15 rows, but the testingSet2 has 183, it only predicted by the features that were available.
predRF0 <- predict(rfMod0, testingSet2)
predDF0 <- data.frame(predRF0, type=testingSet2$userRatingValue)
predDF0
sum <- sum(predRF0==testingSet2$userRatingValue)
length <- length(testingSet2$userRatingValue)
accuracy_rfMod0 <- (sum/length)
accuracy_rfMod0
Lets just use the data set without the meta data as many values are missing. We need a new data table. We will just remove the features we don’t need from our testingSet2 and trainingSet2 tables. Some of the supplemental packages to caret when tuning the random forest trees interferes with tidyverse packages, so we’ll use slicing.
trainingSet3 <- trainingSet2[,c(1,7:18)]
trainingSet3$userRatingValue <- as.factor(paste(trainingSet3$userRatingValue))
# trainingSet3 <- trainingSet2 %>% select(-friends, -reviews, -photos,
# -userBusinessPhotos,-userCheckIns)
colnames(trainingSet3)
## [1] "userRatingValue" "the_ratios" "and_ratios" "for_ratios"
## [5] "have_ratios" "that_ratios" "they_ratios" "this_ratios"
## [9] "you_ratios" "not_ratios" "but_ratios" "good_ratios"
## [13] "with_ratios"
testingSet3 <- testingSet2[,c(1,7:18)]
testingSet3$userRatingValue <- as.factor(paste(testingSet3$userRatingValue))
# testingSet3 <- testingSet2 %>% select(-friends, -reviews, -photos,
# -userBusinessPhotos,-userCheckIns)
colnames(testingSet3)
## [1] "userRatingValue" "the_ratios" "and_ratios" "for_ratios"
## [5] "have_ratios" "that_ratios" "they_ratios" "this_ratios"
## [9] "you_ratios" "not_ratios" "but_ratios" "good_ratios"
## [13] "with_ratios"
dim(testingSet3);dim(trainingSet3)
## [1] 183 13
## [1] 431 13
Lets see if it works for the caret rf model this time.
# requires the RANN package
rfMod1 <- train(userRatingValue~., method='rf',
na.action=na.pass,
data=(trainingSet3), preProc = c("center", "scale","knnImpute"),
trControl=trainControl(method='oob'), number=5)
It does!
predRF1 <- predict(rfMod1, testingSet3)
predDF1 <- data.frame(predRF1, type=testingSet3$userRatingValue)
predDF1
## predRF1 type
## 1 5 5
## 2 5 5
## 3 5 1
## 4 1 1
## 5 5 5
## 6 5 4
## 7 5 5
## 8 5 5
## 9 5 5
## 10 5 4
## 11 5 5
## 12 5 5
## 13 4 4
## 14 1 1
## 15 5 5
## 16 5 4
## 17 5 5
## 18 5 4
## 19 5 5
## 20 5 5
## 21 5 5
## 22 5 1
## 23 5 1
## 24 5 5
## 25 5 3
## 26 5 5
## 27 5 5
## 28 5 1
## 29 5 5
## 30 4 1
## 31 5 5
## 32 5 5
## 33 5 4
## 34 2 2
## 35 5 2
## 36 5 5
## 37 5 4
## 38 5 5
## 39 5 1
## 40 2 5
## 41 5 2
## 42 1 4
## 43 4 5
## 44 5 5
## 45 5 1
## 46 5 4
## 47 5 5
## 48 5 5
## 49 5 5
## 50 5 1
## 51 4 4
## 52 5 1
## 53 5 5
## 54 5 5
## 55 5 4
## 56 1 5
## 57 4 2
## 58 1 5
## 59 5 5
## 60 1 1
## 61 5 4
## 62 5 4
## 63 5 4
## 64 5 4
## 65 5 4
## 66 5 4
## 67 5 4
## 68 5 5
## 69 5 5
## 70 5 5
## 71 4 3
## 72 5 5
## 73 5 5
## 74 5 5
## 75 1 5
## 76 5 1
## 77 5 5
## 78 1 5
## 79 5 4
## 80 5 2
## 81 5 5
## 82 4 2
## 83 5 3
## 84 5 2
## 85 5 3
## 86 5 4
## 87 5 5
## 88 5 4
## 89 5 1
## 90 2 1
## 91 5 5
## 92 5 4
## 93 5 2
## 94 3 2
## 95 5 5
## 96 5 5
## 97 5 5
## 98 5 5
## 99 5 3
## 100 4 4
## 101 5 5
## 102 4 5
## 103 5 4
## 104 5 5
## 105 5 5
## 106 5 5
## 107 5 5
## 108 1 4
## 109 5 5
## 110 5 5
## 111 5 5
## 112 5 5
## 113 5 3
## 114 2 4
## 115 1 2
## 116 5 1
## 117 4 4
## 118 5 3
## 119 3 4
## 120 5 5
## 121 5 5
## 122 5 5
## 123 4 2
## 124 5 5
## 125 5 5
## 126 5 5
## 127 3 4
## 128 5 5
## 129 5 3
## 130 1 1
## 131 4 1
## 132 5 3
## 133 5 5
## 134 5 5
## 135 5 5
## 136 5 5
## 137 5 5
## 138 4 3
## 139 5 2
## 140 5 3
## 141 5 4
## 142 5 5
## 143 5 5
## 144 3 4
## 145 4 1
## 146 5 1
## 147 3 1
## 148 4 1
## 149 5 1
## 150 3 1
## 151 4 3
## 152 3 3
## 153 1 1
## 154 5 3
## 155 5 5
## 156 1 5
## 157 5 5
## 158 5 5
## 159 5 5
## 160 5 5
## 161 5 5
## 162 5 5
## 163 5 5
## 164 5 5
## 165 5 5
## 166 5 5
## 167 5 5
## 168 1 5
## 169 5 5
## 170 5 5
## 171 5 5
## 172 5 5
## 173 5 5
## 174 5 5
## 175 5 5
## 176 5 4
## 177 5 5
## 178 4 5
## 179 5 5
## 180 5 1
## 181 5 5
## 182 5 5
## 183 5 5
sum1 <- sum(predRF1==testingSet3$userRatingValue)
length1 <- length(testingSet3$userRatingValue)
accuracy_rfMod1 <- (sum1/length1)
accuracy_rfMod1
## [1] 0.5464481
We see that the above is regressing with the random forest, lets change the target to a factor.
trainingSet3$userRatingValue <- as.factor(paste(trainingSet3$userRatingValue))
testingSet3$userRatingValue <- as.factor(paste(testingSet3$userRatingValue))
Lets re-run the above two chunks of the model and predictions to see the results.
# requires the RANN package
rfMod1 <- train(userRatingValue~., method='rf',
na.action=na.pass,
data=(trainingSet3), preProc = c("center", "scale","knnImpute"),
trControl=trainControl(method='oob'), number=5)
predRF1 <- predict(rfMod1, testingSet3)
predDF1 <- data.frame(predRF1, type=testingSet3$userRatingValue)
predDF1
## predRF1 type
## 1 5 5
## 2 5 5
## 3 5 1
## 4 1 1
## 5 5 5
## 6 5 4
## 7 5 5
## 8 5 5
## 9 5 5
## 10 5 4
## 11 5 5
## 12 5 5
## 13 4 4
## 14 1 1
## 15 5 5
## 16 5 4
## 17 5 5
## 18 5 4
## 19 5 5
## 20 5 5
## 21 5 5
## 22 5 1
## 23 5 1
## 24 5 5
## 25 5 3
## 26 5 5
## 27 5 5
## 28 5 1
## 29 5 5
## 30 4 1
## 31 5 5
## 32 5 5
## 33 5 4
## 34 2 2
## 35 5 2
## 36 5 5
## 37 5 4
## 38 5 5
## 39 5 1
## 40 2 5
## 41 5 2
## 42 1 4
## 43 5 5
## 44 5 5
## 45 5 1
## 46 5 4
## 47 5 5
## 48 5 5
## 49 5 5
## 50 5 1
## 51 4 4
## 52 5 1
## 53 5 5
## 54 5 5
## 55 5 4
## 56 5 5
## 57 4 2
## 58 5 5
## 59 5 5
## 60 1 1
## 61 5 4
## 62 5 4
## 63 5 4
## 64 5 4
## 65 5 4
## 66 5 4
## 67 5 4
## 68 5 5
## 69 5 5
## 70 5 5
## 71 4 3
## 72 5 5
## 73 5 5
## 74 5 5
## 75 5 5
## 76 5 1
## 77 5 5
## 78 5 5
## 79 1 4
## 80 5 2
## 81 5 5
## 82 4 2
## 83 5 3
## 84 5 2
## 85 5 3
## 86 5 4
## 87 5 5
## 88 5 4
## 89 5 1
## 90 2 1
## 91 5 5
## 92 5 4
## 93 5 2
## 94 4 2
## 95 5 5
## 96 5 5
## 97 5 5
## 98 5 5
## 99 5 3
## 100 4 4
## 101 5 5
## 102 4 5
## 103 2 4
## 104 5 5
## 105 5 5
## 106 5 5
## 107 5 5
## 108 4 4
## 109 5 5
## 110 5 5
## 111 5 5
## 112 5 5
## 113 5 3
## 114 2 4
## 115 3 2
## 116 5 1
## 117 4 4
## 118 5 3
## 119 3 4
## 120 5 5
## 121 5 5
## 122 5 5
## 123 4 2
## 124 5 5
## 125 5 5
## 126 5 5
## 127 3 4
## 128 5 5
## 129 5 3
## 130 1 1
## 131 4 1
## 132 5 3
## 133 5 5
## 134 5 5
## 135 5 5
## 136 5 5
## 137 5 5
## 138 4 3
## 139 5 2
## 140 5 3
## 141 5 4
## 142 5 5
## 143 5 5
## 144 3 4
## 145 4 1
## 146 5 1
## 147 3 1
## 148 4 1
## 149 5 1
## 150 3 1
## 151 4 3
## 152 3 3
## 153 1 1
## 154 5 3
## 155 5 5
## 156 1 5
## 157 5 5
## 158 5 5
## 159 5 5
## 160 5 5
## 161 4 5
## 162 5 5
## 163 5 5
## 164 5 5
## 165 5 5
## 166 5 5
## 167 5 5
## 168 1 5
## 169 5 5
## 170 5 5
## 171 5 5
## 172 5 5
## 173 5 5
## 174 5 5
## 175 5 5
## 176 5 4
## 177 5 5
## 178 4 5
## 179 5 5
## 180 5 1
## 181 5 5
## 182 5 5
## 183 5 5
sum1 <- sum(predRF1==testingSet3$userRatingValue)
length1 <- length(testingSet3$userRatingValue)
accuracy_rfMod1 <- (sum1/length1)
head(accuracy_rfMod1,30)
## [1] 0.5737705
The results are much better and actually the similar range of half and half as our manually built model by using just the stopwords that appeared the most to predict the rating. The above model also used the one out bagging method for trees in validating, and it preprocessed by centering and scaling the variables to normalize them as well as set to imputing NAs with knnImpute, even though there weren’t any NAs in the ratios, just some zeros.
This next model uses the bagImpute for NAs and with the same number, 5, of iterations to one out bag validation.
rfMod2 <- train(userRatingValue~., method='rf',
na.action=na.pass,
data=(trainingSet3), preProc = c("center", "scale","bagImpute"),
trControl=trainControl(method='oob'), number=5)
predRF2 <- predict(rfMod2, testingSet3)
predDF2 <- data.frame(predRF2, type=testingSet3$userRatingValue)
predDF2
## predRF2 type
## 1 5 5
## 2 5 5
## 3 5 1
## 4 1 1
## 5 5 5
## 6 5 4
## 7 5 5
## 8 5 5
## 9 5 5
## 10 5 4
## 11 5 5
## 12 5 5
## 13 4 4
## 14 1 1
## 15 5 5
## 16 5 4
## 17 5 5
## 18 5 4
## 19 5 5
## 20 5 5
## 21 5 5
## 22 5 1
## 23 5 1
## 24 5 5
## 25 5 3
## 26 5 5
## 27 5 5
## 28 5 1
## 29 5 5
## 30 4 1
## 31 5 5
## 32 5 5
## 33 5 4
## 34 2 2
## 35 5 2
## 36 5 5
## 37 5 4
## 38 5 5
## 39 1 1
## 40 2 5
## 41 5 2
## 42 1 4
## 43 5 5
## 44 5 5
## 45 5 1
## 46 5 4
## 47 5 5
## 48 5 5
## 49 5 5
## 50 5 1
## 51 4 4
## 52 5 1
## 53 5 5
## 54 5 5
## 55 5 4
## 56 5 5
## 57 4 2
## 58 5 5
## 59 5 5
## 60 1 1
## 61 5 4
## 62 5 4
## 63 5 4
## 64 5 4
## 65 5 4
## 66 5 4
## 67 5 4
## 68 5 5
## 69 5 5
## 70 5 5
## 71 4 3
## 72 5 5
## 73 5 5
## 74 5 5
## 75 5 5
## 76 5 1
## 77 5 5
## 78 5 5
## 79 5 4
## 80 5 2
## 81 5 5
## 82 4 2
## 83 5 3
## 84 5 2
## 85 5 3
## 86 5 4
## 87 5 5
## 88 5 4
## 89 5 1
## 90 2 1
## 91 5 5
## 92 5 4
## 93 5 2
## 94 3 2
## 95 5 5
## 96 5 5
## 97 5 5
## 98 5 5
## 99 5 3
## 100 4 4
## 101 5 5
## 102 4 5
## 103 5 4
## 104 5 5
## 105 5 5
## 106 5 5
## 107 5 5
## 108 4 4
## 109 5 5
## 110 5 5
## 111 5 5
## 112 5 5
## 113 5 3
## 114 2 4
## 115 1 2
## 116 5 1
## 117 4 4
## 118 5 3
## 119 3 4
## 120 5 5
## 121 5 5
## 122 5 5
## 123 4 2
## 124 5 5
## 125 5 5
## 126 5 5
## 127 3 4
## 128 5 5
## 129 5 3
## 130 1 1
## 131 4 1
## 132 5 3
## 133 5 5
## 134 5 5
## 135 5 5
## 136 5 5
## 137 5 5
## 138 4 3
## 139 5 2
## 140 5 3
## 141 5 4
## 142 5 5
## 143 5 5
## 144 3 4
## 145 4 1
## 146 5 1
## 147 3 1
## 148 4 1
## 149 5 1
## 150 3 1
## 151 4 3
## 152 3 3
## 153 1 1
## 154 5 3
## 155 5 5
## 156 1 5
## 157 5 5
## 158 5 5
## 159 5 5
## 160 5 5
## 161 5 5
## 162 5 5
## 163 5 5
## 164 5 5
## 165 5 5
## 166 5 5
## 167 5 5
## 168 1 5
## 169 5 5
## 170 5 5
## 171 5 5
## 172 5 5
## 173 5 5
## 174 5 5
## 175 5 5
## 176 5 4
## 177 5 5
## 178 4 5
## 179 5 5
## 180 5 1
## 181 5 5
## 182 5 5
## 183 5 5
sum2 <- sum(predRF2==testingSet3$userRatingValue)
length2 <- length(testingSet3$userRatingValue)
accuracy_rfMod2 <- (sum2/length2)
head(accuracy_rfMod2,30)
## [1] 0.5846995
The results of the bagImpute random forest classification is two percent worse than the knnImpute with the same settings in accuracy.
This next random forest model uses the bootstrap method set to five iterations and medianImpute of NAs.
rfMod3 <- train(userRatingValue~., method='rf',
na.action=na.pass,
data=(trainingSet3), preProc = c("center", "scale","medianImpute"),
trControl=trainControl(method='boot'), number=5)
predRF3 <- predict(rfMod3, testingSet3)
predDF3 <- data.frame(predRF3, type=testingSet3$userRatingValue)
predDF3
## predRF3 type
## 1 5 5
## 2 5 5
## 3 5 1
## 4 1 1
## 5 5 5
## 6 5 4
## 7 5 5
## 8 5 5
## 9 5 5
## 10 5 4
## 11 5 5
## 12 5 5
## 13 4 4
## 14 1 1
## 15 5 5
## 16 5 4
## 17 5 5
## 18 5 4
## 19 5 5
## 20 5 5
## 21 5 5
## 22 5 1
## 23 5 1
## 24 5 5
## 25 5 3
## 26 5 5
## 27 5 5
## 28 5 1
## 29 5 5
## 30 4 1
## 31 5 5
## 32 5 5
## 33 5 4
## 34 2 2
## 35 5 2
## 36 5 5
## 37 1 4
## 38 5 5
## 39 5 1
## 40 2 5
## 41 5 2
## 42 1 4
## 43 4 5
## 44 5 5
## 45 5 1
## 46 5 4
## 47 5 5
## 48 5 5
## 49 5 5
## 50 5 1
## 51 4 4
## 52 5 1
## 53 5 5
## 54 5 5
## 55 5 4
## 56 1 5
## 57 4 2
## 58 5 5
## 59 5 5
## 60 1 1
## 61 5 4
## 62 5 4
## 63 5 4
## 64 5 4
## 65 5 4
## 66 5 4
## 67 5 4
## 68 5 5
## 69 5 5
## 70 5 5
## 71 4 3
## 72 5 5
## 73 5 5
## 74 5 5
## 75 5 5
## 76 5 1
## 77 5 5
## 78 1 5
## 79 5 4
## 80 5 2
## 81 5 5
## 82 4 2
## 83 5 3
## 84 5 2
## 85 5 3
## 86 5 4
## 87 5 5
## 88 5 4
## 89 5 1
## 90 2 1
## 91 5 5
## 92 5 4
## 93 5 2
## 94 3 2
## 95 4 5
## 96 5 5
## 97 5 5
## 98 5 5
## 99 5 3
## 100 4 4
## 101 5 5
## 102 4 5
## 103 2 4
## 104 5 5
## 105 5 5
## 106 5 5
## 107 5 5
## 108 4 4
## 109 5 5
## 110 5 5
## 111 5 5
## 112 5 5
## 113 5 3
## 114 2 4
## 115 1 2
## 116 5 1
## 117 4 4
## 118 5 3
## 119 3 4
## 120 5 5
## 121 5 5
## 122 5 5
## 123 4 2
## 124 5 5
## 125 5 5
## 126 5 5
## 127 3 4
## 128 5 5
## 129 5 3
## 130 1 1
## 131 4 1
## 132 5 3
## 133 5 5
## 134 5 5
## 135 5 5
## 136 5 5
## 137 5 5
## 138 4 3
## 139 5 2
## 140 5 3
## 141 5 4
## 142 5 5
## 143 5 5
## 144 3 4
## 145 4 1
## 146 5 1
## 147 3 1
## 148 4 1
## 149 5 1
## 150 3 1
## 151 4 3
## 152 3 3
## 153 1 1
## 154 5 3
## 155 5 5
## 156 1 5
## 157 5 5
## 158 5 5
## 159 5 5
## 160 5 5
## 161 5 5
## 162 5 5
## 163 5 5
## 164 5 5
## 165 5 5
## 166 5 5
## 167 5 5
## 168 1 5
## 169 5 5
## 170 5 5
## 171 5 5
## 172 5 5
## 173 5 5
## 174 5 5
## 175 5 5
## 176 5 4
## 177 5 5
## 178 4 5
## 179 5 5
## 180 5 1
## 181 5 5
## 182 5 5
## 183 5 5
sum3 <- sum(predRF3==testingSet3$userRatingValue)
length3 <- length(testingSet3$userRatingValue)
accuracy_rfMod3 <- (sum3/length3)
head(accuracy_rfMod3,30)
## [1] 0.557377
The results with the bootstrap method and median imputing of NAs is almost as good as the knnImpute of NAs with the one out bag. Both at 56% but the knnImpute 6/10 percent better.
Lets use the bootstrap method with knnImpute on the next random forest model.
rfMod4 <- train(userRatingValue~., method='rf',
na.action=na.pass,
data=(trainingSet3), preProc = c("center", "scale","knnImpute"),
trControl=trainControl(method='boot'), number=5)
predRF4 <- predict(rfMod4, testingSet3)
predDF4 <- data.frame(predRF4, type=testingSet3$userRatingValue)
predDF4
## predRF4 type
## 1 5 5
## 2 5 5
## 3 5 1
## 4 1 1
## 5 5 5
## 6 1 4
## 7 5 5
## 8 5 5
## 9 5 5
## 10 5 4
## 11 5 5
## 12 5 5
## 13 4 4
## 14 1 1
## 15 5 5
## 16 5 4
## 17 5 5
## 18 5 4
## 19 5 5
## 20 5 5
## 21 5 5
## 22 5 1
## 23 5 1
## 24 5 5
## 25 5 3
## 26 5 5
## 27 5 5
## 28 5 1
## 29 5 5
## 30 4 1
## 31 5 5
## 32 5 5
## 33 5 4
## 34 2 2
## 35 5 2
## 36 5 5
## 37 1 4
## 38 5 5
## 39 5 1
## 40 2 5
## 41 5 2
## 42 1 4
## 43 4 5
## 44 5 5
## 45 5 1
## 46 5 4
## 47 5 5
## 48 5 5
## 49 5 5
## 50 5 1
## 51 4 4
## 52 5 1
## 53 5 5
## 54 5 5
## 55 5 4
## 56 4 5
## 57 5 2
## 58 5 5
## 59 5 5
## 60 1 1
## 61 5 4
## 62 5 4
## 63 5 4
## 64 5 4
## 65 1 4
## 66 5 4
## 67 5 4
## 68 5 5
## 69 5 5
## 70 5 5
## 71 4 3
## 72 5 5
## 73 5 5
## 74 5 5
## 75 5 5
## 76 5 1
## 77 5 5
## 78 1 5
## 79 5 4
## 80 5 2
## 81 5 5
## 82 4 2
## 83 5 3
## 84 5 2
## 85 5 3
## 86 5 4
## 87 5 5
## 88 5 4
## 89 5 1
## 90 2 1
## 91 5 5
## 92 5 4
## 93 5 2
## 94 4 2
## 95 5 5
## 96 5 5
## 97 5 5
## 98 5 5
## 99 5 3
## 100 4 4
## 101 5 5
## 102 4 5
## 103 5 4
## 104 5 5
## 105 5 5
## 106 5 5
## 107 5 5
## 108 4 4
## 109 5 5
## 110 5 5
## 111 5 5
## 112 5 5
## 113 5 3
## 114 2 4
## 115 1 2
## 116 5 1
## 117 4 4
## 118 5 3
## 119 3 4
## 120 5 5
## 121 5 5
## 122 5 5
## 123 4 2
## 124 5 5
## 125 5 5
## 126 5 5
## 127 3 4
## 128 5 5
## 129 5 3
## 130 1 1
## 131 4 1
## 132 5 3
## 133 5 5
## 134 5 5
## 135 5 5
## 136 5 5
## 137 5 5
## 138 4 3
## 139 5 2
## 140 5 3
## 141 5 4
## 142 5 5
## 143 5 5
## 144 3 4
## 145 4 1
## 146 5 1
## 147 3 1
## 148 4 1
## 149 5 1
## 150 3 1
## 151 4 3
## 152 3 3
## 153 1 1
## 154 5 3
## 155 5 5
## 156 1 5
## 157 5 5
## 158 5 5
## 159 5 5
## 160 5 5
## 161 5 5
## 162 5 5
## 163 5 5
## 164 5 5
## 165 5 5
## 166 5 5
## 167 5 5
## 168 1 5
## 169 5 5
## 170 5 5
## 171 5 5
## 172 5 5
## 173 5 5
## 174 5 5
## 175 5 5
## 176 5 4
## 177 5 5
## 178 4 5
## 179 5 5
## 180 5 1
## 181 5 5
## 182 5 5
## 183 5 5
sum4 <- sum(predRF4==testingSet3$userRatingValue)
length4 <- length(testingSet3$userRatingValue)
accuracy_rfMod4 <- (sum4/length4)
head(accuracy_rfMod4,30)
## [1] 0.5628415
The bootstrap with knnImpute above did exactly the same as the medianImpute bootstrap random forest as they both predicted 103/183 correct. The knnImpute with one out bag did the best so far with 104/183. The lowest scoring imputing was bagImpute in the one out bag. Lets see this NA imputation with the adaptive_cv method.
Lets use the
rfMod5 <- train(userRatingValue~., method='rf',
na.action=na.pass,
data=(trainingSet3), preProc = c("center", "scale","bagImpute"),
trControl=trainControl(method='adaptive_cv'), number=5)
predRF5 <- predict(rfMod5, testingSet3)
predDF5 <- data.frame(predRF5, type=testingSet3$userRatingValue)
predDF5
## predRF5 type
## 1 5 5
## 2 4 5
## 3 5 1
## 4 1 1
## 5 5 5
## 6 4 4
## 7 5 5
## 8 5 5
## 9 5 5
## 10 2 4
## 11 5 5
## 12 5 5
## 13 4 4
## 14 1 1
## 15 5 5
## 16 5 4
## 17 5 5
## 18 1 4
## 19 3 5
## 20 5 5
## 21 1 5
## 22 5 1
## 23 5 1
## 24 5 5
## 25 5 3
## 26 5 5
## 27 5 5
## 28 5 1
## 29 5 5
## 30 4 1
## 31 5 5
## 32 5 5
## 33 5 4
## 34 2 2
## 35 5 2
## 36 5 5
## 37 1 4
## 38 5 5
## 39 1 1
## 40 2 5
## 41 5 2
## 42 1 4
## 43 3 5
## 44 5 5
## 45 5 1
## 46 5 4
## 47 5 5
## 48 5 5
## 49 5 5
## 50 5 1
## 51 4 4
## 52 5 1
## 53 5 5
## 54 5 5
## 55 3 4
## 56 4 5
## 57 4 2
## 58 5 5
## 59 5 5
## 60 3 1
## 61 5 4
## 62 5 4
## 63 4 4
## 64 5 4
## 65 1 4
## 66 4 4
## 67 5 4
## 68 5 5
## 69 5 5
## 70 4 5
## 71 1 3
## 72 5 5
## 73 5 5
## 74 5 5
## 75 1 5
## 76 5 1
## 77 5 5
## 78 1 5
## 79 1 4
## 80 5 2
## 81 5 5
## 82 4 2
## 83 5 3
## 84 2 2
## 85 5 3
## 86 4 4
## 87 5 5
## 88 5 4
## 89 5 1
## 90 2 1
## 91 5 5
## 92 5 4
## 93 5 2
## 94 3 2
## 95 4 5
## 96 5 5
## 97 5 5
## 98 5 5
## 99 2 3
## 100 4 4
## 101 5 5
## 102 4 5
## 103 2 4
## 104 5 5
## 105 5 5
## 106 5 5
## 107 5 5
## 108 4 4
## 109 4 5
## 110 5 5
## 111 5 5
## 112 5 5
## 113 5 3
## 114 2 4
## 115 4 2
## 116 5 1
## 117 4 4
## 118 5 3
## 119 3 4
## 120 5 5
## 121 5 5
## 122 5 5
## 123 4 2
## 124 5 5
## 125 1 5
## 126 5 5
## 127 3 4
## 128 5 5
## 129 5 3
## 130 1 1
## 131 4 1
## 132 5 3
## 133 5 5
## 134 5 5
## 135 5 5
## 136 5 5
## 137 5 5
## 138 4 3
## 139 5 2
## 140 5 3
## 141 5 4
## 142 5 5
## 143 5 5
## 144 3 4
## 145 4 1
## 146 5 1
## 147 3 1
## 148 4 1
## 149 4 1
## 150 3 1
## 151 4 3
## 152 3 3
## 153 1 1
## 154 4 3
## 155 5 5
## 156 1 5
## 157 5 5
## 158 5 5
## 159 5 5
## 160 2 5
## 161 4 5
## 162 5 5
## 163 5 5
## 164 5 5
## 165 5 5
## 166 5 5
## 167 5 5
## 168 1 5
## 169 5 5
## 170 5 5
## 171 5 5
## 172 5 5
## 173 5 5
## 174 5 5
## 175 5 5
## 176 4 4
## 177 5 5
## 178 4 5
## 179 5 5
## 180 3 1
## 181 5 5
## 182 5 5
## 183 5 5
sum5 <- sum(predRF5==testingSet3$userRatingValue)
length5 <- length(testingSet3$userRatingValue)
accuracy_rfMod5 <- (sum5/length5)
head(accuracy_rfMod5,30)
## [1] 0.5409836
It turns out the bagImpute with adaptive_cv validation random forest model scored 104/183 correctly as best model with the knnImpute one out bag model.
The next is a medianImpute with adaptive_boot validation.
rfMod6 <- train(userRatingValue ~., method='rf',
na.action=na.pass,
data=(trainingSet3), preProc = c("center", "scale","medianImpute"),
trControl=trainControl(method='adaptive_boot'), number=5)
predRF6 <- predict(rfMod6, testingSet3)
predDF6 <- data.frame(predRF6, type=testingSet3$userRatingValue)
predDF6
## predRF6 type
## 1 5 5
## 2 5 5
## 3 5 1
## 4 1 1
## 5 5 5
## 6 5 4
## 7 5 5
## 8 5 5
## 9 5 5
## 10 5 4
## 11 5 5
## 12 5 5
## 13 4 4
## 14 1 1
## 15 5 5
## 16 5 4
## 17 5 5
## 18 5 4
## 19 5 5
## 20 5 5
## 21 5 5
## 22 5 1
## 23 5 1
## 24 5 5
## 25 5 3
## 26 5 5
## 27 5 5
## 28 5 1
## 29 5 5
## 30 4 1
## 31 5 5
## 32 5 5
## 33 5 4
## 34 2 2
## 35 5 2
## 36 5 5
## 37 5 4
## 38 5 5
## 39 5 1
## 40 2 5
## 41 5 2
## 42 1 4
## 43 5 5
## 44 5 5
## 45 5 1
## 46 5 4
## 47 5 5
## 48 5 5
## 49 5 5
## 50 5 1
## 51 4 4
## 52 5 1
## 53 5 5
## 54 5 5
## 55 5 4
## 56 5 5
## 57 4 2
## 58 1 5
## 59 5 5
## 60 3 1
## 61 5 4
## 62 5 4
## 63 5 4
## 64 5 4
## 65 1 4
## 66 5 4
## 67 5 4
## 68 5 5
## 69 5 5
## 70 5 5
## 71 4 3
## 72 5 5
## 73 5 5
## 74 5 5
## 75 5 5
## 76 5 1
## 77 5 5
## 78 5 5
## 79 5 4
## 80 5 2
## 81 5 5
## 82 4 2
## 83 5 3
## 84 5 2
## 85 5 3
## 86 5 4
## 87 5 5
## 88 5 4
## 89 5 1
## 90 2 1
## 91 5 5
## 92 5 4
## 93 5 2
## 94 3 2
## 95 4 5
## 96 5 5
## 97 5 5
## 98 5 5
## 99 5 3
## 100 5 4
## 101 5 5
## 102 4 5
## 103 5 4
## 104 5 5
## 105 5 5
## 106 5 5
## 107 5 5
## 108 4 4
## 109 5 5
## 110 5 5
## 111 5 5
## 112 5 5
## 113 5 3
## 114 2 4
## 115 1 2
## 116 5 1
## 117 4 4
## 118 5 3
## 119 3 4
## 120 5 5
## 121 5 5
## 122 5 5
## 123 4 2
## 124 5 5
## 125 5 5
## 126 5 5
## 127 3 4
## 128 5 5
## 129 5 3
## 130 1 1
## 131 4 1
## 132 5 3
## 133 5 5
## 134 5 5
## 135 5 5
## 136 5 5
## 137 5 5
## 138 4 3
## 139 5 2
## 140 5 3
## 141 5 4
## 142 5 5
## 143 5 5
## 144 3 4
## 145 4 1
## 146 5 1
## 147 3 1
## 148 4 1
## 149 5 1
## 150 3 1
## 151 4 3
## 152 3 3
## 153 1 1
## 154 5 3
## 155 5 5
## 156 1 5
## 157 5 5
## 158 5 5
## 159 5 5
## 160 5 5
## 161 5 5
## 162 5 5
## 163 5 5
## 164 5 5
## 165 5 5
## 166 5 5
## 167 5 5
## 168 1 5
## 169 5 5
## 170 5 5
## 171 5 5
## 172 5 5
## 173 5 5
## 174 5 5
## 175 5 5
## 176 5 4
## 177 5 5
## 178 4 5
## 179 5 5
## 180 5 1
## 181 5 5
## 182 5 5
## 183 5 5
sum6 <- sum(predRF6==testingSet3$userRatingValue)
length6 <- length(testingSet3$userRatingValue)
accuracy_rfMod6 <- (sum6/length6)
head(accuracy_rfMod6,30)
## [1] 0.557377
So far the range is 100-104 out of 183 ratings predicted correctly.
The next model is medianImpute with adaptive_cv
rfMod7 <- train(userRatingValue ~., method='rf',
na.action=na.pass, search="random",
data=(trainingSet3), preProc = c("center", "scale","medianImpute"),
trControl=trainControl(method='adaptive_cv'), number=5)
predRF7 <- predict(rfMod7, testingSet3)
predDF7 <- data.frame(predRF7, type=testingSet3$userRatingValue)
predDF7
## predRF7 type
## 1 5 5
## 2 5 5
## 3 5 1
## 4 1 1
## 5 5 5
## 6 5 4
## 7 5 5
## 8 5 5
## 9 5 5
## 10 5 4
## 11 5 5
## 12 5 5
## 13 4 4
## 14 1 1
## 15 5 5
## 16 5 4
## 17 5 5
## 18 5 4
## 19 5 5
## 20 5 5
## 21 5 5
## 22 5 1
## 23 5 1
## 24 5 5
## 25 5 3
## 26 5 5
## 27 5 5
## 28 5 1
## 29 5 5
## 30 4 1
## 31 5 5
## 32 5 5
## 33 5 4
## 34 2 2
## 35 5 2
## 36 5 5
## 37 4 4
## 38 5 5
## 39 5 1
## 40 2 5
## 41 5 2
## 42 5 4
## 43 5 5
## 44 5 5
## 45 5 1
## 46 5 4
## 47 5 5
## 48 5 5
## 49 5 5
## 50 5 1
## 51 4 4
## 52 5 1
## 53 5 5
## 54 5 5
## 55 5 4
## 56 5 5
## 57 4 2
## 58 1 5
## 59 5 5
## 60 5 1
## 61 5 4
## 62 5 4
## 63 5 4
## 64 5 4
## 65 5 4
## 66 5 4
## 67 5 4
## 68 5 5
## 69 5 5
## 70 5 5
## 71 4 3
## 72 5 5
## 73 5 5
## 74 5 5
## 75 5 5
## 76 5 1
## 77 5 5
## 78 1 5
## 79 5 4
## 80 5 2
## 81 5 5
## 82 4 2
## 83 5 3
## 84 5 2
## 85 5 3
## 86 5 4
## 87 5 5
## 88 5 4
## 89 5 1
## 90 2 1
## 91 5 5
## 92 5 4
## 93 5 2
## 94 3 2
## 95 4 5
## 96 5 5
## 97 5 5
## 98 5 5
## 99 5 3
## 100 4 4
## 101 5 5
## 102 4 5
## 103 5 4
## 104 5 5
## 105 5 5
## 106 5 5
## 107 5 5
## 108 4 4
## 109 5 5
## 110 5 5
## 111 5 5
## 112 5 5
## 113 5 3
## 114 2 4
## 115 3 2
## 116 5 1
## 117 4 4
## 118 5 3
## 119 3 4
## 120 5 5
## 121 5 5
## 122 5 5
## 123 4 2
## 124 5 5
## 125 5 5
## 126 5 5
## 127 3 4
## 128 5 5
## 129 5 3
## 130 1 1
## 131 4 1
## 132 5 3
## 133 5 5
## 134 5 5
## 135 5 5
## 136 5 5
## 137 5 5
## 138 4 3
## 139 5 2
## 140 5 3
## 141 5 4
## 142 5 5
## 143 5 5
## 144 3 4
## 145 4 1
## 146 5 1
## 147 3 1
## 148 4 1
## 149 5 1
## 150 3 1
## 151 4 3
## 152 3 3
## 153 1 1
## 154 5 3
## 155 5 5
## 156 1 5
## 157 5 5
## 158 5 5
## 159 5 5
## 160 5 5
## 161 5 5
## 162 5 5
## 163 5 5
## 164 5 5
## 165 5 5
## 166 5 5
## 167 5 5
## 168 1 5
## 169 5 5
## 170 5 5
## 171 5 5
## 172 5 5
## 173 5 5
## 174 5 5
## 175 5 5
## 176 5 4
## 177 5 5
## 178 5 5
## 179 5 5
## 180 5 1
## 181 5 5
## 182 5 5
## 183 5 5
sum7 <- sum(predRF7==testingSet3$userRatingValue)
length7 <- length(testingSet3$userRatingValue)
accuracy_rfMod7 <- (sum7/length7)
head(accuracy_rfMod7,30)
## [1] 0.568306
Nothing outside the range yet. Lets keep going.
This next model is medianImpute and adaptive_cv with an added grid search.
rfMod8 <- train(userRatingValue ~., method='rf',
na.action=na.pass, search="grid",
data=(trainingSet3), preProc = c("center", "scale","medianImpute"),
trControl=trainControl(method='adaptive_cv'), number=5)
predRF8 <- predict(rfMod8, testingSet3)
predDF8 <- data.frame(predRF8, type=testingSet3$userRatingValue)
predDF8
## predRF8 type
## 1 5 5
## 2 4 5
## 3 5 1
## 4 1 1
## 5 5 5
## 6 4 4
## 7 5 5
## 8 5 5
## 9 5 5
## 10 2 4
## 11 5 5
## 12 5 5
## 13 4 4
## 14 1 1
## 15 5 5
## 16 5 4
## 17 5 5
## 18 1 4
## 19 3 5
## 20 5 5
## 21 1 5
## 22 5 1
## 23 5 1
## 24 5 5
## 25 5 3
## 26 5 5
## 27 5 5
## 28 5 1
## 29 5 5
## 30 4 1
## 31 5 5
## 32 5 5
## 33 5 4
## 34 2 2
## 35 5 2
## 36 5 5
## 37 1 4
## 38 5 5
## 39 1 1
## 40 2 5
## 41 5 2
## 42 4 4
## 43 3 5
## 44 5 5
## 45 5 1
## 46 5 4
## 47 5 5
## 48 5 5
## 49 5 5
## 50 1 1
## 51 4 4
## 52 5 1
## 53 5 5
## 54 5 5
## 55 3 4
## 56 5 5
## 57 4 2
## 58 5 5
## 59 5 5
## 60 3 1
## 61 5 4
## 62 5 4
## 63 4 4
## 64 5 4
## 65 1 4
## 66 4 4
## 67 5 4
## 68 5 5
## 69 5 5
## 70 3 5
## 71 4 3
## 72 5 5
## 73 5 5
## 74 5 5
## 75 5 5
## 76 5 1
## 77 5 5
## 78 1 5
## 79 5 4
## 80 5 2
## 81 5 5
## 82 4 2
## 83 4 3
## 84 2 2
## 85 5 3
## 86 4 4
## 87 5 5
## 88 5 4
## 89 5 1
## 90 2 1
## 91 5 5
## 92 1 4
## 93 1 2
## 94 4 2
## 95 4 5
## 96 5 5
## 97 5 5
## 98 5 5
## 99 3 3
## 100 4 4
## 101 5 5
## 102 4 5
## 103 2 4
## 104 5 5
## 105 5 5
## 106 5 5
## 107 5 5
## 108 4 4
## 109 4 5
## 110 5 5
## 111 5 5
## 112 1 5
## 113 3 3
## 114 2 4
## 115 5 2
## 116 5 1
## 117 4 4
## 118 5 3
## 119 3 4
## 120 5 5
## 121 5 5
## 122 5 5
## 123 4 2
## 124 5 5
## 125 1 5
## 126 5 5
## 127 3 4
## 128 5 5
## 129 5 3
## 130 1 1
## 131 4 1
## 132 5 3
## 133 5 5
## 134 5 5
## 135 5 5
## 136 5 5
## 137 5 5
## 138 4 3
## 139 5 2
## 140 5 3
## 141 5 4
## 142 5 5
## 143 5 5
## 144 3 4
## 145 4 1
## 146 5 1
## 147 3 1
## 148 4 1
## 149 4 1
## 150 3 1
## 151 4 3
## 152 3 3
## 153 1 1
## 154 4 3
## 155 5 5
## 156 1 5
## 157 5 5
## 158 5 5
## 159 5 5
## 160 2 5
## 161 2 5
## 162 5 5
## 163 5 5
## 164 5 5
## 165 5 5
## 166 5 5
## 167 5 5
## 168 1 5
## 169 5 5
## 170 5 5
## 171 5 5
## 172 5 5
## 173 5 5
## 174 5 5
## 175 5 5
## 176 2 4
## 177 5 5
## 178 4 5
## 179 5 5
## 180 3 1
## 181 5 5
## 182 5 5
## 183 5 5
sum8 <- sum(predRF8==testingSet3$userRatingValue)
length8 <- length(testingSet3$userRatingValue)
accuracy_rfMod8 <- (sum8/length8)
head(accuracy_rfMod8,30)
## [1] 0.5628415
Nothing new on the other model. Lets run two other models of random forest. Lets use the grid search again but use 10 iterations instead of 5 with the medianImputed NAs attribute.
rfMod9 <- train(userRatingValue ~., method='rf',
na.action=na.pass, search="grid",
data=(trainingSet3), preProc = c("center", "scale","medianImpute"),
trControl=trainControl(method='adaptive_cv'), number=10)
predRF9 <- predict(rfMod9, testingSet3)
predDF9 <- data.frame(predRF9, type=testingSet3$userRatingValue)
predDF9
## predRF9 type
## 1 5 5
## 2 5 5
## 3 5 1
## 4 1 1
## 5 5 5
## 6 5 4
## 7 5 5
## 8 5 5
## 9 5 5
## 10 5 4
## 11 5 5
## 12 5 5
## 13 4 4
## 14 1 1
## 15 5 5
## 16 5 4
## 17 5 5
## 18 5 4
## 19 5 5
## 20 5 5
## 21 5 5
## 22 5 1
## 23 5 1
## 24 5 5
## 25 5 3
## 26 5 5
## 27 5 5
## 28 5 1
## 29 5 5
## 30 4 1
## 31 5 5
## 32 5 5
## 33 5 4
## 34 2 2
## 35 5 2
## 36 5 5
## 37 5 4
## 38 5 5
## 39 5 1
## 40 2 5
## 41 5 2
## 42 1 4
## 43 4 5
## 44 5 5
## 45 5 1
## 46 5 4
## 47 5 5
## 48 5 5
## 49 5 5
## 50 5 1
## 51 4 4
## 52 5 1
## 53 5 5
## 54 5 5
## 55 5 4
## 56 5 5
## 57 4 2
## 58 5 5
## 59 5 5
## 60 3 1
## 61 5 4
## 62 5 4
## 63 5 4
## 64 5 4
## 65 5 4
## 66 5 4
## 67 5 4
## 68 5 5
## 69 5 5
## 70 5 5
## 71 4 3
## 72 5 5
## 73 5 5
## 74 5 5
## 75 5 5
## 76 5 1
## 77 5 5
## 78 5 5
## 79 5 4
## 80 5 2
## 81 5 5
## 82 4 2
## 83 5 3
## 84 5 2
## 85 5 3
## 86 5 4
## 87 5 5
## 88 5 4
## 89 5 1
## 90 2 1
## 91 5 5
## 92 5 4
## 93 5 2
## 94 3 2
## 95 5 5
## 96 5 5
## 97 5 5
## 98 5 5
## 99 5 3
## 100 4 4
## 101 5 5
## 102 4 5
## 103 5 4
## 104 5 5
## 105 5 5
## 106 5 5
## 107 5 5
## 108 4 4
## 109 5 5
## 110 5 5
## 111 5 5
## 112 5 5
## 113 5 3
## 114 2 4
## 115 1 2
## 116 5 1
## 117 4 4
## 118 5 3
## 119 3 4
## 120 5 5
## 121 5 5
## 122 5 5
## 123 4 2
## 124 5 5
## 125 5 5
## 126 5 5
## 127 3 4
## 128 5 5
## 129 5 3
## 130 1 1
## 131 4 1
## 132 5 3
## 133 5 5
## 134 5 5
## 135 5 5
## 136 5 5
## 137 5 5
## 138 4 3
## 139 5 2
## 140 5 3
## 141 5 4
## 142 5 5
## 143 5 5
## 144 3 4
## 145 4 1
## 146 5 1
## 147 3 1
## 148 4 1
## 149 5 1
## 150 3 1
## 151 4 3
## 152 3 3
## 153 1 1
## 154 5 3
## 155 5 5
## 156 1 5
## 157 5 5
## 158 5 5
## 159 5 5
## 160 5 5
## 161 4 5
## 162 5 5
## 163 5 5
## 164 5 5
## 165 5 5
## 166 5 5
## 167 5 5
## 168 1 5
## 169 5 5
## 170 5 5
## 171 5 5
## 172 5 5
## 173 5 5
## 174 5 5
## 175 5 5
## 176 5 4
## 177 5 5
## 178 4 5
## 179 5 5
## 180 5 1
## 181 5 5
## 182 5 5
## 183 5 5
sum9 <- sum(predRF9==testingSet3$userRatingValue)
length9 <- length(testingSet3$userRatingValue)
accuracy_rfMod9 <- (sum9/length9)
head(accuracy_rfMod9,30)
## [1] 0.5628415
The above model with medianImpute, 10 iterations of adaptive_cv validation, and grid search did score better than the bottom scoring models at 55 % or 102/183 correct.
This last random forest model uses random search, medianImpute, and 10 iterations of adaptive cv.
rfMod10 <- train(userRatingValue ~., method='rf',
na.action=na.pass, search="random",
data=(trainingSet3), preProc = c("center", "scale","medianImpute"),
trControl=trainControl(method='adaptive_cv'), number=10)
predRF10 <- predict(rfMod10, testingSet3)
predDF10 <- data.frame(predRF10, type=testingSet3$userRatingValue)
predDF10
## predRF10 type
## 1 5 5
## 2 4 5
## 3 5 1
## 4 1 1
## 5 5 5
## 6 4 4
## 7 5 5
## 8 5 5
## 9 5 5
## 10 2 4
## 11 5 5
## 12 5 5
## 13 4 4
## 14 1 1
## 15 5 5
## 16 5 4
## 17 5 5
## 18 5 4
## 19 3 5
## 20 5 5
## 21 1 5
## 22 5 1
## 23 1 1
## 24 5 5
## 25 5 3
## 26 5 5
## 27 5 5
## 28 5 1
## 29 5 5
## 30 4 1
## 31 5 5
## 32 5 5
## 33 5 4
## 34 2 2
## 35 5 2
## 36 5 5
## 37 1 4
## 38 5 5
## 39 1 1
## 40 2 5
## 41 5 2
## 42 5 4
## 43 3 5
## 44 5 5
## 45 5 1
## 46 5 4
## 47 5 5
## 48 5 5
## 49 5 5
## 50 5 1
## 51 4 4
## 52 5 1
## 53 5 5
## 54 5 5
## 55 5 4
## 56 4 5
## 57 4 2
## 58 5 5
## 59 5 5
## 60 3 1
## 61 5 4
## 62 5 4
## 63 4 4
## 64 5 4
## 65 1 4
## 66 5 4
## 67 5 4
## 68 5 5
## 69 5 5
## 70 4 5
## 71 4 3
## 72 5 5
## 73 5 5
## 74 5 5
## 75 1 5
## 76 5 1
## 77 5 5
## 78 1 5
## 79 1 4
## 80 5 2
## 81 5 5
## 82 4 2
## 83 5 3
## 84 2 2
## 85 5 3
## 86 4 4
## 87 5 5
## 88 5 4
## 89 5 1
## 90 2 1
## 91 5 5
## 92 5 4
## 93 5 2
## 94 3 2
## 95 4 5
## 96 5 5
## 97 5 5
## 98 5 5
## 99 3 3
## 100 4 4
## 101 5 5
## 102 4 5
## 103 2 4
## 104 5 5
## 105 5 5
## 106 5 5
## 107 5 5
## 108 4 4
## 109 4 5
## 110 5 5
## 111 5 5
## 112 5 5
## 113 5 3
## 114 2 4
## 115 4 2
## 116 5 1
## 117 4 4
## 118 5 3
## 119 3 4
## 120 5 5
## 121 5 5
## 122 5 5
## 123 4 2
## 124 5 5
## 125 5 5
## 126 5 5
## 127 3 4
## 128 5 5
## 129 5 3
## 130 1 1
## 131 4 1
## 132 5 3
## 133 5 5
## 134 5 5
## 135 5 5
## 136 5 5
## 137 5 5
## 138 4 3
## 139 5 2
## 140 5 3
## 141 5 4
## 142 5 5
## 143 5 5
## 144 3 4
## 145 4 1
## 146 5 1
## 147 3 1
## 148 4 1
## 149 4 1
## 150 3 1
## 151 4 3
## 152 3 3
## 153 1 1
## 154 4 3
## 155 5 5
## 156 1 5
## 157 5 5
## 158 5 5
## 159 5 5
## 160 2 5
## 161 4 5
## 162 5 5
## 163 5 5
## 164 5 5
## 165 5 5
## 166 5 5
## 167 5 5
## 168 1 5
## 169 5 5
## 170 5 5
## 171 5 5
## 172 5 5
## 173 5 5
## 174 5 5
## 175 5 5
## 176 4 4
## 177 5 5
## 178 4 5
## 179 5 5
## 180 5 1
## 181 5 5
## 182 5 5
## 183 5 5
sum10 <- sum(predRF10==testingSet3$userRatingValue)
length10 <- length(testingSet3$userRatingValue)
accuracy_rfMod10 <- (sum10/length10)
head(accuracy_rfMod10,30)
## [1] 0.5519126
And the last random forest model scored 54 % with 99/183 correct. This model last ran rfMod10 is actually the lowest scoring model for accuracy in rating predictions.We should also add in the accuracy for our manual mean model that took the difference of the reviews word to all words and compared to all documents in each rating’s word to each word of those 12 key words, took the difference, selected the minimum value, and if a tie took the ceiling of the mean of the dot product of the ratings time the votes for each rating and if the ceiling was higher than the highest rating it would select the highest rating. We have this from the Accuracy value variable stored earlier when calculating the correct predicted against total reviews. This was put in the MLr3 table and the code is at 588-589 of this script. The score was 54.397%, which when comparing to these random forest models with varying modifications for validation and NA imputing is in the same range but not the highest, nor the lowest percent.
accuracy10RFModels <- as.data.frame(c(accuracy_rfMod1,
accuracy_rfMod2, accuracy_rfMod3,
accuracy_rfMod4, accuracy_rfMod5,
accuracy_rfMod6, accuracy_rfMod7,
accuracy_rfMod8, accuracy_rfMod9,
accuracy_rfMod10,Accuracy))
colnames(accuracy10RFModels) <- 'accuracyResults'
row.names(accuracy10RFModels) <- c('knnImpute_OOB_5',
'bagImpute_OOB_5','medianImpute_boot_5',
'knnImpute_boot_5','bagImpute_adaptive_cv_5',
'medianImpute_adaptive_boot_5',
'medianImpute_randomSearch_adaptive_cv_5',
'medianImpute_gridSearch_adaptive_cv_5',
'medianImpute_gridSearch_adaptiv_cv_10',
'medianImpute_randomSearch_adaptive_cv_10',
'manualCeilingMedianDifferenceRatios')
accuracy10RFModels
## accuracyResults
## knnImpute_OOB_5 0.5737705
## bagImpute_OOB_5 0.5846995
## medianImpute_boot_5 0.5573770
## knnImpute_boot_5 0.5628415
## bagImpute_adaptive_cv_5 0.5409836
## medianImpute_adaptive_boot_5 0.5573770
## medianImpute_randomSearch_adaptive_cv_5 0.5683060
## medianImpute_gridSearch_adaptive_cv_5 0.5628415
## medianImpute_gridSearch_adaptiv_cv_10 0.5628415
## medianImpute_randomSearch_adaptive_cv_10 0.5519126
## manualCeilingMedianDifferenceRatios 0.5439739
From here I could clean up the text and run the ratios for each and see if other keywords would predict better. Or we could test out the other machine learning models in the caret package any play with their function settings to see if we get better results. We could also use the best model above and test it on the meta data headers extracted from the data. but unable to use for the NAs being dropped with the impute method of NAs.
Lets now test out this same set of ratios of the stopwords in predicting our target variable of the rating value.
knnMod0 <- train(userRatingValue ~ .,
method='knn', preProcess=c('center','scale'),
tuneLength=10, trControl=trainControl(method='adaptive_cv'),
data=trainingSet3)
predKNN0 <- predict(knnMod0, testingSet3)
dfKNN0 <- data.frame(predKNN0, true=testingSet3$userRatingValue)
dfKNN0
## predKNN0 true
## 1 5 5
## 2 5 5
## 3 5 1
## 4 1 1
## 5 5 5
## 6 5 4
## 7 5 5
## 8 5 5
## 9 5 5
## 10 5 4
## 11 5 5
## 12 5 5
## 13 4 4
## 14 5 1
## 15 5 5
## 16 5 4
## 17 5 5
## 18 5 4
## 19 5 5
## 20 5 5
## 21 1 5
## 22 1 1
## 23 5 1
## 24 5 5
## 25 5 3
## 26 5 5
## 27 5 5
## 28 5 1
## 29 5 5
## 30 2 1
## 31 5 5
## 32 5 5
## 33 5 4
## 34 5 2
## 35 5 2
## 36 5 5
## 37 5 4
## 38 5 5
## 39 1 1
## 40 5 5
## 41 5 2
## 42 5 4
## 43 5 5
## 44 1 5
## 45 5 1
## 46 5 4
## 47 5 5
## 48 5 5
## 49 5 5
## 50 5 1
## 51 5 4
## 52 5 1
## 53 5 5
## 54 5 5
## 55 5 4
## 56 5 5
## 57 5 2
## 58 5 5
## 59 5 5
## 60 5 1
## 61 5 4
## 62 5 4
## 63 5 4
## 64 1 4
## 65 1 4
## 66 5 4
## 67 5 4
## 68 1 5
## 69 5 5
## 70 2 5
## 71 1 3
## 72 5 5
## 73 5 5
## 74 5 5
## 75 5 5
## 76 5 1
## 77 5 5
## 78 4 5
## 79 2 4
## 80 5 2
## 81 5 5
## 82 4 2
## 83 5 3
## 84 5 2
## 85 5 3
## 86 4 4
## 87 5 5
## 88 1 4
## 89 5 1
## 90 1 1
## 91 5 5
## 92 4 4
## 93 5 2
## 94 1 2
## 95 5 5
## 96 5 5
## 97 5 5
## 98 5 5
## 99 5 3
## 100 5 4
## 101 5 5
## 102 5 5
## 103 5 4
## 104 5 5
## 105 5 5
## 106 5 5
## 107 5 5
## 108 4 4
## 109 5 5
## 110 5 5
## 111 4 5
## 112 5 5
## 113 5 3
## 114 1 4
## 115 1 2
## 116 1 1
## 117 4 4
## 118 1 3
## 119 5 4
## 120 5 5
## 121 5 5
## 122 5 5
## 123 5 2
## 124 5 5
## 125 5 5
## 126 5 5
## 127 4 4
## 128 5 5
## 129 5 3
## 130 5 1
## 131 2 1
## 132 5 3
## 133 5 5
## 134 5 5
## 135 5 5
## 136 5 5
## 137 5 5
## 138 5 3
## 139 5 2
## 140 5 3
## 141 1 4
## 142 5 5
## 143 5 5
## 144 4 4
## 145 5 1
## 146 5 1
## 147 5 1
## 148 5 1
## 149 5 1
## 150 4 1
## 151 5 3
## 152 4 3
## 153 5 1
## 154 5 3
## 155 5 5
## 156 1 5
## 157 5 5
## 158 5 5
## 159 5 5
## 160 5 5
## 161 4 5
## 162 5 5
## 163 5 5
## 164 5 5
## 165 5 5
## 166 5 5
## 167 5 5
## 168 1 5
## 169 5 5
## 170 5 5
## 171 5 5
## 172 5 5
## 173 5 5
## 174 4 5
## 175 5 5
## 176 5 4
## 177 5 5
## 178 4 5
## 179 5 5
## 180 5 1
## 181 5 5
## 182 5 5
## 183 5 5
sumKNN0 <- sum(predKNN0==testingSet3$userRatingValue)
lengthKNN0 <- length(testingSet3$userRatingValue)
accuracy_knnMod0 <- (sumKNN0/lengthKNN0)
head(accuracy_knnMod0,30)
## [1] 0.5464481
rpartMod0 <- train(userRatingValue ~ ., method='rpart', tuneLength=7, data=trainingSet3)
predRPART0 <- predict(rpartMod0, testingSet3)
dfRPART0 <- data.frame(predRPART0, true=testingSet3$userRatingValue)
dfRPART0
## predRPART0 true
## 1 5 5
## 2 4 5
## 3 5 1
## 4 1 1
## 5 5 5
## 6 4 4
## 7 5 5
## 8 5 5
## 9 5 5
## 10 2 4
## 11 5 5
## 12 5 5
## 13 4 4
## 14 5 1
## 15 4 5
## 16 5 4
## 17 5 5
## 18 5 4
## 19 2 5
## 20 5 5
## 21 5 5
## 22 5 1
## 23 5 1
## 24 5 5
## 25 5 3
## 26 5 5
## 27 5 5
## 28 5 1
## 29 5 5
## 30 2 1
## 31 5 5
## 32 5 5
## 33 5 4
## 34 2 2
## 35 2 2
## 36 5 5
## 37 4 4
## 38 5 5
## 39 5 1
## 40 5 5
## 41 5 2
## 42 4 4
## 43 5 5
## 44 4 5
## 45 5 1
## 46 5 4
## 47 5 5
## 48 5 5
## 49 5 5
## 50 1 1
## 51 4 4
## 52 2 1
## 53 5 5
## 54 5 5
## 55 5 4
## 56 4 5
## 57 5 2
## 58 4 5
## 59 5 5
## 60 2 1
## 61 5 4
## 62 5 4
## 63 5 4
## 64 5 4
## 65 5 4
## 66 5 4
## 67 5 4
## 68 4 5
## 69 5 5
## 70 2 5
## 71 4 3
## 72 5 5
## 73 5 5
## 74 5 5
## 75 1 5
## 76 5 1
## 77 5 5
## 78 5 5
## 79 5 4
## 80 5 2
## 81 5 5
## 82 4 2
## 83 4 3
## 84 2 2
## 85 4 3
## 86 4 4
## 87 5 5
## 88 4 4
## 89 5 1
## 90 2 1
## 91 5 5
## 92 1 4
## 93 1 2
## 94 4 2
## 95 4 5
## 96 1 5
## 97 5 5
## 98 5 5
## 99 2 3
## 100 4 4
## 101 4 5
## 102 4 5
## 103 2 4
## 104 2 5
## 105 5 5
## 106 5 5
## 107 5 5
## 108 4 4
## 109 5 5
## 110 5 5
## 111 5 5
## 112 4 5
## 113 2 3
## 114 2 4
## 115 4 2
## 116 4 1
## 117 4 4
## 118 5 3
## 119 5 4
## 120 4 5
## 121 5 5
## 122 5 5
## 123 5 2
## 124 5 5
## 125 5 5
## 126 5 5
## 127 5 4
## 128 5 5
## 129 5 3
## 130 5 1
## 131 2 1
## 132 5 3
## 133 2 5
## 134 5 5
## 135 5 5
## 136 5 5
## 137 5 5
## 138 5 3
## 139 4 2
## 140 4 3
## 141 4 4
## 142 5 5
## 143 5 5
## 144 5 4
## 145 5 1
## 146 5 1
## 147 5 1
## 148 5 1
## 149 5 1
## 150 5 1
## 151 5 3
## 152 5 3
## 153 5 1
## 154 5 3
## 155 5 5
## 156 5 5
## 157 5 5
## 158 5 5
## 159 5 5
## 160 2 5
## 161 2 5
## 162 5 5
## 163 4 5
## 164 5 5
## 165 5 5
## 166 5 5
## 167 5 5
## 168 5 5
## 169 5 5
## 170 5 5
## 171 5 5
## 172 5 5
## 173 5 5
## 174 5 5
## 175 5 5
## 176 2 4
## 177 5 5
## 178 4 5
## 179 5 5
## 180 5 1
## 181 5 5
## 182 5 5
## 183 5 5
sumRPART0 <- sum(predRPART0==testingSet3$userRatingValue)
lengthRPART0 <- length(testingSet3$userRatingValue)
accuracy_RPARTMod0 <- (sumRPART0/lengthRPART0)
head(accuracy_RPARTMod0,30)
## [1] 0.5136612
RFtunes <- cbind(predDF1[1],predDF2[1],predDF3[1],
predDF4[1],predDF5[1],predDF6[1],
predDF7[1],predDF8[1],predDF9[1],
predDF10[1])
ManualMean <- Reviews15_results$finalPrediction[-inTrain]
predDF11 <- data.frame(RFtunes,ManualMean,dfKNN0[1],dfRPART0[1],
true=testingSet3$userRatingValue)
#the following column name assignment doesn't change the name as intended
colnames(predDF11[12:14]) <-c('predKNN','predRPART','trueValue')
results <- as.data.frame(c(round(accuracy_knnMod0,2),
round(accuracy_RPARTMod0,2),
round(100,2)))
colnames(results) <- 'results'
results$results <- as.factor(paste(results$results))
results1 <- as.data.frame(t(results))
colnames(results1) <- colnames(predDF11[12:14])
acc11 <- as.data.frame(accuracy10RFModels)
colnames(acc11) <- 'results'
acc11$results <- round(acc11$results,2)
acc11$results <- as.factor(paste(acc11$results))
names1 <- colnames(predDF11)[1:10]
row.names(acc11) <- c(names1,'ManualMean')
acc11RFs <- as.data.frame(t(acc11))
resultsAll <- cbind(acc11RFs,results1)
Results <- rbind(predDF11, resultsAll)
#the column names have to be changed here as well
colnames(Results)[12:13] <- c('predKNN','predRPART')
Results
## predRF1 predRF2 predRF3 predRF4 predRF5 predRF6 predRF7 predRF8 predRF9
## 1 5 5 5 5 5 5 5 5 5
## 2 5 5 5 5 4 5 5 4 5
## 3 5 5 5 5 5 5 5 5 5
## 4 1 1 1 1 1 1 1 1 1
## 5 5 5 5 5 5 5 5 5 5
## 6 5 5 5 1 4 5 5 4 5
## 7 5 5 5 5 5 5 5 5 5
## 8 5 5 5 5 5 5 5 5 5
## 9 5 5 5 5 5 5 5 5 5
## 10 5 5 5 5 2 5 5 2 5
## 11 5 5 5 5 5 5 5 5 5
## 12 5 5 5 5 5 5 5 5 5
## 13 4 4 4 4 4 4 4 4 4
## 14 1 1 1 1 1 1 1 1 1
## 15 5 5 5 5 5 5 5 5 5
## 16 5 5 5 5 5 5 5 5 5
## 17 5 5 5 5 5 5 5 5 5
## 18 5 5 5 5 1 5 5 1 5
## 19 5 5 5 5 3 5 5 3 5
## 20 5 5 5 5 5 5 5 5 5
## 21 5 5 5 5 1 5 5 1 5
## 22 5 5 5 5 5 5 5 5 5
## 23 5 5 5 5 5 5 5 5 5
## 24 5 5 5 5 5 5 5 5 5
## 25 5 5 5 5 5 5 5 5 5
## 26 5 5 5 5 5 5 5 5 5
## 27 5 5 5 5 5 5 5 5 5
## 28 5 5 5 5 5 5 5 5 5
## 29 5 5 5 5 5 5 5 5 5
## 30 4 4 4 4 4 4 4 4 4
## 31 5 5 5 5 5 5 5 5 5
## 32 5 5 5 5 5 5 5 5 5
## 33 5 5 5 5 5 5 5 5 5
## 34 2 2 2 2 2 2 2 2 2
## 35 5 5 5 5 5 5 5 5 5
## 36 5 5 5 5 5 5 5 5 5
## 37 5 5 1 1 1 5 4 1 5
## 38 5 5 5 5 5 5 5 5 5
## 39 5 1 5 5 1 5 5 1 5
## 40 2 2 2 2 2 2 2 2 2
## 41 5 5 5 5 5 5 5 5 5
## 42 1 1 1 1 1 1 5 4 1
## 43 5 5 4 4 3 5 5 3 4
## 44 5 5 5 5 5 5 5 5 5
## 45 5 5 5 5 5 5 5 5 5
## 46 5 5 5 5 5 5 5 5 5
## 47 5 5 5 5 5 5 5 5 5
## 48 5 5 5 5 5 5 5 5 5
## 49 5 5 5 5 5 5 5 5 5
## 50 5 5 5 5 5 5 5 1 5
## 51 4 4 4 4 4 4 4 4 4
## 52 5 5 5 5 5 5 5 5 5
## 53 5 5 5 5 5 5 5 5 5
## 54 5 5 5 5 5 5 5 5 5
## 55 5 5 5 5 3 5 5 3 5
## 56 5 5 1 4 4 5 5 5 5
## 57 4 4 4 5 4 4 4 4 4
## 58 5 5 5 5 5 1 1 5 5
## 59 5 5 5 5 5 5 5 5 5
## 60 1 1 1 1 3 3 5 3 3
## 61 5 5 5 5 5 5 5 5 5
## 62 5 5 5 5 5 5 5 5 5
## 63 5 5 5 5 4 5 5 4 5
## 64 5 5 5 5 5 5 5 5 5
## 65 5 5 5 1 1 1 5 1 5
## 66 5 5 5 5 4 5 5 4 5
## 67 5 5 5 5 5 5 5 5 5
## 68 5 5 5 5 5 5 5 5 5
## 69 5 5 5 5 5 5 5 5 5
## 70 5 5 5 5 4 5 5 3 5
## 71 4 4 4 4 1 4 4 4 4
## 72 5 5 5 5 5 5 5 5 5
## 73 5 5 5 5 5 5 5 5 5
## 74 5 5 5 5 5 5 5 5 5
## 75 5 5 5 5 1 5 5 5 5
## 76 5 5 5 5 5 5 5 5 5
## 77 5 5 5 5 5 5 5 5 5
## 78 5 5 1 1 1 5 1 1 5
## 79 1 5 5 5 1 5 5 5 5
## 80 5 5 5 5 5 5 5 5 5
## 81 5 5 5 5 5 5 5 5 5
## 82 4 4 4 4 4 4 4 4 4
## 83 5 5 5 5 5 5 5 4 5
## 84 5 5 5 5 2 5 5 2 5
## 85 5 5 5 5 5 5 5 5 5
## 86 5 5 5 5 4 5 5 4 5
## 87 5 5 5 5 5 5 5 5 5
## 88 5 5 5 5 5 5 5 5 5
## 89 5 5 5 5 5 5 5 5 5
## 90 2 2 2 2 2 2 2 2 2
## 91 5 5 5 5 5 5 5 5 5
## 92 5 5 5 5 5 5 5 1 5
## 93 5 5 5 5 5 5 5 1 5
## 94 4 3 3 4 3 3 3 4 3
## 95 5 5 4 5 4 4 4 4 5
## 96 5 5 5 5 5 5 5 5 5
## 97 5 5 5 5 5 5 5 5 5
## 98 5 5 5 5 5 5 5 5 5
## 99 5 5 5 5 2 5 5 3 5
## 100 4 4 4 4 4 5 4 4 4
## 101 5 5 5 5 5 5 5 5 5
## 102 4 4 4 4 4 4 4 4 4
## 103 2 5 2 5 2 5 5 2 5
## 104 5 5 5 5 5 5 5 5 5
## 105 5 5 5 5 5 5 5 5 5
## 106 5 5 5 5 5 5 5 5 5
## 107 5 5 5 5 5 5 5 5 5
## 108 4 4 4 4 4 4 4 4 4
## 109 5 5 5 5 4 5 5 4 5
## 110 5 5 5 5 5 5 5 5 5
## 111 5 5 5 5 5 5 5 5 5
## 112 5 5 5 5 5 5 5 1 5
## 113 5 5 5 5 5 5 5 3 5
## 114 2 2 2 2 2 2 2 2 2
## 115 3 1 1 1 4 1 3 5 1
## 116 5 5 5 5 5 5 5 5 5
## 117 4 4 4 4 4 4 4 4 4
## 118 5 5 5 5 5 5 5 5 5
## 119 3 3 3 3 3 3 3 3 3
## 120 5 5 5 5 5 5 5 5 5
## 121 5 5 5 5 5 5 5 5 5
## 122 5 5 5 5 5 5 5 5 5
## 123 4 4 4 4 4 4 4 4 4
## 124 5 5 5 5 5 5 5 5 5
## 125 5 5 5 5 1 5 5 1 5
## 126 5 5 5 5 5 5 5 5 5
## 127 3 3 3 3 3 3 3 3 3
## 128 5 5 5 5 5 5 5 5 5
## 129 5 5 5 5 5 5 5 5 5
## 130 1 1 1 1 1 1 1 1 1
## 131 4 4 4 4 4 4 4 4 4
## 132 5 5 5 5 5 5 5 5 5
## 133 5 5 5 5 5 5 5 5 5
## 134 5 5 5 5 5 5 5 5 5
## 135 5 5 5 5 5 5 5 5 5
## 136 5 5 5 5 5 5 5 5 5
## 137 5 5 5 5 5 5 5 5 5
## 138 4 4 4 4 4 4 4 4 4
## 139 5 5 5 5 5 5 5 5 5
## 140 5 5 5 5 5 5 5 5 5
## 141 5 5 5 5 5 5 5 5 5
## 142 5 5 5 5 5 5 5 5 5
## 143 5 5 5 5 5 5 5 5 5
## 144 3 3 3 3 3 3 3 3 3
## 145 4 4 4 4 4 4 4 4 4
## 146 5 5 5 5 5 5 5 5 5
## 147 3 3 3 3 3 3 3 3 3
## 148 4 4 4 4 4 4 4 4 4
## 149 5 5 5 5 4 5 5 4 5
## 150 3 3 3 3 3 3 3 3 3
## 151 4 4 4 4 4 4 4 4 4
## 152 3 3 3 3 3 3 3 3 3
## 153 1 1 1 1 1 1 1 1 1
## 154 5 5 5 5 4 5 5 4 5
## 155 5 5 5 5 5 5 5 5 5
## 156 1 1 1 1 1 1 1 1 1
## 157 5 5 5 5 5 5 5 5 5
## 158 5 5 5 5 5 5 5 5 5
## 159 5 5 5 5 5 5 5 5 5
## 160 5 5 5 5 2 5 5 2 5
## 161 4 5 5 5 4 5 5 2 4
## 162 5 5 5 5 5 5 5 5 5
## 163 5 5 5 5 5 5 5 5 5
## 164 5 5 5 5 5 5 5 5 5
## 165 5 5 5 5 5 5 5 5 5
## 166 5 5 5 5 5 5 5 5 5
## 167 5 5 5 5 5 5 5 5 5
## 168 1 1 1 1 1 1 1 1 1
## 169 5 5 5 5 5 5 5 5 5
## 170 5 5 5 5 5 5 5 5 5
## 171 5 5 5 5 5 5 5 5 5
## 172 5 5 5 5 5 5 5 5 5
## 173 5 5 5 5 5 5 5 5 5
## 174 5 5 5 5 5 5 5 5 5
## 175 5 5 5 5 5 5 5 5 5
## 176 5 5 5 5 4 5 5 2 5
## 177 5 5 5 5 5 5 5 5 5
## 178 4 4 4 4 4 4 5 4 4
## 179 5 5 5 5 5 5 5 5 5
## 180 5 5 5 5 3 5 5 3 5
## 181 5 5 5 5 5 5 5 5 5
## 182 5 5 5 5 5 5 5 5 5
## 183 5 5 5 5 5 5 5 5 5
## results 0.57 0.58 0.56 0.56 0.54 0.56 0.57 0.56 0.56
## predRF10 ManualMean predKNN predRPART true
## 1 5 5 5 5 5
## 2 4 5 5 4 5
## 3 5 5 5 5 1
## 4 1 5 1 1 1
## 5 5 5 5 5 5
## 6 4 5 5 4 4
## 7 5 5 5 5 5
## 8 5 5 5 5 5
## 9 5 5 5 5 5
## 10 2 5 5 2 4
## 11 5 5 5 5 5
## 12 5 5 5 5 5
## 13 4 5 4 4 4
## 14 1 5 5 5 1
## 15 5 5 5 4 5
## 16 5 5 5 5 4
## 17 5 5 5 5 5
## 18 5 5 5 5 4
## 19 3 5 5 2 5
## 20 5 5 5 5 5
## 21 1 5 1 5 5
## 22 5 5 1 5 1
## 23 1 5 5 5 1
## 24 5 5 5 5 5
## 25 5 5 5 5 3
## 26 5 5 5 5 5
## 27 5 5 5 5 5
## 28 5 5 5 5 1
## 29 5 5 5 5 5
## 30 4 5 2 2 1
## 31 5 5 5 5 5
## 32 5 5 5 5 5
## 33 5 5 5 5 4
## 34 2 5 5 2 2
## 35 5 5 5 2 2
## 36 5 5 5 5 5
## 37 1 5 5 4 4
## 38 5 5 5 5 5
## 39 1 5 1 5 1
## 40 2 5 5 5 5
## 41 5 5 5 5 2
## 42 5 5 5 4 4
## 43 3 5 5 5 5
## 44 5 5 1 4 5
## 45 5 5 5 5 1
## 46 5 5 5 5 4
## 47 5 5 5 5 5
## 48 5 5 5 5 5
## 49 5 5 5 5 5
## 50 5 5 5 1 1
## 51 4 5 5 4 4
## 52 5 5 5 2 1
## 53 5 5 5 5 5
## 54 5 5 5 5 5
## 55 5 5 5 5 4
## 56 4 5 5 4 5
## 57 4 5 5 5 2
## 58 5 5 5 4 5
## 59 5 5 5 5 5
## 60 3 5 5 2 1
## 61 5 5 5 5 4
## 62 5 5 5 5 4
## 63 4 5 5 5 4
## 64 5 5 1 5 4
## 65 1 5 1 5 4
## 66 5 2 5 5 4
## 67 5 5 5 5 4
## 68 5 5 1 4 5
## 69 5 5 5 5 5
## 70 4 5 2 2 5
## 71 4 5 1 4 3
## 72 5 5 5 5 5
## 73 5 5 5 5 5
## 74 5 5 5 5 5
## 75 1 5 5 1 5
## 76 5 5 5 5 1
## 77 5 5 5 5 5
## 78 1 5 4 5 5
## 79 1 5 2 5 4
## 80 5 5 5 5 2
## 81 5 5 5 5 5
## 82 4 5 4 4 2
## 83 5 5 5 4 3
## 84 2 5 5 2 2
## 85 5 5 5 4 3
## 86 4 5 4 4 4
## 87 5 5 5 5 5
## 88 5 5 1 4 4
## 89 5 5 5 5 1
## 90 2 5 1 2 1
## 91 5 5 5 5 5
## 92 5 5 4 1 4
## 93 5 5 5 1 2
## 94 3 5 1 4 2
## 95 4 5 5 4 5
## 96 5 5 5 1 5
## 97 5 4 5 5 5
## 98 5 5 5 5 5
## 99 3 5 5 2 3
## 100 4 5 5 4 4
## 101 5 5 5 4 5
## 102 4 5 5 4 5
## 103 2 5 5 2 4
## 104 5 5 5 2 5
## 105 5 5 5 5 5
## 106 5 5 5 5 5
## 107 5 5 5 5 5
## 108 4 5 4 4 4
## 109 4 5 5 5 5
## 110 5 5 5 5 5
## 111 5 5 4 5 5
## 112 5 5 5 4 5
## 113 5 5 5 2 3
## 114 2 5 1 2 4
## 115 4 5 1 4 2
## 116 5 5 1 4 1
## 117 4 5 4 4 4
## 118 5 5 1 5 3
## 119 3 5 5 5 4
## 120 5 5 5 4 5
## 121 5 5 5 5 5
## 122 5 5 5 5 5
## 123 4 5 5 5 2
## 124 5 5 5 5 5
## 125 5 5 5 5 5
## 126 5 5 5 5 5
## 127 3 5 4 5 4
## 128 5 5 5 5 5
## 129 5 5 5 5 3
## 130 1 5 5 5 1
## 131 4 5 2 2 1
## 132 5 5 5 5 3
## 133 5 5 5 2 5
## 134 5 5 5 5 5
## 135 5 5 5 5 5
## 136 5 5 5 5 5
## 137 5 5 5 5 5
## 138 4 5 5 5 3
## 139 5 5 5 4 2
## 140 5 5 5 4 3
## 141 5 5 1 4 4
## 142 5 5 5 5 5
## 143 5 5 5 5 5
## 144 3 5 4 5 4
## 145 4 5 5 5 1
## 146 5 5 5 5 1
## 147 3 5 5 5 1
## 148 4 5 5 5 1
## 149 4 1 5 5 1
## 150 3 5 4 5 1
## 151 4 5 5 5 3
## 152 3 5 4 5 3
## 153 1 5 5 5 1
## 154 4 4 5 5 3
## 155 5 5 5 5 5
## 156 1 5 1 5 5
## 157 5 5 5 5 5
## 158 5 5 5 5 5
## 159 5 5 5 5 5
## 160 2 5 5 2 5
## 161 4 5 4 2 5
## 162 5 5 5 5 5
## 163 5 5 5 4 5
## 164 5 5 5 5 5
## 165 5 5 5 5 5
## 166 5 5 5 5 5
## 167 5 5 5 5 5
## 168 1 5 1 5 5
## 169 5 5 5 5 5
## 170 5 5 5 5 5
## 171 5 5 5 5 5
## 172 5 5 5 5 5
## 173 5 5 5 5 5
## 174 5 5 4 5 5
## 175 5 5 5 5 5
## 176 4 5 5 2 4
## 177 5 5 5 5 5
## 178 4 5 4 4 5
## 179 5 5 5 5 5
## 180 5 5 5 5 1
## 181 5 5 5 5 5
## 182 5 5 5 5 5
## 183 5 5 5 5 5
## results 0.55 0.54 0.55 0.51 100
# The Random Forest package
rfpkg <- randomForest(userRatingValue~., data=trainingSet3, method='class')
predRFpkg <- predict(rfpkg, testingSet3, type='class')
sumRFpkg <- sum(predRFpkg==testingSet3$userRatingValue)
lengthRFpkg <- length(testingSet3$userRatingValue)
accuracy_RFpkg <- sumRFpkg/lengthRFpkg
# confusionMatrix(predRFpkg, testingSet3$userRatingValue)
# generalizedBoostedModel
gbmMod <- train(userRatingValue~., method='gbm', data=trainingSet3, verbose=FALSE )
predGbm <- predict(gbmMod, testingSet3)
sumGBM0 <- sum(predGbm==testingSet3$userRatingValue)
lengthGBM0 <- length(testingSet3$userRatingValue)
accuracy_gbmMod <- sumGBM0/lengthGBM0
# linkage dirichlet allocation model
ldaMod <- train(userRatingValue~., method='lda', data=trainingSet3)
predlda <- predict(ldaMod, testingSet3)
sumLDA0 <- sum(predlda==testingSet3$userRatingValue)
lengthLDA0 <- length(testingSet3$userRatingValue)
accuracy_ldaMod <- sumLDA0/lengthLDA0
CombinedGAM <- train(true~., method='gam', data=predDF11)
## Loading required package: mgcv
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
##
## collapse
## This is mgcv 1.8-31. For overview type 'help("mgcv-package")'.
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
## Warning in gam.fit3(x = args$X, y = args$y, sp = lsp, Eb = args$Eb, UrS =
## args$UrS, : fitted probabilities numerically 0 or 1 occurred
CombinedGAMPredictions <- predict(CombinedGAM, predDF11)
predDF12 <- data.frame(predDF11[1:13], predRFpkg, predGbm, predlda,
CombinedGAMPredictions,
true=testingSet3$userRatingValue)
sumCP <- sum(CombinedGAMPredictions==testingSet3$userRatingValue)
lengthCP <- length(testingSet3$userRatingValue)
accuracy_CP1 <- sumCP/lengthCP
results3 <- as.data.frame(c(accuracy_RFpkg, accuracy_gbmMod,
accuracy_ldaMod, accuracy_CP1, round(100,2)))
colnames(results3) <- 'results'
results3$results <- round(results3$results,2)
results3$results <- as.factor(paste(results3$results))
results4 <- as.data.frame(t(results3))
colnames(results4) <- colnames(predDF12)[14:18]
results5 <- cbind(resultsAll[1:13],results4)
accuracyAllResults <- rbind(predDF12,results5)
write.csv(accuracyAllResults,'accuracyAllResults.csv',row.names=TRUE)
topbottom <- accuracyAllResults[c(1:10,175:184),]
topbottom
## predRF1 predRF2 predRF3 predRF4 predRF5 predRF6 predRF7 predRF8 predRF9
## 6 5 5 5 5 5 5 5 5 5
## 9 5 5 5 5 4 5 5 4 5
## 16 5 5 5 5 5 5 5 5 5
## 17 1 1 1 1 1 1 1 1 1
## 21 5 5 5 5 5 5 5 5 5
## 25 5 5 5 1 4 5 5 4 5
## 27 5 5 5 5 5 5 5 5 5
## 33 5 5 5 5 5 5 5 5 5
## 34 5 5 5 5 5 5 5 5 5
## 35 5 5 5 5 2 5 5 2 5
## 573 5 5 5 5 5 5 5 5 5
## 575 5 5 5 5 4 5 5 2 5
## 576 5 5 5 5 5 5 5 5 5
## 578 4 4 4 4 4 4 5 4 4
## 581 5 5 5 5 5 5 5 5 5
## 588 5 5 5 5 3 5 5 3 5
## 589 5 5 5 5 5 5 5 5 5
## 591 5 5 5 5 5 5 5 5 5
## 608 5 5 5 5 5 5 5 5 5
## results 0.57 0.58 0.56 0.56 0.54 0.56 0.57 0.56 0.56
## predRF10 ManualMean predKNN0 predRPART0 predRFpkg predGbm predlda
## 6 5 5 5 5 5 5 5
## 9 4 5 5 4 5 5 5
## 16 5 5 5 5 5 5 5
## 17 1 5 1 1 1 1 5
## 21 5 5 5 5 5 5 5
## 25 4 5 5 4 4 1 1
## 27 5 5 5 5 5 5 5
## 33 5 5 5 5 5 5 5
## 34 5 5 5 5 5 5 5
## 35 2 5 5 2 3 5 5
## 573 5 5 5 5 5 5 5
## 575 4 5 5 2 5 5 5
## 576 5 5 5 5 5 5 5
## 578 4 5 4 4 4 5 4
## 581 5 5 5 5 5 5 5
## 588 5 5 5 5 5 5 5
## 589 5 5 5 5 5 5 5
## 591 5 5 5 5 5 5 5
## 608 5 5 5 5 5 5 5
## results 0.55 0.54 0.55 0.51 0.54 0.54 0.49
## CombinedGAMPredictions true
## 6 2 5
## 9 2 5
## 16 2 1
## 17 1 1
## 21 2 5
## 25 2 4
## 27 2 5
## 33 2 5
## 34 2 5
## 35 2 4
## 573 2 5
## 575 2 4
## 576 2 5
## 578 2 5
## 581 2 5
## 588 2 1
## 589 2 5
## 591 2 5
## 608 2 5
## results 0.12 100
The combined model method of ‘gam’ in the caret package to train the data is supposed to look at all the predicted results and vote on the best value. There is some tuning that needs to be done, or else it is unable to classify more than two classes, hence the 1s and 2s for each prediction. It’s accuracy is 10%, the others ranged from 49-57% in accuracy.
Use regression instead of classification for Generalized Linear Machines, temporarily change the target to a numeric instead of factor data type.
# glmMod0 <- train(userRatingValue ~ ., metric='Accuracy',
# method='glm', data=trainingSet3)
############################
trainingSet3$userRatingValue <- as.numeric(paste(trainingSet3$userRatingValue))
testingSet3$userRatingValue <- as.numeric(paste(testingSet3$userRatingValue))
glmMod0 <- train(userRatingValue ~ .,
method='glm', data=trainingSet3)
The above isn’t letting the glm method for linear classification be used, even with the target turned into a numeric data type and using the metric ‘Kappa’ or ‘Accuracy’ for classification. So it will be omitted, it was working with a rounded output of 27% accuracy, but then it stopped working all together when re-ran again.
predGLM0 <- predict(glmMod0, testingSet3) #a numeric vector data type
dfGLM0 <- data.frame(predGLM0,
type=testingSet3$userRatingValue)
dfGLM0$predGLM0 <- round(dfGLM0$predGLM0,0)
dfGLM0$predGLM0 <- ifelse(dfGLM0$predGLM0>5,5,dfGLM0$predGLM0)
dfGLM0$predGLM0 <- as.factor(paste(dfGLM0$predGLM0))
head(dfGLM0)
## predGLM0 type
## 6 4 5
## 9 4 5
## 16 5 1
## 17 3 1
## 21 5 5
## 25 3 4
sumGLM0 <- sum(dfGLM0$predGLM0==testingSet3$userRatingValue)
lengthGLM0 <- length(testingSet3$userRatingValue)
accuracy_GLMMod0 <- (sumGLM0/lengthGLM0)
accuracy_GLMMod0
## [1] 0.284153
We see that the generalized linear model didn’t see a noticeable way the ratios are connected to the outputs, and the accuracy was the worst with 28% accuracy. Worse than any of all the random forest tuned variations of models, the topic modeler Latent Dirichlet Allocation, the generalized boosted machines, the k-nearest neighbor, the recursive partitioned trees, except for the gam voting model that scored 10%. Lets change the target back to a factor data type.
testingSet3$userRatingValue <- as.factor(paste(testingSet3$userRatingValue))
trainingSet3$userRatingValue <- as.factor(paste(trainingSet3$userRatingValue))
What is left to do from here would be to change out the original filtering of top words and exclude the stopwords in the prepocessing step where we built the document term matrices, and also for the manual program we built to replace all searched words to those 12 words that are top for each rating when all stopwords are not included. This will be a separate file and not an attachment.