Since the R language is not available for Heroku platform. Thus, I show my exploaratory analysis in this RPubs website. The Yelp review data are preprocessed and saved in the format of “.rds” for the sake of exploratory analysis. The dataset, Assoc, includes 2 sets of objects. One is the set of the most correlated words associated with “food” found in the review texts on the given restaurant, Chipotle Mexican Grill. The other is the corresponding correlations of those words. I would compare the datasets with 5, 3, and 1 star scores below.
The dataset shown below is selected with 5-star Yelp review.
The table of the most correlated words are shown below.
## correlation word
## fresh 0.80 fresh
## fast 0.75 fast
## actual 0.54 actual
## can 0.54 can
## consid 0.54 consid
## lifesav 0.54 lifesav
## way 0.54 way
## work 0.54 work
## servic 0.53 servic
## mexican 0.51 mexican
The dataset shown below is selected with 3-star Yelp review.
The table of the most correlated words are shown below.
## correlation word
## court 0.74 court
## fast 0.74 fast
## averagennneutralsnlong 0.66 averagennneutralsnlong
## latenloc 0.66 latenloc
## look 0.66 look
## might 0.66 might
## neven 0.66 neven
## nfood 0.66 nfood
## proper 0.66 proper
## prosnopen 0.66 prosnopen
The dataset shown below is selected with 1-star Yelp review.
The table of the most correlated words are shown below.
## correlation word
## afternoon 1 afternoon
## almost 1 almost
## alreadi 1 alreadi
## alway 1 alway
## anoth 1 anoth
## appear 1 appear
## area 1 area
## awar 1 awar
## away 1 away
## beer 1 beer
As you can see, for the higher star barplots, the words like “fresh”, “fast”, and “good” are shown in higher correlation. for lower star plot, no sentimental words are shown in the most 10 correlated word.
The script box for proposal is limited with 10000 characters, which doesn’t provide enough space for me to put down data clean function for business data set. Thus, I put the script here instead.
The script of dataclean function for business dataset is shown below.
businessdataclean <- function(x){
a <- x[[1]]
a <- gsub("[{}]", "", a)
b <- strsplit(a, ", \"")
b <- unlist(b)
business_id_ind <- grep("business_id\":", (b))
b[business_id_ind] <- gsub("[business_id\":]", "", b[business_id_ind])
full_address_ind <- grep("full_address\":", (b))
b[full_address_ind] <- gsub("[full_address\":]", "", b[full_address_ind])
open_ind <- grep("open\":", (b))
open_ind_lng <- length(open_ind)
open_ind <- open_ind[open_ind_lng]
b[open_ind] <- gsub("open\":", "", b[open_ind])
categories_ind <- grep("categories\":", (b))
b[categories_ind] <- gsub("(categories\":)|([\"])|(\\[)", "", b[categories_ind])
city_ind <- grep("city\":", (b))
b[city_ind] <- gsub("(city\":)|(\")", "", b[city_ind])
review_count_ind <- grep("review_count\":", (b))
b[review_count_ind] <- gsub("review_count\":", "", b[review_count_ind])
name_ind <- grep("name\":", (b))
b[name_ind] <- gsub("(name\":)|(\")", "", b[name_ind])
neighborhoods_ind <- grep("neighborhoods\":", (b))
b[neighborhoods_ind] <- gsub("neighborhoods\":", "", b[neighborhoods_ind])
longitude_ind <- grep("longitude\":", (b))
b[longitude_ind] <- gsub("longitude\":", "", b[longitude_ind])
state_ind <- grep("state\":", (b))
b[state_ind] <- gsub("(state\":)|(\")", "", b[state_ind])
stars_ind <- grep("stars\":", (b))
b[stars_ind] <- gsub("stars\":", "", b[stars_ind])
latitude_ind <- grep("latitude\":", (b))
b[latitude_ind] <- gsub("latitude\":", "", b[latitude_ind])
attributes_ind <- grep("attributes\":", (b))
b[attributes_ind] <- gsub("(attributes\":)|(\")", "", b[attributes_ind])
type_ind <- grep("type\":", (b))
b[type_ind] <- gsub("(type\":)|(\")", "", b[type_ind])
hour_ind <- seq(from=full_address_ind+1, to=open_ind-1)
hours_ind_lng <- length(hour_ind)
# remove hours from string
ifelse(hours_ind_lng == 1, b[hour_ind] <- NA, b[hour_ind[1]] <- gsub("hours\": \"", "", b[hour_ind[1]]) )
if(any(grep("Monday", b[hour_ind]) )){
ind_Mon_close <- grep("Monday", b[hour_ind])
ind_Mon_open <- ind_Mon_close + 1
Mon_close <- gsub("(Monday\": \"close\": \")|(\")", "" ,b[hour_ind[ind_Mon_close]] )
Mon_open <- gsub("(open\": \")|(\")", "" ,b[hour_ind[ind_Mon_open]] )
}else{
Mon_close <- NA
Mon_open <- NA
}
if(any(grep("Tuesday", b[hour_ind]) )){
ind_Tues_close <- grep("Tuesday", b[hour_ind])
ind_Tues_open <- ind_Tues_close + 1
Tues_close <- gsub("(Tuesday\": \"close\": \")|(\")", "" ,b[hour_ind[ind_Tues_close]] )
Tues_open <- gsub("(open\": \")|(\")", "" ,b[hour_ind[ind_Tues_open]] )
}else{
Tues_close <- NA
Tues_open <- NA
}
if(any(grep("Wednesday", b[hour_ind]) )){
ind_Wednes_close <- grep("Wednesday", b[hour_ind])
ind_Wednes_open <- ind_Wednes_close + 1
Wednes_close <- gsub("(Wednesday\": \"close\": \")|(\")", "" ,b[hour_ind[ind_Wednes_close]] )
Wednes_open <- gsub("(open\": \")|(\")", "" ,b[hour_ind[ind_Wednes_open]] )
}else{
Wednes_close <- NA
Wednes_open <- NA
}
if(any(grep("Thursday", b[hour_ind]) )){
ind_Thurs_close <- grep("Thursday", b[hour_ind])
ind_Thurs_open <- ind_Thurs_close + 1
Thurs_close <- gsub("(Thursday\": \"close\": \")|(\")", "" ,b[hour_ind[ind_Thurs_close]] )
Thurs_open <- gsub("(open\": \")|(\")", "" ,b[hour_ind[ind_Thurs_open]] )
}else{
Thurs_close <- NA
Thurs_open <- NA
}
if(any(grep("Friday", b[hour_ind]) )){
ind_Fri_close <- grep("Friday", b[hour_ind])
ind_Fri_open <- ind_Fri_close + 1
Fri_close <- gsub("(Friday\": \"close\": \")|(\")", "" ,b[hour_ind[ind_Fri_close]] )
Fri_open <- gsub("(open\": \")|(\")", "" ,b[hour_ind[ind_Fri_open]] )
}else{
Fri_close <- NA
Fri_open <- NA
}
if(any(grep("Saturday", b[hour_ind]) )){
ind_Satur_close <- grep("Saturday", b[hour_ind])
ind_Satur_open <- ind_Satur_close + 1
Satur_close <- gsub("(Saturday\": \"close\": \")|(\")", "" ,b[hour_ind[ind_Satur_close]] )
Satur_open <- gsub("(open\": \")|(\")", "" ,b[hour_ind[ind_Satur_open]] )
}else{
Satur_close <- NA
Satur_open <- NA
}
if(any(grep("Sunday", b[hour_ind]) )){
ind_Sun_close <- grep("Sunday", b[hour_ind])
ind_Sun_open <- ind_Sun_close + 1
Sun_close <- gsub("(Sunday\": \"close\": \")|(\")", "" ,b[hour_ind[ind_Sun_close]] )
Sun_open <- gsub("(open\": \")|(\")", "" ,b[hour_ind[ind_Sun_open]] )
}else{
Sun_close <- NA
Sun_open <- NA
}
var_ind <- c(business_id_ind, full_address_ind, open_ind, categories_ind, city_ind, review_count_ind, name_ind, neighborhoods_ind, longitude_ind, state_ind, stars_ind, latitude_ind, attributes_ind, type_ind)
totind <- sort(c(hour_ind, var_ind))
totind_lng <- length(totind)
b_lng <- length(b)
categories_startind <- open_ind+1
categories_endind <- city_ind-1
attributes_startind <- latitude_ind+1
attributes_endind <- type_ind-1
b[categories_ind] <- paste(b[categories_startind:categories_endind] , collapse=",")
b[categories_ind] <- gsub("([\"])|(\\[)|(\\])", "", b[categories_ind])
b[attributes_ind] <- paste(b[attributes_startind:attributes_endind] , collapse=",")
b[attributes_ind] <- gsub("([\"])|(\\[)|(\\])", "", b[attributes_ind])
data <- array(data= NA, dim=28)
data <- cbind(b[business_id_ind], b[full_address_ind], b[open_ind], b[categories_ind], b[city_ind], b[review_count_ind], b[name_ind], b[neighborhoods_ind], b[longitude_ind], b[state_ind], b[stars_ind], b[latitude_ind], b[attributes_ind], b[type_ind], Mon_close, Mon_open, Tues_close, Tues_open, Wednes_close, Wednes_open, Thurs_close, Thurs_open, Fri_close, Fri_open, Satur_close, Satur_open, Sun_close, Sun_open)
return(data)
}