For a new business owner, one may be interested about the finding of the potential user, and deliver advertisement in good quality. On the other hand, Yelp is a multi-dimension recommendation and review system, consisting several aspects of the data: users, business, and reviews. The purpose of this work is to use Yelp dataset to answer following key question:
Can insightful advertisement be generated and deliver to those potential target user for a new business?
The insight of an advertisement means that advertisement should contain the key features of a particular domain the new business belongs. These key features should reflect the users’ potential needs. And they could already exist in the top-level business model in the domain, and be pursued by the new business owner.
The target user means those member of the yelp who may show great interesting to your business. Basically they are the expert of the business domain or show great need to get the service of the domain.
Yelp dataset contains 5 json files: business, checkin, review, tip and user. In this research, we use the data belonging to the business, review and user. business, review can be summarized in following tables (list only columns we used), and we only use user_id and name in user data.
| ColumnName | Type | Details |
|---|---|---|
| business_id | array of the string | each contains 22 characters |
| categories | list of the string array | each array indicates the types of the related business |
| name | array of the string | name of the business |
| attributes | dataframe | 38 columns |
| ColumnName | Type | Details |
|---|---|---|
| user_id | array of the string | |
| review_id | array of the string | |
| stars | array of the numerical variable | each number indicates the rate of this review |
| date | array of the date | |
| text | array of the string | each contains the contents of the review |
| business_id | array of the string | the ID of the business related to this review |
It is noticed that each business belongs to a certain set of categories \(c\), which can be taken as an observation from a population \(\mathcal{C}\). \(\mathcal{C}\) has more than 783 entities for single category and the combination of them. Meanwhile, each business has an array \(a_{0}\) (length = 38) to describe the attribute.
Moreover, it is noticed that some elements of the attribute array is data frame, thus we can generate a ‘flatten’ version \(a_{1}\) (length = 78) from each one. Furthermore, we can ‘unlist’ the ‘Accepts Credit Cards’ attribute in \(a_{1}\) thus to cast each element in \(a_{1}\) into the basic data type. We notate this final attribute array as \(a_{2}\).
From a new business perspective, the owner can coarsely address the business into one of the category in the business table. And by leverage the existed business instance, they can easily find the common feature to follow by extracting the attributes mostly shared by these business. For example, “Eric Goldberg, MD” in dataset belongs to Doctors, Health & Medical. And for those business belonging to Doctors (1077 instances), there are 800 businesses only accept appointment patients over 164 who don’t. Thus the owner probably may make the decision to follow the common trend, and broadcast a proper advertisement based on it to win more customers.
Unfortunately, this situation could not always stands. For the same set of businesses, there is only 1 instance who claim they have the TV, while rest in dataset are unknown. In this situation, the owner will be lost. To deal this dilemma, in this section, we bring forward a methodology to extract insightful phrase from the reviews.
Firstly, we find all business instances, called buddys, which share the same category. For example, there are 1077 belonging to Doctors and 3213 belonging to Health & Medical, and the total number of the buddy is 3213 after removing the duplication. We notate the set of the buddy as \(\mathcal{B}\).
Then, we can find all relevant reviews to buddy set \(\mathcal{B}\). And use the Part-of-Speech (POS) tag to annotate each of them, following we show an annotation example,
Original tagged review:
Great[NNP] friendly[JJ] staff[NN] ![.] ![.] Clean[NNP]
teeth[NNS] and[CC] no[DT] cavities[NNS] ....[:]
Great[JJ] exam[NN] and[CC] cleaning[NN] .[.] Thanks[NNS] .[.]
For our specific purpose (to extract insightful phrase), we may focus on the noun in these reviews. Since academically, in the sentiment analysis, to find the insightful phrase in the text is a typical problem of identification of the product features (refer to a contibution of professor Liu). In that paper, the author make a hypothesis that “Different customers usually have different stories. However, when they comment on product features, the words that they use converge”. That heuristic statement leads to an association mining to find all frequent itemsets in reviews. In this study, we move one step forward based on their methodology to extract not only the frequent itemsets but also the syntactic structure. And hope the syntactic structure can remove fake nouns not belonging to the real product features.
To do that, we manually mark 50 review samples, randomly sampled from corpus, and pick up insightful phrase with explicit opinion statement, e.g:
N4 N3 N2 N1 C0 P1 P2 P3 P4 P5 YN ID
9 <NA> <NA> VB IN NN VBD JJ <NA> <NA> <NA> Y 1
10 <NA> <NA> <NA> DT NN VBZ RB JJ <NA> <NA> Y 2
Comments
9 Check In woman was professional
10 The billing is absolutely horrible
C0 in our output table is a noun (NN, NNP, NNS etc), which related to the product feature. The N1-N4 is 4 prefix word, and P1-P5 is 5 postfixs. NA in this table means an irrelevant word either belonging to the next sentence or adjuncts. The comments column in this table is the original text in the reviews.
With this table, we use aprior algorithm in association mining to learn the combination rules of the syntactic structure, as following code snippet,
#Learn the rule by association mining----
# kkPOSTag[,1:11] contains columns:N4 - YN
kkPOSTag <- kkPOSTag[,1:11]
rule <-apriori(kkPOSTag,
parameter = list(minlen=4),
appearance = list(rhs = c("YN=Y"),default = "lhs"))
ruleRHS <- subset(sort(rule,by="support",decreasing = T), subset = lhs %pin% "C0=")
# Redundancy pruning as did in [Hu and Liu 2004]
subset.matrix <- is.subset(ruleRHS, ruleRHS)
subset.matrix[lower.tri(subset.matrix, diag=T)] <- NA
redundant <- colSums(subset.matrix, na.rm=T) >= 1
rules.pruned <- ruleRHS[!redundant] # our final rules
Later on, these rules are used to select the candidate product feature phrase in all reviews related to \(\mathcal{B}\). During the selection, each candidate phrase is assigned a float number (weight), to reveal the important of this phrase in current text. This number is from the multiplication of the appearance count in a certain review by the review’s star, then divided by the maximal star rank (=5).
Finally, we can summarize the weight and provide the result to the user. The phrase with high score could indicate the high frequency of showing up in the relative high quality reviews. Thus we say these phrase can be used in composing insightful advertisement. More details are demonstrated in the result section.
From a new business perspective, it is crystal important to find the potential users, so that the designed advertisement based on the insightful feature phrase can be delivered. In this section, we try to use collaberative filtering to find those potential users, based on the similar business.
Since people may pay more attention to the attribute of the business, e.g. preference of a club with parking lot, we need to clusting the buddy set \(\mathcal{B}\) into groups based on the feature, and constrain ourselves on a single subset \(\mathcal{B}_{sub}\). For clusting operation, we need to further clean the business table.
If we take a careful look at the business table, we can find that the attributes array \(a_{2}\) usually contains quite a lot of missing data. So we need to figure out a way to deal with them. The basic variable type in those array including logical, character and integer. For logical and character variable, we can replace NA by a empty string, and then cast each variable sequentially into the character, the factor, and finally the integer. For integer variable, we can use 0 to replace NA value.
The next step is to clust on the business table, and use the group mark to further down select the buddy business into \(\mathcal{B}_{sub}\). The clusting method in this study is K-means and the plot of the total within groups sums of squares against the number of clusters center is used to select the number of the total clust center. Precisely speaking the bend of the trend suggest a reasonable choice.
By using down selected business \(\mathcal{B}_{sub}\) called neighbors, we can pickup relevant reviews from the review table. And sort them based on the issue date. Not all reviews is important to our interrogation business, we only look at the reviews belong to the same year of our interrogation.
For each review, we have one user ID. Thus we can goto the user table to find those users. And we only take those who has submit more than one reviews as our potential candidate.
To demonstrate the methodology in previous section, we use first business in the business table as our interrogation target.
lhs rhs support confidence lift
5 {N1=DT,C0=NN,P1=VBZ} => {YN=Y} 0.1764706 1 1
2 {C0=NN,P1=VBZ,P2=RB} => {YN=Y} 0.1470588 1 1
4 {N1=DT,C0=NN,P2=JJ} => {YN=Y} 0.1470588 1 1
1 {C0=NN,P2=RB,P3=JJ} => {YN=Y} 0.1176471 1 1
3 {N1=DT,C0=NN,P2=RB} => {YN=Y} 0.1176471 1 1
Each rule is expressed as A in lhs column => B in rhs, the support(A => B) is defined as \(P(A \bigcup B)\), confidence(A => B) is defined as a conditional probability \(P(B | A)\), and lift(A => B) is confidence(A => B) over \(P(B)\), since in our table event B is always happened, confidence and lift in the result are always 1. Thus we only need to look at the support, as the probability of the rule met by our training set. And use the top rules with relative high support value.
Use the rules and way introduced in previous section, we get the insightful phrase like following, the horizontal axis is the top features with relative high scores (>10), and the vertical axis is the score, namely the sum of the weights.
Moreover, we can explore the original review (id=37) to double check the opinion of the reviewer on a certain topic (e.g. office).
id phase
1 18 This office does not know
2 23 The office is very clean
3 24 this office are wonderful .
4 28 the office is great ,
5 37 The office is always spotless
The original tagged reviews said,
I[PRP] love[VBP] Dr.[NNP] Dan[NNP] and[CC] Eric[NNP] ![.] ...
The[DT] office[NN] is[VBZ] always[RB] spotless[JJ] ,[,] and[CC]
thanks[NNS] to[TO] the[DT] warm[JJ] decor[NN] it[PRP] does[VBZ] n't[RB]
have[VB] that[DT] sterile[JJ] dentists[NNS] office[NN] feeling[NN]...
Then we clust the buddy business, before clusting we pick up the total number of the clust center from plot (=45). After the down selection of the buddy business by clusting result, we get 2576 business as our final neighbors \(\mathcal{B}_{sub}\). Using \(\mathcal{B}_{sub}\), we retrieve the relevant reviews, sort by date, filter by current year, and finally we have 256 reviews. These reviews are further used to select the users (total number = 248), and during which, we select 8 as our target users.
In this study, we present a method to generate insightful advertisement, and find potential user to deliver this advertisement. During exploration of the key feature phrase, we use association mining to find syntactic structure. The training set for association mining is aggregated by sampling explicit opinion statement, the opposite side of this kind of statement may be an implicit statement with tremendous complex syntactic structure, or just a simple noun phrase. It is quite difficult to find and characterize these case, thus make it impossible to build a classifier. That’s the reason we use association mining instead. The sample size we use is limited, since mark the sample is tedious and arduous. Hopefully, we can generate a larger sample corpus in the next phase of study.
For finding the potential user, we use k-means clusting to down select business based on the features. And use the relevant review to find the candidate in a collaborative filtering way. In this step, we make a hypothesis that people, who recently post multiple reviews on a same kind of business, should show great interesting in them. We’ve explored other features like the statistical feature on the interval between reviews in order to profile users. But we can’t find good quantification in the raw dataset to support this kind of profiling, so we use a heuristic method during find the potential user.