Final Project

For the final project, we will be examining the relationship between yelp reviews and businesses in the metropolitan area of Phoenix, Arizona. The data was obtained from Kaggle’s 2013 Yelp Review Challenge and was subsetted to include only businesses within the food and beverage industries.

We have a two-part goal in this assignment:

  1. Identify the links between key Yelp users and businesses within the Phoenix, Arizona community.
  2. Evaluate the relationship between the text contents of reviews and the rating a business received.

Data Aquisition & Tidying

Data was acquired from Kaggle as a JSON file for a project conducted in Data 612. Our network uses the subsetted data from that project. This subset is stored as a csv file in our data folder and was read into this report for further review.

We added additional transformations to meet our project goals and separated the data into seperate dataframes for network building and text processing.

Network Data

Text Data

Network Analysis

2-Mode Network

We set up our initial 2-mode network by connecting our businesses and users using a weighted incidence matrix. We plotted our graph using the plot.igraph function and verified our network was created properly.

Build Network

# define edges; spread data from long to wide; convert to matrix
edges <- yelp_network %>% select(businessID, userID, weight) %>% spread(businessID, 
    weight, fill = 0) %>% column_to_rownames("userID") %>% as.matrix()

# define nodesets
business_nodes <- yelp_network %>% select(businessID, name, size) %>% mutate(type = "business", 
    name = as.character(name)) %>% distinct()

user_nodes <- yelp_network %>% select(userID) %>% mutate(name = paste0("U", 
    userID), sizes = NA, type = "user") %>% distinct()

# bind rows
nodes <- bind_rows(business_nodes, user_nodes)

# initiate graph from matrix
g <- graph_from_incidence_matrix(edges, weighted = T)

# Define vertex color/shape
V(g)$shape <- ifelse(V(g)$type, "circle", "square")
V(g)$color <- ifelse(V(g)$type, "red", "white")

Network Graph

Verification

Verify Node Counts and Connectivity
Business.Nodes 4332
User.Nodes 10000
Is.Weighted TRUE
Is.Bipartite TRUE

Edge-Trimming

To better understand our network, we applied the island method to see our most influential user and businesses within our dataset. We made our network more sparse by only keeping only the most important ties and discarding the rest.

Examine Frequency

We looked at a histogram of our edge weight to better understand our network.

# Modify data frame
edgesDf <- yelp_network %>% select(businessID, userID, weight)

# Convert weight to numeric
weight <- as.numeric(unlist(edgesDf$weight))

# Examine frquency of weight
hist(weight)

# Calculate mean and standard deviation
print(paste0("Mean:", round(mean(weight), 2), " Standard Deviation: ", round(sd(weight), 
    2)))
FALSE [1] "Mean:32.55 Standard Deviation: 48.3"

Plot 1

In our first plot, kept edges that have weight higher than our mean cut off value.

cut.off <- mean(weight)
net.sp <- delete_edges(g, E(g)[weight < cut.off])
plot.igraph(net.sp, vertex.label = NA)

Plot 2

It was still difficult to see the network, so we tried to instead eliminated weak vertices with 0 degrees of connectivity.

# Eliminate vertices with degree 0
net.sp <- delete.vertices(g, V(g)[degree(net.sp) == 0])
plot.igraph(net.sp, vertex.label = NA)

Plot 3

In our final plot, we can see the most influential user and businesses which have a degree value that exceeds a degree of 6.

# Calculate mean degrees of network
print(paste0("Mean degree: ", mean(degree(g))))
FALSE [1] "Mean degree: 6.20904270164667"
# Eliminate vertices with degree less than rounded mean
net.sp <- delete.vertices(g, V(g)[degree(g) < 10])
plot(net.sp)

Key User Nodes

The key user nodes are identified in the tibbles below.

userID type
2308 user
3667 user
1342 user

Key Business Nodes

The key business nodes are identified in the tibbles below.

Text Analysis

We would like to asses each word in the reviews as either positive or negative and find the difference between the number of positive and negative words. This will be the “score” of the review.

Sentiment Function

First, we will use the Bing sentiment lexicon from the tidytext package. The Bing lexicon classifies certain words as either positive or negative.

# positive / negative sentiment function
m <- get_sentiments("bing")

pos.words <- vector()
neg.words <- vector()

for (i in 1:nrow(m)) {
    if (m$sentiment[i] == "positive") {
        pos.words[i] <- m$word[i]
    }
}


for (i in 1:nrow(m)) {
    if (m$sentiment[i] == "negative") {
        neg.words[i] <- m$word[i]
    }
}


pos.words[5]
FALSE [1] NA
neg.words[5]
FALSE [1] "abominably"

Then, we created a function that splits the string of text form the reviews and analyzes each word, and then calculates the score. [4]

v <- as.character(as.vector(yelp_reviews$text))

score.sentiment = function(v, pos.words, neg.words, .progress = "none") {
    require(plyr)
    require(stringr)
    
    # we got a vector of sentences. plyr will handle a list or a vector as an
    # 'l' for us we want a simple array ('a') of scores back, so we use 'l' +
    # 'a' + 'ply' = 'laply':
    
    scores = laply(v, function(sentence, pos.words, neg.words) {
        
        # clean up sentences with R's regex-driven global substitute, gsub():
        sentence = gsub("[[:punct:]]", "", sentence)
        sentence = gsub("[[:cntrl:]]", "", sentence)
        sentence = gsub("\\d+", "", sentence)
        # and convert to lower case: sentence = tolower(sentence)
        
        # split into words. str_split is in the stringr package
        word.list = str_split(sentence, "\\s+")
        # sometimes a list() is one level of hierarchy too much
        words = unlist(word.list)
        
        # compare our words to the dictionaries of positive & negative terms
        pos.matches = match(words, pos.words)
        neg.matches = match(words, neg.words)
        
        # match() returns the position of the matched term or NA we just want a
        # TRUE/FALSE:
        pos.matches = !is.na(pos.matches)
        neg.matches = !is.na(neg.matches)
        
        # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
        score = sum(pos.matches) - sum(neg.matches)
        
        return(score)
    }, pos.words, neg.words, .progress = .progress)
    
    scores.df = data.frame(score = scores, text = v)
    return(scores.df)
}

Visualize Relationships

We added the scores for each review to the original dataframe. Once this step is finished, it will make analysis much easier. Below, I’ve compared some of the variables to explore the relationships between score of reviews and the stars associated with the reviews.

# add scores to original df
t <- score.sentiment(v, pos.words, neg.words)
full <- inner_join(t, yelp_reviews, by = "text")

Stars Score

Funny Score

Useful Score

Cool Score

Analysis

The most interesting plot to examine further would be the stars vs score. It appears that as stars increase, so does the score. We will test if there is actually a relationship.

Correlation

FALSE [1] 0.2771639

The p-value is .04418, so at a 95% confidence interval, we can say that there is positive relationship between stars and the amount of positive words in the reviews of restaurants.

FALSE 
FALSE   Pearson's product-moment correlation
FALSE 
FALSE data:  full$stars and full$score
FALSE t = 60.886, df = 44550, p-value < 2.2e-16
FALSE alternative hypothesis: true correlation is not equal to 0
FALSE 95 percent confidence interval:
FALSE  0.2685693 0.2857143
FALSE sample estimates:
FALSE       cor 
FALSE 0.2771639

Conclusion

For the text analysis, our first step was to create a sentimental analysis of the reviews. We used the Bing lexicon from the tidytext package and analyzed each word from the reviews to see whether the review was positive or negative. We then compared the ratings witht he reviews to see if there was a pattern. we found that more positive reviews were associated with more positve ratings.

While we cannot say whether previous reviews effect new ratings, can can identify a relationship between the number of stars given to a restaurant and the number of positive words left in the comments. This result comes as no surprise, but the process of learning to parse the text and analyze was excting and a great way to understand the data.


References

Inspiration for this project was derrived from the following sources:

  1. Data Source: https://www.kaggle.com/c/yelp-recsys-2013/data
  2. Related Project: http://rpubs.com/jemceach/D612-Final-Project
  3. R Network Reference: https://kateto.net/network-visualization
  4. R Sentinment Analysis: https://medium.com/@rohitnair_94843/analysis-of-twitter-data-using-r-part-3-sentiment-analysis-53d0e5359cb8