Exploratory Data Analysis with Yelp

Synopsis

Why Yelp?

“People who love to eat are always the best people!” For the love of food, I decided to explore the Yelp dataset. This Dataset is plublicly available as part of the Yelp Dataset Challenge. It includes information about local businesses in 10 cities across 4 countries.

What are we trying to answer/explore/investigate?

By exploring this data, I am trying to answer a couple of interesting questions like - 1. What is the distribution of Average Ratings like over-all? 2. Is there a correlation between the Price-Range that a restaurant falls under and it’s Average Rating? 3. Where are the maximum number of 5 star rated restaurants located (within the scope of our data-set)? 4. What are the top categories that most of the 5 star rated restaurants fall under? 5. Which city is the veggie-friendly’s paradise?

A few questions based on the Yelp Reviews Dataset that this analysis attempts to answer (reviews limited to Restaurants in Pittsburgh) - 1. Can we show an interactive map of restaurants in Pittsburgh with an indication of their Ratings? 2. What are the most frequently occuring phrases in reviews for highly rated restaurants and not-so highly rated restaurants. Is the difference apparent? 3. How do maximum number of ratings for a restaurant compare with the number of high Ratings? Are we making a wise choice by just looking at the Average rating or the number of Ratings? 4. Which neighborhoods house the maximum number of highly rated restaurants?

Methodology Employed

After cleaning the data, some wrangling was performed using Dplyr and Tidyr functions to bring the data into a suitable format before performing Exploratory Data Analysis on it to answer the above questions. Slicing and Dicing was performed on the data using dplyr

Packages Required

library(stringr)        #This package is used for string manipulation functions
library(dplyr)          #This package is used for data manipulation tasks
library(tidyr)          #This package is used for data manipulation tasks
library(data.table)     #This package is used to access the function fread which is a better/faster way to read large data
library(wordcloud)      #This package is used to generate word clouds
library(tm)             #This is a text mining package used in the process of generating word clouds
library(RWeka)          # This package is used to generate Bigramws and Trigrams
library(ggplot2)        #This package is used for visualizations (chart/graph plotting functions)
library(ggmap)          #This package is used for map functions 
library(maps)           #This package is used for map functions
library(leaflet)        #This package is used for plotting maps
library(knitr)

Data Preparation

Original Source of the data: https://www.yelp.com/dataset_challenge/dataset

The CodeBook for the data is provided here: https://www.yelp.com/dataset_challenge

The datasets are located here:

Businesses Data https://s3.amazonaws.com/shreyayelp/yelp_academic_dataset_business.csv
User Data https://s3.amazonaws.com/shreyayelp/yelp_academic_dataset_user.csv
Reviews Data https://s3.amazonaws.com/shreyayelp/yelp_academic_dataset_review.csv
Reviews Data for Pittsburgh Restaurants https://s3.amazonaws.com/shreyayelp/yelp_academic_pittsburgh_restaurant_reviews.csv

The key attributes of the data are as follows:

2.7M reviews and 649K tips by 687K users for 86K businesses.
566K business attributes, e.g., hours, parking availability, ambience etc.,
Social network of 687K users for a total of 4.2M social edges
Aggregated check-ins over time for each of the 86K businesses
200,000 pictures from the included businesses.

The dataset consists of six files - business, tips, reviews, users, check-ins and photos. Each file is composed of a single object type, one json-object per-line. For the purposes of this project, we will be working with 3 of these - business, user and reviews. Also, we would be working with the Review Data only for one city - Pittsburgh.

Reading the Data

data_business <- fread("https://s3.amazonaws.com/shreyayelp/yelp_academic_dataset_business.csv")
#Dimensions and attribute Names for the Business Data
dim(data_business)

[1] 85901    98

DT::datatable(as.data.frame(names(data_business)))

data_user <- fread("https://s3.amazonaws.com/shreyayelp/yelp_academic_dataset_user.csv")


Read 24.8% of 686556 rows
Read 52.4% of 686556 rows
Read 83.0% of 686556 rows
Read 686556 rows and 23 (of 23) columns from 0.152 GB file in 00:00:06

#Dimensions and attribute Names for the User Data
dim(data_user)

[1] 686556     23

DT::datatable(as.data.frame(names(data_user)))

pa_reviews <- fread("https://s3.amazonaws.com/shreyayelp/yelp_academic_pittsburgh_restaurant_reviews.csv")
#Dimensions and attribute Names for the Reviews Data
dim(pa_reviews)

[1] 74295    11

DT::datatable(as.data.frame(names(pa_reviews)))

Cleaning up Reviews Data to filter Reviews for Pittsburgh only-

The Reviews Data file was read and filtered for restaurant reviews and further filtered for one city - Pittsburgh. This was done in order to be able to work with a smaller data-set for the scope of this project. Since the original Reviews data is 2.2GB. The following piece of code was used to get to the reviews dataset for restaurants in Pittsburgh from the original reviews dataset.

    # Reviews Data only for Restaurants 
    data_reviews_restaurant <- data_reviews[data_reviews$business_id %in% restaurants$business_id,]
    write.csv(data_reviews_restaurant, "data/yelp_academic_restaurant_reviews.csv") 
    # Reviews Data only for Restaurants in Pittsburgh
    pa_restaurants <- restaurants %>% filter(city=="Pittsburgh") %>% select(business_id)
    pa_reviews <- data_reviews_restaurant[data_reviews_restaurant$business_id %in% pa_restaurants$business_id,]
    write.csv(pa_reviews, "data/yelp_academic_pittsburgh_restaurant_reviews.csv")

# The Final Reviews Data Dimensions and Attribute Names
dim(pa_reviews)

[1] 74295    11

DT::datatable(as.data.frame(names(pa_reviews)))

Cleaning up Businesses Data to filter only Restaurants Data-

# Extracting only Restaurant information from the data_business data set
restaurants <- data_business[grepl('Restaurant',data_business$categories),]
#The new data-set restaurants has the following dimensions
dim(restaurants)

[1] 26729    98

# Marking missing values as NA
restaurants[restaurants==""] <- NA
sum(is.na(restaurants))

[1] 1218596

# Removing columns where all the values are NA
NA_values <- is.na(restaurants)
NA_Count <- apply(NA_values,2,sum)
NA_Count_df <- NA_Count[NA_Count==dim(restaurants)[1]]
#The following columns are irrelevant for restaurants and are completely empty
names(NA_Count_df)

[1] "attributes.Hair Types Specialized In.africanamerican"
[2] "attributes.Hair Types Specialized In.kids"           
[3] "attributes.Hair Types Specialized In.straightperms"  
[4] "attributes.Hair Types Specialized In.asian"          
[5] "attributes.Hair Types Specialized In.coloring"       
[6] "attributes.Hair Types Specialized In.extensions"     
[7] "attributes.Hair Types Specialized In.perms"          
[8] "attributes.Hair Types Specialized In.curly"

restaurants <- subset(restaurants, select = !(c(names(restaurants)) %in% c(names(NA_Count_df))))
# Cleaning up the Column Names of the Restaurant Data Set
names(restaurants) <- sapply(list(names(restaurants)),function(x) str_replace(x,"attributes.",""))
# Final Restaurants Data Dimesnions and Attribute Names
dim(restaurants)

[1] 26729    90

DT::datatable(as.data.frame(names(restaurants)))

#Finding Restaurants only in Pittsburgh
pa_restaurants <- restaurants %>% filter(city=="Pittsburgh")