Kids Welcome!

Predicting child friendliness of restaurant & food businesses from Yelp data

Summary presentation of key findings

Timm Suess, 2015-11-13

Picture by Quinn Dombrowski via Flickr, CC-by-sa

Research focus and data preparation

Research question

How well can a “good for kids” rating of restaurant and food businesses be predicted from business features such as Yelp category, service attributes, city and key words in the business's name?

Data Source

Yelp Dataset Challenge 6

Application

  • Parent-centric recommendation engines
  • Restauration market research
  • Review fraud detection

Preprocessing

  • Extraction of all businesses identificaton, evaluation, categories, attributes, location
  • Outcome variable: attribute “good for kids” (binary)
  • Filtering (no NA in outcome, category “Restaurant”, “Food”)
  • k-means clustering of GPS coordinates
  • Inclusion of top 300 words in business names
  • Creation of dummy variables

Machine-learning approach

Model building

  • Impute NAs with variable median
  • Train three individual classifiers and one ensemble
  • Validate and test
  • select against multiple metrics

Outcome variable is skewed

More than 80% of all “good for kids” entries are positive:

Not good for kids Good for kids Total
n 3667 16424 20091
% Total 18.3% 81.7% 100%

Results: Random Forest as best model

Key Metrics

RF LB NN GAM
AUROC 0.91 0.87 0.83 0.56
Balanced Accuracy 0.75 0.71 0.72 0.53
Sensitivity 0.97 0.97 0.92 0.90
Specificity 0.54 0.46 0.53 0.15
Kappa 0.57 0.50 0.46 0.06

RF=Random Forest, LB=LogitBoost, NN=Neural Net, GAM=General Additive Model

Random Forest emerges as clear winner. GAM shows surprisingly low performance, despite being the ensemble model.

ROC Curves plot of chunk unnamed-chunk-3 (GAM curve shows performance against testing data, others against validation data)

Interpretation

How well can child-friendliness be predicted?

  • Overall, RF predicts child-friendliness fairly well (good AUROC, Kappa, Accuacy).
  • Excellent sensitivity, but poor specificity
  • Solid basis for a (positive) recommendation engine

Possible Improvements

  • Get more data on child-unfriendly venues (e.g. from review texts)
  • Improve imputation algorithm
  • Use expert knowledge to discover hidden variables
  • Include additional variables (e.g. opening times)
  • Use other algorithms (SVM, deep learning)