Kids Welcome!

Predicting child friendliness of restaurant & food businesses from Yelp data

Summary presentation of key findings

Timm Suess, 2015-11-13

Picture by Quinn Dombrowski via Flickr, CC-by-sa

Research focus and data preparation

Research question

How well can a “good for kids” rating of restaurant and food businesses be predicted from business features such as Yelp category, service attributes, city and key words in the business's name?

Data Source

Yelp Dataset Challenge 6

Application

Parent-centric recommendation engines
Restauration market research
Review fraud detection

Preprocessing

Extraction of all businesses identificaton, evaluation, categories, attributes, location
Outcome variable: attribute “good for kids” (binary)
Filtering (no NA in outcome, category “Restaurant”, “Food”)
k-means clustering of GPS coordinates
Inclusion of top 300 words in business names
Creation of dummy variables

Machine-learning approach

Model building

Impute NAs with variable median
Train three individual classifiers and one ensemble
Validate and test
select against multiple metrics

Outcome variable is skewed

More than 80% of all “good for kids” entries are positive:

	Not good for kids	Good for kids	Total
n	3667	16424	20091
% Total	18.3%	81.7%	100%

Results: Random Forest as best model

Key Metrics

	RF	LB	NN	GAM
AUROC	0.91	0.87	0.83	0.56
Balanced Accuracy	0.75	0.71	0.72	0.53
Sensitivity	0.97	0.97	0.92	0.90
Specificity	0.54	0.46	0.53	0.15
Kappa	0.57	0.50	0.46	0.06

RF=Random Forest, LB=LogitBoost, NN=Neural Net, GAM=General Additive Model

Random Forest emerges as clear winner. GAM shows surprisingly low performance, despite being the ensemble model.

ROC Curves plot of chunk unnamed-chunk-3 (GAM curve shows performance against testing data, others against validation data)

Interpretation

How well can child-friendliness be predicted?

Overall, RF predicts child-friendliness fairly well (good AUROC, Kappa, Accuacy).
Excellent sensitivity, but poor specificity
Solid basis for a (positive) recommendation engine

Possible Improvements

Get more data on child-unfriendly venues (e.g. from review texts)
Improve imputation algorithm
Use expert knowledge to discover hidden variables
Include additional variables (e.g. opening times)
Use other algorithms (SVM, deep learning)