Summary presentation of key findings
Timm Suess, 2015-11-13
Research question
How well can a “good for kids” rating of restaurant and food businesses be predicted from business features such as Yelp category, service attributes, city and key words in the business's name?
Data Source
Application
Preprocessing
Model building
Outcome variable is skewed
More than 80% of all “good for kids” entries are positive:
| Not good for kids | Good for kids | Total | |
|---|---|---|---|
| n | 3667 | 16424 | 20091 |
| % Total | 18.3% | 81.7% | 100% |
Key Metrics
| RF | LB | NN | GAM | |
|---|---|---|---|---|
| AUROC | 0.91 | 0.87 | 0.83 | 0.56 |
| Balanced Accuracy | 0.75 | 0.71 | 0.72 | 0.53 |
| Sensitivity | 0.97 | 0.97 | 0.92 | 0.90 |
| Specificity | 0.54 | 0.46 | 0.53 | 0.15 |
| Kappa | 0.57 | 0.50 | 0.46 | 0.06 |
RF=Random Forest, LB=LogitBoost, NN=Neural Net, GAM=General Additive Model
Random Forest emerges as clear winner. GAM shows surprisingly low performance, despite being the ensemble model.
ROC Curves
(GAM curve shows performance against testing data, others against validation data)
How well can child-friendliness be predicted?
Possible Improvements