LI SUN
2015-11
Data is http://www.yelp.com/dataset_challenge
So do people care about wifi? What are they talking about wifi?
People actually don't care about wifi as I do:) So, should a restuarant owner provide wife?
In this stage, I convert all variables to appropriate class and choose some variables by common sense. Also I check the correlations among variables and try to make sure it is almost full rank, by removing some highly correlated variables
In this stage, I start with full model or least square linear regression by lm function. For several dfferent models
Rsqr sigma vif.ave_pos
original 0.6047648 0.6467814 9.264232
weighted 0.6484257 1.5147396 15.608399
centered 0.6047648 0.6286774 9.264232
W and C 0.6484257 1.4723406 15.608399
So as you see, the full data can explain almost 65% or variation in ratings. Which is good and weighted model has significantly higher R squared. So we will use weighted model.
Finally, lasso was used to select vaiables
So those variables are left by using lambda.1se in glmnet package.
Take Home Message, free wifi is a tiny plus but paid wifi is a big no!