First load in the data:
test = read.csv("yelp_test.csv", as.is=TRUE)
train = read.csv("yelp_train.csv", as.is = TRUE)
Lets take a look at the data:
summary(head(train))
funny useful user_cool characters user_average_stars
Min. :0.000 Min. :0.00 Min. : 0.00 Min. : 313.0 Min. :2.400
1st Qu.:0.000 1st Qu.:0.25 1st Qu.: 0.75 1st Qu.: 403.5 1st Qu.:3.143
Median :0.000 Median :1.50 Median : 27.00 Median : 761.5 Median :3.640
Mean :1.167 Mean :2.00 Mean : 208.17 Mean :1138.0 Mean :3.530
3rd Qu.:0.000 3rd Qu.:2.75 3rd Qu.: 103.50 3rd Qu.:1257.5 3rd Qu.:4.055
Max. :7.000 Max. :6.00 Max. :1074.00 Max. :3285.0 Max. :4.330
biz_review_count user_useful user_funny biz_open stars
Min. : 6.00 Min. : 1.0 Min. : 0.0 Min. :0.0000 Min. :1.00
1st Qu.: 6.75 1st Qu.: 5.5 1st Qu.: 0.0 1st Qu.:1.0000 1st Qu.:2.00
Median :40.50 Median : 78.5 Median : 13.5 Median :1.0000 Median :2.00
Mean :41.50 Mean : 282.3 Mean : 198.2 Mean :0.8333 Mean :2.50
3rd Qu.:75.75 3rd Qu.: 211.5 3rd Qu.: 49.5 3rd Qu.:1.0000 3rd Qu.:2.75
Max. :79.00 Max. :1299.0 Max. :1105.0 Max. :1.0000 Max. :5.00
date biz_stars words user_review_count cool
Length:6 Min. :3.000 Min. : 60.00 Min. : 5.00 Min. :0.0000
Class :character 1st Qu.:3.125 1st Qu.: 69.75 1st Qu.: 5.25 1st Qu.:0.0000
Mode :character Median :3.750 Median :145.50 Median : 61.50 Median :0.0000
Mean :3.583 Mean :215.83 Mean :115.83 Mean :0.8333
3rd Qu.:4.000 3rd Qu.:252.00 3rd Qu.:188.25 3rd Qu.:0.0000
Max. :4.000 Max. :612.00 Max. :350.00 Max. :5.0000
** Create a column called positive which is equal to 1 if the review is positive and 0 otherwise. **
The review is positive when it received either a 4 or 5 start review:
test$positive <- ifelse(test$stars <4,0,1)
train$positive <- ifelse(train$stars <4,0,1)
** Make any other transformations to the dataset you determine to be helpful or necessary. **
Lets remove the date column since it will not be relevant:
test$date =NULL
train$date = NULL
We will also remove the stars variable. It would be unfait to use it, because it alone would we able to predict with accuray 100%, because positive is directly created from it.
test$stars =NULL
train$stars = NULL
lr <- glm(positive ~ ., data=train, family = binomial())
Without any transformations the error rate is:
pred_lr = predict(lr, newdata = test, type = "response")
pred_lr_05 <- ifelse(pred_lr < 0.5,0,1)
mean(pred_lr_05 != test$positive)
[1] 0.314
With any transformations, the algorithm does not do well, getting 31% wrong.
** show all steps in model selection as well as the tuning of any parameters. **
One possibility is to use the weights parameter, that way if either postive or negative reviews are more common, it will take that into account.
pos = sum(train$positive == 1)
neg = sum(train$positive == 0)
pos
[1] 1518
neg
[1] 1482
pos_test = sum(test$positive == 1)
neg_test = sum(test$positive == 0)
pos_test
[1] 482
neg_test
[1] 518
Both the training and the test data are pretty well balanced in half. So the weight parameter of glm will not be necessary.
Lets try to choose a different cutting point rather than 0.5
pred_lr_06 <- ifelse(pred_lr < 0.6,0,1)
mean(pred_lr_06 != test$positive)
[1] 0.293
pred_lr_04 <- ifelse(pred_lr < 0.4,0,1)
mean(pred_lr_04 != test$positive)
[1] 0.349
Choosing different cutting points made it a bit better. Lets explore a bit more:
pred_lr_055 <- ifelse(pred_lr < 0.55,0,1)
mean(pred_lr_055 != test$positive)
[1] 0.299
pred_lr_06 <- ifelse(pred_lr < 0.6,0,1)
mean(pred_lr_06 != test$positive)
[1] 0.293
pred_lr_065 <- ifelse(pred_lr < 0.65,0,1)
mean(pred_lr_065 != test$positive)
[1] 0.301
pred_lr_07 <- ifelse(pred_lr < 0.7,0,1)
mean(pred_lr_07 != test$positive)
[1] 0.327
The best cut is at 0.6, where the error rate is 29.3%
** interpretation of the coefficients **
summary(lr)
Call:
glm(formula = positive ~ ., family = binomial(), data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1820 -0.9399 0.2903 0.8938 3.9722
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.8505708 0.3804446 -18.007 < 2e-16 ***
funny -0.3727667 0.0582204 -6.403 1.53e-10 ***
useful -0.2458292 0.0434809 -5.654 1.57e-08 ***
user_cool 0.0001540 0.0006746 0.228 0.8194
characters 0.0043128 0.0010592 4.072 4.67e-05 ***
user_average_stars 0.3428955 0.0415749 8.248 < 2e-16 ***
biz_review_count -0.0003048 0.0003183 -0.958 0.3382
user_useful -0.0003098 0.0005125 -0.604 0.5456
user_funny 0.0002966 0.0002457 1.207 0.2273
biz_open 0.2280944 0.1470682 1.551 0.1209
biz_stars 1.6389402 0.0881532 18.592 < 2e-16 ***
words -0.0262599 0.0057694 -4.552 5.32e-06 ***
user_review_count -0.0008507 0.0003937 -2.161 0.0307 *
cool 0.6630444 0.0634887 10.443 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4158.5 on 2999 degrees of freedom
Residual deviance: 3269.5 on 2986 degrees of freedom
AIC: 3297.5
Number of Fisher Scoring iterations: 5
The most important coefficients are funny, useful, characters, user_average_stars, biz_stars, words and cool. But the most important one is user_review_count. The others are not significant.
It makes sense that user_review count makes a difference, usually people will comment more on things that they like.
** Summarize your findings **
Logistic Regression does not seem like a good option for this data set. The minimal error found was around 29%. And in general the predictors don’t seem very significant.
#install.packages("e1071")
library(e1071)
package <U+393C><U+3E31>e1071<U+393C><U+3E32> was built under R version 3.3.3
svm_lin <- svm(positive ~ ., data=train, kernel="linear")
Without doing any tuning lest see the test error:
pred_lin = predict(svm_lin,newdata = test)
pred_lin_05 <- ifelse(pred_lin < 0.5,0,1)
mean(pred_lin_05 != test$positive)
[1] 0.339
The test error is even higher than linear regression, with an error of around 34%.
** show all steps in model selection as well as the tuning of any parameters. **
Lets do some tuning to get a better model. To start with lets tune the cost variable:
set.seed(222)
svm_lin_tune <- tune(svm, positive ~ ., data=train, kernel="linear", ranges=list(cost = c(0.1,1,2,5)))
svm_lin_best <- svm_lin_tune$best.model
pred_lin = predict(svm_lin_best,newdata = test)
pred_lin_05 <- ifelse(pred_lin < 0.5,0,1)
mean(pred_lin_05 != test$positive)
[1] 0.339
summary(svm_lin_best)
Call:
best.tune(method = svm, train.x = positive ~ ., data = train, ranges = list(cost = c(0.1,
1, 2, 5)), kernel = "linear")
Parameters:
SVM-Type: eps-regression
SVM-Kernel: linear
cost: 1
gamma: 0.07692308
epsilon: 0.1
Number of Support Vectors: 2787
The error did not dicrease and the error is still the same.
** interpretation of the coefficients **
The algorithm created 2787 supoort vectors. The best cost found is still 1.
** Summarize your findings **
Linear SVM did not perform well. The best found test error was 34%. Maybe polynomials or radial kernels will perform better.
#install.packages("e1071")
library(e1071)
package <U+393C><U+3E31>e1071<U+393C><U+3E32> was built under R version 3.3.3
svm_pol <- svm(positive ~ ., data=train, kernel="polynomial")
Without doing any tuning lest see the test error:
pred_pol = predict(svm_pol,newdata = test)
pred_pol_05 <- ifelse(pred_pol < 0.5,0,1)
mean(pred_pol_05 != test$positive)
[1] 0.431
The test error is even higher than linear regression, with an error of around 43%.
** show all steps in model selection as well as the tuning of any parameters. **
Lets do some tuning to get a better model. Lets tune the cost and gamma variables:
set.seed(222)
svm_pol1 <- svm(positive ~ ., data=train, kernel="polynomial", cost = 0.1)
svm_pol2 <- svm(positive ~ ., data=train, kernel="polynomial", cost = 0.5)
svm_pol3 <- svm(positive ~ ., data=train, kernel="polynomial", cost = 1)
svm_pol4 <- svm(positive ~ ., data=train, kernel="polynomial", cost = 5)
#svm_pol_tune <- tune(svm, positive ~ ., data=train, kernel="polynomial", ranges=list(cost = c(0.1,1,2,5), gamma = c(0.5,1,2,3)))
#svm_pol_best <- svm_pol_tune$best.model
pred_pol = predict(svm_pol1,newdata = test)
pred_pol <- ifelse(pred_pol < 0.5,0,1)
mean(pred_pol != test$positive)
[1] 0.343
pred_pol = predict(svm_pol2,newdata = test)
pred_pol <- ifelse(pred_pol < 0.5,0,1)
mean(pred_pol != test$positive)
[1] 0.353
pred_pol = predict(svm_pol3,newdata = test)
pred_pol <- ifelse(pred_pol < 0.5,0,1)
mean(pred_pol != test$positive)
[1] 0.345
pred_pol = predict(svm_pol4,newdata = test)
pred_pol <- ifelse(pred_pol < 0.5,0,1)
mean(pred_pol != test$positive)
[1] 0.324
Changing the cost did not change the results much. Lets try varying gamma, using the best cost found at 5.
set.seed(222)
svm_pol1 <- svm(positive ~ ., data=train, kernel="polynomial", degree = 1, cost = 5)
svm_pol2 <- svm(positive ~ ., data=train, kernel="polynomial", degree = 2, cost = 5)
svm_pol3 <- svm(positive ~ ., data=train, kernel="polynomial", degree = 3, cost = 5)
svm_pol4 <- svm(positive ~ ., data=train, kernel="polynomial", degree = 4, cost =5)
#svm_pol_tune <- tune(svm, positive ~ ., data=train, kernel="polynomial", ranges=list(cost = c(0.1,1,2,5), gamma = c(0.5,1,2,3)))
#svm_pol_best <- svm_pol_tune$best.model
pred_pol = predict(svm_pol1,newdata = test)
pred_pol <- ifelse(pred_pol < 0.5,0,1)
mean(pred_pol != test$positive)
[1] 0.367
pred_pol = predict(svm_pol2,newdata = test)
pred_pol <- ifelse(pred_pol < 0.5,0,1)
mean(pred_pol != test$positive)
[1] 0.366
pred_pol = predict(svm_pol3,newdata = test)
pred_pol <- ifelse(pred_pol < 0.5,0,1)
mean(pred_pol != test$positive)
[1] 0.324
pred_pol = predict(svm_pol4,newdata = test)
pred_pol <- ifelse(pred_pol < 0.5,0,1)
mean(pred_pol != test$positive)
[1] 0.344
** Summarize your findings **
best_model_svm = svm(positive ~ ., data=train, kernel="polynomial", degree = 3, cost = 5)
summary(best_model_svm)
Call:
svm(formula = positive ~ ., data = train, kernel = "polynomial", degree = 3, cost = 5)
Parameters:
SVM-Type: eps-regression
SVM-Kernel: polynomial
cost: 5
degree: 3
gamma: 0.07142857
coef.0: 0
epsilon: 0.1
Number of Support Vectors: 2249
pred_pol = predict(best_model_svm,newdata = test)
pred_pol <- ifelse(pred_pol < 0.25,0,1)
mean(pred_pol != test$positive)
[1] 0.264
The best model found has cost 5, cuts at 0.25, and has degree 3. The test error for it is 26%.
#install.packages("e1071")
library(e1071)
svm_rad <- svm(positive ~ ., data=train, kernel="radial")
Without doing any tuning lest see the test error for a radial svm:
pred_rad = predict(svm_rad,newdata = test)
pred_rad_05 <- ifelse(pred_rad < 0.5,0,1)
mean(pred_rad_05 != test$positive)
[1] 0.249
The test error for radial kernel seems more promising. Without any tuning it is so far the best test error. Even though it is still quite high at 25%
** show all steps in model selection as well as the tuning of any parameters. **
Lets do some tuning to get a better model. Lets tune the cost and gamma variables:
set.seed(222)
svm_pol1 <- svm(positive ~ ., data=train, kernel="radial", cost = 0.1)
svm_pol2 <- svm(positive ~ ., data=train, kernel="radial", cost = 0.5)
svm_pol3 <- svm(positive ~ ., data=train, kernel="radial", cost = 1)
svm_pol4 <- svm(positive ~ ., data=train, kernel="radial", cost = 5)
pred_pol = predict(svm_pol1,newdata = test)
pred_pol <- ifelse(pred_pol < 0.5,0,1)
mean(pred_pol != test$positive)
[1] 0.267
pred_pol = predict(svm_pol2,newdata = test)
pred_pol <- ifelse(pred_pol < 0.5,0,1)
mean(pred_pol != test$positive)
[1] 0.246
pred_pol = predict(svm_pol3,newdata = test)
pred_pol <- ifelse(pred_pol < 0.5,0,1)
mean(pred_pol != test$positive)
[1] 0.249
pred_pol = predict(svm_pol4,newdata = test)
pred_pol <- ifelse(pred_pol < 0.5,0,1)
mean(pred_pol != test$positive)
[1] 0.256
Changing the cost did not change the results much. Lets try varying gamma, using the best cost found at 0.5.
set.seed(222)
svm_pol1 <- svm(positive ~ ., data=train, kernel="radial", cost = 0.5, gamma = 0.1)
svm_pol2 <- svm(positive ~ ., data=train, kernel="radial", cost = 0.5, gamma = 0.2)
svm_pol3 <- svm(positive ~ ., data=train, kernel="radial", cost = 0.5, gamma = 0.3)
svm_pol4 <- svm(positive ~ ., data=train, kernel="radial", cost = 0.5, gamma = 0.5)
pred_pol = predict(svm_pol1,newdata = test)
pred_pol <- ifelse(pred_pol < 0.5,0,1)
mean(pred_pol != test$positive)
[1] 0.25
pred_pol = predict(svm_pol2,newdata = test)
pred_pol <- ifelse(pred_pol < 0.5,0,1)
mean(pred_pol != test$positive)
[1] 0.258
pred_pol = predict(svm_pol3,newdata = test)
pred_pol <- ifelse(pred_pol < 0.5,0,1)
mean(pred_pol != test$positive)
[1] 0.261
pred_pol = predict(svm_pol4,newdata = test)
pred_pol <- ifelse(pred_pol < 0.5,0,1)
mean(pred_pol != test$positive)
[1] 0.275
The default gamma seems like the best option.
Lets try varying where we cut:
svm_pol2 <- svm(positive ~ ., data=train, kernel="radial", cost = 0.5)
pred_pol = predict(svm_pol2,newdata = test)
pred_pol <- ifelse(pred_pol < 0.25,0,1)
mean(pred_pol != test$positive)
[1] 0.284
pred_pol = predict(svm_pol2,newdata = test)
pred_pol <- ifelse(pred_pol < 0.3,0,1)
mean(pred_pol != test$positive)
[1] 0.271
pred_pol = predict(svm_pol2,newdata = test)
pred_pol <- ifelse(pred_pol < 0.35,0,1)
mean(pred_pol != test$positive)
[1] 0.257
pred_pol = predict(svm_pol2,newdata = test)
pred_pol <- ifelse(pred_pol < 0.4,0,1)
mean(pred_pol != test$positive)
[1] 0.248
pred_pol = predict(svm_pol2,newdata = test)
pred_pol <- ifelse(pred_pol < 0.45,0,1)
mean(pred_pol != test$positive)
[1] 0.247
pred_pol = predict(svm_pol2,newdata = test)
pred_pol <- ifelse(pred_pol < 0.50,0,1)
mean(pred_pol != test$positive)
[1] 0.246
pred_pol = predict(svm_pol2,newdata = test)
pred_pol <- ifelse(pred_pol < 0.55,0,1)
mean(pred_pol != test$positive)
[1] 0.25
pred_pol = predict(svm_pol2,newdata = test)
pred_pol <- ifelse(pred_pol < 0.60,0,1)
mean(pred_pol != test$positive)
[1] 0.26
The best cut found is still at 0.5.
** Summarize your findings **
best_model_svm = svm(positive ~ ., data=train, kernel="radial", cost = 0.5)
summary(best_model_svm)
Call:
svm(formula = positive ~ ., data = train, kernel = "radial", cost = 0.5)
Parameters:
SVM-Type: eps-regression
SVM-Kernel: radial
cost: 0.5
gamma: 0.07142857
epsilon: 0.1
Number of Support Vectors: 2190
pred_pol = predict(best_model_svm,newdata = test)
pred_pol <- ifelse(pred_pol < 0.5,0,1)
mean(pred_pol != test$positive)
[1] 0.246
The best model found has cost 0.5, cuts at 0.5, and has the default gamma of 0.07. It created 2190 support vectors. The best test error for it is 24,6%.
** Compare the linear, polynomial, and radial kernels. **
Out of all the models, the radial kernel worked the best.