This report looks at the yelp hygiene dataset to predict whether a restaurant will pass the hygiene inspection or not. The dataset is composed of a training subset of 546 restaurants used to train the classifier, and a testing subset of 12753 restaurants used for evaluating the performance of the classifier. The dataset is spread across 3 files such that the first 546 lines in each file correspond to the training subset and the rest are part of the testing subset. Below is a description of each file:
- hygiene.dat: Each line contains the concatenated text reviews of one restaurant.
- hygiene.dat.labels: For the first 546 lines, a binary label (0 or 1) is used where a 0 indicates that the restaurant has passed the latest public health inspection test, while a 1 means that the restaurant has failed the test. The rest of the lines have “[None]” in their label field implying that they are part of the testing subset.
- hygiene.dat.additional: It is a CSV (Comma-Separated Values) file where the first value is a list containing the cuisines offered, the second value is the zip code, which gives an idea about the location, the third is the number of reviews, and the fourth is the average rating, which can vary between 0 and 5 (5 being the best).
As a first step, we will read all the reviews from hygiene.dat as a data-frame, and create term document matrix for each restaurant. Since the training data-set consists of only 546 restaurants, we will use only the first 546 restaurant reviews from hygiene.dat for training our model. We will explore multiple models, first by using unigrams from term document matrix as the features, and again by using bigrams as the features. Before we extract the unigrams and bigrams as the features we will cleanup the reviews using tm package. We convert all the text to lowercase, remove all the profanity, remove stopwords, and apply stemming. We also remove all the sparse terms, and only allow 10% sparseness in the term document matrix.
We have 184 unigram features, and 57 bigram features.
Unigram Feature Set
colnames(hygiene.unigramDF)
## [1] "actually" "almost" "also" "always" "amazing"
## [6] "amp" "another" "anything" "area" "around"
## [11] "atmosphere" "away" "awesome" "back" "bad"
## [16] "bar" "best" "better" "big" "bit"
## [21] "came" "can" "cant" "cheese" "chicken"
## [26] "come" "couple" "day" "decent" "definitely"
## [31] "delicious" "didnt" "different" "dinner" "dish"
## [36] "dishes" "dont" "drink" "eat" "eating"
## [41] "else" "enough" "especially" "even" "ever"
## [46] "every" "everything" "excellent" "experience" "favorite"
## [51] "feel" "find" "first" "flavor" "food"
## [56] "found" "fresh" "friend" "friendly" "friends"
## [61] "full" "get" "give" "go" "going"
## [66] "good" "got" "great" "happy" "hard"
## [71] "home" "hot" "however" "huge" "id"
## [76] "ill" "im" "isnt" "ive" "just"
## [81] "kind" "know" "large" "last" "least"
## [86] "like" "little" "long" "looking" "lot"
## [91] "love" "lunch" "made" "make" "makes"
## [96] "many" "maybe" "meal" "meat" "menu"
## [101] "might" "minutes" "much" "need" "never"
## [106] "new" "next" "nice" "night" "nothing"
## [111] "now" "oh" "ok" "one" "order"
## [116] "ordered" "people" "perfect" "place" "places"
## [121] "pretty" "price" "prices" "probably" "quality"
## [126] "quick" "quite" "really" "recommend" "restaurant"
## [131] "right" "said" "salad" "sauce" "say"
## [136] "seattle" "see" "served" "service" "side"
## [141] "since" "small" "something" "special" "spot"
## [146] "staff" "stars" "still" "super" "sure"
## [151] "sweet" "table" "take" "taste" "tasty"
## [156] "thats" "thing" "things" "think" "though"
## [161] "thought" "time" "times" "took" "top"
## [166] "tried" "try" "two" "us" "used"
## [171] "usually" "wait" "want" "wanted" "wasnt"
## [176] "way" "well" "went" "will" "without"
## [181] "work" "worth" "years" "youre"
Bigram Feature Set
colnames(hygiene.bigramDF)
## [1] "back try" "best ive" "can get"
## [4] "cant wait" "come back" "coming back"
## [7] "dont get" "dont know" "dont like"
## [10] "dont think" "even though" "every time"
## [13] "feel like" "felt like" "first time"
## [16] "food good" "food great" "food service"
## [19] "go back" "going back" "good food"
## [22] "good place" "good service" "great food"
## [25] "great place" "great service" "happy hour"
## [28] "highly recommend" "im sure" "ive ever"
## [31] "ive never" "just right" "last night"
## [34] "last time" "like place" "little bit"
## [37] "long time" "love place" "make sure"
## [40] "much better" "next time" "one best"
## [43] "one favorite" "place go" "place great"
## [46] "pretty good" "pretty much" "really good"
## [49] "really like" "really nice" "service good"
## [52] "service great" "staff friendly" "tasted like"
## [55] "wait staff" "will back" "will definitely"
We have some additional information about each restaurant in hygiene.dat.additional which we can use to enhance our feature set. We will extract features from this additional data for each restaurant, and add them to our existing feature set
After adding the additional features, we have 204 unigram features and 77 bigram features.
Unigram Feature set after adding additional features
colnames(hygiene.unigramDF)
## [1] "actually" "almost" "also" "always" "amazing"
## [6] "amp" "another" "anything" "area" "around"
## [11] "atmosphere" "away" "awesome" "back" "bad"
## [16] "bar" "best" "better" "big" "bit"
## [21] "came" "can" "cant" "cheese" "chicken"
## [26] "come" "couple" "day" "decent" "definitely"
## [31] "delicious" "didnt" "different" "dinner" "dish"
## [36] "dishes" "dont" "drink" "eat" "eating"
## [41] "else" "enough" "especially" "even" "ever"
## [46] "every" "everything" "excellent" "experience" "favorite"
## [51] "feel" "find" "first" "flavor" "food"
## [56] "found" "fresh" "friend" "friendly" "friends"
## [61] "full" "get" "give" "go" "going"
## [66] "good" "got" "great" "happy" "hard"
## [71] "home" "hot" "however" "huge" "id"
## [76] "ill" "im" "isnt" "ive" "just"
## [81] "kind" "know" "large" "last" "least"
## [86] "like" "little" "long" "looking" "lot"
## [91] "love" "lunch" "made" "make" "makes"
## [96] "many" "maybe" "meal" "meat" "menu"
## [101] "might" "minutes" "much" "need" "never"
## [106] "new" "next" "nice" "night" "nothing"
## [111] "now" "oh" "ok" "one" "order"
## [116] "ordered" "people" "perfect" "place" "places"
## [121] "pretty" "price" "prices" "probably" "quality"
## [126] "quick" "quite" "really" "recommend" "restaurant"
## [131] "right" "said" "salad" "sauce" "say"
## [136] "seattle" "see" "served" "service" "side"
## [141] "since" "small" "something" "special" "spot"
## [146] "staff" "stars" "still" "super" "sure"
## [151] "sweet" "table" "take" "taste" "tasty"
## [156] "thats" "thing" "things" "think" "though"
## [161] "thought" "time" "times" "took" "top"
## [166] "tried" "try" "two" "us" "used"
## [171] "usually" "wait" "want" "wanted" "wasnt"
## [176] "way" "well" "went" "will" "without"
## [181] "work" "worth" "years" "youre" "zip"
## [186] "reviews" "rating" "american" "bars" "breakfast"
## [191] "brunch" "chinese" "fast" "food" "italian"
## [196] "japanese" "mexican" "new" "pizza" "sandwiches"
## [201] "seafood" "thai" "traditional" "vietnamese"
Bigram Feature set after adding additional features
colnames(hygiene.bigramDF)
## [1] "back try" "best ive" "can get"
## [4] "cant wait" "come back" "coming back"
## [7] "dont get" "dont know" "dont like"
## [10] "dont think" "even though" "every time"
## [13] "feel like" "felt like" "first time"
## [16] "food good" "food great" "food service"
## [19] "go back" "going back" "good food"
## [22] "good place" "good service" "great food"
## [25] "great place" "great service" "happy hour"
## [28] "highly recommend" "im sure" "ive ever"
## [31] "ive never" "just right" "last night"
## [34] "last time" "like place" "little bit"
## [37] "long time" "love place" "make sure"
## [40] "much better" "next time" "one best"
## [43] "one favorite" "place go" "place great"
## [46] "pretty good" "pretty much" "really good"
## [49] "really like" "really nice" "service good"
## [52] "service great" "staff friendly" "tasted like"
## [55] "wait staff" "will back" "will definitely"
## [58] "zip" "reviews" "rating"
## [61] "american" "bars" "breakfast"
## [64] "brunch" "chinese" "fast"
## [67] "food" "italian" "japanese"
## [70] "mexican" "new" "pizza"
## [73] "sandwiches" "seafood" "thai"
## [76] "traditional" "vietnamese"
We can further enhance the feature set, by removing variables where the variance is zero.
nzv <- nearZeroVar(hygiene.unigramDF,saveMetrics=FALSE)
hygiene.unigramDF <- hygiene.unigramDF[,-nzv]
unigramtestData <- unigramtestData[,-nzv]
nzv <- nearZeroVar(hygiene.bigramDF,saveMetrics=FALSE)
hygiene.bigramDF <- hygiene.bigramDF[,-nzv]
bigramtestData <- bigramtestData[,-nzv]
Check for corelation and if 2 variables are corelated then only 1 is needed
x <- cor(hygiene.unigramDF)
y <- findCorrelation(x, cutoff = .75)
hygiene.unigramDF <- hygiene.unigramDF[,-y]
unigramtestData <- unigramtestData[,-y]
After removing corelated variables we have 154 unigram features
Unigram Feature set after removing corelated variables
colnames(hygiene.unigramDF)
## [1] "actually" "almost" "always" "amazing" "amp"
## [6] "another" "anything" "area" "atmosphere" "away"
## [11] "awesome" "bad" "bar" "better" "big"
## [16] "bit" "cheese" "chicken" "come" "couple"
## [21] "day" "decent" "different" "dinner" "dish"
## [26] "dishes" "drink" "eat" "eating" "else"
## [31] "enough" "especially" "every" "everything" "excellent"
## [36] "favorite" "feel" "find" "flavor" "found"
## [41] "fresh" "friend" "friendly" "full" "give"
## [46] "going" "got" "happy" "hard" "home"
## [51] "hot" "however" "huge" "id" "ill"
## [56] "isnt" "kind" "know" "large" "last"
## [61] "least" "looking" "lot" "love" "lunch"
## [66] "made" "makes" "many" "maybe" "meal"
## [71] "meat" "menu" "might" "minutes" "need"
## [76] "never" "new" "next" "nice" "night"
## [81] "nothing" "now" "oh" "ok" "order"
## [86] "perfect" "places" "pretty" "price" "prices"
## [91] "probably" "quality" "quick" "quite" "recommend"
## [96] "right" "said" "salad" "sauce" "say"
## [101] "seattle" "see" "served" "side" "since"
## [106] "small" "something" "special" "spot" "staff"
## [111] "stars" "super" "sure" "sweet" "table"
## [116] "take" "taste" "tasty" "thing" "things"
## [121] "thought" "times" "took" "top" "tried"
## [126] "two" "used" "usually" "wait" "want"
## [131] "wanted" "wasnt" "way" "without" "work"
## [136] "worth" "years" "youre" "zip" "rating"
## [141] "american" "brunch" "chinese" "food.1" "italian"
## [146] "japanese" "mexican" "new.1" "pizza" "sandwiches"
## [151] "seafood" "thai" "traditional" "vietnamese"
x <- cor(hygiene.bigramDF)
y <- findCorrelation(x, cutoff = .75)
hygiene.bigramDF <- hygiene.bigramDF[,-y]
bigramtestData <- bigramtestData[,-y]
After removing corelated variables we have 74 bigram features
Bigram Feature set after removing corelated variables
colnames(hygiene.bigramDF)
## [1] "back try" "best ive" "can get"
## [4] "cant wait" "come back" "coming back"
## [7] "dont get" "dont know" "dont like"
## [10] "dont think" "even though" "every time"
## [13] "feel like" "felt like" "first time"
## [16] "food good" "food great" "food service"
## [19] "go back" "going back" "good food"
## [22] "good place" "good service" "great food"
## [25] "great place" "great service" "happy hour"
## [28] "highly recommend" "im sure" "ive ever"
## [31] "ive never" "just right" "last night"
## [34] "last time" "like place" "little bit"
## [37] "long time" "love place" "make sure"
## [40] "much better" "next time" "one best"
## [43] "one favorite" "place go" "place great"
## [46] "pretty good" "pretty much" "really good"
## [49] "really like" "really nice" "service good"
## [52] "service great" "staff friendly" "tasted like"
## [55] "wait staff" "will back" "will definitely"
## [58] "zip" "reviews" "rating"
## [61] "american" "breakfast" "chinese"
## [64] "food" "italian" "japanese"
## [67] "mexican" "new" "pizza"
## [70] "sandwiches" "seafood" "thai"
## [73] "traditional" "vietnamese"
Now we will partition the training sample, into training and testing sample set, so that we can validate our classifier, before we apply it to our test data. We do this by partitioning the training data, into a sample of 70:30, such that 70% of actual training data is used for training the classifier, and the remaining 30% is used for validating the classifier.
hygiene.unigramDF <- cbind(hygiene.unigramDF, hygiene.labels)
colnames(hygiene.unigramDF)[ncol(hygiene.unigramDF)] <- "inspection"
set.seed(12354)
unigram.train.idx <- sample(nrow(hygiene.unigramDF), ceiling(nrow(hygiene.unigramDF) * 0.7))
unigram.test.idx <- (1:nrow(hygiene.unigramDF))[-unigram.train.idx]
unigram.training <- hygiene.unigramDF[unigram.train.idx,]
unigram.testing <- hygiene.unigramDF[unigram.test.idx,]
unigram.trainingLabels <- as.factor(unigram.training$inspection)
unigram.testingLabels <- as.factor(unigram.testing$inspection)
unigram.training <- unigram.training[,-which(colnames(unigram.training) == "inspection")]
unigram.testing <- unigram.testing[,-which(colnames(unigram.testing) == "inspection")]
hygiene.bigramDF <- cbind(hygiene.bigramDF, hygiene.labels)
colnames(hygiene.bigramDF)[ncol(hygiene.bigramDF)] <- "inspection"
bigram.train.idx <- sample(nrow(hygiene.bigramDF), ceiling(nrow(hygiene.bigramDF) * 0.7))
bigram.test.idx <- (1:nrow(hygiene.bigramDF))[-bigram.train.idx]
bigram.training <- hygiene.bigramDF[bigram.train.idx,]
bigram.testing <- hygiene.bigramDF[bigram.test.idx,]
bigram.trainingLabels <- as.factor(bigram.training$inspection)
bigram.testingLabels <- as.factor(bigram.testing$inspection)
bigram.training <- bigram.training[,-which(colnames(bigram.training) == "inspection")]
bigram.testing <- bigram.testing[,-which(colnames(bigram.testing) == "inspection")]
We will run multiple machine learning models, and use the one that gives use the best results. We will try k nearest neighbor, random forest and stochastic gradient boosting. We will use F1 as the metric to be maximized
F1 <- function(data,lev,model){
pred <- prediction(as.numeric(as.character(data$pred)), as.numeric(as.character(data$obs)))
out <-performance(pred,"f")@y.values[[1]][2]
names(out) <- "F1"
out
}
In knn classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors.
unigram knn model
unigram.knn.fit
## k-Nearest Neighbors
##
## 383 samples
## 153 predictors
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 345, 344, 344, 345, 345, 345, ...
## Resampling results across tuning parameters:
##
## k F1 F1 SD
## 5 0.5890546 0.06779387
## 7 0.6250590 0.07753171
## 9 0.6357103 0.08532342
##
## F1 was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
confusion matrix for unigram knn model
unigram.knn.confusionmatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 46 43
## 1 27 47
##
## Accuracy : 0.5706
## 95% CI : (0.4908, 0.6477)
## No Information Rate : 0.5521
## P-Value [Acc > NIR] : 0.3478
##
## Kappa : 0.1493
## Mcnemar's Test P-Value : 0.0730
##
## Sensitivity : 0.6301
## Specificity : 0.5222
## Pos Pred Value : 0.5169
## Neg Pred Value : 0.6351
## Prevalence : 0.4479
## Detection Rate : 0.2822
## Detection Prevalence : 0.5460
## Balanced Accuracy : 0.5762
##
## 'Positive' Class : 0
##
bigram knn model
bigram.knn.fit
## k-Nearest Neighbors
##
## 383 samples
## 73 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 344, 346, 345, 344, 345, 344, ...
## Resampling results across tuning parameters:
##
## k F1 F1 SD
## 5 0.5453647 0.11753720
## 7 0.5211313 0.11848540
## 9 0.5363748 0.09093464
##
## F1 was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
confusion matrix for bigram knn model
bigram.knn.confusionmatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 51 26
## 1 50 36
##
## Accuracy : 0.5337
## 95% CI : (0.4541, 0.6121)
## No Information Rate : 0.6196
## P-Value [Acc > NIR] : 0.989740
##
## Kappa : 0.0796
## Mcnemar's Test P-Value : 0.008333
##
## Sensitivity : 0.5050
## Specificity : 0.5806
## Pos Pred Value : 0.6623
## Neg Pred Value : 0.4186
## Prevalence : 0.6196
## Detection Rate : 0.3129
## Detection Prevalence : 0.4724
## Balanced Accuracy : 0.5428
##
## 'Positive' Class : 0
##
Random Forests are an ensemble learning method for classification, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes of the individual trees. The algorithm for inducing a random forest was developed by Leo Breiman and Adele Cutler, and Random Forests is their trademark. The method combines Breiman’s bagging idea and random selection of features, introduced independently by Ho and Amit and Geman in order to construct a collection of decision trees with controlled variance.
unigram rf model
unigram.rf.fit
## Random Forest
##
## 383 samples
## 153 predictors
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 345, 344, 345, 345, 344, 345, ...
## Resampling results across tuning parameters:
##
## mtry F1 F1 SD
## 10 0.6049382 0.07191913
## 50 0.5892803 0.06098618
## 100 0.6123985 0.07094478
## 500 0.5987488 0.06863563
## 1000 0.6070963 0.06390975
##
## F1 was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 100.
confusion matrix for unigram rf model
unigram.rf.confusionmatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 51 38
## 1 24 50
##
## Accuracy : 0.6196
## 95% CI : (0.5404, 0.6944)
## No Information Rate : 0.5399
## P-Value [Acc > NIR] : 0.02422
##
## Kappa : 0.2448
## Mcnemar's Test P-Value : 0.09874
##
## Sensitivity : 0.6800
## Specificity : 0.5682
## Pos Pred Value : 0.5730
## Neg Pred Value : 0.6757
## Prevalence : 0.4601
## Detection Rate : 0.3129
## Detection Prevalence : 0.5460
## Balanced Accuracy : 0.6241
##
## 'Positive' Class : 0
##
bigram rf model
bigram.rf.fit
## Random Forest
##
## 383 samples
## 73 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 344, 345, 345, 345, 344, 345, ...
## Resampling results across tuning parameters:
##
## mtry F1 F1 SD
## 10 0.6537525 0.04721641
## 50 0.6615189 0.06886562
## 100 0.6377641 0.07695971
## 500 0.6206777 0.06910854
## 1000 0.6420762 0.06050119
##
## F1 was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 50.
confusion matrix for bigram rf model
bigram.rf.confusionmatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 47 30
## 1 33 53
##
## Accuracy : 0.6135
## 95% CI : (0.5342, 0.6886)
## No Information Rate : 0.5092
## P-Value [Acc > NIR] : 0.004721
##
## Kappa : 0.2262
## Mcnemar's Test P-Value : 0.801059
##
## Sensitivity : 0.5875
## Specificity : 0.6386
## Pos Pred Value : 0.6104
## Neg Pred Value : 0.6163
## Prevalence : 0.4908
## Detection Rate : 0.2883
## Detection Prevalence : 0.4724
## Balanced Accuracy : 0.6130
##
## 'Positive' Class : 0
##
Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.
unigram gbm model
unigram.gbm.fit
## Stochastic Gradient Boosting
##
## 383 samples
## 153 predictors
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 344, 345, 346, 345, 345, 344, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees F1 F1 SD
## 2 100 0.6324806 0.06978027
## 2 200 0.6293300 0.06769479
## 2 300 0.6130057 0.06217683
## 2 400 0.6063549 0.05824169
## 2 500 0.5959974 0.07142145
## 3 100 0.5896646 0.09144147
## 3 200 0.5718785 0.07660035
## 3 300 0.5584011 0.06444784
## 3 400 0.5881368 0.07923596
## 3 500 0.5655245 0.07952033
## 4 100 0.5833909 0.05070745
## 4 200 0.5728192 0.06939895
## 4 300 0.5625158 0.06988523
## 4 400 0.5693650 0.07504513
## 4 500 0.5630118 0.07180698
## 5 100 0.5996130 0.07325693
## 5 200 0.5960680 0.03772509
## 5 300 0.5791496 0.03712235
## 5 400 0.5869561 0.04584553
## 5 500 0.5673033 0.05597792
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## F1 was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 100,
## interaction.depth = 2, shrinkage = 0.1 and n.minobsinnode = 10.
confusion matrix for unigram gbm model
unigram.gbm.confusionmatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 57 32
## 1 22 52
##
## Accuracy : 0.6687
## 95% CI : (0.5908, 0.7404)
## No Information Rate : 0.5153
## P-Value [Acc > NIR] : 5.237e-05
##
## Kappa : 0.3393
## Mcnemar's Test P-Value : 0.2207
##
## Sensitivity : 0.7215
## Specificity : 0.6190
## Pos Pred Value : 0.6404
## Neg Pred Value : 0.7027
## Prevalence : 0.4847
## Detection Rate : 0.3497
## Detection Prevalence : 0.5460
## Balanced Accuracy : 0.6703
##
## 'Positive' Class : 0
##
bigram gbm model
bigram.gbm.fit
## Stochastic Gradient Boosting
##
## 383 samples
## 73 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 344, 345, 344, 345, 345, 345, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees F1 F1 SD
## 2 100 0.5809324 0.12792297
## 2 200 0.5883536 0.10505908
## 2 300 0.5764462 0.11107246
## 2 400 0.5588518 0.09939930
## 2 500 0.5618966 0.09387904
## 3 100 0.5901226 0.11674285
## 3 200 0.6010791 0.09627211
## 3 300 0.5910125 0.10108267
## 3 400 0.5837870 0.07390445
## 3 500 0.5754913 0.08684396
## 4 100 0.6072137 0.10505361
## 4 200 0.5651959 0.09119413
## 4 300 0.5811336 0.06910575
## 4 400 0.5423515 0.09052754
## 4 500 0.5776810 0.09319436
## 5 100 0.6200092 0.08823854
## 5 200 0.5699468 0.08930689
## 5 300 0.6131817 0.09525975
## 5 400 0.6147405 0.08661185
## 5 500 0.6163756 0.09030642
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## F1 was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 100,
## interaction.depth = 5, shrinkage = 0.1 and n.minobsinnode = 10.
confusion matrix for bigram gbm model
bigram.gbm.confusionmatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 54 23
## 1 32 54
##
## Accuracy : 0.6626
## 95% CI : (0.5845, 0.7347)
## No Information Rate : 0.5276
## P-Value [Acc > NIR] : 0.000328
##
## Kappa : 0.3272
## Mcnemar's Test P-Value : 0.280713
##
## Sensitivity : 0.6279
## Specificity : 0.7013
## Pos Pred Value : 0.7013
## Neg Pred Value : 0.6279
## Prevalence : 0.5276
## Detection Rate : 0.3313
## Detection Prevalence : 0.4724
## Balanced Accuracy : 0.6646
##
## 'Positive' Class : 0
##
The author selected random forest model, as it gives the best F1 measure.